Data Discovery on

Introduction

Mon, 01 Jan 0001 00:00:00 +0000

In an era where data privacy is paramount, safeguarding sensitive information in unstructured data has become critical—especially for organizations leveraging AI and machine learning technologies. Data Discovery is a powerful, developer-friendly product designed specifically to address this challenge.

Data Discovery specializes in the detection of Personally Identifiable Information (PII), Protected Health Information (PHI), Payment Card Information (PCI) within free-text (unstructured) and table-based (structured, CSV) inputs. Unlike traditional data tools, it excels in dynamic, unstructured environments such as chatbot conversations, call transcripts, and Generative AI (Gen AI) outputs.

What's New

Mon, 01 Jan 0001 00:00:00 +0000

Data Discovery 2.0

Major changes

Standardized API Endpoints

Updated Classify and Transform APIs:
- http://{Host Address}/pty/data-discovery/v2/classify/text - Classify Text API
- http://{Host Address}/pty/data-discovery/v2/classify/tabular - Classify Tabular Data API
- http://{Host Address}/pty/data-discovery/v2/transform/label - Transform Text API
Added new Endpoints:
- http://{Host Address}/pty/data-discovery/doc – Provides the API documentation for the Data Discovery. For more information see API Specification.
- http://{Host Address}/pty/data-discovery/log – Gets/Sets the log level for the Data Discovery. For more information see Log level API.
- http://{Host Address}/pty/data-discovery/version – Retrieves the current version of the Data Discovery. For more information see Version API.

Enhancements

Updated Context Provider AI model for improved contextual accuracy.
Updated Pattern Provider model for better pattern recognition.
Updated the default score threshold for the Classify API from 0.0 to 0.7, aligning it with the Transform API which already defaults to 0.7. Low-confidence classifications below the threshold are filtered out. The legacy v1.1 classification endpoint retains a threshold of 0.0 for backward compatibility.
Added usage metrics logging to the Classification Service for improved analytics and visibility, see Usage Metrics for more details.
Added per-language accuracy metrics to improve visibility into multilingual performance, see Language Metrics for more details.
Added PII detection in multiple Markdown dialects.
Bug Fixes.

General Architecture

Tue, 20 Feb 2024 00:00:00 +0000

The main components of the Protegrity Data Discovery product are as follows:

Classification service: The Classification Service serves as the primary access point for all classification-related interactions. It orchestrates various back-end components known as Providers, which are responsible for executing the actual classification tasks.
Pattern and Context classification providers: The Providers function as specialized modules in identifying and classifying Personally Identifiable Information (PII). They analyze input data to detect, classify, and locate sensitive information.

Performance and Accuracy

Mon, 01 Jan 0001 00:00:00 +0000

Introduction

Performance and accuracy are critical metrics for data discovery tools. These ensure that large datasets can be processed swiftly and sensitive information is correctly identified. High performance minimizes latency and maximizes productivity, while accuracy reduces the risk of data breaches and ensures compliance with regulatory standards like GDPR and CCPA.

Together, these qualities are essential for maintaining data integrity and security in environments where unstructured data flows through various systems..

Usage Metrics

Mon, 01 Jan 0001 00:00:00 +0000

This section outlines the usage metrics generated by Data Discovery for classification requests. These metrics provide visibility into service usage and support scenarios such as internal chargeback across departments, the logs are designed to support monitoring, auditing, and capacity planning.

Overview

When you submit a classification request to Data Discovery, the service generates a usage log entry after the request is processed. A log entry is created for every request, regardless of whether the request succeeds or fails.