Data Discovery is currently in Private Preview and is not available for General Availability (GA). It should not be used in production environments, as features and functionality may change before the final GA release.

Performance and Accuracy

Details on performance and accuracy results.

Introduction

Performance and accuracy are critical metrics for data discovery tools. These ensure that large datasets can be processed swiftly and sensitive information is correctly identified. High performance minimizes latency and maximizes productivity, while accuracy reduces the risk of data breaches and ensures compliance with regulatory standards like GDPR and CCPA.

Together, these qualities are essential for maintaining data integrity and security in environments where unstructured data flows through various systems..

Performance Evaluation

The evaluation included Data Discovery deployed on Amazon EKS using a Helm Chart. The primary goal was to validate the application’s scalability and the infrastructure’s ability to handle varying loads under real-world conditions. Nevertheless, performance will vary between applications due to confounding variations in customer use cases. The key findings are as follows:

  • Scalability: The application and infrastructure configurations can efficiently scale to meet usage demands and support parallel service calls.

  • Instance Type: The m5.large instance was identified as a well-balanced choice for performance and cost.

    • If the priority is faster response times: Splitting messages into smaller chunks and processing them in parallel is more cost-effective with multiple weaker instance types.
    • If the priority is maximizing processing efficiency: Merging content into a single, larger request and using more powerful instance types is better for maximizing processing efficiency (characters processed per second).
  • Optimized CPU Usage: Maintain low CPU reservation for accurate measurement and effective self-regulation via the Horizontal Pod Autoscaler (HPA) that adjusts based on CPU usage percentage, balancing throughput, and idle time.

Detection Accuracy

Protegrity Data Discovery employs sophisticated Machine Learning (ML) and Natural Language Processing (NLP) technologies to achieve high accuracy in identifying sensitive data. The system processes the text inputs, with an NLP model pinpointing text spans within the document that correspond to various PII elements. The output includes text span as a PII entity, along with the entity type, entity position (start and end), and a confidence score. This confidence score reflects the likelihood of the text span being a PII entity, ensuring precise detection.

Dataset

Diverse datasets containing PII data, which differ based on demographic composition (volume and diversity), variations in data characteristics, types of labels, and other influencing factors were utilized. For example, labels such as “PERSON” and “PHONE_NUMBER” are used. The overall accuracy for detecting various PII data combinations in the dataset was measured with detection rate exceeding 96%.

Accuracy

The accuracy of the PII detection system is evaluated using Precision, Recall, and F1 Score. These metrics are standard in information extraction and named entity recognition (NER) tasks and provide a clear and consistent way to measure detection performance.

  • Ground truth: Evaluation is performed against a labeled dataset where all PII entities are predefined. This labeled data represents the ground truth and is used to determine whether detected entities are correct.

  • Precision: Precision describes how reliable the system’s detections are. It focuses on the quality of the results. When the system identifies something as PII, Precision tells you how often that decision is correct. If Precision is high, most of the detected PII is valid and there are fewer false alerts.

  • Recall: Recall describes how complete the system’s detections are. It focuses on coverage. Recall shows how much of the actual PII present in the text was successfully detected. If Recall is high, the system is finding most of the PII and missing very little.

  • F1 Score: F1 Score combines Precision and Recall into a single value. It reflects the overall effectiveness of the system by balancing:

    • Avoiding false detections (Precision)
    • Avoiding missed PII (Recall) A high F1 Score means the system is both accurate and thorough, without favoring one at the expense of the other.

Interpretation of Metrics:

  • High Precision, Low Recall: The system is conservative and accurate but misses some PII.
  • Low Precision, High Recall: The system detects most PII but includes more false positives.
  • High F1 Score: The system achieves a good balance between Precision and Recall.

Supported languages Data Discovery provides accurate language detection across multiple supported languages. the F1 score demonstrates the consistency of performance across languages and enable quick comparison of detection quality in multilingual deployments.

Language metrics:

  • French / German / Spanish / Italian / Dutch: F1 ≥ 0.90
  • English: F1 ≥ 0.95
Last modified : March 06, 2026