Data Discovery is currently in Private Preview and is not available for General Availability (GA). It should not be used in production environments, as features and functionality may change before the final GA release.

Performance and Accuracy

Details on performance and accuracy results.

Introduction

Performance and accuracy are critical metrics for data discovery tools. These ensure that large datasets can be processed swiftly and sensitive information is correctly identified. High performance minimizes latency and maximizes productivity, while accuracy reduces the risk of data breaches and ensures compliance with regulatory standards like GDPR and CCPA.

Together, these qualities are essential for maintaining data integrity and security in environments where unstructured data flows through various systems..

Performance Evaluation

The evaluation included Data Discovery deployed on Amazon EKS using a Helm Chart. The primary goal was to validate the application’s scalability and the infrastructure’s ability to handle varying loads under real-world conditions. Nevertheless, performance will vary between applications due to confounding variations in customer use cases. The key findings are as follows:

  • Scalability: The application and infrastructure configurations can efficiently scale to meet usage demands and support parallel service calls.

  • Instance Type: The m5.large8 instance was identified as a well-balanced choice for performance and cost.

    • If the priority is Faster Response Times: Splitting messages into smaller chunks and processing them in parallel is more cost-effective with multiple weaker instance types.
    • If the priority is Maximizing Processing Efficiency: Merging content into a single, larger request and using more powerful instance types is better for maximizing Processing Efficiency (characters processed per second).
  • EKS Auto Mode: Running EKS in auto mode offers a fully managed Kubernetes cluster with minimal maintenance. This enables the service to self-regulate by automatically scaling up or down based on demand.

  • Optimized CPU Usage: Maintain low CPU reservation for accurate measurement and effective self-regulation via the Horizontal Pod Autoscaler (HPA) that adjusts based on CPU usage percentage, balancing throughput, and idle time.

Detection Accuracy

Protegrity Data Discovery employs sophisticated Machine Learning (ML) and Natural Language Processing (NLP) technologies to achieve high accuracy in identifying sensitive data. The system processes English text inputs, with an NLP model pinpointing text spans within the document that correspond to various PII elements. The output includes text span as a PII entity, along with the entity type, entity position (start and end), and a confidence score. This confidence score reflects the likelihood of the text span being a PII entity, ensuring precise detection.

Dataset

Diverse datasets containing PII data, which differ based on demographic composition (volume and diversity), variations in data characteristics, types of labels, and other influencing factors were utilized. For example, labels such as “PERSON” and “PHONE_NUMBER” are used. The overall accuracy for detecting various PII data combinations in the dataset was measured with detection rate exceeding 96%.

Accuracy

Defined as an average of detection rates across sentences in a given text data.

Detection Rate = Valid Detections/Ground Truth

Where, Valid Detections is the number of correctly detected PII and Ground Truth is the total number of PIIs.

The variability in customer applications introduces differences in performance, meaning detection accuracy may fluctuate based on the quality of input text. Error rates in identifying PII are influenced not just by the detection service but also by customer workflows and evaluation datasets. It is recommended that customers assess and validate accuracy according to their specific use cases and requirements. It is also pertinent to note that the detected score of the input text may vary negligibly from user to user based on their underlying hardware configuration.

Supported Entity Types

PII entities supported by Data Discovery.

Entity NameDescription
ACCOUNT_NAMEName associated with a financial account.
ACCOUNT_NUMBERBank account number used to identify financial accounts.
AGEAge information used to identify individuals.
AMOUNTSpecific amount of money, which can be linked to financial transactions.
AU_ABNAustralian Business Number used to identify businesses in Australia.
AU_ACNAustralian Company Number used to identify businesses in Australia.
AU_MEDICAREMedicare number used to identify individuals for healthcare services in Australia.
AU_TFNTax File Number used to identify taxpayers in Australia.
BICBank Identifier Code used to identify financial institutions.
BITCOIN_ADDRESSBitcoin wallet address used for digital transactions.
BUILDINGBuilding information used to identify specific locations.
CITYCity information used to identify geographic locations.
COMPANY_NAMEName of a company used to identify businesses.
COUNTRYCountry information used to identify geographic locations.
COUNTYCounty information used to identify geographic locations.
CREDIT_CARDCredit card number used for financial transactions.
CREDIT_CARD_CVVCard Verification Value used to secure credit card transactions.
CRYPTOCryptocurrency wallet address used for digital transactions.
CURRENCYCurrency information used in financial transactions.
CURRENCY_CODECode representing currency used in financial transactions.
CURRENCY_NAMEName of currency used in financial transactions.
CURRENCY_SYMBOLSymbol representing currency, sometimes linked to financial transactions.
DATESpecific date that can be linked to personal activities.
DATE_OF_BIRTHDate of birth used to identify individuals.
DATE_TIMESpecific date and time that can be linked to personal activities.
DRIVER_LICENSEDriver’s license number used to identify individuals.
EMAIL_ADDRESSEmail address used for communication and identification.
ES_NIEForeigner Identification Number used to identify non-residents in Spain.
ES_NIFTax Identification Number used to identify taxpayers in Spain.
ETHEREUM_ADDRESSEthereum wallet address used for digital transactions.
FI_PERSONAL_IDENTITY_CODEPersonal identity code used to identify individuals in Finland.
GENDERGender information used to identify individuals.
GEO_CCORDINATEGeographic coordinates used to identify specific locations.
IBAN_CODEInternational Bank Account Number used to identify bank accounts globally.
ID_CARDIdentity card number used to identify individuals.
IN_AADHAARUnique identification number used to identify residents in India.
IN_PANPermanent Account Number used to identify taxpayers in India.
IN_PASSPORTPassport number used to identify individuals in India.
IN_VEHICLE_REGISTRATIONVehicle registration number used to identify vehicles in India.
IN_VOTERVoter ID number used to identify registered voters in India.
IP_ADDRESSInternet Protocol address used to identify devices on a network.
IPV4IPv4 address used to identify devices on a network.
IPV6IPv6 address used to identify devices on a network.
IT_DRIVER_LICENSEDriver’s license number used to identify individuals in Italy.
IT_FISCAL_CODEFiscal code used to identify taxpayers in Italy.
IT_IDENTITY_CARDIdentity card number used to identify individuals in Italy.
IT_PASSPORTPassport number used to identify individuals in Italy.
LITECOIN_ADDRESSLitecoin wallet address used for digital transactions.
LOCATIONSpecific location or address that can be linked to an individual.
MACMedia Access Control address used to identify devices on a network.
MEDICAL_LICENSELicense number used to identify medical professionals.
NRPNational Registration Number used to identify individuals.
ORGANIZATIONName or identifier used to identify an organization.
PASSPORTPassport number used to identify individuals.
PASSWORDPassword used to secure access to personal accounts.
PERSONName or identifier used to identify an individual.
PHONE_NUMBERNumber used to contact or identify an individual.
PINPersonal Identification Number used to secure access to accounts.
PL_PESELPersonal Identification Number used to identify individuals in Poland.
SECONDARY_ADDRESSAdditional address information used to identify locations.
SG_NRIC_FINNational Registration Identity Card number used to identify residents in Singapore.
SG_UENUnique Entity Number used to identify businesses in Singapore.
SOCIAL_SECURITY_NUMBERSocial Security Number used to identify individuals.
STATEState information used to identify geographic locations.
STREETStreet address used to identify specific locations.
TIMESpecific time that can be linked to personal activities.
TITLETitle or honorific used to identify individuals.
UK_NHSNational Health Service number used to identify individuals for healthcare services in the United Kingdom.
URLWeb address that can sometimes contain personal information.
US_BANK_NUMBERBank account number used to identify financial accounts in the United States.
US_DRIVER_LICENSEDriver’s license number used to identify individuals in the United States.
US_ITINIndividual Taxpayer Identification Number used to identify taxpayers in the United States.
US_PASSPORTPassport number used to identify individuals in the United States.
US_SSNSocial Security Number used to identify individuals in the United States.
USERNAMEUsername used to identify individuals in online systems.
ZIP_CODEPostal code used to identify specific geographic areas.
Last modified : September 03, 2025