Data Discovery is currently in Private Preview and is not available for General Availability (GA). It should not be used in production environments, as features and functionality may change before the final GA release.
Performance and Accuracy
Introduction
Performance and accuracy are critical metrics for data discovery tools. These ensure that large datasets can be processed swiftly and sensitive information is correctly identified. High performance minimizes latency and maximizes productivity, while accuracy reduces the risk of data breaches and ensures compliance with regulatory standards like GDPR and CCPA.
Together, these qualities are essential for maintaining data integrity and security in environments where unstructured data flows through various systems..
Performance Evaluation
The evaluation included Data Discovery deployed on Amazon EKS using a Helm Chart. The primary goal was to validate the application’s scalability and the infrastructure’s ability to handle varying loads under real-world conditions. Nevertheless, performance will vary between applications due to confounding variations in customer use cases. The key findings are as follows:
Scalability: The application and infrastructure configurations can efficiently scale to meet usage demands and support parallel service calls.
Instance Type: The m5.large8 instance was identified as a well-balanced choice for performance and cost.
- If the priority is Faster Response Times: Splitting messages into smaller chunks and processing them in parallel is more cost-effective with multiple weaker instance types.
- If the priority is Maximizing Processing Efficiency: Merging content into a single, larger request and using more powerful instance types is better for maximizing Processing Efficiency (characters processed per second).
EKS Auto Mode: Running EKS in auto mode offers a fully managed Kubernetes cluster with minimal maintenance. This enables the service to self-regulate by automatically scaling up or down based on demand.
Optimized CPU Usage: Maintain low CPU reservation for accurate measurement and effective self-regulation via the Horizontal Pod Autoscaler (HPA) that adjusts based on CPU usage percentage, balancing throughput, and idle time.
Detection Accuracy
Protegrity Data Discovery employs sophisticated Machine Learning (ML) and Natural Language Processing (NLP) technologies to achieve high accuracy in identifying sensitive data. The system processes English text inputs, with an NLP model pinpointing text spans within the document that correspond to various PII elements. The output includes text span as a PII entity, along with the entity type, entity position (start and end), and a confidence score. This confidence score reflects the likelihood of the text span being a PII entity, ensuring precise detection.
Dataset
Diverse datasets containing PII data, which differ based on demographic composition (volume and diversity), variations in data characteristics, types of labels, and other influencing factors were utilized. For example, labels such as “PERSON” and “PHONE_NUMBER” are used. The overall accuracy for detecting various PII data combinations in the dataset was measured with detection rate exceeding 96%.
Accuracy
Defined as an average of detection rates across sentences in a given text data.
Detection Rate = Valid Detections/Ground Truth
Where, Valid Detections is the number of correctly detected PII and Ground Truth is the total number of PIIs.
The variability in customer applications introduces differences in performance, meaning detection accuracy may fluctuate based on the quality of input text. Error rates in identifying PII are influenced not just by the detection service but also by customer workflows and evaluation datasets. It is recommended that customers assess and validate accuracy according to their specific use cases and requirements. It is also pertinent to note that the detected score of the input text may vary negligibly from user to user based on their underlying hardware configuration.
Supported Entity Types
PII entities supported by Data Discovery.
| Entity Name | Description |
|---|---|
| ACCOUNT_NAME | Name associated with a financial account. |
| ACCOUNT_NUMBER | Bank account number used to identify financial accounts. |
| AGE | Age information used to identify individuals. |
| AMOUNT | Specific amount of money, which can be linked to financial transactions. |
| AU_ABN | Australian Business Number used to identify businesses in Australia. |
| AU_ACN | Australian Company Number used to identify businesses in Australia. |
| AU_MEDICARE | Medicare number used to identify individuals for healthcare services in Australia. |
| AU_TFN | Tax File Number used to identify taxpayers in Australia. |
| BIC | Bank Identifier Code used to identify financial institutions. |
| BITCOIN_ADDRESS | Bitcoin wallet address used for digital transactions. |
| BUILDING | Building information used to identify specific locations. |
| CITY | City information used to identify geographic locations. |
| COMPANY_NAME | Name of a company used to identify businesses. |
| COUNTRY | Country information used to identify geographic locations. |
| COUNTY | County information used to identify geographic locations. |
| CREDIT_CARD | Credit card number used for financial transactions. |
| CREDIT_CARD_CVV | Card Verification Value used to secure credit card transactions. |
| CRYPTO | Cryptocurrency wallet address used for digital transactions. |
| CURRENCY | Currency information used in financial transactions. |
| CURRENCY_CODE | Code representing currency used in financial transactions. |
| CURRENCY_NAME | Name of currency used in financial transactions. |
| CURRENCY_SYMBOL | Symbol representing currency, sometimes linked to financial transactions. |
| DATE | Specific date that can be linked to personal activities. |
| DATE_OF_BIRTH | Date of birth used to identify individuals. |
| DATE_TIME | Specific date and time that can be linked to personal activities. |
| DRIVER_LICENSE | Driver’s license number used to identify individuals. |
| EMAIL_ADDRESS | Email address used for communication and identification. |
| ES_NIE | Foreigner Identification Number used to identify non-residents in Spain. |
| ES_NIF | Tax Identification Number used to identify taxpayers in Spain. |
| ETHEREUM_ADDRESS | Ethereum wallet address used for digital transactions. |
| FI_PERSONAL_IDENTITY_CODE | Personal identity code used to identify individuals in Finland. |
| GENDER | Gender information used to identify individuals. |
| GEO_CCORDINATE | Geographic coordinates used to identify specific locations. |
| IBAN_CODE | International Bank Account Number used to identify bank accounts globally. |
| ID_CARD | Identity card number used to identify individuals. |
| IN_AADHAAR | Unique identification number used to identify residents in India. |
| IN_PAN | Permanent Account Number used to identify taxpayers in India. |
| IN_PASSPORT | Passport number used to identify individuals in India. |
| IN_VEHICLE_REGISTRATION | Vehicle registration number used to identify vehicles in India. |
| IN_VOTER | Voter ID number used to identify registered voters in India. |
| IP_ADDRESS | Internet Protocol address used to identify devices on a network. |
| IPV4 | IPv4 address used to identify devices on a network. |
| IPV6 | IPv6 address used to identify devices on a network. |
| IT_DRIVER_LICENSE | Driver’s license number used to identify individuals in Italy. |
| IT_FISCAL_CODE | Fiscal code used to identify taxpayers in Italy. |
| IT_IDENTITY_CARD | Identity card number used to identify individuals in Italy. |
| IT_PASSPORT | Passport number used to identify individuals in Italy. |
| LITECOIN_ADDRESS | Litecoin wallet address used for digital transactions. |
| LOCATION | Specific location or address that can be linked to an individual. |
| MAC | Media Access Control address used to identify devices on a network. |
| MEDICAL_LICENSE | License number used to identify medical professionals. |
| NRP | National Registration Number used to identify individuals. |
| ORGANIZATION | Name or identifier used to identify an organization. |
| PASSPORT | Passport number used to identify individuals. |
| PASSWORD | Password used to secure access to personal accounts. |
| PERSON | Name or identifier used to identify an individual. |
| PHONE_NUMBER | Number used to contact or identify an individual. |
| PIN | Personal Identification Number used to secure access to accounts. |
| PL_PESEL | Personal Identification Number used to identify individuals in Poland. |
| SECONDARY_ADDRESS | Additional address information used to identify locations. |
| SG_NRIC_FIN | National Registration Identity Card number used to identify residents in Singapore. |
| SG_UEN | Unique Entity Number used to identify businesses in Singapore. |
| SOCIAL_SECURITY_NUMBER | Social Security Number used to identify individuals. |
| STATE | State information used to identify geographic locations. |
| STREET | Street address used to identify specific locations. |
| TIME | Specific time that can be linked to personal activities. |
| TITLE | Title or honorific used to identify individuals. |
| UK_NHS | National Health Service number used to identify individuals for healthcare services in the United Kingdom. |
| URL | Web address that can sometimes contain personal information. |
| US_BANK_NUMBER | Bank account number used to identify financial accounts in the United States. |
| US_DRIVER_LICENSE | Driver’s license number used to identify individuals in the United States. |
| US_ITIN | Individual Taxpayer Identification Number used to identify taxpayers in the United States. |
| US_PASSPORT | Passport number used to identify individuals in the United States. |
| US_SSN | Social Security Number used to identify individuals in the United States. |
| USERNAME | Username used to identify individuals in online systems. |
| ZIP_CODE | Postal code used to identify specific geographic areas. |