1 - Introduction

About Protegrity’s Data Discovery.

In an era where data privacy is paramount, safeguarding sensitive information in unstructured data has become critical—especially for organizations leveraging AI and machine learning technologies. Data Discovery is a powerful, developer-friendly product designed specifically to address this challenge.

Data Discovery’s Classification Service specializes in the detection of Personally Identifiable Information (PII), Protected Health Information (PHI), Payment Card Information (PCI) within free-text (unstructured) and table-based (structured. CSV) inputs. Unlike traditional data tools, it excels in dynamic, unstructured environments such as chatbot conversations, call transcripts, and Generative AI (Gen AI) outputs.

Harnessing a hybrid detection engine that combines machine learning and rule-based algorithms, Data Discovery offers unparalleled accuracy and flexibility. It empowers teams to perform the following:

  • Automate chatbot redaction to ensure compliance with privacy regulations.

  • Perform transcript cleanup for customer service, healthcare, and financial industries.

  • Enhance GenAI applications by proactively mitigating the risks associated with leaking sensitive information.

Built for developers, architects, and privacy engineers, Data Discovery seamlessly integrates into AI/ML pipelines and Gen AI workflows. Deployment is fast and flexible, with support for both Docker containers and AWS EKS clusters, and interaction via robust, intuitive REST APIs.

Whether you’re building next-generation AI applications or enhancing existing systems to meet evolving data privacy standards, Data Discovery equips you with the tools to discover, classify, and protect sensitive information at scale.

2 - What's New

Features introduced in this version for Data Discovery.
FeatureDescriptionReferences (if any)
Structured Data ClassificationClassify data in CSV content by analyzing and assigning classifications to each column.Classify CSV API
Harmonize Classification ResponsesStandardize classification outputs from multiple providers by mapping them to a unified set of conventional categories.Harmonize Responses
Transformation - LabelReplace sensitive text with corresponding entity labels (e.g., , <CREDIT_CARD>) based on classified data types.Label Text API
Terraform and Helm DeploymentSupport deployment on EKS using Terraform and Helm Charts.EKS Deployment
Registry Hosted Product ImagesProduct images used in this deployment are available on Protegrity’s public Image Registry at registry.protegrity.com.Obtaining Package
Performance ImprovementGeneral performance improvements.NA
Accuracy ImprovementsGeneral accuracy improvements.NA
Bug FixesVarious bug fixes.NA

3 - General Architecture

High level view of the main components and interactions.

The main components of the Protegrity Data Discovery product are as follows:

  • Classification service: The Classification Service serves as the primary access point for all classification-related interactions. It orchestrates various back-end components known as Providers, which are responsible for executing the actual classification tasks.

  • Pattern and Context classification providers: The Providers function as specialized modules in identifying and classifying Personally Identifiable Information (PII). They analyze input data to detect, classify, and locate sensitive information.

The Pattern classification provider is a rule-based system that identifies PII using predefined patterns and heuristics. It is fast, customizable, and suitable for structured data with known formats.

The Context classification provider is an LLM based designed within Protegrity. A machine learning model that detects PII using context and semantics. It is flexible, effective with unstructured data, and adapts to varied patterns.

The general architecture is illustrated in the following figure.

CalloutDescription
1The user enters the data to be classified for sensitive data as text body and sends the request to the Classification service.
2This Classification service then distributes the request to the Pattern and Context classification service providers to process the data.
3The Pattern and Context classification providers process the data based on their logic and classify them in the form of a response to the Classification service.
4The Classification service then aggregates the responses from the service providers and sends it to the user.

4 - Deployment

Setup Data Discovery using Docker Compose for development or Amazon EKS for production environments.

4.1 - Obtaining the Deployment Package

Download the artifacts for deploying the product from the Protegrity Customer portal.

Run the following steps to download the artifacts.

  1. Log in to the Customer Portal.

  2. Navigate to the page and download the DataDiscovery_RHUBI-9-64_x86-64_Generic.K8S_1.1.0.tar.gz file on the system.

  3. Extract this package. The following files are available:

    • README - File containing the instructions to deploy the product.
    • docker_compose - Deploying the product on a local developer setup.
    • eks-terraform-helm - Deploying the product on a scalable deployment on Amazon EKS. Terraform is used for deploying the infrastructure and Helm Charts are used for the application components.

4.2 - Docker Compose

Set up Data Discovery using Docker Compose for development and testing.

4.2.1 - Docker Compose Deployment

Use Docker Compose for a non-production environment. This set up has been designed for a local deployment and testing.

Prerequisites

  • The Deployment Package provided by Protegrity is obtained from the portal and extracted.

  • Docker CLI version greater than or equal to 28.3.0 is installed. This is required for managing Docker containers.

  • Docker Compose version greater than or equal to 2.37.3 is installed. This is required for local containerized deployments.

Docker Compose v2 that uses the docker compose command syntax. Ensure that the the installation supports this version.

For Apple Macbook users, refer Additional Notes.

Starting the Containers

  1. If a Docker network does not exist, run the following command to create a Docker network.
docker network create protegrity-network

This step ensures that all services communicate with each other within the same Docker network.

  1. Run the following script to launch the services in detached mode.
docker compose up -d

The classification_service is exposed on port 8050.

Verifying the Deployment

When running command from outside the docker network, e.g., from your host machine, use the published port mapping. e.g.,

curl -XPOST classification_service/pty/data-discovery/v1.1/classify --data 'You can reach Dave Elliot by phone 203-555-1286' -H "Content-Type: text/plain" 
  

When running commands from inside the Docker network (for example, from another container), use the service name directly. This leverages Docker’s internal DNS. e.g.,

curl -XPOST http://localhost:8050/pty/data-discovery/v1.1/classify --data 'You can reach Dave Elliot by phone 203-555-1286' -H "Content-Type: text/plain"
    

Stopping the Containers

  1. Run the following script to stop, remove the Docker services. Also, remove the created Docker network created.

    docker compose down
    
  2. To remove a Docker network that has been created, run the following command:

    docker network rm protegrity-network
    

Additional Notes

For Apple users running containers on Apple Silicon (M1/M2/M3/M4).

  • For Docker Desktop on a MacBook.

    1. Open Docker Desktop.
    2. Navigate to Settings > General.
    3. Enable Use Virtualization Framework and Use Rosetta for x86/amd64 emulation on Apple Silicon.
    4. Click Apply & Restart.
  • For Colima. Start Colima using Rosetta and Apple’s virtualization framework:

      colima start --vm-type vz --vz-rosetta
    

4.2.2 - Configuring Environment Variables

Setting the environment variables for Docker Compose.

Run the following steps to edit the environment variables:

  1. Navigate to the docker_compose directory.

  2. Open the .env file and set the following variables as required:

VariableDescriptionRequired
DOCKER_CLASSIFICATION_IMAGERepository path where the docker image of Classification Service is stored.Yes
DOCKER_PATTERN_PROVIDER_IMAGERepository path where the docker image of Pattern classification Service is stored.Yes
DOCKER_CONTEXT_PROVIDER_IMAGERepository path where the docker image of Context clarification Service is stored.Yes
DOCKER_NETWORK_NAMEName of the Docker network.No
PATTERN_PROVIDER_LOGGING_CONFIGa valid JSON python logging configuration for the Pattern Classification Provider.No
CONTEXT_PROVIDER_LOGGING_CONFIGa valid JSON python logging configuration for the Context Classification Provider.No
CLASSIFICATION_LOGGING_CONFIGa valid JSON python logging configuration for the Classification Service.No
ENABLE_ALL_SECURITY_CONTROLSControls whether security mitigations are enabled. Accepted values: true (default) or falseNo
  1. Save the changes.

4.2.3 - Viewing Application Logs

Viewing Docker Logs.

The application logs can be viewed using the following commands:

docker logs -f classification_service
docker logs -f context_provider
docker logs -f pattern_provider

Setting the Log Level and other logging configuration

The log level and other valid Python Logging configuration can be set in the .env file using JSON.

Run the following steps to set the overall logging level.

  1. Navigate to the docker_compose directory.

  2. Edit the .env file.

  3. Uncomment the required logging configuration and set the logging level to one of the following:

  • INFO
  • DEBUG
  • ERROR
  • WARNING

For example, to change the log level for PATTERN_PROVIDER_LOGGING_CONFIG, configure the parameter as follows.

PATTERN_PROVIDER_LOGGING_CONFIG={"root":{"level":"ERROR"}} 
  1. Save the changes.

  2. Run the folllwing command to undeploy the application.

docker compose down -d
  1. Run the following command to redeploy the application.
docker compose up -d

4.2.4 -

When running commands from inside the Docker network (for example, from another container), use the service name directly. This leverages Docker’s internal DNS. e.g.,

curl -XPOST http://localhost:8050/pty/data-discovery/v1.1/classify --data 'You can reach Dave Elliot by phone 203-555-1286' -H "Content-Type: text/plain"

4.2.5 -

When running command from outside the docker network, e.g., from your host machine, use the published port mapping. e.g.,

curl -XPOST classification_service/pty/data-discovery/v1.1/classify --data 'You can reach Dave Elliot by phone 203-555-1286' -H "Content-Type: text/plain" 

4.3 - Amazon EKS

Setup Data Discovery on Amazon EKS for scalable, production-grade infrastructure using Terraform and Helm.

4.3.1 - Prerequisites

Required tools, permissions, and infrastructure setup for EKS deployment

Before deploying Data Discovery on Amazon EKS, ensure that following requirements are met for a smooth deployment process.

Tools and Permissions

The following tools must be installed and properly configured on your local machine:

  • AWS CLI of version 2.28.3 is installed. It us a command-line interface for AWS services. Must be configured with valid credentials having EKS cluster creation and management permissions. For more information about the configuration details, refer to Configuration and credentials precedence.

  • kubectl of version v1.32.0-eks-5ca49cb, Server v1.33.3-eks-ace6451 is installed. It is a Kubernetes command-line tool for cluster management and application deployment operations.

  • Helm of version 3.18.4 is installed. It is a Kubernetes package manager to deploy and manage Data Discovery application charts on the EKS cluster.

  • Terraform of version 1.12.2 is installed. It is an infrastructure as a code tool for provisioning and managing EKS cluster resources in a reproducible manner.

Infrastructure Requirements

  • Amazon VPC, a properly configured Virtual Private Cloud with at least two subnets in different availability zones for high availability and fault tolerance.

4.3.2 - EKS Deployment Architecture

Components and their description.

The following architectural diagram illustrates the main components in the deployment of the product on EKS.

ComponentDescription
Ingress ControllerThe Ingress controller acts as the single point of entry for the requests provided by a user.
Ingress ruleThe Ingress rule routes the requests to the Classification service.
Classification podsClassification service pods that act as the main entry point and the aggregator of the responses provided by the service providers.
Context and Pattern service providersPattern and the Context service providers pods that perform that task of identifying the sensitive data.

4.3.3 - Deploying the Application

Deploying the application and components.

The step-by-step deployment of Data Discovery on Amazon EKS is explained here. Each component builds on the previous, ensuring a reliable and production-ready environment.

The deployment is separated into two main phases:

  • Phase 1: Infrastructure (Terraform) - Provisions the EKS cluster and underlying AWS resources
  • Phase 2: Applications (Helm) - Deploys Kubernetes components and the Data Discovery application

After completing Step 1 (Terraform), if an existing EKS cluster is used, configure the kubectl context to connect to the cluster:

   aws eks update-kubeconfig --region <region> --name <cluster-name>
   # Replace `<region>` with your AWS region and `<cluster-name>` with your EKS cluster name.

4.3.3.1 - EKS Control Plane Provisioning (Terraform)

Deploy the required infrastructure - Terraform setup for EKS cluster, IAM roles, and VPC

Before you Begin

Ensure that the following points are considered.

  • The AWS CLI is configured.

  • The VPC is configured with at least two private subnets.

  • Terraform is installed.

  • kubectl is installed.

Configuring the Parameters

Configure the following parameters in the terraform.tfvars file available in the terraform directory.

NameDescriptionTypeRequired
vpc_idExisting VPC ID.stringYes
vpc_subnet_idsList of private subnet IDs.list(string)Yes
cluster_nameName of the EKS cluster. Default set to "eks-terraform".stringNo
aws_regionRegion for the AWS deployment. Default set to "us-east-1".stringNo
eks_cluster_role_arnExisting IAM role for EKS control plane. Default set to null.stringNo
eks_node_role_arnExisting IAM role for node group. Default set to null.stringNo

Deploying Terraform

Run the following script to deploy the application.

cd terraform
terraform init
terraform apply -auto-approve

Verifying the Installation

Run the following commands to verify the deployment.

terraform output

Sample output:

eks_cluster_name = "eks-terraform"
eks_cluster_endpoint = "<Endpoint URL>"
eks_cluster_region = "us-east-1"
eks_update_kubeconfig_command = "aws eks update-kubeconfig --region us-east-1 --name eks-terraform"

Run the following command to verify the cluster that was created.

kubectl get nodepools

Sample output:

NAME              NODECLASS   NODES   READY   AGE
general-purpose   default     0       True    ...
system            default     0       True    ...

Updating kubeconfig after Deployment

After deploying the cluster, update the local kubeconfig to interact with the cluster. The following commands links the kubeconfig command to the new EKS cluster.

$(terraform output -raw eks_update_kubeconfig_command)

4.3.3.2 - Metrics Server

Deploy a Metrics Server for autoscaling capabilities.

Requirements

  • An EKS cluster is provisioned.

  • The cluster is connected and the kubeconfig is properly configured.

Run the following command to connect a local environment to the EKS cluster.

aws eks update-kubeconfig --region <region> --name <cluster-name>

Installing the Component

cd helm/metrics-server
helm repo add metrics-server https://kubernetes-sigs.github.io/metrics-server || true
helm repo update
helm dependency build
helm install metrics-server . \
  --namespace kube-system \
  --create-namespace

For any custom configuration changes, create a values-override.yaml file and add -f values-override.yaml to the helm install command. It is not recommended to modify the configurations in the values.yaml file.

Verifying the Installation

Check that the Metrics Server deployment is ready:

kubectl get deployment metrics-server -n kube-system

Sample output.

NAME             READY   UP-TO-DATE   AVAILABLE   AGE
metrics-server   1/1     1            1           ...

Run the following command to verify that node metrics are available.

kubectl top nodes

Uninstalling the Component

Run the following command to uninstall the Metrics Server:

helm uninstall metrics-server \
  --namespace kube-system

4.3.3.3 - Karpenter NodePool

Deploy a Karpenter NodePool for EKS to enable automatic node provisioning and scaling for Data Discovery workloads.

Requirements

  • An EKS cluster is provisioned.

  • The cluster is connected and the kubeconfig is properly configured.

  • karpenter.sh/v1 CRDs are available. Auto Mode includes these by default.

Run the following command to connect a local environment to the EKS cluster.

aws eks update-kubeconfig --region <region> --name <cluster-name>

Installing the Component

cd helm/karpenter-node-pool
helm install karpenter-nodepool . \
  --namespace default \
  --create-namespace

Verifying the Installation

Run the following command to check the NodePool resource.

kubectl get nodepools

Sample output after the process is completed.

NAME                  NODECLASS   NODES   READY   AGE
m5-large-node-pool    default     0       True    ...

No nodes will appear until a matching workload is scheduled. Node creation is confirmed after a pod requests this NodePool’s label.

Uninstalling the Component

Run the following command to uninstall the Karpenter NodePool.

helm uninstall karpenter-nodepool \
  --namespace default

Ensure that no workloads are actively using this NodePool before removal. Any running pods scheduled on nodes from this pool may be terminated during the uninstall process.

4.3.3.4 - Ingress Controller

Deploy an internal-only NGINX ingress controller with private AWS NLB for a secure TLS-only access to Data Discovery services within your VPC.

Requirements

  • The EKS cluster is provisioned.

  • The cluster is connected and the kubeconfig is properly configured.

Run the following command to connect a local environment to the EKS cluster.

aws eks update-kubeconfig --region <region> --name <cluster-name>

Configuration

This chart wraps the official ingress-nginx chart using the alias private-ingress and allows to customize the default certificate that is used on all TLS communications handled by this controller.

To configure TLS certificates, place the certificate files in the following folder.

ingress-controller/certs/tls.crt
ingress-controller/certs/tls.key

For more information about creating TLS certificates, refer to Create and configure certificates (AWS docs)

It is recommended not to edit the values.yaml file unless required. To customize configurations, create a values-override.yaml file with the desired changes and use the -f values-override.yaml flag during installation.

Installing the Component

cd helm/ingress-controller
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx || true
helm repo update
helm dependency build
helm install ingress-controller . \
  --namespace ingress-nginx \
  --create-namespace \
  --set-file tls.crt=./certs/tls.crt \
  --set-file tls.key=./certs/tls.key

If TLS is not configured, ommit the --set-file tls lines in the command above.

For any custom configuration changes, create a values-override.yaml file and add -f values-override.yaml to the helm install command. It is not recommended to modify the configurations in the values.yaml file.

This deploys the controller (and a TLS secret if configured) under the ingress-nginx namespace and exposes it through an internal AWS NLB.

Verifying the Installation

Checking the controller pods

kubectl get pods -n ingress-nginx

Example output:

NAME                             READY   STATUS    RESTARTS   AGE
private-ingress-controller-xxx   1/1     Running   0          ...

Confirming the service is created

kubectl get svc -n ingress-nginx

Example output:

NAME                        TYPE           CLUSTER-IP     EXTERNAL-IP                                                               PORT(S)
private-ingress-controller  LoadBalancer   10.x.x.x       internal-<hash>.<region>.elb.amazonaws.com   443:xxxx/TCP

Checking the IngressClass

kubectl get ingressclass

Example output:

NAME             CONTROLLER             PARAMETERS   AGE
private-nginx    k8s.io/ingress-nginx   <none>       ...

This IngressClass is automatically used by any Ingress with no ingressClassName or one explicitly set to private-nginx.

Uninstalling the Component

Run the following command to uninstall the Ingress Controller.

helm uninstall ingress-controller \
  --namespace ingress-nginx

This will remove the AWS Load Balancer and make any applications using this ingress controller inaccessible from outside the cluster. Ensure all dependent services are stopped or reconfigured before removal.

4.3.3.5 - Data Discovery Classification

Deploy the Data Discovery Classification service with Pattern and Context providers for data classification and transformation.

Requirements

The following requirements are mandatory before deploying the product.

  • An EKS cluster is provisioned.

  • The cluster is connected and the kubeconfig is properly configured.

The following components are optional.

Run the following command to connect a local environment to the EKS cluster.

aws eks update-kubeconfig --region <region> --name <cluster-name>

Installing the Service

  1. Define the docker registry credentials that were provided in the environment variables:
export DOCKER_USERNAME=myuser
export DOCKER_PASSWORD=mypassword
  1. Install the chart using the following command.
cd helm/data-discovery-classification
helm install data-discovery-classification . \
  --namespace default \
  --create-namespace \
  --wait \
  --wait-for-jobs \
  --timeout 900s \
  --set docker.creds.username=$DOCKER_USERNAME \
  --set docker.creds.password=$DOCKER_PASSWORD

Note: For any custom configuration changes, create a values-override.yaml file and add -f values-override.yaml to the helm install command instead of modifying the default values.yaml file.

The --wait flag with a 15-minute timeout is recommended as the installation typically completes in 5-7 minutes due to large Docker image downloads. Monitor the installation progress in another terminal using the verification commands.

If a registry is used that does not require basic authentication (e.g., ECR or a private registry), ommit the --set docker lines in the command above.

Verifying the Installation

Get Deployments, Services, and HPAs

kubectl get deploy,svc,hpa -n default

Expected output:

NAME                                          READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/classification-deployment     1/1     1            1           ...
deployment.apps/context-provider-deployment   1/1     1            1           ...
deployment.apps/pattern-provider-deployment   1/1     1            1           ...

NAME                               TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
service/classification-service     ClusterIP   172.20.x.x      <none>        8050/TCP   ...
service/context-provider-service   ClusterIP   172.20.x.x      <none>        8052/TCP   ...
service/pattern-provider-service   ClusterIP   172.20.x.x      <none>        8051/TCP   ...

NAME                                                             REFERENCE                                TARGETS              MINPODS   MAXPODS   REPLICAS   AGE
horizontalpodautoscaler.autoscaling/classification-service-hpa   Deployment/classification-deployment     cpu: 50%/50%        1         5         1          ...
horizontalpodautoscaler.autoscaling/context-provider-hpa         Deployment/context-provider-deployment   cpu: 65%/65%        1         20        1          ...
horizontalpodautoscaler.autoscaling/pattern-provider-hpa         Deployment/pattern-provider-deployment   cpu: 90%/90%        1         3         1          ...

All deployments must show 1/1 in the READY column after deployment is completed. During startup, it is an expected behaviour to see 0/1 and cpu: <unknown>.

Ingress

kubectl get ingress -n default

Expected output:

NAME                          CLASS           HOSTS   ADDRESS                                        PORTS   AGE
classification-ingress-rule   private-nginx   *       <load-balancer-dns>.elb.amazonaws.com.         443     ...

Ingress Endpoint Testing

INGRESS_HOST=$(kubectl get svc ingress-controller-private-ingress-controller \
  -n ingress-nginx \
  -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')

# Fallback to IP
if [ -z "$INGRESS_HOST" ]; then
  INGRESS_HOST=$(kubectl get svc ingress-controller-private-ingress-controller \
    -n ingress-nginx \
    -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
fi

echo "Ingress available at: $INGRESS_HOST"

Running Requests

curl -k https://$INGRESS_HOST/readiness
curl -k https://$INGRESS_HOST/healthz
curl -k https://$INGRESS_HOST/startup

curl -k -X POST https://$INGRESS_HOST/pty/data-discovery/v1.1/classify \
  -H 'Content-Type: text/plain' \
  --data 'You can reach Dave Elliot by phone 203-555-1286'

Custom Configuration

The chart is production-ready and the required configurations and default container images are set in the values.yaml file. However, customized container images can also be configured.

To use your own container images, perform the following steps:

  1. Create a values-override.yaml file with the following configuration.
docker:
  registry: "<Address of the image-repository>"
# e.g.: 
# docker:
#   registry: "registry.protegrity.com"

serviceImages:
  classification: "<Name of the classification-image>"
  pattern: "<Name of the pattern-provider-image>"
  context: "<Name of the context-provider-image>"
# e.g.:
# serviceImages:
#  classification: "products/data_discovery/1.1/classification_service:latest"
#  pattern: "products/data_discovery/1.1/pattern_classification_provider:latest"
#  context: "products/data_discovery/1.1/context_classification_provider:latest"
  1. Run the following installation command.
helm install data-discovery-classification . \
  --namespace default \
  --create-namespace \
  --wait \
  --wait-for-jobs \
  --timeout 900s \
  --set docker.creds.username=$DOCKER_USERNAME \
  --set docker.creds.password=$DOCKER_PASSWORD \
  -f values-override.yaml

Uninstalling the Service

Run the following command to uninstall the Data Discovery Classification application.

helm uninstall data-discovery-classification \
  --namespace default \
  --wait \
  --timeout 300s

This will remove the classification, pattern provider, and context provider services. Also, the associated ConfigMaps, Services, and HPA resources will be removed. Any persistent data or logs will be lost during this process.

Resources may take a couple of minutes to be fully terminated. Re-installing immediately after uninstall can lead to an inconsistent state. Wait for all pods to be completely removed before reinstalling.

Troubleshooting

Run the following commands to inspect the state of the deployment.

Viewing all Pods in the Namespace

kubectl get pods -n default

Viewing all Services in the Namespace

kubectl get svc -n default

Viewing Logs for a Specific Pod

kubectl logs <pod-name> -n default

Describing a Specific Pod

kubectl describe pod <pod-name> -n default

4.3.4 - Viewing Application Logs

Viewing EKS application logs.

The application logs can be viewed using the following commands:

 kubectl logs classification-deployment-{version} -n protegrity -f
 kubectl logs roberta-provider-deployment-{version} -n protegrity -f
 kubectl logs presidio-provider-deployment-{version} -n protegrity -f

Run the kubectl get pods -n <namespace-name> command to obtain the version of the images.

Setting the Log Level and other Logging Configuration

Set the log level and other valid Python Logging configuration.

  1. Navigate to the helm/data-discovery-classification directory in your downloaded deployment package.

  2. Create a values-override.yaml file with the required logging configuration.

classificationAppConfig:
  loggingConfig:
    root:
      level: WARNING  # Can be INFO, DEBUG, ERROR, or WARNING
  1. Save the changes.

  2. Run the following installation command.

helm install data-discovery-classification . \
  --namespace default \
  --create-namespace \
  --wait \
  --wait-for-jobs \
  --timeout 900s \
  -f values-override.yaml

5 - APIs

APIs and supporting information.

5.1 - Classify

Identify, classify and locate sensitive data.

5.1.1 - Classify Text API

Classify plain text unstructured data.

POST https://{Host Address}/pty/data-discovery/v1.1/classify

Query Parameters

score_threshold

  • Type: float
  • Description: Optional. Exclude results with a score lower than this threshold.
  • Values: Minimum 0, Maximum 1.0
  • Default: 0.00

Body

  • Content type must be a plain text and in an UTF-8 format.

  • Length of the body is limited to 10K Bytes.

Sample Request

curl -X POST "https://<SERVER_IP>/pty/data-discovery/v1.1/classify?score_threshold=0.85" \
          -H "Content-Type: text/plain" \
          --data "You can reach Dave Elliot by phone 203-555-1286"
import requests
    
    url = "https://<SERVER_IP>/pty/data-discovery/v1.1/classify"
    params = {"score_threshold": 0.85}
    headers = {"Content-Type": "text/plain"}
    data = "You can reach Dave Elliot by phone 203-555-1286"
    
    response = requests.post(url, params=params, headers=headers, data=data, verify=False)
    
    print("Status code:", response.status_code)
    print("Response JSON:", response.json())
URL: POST `https://<SERVER_IP>/pty/data-discovery/v1.1/classify`
   Query Parameters:
   -score_threshold (optional), float between 0.0 and 1.0, default: 0.
   Headers:
   -Content-Type: text/plain
   Body:
   -You can reach Dave Elliot by phone 203-555-1286

Sample Response

{
    "providers": [
        {
            "name": "Pattern Classification Provider",
            "version": "1.1.0",
            "status": 200,
            "elapsed_time": 0.028261899948120117,
            "config_provider": {
                "name": "Pattern",
                "address": "http://pattern_provider_service:8051",
                "supported_content_types": []
            }
        },
        {
            "name": "Context Classification Provider",
            "version": "1.1.0",
            "status": 200,
            "elapsed_time": 0.040960073471069336,
            "config_provider": {
                "name": "Context",
                "address": "http://context_provider_service:8052",
                "supported_content_types": []
            }
        }
    ],
    "classifications": {
        "PERSON": [
            {
                "score": 0.9238499879837037,
                "location": {
                    "start_index": 14,
                    "end_index": 25
                },
                "classifiers": [
                    {
                        "provider_index": 0,
                        "name": "SpacyRecognizer",
                        "score": 0.85,
                        "original_entity": "PERSON",
                        "details": {}
                    },
                    {
                        "provider_index": 1,
                        "name": "context",
                        "score": 0.9976999759674072,
                        "original_entity": "NAME",
                        "details": {}
                    }
                ]
            }
        ],
        "PHONE_NUMBER": [
            {
                "score": 0.9995999932289124,
                "location": {
                    "start_index": 35,
                    "end_index": 47
                },
                "classifiers": [
                    {
                        "provider_index": 1,
                        "name": "context",
                        "score": 0.9995999932289124,
                        "original_entity": "PHONE",
                        "details": {}
                    }
                ]
            }
        ]
    }
}

Response Fields Description

Providers Section

NameExample ResponseDescription
providersArrayArray of provider objects that participated in the request, including their respective success or failure codes.
providers[n].namePattern Classification ProviderProduct name of the provider.
providers[n].version1.0.0Version of the provider.
providers[n].status200HTTP response code returned by the provider.
providers[n].elapsed_time0.028Time, in seconds, taken by the provider to process the request.
providers[n].config_providerObjectObject containing configuration details for each provider.
providers[n].config_provider.namePatternInternal name of the provider.
providers[n].config_provider.addresshttp://pattern_provider_service:8051Network address or endpoint of the provider.
providers[n].config_provider.supported_content_types[]Array of supported content types. An empty array indicates support for all content types.

Classifications Section

NameExample ResponseDescription
classificationsDictionaryA dictionary mapping entity types (e.g., “PERSON”, “PHONE_NUMBER”) to arrays of occurrence objects. Each key is an entity type, and its value is a list of detected occurrences, each containing location and classifier details.
classifications[’entity’][n].score0.9238The confidence score for the detected entity, aggregated from all contributing classifiers.
classifications[’entity’][n].locationObjectAn object specifying the location of the entity within the input text.
classifications[’entity’][n].location.start_index14The starting index of the entity in the input text.
classifications[’entity’][n].location.end_index25The ending index of the entity in the input text.
classifications[’entity’][n].classifiersArrayAn array of classifier objects that contributed to the entity detection.
classifications[’entity’][n].classifiers[m].provider_index0The index of the provider in the top-level providers array.
classifications[’entity’][n].classifiers[m].nameSpacyRecognizerThe name of the classifier. A provider may have multiple classifiers.
classifications[’entity’][n].classifiers[m].score0.85The score assigned by the classifier for the entity detection.
classifications[’entity’][n].classifiers[m].original_entityPERSONThe original entity type detected by the classifier. See Harmonization for details.
classifications[’entity’][n].classifiers[m].detailsObjectOptional. Additional key-value details provided by the classifier.

Response Codes

Response CodeDescription
200Successful Response.
206Partial Content. Only some providers classifed data successfully.
400Bad Request. Invalid input parameters or content.
413Payload too large.
415Unsupported media type.
422Untrusted input. For more information, refer to Input Validation
502Bad Gateway. All upstream providers failed; no successful data aggregation possible.
598Unexpected internal server error. Check server logs.
599Internal server error. Check server logs.

5.1.2 - Classify CSV API

Classify structured CSV data.

POST https://{Host Address}/pty/data-discovery/v1.1/classify

Query Parameters

score_threshold

  • Type: float
  • Description: Optional. Exclude results with a score lower than this threshold.
  • Values: Minimum 0, Maximum 1.0
  • Default: 0.00

has_headers

  • Type: boolean
  • Description: Optional. Indicates whether the first row represents the column header.
  • Values: true/false
  • Default: true

column_delimiter

  • Type: char
  • Description: Optional. Delimiter to separate the columns.
  • Values: , |
  • Default: ,

quote_char

  • Type: char
  • Description: Optional. Character to quote fields containing special characters, such as, the column_delimiter or new-line characters.
  • Values: ""

Body

  • Content type should be text/csv and in UTF-8 format.

  • Body size is limited to 10K Bytes

Sample Request

curl -X POST "https://<SERVER_IP>/pty/data-discovery/v1.1/classify?score_threshold=0.85" \
     --header 'Content-Type: text/csv' \
     --data-raw 'Social Security Number,Credit Card Number,IBAN,Phone Number
     589-25-1068,349384370543801,FR43 9255 4858 47BG 3EBG U4OK O18,(483) 9440301
     636-36-3077,4041594844904,AL50 8947 4215 KAEY GAPM NLYC FNZG,(113) 5143119
     748-82-2375,3558175715821800,AT34 4082 9269 0841 5702,(763) 5136237
     516-62-9861,560221027976015000,FR22 0068 7181 11FB UG8H ECEM 306,(726) 6031636
     121-49-9409,374283320982549,DK37 5687 8459 8060 79,(624) 9205200
     838-73-3299,5558216060144900,CR54 8952 8144 6403 4765 0,(356) 9479541
     439-11-5310,5048376143641900,RS76 6213 4824 0184 8983 74,(544) 5623326
     564-06-8466,3543299511845640,EE51 6882 3443 7863 4703,(702) 6093849
     518-54-5443,3543019452249540,IT65 D000 3874 2801 Z15I LNLL OOX,(584) 8618371'
import requests
    
    url = "https://<SERVER_IP>/pty/data-discovery/v1.1/classify"
    params = {"score_threshold": 0.85}
    headers = {"Content-Type": "text/csv"}
    data = """Social Security Number,Credit Card Number,IBAN,Phone Number
    589-25-1068,349384370543801,FR43 9255 4858 47BG 3EBG U4OK O18,(483) 9440301
    636-36-3077,4041594844904,AL50 8947 4215 KAEY GAPM NLYC FNZG,(113) 5143119
    748-82-2375,3558175715821800,AT34 4082 9269 0841 5702,(763) 5136237
    516-62-9861,560221027976015000,FR22 0068 7181 11FB UG8H ECEM 306,(726) 6031636
    121-49-9409,374283320982549,DK37 5687 8459 8060 79,(624) 9205200
    838-73-3299,5558216060144900,CR54 8952 8144 6403 4765 0,(356) 9479541
    439-11-5310,5048376143641900,RS76 6213 4824 0184 8983 74,(544) 5623326
    564-06-8466,3543299511845640,EE51 6882 3443 7863 4703,(702) 6093849
    518-54-5443,3543019452249540,IT65 D000 3874 2801 Z15I LNLL OOX,(584) 8618371
    """
    
    response = requests.post(url, params=params, headers=headers, data=data, verify=False)
    
    print("Status code:", response.status_code)
    try:
        print("Response JSON:", response.json())
    except ValueError:
        print("Response Text:", response.text)
    
URL: POST `https://<SERVER_IP>/pty/data-discovery/v1.1/classify`
      Query Parameters:
      -score_threshold (optional), float between 0.0 and 1.0, default: 0.
      -has_headers (optional), Indicates whether the first row represents the column header.
      -column_delimiter (optional), Delimiter to separate the columns.
      -quote_char (optional), Character to quote fields containing special characters, such as, the column_delimiter or new-line characters.
      Headers:
      -Content-Type: text/csv
      Body:
      -Social Security Number,Credit Card Number,IBAN,Phone Number
     589-25-1068,349384370543801,FR43 9255 4858 47BG 3EBG U4OK O18,(483) 9440301
     636-36-3077,4041594844904,AL50 8947 4215 KAEY GAPM NLYC FNZG,(113) 5143119
     748-82-2375,3558175715821800,AT34 4082 9269 0841 5702,(763) 5136237
     516-62-9861,560221027976015000,FR22 0068 7181 11FB UG8H ECEM 306,(726) 6031636
     121-49-9409,374283320982549,DK37 5687 8459 8060 79,(624) 9205200
     838-73-3299,5558216060144900,CR54 8952 8144 6403 4765 0,(356) 9479541
     439-11-5310,5048376143641900,RS76 6213 4824 0184 8983 74,(544) 5623326
     564-06-8466,3543299511845640,EE51 6882 3443 7863 4703,(702) 6093849
     518-54-5443,3543019452249540,IT65 D000 3874 2801 Z15I LNLL OOX,(584) 8618371
   

Sample Response

{
    "providers": [
        {
            "name": "Pattern Classification Provider",
            "version": "1.1.0",
            "status": 200,
            "elapsed_time": 0.31273603439331055,
            "config_provider": {
                "name": "Pattern",
                "address": "http://pattern_provider_service:8051",
                "supported_content_types": []
            }
        },
        {
            "name": "Context Classification Provider",
            "version": "1.1.0",
            "status": 200,
            "elapsed_time": 1.1383004188537598,
            "config_provider": {
                "name": "Context",
                "address": "http://context_provider_service:8052",
                "supported_content_types": []
            }
        }
    ],
    "classifications": {
        "SOCIAL_SECURITY_ID": [
            {
                "score": 0.9994888835483127,
                "rows_processed": 9,
                "location": {
                    "column_name": "Social Security Number",
                    "column_index": 0
                },
                "classifiers": [
                    {
                        "provider_index": 1,
                        "name": "context",
                        "rows_with_classification": 9,
                        "total_classifications": 9,
                        "score": 0.9994888835483127,
                        "details": {}
                    }
                ]
            }
        ],
        "CREDIT_CARD": [
            {
                "score": 0.9986333317226834,
                "rows_processed": 9,
                "location": {
                    "column_name": "Credit Card Number",
                    "column_index": 1
                },
                "classifiers": [
                    {
                        "provider_index": 1,
                        "name": "context",
                        "rows_with_classification": 9,
                        "total_classifications": 9,
                        "score": 0.9986333317226834,
                        "details": {}
                    }
                ]
            }
        ],
        "BANK_ACCOUNT": [
            {
                "score": 0.7901234567901234,
                "rows_processed": 9,
                "location": {
                    "column_name": "IBAN",
                    "column_index": 2
                },
                "classifiers": [
                    {
                        "provider_index": 0,
                        "name": "IbanRecognizer",
                        "rows_with_classification": 8,
                        "total_classifications": 8,
                        "score": 0.8888888888888888,
                        "details": {}
                    }
                ]
            }
        ],
        "PHONE_NUMBER": [
            {
                "score": 0.9961333341068692,
                "rows_processed": 9,
                "location": {
                    "column_name": "Phone Number",
                    "column_index": 3
                },
                "classifiers": [
                    {
                        "provider_index": 1,
                        "name": "context",
                        "rows_with_classification": 9,
                        "total_classifications": 9,
                        "score": 0.9961333341068692,
                        "details": {}
                    }
                ]
            }
        ]
    }
}

Response Fields Description

Providers Section

NameExample ResponseDescription
providersArrayArray of provider objects that participated in the request, including their respective success or failure codes.
providers[n].namePattern Classification ProviderProduct name of the provider.
providers[n].version1.0.0Version of the provider.
providers[n].status200HTTP response code returned by the provider.
providers[n].elapsed_time0.028Time, in seconds, taken by the provider to process the request.
providers[n].config_providerObjectObject containing configuration details for each provider.
providers[n].config_provider.namePatternInternal name of the provider.
providers[n].config_provider.addresshttp://pattern_provider_service:8051Network address or endpoint of the provider.
providers[n].config_provider.supported_content_types[]Array of supported content types. An empty array indicates support for all content types.

Classifications Section

NameExample ResponseDescription
classificationsDictionaryA dictionary mapping entity types (e.g., “SOCIAL_SECURITY_ID”, “CREDIT_CARD”) to arrays of occurrence objects. Each key is an entity type, and its value is a list of detected occurrences, each containing location, classifier, and row details.
classifications[’entity’][n].score0.9995The confidence score for the detected entity, aggregated and calculated from all contributing classifiers and their
reported scores.
classifications[’entity’][n].rows_processed9The number of rows passed to and processed by the classification request.
classifications[’entity’][n].locationObjectAn object specifying the location of the entity within the CSV data.
classifications[’entity’][n].location.column_nameSocial Security NumberThe name of the column in which the entity was detected.
classifications[’entity’][n].location.column_index0The index of the column in which the entity was detected.
classifications[’entity’][n].classifiersArrayAn array of classifier objects that contributed to the entity detection.
classifications[’entity’][n].classifiers[m].provider_index1The index of the provider in the top-level providers array.
classifications[’entity’][n].classifiers[m].namecontextThe name of the classifier. A provider may have multiple classifiers.
classifications[’entity’][n].classifiers[m].score0.9995The score assigned by the classifier for the entity detection.
classifications[’entity’][n].classifiers[m].rows_with_classification9The number of rows in which the entity was classified by this classifier.
classifications[’entity’][n].classifiers[m].total_classifications9The total number of classifications made by this classifier in this location. it is possible to find multiple entities within a single column, e.g., date and time, complex address, etc'.
classifications[’entity’][n].classifiers[m].detailsObjectOptional. Additional key-value details provided by the classifier.

Response Codes

Response CodeDescription
200Successful Response.
206Partial Content. Only some providers classifed data successfully.
400Bad Request. Invalid input parameters or content.
413Payload too large.
415Unsupported media type.
422Untrusted input. For more information, refer to Input Validation
502Bad Gateway. All upstream providers failed; no successful data aggregation possible.
598Unexpected internal server error. Check server logs.
599Internal server error. Check server logs.

5.2 - Transform

Identify, Classify & Transform sensitive data.

5.2.1 - Label Text API

Identify and classify plain-text sensitive data. Replace the sensitive data with labels of the classified data types, such as, <CREDIT_CARD> and so on.

POST https://{Host Address}/pty/data-discovery/v1.1/transform/label

Query Parameters

score_threshold

  • Type: float
  • Description: Optional. Label results where the score is greater than this threshold.
  • Values: Minimum 0, Maximum 1.0
  • Default: 0.7

include_providers

  • Type: binary
  • Description: Optional. Include details of the service providers in the response.
  • Values: Yes / No
  • Default: No

include_classification_details

  • Type: binary
  • Description: Optional. Include classification details in the response.
  • Values: Yes / No
  • Default: No

Body

  • Content type must be text/plain and in UTF-8 format.

  • Body size is limited to 10K Bytes

Sample Request

curl -X POST "https://<SERVER_IP>/pty/data-discovery/v1.1/transform/label?score_threshold=0.85" \
          -H "Content-Type: text/plain" \
          --data "Jake lives at 15 Main st, Hamden 06517, Connecticut."
import requests
    
    url = "https://<SERVER_IP>/pty/data-discovery/v1.1/transform/label"
    params = {"score_threshold": 0.85}
    headers = {"Content-Type": "text/plain"}
    data = "Jake lives at 15 Main st, Hamden 06517, Connecticut."
    
    response = requests.post(url, params=params, headers=headers, data=data, verify=False)
    
    print("Status code:", response.status_code)
    print("Response JSON:", response.json())
URL: POST `https://<SERVER_IP>/pty/data-discovery/v1.1/transform/label`
   Query Parameters:
   -score_threshold (optional), float between 0.0 and 1.0, default: 0.
   Headers:
   -Content-Type: text/plain
   Body:
   -Jake lives at 15 Main st, Hamden 06517, Connecticut.

Sample Responses


title: Sample Response Default weight: 60 date: 2024-02-20 description: Sample Response Default.

{ “transform”: { “text”: “[PERSON] lives at [LOCATION] [LOCATION], [LOCATION] [LOCATION], [LOCATION].” } }

The fields are described as follows:

NameExample ResponseDescription
transform.text[PERSON] lives at [LOCATION]..The labed input text with classified entities listed by name in place of the original sensitive data

title: Sample Response with Detail weight: 60 date: 2024-02-20 description: Sample Response with Detail.

{
        "transform": {
            "text": "[PERSON] lives at [LOCATION] [LOCATION], [LOCATION] [LOCATION], [LOCATION]."
        },
        "providers": [
            {
                "name": "Pattern Classification Provider",
                "version": "1.1.0",
                "status": 200,
                "elapsed_time": 0.011328935623168945,
                "config_provider": {
                    "name": "Pattern",
                    "address": "http://pattern_provider_service:8051",
                    "supported_content_types": []
                }
            },
            {
                "name": "Context Classification Provider",
                "version": "1.1.0",
                "status": 200,
                "elapsed_time": 0.03895401954650879,
                "config_provider": {
                    "name": "Context",
                    "address": "http://context_provider_service:8052",
                    "supported_content_types": []
                }
            }
        ],
        "classifications": {
            "LOCATION": [
                {
                    "score": 0.85,
                    "location": {
                        "start_index": 17,
                        "end_index": 24
                    },
                    "classifiers": [
                        {
                            "provider_index": 0,
                            "name": "SpacyRecognizer",
                            "score": 0.85,
                            "original_entity": "LOCATION",
                            "details": {}
                        }
                    ]
                },
                {
                    "score": 0.9240000128746033,
                    "location": {
                        "start_index": 26,
                        "end_index": 32
                    },
                    "classifiers": [
                        {
                            "provider_index": 0,
                            "name": "SpacyRecognizer",
                            "score": 0.85,
                            "original_entity": "LOCATION",
                            "details": {}
                        },
                        {
                            "provider_index": 1,
                            "name": "context",
                            "score": 0.9980000257492065,
                            "original_entity": "CITY",
                            "details": {}
                        }
                    ]
                },
                {
                    "score": 0.9244499981403351,
                    "location": {
                        "start_index": 40,
                        "end_index": 51
                    },
                    "classifiers": [
                        {
                            "provider_index": 0,
                            "name": "SpacyRecognizer",
                            "score": 0.85,
                            "original_entity": "LOCATION",
                            "details": {}
                        },
                        {
                            "provider_index": 1,
                            "name": "context",
                            "score": 0.9988999962806702,
                            "original_entity": "STATE",
                            "details": {}
                        }
                    ]
                },
                {
                    "score": 0.9958999752998352,
                    "location": {
                        "start_index": 14,
                        "end_index": 16
                    },
                    "classifiers": [
                        {
                            "provider_index": 1,
                            "name": "context",
                            "score": 0.9958999752998352,
                            "original_entity": "BUILDING",
                            "details": {}
                        }
                    ]
                },
                {
                    "score": 0.9983999729156494,
                    "location": {
                        "start_index": 33,
                        "end_index": 38
                    },
                    "classifiers": [
                        {
                            "provider_index": 1,
                            "name": "context",
                            "score": 0.9983999729156494,
                            "original_entity": "ZIPCODE",
                            "details": {}
                        }
                    ]
                }
            ],
            "PERSON": [
                {
                    "score": 0.8819000124931335,
                    "location": {
                        "start_index": 0,
                        "end_index": 4
                    },
                    "classifiers": [
                        {
                            "provider_index": 1,
                            "name": "context",
                            "score": 0.8819000124931335,
                            "original_entity": "NAME",
                            "details": {}
                        }
                    ]
                }
            ]
        }
    }

The fields for the transform section are described as follows:

NameExample ResponseDescription
transform.text[PERSON] lives at [LOCATION]..The labed input text with classified entities listed by name in place of the original sensitive data

The fields for the providers section are described as follows:

NameExample ResponseDescription
providersArrayArray of provider objects that participated in the request, including their respective success or failure codes.
providers[n].namePattern Classification ProviderProduct name of the provider.
providers[n].version1.0.0Version of the provider.
providers[n].status200HTTP response code returned by the provider.
providers[n].elapsed_time0.028Time, in seconds, taken by the provider to process the request.
providers[n].config_providerObjectObject containing configuration details for each provider.
providers[n].config_provider.namePatternInternal name of the provider.
providers[n].config_provider.addresshttp://pattern_provider_service:8051Network address or endpoint of the provider.
providers[n].config_provider.supported_content_types[]Array of supported content types. An empty array indicates support for all content types.

The fields for the classificartion section are described as follows:

NameExample ResponseDescription
classificationsDictionaryA dictionary mapping entity types (e.g., “PERSON”, “PHONE_NUMBER”) to arrays of occurrence objects. Each key is an entity type, and its value is a list of detected occurrences, each containing location and classifier details.
classifications[’entity’][n].score0.9238The confidence score for the detected entity, aggregated from all contributing classifiers.
classifications[’entity’][n].locationObjectAn object specifying the location of the entity within the input text.
classifications[’entity’][n].location.start_index14The starting index of the entity in the input text.
classifications[’entity’][n].location.end_index25The ending index of the entity in the input text.
classifications[’entity’][n].classifiersArrayAn array of classifier objects that contributed to the entity detection.
classifications[’entity’][n].classifiers[m].provider_index0The index of the provider in the top-level providers array.
classifications[’entity’][n].classifiers[m].nameSpacyRecognizerThe name of the classifier. A provider may have multiple classifiers.
classifications[’entity’][n].classifiers[m].score0.85The score assigned by the classifier for the entity detection.
classifications[’entity’][n].classifiers[m].original_entityPERSONThe original entity type detected by the classifier. See Harmonization for details.
classifications[’entity’][n].classifiers[m].detailsObjectOptional. Additional key-value details provided by the classifier.

Response Codes

Response CodeDescription
200Successful Response.
206Partial Content. Only some providers classifed data successfully.
400Bad Request. Invalid input parameters or content.
413Payload too large.
415Unsupported media type.
422Untrusted input. For more information, refer to Input Validation
502Bad Gateway. All upstream providers failed; no successful data aggregation possible.
598Unexpected internal server error. Check server logs.
599Internal server error. Check server logs.

5.2.1.1 - Handling Overlapping Conflicts

Resolving conflicts between entities that label sensitive data.

While classifying data, the providers may label an identical text under two different entities. This distinction arises from the detection strategies the classifiers adopt. Data Discovery handles these conflicts by applying certain rules on these conflicting entities.

The rules for handling the conflicting entities are as follows:

  • No overlap: If the two entities do not conflict, retain the results in the original form.

    For example, Jake Filbert lives in Connecticut. If only Jake Filbert is identified, the result will be labeled as [NAME] lives in Connecticut.

  • Full overlap: If both the entities overlap, the following logic will be applied:

    • Select the entity with a higher confidence score.
    • If both the entities contain the same confidence score, select the first entity.

    For example, Jake Filbert lives in Connecticut. Here, the name is recognized as [USER] with a score 0.7 and [NAME] with a score 0.9. As [NAME] has a higher score, the result will be labeled as [NAME] lives in Connecticut.

  • One entity contained in other: If one entity is completely contained in the other, select the entity with the longer text.

    For example, jake@email.com. Here, the classifiers may recognize the text as [NAME] and [EMAIL]. As [EMAIL] is the longer text, the result will be labeled as [EMAIL].

  • Partial intersection. If the two entities overlap partially, the result will be a combination of both.

    For example, 092-33445. Here, the classifiers may recognize the text as [PHONE_NUMBER] and [SSN]. The result will be labeled as [PHONE_NUMBER&SSN].

5.2.1.2 - Sample Response Default

Sample Response Default.
{ “transform”: { “text”: “[PERSON] lives at [LOCATION] [LOCATION], [LOCATION] [LOCATION], [LOCATION].” } }

The fields are described as follows:

NameExample ResponseDescription
transform.text[PERSON] lives at [LOCATION]..The labed input text with classified entities listed by name in place of the original sensitive data

5.2.1.3 - Sample Response with Detail

Sample Response with Detail.
{
    "transform": {
        "text": "[PERSON] lives at [LOCATION] [LOCATION], [LOCATION] [LOCATION], [LOCATION]."
    },
    "providers": [
        {
            "name": "Pattern Classification Provider",
            "version": "1.1.0",
            "status": 200,
            "elapsed_time": 0.011328935623168945,
            "config_provider": {
                "name": "Pattern",
                "address": "http://pattern_provider_service:8051",
                "supported_content_types": []
            }
        },
        {
            "name": "Context Classification Provider",
            "version": "1.1.0",
            "status": 200,
            "elapsed_time": 0.03895401954650879,
            "config_provider": {
                "name": "Context",
                "address": "http://context_provider_service:8052",
                "supported_content_types": []
            }
        }
    ],
    "classifications": {
        "LOCATION": [
            {
                "score": 0.85,
                "location": {
                    "start_index": 17,
                    "end_index": 24
                },
                "classifiers": [
                    {
                        "provider_index": 0,
                        "name": "SpacyRecognizer",
                        "score": 0.85,
                        "original_entity": "LOCATION",
                        "details": {}
                    }
                ]
            },
            {
                "score": 0.9240000128746033,
                "location": {
                    "start_index": 26,
                    "end_index": 32
                },
                "classifiers": [
                    {
                        "provider_index": 0,
                        "name": "SpacyRecognizer",
                        "score": 0.85,
                        "original_entity": "LOCATION",
                        "details": {}
                    },
                    {
                        "provider_index": 1,
                        "name": "context",
                        "score": 0.9980000257492065,
                        "original_entity": "CITY",
                        "details": {}
                    }
                ]
            },
            {
                "score": 0.9244499981403351,
                "location": {
                    "start_index": 40,
                    "end_index": 51
                },
                "classifiers": [
                    {
                        "provider_index": 0,
                        "name": "SpacyRecognizer",
                        "score": 0.85,
                        "original_entity": "LOCATION",
                        "details": {}
                    },
                    {
                        "provider_index": 1,
                        "name": "context",
                        "score": 0.9988999962806702,
                        "original_entity": "STATE",
                        "details": {}
                    }
                ]
            },
            {
                "score": 0.9958999752998352,
                "location": {
                    "start_index": 14,
                    "end_index": 16
                },
                "classifiers": [
                    {
                        "provider_index": 1,
                        "name": "context",
                        "score": 0.9958999752998352,
                        "original_entity": "BUILDING",
                        "details": {}
                    }
                ]
            },
            {
                "score": 0.9983999729156494,
                "location": {
                    "start_index": 33,
                    "end_index": 38
                },
                "classifiers": [
                    {
                        "provider_index": 1,
                        "name": "context",
                        "score": 0.9983999729156494,
                        "original_entity": "ZIPCODE",
                        "details": {}
                    }
                ]
            }
        ],
        "PERSON": [
            {
                "score": 0.8819000124931335,
                "location": {
                    "start_index": 0,
                    "end_index": 4
                },
                "classifiers": [
                    {
                        "provider_index": 1,
                        "name": "context",
                        "score": 0.8819000124931335,
                        "original_entity": "NAME",
                        "details": {}
                    }
                ]
            }
        ]
    }
}

The fields for the transform section are described as follows:

NameExample ResponseDescription
transform.text[PERSON] lives at [LOCATION]..The labed input text with classified entities listed by name in place of the original sensitive data

The fields for the providers section are described as follows:

NameExample ResponseDescription
providersArrayArray of provider objects that participated in the request, including their respective success or failure codes.
providers[n].namePattern Classification ProviderProduct name of the provider.
providers[n].version1.0.0Version of the provider.
providers[n].status200HTTP response code returned by the provider.
providers[n].elapsed_time0.028Time, in seconds, taken by the provider to process the request.
providers[n].config_providerObjectObject containing configuration details for each provider.
providers[n].config_provider.namePatternInternal name of the provider.
providers[n].config_provider.addresshttp://pattern_provider_service:8051Network address or endpoint of the provider.
providers[n].config_provider.supported_content_types[]Array of supported content types. An empty array indicates support for all content types.

The fields for the classificartion section are described as follows:

NameExample ResponseDescription
classificationsDictionaryA dictionary mapping entity types (e.g., “PERSON”, “PHONE_NUMBER”) to arrays of occurrence objects. Each key is an entity type, and its value is a list of detected occurrences, each containing location and classifier details.
classifications[’entity’][n].score0.9238The confidence score for the detected entity, aggregated from all contributing classifiers.
classifications[’entity’][n].locationObjectAn object specifying the location of the entity within the input text.
classifications[’entity’][n].location.start_index14The starting index of the entity in the input text.
classifications[’entity’][n].location.end_index25The ending index of the entity in the input text.
classifications[’entity’][n].classifiersArrayAn array of classifier objects that contributed to the entity detection.
classifications[’entity’][n].classifiers[m].provider_index0The index of the provider in the top-level providers array.
classifications[’entity’][n].classifiers[m].nameSpacyRecognizerThe name of the classifier. A provider may have multiple classifiers.
classifications[’entity’][n].classifiers[m].score0.85The score assigned by the classifier for the entity detection.
classifications[’entity’][n].classifiers[m].original_entityPERSONThe original entity type detected by the classifier. See Harmonization for details.
classifications[’entity’][n].classifiers[m].detailsObjectOptional. Additional key-value details provided by the classifier.

5.2.1.4 -

NameExample ResponseDescription
transform.text[PERSON] lives at [LOCATION]..The labed input text with classified entities listed by name in place of the original sensitive data

5.3 - Harmonizing Provider Outputs

Aggregate responses under a similar category.

Based on the detection logic, the Pattern and Context classification providers might classify the same data in different labels. The classification service standardizes provider outputs into a unified response.

Consider the example, You can visit our office located in New York City.

  • Context provider might categorize New York City as CITY.
  • Pattern provider might categorize New York City as LOCATION.

This can cause an inconsistency in the outputs generated across the providers.

Data Discovery ensures standardization of responses by aggregating similar outputs of the providers under a common classification name. In the example shown, the classification service will categorize New York City under the category LOCATION.

Harmonization Process

The following pointers illustrate the harmonization process in detail.

Providers Mapping Entities

Each provider is responsible for mapping its identified entities to harmonized classification entities that are consistent with those used by other providers. This ensures that the classification service can accurately aggregate and interpret responses across multiple providers. When a provider’s classification is harmonized, the response must include the originally identified entity alongside the harmonized classification.

The following snippet shows how the Context classification provider initially classified the entity as CITY, which was then harmonized into the category LOCATION.

{
  "providers": "...",
  "classifications": {
    "LOCATION": [
      {
        "score": 0.9222000122070313,
        "location": {
          "start_index": 36,
          "end_index": 49
        },
        "classifiers": [
          {
            "provider_index": 0,
            "name": "SpacyRecognizer",
            "score": 0.85,
            "original_entity": "LOCATION",
            "details": {}
          },
          {
            "provider_index": 1,
            "name": "context",
            "score": 0.9944000244140625,
            "original_entity": "CITY",
            "details": {}
          }
        ]
      }
    ]
  }
}

Grouping by Matching Indexes

The entities are grouped together only if the responses shared by the providers contain the same start_index, end_index, and similar classification entity. If the start_index and end_index differ, the entities will not be grouped together.

As shown in the following snippet, the Context and Pattern providers classify the data as IT_IDENTITY_CARD and ID_CARD respectively. These are then grouped under the NATIONAL_ID category by the classification service.

{
  "providers": ...,
  "classifications": {
    "NATIONAL_ID": [
      {
        "score": 0.9236000061035157,
        "location": {
          "start_index": 14,
          "end_index": 25
        },
        "classifiers": [
          {
            "provider_index": 0,
            "name": "pattern_classification",
            "score": 0.85,
            "original_entity": "IT_IDENTITY_CARD" 
          }, {
            "provider_index": 1,
            "name": "context_classification",
            "score": 0.9972000122070312,
            "original_entity": "ID_CARD" 
          }
        ]
      }
    ]
  }
}

Non-Matching Indexes

If the responses for start_index and end_index differ, the entities will not be grouped together. However, the entities will appear under a common classification name.

The following table illustrates a common classification name for multiple providers.

ProviderOriginal Entity LabelsCommon Classification Name
Pattern ProviderLOCATIONLOCATION
Context ProviderCITY, STATE, COUNTRY, COUNTY, ZIP_CODE, STREET, BUILDING, GEO_COORDINATELOCATION

The following snippet illustrates the sample.

{
  "providers": "...",
  "classifications": {
    "LOCATION": [
      {
        "score": 0.9236000061035157,
        "location": {
          "start_index": 0,
          "end_index": 35
        },
        "classifiers": [
          {
            "provider_index": 0,
            "name": "pattern_provider",
            "score": 0.85,
            "original_entity": "LOCATION"
          }
        ]
      },
      {
        "score": 0.9236000061035157,
        "location": {
          "start_index": 0,
          "end_index": 17
        },
        "classifiers": [
          {
            "provider_index": 1,
            "name": "context_provider",
            "score": 0.9972000122070312,
            "original_entity": "STREET"
          }
        ]
      },
      {
        "score": 0.9236000061035157,
        "location": {
          "start_index": 20,
          "end_index": 22
        },
        "classifiers": [
          {
            "provider_index": 1,
            "name": "context_provider",
            "score": 0.9972000122070312,
            "original_entity": "BUILDING"
          }
        ]
      },
      {
        "score": 0.9236000061035157,
        "location": {
          "start_index": 25,
          "end_index": 31
        },
        "classifiers": [
          {
            "provider_index": 1,
            "name": "context_provider",
            "score": 0.9972000122070312,
            "original_entity": "ZIP_CODE"
          }
        ]
      }
    ]
  }
}

Harmonization Fields

The following table illustrates the original entities and the their corresponding harmonized classification

Original Provider EntityHarmonized/Common Classification
US_BANK_NUMBERBANK_ACCOUNT
IBAN_CODEBANK_ACCOUNT
IBANBANK_ACCOUNT
BICBANK_ACCOUNT
CRYPTOCRYPTO_ADDRESS
BITCOINADDRESSCRYPTO_ADDRESS
ETHEREUMADDRESSCRYPTO_ADDRESS
LITECOINADDRESSCRYPTO_ADDRESS
IT_DRIVER_LICENSEDRIVER_LICENSE
US_DRIVER_LICENSEDRIVER_LICENSE
DRIVERLICENSEDRIVER_LICENSE
US_PASSPORTPASSPORT
IN_PASSPORTPASSPORT
IT_PASSPORTPASSPORT
PASSPORTPASSPORT
IT_IDENTITY_CARDNATIONAL_ID
FI_PERSONAL_IDENTITY_CODENATIONAL_ID
IN_AADHAARNATIONAL_ID
ES_NIENATIONAL_ID
SG_NRIC_FINNATIONAL_ID
PL_PESELNATIONAL_ID
SG_UENNATIONAL_ID
AU_ACNNATIONAL_ID
IDCARDNATIONAL_ID
US_ITINTAX_ID
AU_TFNTAX_ID
IN_PANTAX_ID
ES_NIFTAX_ID
IT_FISCAL_CODETAX_ID
AU_ABNTAX_ID
IT_VAT_CODETAX_ID
US_SSNSOCIAL_SECURITY_ID
UK_NINOSOCIAL_SECURITY_ID
SSNSOCIAL_SECURITY_ID
MEDICAL_LICENSEHEALTH_CARE_ID
AU_MEDICAREHEALTH_CARE_ID
UK_NHSHEALTH_CARE_ID
DATE_TIMEDATETIME
DATEDATETIME
TIMEDATETIME
EMAILEMAIL_ADDRESS
IPIP_ADDRESS
IPV4IP_ADDRESS
IPV6IP_ADDRESS
NAMEPERSON
PHONEPHONE_NUMBER
PINPASSWORD
PASSWORDPASSWORD
CREDITCARDCVVPASSWORD
BUILDINGLOCATION
COUNTRYLOCATION
CITYLOCATION
COUNTYLOCATION
GEOCOORDLOCATION
SECADDRESSLOCATION
SECONDARYADDRESSLOCATION
STATELOCATION
STREETLOCATION
ZIPCODELOCATION
CCNCREDIT_CARD
COMPANYNAMEORGANIZATION
MACMAC_ADDRESS
ACCOUNTNAMEACCOUNT_NAME
ACCOUNTNUMBERACCOUNT_NUMBER
CURRENCYCODECURRENCY_CODE
CURRENCYNAMECURRENCY_NAME
CURRENCYSYMBOLCURRENCY_SYMBOL

5.4 - Input Validation

Rejecting unsanitized data.

The Classification service in Data Discovery offers an input validation security feature that rejects invalid input data. Data that is malformed, non-normalized, containing homoglyphs, hieroglyphs, mixed Unicode variants, or control characters is considered as unsanitized or invalid data. These are rejected and will not be classified.

The following are few examples of data that will be rejected:

  • 𝓉𝑒𝓍𝓉
  • Pep

Before invoking the Classification endpoint, ensure that the input text is normalized. Replace invalid characters by their corresponding normalized plaintext characters. If the input text contains any invalid character, a status code of 422 and a message Untrusted input is returned.

For security purposes, the application rejects unsanitized data by default. It is recommended that this feature remains enabled. However, to override this feature, perform the following steps.

  1. Navigate to the docker_compose directory.

  2. Edit the docker-compose.yaml file.

  3. Under the environment section of classification_service, append the security parameter as follows.

- SECURITY_SETTINGS={"ENABLE_ALL_SECURITY_CONTROLS":false}
  
  1. Save the changes.

  2. If the application is already running, stop the containers first:

docker compose down
  
  1. Start the application with your configuration changes following the Docker Compose deployment guide:
docker compose up -d
  
  1. Navigate to the /eks/helm/classification_app directory.

  2. Create a values-override.yaml file with the required custom configuration.

securitySettings:
    ENABLE_ALL_SECURITY_CONTROLS: false
  
  1. Save the changes.

  2. If the application is already deployed, uninstall using the following command.

helm uninstall data-discovery-classification --namespace default --wait
  
  1. Run the following installation command.
helm install data-discovery-classification . \
    --namespace default \
    --create-namespace \
    --wait \
    --wait-for-jobs \
    --timeout 900s \
    -f values-override.yaml
  

5.5 -

Response CodeDescription
200Successful Response.
206Partial Content. Only some providers classifed data successfully.
400Bad Request. Invalid input parameters or content.
413Payload too large.
415Unsupported media type.
422Untrusted input. For more information, refer to Input Validation
502Bad Gateway. All upstream providers failed; no successful data aggregation possible.
598Unexpected internal server error. Check server logs.
599Internal server error. Check server logs.

5.6 -

NameExample ResponseDescription
providersArrayArray of provider objects that participated in the request, including their respective success or failure codes.
providers[n].namePattern Classification ProviderProduct name of the provider.
providers[n].version1.0.0Version of the provider.
providers[n].status200HTTP response code returned by the provider.
providers[n].elapsed_time0.028Time, in seconds, taken by the provider to process the request.
providers[n].config_providerObjectObject containing configuration details for each provider.
providers[n].config_provider.namePatternInternal name of the provider.
providers[n].config_provider.addresshttp://pattern_provider_service:8051Network address or endpoint of the provider.
providers[n].config_provider.supported_content_types[]Array of supported content types. An empty array indicates support for all content types.

5.7 -

  1. Navigate to the docker_compose directory.

  2. Edit the docker-compose.yaml file.

  3. Under the environment section of classification_service, append the security parameter as follows.

- SECURITY_SETTINGS={"ENABLE_ALL_SECURITY_CONTROLS":false}
  1. Save the changes.

  2. If the application is already running, stop the containers first:

docker compose down
  1. Start the application with your configuration changes following the Docker Compose deployment guide:
docker compose up -d

5.8 -

  1. Navigate to the /eks/helm/classification_app directory.

  2. Create a values-override.yaml file with the required custom configuration.

securitySettings:
  ENABLE_ALL_SECURITY_CONTROLS: false
  1. Save the changes.

  2. If the application is already deployed, uninstall using the following command.

helm uninstall data-discovery-classification --namespace default --wait
  1. Run the following installation command.
helm install data-discovery-classification . \
  --namespace default \
  --create-namespace \
  --wait \
  --wait-for-jobs \
  --timeout 900s \
  -f values-override.yaml

6 - Performance and Accuracy

Details on performance and accuracy results.

Introduction

Performance and accuracy are critical metrics for data discovery tools. These ensure that large datasets can be processed swiftly and sensitive information is correctly identified. High performance minimizes latency and maximizes productivity, while accuracy reduces the risk of data breaches and ensures compliance with regulatory standards like GDPR and CCPA.

Together, these qualities are essential for maintaining data integrity and security in environments where unstructured data flows through various systems..

Performance Evaluation

The evaluation included Data Discovery deployed on Amazon EKS using a Helm Chart. The primary goal was to validate the application’s scalability and the infrastructure’s ability to handle varying loads under real-world conditions. Nevertheless, performance will vary between applications due to confounding variations in customer use cases. The key findings are as follows:

  • Scalability: The application and infrastructure configurations can efficiently scale to meet usage demands and support parallel service calls.

  • Instance Type: The m5.large8 instance was identified as a well-balanced choice for performance and cost.

    • If the priority is Faster Response Times: Splitting messages into smaller chunks and processing them in parallel is more cost-effective with multiple weaker instance types.
    • If the priority is Maximizing Processing Efficiency: Merging content into a single, larger request and using more powerful instance types is better for maximizing Processing Efficiency (characters processed per second).
  • EKS Auto Mode: Running EKS in auto mode offers a fully managed Kubernetes cluster with minimal maintenance. This enables the service to self-regulate by automatically scaling up or down based on demand.

  • Optimized CPU Usage: Maintain low CPU reservation for accurate measurement and effective self-regulation via the Horizontal Pod Autoscaler (HPA) that adjusts based on CPU usage percentage, balancing throughput, and idle time.

Detection Accuracy

Protegrity Data Discovery employs sophisticated Machine Learning (ML) and Natural Language Processing (NLP) technologies to achieve high accuracy in identifying sensitive data. The system processes English text inputs, with an NLP model pinpointing text spans within the document that correspond to various PII elements. The output includes text span as a PII entity, along with the entity type, entity position (start and end), and a confidence score. This confidence score reflects the likelihood of the text span being a PII entity, ensuring precise detection.

Dataset

Diverse datasets containing PII data, which differ based on demographic composition (volume and diversity), variations in data characteristics, types of labels, and other influencing factors were utilized. For example, labels such as “PERSON” and “PHONE_NUMBER” are used. The overall accuracy for detecting various PII data combinations in the dataset was measured with detection rate exceeding 96%.

Accuracy

Defined as an average of detection rates across sentences in a given text data.

Detection Rate = Valid Detections/Ground Truth

Where, Valid Detections is the number of correctly detected PII and Ground Truth is the total number of PIIs.

The variability in customer applications introduces differences in performance, meaning detection accuracy may fluctuate based on the quality of input text. Error rates in identifying PII are influenced not just by the detection service but also by customer workflows and evaluation datasets. It is recommended that customers assess and validate accuracy according to their specific use cases and requirements. It is also pertinent to note that the detected score of the input text may vary negligibly from user to user based on their underlying hardware configuration.

Supported Entity Types

PII entities supported by Data Discovery.

Entity NameDescription
ACCOUNT_NAMEName associated with a financial account.
ACCOUNT_NUMBERBank account number used to identify financial accounts.
AGEAge information used to identify individuals.
AMOUNTSpecific amount of money, which can be linked to financial transactions.
AU_ABNAustralian Business Number used to identify businesses in Australia.
AU_ACNAustralian Company Number used to identify businesses in Australia.
AU_MEDICAREMedicare number used to identify individuals for healthcare services in Australia.
AU_TFNTax File Number used to identify taxpayers in Australia.
BICBank Identifier Code used to identify financial institutions.
BITCOIN_ADDRESSBitcoin wallet address used for digital transactions.
BUILDINGBuilding information used to identify specific locations.
CITYCity information used to identify geographic locations.
COMPANY_NAMEName of a company used to identify businesses.
COUNTRYCountry information used to identify geographic locations.
COUNTYCounty information used to identify geographic locations.
CREDIT_CARDCredit card number used for financial transactions.
CREDIT_CARD_CVVCard Verification Value used to secure credit card transactions.
CRYPTOCryptocurrency wallet address used for digital transactions.
CURRENCYCurrency information used in financial transactions.
CURRENCY_CODECode representing currency used in financial transactions.
CURRENCY_NAMEName of currency used in financial transactions.
CURRENCY_SYMBOLSymbol representing currency, sometimes linked to financial transactions.
DATESpecific date that can be linked to personal activities.
DATE_OF_BIRTHDate of birth used to identify individuals.
DATE_TIMESpecific date and time that can be linked to personal activities.
DRIVER_LICENSEDriver’s license number used to identify individuals.
EMAIL_ADDRESSEmail address used for communication and identification.
ES_NIEForeigner Identification Number used to identify non-residents in Spain.
ES_NIFTax Identification Number used to identify taxpayers in Spain.
ETHEREUM_ADDRESSEthereum wallet address used for digital transactions.
FI_PERSONAL_IDENTITY_CODEPersonal identity code used to identify individuals in Finland.
GENDERGender information used to identify individuals.
GEO_CCORDINATEGeographic coordinates used to identify specific locations.
IBAN_CODEInternational Bank Account Number used to identify bank accounts globally.
ID_CARDIdentity card number used to identify individuals.
IN_AADHAARUnique identification number used to identify residents in India.
IN_PANPermanent Account Number used to identify taxpayers in India.
IN_PASSPORTPassport number used to identify individuals in India.
IN_VEHICLE_REGISTRATIONVehicle registration number used to identify vehicles in India.
IN_VOTERVoter ID number used to identify registered voters in India.
IP_ADDRESSInternet Protocol address used to identify devices on a network.
IPV4IPv4 address used to identify devices on a network.
IPV6IPv6 address used to identify devices on a network.
IT_DRIVER_LICENSEDriver’s license number used to identify individuals in Italy.
IT_FISCAL_CODEFiscal code used to identify taxpayers in Italy.
IT_IDENTITY_CARDIdentity card number used to identify individuals in Italy.
IT_PASSPORTPassport number used to identify individuals in Italy.
LITECOIN_ADDRESSLitecoin wallet address used for digital transactions.
LOCATIONSpecific location or address that can be linked to an individual.
MACMedia Access Control address used to identify devices on a network.
MEDICAL_LICENSELicense number used to identify medical professionals.
NRPNational Registration Number used to identify individuals.
ORGANIZATIONName or identifier used to identify an organization.
PASSPORTPassport number used to identify individuals.
PASSWORDPassword used to secure access to personal accounts.
PERSONName or identifier used to identify an individual.
PHONE_NUMBERNumber used to contact or identify an individual.
PINPersonal Identification Number used to secure access to accounts.
PL_PESELPersonal Identification Number used to identify individuals in Poland.
SECONDARY_ADDRESSAdditional address information used to identify locations.
SG_NRIC_FINNational Registration Identity Card number used to identify residents in Singapore.
SG_UENUnique Entity Number used to identify businesses in Singapore.
SOCIAL_SECURITY_NUMBERSocial Security Number used to identify individuals.
STATEState information used to identify geographic locations.
STREETStreet address used to identify specific locations.
TIMESpecific time that can be linked to personal activities.
TITLETitle or honorific used to identify individuals.
UK_NHSNational Health Service number used to identify individuals for healthcare services in the United Kingdom.
URLWeb address that can sometimes contain personal information.
US_BANK_NUMBERBank account number used to identify financial accounts in the United States.
US_DRIVER_LICENSEDriver’s license number used to identify individuals in the United States.
US_ITINIndividual Taxpayer Identification Number used to identify taxpayers in the United States.
US_PASSPORTPassport number used to identify individuals in the United States.
US_SSNSocial Security Number used to identify individuals in the United States.
USERNAMEUsername used to identify individuals in online systems.
ZIP_CODEPostal code used to identify specific geographic areas.