This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Data Discovery

Data Discovery for AI Team Edition.

Data Discovery specializes in the detection of Personally Identifiable Information (PII), Protected Health Information (PHI), Payment Card Information (PCI) within free-text (unstructured) and table-based (structured. CSV) inputs. Unlike traditional data tools, it excels in dynamic, unstructured environments such as chatbot conversations, call transcripts, and Generative AI (Gen AI) outputs.

For more information about Data Discovery, refer to Data Discovery.

1 - Prerequisites

Prerequisites to install Data Discovery

Ensure that the following prerequisites are met before installing Data Discovery with PPC:

  • PPC Cluster Team Edition must be installed and accessible. For more information on installing PPC, refer Installing PPC for instructions.

  • Access to a Kubernetes cluster with sufficient permissions to manage the following resources:

    • Namespace
    • Deployment
    • Service
    • ConfigMap
    • Secret
    • HorizontalPodAutoscaler
    • Gateway API resources (HTTPRoute, ReferenceGrant, SecurityPolicy)
    • Karpenter resources (NodePool, EC2NodeClass)
  • Authorization to provision AWS m5.large instances.

  • Ensure that the jumpbox can connect to the required repositories. If not already authenticated, then log in to the required repository.

  • For connecting and deploying from the Protegrity Container Registry (PCR), use the following command and the credentials obtained from the My.Protegrity portal during account creation:

helm registry login registry.protegrity.com:9443
  • For connecting and deploying to the local repository, use your local credentials and local repository endpoint as required.
  • AWS credentials with permission to read SSM parameters in the target region (required only when overriding the AMI ID).

Option A (Recommended): Run the following AWS CLI command to retrieve the AMI ID dynamically

aws ssm get-parameter \
  --name /aws/service/bottlerocket/aws-k8s-1.34/x86_64/latest/image_id \
  --region <region> \
  --query "Parameter.Value" \
  --output text

Alternatively, refer to these example AMI IDs.

Option B: The following table provides the list of AMI IDs

RegionAMI ID
ap-south-1ami-07959c05dcdb79a72
eu-north-1ami-0268b0bfff0f25d31
eu-west-3ami-0ea9454aef60045a2
eu-west-2ami-0d5eee57a6a1398a3
eu-west-1ami-00a8d14029b60a028
ap-northeast-3ami-0e495c3ffd416c65e
ap-northeast-2ami-0fc18a24aec719c1c
ap-northeast-1ami-00ec85b83bf713aac
ca-central-1ami-03891f0d8b41eb296
sa-east-1ami-0a30f044a5781b4e0
ap-southeast-1ami-0ae51324bf2e89725
ap-southeast-2ami-0ef7e8095b163dc42
eu-central-1ami-00e36131a0343c374
us-east-2ami-0e486911b2d0a5f7e
us-west-1ami-01183e1261529749e
us-west-2ami-04f850c412625dfe6

2 - Installing Data Discovery

Steps to install Data Discovery.

Data Discovery application can be deployed using helm.

Note: For connecting and deploying from the Protegrity Container Registry (PCR), use the helm registry login <Container_Registry_Path> command and the credentials obtained from the My.Protegrity portal during account creation.

Install Data Discovery using the following command:

helm registry login <Container_Registry_Path>
helm upgrade --install data-discovery \
  oci://<Container_Registry_Path>/data-discovery/2.0/classification/helm/data-discovery \
  --version 2.0.0-373.gf464fa3e \
  --namespace data-discovery \
  --create-namespace

Replace the placeholder values in the command with the following variables.

Variable NameDescriptionValue
<Container_Registry_Path>Location of the container registry where the Data Discovery Helm chart is published.
  • registry.protegrity.com:9443 if Protegrity Container Registry is used.

  • Local registry endpoint if a local registry is used.

When installing Data Discovery in a region other than the default us-east-1, an AMI ID override may be required.

helm registry login <Container_Registry_Path>
helm upgrade --install data-discovery \
  oci://<Container_Registry_Path>/data-discovery/2.0/classification/helm/data-discovery \
  --version 2.0.0-373.gf464fa3e \
  --namespace data-discovery \
  --create-namespace \
  --set karpenterResources.nodeClass.amiId="<ami-id>"

Note: Ensure that <ami-id> in the preceding command is replaced with a valid AMI ID for the AWS region in use. For more information about AMI IDs and available options, refer AMI ID.

Validating the deployment

After installing Data Discovery, validate the deployment using the following steps.

  1. Check whether all Data Discovery Pods are ready and running using the following command.
kubectl get pods -n data-discovery

NAME                                           READY   STATUS    RESTARTS   AGE
classification-deployment-75db967f47-88kkc     1/1     Running   0          5h40m
context-provider-deployment-54f44fb4b6-p9wx2   1/1     Running   0          5h32m
pattern-provider-deployment-6b6cb5f8dd-2kx25   1/1     Running   0          5h40m
  1. Submit a classification request to the Data Discovery API.

Note: The following requirements are necessary to submit a classification request to the Data Discovery API:

  • An Authentication token.
  • To login with a user with data_discovery_permission access. This permission is currently assigned to the security_administrator role.
curl -k https://<CLUSTER_FQDN>/pty/data-discovery/v2/classify/text \
      -H 'Content-Type: text/plain' \
      -H "Authorization: Bearer <JWT_TOKEN>" \
      --data 'You can reach Dave Elliot by phone 203-555-1286'

Where:

  • <CLUSTER_FQDN> is the Fully Qualified Domain Name of the cluster (FQDN). For example, eclipse.aws.protegrity.com.
  • <JWT_TOKEN> is authentication token.

To view a sample response, refer to API Endpoints in Data Discovery.

Tip: To test classification without authentication, refer to Verify application functionality without authentication in the Troubleshooting section.

3 - Configuring Data Discovery

Steps to configure Data Discovery.

This section provides guidance on configuring Data Discovery logging and service providers.

Configurations can be set during deployment by overwriting the configurations defined in the Data Discovery Helm values.yaml.

Overriding Configurations

  1. Create a values-override.yaml file with the custom configuration mentioned in the Logging Configuration.

  2. Save the changes.

  3. If the application is already deployed, uninstall it using the following command.

    helm uninstall data-discovery -n data-discovery
    
  4. Run the installation command mentioned in the Installing Data Discovery.

  5. Apply the custom configuration using the following command.

-f values-override.yaml

Logging Configuration

To configure the settings during deployment, add the following entries to the values-override.yaml file:

Setting the Log level

Update the log level in the values-override.yaml file.

  • Classification Service:
# Custom logging configuration for classificationService 
classificationService:
  loggingConfig: |
      {}
  • Providers:
# Custom logging configuration for Providers
providers:

  Pattern:
    loggingConfig: |
      {}

  Context:
    loggingConfig: |
      {}

The empty braces can be populated using the standard Python logging configuration JSON format. For more information, refer the official documentation.

To set the log level, perform the following steps:

  1. Edit the values-override.yaml file.
  2. Under loggingConfig:, set the value of root:level to one of the following:
  • DEBUG
  • INFO
  • WARNING
  • ERROR
  • CRITICAL

For example, to change the the log level to WARNING, configure any of the loggingConfig parameters as follows:

  • Classification Service:
classificationService:
  loggingConfig: |
    {
      "root": {
        "level": "WARNING"
      }
    }
  • Providers:
providers:

  Pattern:
    loggingConfig: |
      {
        "root": {
          "level": "WARNING"
        }
      }

  Context:
    loggingConfig: |
      {
        "root": {
          "level": "WARNING"
        }
      }
  1. Save the changes.

Configuring Input Validation Parameters

The Classification service in Data Discovery offers an input validation security feature that rejects invalid input data. For more information about Input Validation, refer to the Input Validation section.

Configure this feature during deployment by adding parameters to the values-override.yaml file. The below configuration uses the same override mechanism described in the Overriding Configuration section.

Following example shows how to Enable/Disable Input Validation Security Controls:

classificationAppConfig:
  securitySettings: 
    # Can be set as True or False 
    ENABLE_ALL_SECURITY_CONTROLS: true 

4 - Uninstalling Data Discovery

Steps to uninstall Data Discovery.

Run the following command to uninstall:

helm uninstall data-discovery -n data-discovery

Tip: If the uninstall process hangs, refer to the Manually remove the remaining resources in the Troubleshooting section.

5 - Troubleshooting

Troubleshooting procedures.

The following section provides a quick reference for common issues, their causes, and actions.

  • Pods remain in the Pending state.

    • Likely cause: The NodePool is not ready or does not have sufficient capacity.
    • Action: Check the NodePool status and verify that sufficient capacity is available.
  • HPA displays unknown metrics.

    • Likely cause: The Metrics Server is missing or unhealthy.
    • Action: Install the Metrics Server or restore its health.
  • Gateway API returns a 401 Unauthorized response.

  • Uninstall stops responding.

  • NodePool or NodeClass resources remain in the cluster.

Verify application functionality without authentication

kubectl -n data-discovery run curl --image=curlimages/curl -it --rm --restart=Never -- \
curl -v -X POST classification-service:8050/pty/data-discovery/v2/classify/text \
-H 'Content-Type: text/plain' \
--data 'Detect Jane Roe phone 203-555-1111'

Manually remove the remaining resources

Follow the steps to manually remove resources.

  1. Remove finalizer from the Data Discovery EC2NodeClass.
    kubectl patch ec2nodeclass data-discovery-nodeclass \
    --type merge \
    -p '{"metadata":{"finalizers":[]}}'
    kubectl delete ec2nodeclass data-discovery-nodeclass
    
  2. Delete NodePools.
    kubectl delete nodepool data-discovery-classification
    kubectl delete nodepool data-discovery-context  
    kubectl delete nodepool data-discovery-pattern
    
  3. Delete EC2NodeClass (AWS provider)
    kubectl delete ec2nodeclass data-discovery-nodeclass
    

6 - Logging usage metrics

Logging usage metrics

Data Discovery generates usage logs for all classification requests submitted to the service. These logs provide visibility into how the service is being used and support monitoring, auditing, and operational analysis. Usage logs include high-level metrics such as request outcomes and the volume of data processed, enabling administrators to track usage patterns and assess service behavior over time.

For a detailed description of the usage log format and available fields, see the
Data Discovery Usage Logs documentation.

For information about Insight’s dashboards and access to usage logs, refer to AI Teams Edition Insight documentation.