Data Discovery
Data Discovery for AI Team Edition.
Data Discovery specializes in the detection of Personally Identifiable Information (PII), Protected Health Information (PHI), Payment Card Information (PCI) within free-text (unstructured) and table-based (structured. CSV) inputs. Unlike traditional data tools, it excels in dynamic, unstructured environments such as chatbot conversations, call transcripts, and Generative AI (Gen AI) outputs.
For more information about Data Discovery, refer to Data Discovery.
1 - Prerequisites
Prerequisites to install Data Discovery
Ensure that the following prerequisites are met before installing Data Discovery with PPC:
PPC Cluster Team Edition must be installed and accessible. For more information on installing PPC, refer Installing PPC for instructions.
Access to a Kubernetes cluster with sufficient permissions to manage the following resources:
- Namespace
- Deployment
- Service
- ConfigMap
- Secret
- HorizontalPodAutoscaler
- Gateway API resources (HTTPRoute, ReferenceGrant, SecurityPolicy)
- Karpenter resources (NodePool, EC2NodeClass)
Authorization to provision AWS m5.large instances.
Ensure that the jumpbox can connect to the required repositories. If not already authenticated, then log in to the required repository.
For connecting and deploying from the Protegrity Container Registry (PCR), use the following command and the credentials obtained from the My.Protegrity portal during account creation:
helm registry login registry.protegrity.com:9443
- For connecting and deploying to the local repository, use your local credentials and local repository endpoint as required.
- AWS credentials with permission to read SSM parameters in the target region (required only when overriding the AMI ID).
Option A (Recommended): Run the following AWS CLI command to retrieve the AMI ID dynamically
aws ssm get-parameter \
--name /aws/service/bottlerocket/aws-k8s-1.34/x86_64/latest/image_id \
--region <region> \
--query "Parameter.Value" \
--output text
Alternatively, refer to these example AMI IDs.
Option B: The following table provides the list of AMI IDs
| Region | AMI ID |
|---|
| ap-south-1 | ami-07959c05dcdb79a72 |
| eu-north-1 | ami-0268b0bfff0f25d31 |
| eu-west-3 | ami-0ea9454aef60045a2 |
| eu-west-2 | ami-0d5eee57a6a1398a3 |
| eu-west-1 | ami-00a8d14029b60a028 |
| ap-northeast-3 | ami-0e495c3ffd416c65e |
| ap-northeast-2 | ami-0fc18a24aec719c1c |
| ap-northeast-1 | ami-00ec85b83bf713aac |
| ca-central-1 | ami-03891f0d8b41eb296 |
| sa-east-1 | ami-0a30f044a5781b4e0 |
| ap-southeast-1 | ami-0ae51324bf2e89725 |
| ap-southeast-2 | ami-0ef7e8095b163dc42 |
| eu-central-1 | ami-00e36131a0343c374 |
| us-east-2 | ami-0e486911b2d0a5f7e |
| us-west-1 | ami-01183e1261529749e |
| us-west-2 | ami-04f850c412625dfe6 |
2 - Installing Data Discovery
Steps to install Data Discovery.
Data Discovery application can be deployed using helm.
Note: For connecting and deploying from the Protegrity Container Registry (PCR), use the helm registry login <Container_Registry_Path> command and the credentials obtained from the My.Protegrity portal during account creation.
Install Data Discovery using the following command:
helm registry login <Container_Registry_Path>
helm upgrade --install data-discovery \
oci://<Container_Registry_Path>/data-discovery/2.0/classification/helm/data-discovery \
--version 2.0.0-373.gf464fa3e \
--namespace data-discovery \
--create-namespace
Replace the placeholder values in the command with the following variables.
| Variable Name | Description | Value |
|---|
<Container_Registry_Path> | Location of the container registry where the Data Discovery Helm chart is published. | registry.protegrity.com:9443 if Protegrity Container Registry is used.
- Local registry endpoint if a local registry is used.
|
When installing Data Discovery in a region other than the default us-east-1, an AMI ID override may be required.
helm registry login <Container_Registry_Path>
helm upgrade --install data-discovery \
oci://<Container_Registry_Path>/data-discovery/2.0/classification/helm/data-discovery \
--version 2.0.0-373.gf464fa3e \
--namespace data-discovery \
--create-namespace \
--set karpenterResources.nodeClass.amiId="<ami-id>"
Note: Ensure that <ami-id> in the preceding command is replaced with a valid AMI ID for the AWS region in use. For more information about AMI IDs and available options, refer AMI ID.
Validating the deployment
After installing Data Discovery, validate the deployment using the following steps.
- Check whether all Data Discovery Pods are ready and running using the following command.
kubectl get pods -n data-discovery
NAME READY STATUS RESTARTS AGE
classification-deployment-75db967f47-88kkc 1/1 Running 0 5h40m
context-provider-deployment-54f44fb4b6-p9wx2 1/1 Running 0 5h32m
pattern-provider-deployment-6b6cb5f8dd-2kx25 1/1 Running 0 5h40m
- Submit a classification request to the Data Discovery API.
Note: The following requirements are necessary to submit a classification request to the Data Discovery API:
- An Authentication token.
- To login with a user with
data_discovery_permission access. This permission is currently assigned to the security_administrator role.
curl -k https://<CLUSTER_FQDN>/pty/data-discovery/v2/classify/text \
-H 'Content-Type: text/plain' \
-H "Authorization: Bearer <JWT_TOKEN>" \
--data 'You can reach Dave Elliot by phone 203-555-1286'
Where:
<CLUSTER_FQDN> is the Fully Qualified Domain Name of the cluster (FQDN). For example, eclipse.aws.protegrity.com.<JWT_TOKEN> is authentication token.
To view a sample response, refer to API Endpoints in Data Discovery.
Tip: To test classification without authentication, refer to Verify application functionality without authentication in the Troubleshooting section.
3 - Configuring Data Discovery
Steps to configure Data Discovery.
This section provides guidance on configuring Data Discovery logging and service providers.
Configurations can be set during deployment by overwriting the configurations defined in the Data Discovery Helm values.yaml.
Overriding Configurations
Create a values-override.yaml file with the custom configuration mentioned in the Logging Configuration.
Save the changes.
If the application is already deployed, uninstall it using the following command.
helm uninstall data-discovery -n data-discovery
Run the installation command mentioned in the Installing Data Discovery.
Apply the custom configuration using the following command.
Logging Configuration
To configure the settings during deployment, add the following entries to the values-override.yaml file:
Setting the Log level
Update the log level in the values-override.yaml file.
# Custom logging configuration for classificationService
classificationService:
loggingConfig: |
{}
# Custom logging configuration for Providers
providers:
Pattern:
loggingConfig: |
{}
Context:
loggingConfig: |
{}
The empty braces can be populated using the standard Python logging configuration JSON format. For more information, refer the official documentation.
To set the log level, perform the following steps:
- Edit the
values-override.yaml file. - Under
loggingConfig:, set the value of root:level to one of the following:
- DEBUG
- INFO
- WARNING
- ERROR
- CRITICAL
For example, to change the the log level to WARNING, configure any of the loggingConfig parameters as follows:
classificationService:
loggingConfig: |
{
"root": {
"level": "WARNING"
}
}
providers:
Pattern:
loggingConfig: |
{
"root": {
"level": "WARNING"
}
}
Context:
loggingConfig: |
{
"root": {
"level": "WARNING"
}
}
- Save the changes.
The Classification service in Data Discovery offers an input validation security feature that rejects invalid input data. For more information about Input Validation, refer to the Input Validation section.
Configure this feature during deployment by adding parameters to the values-override.yaml file. The below configuration uses the same override mechanism described in the Overriding Configuration section.
Following example shows how to Enable/Disable Input Validation Security Controls:
classificationAppConfig:
securitySettings:
# Can be set as True or False
ENABLE_ALL_SECURITY_CONTROLS: true
4 - Uninstalling Data Discovery
Steps to uninstall Data Discovery.
Run the following command to uninstall:
helm uninstall data-discovery -n data-discovery
Tip: If the uninstall process hangs, refer to the Manually remove the remaining resources in the Troubleshooting section.
5 - Troubleshooting
Troubleshooting procedures.
The following section provides a quick reference for common issues, their causes, and actions.
Pods remain in the Pending state.
- Likely cause: The NodePool is not ready or does not have sufficient capacity.
- Action: Check the NodePool status and verify that sufficient capacity is available.
HPA displays unknown metrics.
- Likely cause: The Metrics Server is missing or unhealthy.
- Action: Install the Metrics Server or restore its health.
Gateway API returns a 401 Unauthorized response.
- Likely cause: JWT authentication fails because the token is missing, malformed, or expired.
- Action:
Uninstall stops responding.
NodePool or NodeClass resources remain in the cluster.
Verify application functionality without authentication
kubectl -n data-discovery run curl --image=curlimages/curl -it --rm --restart=Never -- \
curl -v -X POST classification-service:8050/pty/data-discovery/v2/classify/text \
-H 'Content-Type: text/plain' \
--data 'Detect Jane Roe phone 203-555-1111'
Manually remove the remaining resources
Follow the steps to manually remove resources.
- Remove finalizer from the Data Discovery EC2NodeClass.
kubectl patch ec2nodeclass data-discovery-nodeclass \
--type merge \
-p '{"metadata":{"finalizers":[]}}'
kubectl delete ec2nodeclass data-discovery-nodeclass
- Delete NodePools.
kubectl delete nodepool data-discovery-classification
kubectl delete nodepool data-discovery-context
kubectl delete nodepool data-discovery-pattern
- Delete EC2NodeClass (AWS provider)
kubectl delete ec2nodeclass data-discovery-nodeclass
6 - Logging usage metrics
Logging usage metrics
Data Discovery generates usage logs for all classification requests submitted to the service. These logs provide visibility into how the service is being used and support monitoring, auditing, and operational analysis.
Usage logs include high-level metrics such as request outcomes and the volume of data processed, enabling administrators to track usage patterns and assess service behavior over time.
For a detailed description of the usage log format and available fields, see the
Data Discovery Usage Logs documentation.
For information about Insight’s dashboards and access to usage logs, refer to AI Teams Edition Insight documentation.