Data Discovery is currently in Private Preview and is not available for General Availability (GA). It should not be used in production environments, as features and functionality may change before the final GA release.

Data Discovery Classification

Deploy the Data Discovery Classification service with Pattern and Context providers for data classification and transformation.

Requirements

The following requirements are mandatory before deploying the product.

An EKS cluster is provisioned.
The cluster is connected and the kubeconfig is properly configured.

The following components are optional.

Metrics Server to enables Horizontal Pod Autoscaling (HPA). If it is not installed, HPA will not function.
Ingress Controller for HTTPS access.
Karpenter NodePool for automatic node provisioning.

Run the following command to connect a local environment to the EKS cluster.

aws eks update-kubeconfig --region <region> --name <cluster-name>

Installing the Service

Define the docker registry credentials that were provided in the environment variables:

export DOCKER_USERNAME=myuser
export DOCKER_PASSWORD=mypassword

Install the chart using the following command.

cd helm/data-discovery-classification
helm install data-discovery-classification . \
  --namespace default \
  --create-namespace \
  --wait \
  --wait-for-jobs \
  --timeout 900s \
  --set docker.creds.username=$DOCKER_USERNAME \
  --set docker.creds.password=$DOCKER_PASSWORD

Note: For any custom configuration changes, create a values-override.yaml file and add -f values-override.yaml to the helm install command instead of modifying the default values.yaml file.

The --wait flag with a 15-minute timeout is recommended as the installation typically completes in 5-7 minutes due to large Docker image downloads. Monitor the installation progress in another terminal using the verification commands.

If a registry is used that does not require basic authentication (e.g., ECR or a private registry), ommit the --set docker lines in the command above.

Verifying the Installation

Get Deployments, Services, and HPAs

kubectl get deploy,svc,hpa -n default

Expected output:

NAME                                          READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/classification-deployment     1/1     1            1           ...
deployment.apps/context-provider-deployment   1/1     1            1           ...
deployment.apps/pattern-provider-deployment   1/1     1            1           ...

NAME                               TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
service/classification-service     ClusterIP   172.20.x.x      <none>        8050/TCP   ...
service/context-provider-service   ClusterIP   172.20.x.x      <none>        8052/TCP   ...
service/pattern-provider-service   ClusterIP   172.20.x.x      <none>        8051/TCP   ...

NAME                                                             REFERENCE                                TARGETS              MINPODS   MAXPODS   REPLICAS   AGE
horizontalpodautoscaler.autoscaling/classification-service-hpa   Deployment/classification-deployment     cpu: 50%/50%        1         5         1          ...
horizontalpodautoscaler.autoscaling/context-provider-hpa         Deployment/context-provider-deployment   cpu: 65%/65%        1         20        1          ...
horizontalpodautoscaler.autoscaling/pattern-provider-hpa         Deployment/pattern-provider-deployment   cpu: 90%/90%        1         3         1          ...

All deployments must show 1/1 in the READY column after deployment is completed. During startup, it is an expected behaviour to see 0/1 and cpu: <unknown>.

Ingress

kubectl get ingress -n default

Expected output:

NAME                          CLASS           HOSTS   ADDRESS                                        PORTS   AGE
classification-ingress-rule   private-nginx   *       <load-balancer-dns>.elb.amazonaws.com.         443     ...

Ingress Endpoint Testing

INGRESS_HOST=$(kubectl get svc ingress-controller-private-ingress-controller \
  -n ingress-nginx \
  -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')

# Fallback to IP
if [ -z "$INGRESS_HOST" ]; then
  INGRESS_HOST=$(kubectl get svc ingress-controller-private-ingress-controller \
    -n ingress-nginx \
    -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
fi

echo "Ingress available at: $INGRESS_HOST"

Running Requests

curl -k https://$INGRESS_HOST/readiness
curl -k https://$INGRESS_HOST/healthz
curl -k https://$INGRESS_HOST/startup

curl -k -X POST https://$INGRESS_HOST/pty/data-discovery/v1.1/classify \
  -H 'Content-Type: text/plain' \
  --data 'You can reach Dave Elliot by phone 203-555-1286'

Custom Configuration

The chart is production-ready and the required configurations and default container images are set in the values.yaml file. However, customized container images can also be configured.

To use your own container images, perform the following steps:

Create a values-override.yaml file with the following configuration.

docker:
  registry: "<Address of the image-repository>"
# e.g.: 
# docker:
#   registry: "registry.protegrity.com"

serviceImages:
  classification: "<Name of the classification-image>"
  pattern: "<Name of the pattern-provider-image>"
  context: "<Name of the context-provider-image>"
# e.g.:
# serviceImages:
#  classification: "products/data_discovery/1.1/classification_service:latest"
#  pattern: "products/data_discovery/1.1/pattern_classification_provider:latest"
#  context: "products/data_discovery/1.1/context_classification_provider:latest"

Run the following installation command.

helm install data-discovery-classification . \
  --namespace default \
  --create-namespace \
  --wait \
  --wait-for-jobs \
  --timeout 900s \
  --set docker.creds.username=$DOCKER_USERNAME \
  --set docker.creds.password=$DOCKER_PASSWORD \
  -f values-override.yaml

Uninstalling the Service

Run the following command to uninstall the Data Discovery Classification application.

helm uninstall data-discovery-classification \
  --namespace default \
  --wait \
  --timeout 300s

This will remove the classification, pattern provider, and context provider services. Also, the associated ConfigMaps, Services, and HPA resources will be removed. Any persistent data or logs will be lost during this process.

Resources may take a couple of minutes to be fully terminated. Re-installing immediately after uninstall can lead to an inconsistent state. Wait for all pods to be completely removed before reinstalling.

Troubleshooting

Run the following commands to inspect the state of the deployment.

Viewing all Pods in the Namespace

kubectl get pods -n default

Viewing all Services in the Namespace

kubectl get svc -n default

Viewing Logs for a Specific Pod

kubectl logs <pod-name> -n default

Describing a Specific Pod

kubectl describe pod <pod-name> -n default

Last modified : August 29, 2025