This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

AWS Databricks

Big Data Protector on AWS Databricks

The Protegrity Big Data Protector for AWS Databricks delivers end‑to‑end data protection. Organizations deploying the Big Data Protector rely on modern, supported storage options such as Workspace storage, Unity Catalog Volumes, and cloud object storage like Amazon S3.

Designed to secure sensitive data across analytics pipelines, the Big Data Protector applies advanced tokenization and encryption during Spark execution and enforces centralized, policy‑driven controls. Whether installed via Workspace-backed paths or deployed using S3 buckets for configuration and script delivery, the Protector ensures resilient execution across AWS Databricks clusters.

By embracing cloud‑native storage paths, this approach ensures long‑term compatibility with Databricks platform changes while maintaining Protegrity’s standard of seamless and transparent protection. Organizations can continue to process high‑value datasets on AWS Databricks with confidence—knowing that sensitive information is secured across its lifecycle, even as the underlying platform evolves.

The Protegrity Big Data Protector for AWS Databricks empowers organizations to secure sensitive data across their analytics pipelines by combining high‑performance protection mechanisms with flexible deployment models tailored for modern cloud architectures. Central to this capability are two approaches; Application Protector REST (AP REST) and Cloud Protector approach. Each approach is designed to address different customer requirements around scalability, infrastructure usage, and cost optimization.

Application Protector REST Approach

The AP REST model enables data protection directly within the Databricks cluster itself, eliminating the need for a separate Cloud API infrastructure. This approach is particularly suitable for customers who want to avoid maintaining additional cloud-native services for protection operations.

With AP REST, protection workflows are executed through REST endpoints running on the cluster, allowing seamless scaling along with Databricks’ auto-scaling compute. This ensures that sensitive data remains protected throughout processing while also adapting automatically to dynamically assigned IPs in auto-scaling environments. This results in an operationally efficient fit for Spark-driven workloads on AWS.

For the Application Protector REST Approach, the following cluster types are supported:

  • Databricks Dedicated Compute
  • Databricks Standard Compute

For the Application Protector REST approach, the following sections are applicable:

Cloud Protector Approach

The Cloud Protector approach extends protection capabilities by offering centralized, cloud-hosted security services for environments that require externally managed protection layers. It enables highly scalable, policy-driven tokenization and encryption without requiring protection logic to reside inside the Databricks compute itself.

In contexts where Cloud Protector is integrated with the Big Data Protector, organizations benefit from lifecycle-wide protection that spans storage, compute, and inter-system data transfers. Cloud Protector provides the foundation for UDF-driven protections (including Spark and Unity Catalog–level enforcement), ensuring centralized governance across distributed analytics ecosystems.

For the Cloud Protector approach, the following cluster types are supported:

  • Databricks Dedicated Compute
  • Databricks Standard Compute
  • Databricks SQL Warehouse

For the Cloud Protector approach, the following sections are applicable:

Conclusion

Together, these two approaches provide enterprises the flexibility to choose a data protection strategy aligned with their architectural, cost, and compliance requirements—whether fully cluster-local using AP REST, centrally managed via Cloud Protector, or in hybrid deployments. This dual-path model ensures that AWS Databricks customers can achieve seamless, transparent, policy-based data protection while continuing to extract high-value insights from their data securely and efficiently.

1 - Understanding the architecture

1.1 - For the Application Protector REST Approach

The architecture for installing the AWS Databricks protector using the Application Protector REST approach is depicted in the image below.

An outline of the steps in the workflow is explained below.

  1. Download the AWS Databricks build from the customer portal and extract the configurator script.
  2. Execute the configurator script to retrieve the IP address of the Application Protector REST server.
  3. Use the IP address to generate the CA, client, and server certificates.
  4. Store the content of the CA and the client certificates as Secrets in the Secret Manager.
  5. Create a Databricks Unity Catalog Service Credentials to access the Secrets from the Secret Manager .
  6. Execute the configurator script to create the Unity Catalog Batch Python UDFs.
  7. Edit the cluster configuration to include the environment variables and attach the initialization script.

1.2 - For the Cloud Protector Approach

The architecture for installing the AWS Databricks protector using the Cloud Protector approach is depicted in the image below.

An outline of the steps in the workflow is explained below.

  1. Install and configure the Cloud Protector.
  2. Create an AWS Databricks Unity Catalog Service Credential and connect it with the AWS IAM roles.
  3. Create a Databricks Compute.
  4. On a Linux staging machine, download and extract the installation package for AWS Databricks from the customer portal, for the Databricks Compute.
  5. Execute the configurator script to create the Batch Python UDFs at the Unity Catalog level.
  6. Attach an AWS Databricks Notebook to the Databricks Compute.
  7. Execute the Unity Catalog Batch Python UDFs to protect and unprotect data.

2 - System Requirements

2.1 - For the Application Protector REST Approach

Ensure that the following prerequisites are available before installing the Big Data Protector:

  • Python3 along with the requests module is installed on the machine to execute the configurator script.

  • A compatible version of ESA is installed, configured, and running.

  • Access to the Databricks workspace is available.

  • A Databricks cluster, of any one of the following type, is created and is in the running state:

    • Dedicated Compute
    • Standard Compute
  • Create the Databricks Service Principal.

  • The Databricks Service Principal must have the Can attach to permission on the cluster.

  • Create the following certificates for mutual TLS authorization:

    • CA Certificate
    • Server Certificate
    • Non-encrypted Server Key
    • Client Certificate
    • Non-encrypted Client Key

    Note: These certificates must be generated ONLY after retrieving the IP address of the Application Protector REST server.

  • Permission to create a Secrets Manager and store secrets is available.

  • Create an AWS Databricks Unity Catalog Service Credential.

    Note: For more information about creating the credential, refer to https://docs.databricks.com/aws/en/connect/unity-catalog/cloud-services/service-credentials.

  • The Databricks Service Principal must have the access permissions on the Databricks Unity Catalog Service Credential.

  • A Databricks Unity Catalog Volume is available with a Catalog and a Schema and the following permissions:

    • The Databricks Service Principal must have the Read volume and Write volume permission on the Databricks Unity Catalog Volume.
    • The Databricks Service Principal must have the Use catalog permission at the Catalog level.
    • The Databricks Service Principal must have the Use schema permission at the Schema level.
    • The Databricks Service Principal must have the Create function permission at the Schema level.
    • The Databricks Service Principal must have the manage permission at the Schema level.

2.2 - For the Cloud Protector Approach

The prerequisites required to install and run the Big Data Protector on a Databricks Compute are listed below.

  • Python3 along with the requests module is installed on the machine to execute the configurator script.

  • A compatible version of ESA is installed, configured, and running.

  • Access to the Databricks workspace is available.

  • A Databricks cluster, of any one of the following type, is created and is in the running state:

    • Dedicated Compute
    • Standard Compute
    • SQL Warehouse
  • Create the Databricks Service Principal.

  • The Databricks Service Principal must have the Can attach to permission on the cluster.

  • Install and configure the Cloud API on AWS.

    Note: For more information about installing and configuring the Cloud API on AWS, refer Cloud API.

  • To modify the core parameters for RPSync, refer https://docs.protegrity.com/cloud-protect/4.0.0/docs/aws/api/installation/agent/#policy-agent-lambda-configuration.

  • Install and configure a compatible version of ESA.

    Note: For more information about compatible ESA versions, refer Cloud API.

  • Create an AWS Databricks Unity Catalog Service Credential.

    Note: For more information about creating the credential, refer to https://docs.databricks.com/aws/en/connect/unity-catalog/cloud-services/service-credentials.

  • Assigned the ACCESS privilege to the principals that will be using the AWS Databricks Unity Catalog Service Credential.

  • Create a service principal and OAuth secret to deploy the UDFs.

    Note: For more information, refer to https://docs.databricks.com/aws/en/dev-tools/auth/oauth-m2m?language=Connect.

  • (Optional) Configure private connectivity to the Protegrity Cloud API.

    Note: For more information, refer to https://docs.databricks.com/aws/en/security/network/serverless-network-security/pl-to-internal-network.

  • A Databricks Unity Catalog Volume is available with a Catalog and a Schema and the following permissions:

    • The Databricks Service Principal must have the ATTACH or MANAGE permission on the compute.
    • The Databricks Service Principal must have the Read volume and Write volume permission on the Databricks Unity Catalog Volume.
    • The Databricks Service Principal must have the Use catalog permission at the Catalog level.
    • The Databricks Service Principal must have the Use schema permission at the Schema level.
    • The Databricks Service Principal must have the Create function permission at the Schema level.
    • The Databricks Service Principal must have the manage permission at the Schema level.
  • To use a SQL Warehouse with the Cloud Protector approach, create a SQL Warehouse. For more information, refer https://docs.databricks.com/aws/en/compute/sql-warehouse/create.

3 - Preparing the Environment

3.1 - Extracting the Installation Package

Extract the contents of the installation package to access the configurator script. This script generates the required files to install the Big Data Protector.

To extract the files from the installation package:

  1. Log in to the Linux machine that has connectivity to ESA.

  2. Download the Big Data Protector package BigDataProtector_Linux-ALL-64_x86-64_AWS.Databricks-<DBR_version>-64_<BDP_version>.tgz to any local directory.

  3. To extract the files from the installation pacakage, run the following command:

    tar -xvf BigDataProtector_Linux-ALL-64_x86-64_AWS.Databricks-<DBR_version>-64_<BDP_version>.tgz
    
  4. Press ENTER. The command extracts the installation package and the GPG signature files.

    BigDataProtector_Linux-ALL-64_x86-64_AWS.Databricks-<DBR_version>-64_<BDP_version>.tgz
    signatures/
    signatures/BigDataProtector_Linux-ALL-64_x86-64_AWS.Databricks-<DBR_version>-64_<BDP_version>.tgz_10.0.sig
    

    Verify the authenticity of the build using the signatures folder. For more information, refer Verification of Signed Protector Build.

  5. To extract the configurator script, run the following command:

    tar -xvf BigDataProtector_Linux-ALL-64_x86-64_AWS.Databricks-<DBR_version>-64_<BDP_version>.tgz
    
  6. Press ENTER. The command extracts the configurator script.

    BigDataProtector-Configurator_Linux-ALL-64_x86-64_AWS.Databricks-<DBR_version>-64_<BDP_version>.sh
    

3.2 - Working with the Configurator Script

The configurator script performs the following tasks:

  1. Generate the IP address for the Application Protector REST server.
  2. Create the UDFs.
  3. Delete the UDFs.

The configurator script provides the --help option to understand the options and the arguments to be provided.

To understand the options and the arguments for the configurator script:

  1. Log in to the node where the installation files are extracted.
  2. To view the options and the arguments, run the following command:
    ./BigDataProtector-Configurator_Linux-ALL-64_x86-64_AWS.Databricks-<DBR_version>-64_<BDP_version>.sh --help
    
  3. Press ENTER. The command displays all the options and the arguments required to execute the configurator script.
    This script needs the following inputs as a string:
     1. The ID of the operation.
        ----------------------------------------------------------
        | ID | Operation                                         |
        ----------------------------------------------------------
        |  1 | Get Application Protector REST's Server IP        |
        |  2 | Create Databricks Unity Catalog Batch Python UDFs |
        |  3 | Delete Databricks Unity Catalog Batch Python UDFs |
        ----------------------------------------------------------
     2. The URL of the Databricks Workspace.
     3. The Application ID of the Databricks Service Principal
     4. The OAuth Secret of the Databricks Service Principal
     5. The ID of the Databricks Compute.
    
    If the ID of the operation is specified as "2" or "3", then the script will require the following additional inputs as a string:
     6. The name of the Databricks Unity Catalog Catalog-Schema.
     7. The ID of the approach.
        -----------------------------------
        | ID | Approach                   |
        -----------------------------------
        |  1 | Application Protector REST |
        |  2 | Cloud Protector            |
        -----------------------------------
    
    If the ID of the operation is specified as "2" and the ID of the approach is specified as "1", then the script will require the following additional inputs as a string:
     8. The path of the CA Certificate.
     9. The path of the Server Certificate.
    10. The path of the Server Key.
    11. The name of the AWS Secret.
    12. The name of the AWS Secret's AWS Region.
    13. The name of the Databricks Unity Catalog Service Credential.
    14. The path of the Databricks Unity Catalog Volume.
    
    If the ID of the operation is specified as "2" and the ID of the approach is specified as "2", then the script will require the following additional inputs as a string:
     8. The name of the AWS Lambda Function.
     9. The name of the AWS Lambda Function's AWS Region.
    10. The name of the Databricks Unity Catalog Service Credential.
    
    If the ID of the operation is specified as "3" and the ID of the approach is specified as "1", then the script will require the following additional input as a string:
     8. The path of the Databricks Unity Catalog Volume.
    
    
    This script accepts the above-mentioned inputs in any one of the following ways:
     1. Using .cfg file (pass the path of the .cfg file to this script as a command-line argument).
     2. Using command-line arguments.
     3. Using interactive prompts.
    
    
    Structure of the .cfg file:
    operation_id = "operation_id"
    databricks_workspace_url = "databricks_workspace_url"
    databricks_service_principal_application_id = "databricks_service_principal_application_id"
    databricks_service_principal_oauth_secret = "databricks_service_principal_oauth_secret"
    databricks_compute_id = "databricks_compute_id"
    databricks_unity_catalog_catalog_schema_name = "databricks_unity_catalog_catalog_schema_name"
    approach_id = "approach_id"
    ca_certificate_path = "ca_certificate_path"
    server_certificate_path = "server_certificate_path"
    server_key_path = "server_key_path"
    aws_secret_name = "aws_secret_name"
    aws_secret_aws_region_name = "aws_secret_aws_region_name"
    databricks_unity_catalog_service_credential_name = "databricks_unity_catalog_service_credential_name"
    databricks_unity_catalog_volume_path = "databricks_unity_catalog_volume_path"
    aws_lambda_function_name = "aws_lambda_function_name"
    aws_lambda_function_aws_region_name = "aws_lambda_function_aws_region_name"
    
    
    Syntax of the command-line arguments:
    --operation_id "operation_id"
    --databricks_workspace_url "databricks_workspace_url"
    --databricks_service_principal_application_id "databricks_service_principal_application_id"
    --databricks_service_principal_oauth_secret "databricks_service_principal_oauth_secret"
    --databricks_compute_id "databricks_compute_id"
    --databricks_unity_catalog_catalog_schema_name "databricks_unity_catalog_catalog_schema_name"
    --approach_id "approach_id"
    --ca_certificate_path "ca_certificate_path"
    --server_certificate_path "server_certificate_path"
    --server_key_path "server_key_path"
    --aws_secret_name "aws_secret_name"
    --aws_secret_aws_region_name "aws_secret_aws_region_name"
    --databricks_unity_catalog_service_credential_name "databricks_unity_catalog_service_credential_name"
    --databricks_unity_catalog_volume_path "databricks_unity_catalog_volume_path"
    --aws_lambda_function_name "aws_lambda_function_name"
    --aws_lambda_function_aws_region_name "aws_lambda_function_aws_region_name"
    

3.3 - Retrieving the IP Address

Note: The instructions mentioned in the section apply only to the Application Protector REST approach.

The IP address for the Application Protector REST approach is required to generate the certificates. The certificates must be created using the retrieved IP address. These certificates will be used to establish a mutual trust between the Unity Catalog Batch Python UDFs and the Application Protector REST Server.

  1. Log in to the node where the installation files are extracted.

  2. To execute the configurator script, run the following command:

    ./BigDataProtector-Configurator_Linux-ALL-64_x86-64_AWS.Databricks-<DBR_version>-64_<BDP_version>.sh
    
  3. Press ENTER The prompt to enter the operation ID appears.

    Creating installation files...
    Created installation files.
    
    Enter the ID of the operation:
    
  4. To retrieve the IP address of the Application Protector REST server, type 1.

  5. Press ENTER. The prompt to enter the Databricks Workspace URL appears.

    Enter the URL of the Databricks Workspace:
    
  6. Enter the Databricks Workspace URL.

  7. Press ENTER. The prompt to enter the application ID of the Databricks Service Principal appears.

    Enter the Application ID of the Databricks Service Principal:
    
  8. Enter the Application ID of the Databricks Service Principal.

  9. Press ENTER. The prompt to enter the OAuth secret for the Service Principal appears.

    Enter the OAuth Secret of the Databricks Service Principal:
    
  10. Enter the OAuth secret.

  11. Press ENTER. The prompt to enter the cluster ID appears.

    Enter the ID of the Databricks Compute:
    
  12. Enter the Cluster ID.

  13. Press ENTER. The script retrieves the IP address of the Application Protector REST server.

    Executing specified operation...
    
    APREST Protector's Server IP: x.x.x.x
    
    Executed specified operation.
    

3.4 - Uploading the Secrets

Note: The instructions mentioned in the section apply only to the Application Protector REST approach.

The CA and the Client certificates are important entities in the mutual trust process. These certificates determine the authentication and authorization to the Application Protector REST server. As a result, it is critical to store these certificates in a secured location. Therefore, the certificates must be uploaded to the Secrets Manager in AWS where they will be stored as secrets.

To upload the secrets:

  1. Create a Secrets Manager in AWS to upload the secrets.

  2. Assign the required access permissions to the Secrets Manager. For example:

    {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "secretsmanager:*"
            ],
            "Resource": [
                "arn:aws:secretsmanager:<aws_region_name>:<aws_account>:secret:*"
            ]
        },
        {
            "Action": "sts:AssumeRole",
            "Resource": "arn:aws:iam::<aws_account>:role/<aws_iam_role>",
            "Effect": "Allow"
        }
    ]
    }
    
  3. Log in to the machine where the certificates are created.

  4. Launch the python console.

  5. To view the contents of the CA.pem file and store it as PTY-APPLICATION-PROTECTOR-REST-CA-CERTIFICATE, run the following command:

    with open("ca/CA.pem") as file:
        file.read()
    Store CA cert as PTY-APPLICATION-PROTECTOR-REST-CA-CERTIFICATE
    
  6. Press ENTER. The command displays the contents of the CA.pem file.

  7. To view the contents of the client.pem file and store it as PTY-APPLICATION-PROTECTOR-REST-CLIENT-CERTIFICATE, run the following command:

    with open("client/client.pem") as file:
        file.read()
    Store client cert as PTY-APPLICATION-PROTECTOR-REST-CLIENT-CERTIFICATE
    
  8. Press ENTER. The command displays the contents of the client.pem file.

  9. To view the contents of the client.key file and store it as PTY-APPLICATION-PROTECTOR-REST-CLIENT-KEY, run the following command:

    with open("client/client.key") as file:
        file.read()
    Store client key as PTY-APPLICATION-PROTECTOR-REST-CLIENT-KEY
    
  10. Press ENTER. The command displays the contents of the client.key file.

  11. Log in to the AWS portal.

  12. Navigate to the required Secrets Manager.

  13. Click Store a new secret. The Choose secret type page appears.

  14. From the Secret type section, select Other type of secret.

  15. Enter the details as listed in the table, in a new row.

    Key

    Value

    PTY-APPLICATION-PROTECTOR-REST-CA-CERTIFICATE

    1. In the Key box, enter PTY-APPLICATION-PROTECTOR-REST-CA-CERTIFICATE.
    2. In the Value box, enter the contents of the CA.pem file.

    PTY-APPLICATION-PROTECTOR-REST-CLIENT-CERTIFICATE

    1. In the Key box, enter PTY-APPLICATION-PROTECTOR-REST-CLIENT-CERTIFICATE.
    2. In the Value box, enter the contents of the client.pem file.

    PTY-APPLICATION-PROTECTOR-REST-CLIENT-KEY

    1. In the Key box, enter PTY-APPLICATION-PROTECTOR-REST-CLIENT-KEY.
    2. In the Value box, enter the contents of the client.key file.
  16. Click Next. The Configure secret page appears.

  17. In the Secret name box, enter a name to identify the secret.

  18. Click Next. The Configure rotation page appears.

  19. Click Next. The Review page appears.

  20. Verify the details.

  21. Click Store. The secrets are stored as per the specified details.

4 - Installing the Protector

4.1 - Creating the User Defined Functions

The following combinations will work for a successful execution of the configurator script:

  • Databricks Dedicated Compute + Application Protector REST approach
  • Databricks Dedicated Compute + Cloud Protector approach
  • Databricks Standard Compute + Application Protector REST approach
  • Databricks Standard Compute + Cloud Protector approach
  • Databricks SQL Warehouse + Cloud Protector approach

The Databricks SQL Warehouse + Application Protector REST approach combination will not work. This is because Protegrity executes a few Python commands on the Databricks Compute to retrieve a listening IP for the Application Protector REST’s Server. When the Databricks Compute is a SQL Warehouse, the Python commands fail to execute. This occurs because the SQL Warehouse supports only SQL commands.

For the Application Protector REST Approach

The configurator script is used to create the UDFs. These Unity Catalog Batch Python UDFs are used to perform data protection and unprotection operations. Select the required approach and the operation ID to create the UDFs using the Application Protector REST server. This section explains the process to create the UDFs using the interactive method of installation.

To create the UDFs:

  1. Log in to the staging machine.

  2. Navigate to the directory where the installation files are extracted.

  3. To execute the configurator script, run the following command:

    ./BigDataProtector-Configurator_Linux-ALL-64_x86-64_AWS.Databricks-<DBR_version>-64_<BDP_version>.sh
    
  4. Press ENTER. The prompt to enter the operation ID appears.

    Creating installation files...
    Created installation files.
    
    Enter the ID of the operation:
    
  5. To create the UDFs, type 2.

  6. Press ENTER. The prompt to enter the Databricks Workspace URL appears.

    Enter the URL of the Databricks Workspace:
    
  7. Enter the Databricks Workspace URL.

  8. Press ENTER. The prompt to enter the application ID of the Databricks Service Principal appears.

    Enter the Application ID of the Databricks Service Principal:
    
  9. Enter the Application ID of the Databricks Service Principal.

  10. Press ENTER. The prompt to enter the OAuth secret for the Service Principal appears.

    Enter the OAuth Secret of the Databricks Service Principal:
    
  11. Enter the OAuth secret.

  12. Press ENTER. The prompt to enter the cluster ID appears.

    Enter the ID of the Databricks Compute:
    

    Note: The Cluster ID can be either for Standard Compute or Dedicated Compute. For more information about identifying the Cluster ID, refer to https://docs.databricks.com/aws/en/workspace/workspace-details/.

  13. Enter the Cluster ID.

  14. Press ENTER. The prompt to enter the name of the schema appears.

    Enter the name of the Databricks Unity Catalog Catalog-Schema:
    
  15. Enter the name of the catalog and the schema in the <catalog_name.schema_name> format.

  16. Press ENTER. The prompt to select the approach appears.

    Enter the ID of the approach:
    
  17. To create the UDFs using the Application Protector REST approach, type 1.

  18. Press ENTER. The prompt to enter the path of the CA Certificate appears.

    Enter the path of the CA Certificate:
    
  19. Enter the path of the CA Certificate.

  20. Press ENTER. The prompt to enter the path of the Server Certificate appears.

    Enter the path of the Server Certificate:
    
  21. Enter the path of the Server Certificate.

  22. Press ENTER. The prompt to enter the path of the Server key appears.

    Enter the path of the Server Key:
    
  23. Enter the path of the Server Key.

  24. Press ENTER. The prompt to enter the name of the AWS Secret appears.

    Enter the name of the AWS Secret:
    
  25. Enter the name of the AWS Secret.

  26. Press ENTER. The prompt to enter the region of the Secret appears.

    Enter the name of the AWS Secret's AWS Region:
    
  27. Enter the region where the Secret is created.

  28. Press ENTER. The prompt to enter the name of the Service Credential appears.

    Enter the name of the Databricks Unity Catalog Service Credential:
    
  29. Enter the name of the Databricks Unity Catalog Service Credential.

  30. Press ENTER. The prompt to enter the path of the Unity Catalog Volume appears.

    Enter the path of the Databricks Unity Catalog Volume:
    
  31. Enter the path of the Databricks Unity Catalog Volume.

  32. Press ENTER. The script creates the UDFs at the specified location.

    Executing specified operation...
    
    1. Create the following environment variables in the Spark section of the Advanced properties of the Databricks Compute:
    PTY_ESA_IP=PTY_ESA_IP
    PTY_ESA_PORT=PTY_ESA_PORT
    Either PTY_ESA_TOKEN=PTY_ESA_TOKEN or PTY_ESA_ADMINISTRATOR_USERNAME=PTY_ESA_ADMINISTRATOR_USERNAME and PTY_ESA_ADMINISTRATOR_PASSWORD=PTY_ESA_ADMINISTRATOR_PASSWORD
    PTY_AUDIT_STORE_IP_PORT=PTY_AUDIT_STORE_IP_PORT
    PTY_PROTECTOR_CONFIGURATION=PTY_PROTECTOR_CONFIGURATION
    2. Attach "DATABRICKS_UNITY_CATALOG_VOLUME_PATH/DATABRICKS_INIT_SCRIPT_NAME" as an Init Script to the Databricks Compute.
    3. Restart the Databricks Compute.
    
    Executed specified operation.
    

For the Cloud Protector Approach

The configurator script is used to create the UDFs. These Unity Catalog Batch Python UDFs are used to perform data protection and unprotection operations. Select the required approach and the operation ID to create the UDFs using the Cloud Protector. This section explains the process to create the UDFs using the interactive method of installation.

To create the UDFs:

  1. Log in to the staging machine.

  2. Navigate to the directory where the installation files are extracted.

  3. To execute the configurator script, run the following command:

    ./BigDataProtector-Configurator_Linux-ALL-64_x86-64_AWS.Databricks-<DBR_version>-64_<BDP_version>.sh
    
  4. Press ENTER. The prompt to enter the operation ID appears.

    Creating installation files...
    Created installation files.
    
    Enter the ID of the operation:
    
  5. To create the UDFs, type 2

  6. Press ENTER. The prompt to enter the Databricks Workspace URL appears.

    Enter the URL of the Databricks Workspace:
    
  7. Enter the Databricks Workspace URL.

  8. Press ENTER. The prompt to enter the application ID of the Databricks Service Principal appears.

    Enter the Application ID of the Databricks Service Principal:
    
  9. Enter the Application ID of the Databricks Service Principal.

  10. Press ENTER. The prompt to enter the OAuth secret for the Service Principal appears.

    Enter the OAuth Secret of the Databricks Service Principal:
    
  11. Enter the OAuth secret.

  12. Press ENTER. The prompt to enter the cluster ID appears.

    Enter the ID of the Databricks Compute:
    

    Note: The Cluster ID can be either for SQL Warehouse, Standard Compute or Dedicated Compute. For more information about identifying the Cluster ID, refer to https://docs.databricks.com/aws/en/workspace/workspace-details/.

  13. Enter the Cluster ID.

  14. Press ENTER. The prompt to enter the name of the schema appears.

    Enter the name of the Databricks Unity Catalog Catalog-Schema:
    
  15. Enter the name of the catalog and the schema in the <catalog_name.schema_name> format.

  16. Press ENTER. The prompt to select the approach appears.

    Enter the ID of the approach:
    
  17. To create the UDFs using the Cloud Protector approach, type 2.

  18. Press ENTER. The prompt to enter the name of the AWS Lambda Function appears.

    Enter the name of the AWS Lambda Function:
    
  19. Enter the name of the AWS Lambda Function.

  20. Press ENTER. The prompt to enter the region of the AWS Lambda function appears.

    Enter the name of the AWS Lambda Function's AWS Region:
    
  21. Enter the region name.

  22. Press ENTER. The prompt to enter the name of the Service Credential appears.

    Enter the name of the Databricks Unity Catalog Service Credential:
    
  23. Enter the name of the Databricks Unity Catalog Service Credential.

  24. Press ENTER. The script creates the UDFs at the specified location.

    Executing specified operation...
    Executed specified operation.
    

5 - Configuring the Protector

5.1 - Editing the Cluster Configuration

Note: The instructions mentioned in the section apply only to the Application Protector REST approach.

After the configurator script is executed and the UDFs are created, the cluster must be updated to include the following configurations:

  1. Inclusion of the environment variables.
  2. Attach the BigDataProtector-Init-Script_Linux-ALL-64_x86-64_AWS.Databricks-<DBR_version>-64_<BDP_version>.sh script to the Databricks compute.

Ensure that ESA is started and in a running state before restarting the Databricks cluster after updating the configurations.

To edit the cluster:

  1. Log in to the Databricks portal.

  2. Edit the required cluster.

  3. Expand the Advanced section.

  4. Click the Spark tab.

  5. Under Environment variables, add the variables, with their values, listed in the table:

    VariableValue
    PTY_ESA_IPEnter ESA IP address.
    PTY_ESA_PORTEnter the port number to connect to ESA.
    PTY_ESA_TOKENEnter the JWT token to connect to ESA.
    PTY_ESA_ADMINISTRATOR_USERNAMEEnter the user name to connect to ESA.
    PTY_ESA_ADMINISTRATOR_PASSWORDEnter the password to connect to ESA.
    PTY_AUDIT_STORE_IP_PORTEnter the port to connect to the Audit Store. The value is a comma-separated string of <audit_store_ip>:<audit_store_port>. For example, 11.22.33.44:9200, 55.66.77.88:9200
    PTY_PROTECTOR_CONFIGURATIONSpecify the values as [core]emptystring=empty, [sync]interval=10
  6. Click the Init scripts tab.

  7. From the Source list, select Volumes.

  8. In the File path box, enter the location of the initialization script.

  9. To save the changes and restart the cluster, click Confirm and restart.

    Note: If the initialization script fails with a non-zero exit code, enable cluster logging to view the error log files for troubleshooting purposes.

    When the cluster is restarted, the initialization script starts the Application Protector REST service on every node in the cluster. After the Application Protector REST service is started, use the Unity Catalog Batch Python UDFs to protect and unprotect data.

    Note: The process to execute the initialization script will take some time before the cluster is ready to use for performing protect and unprotect operations. For more information on using the UDFs for protect and unprotect operations, refer to the section Unity Catalog Batch Python UDFs.

6 - Uninstalling the Protector

6.1 - Dropping the User Defined Functions

For the Application Protector REST Approach

Deleting the UDFs is an optional step and must be performed ONLY to clean up the Databricks cluster. The configurator script is used to delete the UDFs. You must select the required approach and the operation ID to delete the UDFs using the Application Protector REST server. This section explains the process to delete the UDFs using the interactive method of installation.

To delete the UDFs:

  1. Log in to the staging machine.

  2. Navigate to the directory where the installation files are extracted.

  3. To execute the configurator script, run the following command:

    ./BigDataProtector-Configurator_Linux-ALL-64_x86-64_AWS.Databricks-<DBR_version>-64_<BDP_version>.sh
    
  4. Press ENTER. The prompt to enter the operation ID appears.

    Creating installation files...
    Created installation files.
    
    Enter the ID of the operation:
    
  5. To delete the UDFs, type 3.

  6. Press ENTER. The prompt to enter the Databricks Workspace URL appears.

    Enter the URL of the Databricks Workspace:
    
  7. Enter the Databricks Workspace URL.

  8. Press ENTER. The prompt to enter the application ID of the Databricks Service Principal appears.

    Enter the Application ID of the Databricks Service Principal:
    
  9. Enter the Application ID of the Databricks Service Principal.

  10. Press ENTER. The prompt to enter the OAuth secret for the Service Principal appears.

    Enter the OAuth Secret of the Databricks Service Principal:
    
  11. Enter the OAuth secret.

  12. Press ENTER. The prompt to enter the cluster ID appears.

    Enter the ID of the Databricks Compute:
    
  13. Enter the Cluster ID.

    Note: The Cluster ID can be either for a Standard Compute or Dedicated Compute. For more information about identifying the Cluster ID, refer to https://docs.databricks.com/aws/en/workspace/workspace-details/.

  14. Press ENTER. The prompt to enter the name of the schema appears.

    Enter the name of the Databricks Unity Catalog Catalog-Schema:
    
  15. Enter the name of the catalog and the schema in the <catalog_name.schema_name> format.

  16. Press ENTER. The script deletes the UDFs from the specified location.

    Executing specified operation...
    
    Executed specified operation.
    

For the Cloud Protector Approach

Deleting the UDFs is an optional step and must be performed ONLY to clean up the Databricks cluster. The configurator script is used to delete the UDFs. Select the required approach and the operation ID to delete the UDFs using the Cloud Protector. This section explains the process to delete the UDFs using the interactive method of installation.

To delete the UDFs:

  1. Log in to the staging machine.
  2. Navigate to the directory where the installation files are extracted.
  3. To execute the configurator script, run the following command:
    ./BigDataProtector-Configurator_Linux-ALL-64_x86-64_AWS.Databricks-<DBR_version>-64_<BDP_version>.sh
    
  4. Press ENTER. The prompt to enter the operation ID appears.
    Creating installation files...
    Created installation files.
    
    Enter the ID of the operation:
    
  5. To delete the UDFs, type 3.
  6. Press ENTER. The prompt to enter the Databricks Workspace URL appears.
    Enter the URL of the Databricks Workspace:
    
  7. Enter the Databricks Workspace URL.
  8. Press ENTER. The prompt to enter the application ID of the Databricks Service Principal appears.
    Enter the Application ID of the Databricks Service Principal:
    
  9. Enter the Application ID of the Databricks Service Principal.
  10. Press ENTER. The prompt to enter the OAuth secret for the Service Principal appears.
    Enter the OAuth Secret of the Databricks Service Principal
    
  11. Enter the OAuth secret.
  12. Press ENTER. The prompt to enter the cluster ID appears.
    Enter the ID of the Databricks Compute:
    
  13. Enter the Cluster ID.

    Note: The Cluster ID can be either for SQL Warehouse, Standard Compute or Dedicated Compute. For more information about identifying the Cluster ID, refer to https://docs.databricks.com/aws/en/workspace/workspace-details/.

  14. Press ENTER. The prompt to enter the name of the schema appears.
    Enter the name of the Databricks Unity Catalog Catalog-Schema:
    
  15. Enter the name of the catalog and the schema in the <catalog_name.schema_name> format.
  16. Press ENTER. The prompt to select the approach appears.
    Enter the ID of the approach:
    
  17. To delete the UDFs using the Cloud Protector approach, type 2.
  18. Press ENTER. The prompt to enter the name of the AWS Lambda Function appears.
    Enter the name of the AWS Lambda Function:
    
  19. Enter the name of the AWS Lambda Function.
  20. Press ENTER. The prompt to enter the region of the AWS Lambda function appears.
    Enter the name of the AWS Lambda Function's AWS Region:
    
  21. Enter the region name.
  22. Press ENTER. The prompt to enter the name of the Service Credential appears.
    Enter the name of the Databricks Unity Catalog Service Credential:
    
  23. Enter the name of the Databricks Unity Catalog Service Credential.
  24. Press ENTER. The script deletes the UDFs from the specified location.
    Executing specified operation...
    
    Executed specified operation.