1 - Amazon EMR Protector

Using Amazon EMR Protector

The Big Data Protector UDFs and APIs provide a robust framework for securing sensitive data within EMR environments on AWS. These components are part of the Protegrity Big Data Protector architecture, enabling developers and data engineers to integrate advanced data protection directly into big data workflows. The User Defined Functions (UDFs) allow seamless encryption, tokenization, and de-tokenization of sensitive fields during Hive and Spark. By embedding Protegrity UDFs into SQL queries, organizations can enforce column-level security without altering application logic. This ensures compliance while maintaining analytical performance.

To perform protect and unprotect operations using the User Defined Functions, refer to User Defined Functions and APIs.

1.1 - Installing the Amazon Elastic MapReduce Protector

Steps to install the Amazon Elastic MapReduce Protector

Setting up the Amazon EMR Protector

The Amazon EMR Protector v10.0.0 is part of the Protegrity Big Data Protector suite, designed to secure sensitive data in distributed processing environments on AWS Elastic MapReduce (EMR). This protector enables organizations to run analytics on large-scale datasets while ensuring compliance with stringent data privacy regulations.

The Bootstrap Installer is designed to automate the deployment of the Protegrity Big Data Protector (BDP) components during the creation of an Amazon EMR cluster. By leveraging AWS bootstrap actions, this method ensures that all required libraries, configuration files, and services are installed and configured as part of the cluster initialization process.

The Static Installer provides a manual or scripted approach for installing BDP components on existing EMR clusters. This method is best suited for environments where clusters are persistent or require custom installation steps outside the bootstrap lifecycle.

Prerequisites

Register the jumpbox

To register and prepare the jumpbox, refer to Registering and preparing the jumpbox.

For a detailed information on the prerequistes for the Bootstrap installer, refer to Verifying the prerequisites.

For a detailed information on the prerequistes for the Static installer, refer to Verifying the prerequisites for Static Installer.

Integrating the Amazon EMR Protector with Protegrity Provisioned Cluster (PPC)

To integrate the Amazon EMR Protector with PPC, perform the following steps:

  1. Install the EMR protector using the bootstrap installer as per steps mentioned in the section Using the Bootstrap Installer.

OR

  1. Install the EMR protector using the static installer as per steps mentioned in the section Using the Static Installer.

Note: When prompted for the ESA IP address, enter the PPC FQDN as configured in Step 4 of Deploying PPC. Ensure the FQDN does not exceed 50 characters. For the ESA listening port, enter 25400. These specific values are required to integrate the protector with the PPC.

Post Configuration Steps

For a detailed information on the post configuration steps, refer to Configuring the Protector.

1.2 - Uninstalling the Amazon Elastic MapReduce Protector

Steps to uninstall the Amazon Elastic MapReduce Protector

For more information about uninstalling the Amazon EMR Protector, refer to Uninstalling the protector.

2 - AWS Databricks Protector

Using AWS Databricks Protector

The Protegrity Big Data Protector for AWS Databricks delivers end‑to‑end data protection. Organizations deploying the Big Data Protector rely on modern, supported storage options such as Workspace storage, Unity Catalog Volumes, and cloud object storage like Amazon S3.

Designed to secure sensitive data across analytics pipelines, the Big Data Protector applies advanced tokenization and encryption during Spark execution and enforces centralized, policy‑driven controls. Whether installed via Workspace-backed paths or deployed using S3 buckets for configuration and script delivery, the Protector ensures resilient execution across AWS Databricks clusters.

By embracing cloud‑native storage paths, this approach ensures long‑term compatibility with Databricks platform changes while maintaining Protegrity’s standard of seamless and transparent protection. Organizations can continue to process high‑value datasets on AWS Databricks with confidence—knowing that sensitive information is secured across its lifecycle, even as the underlying platform evolves.

The Protegrity Big Data Protector for AWS Databricks empowers organizations to secure sensitive data across their analytics pipelines by combining high‑performance protection mechanisms with flexible deployment models tailored for modern cloud architectures. Central to this capability are two approaches; Application Protector REST (AP REST) and Cloud Protector approach. Each approach is designed to address different customer requirements around scalability, infrastructure usage, and cost optimization.

2.1 - Installing the AWS Databricks Protector

Steps to install the AWS Databricks Protector

Prerequisites

For more information about the prerequisites, refer to the sections listed below.

Register the jumpbox

To register and prepare the jumpbox, refer to Registering and preparing the jumpbox.

For the Application Protector REST Approach

For more information about the prerequisites, refer to For the Application Protector REST Approach.

For the Cloud Protector Approach

For more information about the prerequisites, refer to For the Cloud Protector Approach

Preparing the Environment

For more information about the preparing the environment, refer to Preparing the Environment.

Installing the Protector

For more information about installing the protector, refer to Creating the User Defined Functions.

Integrating the AWS Databricks Protector with Protegrity Provisioned Cluster (PPC)

To integrate the AWS Databricks Protector with PPC, perform the following steps:

When prompted for the ESA IP address, enter the PPC FQDN as configured in Step 4 of Deploying PPC. Ensure the FQDN does not exceed 50 characters. For the ESA listening port, enter 25400. These specific values are required to integrate the protector with the PPC.

Configuring the Protector

For more information about protector configuration, refer to Editing the Cluster Configuration.

2.2 - Uninstalling the AWS Databricks Protector

Steps to uninstall the AWS Databricks Protector

For more information about uninstalling the AWS Databricks Protector, refer to Dropping the User Defined Functions.

3 - CDP-AWS-DataHub Protector

Using CDP-AWS-DataHub Protector

The CDP-AWS-DataHub UDFs and APIs provide a robust framework for securing sensitive data within Cloudera Data Platform (CDP) environments on AWS. These components are part of the Protegrity Big Data Protector architecture, enabling developers and data engineers to integrate advanced data protection directly into big data workflows. The User Defined Functions (UDFs) allow seamless encryption, tokenization, and de-tokenization of sensitive fields during Hive, Spark, and Impala operations. By embedding Protegrity UDFs into SQL queries, organizations can enforce column-level security without altering application logic. This ensures compliance while maintaining analytical performance.

To perform protect and unprotect operations using the User Defined Functions, refer to User Defined Functions and APIs.

3.1 - Installing the CDP-AWS-DataHub Protector

Steps to install the CDP-AWS-DataHub Protector

Setting up the CDP-AWS-DataHub Protector

The CDP-AWS-DataHub Protector v10.0.0 secures sensitive data across the Cloudera Data Platform (CDP) environments hosted on AWS. The protector leverages Protegrity’s tokenization and encryption features to secure data at rest, in transit, and during processing within AWS DataHub clusters.

Prerequisites

For a detailed information on the prerequistes, refer to System Requirements.

Register the jumpbox

To register and prepare the jumpbox, refer to Registering and preparing the jumpbox.

Integrating the CDP-AWS-DataHub Protector with Protegrity Provisioned Cluster (PPC)

To integrate the CDP-AWS-DataHub Protector with PPC, perform the following steps:

  1. Preparing the environment using the steps mentioned in the section Preparing the Environment.

  2. Install the Big Data Protector using the steps mentioned in the section Installing the Big Data Protector.

Note: When prompted for the ESA IP address, enter the PPC FQDN as configured in Step 4 of Deploying PPC. Ensure the FQDN does not exceed 50 characters. For the ESA listening port, enter 25400. These specific values are required to integrate the protector with the PPC.

Post Configuration Steps

For a detailed information on the post configuration steps, refer to Configuring the Big Data Protector.

3.2 - Uninstalling the CDP-AWS-DataHub Protector

Steps to uninstall the CDP-AWS-DataHub Protector

For more information about uninstalling the CDP AWS DataHub Protector, refer to Uninstalling the Big Data Protector.