1 - Using the Bootstrap Installer

Installing the Big Data Protector using the Bootstrap Installer

The Big Data Protector on Amazon EMR enables cluster creation using a bootstrap action. This action enables:

  • configuration of cluster instances
  • installation of custom and additional software
  • setting up of the environment variables

Bootstrap actions are scripts that run on cluster instances after they are launched. These scripts installs the specified applications during cluster creation and before the cluster nodes start processing data. To create a bootstrap action, can specify the script when creating the cluster in any one of the following methods:

  • Amazon EMR console - pass the location of the script in the Bootstrap actions section.
  • AWS CLI - pass the location of the script to the --bootstrap-actions parameter.
  • API

In this method of cluster creation, the nodes are automatically scaled depending on the workload. In case of instances where the workloads are minimal for a node, Amazon decomissions the node to balance the workload optimally.

1.1 - Creating a Cluster

Creating a Cluster

The procedures mentioned in this section are applicable only for the Bootstrap approach to install the Big Data Protector.

Perform the following steps to create an EMR cluster on AWS and install Big Data Protector on all the nodes in the EMR cluster.

To install Big Data Protector on a New EMR Cluster:

  1. On the AWS services screen, click EMR under the Analytics section.

    The Amazon EMR screen appears.

  2. Click Create cluster.

    The Create Cluster - Quick Options screen appears.

  3. Type the name of the cluster in the Cluster name box.

  4. Depending on the requirements, enter the sum of the master and core nodes in the Number of instances box.

  5. Click Create cluster.

    The Software and Steps tab on the Create Cluster - Advanced Options screen appears.

  6. Depending on the requirements, select the components under the Software Configuration section.

  7. Click Next.

    The Hardware tab on the Create Cluster - Advanced Options screen appears.

  8. On the Hardware tab, if required, you can add or reduce the number of instances of the Master, Core, and Task nodes.

  9. Click Next.

    The General Cluster Settings tab on the Create Cluster - Advanced Options screen appears.

  10. Type the name of the cluster in the Cluster name box.

  11. Under the Bootstrap Actions area, in the Add bootstrap action drop-down list, click Custom action.

    The Add Bootstrap Action dialog box appears.

  12. Enter the name of the bootstrap action in the Name box.

  13. To select the location of the bootstrap script, click the icon besides the Script location box.

    The Select S3 File dialog box appears.

  14. Enter the path of the S3 bucket in the URL box.

    The contents of the S3 bucket appear.

  15. Select the bdp_bootstrap_installer.sh file from the S3 bucket.

  16. Click Select.

    The Big Data Protector bootstrap script file is selected and the Add Bootstrap Action dialog box appears.

  17. To specify the directory in which the Big Data Protector needs to be installed on the nodes in the cluster, then provide the directory path in the Optional arguments box.

    If an installation directory for the Big Data Protector is not specified, then /opt/protegrity/ is considered as the default directory.

  18. Click Add.

    The General Cluster Settings tab on the Create Cluster - Advanced Options screen appears and the Bootstrap actions are updated.

  19. Click Next.

    The Security tab on the Create Cluster - Advanced Options screen appears.

  20. Select the required EC2 key pair for the EMR cluster from the EC2 key pair drop-down list.

  21. Click Create Cluster.

    The EMR cluster is created, Big Data Protector is installed on all the nodes in the cluster, and the required Big Data Protector parameters are configured.

  22. You can also install create a new EMR cluster and install Big Data Protector on the nodes in the cluster using the CLI using the following command:

    aws emr create-cluster --auto-scaling-role EMR_AutoScaling_DefaultRole --termination-protected --applications Name=Hadoop Name=Hive Name=Pig Name=Hue Name=Spark Name=Tez Name=HBase --bootstrap-actions '[{"Path":"<S3_Path_For_BootstrapInstaller>","Name":"<Script_Name>"}]' --ec2-attributes '{"KeyName":"<KEY_NAME>","InstanceProfile":"EMR_EC2_DefaultRole","EmrManagedSlaveSecurityGroup":"sg-c8ef00de","EmrManagedMasterSecurityGroup":"sg-2deb043b"}' --service-role EMR_DefaultRole --enable-debugging --release-label emr-<EMR_Version> --log-uri 's3n://aws-logs-406396743807-us-east-1/elasticmapreduce/' --name '<Cluster_Name>' --instance-groups '[{"InstanceCount":2,"InstanceGroupType":"CORE","InstanceType":"m3.xlarge","Name":"Core - 2"},{"InstanceCount":1,"InstanceGroupType":"MASTER","InstanceType":"m3.xlarge","Name":"Master - 1"}]' –
    scale-down-behavior TERMINATE_AT_INSTANCE_HOUR --region us-east-1
    

    where:

    • S3_Path_For_BootstrapInstaller: Specifies the S3 bucket path containing the Big Data Protector bootstrap installer script.
    • Script_Name: Specifies the name of the Big Data Protector installation script.
    • KEY_NAME: Specifies the Private Key file on the Master node in the EMR cluster, which is used to communicate with the other nodes in the cluster.
    • Cluster_Name: Specifies the name of the new EMR cluster.

1.2 - Managing the Cluster Nodes

Managing the Cluster Nodes

The steps mentioned in this section are applicable only for the Bootstrap approach to install the Big Data Protector.

Depending on the workload on the EMR cluster, you can add or remove the Big Data Protector nodes. You can either set the cluster to automatically scale or manually add or remove nodes in the EMR cluster. You can add or remove nodes in the EMR cluster either while you create the cluster or after you have created the cluster. Before you add or remove the nodes from the cluster, ensure that you save all your data to S3, as standard practice, to avoid any data loss.

This section covers the procedure to add or remove nodes from an Amazon EMR cluster after you have created it.

To add or remove nodes from an Amazon EMR cluster:

  1. On the AWS management console, expand Services and click Analytics.

    The sub-menu appears.

  2. From the sub-menu, click EMR.

    The Amazon EMR page appears.

  3. Click the required cluster.

    The Properties tab of the cluster appears.

  4. Click the Instances tab.

  5. To add an instance, perform the following steps:

    1. Under Instance groups, click Add task instance group. The Add task instance group page appears.
    2. In the Name box, enter the name to identify the node.
    3. From the Choose EC2 instance type list, select the required storage type.
    4. In the Instance group size box, enter the required number of instances.
    5. Click Add task instance group. The new instance is added to the node and appears on the Instances tab.
  6. To resize an instance, perform the following steps:

    1. Under Instance groups, select the required instance that you want to resize.
    2. Click Resize instance group. The Resize page appears.
    3. In the Instance group size box, enter the required number of instances.
    4. Click Resize. The instance is resized as per the inputs and appears on the Instances tab.

1.3 - Verifying the Parameters

Verifying the Parameters for the Bootstrap Installer

The content mentioned in this section is applicable only for the Bootstrap approach to install the Big Data Protector.

Before using Big Data Protector, configure the required Protegrity-related parameters in EMR. The Big Data Protector configuration parameters are set for the EMR cluster when it is installed on all the nodes in the cluster.

The following table provides the parameters that are set for the existing Amazon EMR cluster before using the Big Data Protector:

ComponentConfiguration FileUpdated Classpath Parameter
MapReduce/etc/hadoop/conf/mapred-site.xmlmapreduce.application.classpath : /opt/protegrity/pepmapreduce/lib/*
/opt/protegrity/pephive/lib/*
/opt/protegrity/bdp_version/
mapreduce.admin.user.env : LD_LIBRARY_PATH=/opt/protegrity/jpeplite/lib
Hive/etc/hive/conf/hive-site.xml
/etc/tez/conf/tez-site.xml
/etc/hive/conf/hive-env.sh
hive.exec.pre.hooks : com.protegrity.hive.PtyHiveUserPreHook
tez.cluster.additional.classpath.prefix:/opt/protegrity/pephive/lib/:/opt/protegrity/bdp_version/
tez.am.launch.env: LD_LIBRARY_PATH=/opt/protegrity/jpeplite/lib/
export HIVE_CLASSPATH=${HIVE_CLASSPATH}:/opt/protegrity/pephive/lib/
:/opt/protegrity/bdp_version/
export JAVA_LIBRARY_PATH=${JAVA_LIBRARY_PATH}:/opt/protegrity/jpeplite/lib/
Pig/etc/pig/conf/pig-env.shPIG_CLASSPATH="/opt/protegrity/peppig/lib/*:/opt/protegrity/bdp_version/"
export JAVA_LIBRARY_PATH=${JAVA_LIBRARY_PATH}:/opt/protegrity/jpeplite/lib/
HBase/etc/hbase/conf/hbase-site.xml
/etc/hbase/conf/hbase-env.sh
hbase.coprocessor.region.classes:com.protegrity.hbase.PTYRegionObserver
export HBASE_CLASSPATH=${HBASE_CLASSPATH}:/opt/protegrity/pephbase/lib/*:/opt/protegrity/bdp_version/
export JAVA_LIBRARY_PATH=${JAVA_LIBRARY_PATH}:/opt/protegrity/jpeplite/lib/
Spark/etc/spark/conf/spark-defaults.confspark.driver.extraClassPath=/opt/protegrity/pephive/lib/:/opt/protegrity/pepspark/lib/:/opt/protegrity/bdp_version/
spark.executor.extraClassPath=/opt/protegrity/pephive/lib/:/opt/protegrity/pepspark/lib/:/opt/protegrity/bdp_version/
spark.executor.extraLibraryPath= /opt/protegrity/jpeplite/lib
spark.driver.extraLibraryPath= /opt/protegrity/jpeplite/lib

2 - Using the Static Installer

Installing the Big Data Protector using the Static Installer

The static installer method of installation is applicable where the Big Data Protector must be installed on an existing EMR cluster. Using the Static Installer, users can enforce data protection policies at a granular level. This feature helps organizations to define specific rules for data protection based on sensitivity and usage.

The nodes in the cluster created using the static installer are do not have auto-scaling enabled. The nodes must be manually added or decommissioned depending upon the usage. The installation provides additional scripts to monitor and control the cluster behaviour. These scripts are available in the <installation_directory>/cluster_utils/ directory after installation.

2.1 - Installing the Protector on all the Nodes

Installing the Protector on all the Nodes using the Static Installer

The steps mentioned in this section are applicable only for the Static Installer approach to install the Big Data Protector.

  1. Log in to the Master or Lead node of the EMR cluster.

  2. Navigate to the directory that contains the BdpInstallx.x.x_Linux_<BDP_version>.sh script.

  3. To run the installer, execute the following script:

    ./BdpInstallx.x.x_Linux_<BDP_version>.sh
    
  4. Press ENTER.

    The prompt to continue the installation of the Big Data Protector appears.

    ************************************************************************************
               Welcome to the Hadoop Big Data Protector Setup Wizard
    ************************************************************************************
    This will install the Hadoop Big Data Protector on your system.
    
    This installation requires a Private Key file for communicating with other nodes in the cluster.
    
    Do you want to continue? [yes or no]:
    
  5. To continue, type yes.

  6. Press ENTER.

    The prompt to enter path of the Private Key file (.pem file) appears.

    Big Data Protector installation started
    Enter the path of the Private Key (.PEM) file:
    
  7. Enter the path of the .PEM file.

  8. Press ENTER.

    The prompt to enter the ESA hostname or IP address appears.

    libhadoop.so located in directory '/usr/lib/hadoop/lib/native'
    Unpacking...
    Extracting files...
    
    Preparing for cluster deploy, Wait...
    
    Enter ESA Hostname or IP Address:
    
  9. If you have installed a proxy, then enter the IP address of the proxy node. Alternatively, enter the IP Address of ESA.

  10. Press ENTER.

    The prompt to enter the listening port for ESA appears.

    Enter ESA host listening port [8443]:
    
  11. Enter the port for ESA.

  12. Press ENTER.

    The prompt to enter the JWT token appears.

    If you have an existing ESA JSON Web Token (JWT) with Export Certificates role, enter it otherwise enter 'no':
    
  13. Enter the JWT token.

  14. Press ENTER.

    If you fail to provide a JWT token, the script will prompt to enter the username and password for ESA.

    JWT was not provided. Script will now prompt for ESA username and password.
    
    Enter ESA Username:
    
  15. Enter the username for ESA.

  16. Press ENTER.

    The prompt to enter the password appears.

    ************************************************************************************
                    Welcome to the RPAgent Setup Wizard.
    ************************************************************************************
    
    Unpacking...................
    Extracting files...
    Unpacked rpagent compressed file...
    RPAgent Installing in Lead Node...
    Please enter the password for downloading certificates[]:
    
  17. Enter the password.

  18. Press ENTER.

    The script retrieves the JWT token from ESA, installs the RPAgent, and the prompt to select the Audit Store type appears.

    Unpacking...
    Extracting files...
    Obtaining token from <ESA_IP_Address>:8443...
    Downloading certificates from <ESA_IP_Address>:8443...
      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                    Dload  Upload   Total   Spent    Left  Speed
    100 11264  100 11264    0     0  12124      0 --:--:-- --:--:-- --:--:-- 12111
    
    Extracting certificates...
    Certificates successfully downloaded and stored in /opt/protegrity/rpagent/data
    
    Protegrity RPAgent installed in /opt/protegrity/rpagent.
    
    
    RPAgent installed on Lead node at location /opt/protegrity/rpagent.
    
    Performing install on other nodes...
    
    RPAgent installed on other nodes at location /opt/protegrity/rpagent.
    
    Check the status in /opt/protegrity/logs/rpagent_setup.log
    
    
    Select the Audit Store type where Log Forwarder(s) should send logs to.
    
    [ 1 ] : Protegrity Audit Store
    [ 2 ] : External Audit Store
    [ 3 ] : Protegrity Audit Store + External Audit Store
    
    Enter the no.:
    
  19. Depending on the Audit Store type, select any one of the following options:

    OptionDescription
    1To use the default setting using the Protegrity Audit Store appliance, type 1. If you enter 1, then the default Fluent Bit configuration files are used and Fluent Bit will forward the logs to the Protegrity Audit Store appliances.
    2To use an external audit store, type 2. If you enter 2, then the default Fluent Bit configuration files used for the External Audit Store (out.conf and upstream.cfg in the /opt/protegrity/fluent-bit/data/config.d/ directory) are renamed (out.conf.bkp and upstream.cfg.bkp) so that they will not be used by Fluent Bit. Additionally, the custom Fluent Bit configuration files for the external audit store are copied to the /opt/protegrity/fluent-bit/data/config.d/ directory.
    3To use a combination of the default setting with an external audit store, type 3. If you enter 3, then the default Fluent Bit configuration files used for the Protegrity Audit Store (out.conf and upstream.cfg in the /opt/protegrity/fluent-bit/data/config.d/ directory) are not renamed. However, the custom Fluent Bit configuration files for the external audit store are copied to the /opt/protegrity/fluent-bit/data/config.d/ directory.
  20. Press ENTER.

    The prompt to enter the comma separated list of hostnames/IP addresses appears.

    Enter comma-separated list of Hostnames/IP Addresses and/or Ports of Protegrity Audit Store.
    Allowed Syntax: hostname[:port][,hostname[:port],hostname[:port]...] (Default Value - <ESA_IP_Address>:9200)
    Enter the list:
    
  21. To use the default value, press ENTER.

    The prompt to enter the location of the Fluent Bit configuration file appears.

    Enter the local directory path on this node that stores the custom Fluent-Bit configuration files for External Audit Store:
    

    The script will display this prompt only if you select option 2 in step 19. When you select option 2 in step 19, the custom configuration files are copied to the /<Installation directory>/fluent-bit/data/config.d/ directory on all the EMR nodes selected for installation.

  22. Enter the path that contains the Fluent Bit configuration file.

  23. Press ENTER.

    The prompt to save the RPAgent’s log in a file appears.

    Do you want RPAgent's log to be generated in a file? [yes or no]:
    
  24. To generate the logs in a file, type yes.

  25. Press ENTER.

    The script installs the protector on all the nodes in the cluster.

    RPAgent's log will be generated in a file.
    ************************************************************************************
                    Welcome to the LogForwarder Setup Wizard.
    ************************************************************************************
    
    Unpacking...................
    Extracting files...
    Unpacked logforwarder compressed file...
    Logforwarder Installing in Lead Node...
    Unpacking...
    Extracting files...
    
    Protegrity Log Forwarder installed in /opt/protegrity/logforwarder.
    
    
    LogForwarder installed on Lead node at location /opt/protegrity/logforwarder.
    
    Performing install on other nodes...
    
    Logforwarder installed on other nodes at location /opt/protegrity/logforwarder.
    
    Check the status in /opt/protegrity/logs/logforwarder_setup.log
    ************************************************************************************
                        Welcome to the JcoreLite Setup Wizard.
    ************************************************************************************
    
    Unpacking...................
    Extracting files...
    Unpacked jcorelite compressed file...
    Installing JcoreLite ....
    
    JcoreLite installed on lead node at location /opt/protegrity/bdp/lib.
    
    Performing install on other nodes...
    
    JcoreLite installed on other nodes at location /opt/protegrity/bdp/lib.
    
    Check the status in /opt/protegrity/logs/jcorelite_setup.log
    ************************************************************************************
                    Welcome to the Hive Protector Setup Wizard.
    ************************************************************************************
    
    Unpacking...................
    Extracting files...
    Unpacked pephive compressed file...
    
    Hive Big Data Protector installed on lead node at location /opt/protegrity/bdp/lib/ and /opt/protegrity/pephive/scripts/.
    
    Performing install on other nodes...
    
    Hive Big Data Protector installed on other nodes at location /opt/protegrity/bdp/lib/ and /opt/protegrity/pephive/scripts/.
    
    Check the status in /opt/protegrity/logs/pephive_setup.log
    ************************************************************************************
                        Welcome to the Pig Protector Setup Wizard.
    ************************************************************************************
    
    Unpacking...................
    Extracting files...
    Unpacked peppig compressed file...
    
    Pig Big Data Protector installed on lead node at location /opt/protegrity/bdp/lib/ and /opt/protegrity/peppig.
    
    Performing install on other nodes...
    
    Pig Big Data Protector installed on other nodes at location /opt/protegrity/bdp/lib/ and /opt/protegrity/peppig.
    
    Check the status in /opt/protegrity/logs/peppig_setup.log
    ************************************************************************************
                    Welcome to the MapReduce Protector Setup Wizard.
    ************************************************************************************
    
    Unpacking...................
    Extracting files...
    Unpacked pepmapreduce compressed file...
    
    Mapreduce Big Data Protector installed on lead node at location /opt/protegrity/bdp/lib/.
    
    Performing install on other nodes...
    
    Mapreduce Big Data Protector installed on other nodes at location /opt/protegrity/bdp/lib/.
    
    Check the status in /opt/protegrity/logs/pepmapreduce_setup.log
    ************************************************************************************
                        Welcome to the Hbase Protector Setup Wizard.
    ************************************************************************************
    
    Unpacking...................
    Extracting files...
    Unpacked pephbase compressed file...
    
    Hbase Big Data Protector installed on lead node at location /opt/protegrity/bdp/lib/.
    
    Performing install on other nodes...
    
    Hbase Big Data Protector installed on other nodes at location /opt/protegrity/bdp/lib/.
    
    Check the status in /opt/protegrity/logs/pephbase_setup.log
    ************************************************************************************
                    Welcome to the Spark Protector Setup Wizard.
    ************************************************************************************
    
    Unpacking...................
    Extracting files...
    Unpacked pepspark compressed file...
    
    Spark Big Data Protector installed on lead node at location /opt/protegrity/bdp/lib/ and /opt/protegrity/pepspark/scripts/.
    
    Performing install on other nodes...
    
    Spark Big Data Protector installed on other nodes at location /opt/protegrity/bdp/lib/ and /opt/protegrity/pepspark/scripts/.
    
    Check the status in /opt/protegrity/logs/pepspark_setup.log
    
    Starting Logforwarder on lead node...
    
    Starting Logforwarder on other nodes...
    
    Starting RPAgent on lead node...
    
    Starting RPAgent on other nodes...
    
    Hadoop Big Data Protector installed in /opt/protegrity.
    
    Generating Big Data Protector installation status report ...
    
    Clearing previous logs files ...
    
    Installation Status report generated in /opt/protegrity/cluster_utils/installation_report.txt
    
  26. Restart the Hadoop, Hive, and HBase service daemon processes to start using the updated configuration.

2.2 - Installing the Protector on Specific Nodes

Installing the Protector on Specific Nodes using the Static Installer

The steps mentioned in this section are applicable only for the Static Installer approach to install the Big Data Protector.

Protegrity provides the BdpInstallx.x.x_Linux_<arch>_<BDP_version>.sh script to install the Big Data Protector on the new nodes that you add to an existing EMR cluster.

Ensure to install the Big Data Protector from an account having full sudoer privileges.

  1. Login to the Lead Node on the EMR cluster.

  2. Navigate to the <PROTEGRITY_DIR>/cluster_utils directory.

  3. In the NEW_HOSTS_FILE file, add an additional entry for each new node in the EMR cluster, on which you want to install the Big Data Protector. The new nodes from the NEW_HOSTS_FILE file will be appended to the CLUSTERLIST_FILE.

  4. To install the Big Data Protector on the new nodes, run the the following command:

    ./BdpInstallx.x.x_Linux_<arch>_<BDP_version>.sh –a <NEW_HOSTS_FILE>
    
  5. Press ENTER.

    The prompt to enter the path of the Private Key file (.pem file) appears.

  6. Enter the path of the Private Key file.

  7. Press ENTER.

    The script installs the Big Data Protector on the new nodes in the EMR cluster.

2.3 - Verifying the Parameters

Verifying the Parameters for the Static Installer

The content in this section is applicable only for the Static installer approach to install the Big Data Protector.

Before using the Big Data Protector, configure the required Protegrity-related parameters in EMR. The Big Data Protector configuration parameters are set for the EMR cluster when it is installed on all the nodes in the cluster.

The following table provides the parameters that are set for the existing Amazon EMR cluster before using the Big Data Protector:

ComponentConfiguration FileUpdated Classpath Parameter
MapReduce/etc/hadoop/conf/mapred-site.xmlmapreduce.application.classpath : /opt/protegrity/pepmapreduce/lib/*
/opt/protegrity/pephive/lib/*
/opt/protegrity/bdp_version/
mapreduce.admin.user.env : LD_LIBRARY_PATH=/opt/protegrity/jpeplite/lib
Hive/etc/hive/conf/hive-site.xml
/etc/tez/conf/tez-site.xml
/etc/hive/conf/hive-env.sh
hive.exec.pre.hooks : com.protegrity.hive.PtyHiveUserPreHook
tez.cluster.additional.classpath.prefix:/opt/protegrity/pephive/lib/:/opt/protegrity/bdp_version/
tez.am.launch.env: LD_LIBRARY_PATH=/opt/protegrity/jpeplite/lib/
export HIVE_CLASSPATH=${HIVE_CLASSPATH}:/opt/protegrity/pephive/lib/
:/opt/protegrity/bdp_version/
export JAVA_LIBRARY_PATH=${JAVA_LIBRARY_PATH}:/opt/protegrity/jpeplite/lib/
Pig/etc/pig/conf/pig-env.shPIG_CLASSPATH="/opt/protegrity/peppig/lib/*:/opt/protegrity/bdp_version/"
export JAVA_LIBRARY_PATH=${JAVA_LIBRARY_PATH}:/opt/protegrity/jpeplite/lib/
HBase/etc/hbase/conf/hbase-site.xml
/etc/hbase/conf/hbase-env.sh
hbase.coprocessor.region.classes:com.protegrity.hbase.PTYRegionObserver
export HBASE_CLASSPATH=${HBASE_CLASSPATH}:/opt/protegrity/pephbase/lib/*:/opt/protegrity/bdp_version/
export JAVA_LIBRARY_PATH=${JAVA_LIBRARY_PATH}:/opt/protegrity/jpeplite/lib/
Spark/etc/spark/conf/spark-defaults.confspark.driver.extraClassPath=/opt/protegrity/pephive/lib/:/opt/protegrity/pepspark/lib/:/opt/protegrity/bdp_version/
spark.executor.extraClassPath=/opt/protegrity/pephive/lib/:/opt/protegrity/pepspark/lib/:/opt/protegrity/bdp_version/
spark.executor.extraLibraryPath= /opt/protegrity/jpeplite/lib
spark.driver.extraLibraryPath= /opt/protegrity/jpeplite/lib

3 - Using the EMR Serverless Installer

The overall process of installing the Big Data Protector are explained in the following sections:

  • Installing the EMR Serverless protector
  • Setting up the Log Forwarder

3.1 - EMR Serverless Setup CLI

The instructions mentioned in the section are applicable only for the Serverless approach to install the Big Data Protector.

The EMR Serverless Setup CLI automates the complete Docker image build and deployment pipeline for the Big Data Protector. It validates the environment, prepares the configuration files, generates the Docker files, builds images with ESA certificate injection, and pushes the artifacts to AWS ECR.

To facilitate the installation, the configurator script generates a set of python scripts within the ./Installation_Files/ directory. The script and the arguments are listed below.

python scripts/emr_serverless_setup_cli.py <argument>
ArgumentPurpose
validateVerifies the working directory and config.json schema. Also validates AWS CLI connectivity and docker presence.
prepare-assetsUpdates the config.ini file and the GetCertificates.sh script with ESA details.
generate-dockerfileCreates the runtime-specific Dockerfile (Spark/Hive).
buildBuilds the Docker image with ESA certificate injection.
pushPushes the custom image to AWS ECR.
deployRun the full pipeline together from validation to push in a single command, if required.

Note: Execute the individual commands to accommodate custom modifications at any step.

Validating the Environment

The validate argument in the Python script:

  • Validates the config.json schema and the required parameters.
  • Verifies the Docker installation and the daemon status.
  • Verifies the AWS CLI configuration and credentials.
  • Tests ECR repository connectivity.
  • Validates the presence of BDP artifacts, such as, .jar and configuration files.
  • Tests ESA connectivity on the configured port.

To validate the environment:

  1. Log in to the CLI on a machine or an Amazon EC2 node that has connectivity to the ESA.
  2. Navigate to the directory where the installation files are extracted.
  3. To execute the Python script, run the following command:
    python scripts/emr_serverless_setup_cli.py validate
    
  4. Press ENTER. The script performs the required validations and the status of each step appears.
    [Validation]
    ============================================================
    [OK] config.json schema valid
    + docker info
    + docker buildx version
    + aws sts get-caller-identity --output json
    + aws ecr describe-repositories --repository-names bdp-emr-serverless --region <region_name>
    
    Summary:
    [OK] Working directory
    [OK] Config schema
    [OK] Docker installed
    [OK] Docker daemon
    [OK] BuildKit support
    [OK] AWS CLI installed
    [OK] AWS credentials
    [OK] Assets prepared
    [OK] Dockerfile exists
    [OK] COPY sources exist
    [OK] ECR repo exists
    
    [VALIDATION PASSED]
    

Preparing the Assets

The prepare-assets argument in the Python script:

  • Reads the common/config.ini template.
  • Appends the [sync] section in the config.ini file with ESA connection settings from the config.json file.
  • Appends the [log] section in the config.ini file with output = stdout.
  • Updates the /common/GetCertificates.sh file with the ESA host/port.

To prepare the assets:

  1. Log in to the CLI on a machine or an Amazon EC2 node that has connectivity to the ESA.
  2. Navigate to the directory where the installation files are extracted.
  3. To execute the Python script, run the following command:
    python scripts/emr_serverless_setup_cli.py prepare-assets
    
  4. Press ENTER.
    The script performs the required actions and a confirmation appears.
    [Phase 1: Prepare Assets]
    ============================================================
    [INFO] Runtime: SPARK
    [INFO] Log Output: stdout (audit logs will be sent to stdout)
    
    [OK] inserted [sync] after [protector] and updated [log] section (output=stdout, mode=drop) -> ../common/config.ini
    [OK] updated GetCertificates.sh -> ../common/GetCertificates.sh
    
    
    
    generate-dockerfile console output
    

Generating the Dockerfile

The generate-dockerfile argument in the Python script:

  • Reads the runtime configuration from the config.json file for the spark or hive application.
  • Generates multi-stage Dockerfile optimized for EMR Serverless.
  • Configures BuildKit secrets for secure ESA credential handling.
  • Stores the config.ini file in both Spark and Hive locations to ensure runtime interoperability.
  • Sets up certificate fetch during build time and not during runtime.
  • Configures the required permissions for the hadoop:hadoop user.

To generate the DockerFile:

  1. Log in to the CLI on a machine or an Amazon EC2 node that has connectivity to the ESA.
  2. Navigate to the directory where the installation files are extracted.
  3. To execute the Python script, run the following command:
    python scripts/emr_serverless_setup_cli.py generate-dockerfile
    
  4. Press ENTER. The script performs the required actions and a confirmation appears.
    [Phase 2: Generate Dockerfile]
    ============================================================
    + which docker 2>/dev/null
    + docker info 2>/dev/null | grep -i 'docker root dir' || true
    [INFO] traditional Docker - using BuildKit secrets (secure)
    [OK] Generated /home/ubuntu/serverless/final_build/spark/Installation_Files/Dockerfile
    

Building the Docker Image

The build argument in the Python sript:

  • Prompts for ESA credentials, such as, username and password.
  • Executes the Docker build with BuildKit secrets.
  • Cleans up the temporary credential files immediately after building the image.

To build the docker image:

  1. Log in to the CLI on a machine or an Amazon EC2 node that has connectivity to the ESA.
  2. Navigate to the directory where the installation files are extracted.
  3. To execute the Python script, run the following command:
    python scripts/emr_serverless_setup_cli.py build
    
  4. Press ENTER. The script starts the build process and the prompt to select the authentication method appears.
    ============================================================
    EMR Serverless BDP Image Builder (Build Only)
    ============================================================
    
    Runtime: spark
    + docker info
    + docker buildx version
    
    [INFO] Using existing config.ini and Dockerfile
    [INFO] If you need to regenerate them, use 'prepare-assets' command first
    
    ============================================================
          ESA Authentication Required
    ============================================================
    Credentials needed to fetch certificates during Docker build.
    NOT stored in config files or image layers.
    Passed securely via Docker BuildKit secrets.
    
    Authentication Method:
    [1] Username/Password
    [2] JWT Token
    
    Select authentication method (1 or 2): 
    
  5. To use the credentials, type 1.
  6. Press ENTER.
    The prompt to enter the ESA username appears.
    Enter ESA Username: 
    
  7. Enter the username.
  8. Press ENTER. The prompt to enter the password appears.
    Enter ESA Password:
    
  9. Enter the password.
  10. Press ENTER. The script resumes and completes the build process.
[Phase 3: Build]
============================================================
+ aws ecr describe-repositories --repository-names bdp-emr-serverless --region <region_name>
+ aws ecr get-login-password --region <region_name> | docker login --username AWS --password-stdin <Account_ID>.dkr.ecr.<region_name>.amazonaws.com
+ which docker 2>/dev/null
+ docker info 2>/dev/null | grep -i 'docker root dir' || true

[BUILD] traditional Docker - using BuildKit secrets (secure)
+ cd /home/ubuntu/serverless/final_build/spark/Installation_Files && DOCKER_BUILDKIT=1 docker build --secret id=esa_user,src=/tmp/tmpoyvdsake.secret --secret id=esa_password,src=/tmp/tmpq6l9mn8v.secret -t bdp-emr-serverless:tag_spark -f Dockerfile .

[OK] Built local image bdp-emr-serverless:tag_spark for runtime 'spark'


============================================================
[SUCCESS] Image built locally
Use 'push' command to push to ECR
============================================================

Pushing the Image to ECR

The push argument in the Python script:

  • Authenticates with AWS ECR using aws ecr get-login-password.
  • Tags the local image with full ECR URI.
  • Pushes all image layers to ECR.
  • Verifies the image exists in ECR after push.

To push the image to ECR:

  1. Log in to the CLI on a machine or an Amazon EC2 node that has connectivity to the ESA.
  2. Navigate to the directory where the installation files are extracted.
  3. To execute the Python script, run the following command:
    python scripts/emr_serverless_setup_cli.py push
    
  4. Press ENTER. The script pushes the image to ECR and a confirmation appears.
    [Push Image to ECR]
    ============================================================
    + aws sts get-caller-identity --output json
    + aws ecr describe-repositories --repository-names bdp-emr-serverless --region <region_name>
    + docker info
    + docker images --format '{{.Repository}}:{{.Tag}}'
    + aws ecr get-login-password --region <region_name> | docker login --username AWS --password-stdin <Account_ID>.dkr.ecr.<region_name>.amazonaws.com
    [OK] Logged in to ECR: <Account_ID>.dkr.ecr.<region_name>.amazonaws.com
    + docker tag bdp-emr-serverless:tag_spark <Account_ID>.dkr.ecr.<region_name>.amazonaws.com/bdp-emr-serverless:tag_spark
    [OK] Tagged image bdp-emr-serverless:tag_spark -> <Account_ID>.dkr.ecr.<region_name>.amazonaws.com/bdp-emr-serverless:tag_spark
    + docker push <Account_ID>.dkr.ecr.<region_name>.amazonaws.com/bdp-emr-serverless:tag_spark
    [OK] Pushed image <Account_ID>.dkr.ecr.<region_name>.amazonaws.com/bdp-emr-serverless:tag_spark
    
    [SUCCESS] Image pushed to ECR
    

Deploying the Image

The deploy argument enables the execution of the complete pipeline starting from validation to deployment in a single command.

Note: This is an optional step.

To deploy the image:

  1. Log in to the CLI on a machine or an Amazon EC2 node that has connectivity to the ESA.
  2. Navigate to the directory where the installation files are extracted.
  3. To execute the Python script, run the following command:
    python scripts/emr_serverless_setup_cli.py deploy
    
  4. Press ENTER. The script deploys the image and a confirmation appears.
    ============================================================
    EMR Serverless BDP Image Deployment (Full Pipeline)
    ============================================================
    
    Runtime: spark
    + docker info
    + docker buildx version
    + aws sts get-caller-identity --output json
    + aws ecr describe-repositories --repository-names bdp-emr-serverless --region <region_name>
    
    [Phase 1/3] Preparing assets...
    
    [Phase 1: Prepare Assets]
    ============================================================
    [INFO] Runtime: SPARK
    [INFO] Log Output: stdout (audit logs will be sent to stdout)
    
    [OK] replaced [sync] and updated [log] section (output=stdout, mode=drop) -> ../common/config.ini
    [OK] updated GetCertificates.sh -> ../common/GetCertificates.sh
    
    
    [Phase 2/3] Generating Dockerfile...
    
    [Phase 2: Generate Dockerfile]
    ============================================================
    + which docker 2>/dev/null
    + docker info 2>/dev/null | grep -i 'docker root dir' || true
    [INFO] traditional Docker - using BuildKit secrets (secure)
    [OK] Generated /home/ubuntu/serverless/final_build/spark/Installation_Files/Dockerfile
    
    
    [Phase 3/3] Building and pushing image...
    
    ============================================================
          ESA Authentication Required
    ============================================================
    Credentials needed to fetch certificates during Docker build.
    NOT stored in config files or image layers.
    Passed securely via Docker BuildKit secrets.
    
    Authentication Method:
    [1] Username/Password
    [2] JWT Token
    
    Select authentication method (1 or 2): 1
    Enter ESA Username: admin
    Enter ESA Password:
    
    [Phase 3: Build]
    ============================================================
    + aws ecr describe-repositories --repository-names bdp-emr-serverless --region <region_name>
    + aws ecr get-login-password --region <region_name> | docker login --username AWS --password-stdin <Account_ID>.dkr.ecr.<region_name>.amazonaws.com
    + which docker 2>/dev/null
    + docker info 2>/dev/null | grep -i 'docker root dir' || true
    
    [BUILD] traditional Docker - using BuildKit secrets (secure)
    + cd /home/ubuntu/serverless/final_build/spark/Installation_Files && DOCKER_BUILDKIT=1 docker build --secret id=esa_user,src=/tmp/tmphax6dcg9.secret --secret id=esa_password,src=/tmp/tmpzgrig1jz.secret -t bdp-emr-serverless:tag_spark -f Dockerfile .
    
    [OK] Built local image bdp-emr-serverless:tag_spark for runtime 'spark'
    
    + docker tag bdp-emr-serverless:tag_spark <Account_ID>.dkr.ecr.<region_name>.amazonaws.com/bdp-emr-serverless:tag_spark
    + docker push <Account_ID>.dkr.ecr.<region_name>.amazonaws.com/bdp-emr-serverless:tag_spark
    
    [OK] Pushed <Account_ID>.dkr.ecr.<region_name>.amazonaws.com/bdp-emr-serverless:tag_spark
    
    
    ============================================================
    [SUCCESS] All phases completed
    ============================================================
    

3.2 - Setting up the Log Forwarder

The instructions mentioned in the section are applicable only for the Serverless approach to install the Big Data Protector.

In the native EMR setup, Protegrity processes could be managed directly within the cluster nodes. However, in the containerized EMR Serverless environment, this level of control is limited. As a result, logs must be redirected to either Amazon S3 or CloudWatch. Using a CloudWatch Logs subscription filter, relevant log entries are streamed into Amazon Kinesis Data Streams. A Lambda function then processes these Kinesis batches, extracts the Protegrity audit JSON lines, constructs an OpenSearch Bulk (_bulk) payload, and sends it to the ESA endpoint.

Note: CloudWatch log lines are not always “instant”. Some delay is observed. This is an expected behavior.

Important: The logging functionality will only work when the jobs are submitted using the AWS CLI with aws emr-serverless start-job-run command. A sample command is listed below.

aws emr-serverless start-job-run \
  --region <region_name> \
  --application-id <application_id> \
  --execution-role-arn arn:aws:iam::<Account_ID>:role/EMR-Servlerless-Execution-Role \
  --job-driver '{
    "sparkSubmit": {
      "entryPoint": "s3://<script_path>/<script_name>.py"
    }
  }' \
  --configuration-overrides '{
    "monitoringConfiguration": {
      "cloudWatchLoggingConfiguration": {
        "enabled": true,
        "logGroupName": "<log_group_name>",
        "logStreamNamePrefix": "emrs",
        "logTypes": {
          "SPARK_DRIVER": ["STDOUT","STDERR"],
          "SPARK_EXECUTOR": ["STDOUT","STDERR"]
        }
      }
    }
  }'

Note: Only driver logs will be generated when a job is executed from the AWS Web UI. Therefore, execute the jobs only through the AWS CLI to generate both the driver and the executor logs in the CloudWatch Log group.

Prerequisites

The Lambda function is able to reach ESA

The ESA is configured in a private network. Therefore, the Lambda function must run in a VPC/subnet that have network route to that IP (VPN/TGW/peering/inside same network). Ensure the following:

  • The Lambda function is attached to the VPC subnet that can route to the ESA IP address.
  • The Security Group egress allows TCP 9200 to the ESA IP address.
  • NACLs allow it.
  • The TLS CA cert is available to the Lambda function.

The Lambda function is able to access the Kinesis Stream

The Lambda function reading from Kinesis must be able to reach the Kinesis API endpoints. If NAT is available, skip the endpoints.

The Kinesis Stream is able to retrieve the Logs from the CloudWatch Log group

The Kinesis Stream must be able to retrieve the Logs from the CloudWatch Log group.

EMR Serverless is able to send the logs to the CloudWatch Log group

The EMR Serverless cluster must be able to send the logs to the CloudWatch Log group.

Creating the Kinesis Data Stream

  1. Log in to the AWS console.

  2. Navigate to the Amazon Kinesis page.

  3. Click Data streams.

  4. Click Create Data stream.

  5. In the Data stream name box, enter a name to identify the stream.

  6. Under Capacity mode, select the required mode.

    Note: In case of Provisioned mode, start with 1 shard. This can be increased later.

  7. Click Create data stream.

  8. After the data stream is created, open the data stream.

  9. Note the ARN.

    Note: The default retention period is 24 hours. To increase the retention period, set the required duration in the Retention period box under the Configuration tab.

Creating the IAM Role

CloudWatch requires permissions to write the logs into the Kinesis stream. Create an IAM role that grants the required permissions to CloudWatch for writing the logs into the Kinesis stream.

  1. To create the role, log in to the AWS console.
  2. Navigate to IAM > Roles > Create role.
  3. Set the Trusted entity as AWS service.
  4. Set the Use case as CloudWatch Events.
  5. Set a Name for the role.
  6. Include permissions for the policy. A sample is listed below.
    {
    "Version": "2012-10-17",
    "Statement": [
     {
       "Sid": "AllowPutToKinesis",
       "Effect": "Allow",
       "Action": [
         "kinesis:PutRecord",
         "kinesis:PutRecords"
       ],
       "Resource": "arn:aws:kinesis:<region_name>:<Account_ID>:stream/emr-protegrity-audit-stream"
     }
       ]
    }
    
  7. Ensure the trust policy allows logs service.
    {
    "Version": "2012-10-17",
    "Statement": [
     {
       "Effect": "Allow",
       "Principal": { "Service": "logs.<region_name>.amazonaws.com" },
       "Action": "sts:AssumeRole"
     }
    ]
    }
    

Creating the CloudWatch Log group

  1. Log in to the AWS console.
  2. Navigate to the CloudWatch page.
  3. Navigate to Logs > Log management.
  4. Click Create log group.
  5. In the Log group name box, enter a name to identify the group in the following syntax:
    /aws/<log_group_name>
    
  6. From the Retention setting list, select the required option.
  7. From the Log class list, select the required option.
  8. Click Create.

Note: Ensure to assign the required IAM permissions to the Log group. The EMR Serverless application execution role must have permissions to access the above-created CloudWatch Log group.

Creating the CloudWatch Logs Subscription Filter

  1. Log in to the AWS console.
  2. Navigate to the CloudWatch page.
  3. Navigate to Logs > Log management.
  4. Select the CloudWatch log group name that is created.
  5. Select Actions > Create subscription filter.
  6. Select the required Destination account.
  7. Under Kinesis data stream, select the stream name that is created.
  8. Under IAM role, select the role that was created for the CloudWatch Log group.
  9. If the Protegrity JSON lines contain “logtype”, specify the filter pattern as logtype.

    Note: If the JSON is embedded in other text, filter on a unique token, such as, correlationid or protection.

  10. Click Start streaming.

Note: CloudWatch Logs allows only a limited number of subscription filters per log group. The common limit is 2 subscription filters per log group.

Creating the Lambda Function

The Lambda function is responsible to send the logs from the Kinesis stream to the ESA.

  1. Log in to the AWS console.
  2. Navigate to the Lambda page.
  3. To create a function, click Create function.
  4. Select the Author from scratch option.
  5. In the Function name box, enter a name to identify the function.
  6. From the Runtime list, select the required language, such as, Python.
  7. Under Execution role, select the Create a new role with basic Lambda permissions option.
  8. Click Create function.

    Note: Ensure that the Lambda function must have access to the Kinesis stream, SQS access. The function must also have the LambdaBasicExecutionRole permissions and LambdaVPCAccessExecutionRole permissions.

Attaching a VPC to the Lambda Function

  1. To edit the function and attach a VPC, on the Lambda page, click the function name.
  2. Click the Configuration tab.
  3. From the left pane, click VPC.
  4. To modify the configuration, click Edit.
  5. From the VPC list, select the required VPC.
  6. From the Subnets list, select the required subnet.

    Note: Ensure the subnet can connect to the ESA IP address.

  7. From the Security groups list, select the group that allows egress to the ESA IP address.
  8. To persist the changes, click Save.

    Note: Attaching a Lambda function to a VPC without any NAT or endpoints can result in the Lambda function being unable to call the AWS APIs including the Kinesis stream.

Adding a Trigger to the Kinesis Stream

  1. To add a trigger to the Kinesis stream, click the Triggers tab.
  2. Click Add trigger.
  3. From the Trigger configuration list, select the source as Kinesis.
  4. From the Kinesis stream list, select the required stream.
  5. In the Batch size box, enter 200.
  6. In the Batch window box, enter any value between 1 and 5.
  7. Click Add.
  8. To configure the retry behavior, navigate to the Lambda page.
  9. Click Event source mappings.
  10. Click the required Kinesis trigger.
  11. Click the Configuration tab.
  12. Enable the Bisect batch on function error feature.
  13. Set the Maximum retry attempts to 10 or more.
  14. Set the Maximum record age to a longer duration.

Providing the CA.pem File to the Lambda Function

The CA.pem file must be provided to the Lambda function. The Curl component requires these certificates for TLS verification. The optimal and secure approach is to store the CA.pem file in the Secrets Manager.

Downloading the CA.pem File

  1. Log in to the ESA through a terminal having the required permissions.

  2. Navigate to the /etc/ksa/certificates/plug/ directory.

  3. Download the CA.pem file from this directory.

  4. After certificate is downloaded, open the PEM file in any text editor.

  5. Replace all new lines with escaped new line: \n.

  6. To escape new lines from command line, use one of the following commands depending on the operating system:

    For Linux:

    awk 'NF {printf "%s\\n",$0;}' CA.pem > output.txt
    

    For Windows PowerShell:

    (Get-Content '.\CA.pem') -join '\n' | Set-Content 'output.txt'
    

Storing the Certificates

  1. Log in to the AWS console.
  2. Navigate to the Secrets Manager page.
  3. Click Store a new secret.
  4. Under Secret type, select Other type of secret.
  5. In the Key box, enter ca_pem.
  6. In the value box, enter the contents of the CA.pem file.
  7. Click Next.
  8. Enter a name to identify the secret.
  9. Click Next.
  10. Click Store.
  11. Note the Secret ARN.

Setting up the Lambda Function

To set up the Lambda function:

  1. Log in to the AWS console.
  2. Navigate to the Lambda page.
  3. Click the required function.
  4. Click the Code tab.
  5. Click the lambda_function.py function.
  6. Paste the code from the lambda_function.py file that was generated after executing the configurator script.
  7. Click Deploy.
  8. Click the Configuration tab.
  9. From the left pane, click Permissions.
  10. Click the Role name to open the Role page.
  11. From the Add permissions list, select Create inline policy.
  12. Under Policy editor, select JSON.
  13. Paste the following policy:
    {
      "Version": "2012-10-17",
      "Statement": [
    	{
    		"Sid": "AllowGetSpecificSecret",
    		"Effect": "Allow",
    		"Action": [
    			"secretsmanager:GetSecretValue",
    			"secretsmanager:DescribeSecret"
    		],
    		"Resource": "arn:aws:secretsmanager:<region_name>:<Account_ID>:secret:<secret_name>"
         }
      ]
    }
    
  14. Click Next.
  15. In the Policy name box, enter a name for the policy.
  16. Click Create.
  17. Navigate to the Lambda page.
  18. Click the required function.
  19. From the left pane, click Environment variables.
  20. Click Edit and add the following variables in the key:value format:
ESA_BULK_URL = https://<ESA_IP_Address>:9200/pty_insight_audit/_bulk?pipeline=logs_pipeline
ESA_CA_SECRET_ID = <ARN_of_the_Secret_from_Secret_Manager>
ESA_CA_SECRET_JSON_KEY = ca_pem
ONLY_MATCH_SUBSTRING = "logtype" (optional extra filter)
BULK_MAX_BYTES = 5242880 (5MB)
HTTP_TIMEOUT_SEC = 120
  1. To persist the changes, click Save.

Troubleshooting

Validate each hop before moving to the next. Most issues are isolated to one hop.

Verify logs are reaching CloudWatch (EMR → CloudWatch)

Where to check:

  • CloudWatch Logs → Log groups → /aws/<log_group_name>
  • Open the latest log stream.

What to check:

  • New log events should appear while the EMR Serverless job is running.
  • If you do not see new events, the problem is upstream (EMR monitoring config or EMR execution role permissions).

If this fails:

  • Confirm the EMR Serverless job run has CloudWatch logging enabled.
  • Confirm the execution role attached to the job/application has permissions to write to the log group/streams.

Verify CloudWatch Subscription Filter is configured (CloudWatch → Kinesis)

Where to check:

  • CloudWatch Logs → Log groups → /aws/<log_group_name> → Subscription filters

What to check:

  • A subscription filter exists.
  • Destination is the correct Kinesis Data Stream.
  • The filter pattern matches your logs.

Recommended test:

  • Temporarily set a permissive filter (for testing):
    • Match all: ""
    • Or minimal match: “logtype”
  • Save and observe whether data begins flowing into Kinesis.

If this fails:

  • Most common cause is IAM permissions for CloudWatch Logs to write records into Kinesis (destination access role / resource policy).

Verify Kinesis is receiving events (Kinesis ingestion)

Where to check:

  • Kinesis → Data streams → → Monitoring

What to check:

  • IncomingRecords should be greater than 0 during active logging.
  • IncomingBytes should also increase.

If this fails:

  • CloudWatch subscription filter is not delivering. Possible causes can include incorrect stream, incorrect filter pattern, or missing permissions.

Verify Lambda Function is triggered (Kinesis → Lambda)

Where to check:

  • Lambda → → Configuration → Triggers
  • Lambda → Monitor

What to check:

  • Kinesis trigger exists and is Enabled.
  • Monitor metrics:
    • Invocations should increase.
    • Errors should be 0 (or very low).

If this fails:

  • Trigger/event source mapping may be disabled, misconfigured, or pointing to the wrong stream.

Validate Lambda processing and payload (Lambda internal validation)

Where to check:

  • CloudWatch Logs → Log groups → /aws/lambda/

What to check:

  • Confirm Lambda is actually parsing events:
    • docs_seen= should be > 0
    • bulk_calls= should be >= 1 when data exists
  • Confirm outbound calls:
    • Log should show ESA HTTP status=200
    • ESA bulk response should not show errors:true

Common failure patterns:

  • TLS/CA errors
    • NO_CERTIFICATE - indicates the CA.pem file loaded from Secrets Manager is empty/malformed.
    • CERTIFICATE_VERIFY_FAILED - indicates incorrect CA chain or wrong certificate for the ESA endpoint.
  • Filtering too strict
    • If docs_seen=0, your ONLY_MATCH_SUBSTRING or JSON-line parsing is skipping everything.

Validate ESA ingestion (Lambda → ESA)

Where to check:

  • Lambda log output for ESA bulk response.
  • ESA/OpenSearch logs (if accessible).
  • Index / pipeline configuration.

What to check:

  • Bulk response should show:
    • errors: false
    • Successful item status (2xx)
  • If errors: true, inspect first error item:
    • Strict mapping exceptions indicate you are sending fields that are not allowed by index mapping.
    • Pipeline errors indicate ingest pipeline expects different fields or types.

Quick Diagnosis Rules

  • CloudWatch log streams have events, but Kinesis IncomingRecords=0 → Subscription filter / IAM permissions / wrong destination stream.
  • Kinesis has IncomingRecords>0, but Lambda Invocations=0 → Kinesis trigger (event source mapping) disabled/misconfigured.
  • Lambda invokes, but ESA is not receiving logs: → TLS/CA issue, ESA bulk endpoint issue, pipeline/mapping errors, or filter logic dropping events.

3.3 - Performing URP Operations

The instructions mentioned in the section are applicable only for the Serverless approach.

The Big Data Protector on the EMR Serverless architecture provides the following approaches to perform URP operations:

  • AWS Web UI - operations using this approach returns only the driver logs.
  • AWS CLI - operations using this approach returns both the driver and executor logs.

Creating the EMR Serverless Application for Spark

  1. Log in to the AWS console.
  2. Navigate to the EMR page.
  3. From the left pane, click EMR Serverless.
  4. Under Manage applications, select the required EMR studio.
  5. Click Manage applications.
  6. Click Create application.
  7. Under Application settings, specify a value for the following:
    1. Name
    2. Type
    3. Release version
  8. Under Application setup options, select the Use custom settings option.
  9. Under Custom image settings, select the Use the custom image with this application check box.
  10. Browse and select the required image from the Elastic Container Repository.
  11. Under Application logs and metrics, select the Deliver logs to Amazon CloudWatch check box.
  12. In the Log group name box, enter the name for the CloudWatch Log group. The name must be the same as that of the group created to fetch logs from the application.
  13. Under Interactive endpoint, select the Enable endpoint for EMR studio check box to analyze data in Jupyter notebooks on EMR Serverless. This is optional.
  14. Under Network connections, from the Virtual private cloud (VPC) list, select the required VPC.
  15. Select the required Subnets and the Security groups.
  16. Under Application behavior, set the required time to stop the application.
  17. Click Create and start application.

Submitting a Spark Job

  1. Create a Spark script using Protegrity functions.
  2. Upload the Spark script to the S3 bucket.
  3. Using the AWS CLI/CloudShell, submit the job. A sample command is listed below.
    aws emr-serverless start-job-run \
    --region <region_name> \
    --application-id <application_id> \
    --execution-role-arn arn:aws:iam::<Account_ID>:role/EMR-Servlerless-Execution-Role \
    --job-driver '{
        "sparkSubmit": {
        "entryPoint": "s3://<script_path>/<script_name>.py"
        }
    }' \
    --configuration-overrides '{
        "monitoringConfiguration": {
        "cloudWatchLoggingConfiguration": {
            "enabled": true,
            "logGroupName": "<log_group_name>",
            "logStreamNamePrefix": "emrs",
            "logTypes": {
            "SPARK_DRIVER": ["STDOUT","STDERR"],
            "SPARK_EXECUTOR": ["STDOUT","STDERR"]
            }
        }
        }
    }'