1 - Setting up for the Bootstrap Installer

Prepare the system for using the Bootstrap Installer

The procedures mentioned in this section are applicable only for the Bootstrap installer approach to prepare the environment for the Big Data Protector.

1.1 - Verifying the prerequisites

Verifying the Prerequisites for Installing the Big Data Protector

The content mentioned in this section is applicable only for the Bootstrap approach to install the Big Data Protector.

Ensure that the following prerequisites are met, before installing the Big Data Protector on an Amazon EMR cluster:

  • It is recommended to be familiar with the following parts:
    • The Amazon EMR environment
    • Storage bucket, used to store the Big Data Protector installation files
    • Bootstrap Action, used to invoke the installation of Big Data Protector
    • Amazon Virtual Private Cloud (VPC)
  • An ESA appliance v10.x.x is installed and running.
  • An S3 bucket is available to copy the Big Data Protector installation files, which are created using the Configurator script.

    For more information about creating an S3 bucket, refer to the Amazon documentation for creating the S3 bucket.

  • The following table depicts the list of ports that are configured on ESA and the nodes in the cluster, which will run the Big Data Protector:
Destination Port No.ProtocolsSourcesDestinationsDescriptions
8443TCPRPAgent on the Big Data Protector cluster nodeESAThe RPAgent communicates with ESA through port 8443 to download a Policy.
9200Log Forwarder on the Big Data Protector cluster nodeProtegrity Audit Store applianceThe Log Forwarder sends all the logs to the Protegrity Audit Store appliance through port 9200.
15780Protector on the Big Data Protector cluster nodeLog Forwarder on the Big Data Protector cluster nodeThe Big Data Protector writes Audit Logs to localhost through port 15780. The RPAgent Application Logs are also written to localhost through port 15780. The Log Forwarder reads the logs from that socket.

1.2 - Extracting the Big Data Protector Package

Extracting the Big Data Protector Package

The steps mentioned in this section are applicable only for the Bootstrap approach to install the Big Data Protector.

After receiving the Big Data Protector installation package from Protegrity, copy it to any Amazon EC2 instance or any node that has connectivity to the ESA.

After downloading the Big Data Protector package, extract it to:

  1. Access the Configurator script and
  2. Install the Big Data Protector on all the nodes on an Amazon EMR cluster.

To extract the Configurator script from the installation package:

  1. Log in to the CLI on a machine or an Amazon EC2 node that has connectivity to the ESA.

  2. Copy the Big Data Protector package BigDataProtector_Linux-ALL-64_x86-64_EMR-<EMR_version>-64_<BDP_version>.tgz to any directory.

    For example, /opt/protegrity/.

  3. To extract the contents of the package, run the following command:

    tar -xvf BigDataProtector_Linux-ALL-64_x86-64_EMR-<EMR_version>-64_<BDP_version>.tgz
    
  4. Press ENTER.

    The command extracts the installer package and the signature files.

    BigDataProtector_Linux-ALL-64_x86-64_EMR-<EMR_version>-64_<BDP_version>.tgz
    signatures/
    signatures/BigDataProtector_Linux-ALL-64_x86-64_EMR-<EMR_version>-64_<BDP_version>.tgz_<BDP_version>.sig
    

    Verify the authenticity of the build using the signatures folder. For more information, refer Verification of Signed Protector Build.

  5. To extract the configurator script, run the following command:

    tar –xvf BigDataProtector_Linux-ALL-64_x86-64_EMR-<EMR_version>-64_<BDP_version>.tgz
    
  6. Press ENTER.

    The command extracts the configurator script.

    BDP_Configurator_EMR-<EMR_version>_<BDP_version>.sh
    

1.3 - Executing the Configurator Script

Executing the Configurator Script

The steps mentioned in this section are applicable only for the Bootstrap approach to install the Big Data Protector.

Execute the configurator script to create the installation files for installing the Big Data Protector on an Amazon EMR cluster. You can install the Big Data Protector on an Amazon EMR cluster in any one of the following methods:

  • New EMR cluster: The configurator script will:
    • Download the certificates and key encryption files from ESA.
    • Create the Big Data Protector installation files for a new EMR cluster.
    • Create the bootstrap installer and classpath configurator script for a new EMR cluster.
    • Copy the Big Data Protector installation files, bootstrap installer, and the classpath configurator script to the S3 bucket.
  • Existing EMR cluster: The configurator script will generate the installation package to install the Big Data Protector on an existing EMR cluster.

To execute the configurator script:

  1. Log in to the staging environment.

  2. Navigate to the directory that contains the BDP_Configurator_EMR-<EMR_version>_<BDP_version>.sh script.

  3. To execute the configurator script, run the following command:

    ./BDP_Configurator_EMR-<EMR_version>_<BDP_version>.sh
    
  4. Press ENTER.

    The prompt to continue the installation of the Big Data Protector appears.

    ***********************************************************************
         Welcome to the Big Data Protector Configurator Wizard
    ***********************************************************************
    This will create the Big Data Protector Installation files for AWS EMR.
    Do you want to continue? [yes or no]:
    
  5. To continue, type yes.

  6. Press ENTER.

    The prompt to create the Big Data Protector installation package, depending on the EMR cluster, appears.

    Protegrity Big Data Protector Configurator started...
    
    Enter the EMR cluster for which the Big Data Protector installation package needs to be created:
    [ 1 ] : New EMR Cluster
    [ 2 ] : Existing EMR cluster
    [ 1 or 2 ]:
    
  7. Depending on your requirement, select any one of the following options:

    • To create the Big Data Protector installation package for a new EMR cluster, type 1.
    • To generate the Big Data Protector installation package, in a local directory, for an existing EMR cluster, type 2.
      For more information about installing the Big Data Protector on an existing EMR cluster, refer Using the Static Installer.
  8. To create the Big Data Protector installation package for a new EMR cluster, type 1.

  9. Press ENTER.

    The prompt to enter the S3 URI to upload the Big Data Protector installation files appears.

    Generating Big Data Protector for a new EMR cluster......
    Enter the S3 URI where the BDP Installation files are to be uploaded.
    (E.g. s3://examplebucket/folder):
    
  10. Type the path of the S3 storage bucket.

    Ensure that the path of the S3 storage bucket is in the following format:

    s3://<bucket_name>/<folder_in_the_bucket>
    

    where,

    • <bucket_name> - specifies the name of the storage bucket.
    • <folder_in_the_bucket> - specifies the directory within the bucket.
  11. Press ENTER.

    The prompt to either upload the installation files to the S3 bucket or generate them locally appears.

    Choose one option among the following for BDP Installation files:
    [1] -> Upload files to 's3://<bucket_name>/<folder_in_the_bucket>' S3 URI.
    [2] -> Generate files locally to current working directory. (You would have to manually upload the files to the specified S3 URI)
    [ 1 or 2 ]:
    
  12. To upload the installation files to the S3 storage bucket, type 1.

  13. Press ENTER.

    The prompt to select the type of AWS access key appears.

    Choose the Type of AWS Access Keys from the following options:
    [1] -> IAM User Access Keys (Permanent access key id & secret access key)
    [2] -> Temporary Security Credentials (Temporary access key id, secret access key & session token)
    [ 1 or 2 ]:
    
  14. Depending on the type of AWS Access Keys you want to use, type 1 or 2. For example, to use the temporary security credentials, type 2.

  15. Press ENTER.

    The prompt to enter the access key ID appears.

    Enter the Access Key ID:
    
  16. Enter the access key ID.

  17. Press ENTER.

    The prompt to enter the secret access key appears.

    Enter the Secret Access Key:
    
  18. Enter the secret access key.

  19. Press ENTER.

    The prompt to enter the security session token appears.

    Enter the Security Session Token:
    
  20. Enter the Security Session Token.

  21. Press ENTER.

    The prompt to enter ESA hostname or IP address appears.

    Enter the ESA Hostname/IP Address:
    
  22. Enter the hostname or the IP address of ESA.

  23. Press ENTER.

    The prompt to enter the listening port for ESA appears.

    Enter ESA host listening port [8443]:
    
  24. Enter the listening port for ESA.

    Alternatively, to use the default listening port, press ENTER.

  25. Press ENTER.

    The prompt to enter the JWT token appears.

    If you have an existing ESA JSON Web Token (JWT) with Export Certificates role, enter it otherwise enter 'no':
    
  26. Enter the JWT token.

  27. Press ENTER.

    The prompt to select the audit store type appears.

    Select the Audit Store type where Log Forwarder(s) should send logs to.
    
    [ 1 ] : Protegrity Audit Store
    [ 2 ] : External Audit Store
    [ 3 ] : Protegrity Audit Store + External Audit Store
    
    Enter the no.:
    
  28. Depending on the Audit Store type, select any one of the following options:

    OptionDescription
    1To use the default setting using the Protegrity Audit Store appliance, type 1. If you enter 1, then the default Fluent Bit configuration files are used and Fluent Bit will forward the logs to the Protegrity Audit Store appliances.
    2To use an external audit store, type 2. If you enter 2, then the default Fluent Bit configuration files used for the External Audit Store (out.conf and upstream.cfg in the /opt/protegrity/fluent-bit/data/config.d/ directory) are renamed (out.conf.bkp and upstream.cfg.bkp) so that they will not be used by Fluent Bit. Additionally, the custom Fluent Bit configuration files for the external audit store are copied to the /opt/protegrity/fluent-bit/data/config.d/ directory.
    3To use a combination of the default setting with an external audit store, type 3. If you enter 3, then the default Fluent Bit configuration files used for the Protegrity Audit Store (out.conf and upstream.cfg in the /opt/protegrity/fluent-bit/data/config.d/ directory) are not renamed. However, the custom Fluent Bit configuration files for the external audit store are copied to the /opt/protegrity/fluent-bit/data/config.d/ directory.
  29. Press ENTER.

    The prompt to enter the comma separated list of hostname or IP addresses appears.

    Enter comma-separated list of Hostnames/IP Addresses and/or Ports of Protegrity Audit Store.
    Allowed Syntax: hostname[:port][,hostname[:port],hostname[:port]...] (Default Value - <ESA_IP_Address>:9200)
    Enter the list:
    
  30. Enter the comma-separated IP addresses/ports in the correct syntax.

  31. Press ENTER.

    The prompt to enter the local directory path that stores the custom Fluent Bit configuration file appears.

    Enter the local directory path on this node that stores the custom Fluent-Bit configuration files for External Audit Store:
    

    The configurator script will display this prompt only if you select option 2 or 3 in step 28. When you select option 2 or 3 in step 28, the custom configuration files are copied to the /<installation_directory>/fluent-bit/data/config.d/ directory during the execution of bootstrap script on the EMR nodes.

  32. Enter the local directory path that stores the custom Fluent Bit configuration files.

  33. Press ENTER.

    The prompt to generate the application logs for the RPAgent appears.

    Do you want RPAgent's log to be generated in a file? [yes or no]:
    
  34. To generate the logs in a file, type yes.

  35. Press ENTER.

    The script generates the installation files and uploads them to the specified S3 bucket.

    RPAgent's log will be generated in a file.
    ************************************************************************************
                        Welcome to the RPAgent Setup Wizard.
    ************************************************************************************
    
    Unpacking...................
    Extracting files...
    Unpacked rpagent compressed file...
    Temporarily setting up rpagent directory structure on current node...
    Unpacking...
    Extracting files...
    Downloading certificates from <ESA_IP_Address>:8443...
    % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                    Dload  Upload   Total   Spent    Left  Speed
    100 11264  100 11264    0     0   163k      0 --:--:-- --:--:-- --:--:--  164k
    
    Extracting certificates...
    Certificates successfully downloaded and stored in /<installation_dir>/rpagent/data
    
    Protegrity RPAgent installed in /<installation_dir>/rpagent.
    
    
    Retrieving the S3 bucket's AWS Region via AWS S3 REST API...
    Successfully retrieved S3 bucket's AWS region: <AWS_region_name>
    
    
    Started Uploading the generated installation files via AWS S3 REST API......
    
    Uploading bdp_bootstrap_installer.sh to the S3 bucket.
    File uploaded to s3://<bucket_name>/<folder_in_the_bucket>/bdp_bootstrap_installer.sh
    
    Uploading bdp_classpath_configurator.py to the S3 bucket.
    File uploaded to s3://<bucket_name>/<folder_in_the_bucket>/bdp_classpath_configurator.py
    
    Uploading BigDataProtector_Linux-ALL-64_x86-64_EMR-7.9-64_<BDP_version>.tgz to the S3 bucket.
    File uploaded to s3://<bucket_name>/<folder_in_the_bucket>/BigDataProtector_Linux-ALL-64_x86-64_EMR-<EMR_version>-64_<BDP_version>.tgz
    
    Successfully Uploaded BigDataProtector_Linux-ALL-64_x86-64_EMR-<EMR_version>-64_<BDP_version>.tgz, bdp_bootstrap_installer.sh, bdp_classpath_configurator.py to S3 bucket 's3://<bucket_name>/<folder_in_the_bucket>'
    
    Successfully Generated installation files at ./Installation_Files/ directory.
    
    Successfully configured Big Data Protector for a new EMR cluster..
    

2 - Setting up for the Static Installer

Prepare the system for using the Static Installer

The procedures mentioned in this section are applicable only for the Static installer approach to prepare the environment for the Big Data Protector.

2.1 - Verifying the prerequisites for Static Installer

Verifying the Prerequisites for Installing the Big Data Protector using the Static Installer

The content mentioned in this section is applicable only for the Static installer approach to install the Big Data Protector.

Ensure that the following prerequisites are met, before installing the Big Data Protector:

  • The EMR cluster is installed, configured, and running.

  • The ESA v10.0.x instance is installed, configured, and running.

  • The static installer for EMR uses utilities, such as, pssh (parallel ssh) and pscp (parallel scp). These utilities require Python to be installed on the Primary node. To verify whether Python is installed on the Primary node, run the following command:

    /usr/bin/env python --version
    

    The command returns the version of Python installed on the system.

    If you are unable to detect Python on the Primary node, then ensure that you have a compatible version of Python installed on the lead node (preferably Python 3.x). Ensure that the utilities are able to detect the version of Python using the following command:

    /usr/bin/env python
    
  • A sudoer user account with privileges to perform the following tasks:

    • Update the system by modifying the configuration, permissions, or ownership of directories and files.
    • Perform third party configuration.
    • Create directories and files.
    • Modify the permissions and ownership for the created directories and files.
    • Set the required permissions to the create directories and files for the Protegrity Service Account.
    • Permissions for using the SSH service.
  • The following user accounts are present to perform the required tasks:

    • ADMINISTRATOR_USER is the sudoer user account that is responsible to install and uninstall the Big Data Protector on the cluster. This user account must have sudo access to install the product.
    • EXECUTOR_USER: It is a user that has ownership of all Protegrity files, directories, and services.
    • OPERATOR_USER: It is responsible for performing tasks, such as, starting or stopping tasks, monitoring services, updating the configuration, and maintaining the cluster while the Big Data Protector is installed on it. If you want to start, stop, or restart the Protegrity services, then you require sudoer privileges for this user to impersonate the EXECUTOR_USER.
    • Depending on the requirements, a single user on the system may perform multiple roles. If a single user is performing multiple roles, then ensure that the following conditions are met:
      • The user has the required permissions and privileges to impersonate the other user accounts, for performing their roles, and perform tasks as the impersonated user.
      • The user is assigned the highest set of privileges, from the required roles that it needs to perform, to execute the required tasks. For example, if a single user is performing tasks as ADMINISTRATOR_USER, EXECUTOR_USER, and OPERATOR_USER, then ensure that the user is assigned the privileges of the ADMINISTRATOR_USER.
  • A Private Key file (.pem file) for the sudoer user, which is used for enabling key-based authentication, and for communicating with all the nodes in the EMR cluster, is present on the Master node.

  • As key-based authentication for the sudoer user is provided, which is required for installing and using Big Data Protector on the EMR cluster, ensure that the ADMINISTRATOR_USER or OPERATOR_USER have the value of the NOPASSWD parameter set to ALL in the sudoer’s file.

  • The management scripts provided by the installer in the cluster_utils directory should be run only by the user (OPERATOR_USER) having privileges to impersonate the EXECUTOR_USER.

    • If the value of the AUTOCREATE_PROTEGRITY_IT_USR parameter in the BDP.config file is set to No, then ensure that a service group containing a user for running the Protegrity services on all the nodes in the cluster already exists.
    • If the Hadoop cluster is configured with AD or LDAP for user management, then ensure that the AUTOCREATE_PROTEGRITY_IT_USR parameter in the BDP.config file is set to No and that the required service account user is created on all the nodes in the cluster.
  • The table lists the ports required for the EMR cluster.

Destination Port No.ProtocolsSourcesDestinationsDescriptions
8443TCPRPAgent on the Big Data Protector cluster nodeESAThe RPAgent communicates with ESA through port 8443 to download a Policy.
9200Log Forwarder on the Big Data Protector cluster nodeProtegrity Audit Store applianceThe Log Forwarder sends all the logs to the Protegrity Audit Store appliance through port 9200.
15780Protector on the Big Data Protector cluster nodeLog Forwarder on the Big Data Protector cluster nodeThe Big Data Protector writes Audit Logs to localhost through port 15780. The RPAgent Application Logs are also written to localhost through port 15780. The Log Forwarder reads the logs from that socket.

2.2 - Extracting the Installation Package

Extracting the Instllation Package for the Static Installer

The steps mentioned in this section are applicable only for the Static installer approach to install the Big Data Protector.

To extract the files from the installation package:

  1. Ensure that the installation package BigDataProtector_Linux-ALL-64_x86-64_EMR-<emr_version>-64_<BDP_version>.tgz is copied to the Master node on the EMR cluster in any temporary directory, such as /opt/protegrity/.

  2. To extract the files from the installation package, run the following command:

    tar -xvf BigDataProtector_Linux-ALL-64_x86-64_EMR-<emr_version>-64_<BDP_version>.tgz

  3. Press ENTER. The command extracts the following files:

    uninstall.sh
    ptyLogAnalyzer.sh
    ptyLog_Consolidator.sh
    PepHbaseProtector<HBase_version>Setup_Linux_emr-<emr_version>_<BDP_version>.sh
    bdp_classpath_deconfigurator.py
    PepSpark<Spark_version>Setup_Linux_emr-<emr_version>_<BDP_version>.sh
    JcoreLiteSetup_Linux_x64_<JcoreLite_version>.gadcc.release-<BDP_version>.sh
    PepPig<pig_version>Setup_Linux_emr-<emr_version>_<BDP_version>.sh
    bdp_common/
    bdp_common/bdp.properties.template
    bdp_common/config.ini.template
    Logforwarder_Setup_Linux_x64_<core_version>.sh
    node_uninstall.sh
    bdp_classpath_configurator.py
    RPAgent_Setup_Linux_x64_<core_version>.sh
    PepMapreduce<MapReduce_version>Setup_Linux_emr-<emr_version>_<BDP_version>.sh
    PepHive<Hive_version>Setup_Linux_emr-<emr_version>_<BDP_version>.sh
    BDP.config
    BdpInstallx.x.x_Linux_<BDP_version>.sh
    

2.3 - Updating the BDP.Config File

Updating the BDP.Config File for the Static Installer

The steps mentioned in this section are applicable only for the Static Installer approach to install the Big Data Protector.

Note: Ensure that the BDP.config file is updated before the Big Data Protector is installed.

Do not update the BDP.config file when the installation of the Big Data Protector is in progress.

To update the BDP.config file:

  1. Create a hosts file containing the IP addresses of all the nodes in the cluster, except the Lead node, and specify them in the BDP.config file.

    The installation script uses this file to install the Big Data Protector on the nodes.

  2. Open the BDP.config file in any text editor and modify the following parameter values:

    • HADOOP_DIR – is the installation home directory for the Hadoop distribution.

    • PROTEGRITY_DIR – is the directory where the Big Data Protector will be installed.

      The examples used in this document assume that the Big Data Protector is installed in the /opt/protegrity/ directory.

    • CLUSTERLIST_FILE – This file contains the host name or IP addresses all the nodes in the cluster, except the Lead node, listing one host name and IP address per line.

      Ensure that you specify the file name with the complete path.

    • SPARK_PROTECTOR – Specifies one of the following values, as required:

      • Yes – Specifies to install the Spark protector. Set the value of this parameter to Yes, if the user wants to run Hive UDFs with Spark SQL, or use the Spark protector samples if the INSTALL_DEMO parameter is set to Yes.
      • No – Specifies to skip installing the Spark protector.
    • AUTOCREATE_PROTEGRITY_IT_USR – Determines the Protegrity service account. The service group and service user name specified in the PROTEGRITY_IT_USR_GROUP and PROTEGRITY_IT_USR parameters respectively will be created if this parameter is set to Yes. One of the following values can be specified, as required:

      • Yes – Instructs the installer to create the service group PROTEGRITY_IT_USR_GROUP containing the user PROTEGRITY_IT_USR for executing the Protegrity services on all the nodes in the cluster.

        If the service group or service user are already present, then the installer exits.

        If you uninstall the Big Data Protector, then the service group and the service user are deleted.

      • No – Instructs the installer to skip creating a service group PROTEGRITY_IT_USR_GROUP with the service user PROTEGRITY_IT_USR for executing the Protegrity services on all the nodes in the cluster.

    • PROTEGRITY_IT_USR_GROUP – is the service group required for running the Protegrity services on all the nodes in the cluster. All the Protegrity installation directories are owned by this service group.

    • PROTEGRITY_IT_USR – is the service account user required for running the Protegrity services on all the nodes in the cluster and is a part of the group PROTEGRITY_IT_USR_GROUP. All the Protegrity installation directories are owned by this service user.

3 - Setting up for the EMR Serverless Installer

Prepare the system for using the EMR Serverless Installer

The procedures mentioned in this section are applicable only for the Serverless approach to prepare the environment for the Big Data Protector.

3.1 - Extracting the Big Data Protector Package

Extracting the Big Data Protector Package

The steps mentioned in this section are applicable only for the Serverless approach to install the Big Data Protector.

After receiving the Big Data Protector installation package from Protegrity, copy it to any Amazon EC2 instance or any node that has connectivity to the ESA.

To extract the Configurator script from the installation package:

  1. Log in to the CLI on a machine or an Amazon EC2 node that has connectivity to the ESA.

  2. Copy the Big Data Protector package BigDataProtector_Linux-ALL-64_x86-64_EMR-<EMR_version>-64_<BDP_version>.tgz to any directory.

    For example, /opt/protegrity/.

  3. To extract the contents of the package, run the following command:

    tar -xvf BigDataProtector_Linux-ALL-64_x86-64_EMR-<EMR_version>-64_<BDP_version>.tgz
    
  4. Press ENTER.

    The command extracts the installer package and the signature files.

    BigDataProtector_Linux-ALL-64_x86-64_EMR-<EMR_version>-64_<BDP_version>.tgz
    signatures/
    signatures/BigDataProtector_Linux-ALL-64_x86-64_EMR-<EMR_version>-64_<BDP_version>.tgz_<BDP_version>.sig
    

    Verify the authenticity of the build using the signatures folder. For more information, refer Verification of Signed Protector Build.

  5. To extract the configurator script, run the following command:

    tar –xvf BigDataProtector_Linux-ALL-64_x86-64_EMR-<EMR_version>-64_<BDP_version>.tgz
    
  6. Press ENTER.

    The command extracts the configurator script.

    BDP_Configurator_EMR-<EMR_version>_<BDP_version>.sh
    

3.2 - Executing the Configurator Script

The steps mentioned in this section are applicable only for the Serverless approach to install the Big Data Protector.

The Big Data Protector configurator script:

  1. Generates the config.json file.
  2. Generates the EMR Serverless deployment scripts.
  3. Provides the runtime artifacts and common utilities.

To execute the configurator script:

  1. Log in to the CLI on a machine or an Amazon EC2 node that has connectivity to the ESA.
  2. Navigate to the directory where the installation files are extracted.
  3. To execute the script, run the following command:
    ./BDP_Configurator_EMR-<EMR_version>_<BDP_version>.sh
    
  4. Press ENTER.
    The Big Data Protector Configurator Wizard with the prompt to continue appears.
    ***********************************************************************
         Welcome to the Big Data Protector Configurator Wizard
    ***********************************************************************
    This will create the Big Data Protector Installation files for AWS EMR.
    
    Do you want to continue? [yes or no]:
    
  5. To continue, type yes.
  6. Press ENTER.
    The prompt to select the deployment type appears.
    Protegrity Big Data Protector Configurator started...
    Enter the EMR deployment type for Big Data Protector:
    [ 1 ] : New EMR Cluster (Bootstrap)
    [ 2 ] : Existing EMR Cluster (Static)
    [ 3 ] : EMR Serverless (Containerized)
    [ 1, 2, or 3 ]:
    
  7. To install the Big Data Protector using the Serverless approach, type 3.
  8. Press ENTER.
    The prompt to select the configuration mode appears.
    Generating Big Data Protector for EMR Serverless......
    
    ================================================================
        EMR Serverless - Configuration Setup
    ================================================================
    
    The EMR Serverless deployment requires configuration values to be
    stored in a config.json file. This file is used by Python scripts to:
    
    - Generate the Dockerfile with BDP components
    - Build and tag the Docker image
    - Push the image to AWS ECR
    - Configure certificate downloads from ESA
    
    You have two options to provide this configuration:
    
    ================================================================
    OPTION 1: Interactive Mode (Recommended)
    ================================================================
    - Guided prompts will collect all required information
    - Values are validated during input
    - config.json is automatically generated
    - Faster and less error-prone
    
    ================================================================
    OPTION 2: Silent Mode
    ================================================================
    - A template config.json file with placeholders is created
    - You manually edit the file and replace all placeholders
    - Useful if you prefer to script or automate configuration
    - Requires careful attention to JSON syntax
    
    ================================================================
    
    Select configuration mode:
    [ 1 ] : Interactive Mode (Guided prompts)
    [ 2 ] : Silent Mode (Edit config.json template)
    Enter your choice [1 or 2]:
    
  9. To use the interactive configuration mode, type 1.
  10. Press ENTER.
    The prompt to verify the prerequisites appears.
    [OK] Selected: Interactive Mode
    ================================================================
       EMR Serverless - Prerequisites Checklist
    ================================================================
    
    Before proceeding, please ensure you have the following information ready:
    
    [OK] ESA Configuration:
    - ESA Server Host/IP
    - ESA Port (default: 25400)
    - GetCertificates Port (default: 25400)
    - ESA Admin Username & Password (prompted during build)
    
    [OK] EMR Serverless Configuration:
    [1/6] EMR Release Label (e.g., emr-6.15.0, emr-7.0.0)
    [2/6] Runtime Selection (Spark or Hive)
    [3/6] AWS Account ID (12-digit number)
    [4/6] AWS Region (e.g., us-east-1, us-west-2)
    [5/6] ECR Repository Name (where Docker image will be stored)
    [6/6] Docker Image Tag (e.g., latest, v1.0.0)
    
    ================================================================
    
    Do you have all the required information to proceed? [yes/no]:
    
  11. If all the prerequisites are available, type yes.
  12. Press ENTER.
    The prompt to enter the ESA host name appears.
    [OK] Proceeding with interactive configuration...
    Enter the ESA Hostname/IP Address:
    
  13. Enter the ESA Hostname or IP address.
  14. Press ENTER.
    The prompt to enter the ESA listening port appears.
    Enter ESA host listening port [25400]:
    
  15. Enter the listening port.
  16. Press ENTER.
    The prompt to enter the GetCertificates port appears.
    Enter GetCertificates port [25400]:
    
  17. Enter the port to fetch the certificates from the ESA.
  18. Press ENTER.
    The prompt to enter the EMR release label appears.
    ================================================================
       EMR Serverless Configuration - Step by Step
    ================================================================
    
    ESA Server: <ESA_IP_Address>:<ESA_Port>
    GetCertificates Port: <ESA_Port>
    
    [1/6] EMR Release Label
    ------------------------------------------------------
    Specify the EMR release version you want to use.
    Note: Not all EMR versions have serverless images available.
    For available versions, visit AWS EMR Serverless documentation.
    Enter EMR Release Label (e.g., emr-7.12.0):
    
  19. Enter the EMR version.
  20. Press ENTER.
    The prompt to select the processing engine appears.
    [2/6] Runtime Selection
    ------------------------------------------------------
    Choose the processing engine for your EMR Serverless application.
    Spark: For data processing, ETL, and analytics
    Hive:  For SQL queries on large datasets
    
    Select Runtime:
    [ 1 ] : Spark
    [ 2 ] : Hive
    Enter your choice [1 or 2]:
    
  21. Depending on the requirements, type 1 or 2.
  22. Press ENTER.
    The prompt to enter the AWS Account ID appears.
    [3/6] AWS Account ID
    ------------------------------------------------------
    Your 12-digit AWS Account ID is required to:
    • Access AWS ECR (Elastic Container Registry)
    • Identify your AWS resources
    
    Find it at: AWS Console > Account (top-right) > My Account
    Enter AWS Account ID (12 digits):
    
  23. Enter the AWS Account ID.
  24. Press ENTER.
    The prompt to enter the AWS region where the EMR Serverless resources will be deployed appears.
    [4/6] AWS Region
    ------------------------------------------------------
    Specify the AWS region where your EMR Serverless resources
    will be deployed (e.g., us-east-1, us-west-2, eu-west-1).
    
    Note:
    • Your ECR repository and EMR Serverless application must be in same region.
    
    Enter AWS Region (e.g., us-east-1):
    
  25. Enter the region name.
  26. Press ENTER.
    The prompt to enter the ECR Repository Name appears.
    [5/6] ECR Repository Name
    ------------------------------------------------------
    AWS ECR (Elastic Container Registry) repository where the
    BDP Docker image will be stored and pulled from.
    
    Repository naming rules:
    • Lowercase letters, numbers, hyphens, underscores, forward slashes
    • 2-256 characters long    
    Enter ECR Repository Name:
    
  27. Enter the ECR repository name.
  28. Press ENTER.
    The prompt to enter the docker image tag appears.
    [6/6] Docker Image Tag
    ------------------------------------------------------
    Tag for the Docker image in ECR. This helps identify
    different versions of your BDP image.
    Enter Docker Image Tag [default: latest]:
    
  29. Enter the docker image tag.
  30. Press ENTER.
    The script completes the EMR Serverless configuration.
    ================================================================
    [OK] EMR Serverless configuration completed successfully!
    ================================================================
    
    Generated config.json file successfully at /bdp/build/BigDataProtector/BigDataProtector/Installation_Files/config.json
    
    ================================================================
    [OK] Successfully configured Big Data Protector for EMR Serverless!
    ================================================================
    
    Generated Files in ./Installation_Files/ directory:
    - config.json                    - EMR Serverless configuration
    - scripts/                       - Python deployment CLIs
        +-- emr_serverless_setup_cli.py    - Main deployment CLI
        +-- lambda_function.py             - Lambda for ESA audit log forwarding
    - runtime/                       - BDP JAR files (Spark/Hive)
    - common/                        - JcoreLite, config.ini, GetCertificates.sh
    - BigDataProtector_Linux-ALL-64_x86-64_EMR-<EMR_version>-64_<BDP_version>.tgz       - Complete package tarball
    
    ================================================================
    Using emr_serverless_setup_cli.py - Main Deployment Tool
    ================================================================
    
    This Python CLI provides commands to build and deploy BDP Docker images:
    
    AVAILABLE COMMANDS:
    validate            - Check prerequisites (Docker, AWS CLI, config.json)
    prepare-assets      - Update config.ini and GetCertificates.sh with ESA details
    generate-dockerfile - Create Dockerfile from config.json
    build               - Build Docker image locally (preserves manual edits)
    push                - Push existing image to AWS ECR
    deploy              - Full pipeline: validate -> prepare -> generate -> build -> push
    
    USAGE:
    cd ./Installation_Files/scripts
    python3 emr_serverless_setup_cli.py --config ../config.json <COMMAND>
    
    TYPICAL WORKFLOW:
    # Option 1: Full automated deployment
    python3 emr_serverless_setup_cli.py --config ../config.json deploy
    
    # Option 2: Step-by-step with manual edits
    python3 emr_serverless_setup_cli.py --config ../config.json validate
    python3 emr_serverless_setup_cli.py --config ../config.json prepare-assets
    python3 emr_serverless_setup_cli.py --config ../config.json generate-dockerfile
    # Manually edit Dockerfile if needed
    python3 emr_serverless_setup_cli.py --config ../config.json build
    python3 emr_serverless_setup_cli.py --config ../config.json push
    
    NOTES:
    - During 'deploy' or 'build', you'll be prompted for ESA credentials
    - Credentials are used during build only, NOT stored in image layers
    - ECR authentication is handled automatically by AWS CLI
    - Use 'build' command to preserve manual Dockerfile edits
    
    ================================================================
    Audit Logging Configuration
    ================================================================
    
    IMPORTANT: EMR Serverless uses stdout for audit log output.
    
    - All audit logs are written to standard output (stdout)
    - Logs are automatically captured by AWS CloudWatch Logs
    - CloudWatch logs are stored in your configured S3 bucket
    
    To access audit logs:
    1. Via CloudWatch: AWS Console -> CloudWatch -> Log Groups
    2. Via S3 Bucket: Check your EMR Serverless application's S3 logs location
    
    ================================================================
    lambda_function.py - ESA Audit Log Forwarder
    ================================================================
    
    For centralized audit log forwarding to ESA Audit Store, use the provided
    lambda_function.py - a ready-to-deploy AWS Lambda function.
    
    LOG FLOW:
    EMR Serverless (stdout)  CloudWatch Logs  Subscription Filter 
    Kinesis Data Stream  Lambda Function  ESA OpenSearch Endpoint
    
    LAMBDA FUNCTION FEATURES:
    - Triggered by Kinesis Data Stream events
    - Decodes and parses CloudWatch log data from Kinesis records
    - Forwards logs to ESA using OpenSearch bulk API
    - TLS encryption with certificate-based authentication
    - Automatic batching, retries, and error recovery
    
    REQUIRED ENVIRONMENT VARIABLES:
    ESA_BULK_URL          - Full OpenSearch bulk API endpoint
                            Example: https://<ESA_IP_Address>:9200/pty_insight_audit/_bulk?pipeline=logs_pipeline
    ESA_CA_SECRET_ID      - AWS Secrets Manager ARN for CA certificate
    ESA_CA_SECRET_JSON_KEY- JSON key name in secret (default: ca_pem)
    HTTP_TIMEOUT_SEC      - HTTP timeout in seconds (default: 120)
    BULK_MAX_BYTES        - Max bulk request size (default: 5242880)
    ONLY_MATCH_SUBSTRING  - Filter logs by substring (e.g., "logtype")
    
    For detailed deployment steps, refer to the EMR Serverless documentation.
    
    ================================================================
    
    The directory structure of the artifacts, after executing the configurator script is listed below.
    Installation_Files/
    ├── config.json
    ├── scripts/
    │   ├── emr_serverless_setup_cli.py
    |   ├── lambda_function.py
    ├── runtime/
    │   ├── pephive-3.1.3_v<BDP_version>.jar
    │   └── pepspark-3.5.6_v<BDP_version>.jar
    ├── common/
    │   ├── jcorelite.jar
    │   ├── jcorelite.plm
    │   ├── GetCertificates.sh
    │   ├── config.ini.template
    └── BigDataProtector_Linux-ALL-64_x86-64_EMR.Serverless-<EMR_version>-64_<BDP_version>.tgz
    
    A sample output of the config.json file is listed for reference.
    {
        "_comment": "EMR Serverless Big Data Protector Configuration - Generated by configurator.sh",
        "runtime": "spark",
        "region": "<region_name>",
        "registryHostname": "<AWS_Account_ID>.dkr.ecr.<region_name>.amazonaws.com",
        "defaults": {
            "syncHost": "<ESA_IP>",
            "syncPort": "25400",
            "getCertPort": "25400",
            "syncProtocol": "https",
            "syncCAFile": "/opt/esacert/CA.pem",
            "syncCertFile": "/opt/esacert/cert.pem",
            "syncKeyFile": "/opt/esacert/cert.key",
            "syncSecretFile": "/opt/esacert/secret.txt",
            "syncRequestTimeout": 60,
            "certResource": "pty/v1/cert",
            "repositoryName": "protegrity-emr-rest",
            "imageTag": "sparkv66",
            "commonCopy": [
            {
                "source": "common/jcorelite.jar",
                "destSpark": "/usr/lib/spark/jars/jcorelite.jar",
                "destHive": "/usr/lib/hive/lib/jcorelite.jar"
            },
            {
                "source": "common/jcorelite.plm",
                "destSpark": "/usr/lib/spark/jars/jcorelite.plm",
                "destHive": "/usr/lib/hive/lib/jcorelite.plm"
            },
            {
                "source": "common/GetCertificates.sh",
                "destSpark": "/opt/esacert/GetCertificates",
                "destHive": "/opt/esacert/GetCertificates"
            },
            {
                "source": "common/config.ini",
                "destSpark": "/usr/lib/spark/data/config.ini",
                "destHive": "/usr/lib/hive/data/config.ini"
            }
            ]
        },
        "runtimes": {
            "spark": {
            "baseImage": "public.ecr.aws/emr-serverless/spark/emr-7.12.0:latest",
            "contextDir": ".",
            "yumPackages": ["curl", "vim", "wget", "tar", "gzip"],
            "copy": [
                {
                "source": "runtime/pepspark-*.jar",
                "dest": "/usr/lib/spark/jars/"
                }
            ],
            "chown": [
                "/usr/lib/spark/jars",
                "/usr/lib/spark/lib",
                "/usr/lib/spark/data",
                "/opt/esacert"
            ],
            "user": "hadoop:hadoop"
            },
            "hive": {
            "baseImage": "public.ecr.aws/emr-serverless/hive/emr-7.12.0:latest",
            "contextDir": ".",
            "yumPackages": ["curl", "vim", "wget", "tar", "gzip"],
            "copy": [
                {
                "source": "runtime/pephive-*.jar",
                "dest": "/usr/lib/hive/lib/"
                }
            ],
            "chown": [
                "/usr/lib/hive/lib",
                "/usr/lib/hive/data",
                "/opt/esacert"
            ],
            "user": "hadoop:hadoop"
            }
        }
    }