Executing the Configurator Script

The steps mentioned in this section are applicable only for the Serverless approach to install the Big Data Protector.

The Big Data Protector configurator script:

  1. Generates the config.json file.
  2. Generates the EMR Serverless deployment scripts.
  3. Provides the runtime artifacts and common utilities.

To execute the configurator script:

  1. Log in to the CLI on a machine or an Amazon EC2 node that has connectivity to the ESA.
  2. Navigate to the directory where the installation files are extracted.
  3. To execute the script, run the following command:
    ./BDP_Configurator_EMR-<EMR_version>_<BDP_version>.sh
    
  4. Press ENTER.
    The Big Data Protector Configurator Wizard with the prompt to continue appears.
    ***********************************************************************
         Welcome to the Big Data Protector Configurator Wizard
    ***********************************************************************
    This will create the Big Data Protector Installation files for AWS EMR.
    
    Do you want to continue? [yes or no]:
    
  5. To continue, type yes.
  6. Press ENTER.
    The prompt to select the deployment type appears.
    Protegrity Big Data Protector Configurator started...
    Enter the EMR deployment type for Big Data Protector:
    [ 1 ] : New EMR Cluster (Bootstrap)
    [ 2 ] : Existing EMR Cluster (Static)
    [ 3 ] : EMR Serverless (Containerized)
    [ 1, 2, or 3 ]:
    
  7. To install the Big Data Protector using the Serverless approach, type 3.
  8. Press ENTER.
    The prompt to select the configuration mode appears.
    Generating Big Data Protector for EMR Serverless......
    
    ================================================================
        EMR Serverless - Configuration Setup
    ================================================================
    
    The EMR Serverless deployment requires configuration values to be
    stored in a config.json file. This file is used by Python scripts to:
    
    - Generate the Dockerfile with BDP components
    - Build and tag the Docker image
    - Push the image to AWS ECR
    - Configure certificate downloads from ESA
    
    You have two options to provide this configuration:
    
    ================================================================
    OPTION 1: Interactive Mode (Recommended)
    ================================================================
    - Guided prompts will collect all required information
    - Values are validated during input
    - config.json is automatically generated
    - Faster and less error-prone
    
    ================================================================
    OPTION 2: Silent Mode
    ================================================================
    - A template config.json file with placeholders is created
    - You manually edit the file and replace all placeholders
    - Useful if you prefer to script or automate configuration
    - Requires careful attention to JSON syntax
    
    ================================================================
    
    Select configuration mode:
    [ 1 ] : Interactive Mode (Guided prompts)
    [ 2 ] : Silent Mode (Edit config.json template)
    Enter your choice [1 or 2]:
    
  9. To use the interactive configuration mode, type 1.
  10. Press ENTER.
    The prompt to verify the prerequisites appears.
    [OK] Selected: Interactive Mode
    ================================================================
       EMR Serverless - Prerequisites Checklist
    ================================================================
    
    Before proceeding, please ensure you have the following information ready:
    
    [OK] ESA Configuration:
    - ESA Server Host/IP
    - ESA Port (default: 25400)
    - GetCertificates Port (default: 25400)
    - ESA Admin Username & Password (prompted during build)
    
    [OK] EMR Serverless Configuration:
    [1/6] EMR Release Label (e.g., emr-6.15.0, emr-7.0.0)
    [2/6] Runtime Selection (Spark or Hive)
    [3/6] AWS Account ID (12-digit number)
    [4/6] AWS Region (e.g., us-east-1, us-west-2)
    [5/6] ECR Repository Name (where Docker image will be stored)
    [6/6] Docker Image Tag (e.g., latest, v1.0.0)
    
    ================================================================
    
    Do you have all the required information to proceed? [yes/no]:
    
  11. If all the prerequisites are available, type yes.
  12. Press ENTER.
    The prompt to enter the ESA host name appears.
    [OK] Proceeding with interactive configuration...
    Enter the ESA Hostname/IP Address:
    
  13. Enter the ESA Hostname or IP address.
  14. Press ENTER.
    The prompt to enter the ESA listening port appears.
    Enter ESA host listening port [25400]:
    
  15. Enter the listening port.
  16. Press ENTER.
    The prompt to enter the GetCertificates port appears.
    Enter GetCertificates port [25400]:
    
  17. Enter the port to fetch the certificates from the ESA.
  18. Press ENTER.
    The prompt to enter the EMR release label appears.
    ================================================================
       EMR Serverless Configuration - Step by Step
    ================================================================
    
    ESA Server: <ESA_IP_Address>:<ESA_Port>
    GetCertificates Port: <ESA_Port>
    
    [1/6] EMR Release Label
    ------------------------------------------------------
    Specify the EMR release version you want to use.
    Note: Not all EMR versions have serverless images available.
    For available versions, visit AWS EMR Serverless documentation.
    Enter EMR Release Label (e.g., emr-7.12.0):
    
  19. Enter the EMR version.
  20. Press ENTER.
    The prompt to select the processing engine appears.
    [2/6] Runtime Selection
    ------------------------------------------------------
    Choose the processing engine for your EMR Serverless application.
    Spark: For data processing, ETL, and analytics
    Hive:  For SQL queries on large datasets
    
    Select Runtime:
    [ 1 ] : Spark
    [ 2 ] : Hive
    Enter your choice [1 or 2]:
    
  21. Depending on the requirements, type 1 or 2.
  22. Press ENTER.
    The prompt to enter the AWS Account ID appears.
    [3/6] AWS Account ID
    ------------------------------------------------------
    Your 12-digit AWS Account ID is required to:
    • Access AWS ECR (Elastic Container Registry)
    • Identify your AWS resources
    
    Find it at: AWS Console > Account (top-right) > My Account
    Enter AWS Account ID (12 digits):
    
  23. Enter the AWS Account ID.
  24. Press ENTER.
    The prompt to enter the AWS region where the EMR Serverless resources will be deployed appears.
    [4/6] AWS Region
    ------------------------------------------------------
    Specify the AWS region where your EMR Serverless resources
    will be deployed (e.g., us-east-1, us-west-2, eu-west-1).
    
    Note:
    • Your ECR repository and EMR Serverless application must be in same region.
    
    Enter AWS Region (e.g., us-east-1):
    
  25. Enter the region name.
  26. Press ENTER.
    The prompt to enter the ECR Repository Name appears.
    [5/6] ECR Repository Name
    ------------------------------------------------------
    AWS ECR (Elastic Container Registry) repository where the
    BDP Docker image will be stored and pulled from.
    
    Repository naming rules:
    • Lowercase letters, numbers, hyphens, underscores, forward slashes
    • 2-256 characters long    
    Enter ECR Repository Name:
    
  27. Enter the ECR repository name.
  28. Press ENTER.
    The prompt to enter the docker image tag appears.
    [6/6] Docker Image Tag
    ------------------------------------------------------
    Tag for the Docker image in ECR. This helps identify
    different versions of your BDP image.
    Enter Docker Image Tag [default: latest]:
    
  29. Enter the docker image tag.
  30. Press ENTER.
    The script completes the EMR Serverless configuration.
    ================================================================
    [OK] EMR Serverless configuration completed successfully!
    ================================================================
    
    Generated config.json file successfully at /bdp/build/BigDataProtector/BigDataProtector/Installation_Files/config.json
    
    ================================================================
    [OK] Successfully configured Big Data Protector for EMR Serverless!
    ================================================================
    
    Generated Files in ./Installation_Files/ directory:
    - config.json                    - EMR Serverless configuration
    - scripts/                       - Python deployment CLIs
        +-- emr_serverless_setup_cli.py    - Main deployment CLI
        +-- lambda_function.py             - Lambda for ESA audit log forwarding
    - runtime/                       - BDP JAR files (Spark/Hive)
    - common/                        - JcoreLite, config.ini, GetCertificates.sh
    - BigDataProtector_Linux-ALL-64_x86-64_EMR-<EMR_version>-64_<BDP_version>.tgz       - Complete package tarball
    
    ================================================================
    Using emr_serverless_setup_cli.py - Main Deployment Tool
    ================================================================
    
    This Python CLI provides commands to build and deploy BDP Docker images:
    
    AVAILABLE COMMANDS:
    validate            - Check prerequisites (Docker, AWS CLI, config.json)
    prepare-assets      - Update config.ini and GetCertificates.sh with ESA details
    generate-dockerfile - Create Dockerfile from config.json
    build               - Build Docker image locally (preserves manual edits)
    push                - Push existing image to AWS ECR
    deploy              - Full pipeline: validate -> prepare -> generate -> build -> push
    
    USAGE:
    cd ./Installation_Files/scripts
    python3 emr_serverless_setup_cli.py --config ../config.json <COMMAND>
    
    TYPICAL WORKFLOW:
    # Option 1: Full automated deployment
    python3 emr_serverless_setup_cli.py --config ../config.json deploy
    
    # Option 2: Step-by-step with manual edits
    python3 emr_serverless_setup_cli.py --config ../config.json validate
    python3 emr_serverless_setup_cli.py --config ../config.json prepare-assets
    python3 emr_serverless_setup_cli.py --config ../config.json generate-dockerfile
    # Manually edit Dockerfile if needed
    python3 emr_serverless_setup_cli.py --config ../config.json build
    python3 emr_serverless_setup_cli.py --config ../config.json push
    
    NOTES:
    - During 'deploy' or 'build', you'll be prompted for ESA credentials
    - Credentials are used during build only, NOT stored in image layers
    - ECR authentication is handled automatically by AWS CLI
    - Use 'build' command to preserve manual Dockerfile edits
    
    ================================================================
    Audit Logging Configuration
    ================================================================
    
    IMPORTANT: EMR Serverless uses stdout for audit log output.
    
    - All audit logs are written to standard output (stdout)
    - Logs are automatically captured by AWS CloudWatch Logs
    - CloudWatch logs are stored in your configured S3 bucket
    
    To access audit logs:
    1. Via CloudWatch: AWS Console -> CloudWatch -> Log Groups
    2. Via S3 Bucket: Check your EMR Serverless application's S3 logs location
    
    ================================================================
    lambda_function.py - ESA Audit Log Forwarder
    ================================================================
    
    For centralized audit log forwarding to ESA Audit Store, use the provided
    lambda_function.py - a ready-to-deploy AWS Lambda function.
    
    LOG FLOW:
    EMR Serverless (stdout)  CloudWatch Logs  Subscription Filter 
    Kinesis Data Stream  Lambda Function  ESA OpenSearch Endpoint
    
    LAMBDA FUNCTION FEATURES:
    - Triggered by Kinesis Data Stream events
    - Decodes and parses CloudWatch log data from Kinesis records
    - Forwards logs to ESA using OpenSearch bulk API
    - TLS encryption with certificate-based authentication
    - Automatic batching, retries, and error recovery
    
    REQUIRED ENVIRONMENT VARIABLES:
    ESA_BULK_URL          - Full OpenSearch bulk API endpoint
                            Example: https://<ESA_IP_Address>:9200/pty_insight_audit/_bulk?pipeline=logs_pipeline
    ESA_CA_SECRET_ID      - AWS Secrets Manager ARN for CA certificate
    ESA_CA_SECRET_JSON_KEY- JSON key name in secret (default: ca_pem)
    HTTP_TIMEOUT_SEC      - HTTP timeout in seconds (default: 120)
    BULK_MAX_BYTES        - Max bulk request size (default: 5242880)
    ONLY_MATCH_SUBSTRING  - Filter logs by substring (e.g., "logtype")
    
    For detailed deployment steps, refer to the EMR Serverless documentation.
    
    ================================================================
    
    The directory structure of the artifacts, after executing the configurator script is listed below.
    Installation_Files/
    ├── config.json
    ├── scripts/
    │   ├── emr_serverless_setup_cli.py
    |   ├── lambda_function.py
    ├── runtime/
    │   ├── pephive-3.1.3_v<BDP_version>.jar
    │   └── pepspark-3.5.6_v<BDP_version>.jar
    ├── common/
    │   ├── jcorelite.jar
    │   ├── jcorelite.plm
    │   ├── GetCertificates.sh
    │   ├── config.ini.template
    └── BigDataProtector_Linux-ALL-64_x86-64_EMR.Serverless-<EMR_version>-64_<BDP_version>.tgz
    
    A sample output of the config.json file is listed for reference.
    {
        "_comment": "EMR Serverless Big Data Protector Configuration - Generated by configurator.sh",
        "runtime": "spark",
        "region": "<region_name>",
        "registryHostname": "<AWS_Account_ID>.dkr.ecr.<region_name>.amazonaws.com",
        "defaults": {
            "syncHost": "<ESA_IP>",
            "syncPort": "25400",
            "getCertPort": "25400",
            "syncProtocol": "https",
            "syncCAFile": "/opt/esacert/CA.pem",
            "syncCertFile": "/opt/esacert/cert.pem",
            "syncKeyFile": "/opt/esacert/cert.key",
            "syncSecretFile": "/opt/esacert/secret.txt",
            "syncRequestTimeout": 60,
            "certResource": "pty/v1/cert",
            "repositoryName": "protegrity-emr-rest",
            "imageTag": "sparkv66",
            "commonCopy": [
            {
                "source": "common/jcorelite.jar",
                "destSpark": "/usr/lib/spark/jars/jcorelite.jar",
                "destHive": "/usr/lib/hive/lib/jcorelite.jar"
            },
            {
                "source": "common/jcorelite.plm",
                "destSpark": "/usr/lib/spark/jars/jcorelite.plm",
                "destHive": "/usr/lib/hive/lib/jcorelite.plm"
            },
            {
                "source": "common/GetCertificates.sh",
                "destSpark": "/opt/esacert/GetCertificates",
                "destHive": "/opt/esacert/GetCertificates"
            },
            {
                "source": "common/config.ini",
                "destSpark": "/usr/lib/spark/data/config.ini",
                "destHive": "/usr/lib/hive/data/config.ini"
            }
            ]
        },
        "runtimes": {
            "spark": {
            "baseImage": "public.ecr.aws/emr-serverless/spark/emr-7.12.0:latest",
            "contextDir": ".",
            "yumPackages": ["curl", "vim", "wget", "tar", "gzip"],
            "copy": [
                {
                "source": "runtime/pepspark-*.jar",
                "dest": "/usr/lib/spark/jars/"
                }
            ],
            "chown": [
                "/usr/lib/spark/jars",
                "/usr/lib/spark/lib",
                "/usr/lib/spark/data",
                "/opt/esacert"
            ],
            "user": "hadoop:hadoop"
            },
            "hive": {
            "baseImage": "public.ecr.aws/emr-serverless/hive/emr-7.12.0:latest",
            "contextDir": ".",
            "yumPackages": ["curl", "vim", "wget", "tar", "gzip"],
            "copy": [
                {
                "source": "runtime/pephive-*.jar",
                "dest": "/usr/lib/hive/lib/"
                }
            ],
            "chown": [
                "/usr/lib/hive/lib",
                "/usr/lib/hive/data",
                "/opt/esacert"
            ],
            "user": "hadoop:hadoop"
            }
        }
    }
    

Last modified : January 13, 2026