Executing the Configurator Script

The steps mentioned in this section are applicable only for the Serverless approach to install the Big Data Protector.

The Big Data Protector configurator script:

Generates the config.json file.
Generates the EMR Serverless deployment scripts.
Provides the runtime artifacts and common utilities.

To execute the configurator script:

Log in to the CLI on a machine or an Amazon EC2 node that has connectivity to the ESA.
Navigate to the directory where the installation files are extracted.

To execute the script, run the following command:

./BDP_Configurator_EMR-<EMR_version>_<BDP_version>.sh

Press ENTER.
The Big Data Protector Configurator Wizard with the prompt to continue appears.

***********************************************************************
     Welcome to the Big Data Protector Configurator Wizard
***********************************************************************
This will create the Big Data Protector Installation files for AWS EMR.

Do you want to continue? [yes or no]:

To continue, type yes.

Press ENTER.
The prompt to select the deployment type appears.

Protegrity Big Data Protector Configurator started...
Enter the EMR deployment type for Big Data Protector:
[ 1 ] : New EMR Cluster (Bootstrap)
[ 2 ] : Existing EMR Cluster (Static)
[ 3 ] : EMR Serverless (Containerized)
[ 1, 2, or 3 ]:

To install the Big Data Protector using the Serverless approach, type 3.

Press ENTER.
The prompt to select the configuration mode appears.

Generating Big Data Protector for EMR Serverless......

================================================================
    EMR Serverless - Configuration Setup
================================================================

The EMR Serverless deployment requires configuration values to be
stored in a config.json file. This file is used by Python scripts to:

- Generate the Dockerfile with BDP components
- Build and tag the Docker image
- Push the image to AWS ECR
- Configure certificate downloads from ESA

You have two options to provide this configuration:

================================================================
OPTION 1: Interactive Mode (Recommended)
================================================================
- Guided prompts will collect all required information
- Values are validated during input
- config.json is automatically generated
- Faster and less error-prone

================================================================
OPTION 2: Silent Mode
================================================================
- A template config.json file with placeholders is created
- You manually edit the file and replace all placeholders
- Useful if you prefer to script or automate configuration
- Requires careful attention to JSON syntax

================================================================

Select configuration mode:
[ 1 ] : Interactive Mode (Guided prompts)
[ 2 ] : Silent Mode (Edit config.json template)
Enter your choice [1 or 2]:

To use the interactive configuration mode, type 1.

Press ENTER.
The prompt to verify the prerequisites appears.

[OK] Selected: Interactive Mode
================================================================
   EMR Serverless - Prerequisites Checklist
================================================================

Before proceeding, please ensure you have the following information ready:

[OK] ESA Configuration:
- ESA Server Host/IP
- ESA Port (default: 25400)
- GetCertificates Port (default: 25400)
- ESA Admin Username & Password (prompted during build)

[OK] EMR Serverless Configuration:
[1/6] EMR Release Label (e.g., emr-6.15.0, emr-7.0.0)
[2/6] Runtime Selection (Spark or Hive)
[3/6] AWS Account ID (12-digit number)
[4/6] AWS Region (e.g., us-east-1, us-west-2)
[5/6] ECR Repository Name (where Docker image will be stored)
[6/6] Docker Image Tag (e.g., latest, v1.0.0)

================================================================

Do you have all the required information to proceed? [yes/no]:

If all the prerequisites are available, type yes.

Press ENTER.
The prompt to enter the ESA host name appears.

[OK] Proceeding with interactive configuration...
Enter the ESA Hostname/IP Address:

Enter the ESA Hostname or IP address.
Press ENTER.
The prompt to enter the ESA listening port appears.
```
Enter ESA host listening port [25400]:
```
Enter the listening port.
Press ENTER.
The prompt to enter the GetCertificates port appears.
```
Enter GetCertificates port [25400]:
```
Enter the port to fetch the certificates from the ESA.

Press ENTER.
The prompt to enter the EMR release label appears.

================================================================
   EMR Serverless Configuration - Step by Step
================================================================

ESA Server: <ESA_IP_Address>:<ESA_Port>
GetCertificates Port: <ESA_Port>

[1/6] EMR Release Label
------------------------------------------------------
Specify the EMR release version you want to use.
Note: Not all EMR versions have serverless images available.
For available versions, visit AWS EMR Serverless documentation.
Enter EMR Release Label (e.g., emr-7.12.0):

Enter the EMR version.

Press ENTER.
The prompt to select the processing engine appears.

[2/6] Runtime Selection
------------------------------------------------------
Choose the processing engine for your EMR Serverless application.
Spark: For data processing, ETL, and analytics
Hive:  For SQL queries on large datasets

Select Runtime:
[ 1 ] : Spark
[ 2 ] : Hive
Enter your choice [1 or 2]:

Depending on the requirements, type 1 or 2.

Press ENTER.
The prompt to enter the AWS Account ID appears.

[3/6] AWS Account ID
------------------------------------------------------
Your 12-digit AWS Account ID is required to:
• Access AWS ECR (Elastic Container Registry)
• Identify your AWS resources

Find it at: AWS Console > Account (top-right) > My Account
Enter AWS Account ID (12 digits):

Enter the AWS Account ID.

Press ENTER.
The prompt to enter the AWS region where the EMR Serverless resources will be deployed appears.

[4/6] AWS Region
------------------------------------------------------
Specify the AWS region where your EMR Serverless resources
will be deployed (e.g., us-east-1, us-west-2, eu-west-1).

Note:
• Your ECR repository and EMR Serverless application must be in same region.

Enter AWS Region (e.g., us-east-1):

Enter the region name.

Press ENTER.
The prompt to enter the ECR Repository Name appears.

[5/6] ECR Repository Name
------------------------------------------------------
AWS ECR (Elastic Container Registry) repository where the
BDP Docker image will be stored and pulled from.

Repository naming rules:
• Lowercase letters, numbers, hyphens, underscores, forward slashes
• 2-256 characters long    
Enter ECR Repository Name:

Enter the ECR repository name.

Press ENTER.
The prompt to enter the docker image tag appears.

[6/6] Docker Image Tag
------------------------------------------------------
Tag for the Docker image in ECR. This helps identify
different versions of your BDP image.
Enter Docker Image Tag [default: latest]:

Enter the docker image tag.

Press ENTER.
The script completes the EMR Serverless configuration.

================================================================
[OK] EMR Serverless configuration completed successfully!
================================================================

Generated config.json file successfully at /bdp/build/BigDataProtector/BigDataProtector/Installation_Files/config.json

================================================================
[OK] Successfully configured Big Data Protector for EMR Serverless!
================================================================

Generated Files in ./Installation_Files/ directory:
- config.json                    - EMR Serverless configuration
- scripts/                       - Python deployment CLIs
    +-- emr_serverless_setup_cli.py    - Main deployment CLI
    +-- lambda_function.py             - Lambda for ESA audit log forwarding
- runtime/                       - BDP JAR files (Spark/Hive)
- common/                        - JcoreLite, config.ini, GetCertificates.sh
- BigDataProtector_Linux-ALL-64_x86-64_EMR-<EMR_version>-64_<BDP_version>.tgz       - Complete package tarball

================================================================
Using emr_serverless_setup_cli.py - Main Deployment Tool
================================================================

This Python CLI provides commands to build and deploy BDP Docker images:

AVAILABLE COMMANDS:
validate            - Check prerequisites (Docker, AWS CLI, config.json)
prepare-assets      - Update config.ini and GetCertificates.sh with ESA details
generate-dockerfile - Create Dockerfile from config.json
build               - Build Docker image locally (preserves manual edits)
push                - Push existing image to AWS ECR
deploy              - Full pipeline: validate -> prepare -> generate -> build -> push

USAGE:
cd ./Installation_Files/scripts
python3 emr_serverless_setup_cli.py --config ../config.json <COMMAND>

TYPICAL WORKFLOW:
# Option 1: Full automated deployment
python3 emr_serverless_setup_cli.py --config ../config.json deploy

# Option 2: Step-by-step with manual edits
python3 emr_serverless_setup_cli.py --config ../config.json validate
python3 emr_serverless_setup_cli.py --config ../config.json prepare-assets
python3 emr_serverless_setup_cli.py --config ../config.json generate-dockerfile
# Manually edit Dockerfile if needed
python3 emr_serverless_setup_cli.py --config ../config.json build
python3 emr_serverless_setup_cli.py --config ../config.json push

NOTES:
- During 'deploy' or 'build', you'll be prompted for ESA credentials
- Credentials are used during build only, NOT stored in image layers
- ECR authentication is handled automatically by AWS CLI
- Use 'build' command to preserve manual Dockerfile edits

================================================================
Audit Logging Configuration
================================================================

IMPORTANT: EMR Serverless uses stdout for audit log output.

- All audit logs are written to standard output (stdout)
- Logs are automatically captured by AWS CloudWatch Logs
- CloudWatch logs are stored in your configured S3 bucket

To access audit logs:
1. Via CloudWatch: AWS Console -> CloudWatch -> Log Groups
2. Via S3 Bucket: Check your EMR Serverless application's S3 logs location

================================================================
lambda_function.py - ESA Audit Log Forwarder
================================================================

For centralized audit log forwarding to ESA Audit Store, use the provided
lambda_function.py - a ready-to-deploy AWS Lambda function.

LOG FLOW:
EMR Serverless (stdout) → CloudWatch Logs → Subscription Filter →
Kinesis Data Stream → Lambda Function → ESA OpenSearch Endpoint

LAMBDA FUNCTION FEATURES:
- Triggered by Kinesis Data Stream events
- Decodes and parses CloudWatch log data from Kinesis records
- Forwards logs to ESA using OpenSearch bulk API
- TLS encryption with certificate-based authentication
- Automatic batching, retries, and error recovery

REQUIRED ENVIRONMENT VARIABLES:
ESA_BULK_URL          - Full OpenSearch bulk API endpoint
                        Example: https://<ESA_IP_Address>:9200/pty_insight_audit/_bulk?pipeline=logs_pipeline
ESA_CA_SECRET_ID      - AWS Secrets Manager ARN for CA certificate
ESA_CA_SECRET_JSON_KEY- JSON key name in secret (default: ca_pem)
HTTP_TIMEOUT_SEC      - HTTP timeout in seconds (default: 120)
BULK_MAX_BYTES        - Max bulk request size (default: 5242880)
ONLY_MATCH_SUBSTRING  - Filter logs by substring (e.g., "logtype")

For detailed deployment steps, refer to the EMR Serverless documentation.

================================================================

The directory structure of the artifacts, after executing the configurator script is listed below.

Installation_Files/
├── config.json
├── scripts/
│   ├── emr_serverless_setup_cli.py
|   ├── lambda_function.py
├── runtime/
│   ├── pephive-3.1.3_v<BDP_version>.jar
│   └── pepspark-3.5.6_v<BDP_version>.jar
├── common/
│   ├── jcorelite.jar
│   ├── jcorelite.plm
│   ├── GetCertificates.sh
│   ├── config.ini.template
└── BigDataProtector_Linux-ALL-64_x86-64_EMR.Serverless-<EMR_version>-64_<BDP_version>.tgz

A sample output of the config.json file is listed for reference.

{
    "_comment": "EMR Serverless Big Data Protector Configuration - Generated by configurator.sh",
    "runtime": "spark",
    "region": "<region_name>",
    "registryHostname": "<AWS_Account_ID>.dkr.ecr.<region_name>.amazonaws.com",
    "defaults": {
        "syncHost": "<ESA_IP>",
        "syncPort": "25400",
        "getCertPort": "25400",
        "syncProtocol": "https",
        "syncCAFile": "/opt/esacert/CA.pem",
        "syncCertFile": "/opt/esacert/cert.pem",
        "syncKeyFile": "/opt/esacert/cert.key",
        "syncSecretFile": "/opt/esacert/secret.txt",
        "syncRequestTimeout": 60,
        "certResource": "pty/v1/cert",
        "repositoryName": "protegrity-emr-rest",
        "imageTag": "sparkv66",
        "commonCopy": [
        {
            "source": "common/jcorelite.jar",
            "destSpark": "/usr/lib/spark/jars/jcorelite.jar",
            "destHive": "/usr/lib/hive/lib/jcorelite.jar"
        },
        {
            "source": "common/jcorelite.plm",
            "destSpark": "/usr/lib/spark/jars/jcorelite.plm",
            "destHive": "/usr/lib/hive/lib/jcorelite.plm"
        },
        {
            "source": "common/GetCertificates.sh",
            "destSpark": "/opt/esacert/GetCertificates",
            "destHive": "/opt/esacert/GetCertificates"
        },
        {
            "source": "common/config.ini",
            "destSpark": "/usr/lib/spark/data/config.ini",
            "destHive": "/usr/lib/hive/data/config.ini"
        }
        ]
    },
    "runtimes": {
        "spark": {
        "baseImage": "public.ecr.aws/emr-serverless/spark/emr-7.12.0:latest",
        "contextDir": ".",
        "yumPackages": ["curl", "vim", "wget", "tar", "gzip"],
        "copy": [
            {
            "source": "runtime/pepspark-*.jar",
            "dest": "/usr/lib/spark/jars/"
            }
        ],
        "chown": [
            "/usr/lib/spark/jars",
            "/usr/lib/spark/lib",
            "/usr/lib/spark/data",
            "/opt/esacert"
        ],
        "user": "hadoop:hadoop"
        },
        "hive": {
        "baseImage": "public.ecr.aws/emr-serverless/hive/emr-7.12.0:latest",
        "contextDir": ".",
        "yumPackages": ["curl", "vim", "wget", "tar", "gzip"],
        "copy": [
            {
            "source": "runtime/pephive-*.jar",
            "dest": "/usr/lib/hive/lib/"
            }
        ],
        "chown": [
            "/usr/lib/hive/lib",
            "/usr/lib/hive/data",
            "/opt/esacert"
        ],
        "user": "hadoop:hadoop"
        }
    }
}

Feedback

Was this page helpful?

Last modified : January 13, 2026