Executing the Configurator Script
The steps mentioned in this section are applicable only for the Serverless approach to install the Big Data Protector.
The Big Data Protector configurator script:
- Generates the
config.jsonfile. - Generates the EMR Serverless deployment scripts.
- Provides the runtime artifacts and common utilities.
To execute the configurator script:
- Log in to the CLI on a machine or an Amazon EC2 node that has connectivity to the ESA.
- Navigate to the directory where the installation files are extracted.
- To execute the script, run the following command:
./BDP_Configurator_EMR-<EMR_version>_<BDP_version>.sh - Press ENTER.
The Big Data Protector Configurator Wizard with the prompt to continue appears.*********************************************************************** Welcome to the Big Data Protector Configurator Wizard *********************************************************************** This will create the Big Data Protector Installation files for AWS EMR. Do you want to continue? [yes or no]: - To continue, type
yes. - Press ENTER.
The prompt to select the deployment type appears.Protegrity Big Data Protector Configurator started... Enter the EMR deployment type for Big Data Protector: [ 1 ] : New EMR Cluster (Bootstrap) [ 2 ] : Existing EMR Cluster (Static) [ 3 ] : EMR Serverless (Containerized) [ 1, 2, or 3 ]: - To install the Big Data Protector using the Serverless approach, type
3. - Press ENTER.
The prompt to select the configuration mode appears.Generating Big Data Protector for EMR Serverless...... ================================================================ EMR Serverless - Configuration Setup ================================================================ The EMR Serverless deployment requires configuration values to be stored in a config.json file. This file is used by Python scripts to: - Generate the Dockerfile with BDP components - Build and tag the Docker image - Push the image to AWS ECR - Configure certificate downloads from ESA You have two options to provide this configuration: ================================================================ OPTION 1: Interactive Mode (Recommended) ================================================================ - Guided prompts will collect all required information - Values are validated during input - config.json is automatically generated - Faster and less error-prone ================================================================ OPTION 2: Silent Mode ================================================================ - A template config.json file with placeholders is created - You manually edit the file and replace all placeholders - Useful if you prefer to script or automate configuration - Requires careful attention to JSON syntax ================================================================ Select configuration mode: [ 1 ] : Interactive Mode (Guided prompts) [ 2 ] : Silent Mode (Edit config.json template) Enter your choice [1 or 2]: - To use the interactive configuration mode, type
1. - Press ENTER.
The prompt to verify the prerequisites appears.[OK] Selected: Interactive Mode ================================================================ EMR Serverless - Prerequisites Checklist ================================================================ Before proceeding, please ensure you have the following information ready: [OK] ESA Configuration: - ESA Server Host/IP - ESA Port (default: 25400) - GetCertificates Port (default: 25400) - ESA Admin Username & Password (prompted during build) [OK] EMR Serverless Configuration: [1/6] EMR Release Label (e.g., emr-6.15.0, emr-7.0.0) [2/6] Runtime Selection (Spark or Hive) [3/6] AWS Account ID (12-digit number) [4/6] AWS Region (e.g., us-east-1, us-west-2) [5/6] ECR Repository Name (where Docker image will be stored) [6/6] Docker Image Tag (e.g., latest, v1.0.0) ================================================================ Do you have all the required information to proceed? [yes/no]: - If all the prerequisites are available, type
yes. - Press ENTER.
The prompt to enter the ESA host name appears.[OK] Proceeding with interactive configuration... Enter the ESA Hostname/IP Address: - Enter the ESA Hostname or IP address.
- Press ENTER.
The prompt to enter the ESA listening port appears.Enter ESA host listening port [25400]: - Enter the listening port.
- Press ENTER.
The prompt to enter the GetCertificates port appears.Enter GetCertificates port [25400]: - Enter the port to fetch the certificates from the ESA.
- Press ENTER.
The prompt to enter the EMR release label appears.================================================================ EMR Serverless Configuration - Step by Step ================================================================ ESA Server: <ESA_IP_Address>:<ESA_Port> GetCertificates Port: <ESA_Port> [1/6] EMR Release Label ------------------------------------------------------ Specify the EMR release version you want to use. Note: Not all EMR versions have serverless images available. For available versions, visit AWS EMR Serverless documentation. Enter EMR Release Label (e.g., emr-7.12.0): - Enter the EMR version.
- Press ENTER.
The prompt to select the processing engine appears.[2/6] Runtime Selection ------------------------------------------------------ Choose the processing engine for your EMR Serverless application. Spark: For data processing, ETL, and analytics Hive: For SQL queries on large datasets Select Runtime: [ 1 ] : Spark [ 2 ] : Hive Enter your choice [1 or 2]: - Depending on the requirements, type
1or2. - Press ENTER.
The prompt to enter the AWS Account ID appears.[3/6] AWS Account ID ------------------------------------------------------ Your 12-digit AWS Account ID is required to: • Access AWS ECR (Elastic Container Registry) • Identify your AWS resources Find it at: AWS Console > Account (top-right) > My Account Enter AWS Account ID (12 digits): - Enter the AWS Account ID.
- Press ENTER.
The prompt to enter the AWS region where the EMR Serverless resources will be deployed appears.[4/6] AWS Region ------------------------------------------------------ Specify the AWS region where your EMR Serverless resources will be deployed (e.g., us-east-1, us-west-2, eu-west-1). Note: • Your ECR repository and EMR Serverless application must be in same region. Enter AWS Region (e.g., us-east-1): - Enter the region name.
- Press ENTER.
The prompt to enter the ECR Repository Name appears.[5/6] ECR Repository Name ------------------------------------------------------ AWS ECR (Elastic Container Registry) repository where the BDP Docker image will be stored and pulled from. Repository naming rules: • Lowercase letters, numbers, hyphens, underscores, forward slashes • 2-256 characters long Enter ECR Repository Name: - Enter the ECR repository name.
- Press ENTER.
The prompt to enter the docker image tag appears.[6/6] Docker Image Tag ------------------------------------------------------ Tag for the Docker image in ECR. This helps identify different versions of your BDP image. Enter Docker Image Tag [default: latest]: - Enter the docker image tag.
- Press ENTER.
The script completes the EMR Serverless configuration.The directory structure of the artifacts, after executing the configurator script is listed below.================================================================ [OK] EMR Serverless configuration completed successfully! ================================================================ Generated config.json file successfully at /bdp/build/BigDataProtector/BigDataProtector/Installation_Files/config.json ================================================================ [OK] Successfully configured Big Data Protector for EMR Serverless! ================================================================ Generated Files in ./Installation_Files/ directory: - config.json - EMR Serverless configuration - scripts/ - Python deployment CLIs +-- emr_serverless_setup_cli.py - Main deployment CLI +-- lambda_function.py - Lambda for ESA audit log forwarding - runtime/ - BDP JAR files (Spark/Hive) - common/ - JcoreLite, config.ini, GetCertificates.sh - BigDataProtector_Linux-ALL-64_x86-64_EMR-<EMR_version>-64_<BDP_version>.tgz - Complete package tarball ================================================================ Using emr_serverless_setup_cli.py - Main Deployment Tool ================================================================ This Python CLI provides commands to build and deploy BDP Docker images: AVAILABLE COMMANDS: validate - Check prerequisites (Docker, AWS CLI, config.json) prepare-assets - Update config.ini and GetCertificates.sh with ESA details generate-dockerfile - Create Dockerfile from config.json build - Build Docker image locally (preserves manual edits) push - Push existing image to AWS ECR deploy - Full pipeline: validate -> prepare -> generate -> build -> push USAGE: cd ./Installation_Files/scripts python3 emr_serverless_setup_cli.py --config ../config.json <COMMAND> TYPICAL WORKFLOW: # Option 1: Full automated deployment python3 emr_serverless_setup_cli.py --config ../config.json deploy # Option 2: Step-by-step with manual edits python3 emr_serverless_setup_cli.py --config ../config.json validate python3 emr_serverless_setup_cli.py --config ../config.json prepare-assets python3 emr_serverless_setup_cli.py --config ../config.json generate-dockerfile # Manually edit Dockerfile if needed python3 emr_serverless_setup_cli.py --config ../config.json build python3 emr_serverless_setup_cli.py --config ../config.json push NOTES: - During 'deploy' or 'build', you'll be prompted for ESA credentials - Credentials are used during build only, NOT stored in image layers - ECR authentication is handled automatically by AWS CLI - Use 'build' command to preserve manual Dockerfile edits ================================================================ Audit Logging Configuration ================================================================ IMPORTANT: EMR Serverless uses stdout for audit log output. - All audit logs are written to standard output (stdout) - Logs are automatically captured by AWS CloudWatch Logs - CloudWatch logs are stored in your configured S3 bucket To access audit logs: 1. Via CloudWatch: AWS Console -> CloudWatch -> Log Groups 2. Via S3 Bucket: Check your EMR Serverless application's S3 logs location ================================================================ lambda_function.py - ESA Audit Log Forwarder ================================================================ For centralized audit log forwarding to ESA Audit Store, use the provided lambda_function.py - a ready-to-deploy AWS Lambda function. LOG FLOW: EMR Serverless (stdout) → CloudWatch Logs → Subscription Filter → Kinesis Data Stream → Lambda Function → ESA OpenSearch Endpoint LAMBDA FUNCTION FEATURES: - Triggered by Kinesis Data Stream events - Decodes and parses CloudWatch log data from Kinesis records - Forwards logs to ESA using OpenSearch bulk API - TLS encryption with certificate-based authentication - Automatic batching, retries, and error recovery REQUIRED ENVIRONMENT VARIABLES: ESA_BULK_URL - Full OpenSearch bulk API endpoint Example: https://<ESA_IP_Address>:9200/pty_insight_audit/_bulk?pipeline=logs_pipeline ESA_CA_SECRET_ID - AWS Secrets Manager ARN for CA certificate ESA_CA_SECRET_JSON_KEY- JSON key name in secret (default: ca_pem) HTTP_TIMEOUT_SEC - HTTP timeout in seconds (default: 120) BULK_MAX_BYTES - Max bulk request size (default: 5242880) ONLY_MATCH_SUBSTRING - Filter logs by substring (e.g., "logtype") For detailed deployment steps, refer to the EMR Serverless documentation. ================================================================A sample output of the config.json file is listed for reference.Installation_Files/ ├── config.json ├── scripts/ │ ├── emr_serverless_setup_cli.py | ├── lambda_function.py ├── runtime/ │ ├── pephive-3.1.3_v<BDP_version>.jar │ └── pepspark-3.5.6_v<BDP_version>.jar ├── common/ │ ├── jcorelite.jar │ ├── jcorelite.plm │ ├── GetCertificates.sh │ ├── config.ini.template └── BigDataProtector_Linux-ALL-64_x86-64_EMR.Serverless-<EMR_version>-64_<BDP_version>.tgz{ "_comment": "EMR Serverless Big Data Protector Configuration - Generated by configurator.sh", "runtime": "spark", "region": "<region_name>", "registryHostname": "<AWS_Account_ID>.dkr.ecr.<region_name>.amazonaws.com", "defaults": { "syncHost": "<ESA_IP>", "syncPort": "25400", "getCertPort": "25400", "syncProtocol": "https", "syncCAFile": "/opt/esacert/CA.pem", "syncCertFile": "/opt/esacert/cert.pem", "syncKeyFile": "/opt/esacert/cert.key", "syncSecretFile": "/opt/esacert/secret.txt", "syncRequestTimeout": 60, "certResource": "pty/v1/cert", "repositoryName": "protegrity-emr-rest", "imageTag": "sparkv66", "commonCopy": [ { "source": "common/jcorelite.jar", "destSpark": "/usr/lib/spark/jars/jcorelite.jar", "destHive": "/usr/lib/hive/lib/jcorelite.jar" }, { "source": "common/jcorelite.plm", "destSpark": "/usr/lib/spark/jars/jcorelite.plm", "destHive": "/usr/lib/hive/lib/jcorelite.plm" }, { "source": "common/GetCertificates.sh", "destSpark": "/opt/esacert/GetCertificates", "destHive": "/opt/esacert/GetCertificates" }, { "source": "common/config.ini", "destSpark": "/usr/lib/spark/data/config.ini", "destHive": "/usr/lib/hive/data/config.ini" } ] }, "runtimes": { "spark": { "baseImage": "public.ecr.aws/emr-serverless/spark/emr-7.12.0:latest", "contextDir": ".", "yumPackages": ["curl", "vim", "wget", "tar", "gzip"], "copy": [ { "source": "runtime/pepspark-*.jar", "dest": "/usr/lib/spark/jars/" } ], "chown": [ "/usr/lib/spark/jars", "/usr/lib/spark/lib", "/usr/lib/spark/data", "/opt/esacert" ], "user": "hadoop:hadoop" }, "hive": { "baseImage": "public.ecr.aws/emr-serverless/hive/emr-7.12.0:latest", "contextDir": ".", "yumPackages": ["curl", "vim", "wget", "tar", "gzip"], "copy": [ { "source": "runtime/pephive-*.jar", "dest": "/usr/lib/hive/lib/" } ], "chown": [ "/usr/lib/hive/lib", "/usr/lib/hive/data", "/opt/esacert" ], "user": "hadoop:hadoop" } } }
Feedback
Was this page helpful?