EMR Serverless architecture
Amazon EMR Serverless is a modern, on-demand data processing architecture designed to eliminate the complexity of managing clusters for big data workloads. Unlike traditional EMR deployments, EMR Serverless dynamically provisions compute resources based on job requirements, enabling cost efficiency and scalability without manual intervention.
At its core, the architecture for EMR Serverless leverages containerized executors to run Spark or Hive applications in an isolated, secure environment. These containers are orchestrated by AWS, ensuring optimal resource utilization and fault tolerance. The design supports Protegrity data protection integration, making it suitable for enterprise-grade deployments where compliance and security are critical.
Key components include:
- Serverless Runtime: Supports Spark and Hive for analytics and ETL.
- Dynamic Scaling: Automatically adjusts resources to workload demands.
- Logging and Monitoring: Driver and executor logs are streamed to CloudWatch, with optional forwarding to external systems via Kinesis and Lambda for near real-time insights.
- Deployment Workflow: Applications are packaged as Docker images, stored in AWS ECR, and executed in EMR Serverless environments for consistent and reproducible runs.
The architecture for the EMR Serverless distribution of the Big Data Protector is depicted in the image below.

The overall process of installing the Big Data Protector in the EMR Serverless environment is outlined below.
Step 1: Executing the Configurator Script
- Interactive prompt collects all the configuration parameters.
- Input: ESA host/ports, AWS account/region, EMR Serverless application type, and ECR repository names.
- Output:
Installation_Files/directory withconfig.jsonand all the required files. - Files created:
config.json, copied JARs, scripts, and the certificate scripts.
Note: For more information, refer Executing the Configurator Script.
Step 2: Deploying the BDP Image
python3 emr_serverless_setup_cli.py --config ../config.json deploy
Note: For more information, refer EMR Serverless Setup CLI
Substep: Validating the Prerequisites
The script:
- Checks Docker, AWS CLI, credentials
- Verifies ECR repository exists
- Confirms all source files present
Substep: Preparing the Assets
The script:
- Reads
config.jsonandconfig.ini.template - Generates
config.iniwith:- [sync] section: ESA policy server connection (host:25400)
- [log] section: output=stdout
- Updates the
GetCertificates.shscript with ESA host/port
Note: After preparing the assets, if required, modify the
config.inifile as per requirements.
Substep: Generating the Dockerfile
The script:
- Generates the Dockerfile using the values from the
config.jsonfile.
Note: After generating the dockerfile, if needed, modify the dockerfile as per requirements.
Substep: Building the Docker Image
The script:
- Prompts for ESA credentials (username/password or JWT token)
- Downloads the certificates from ESA:25400
- Builds the Docker image
Step 3: Pushing the Image to ECR
The script:
- Logs in to ECR using AWS CLI
- Pushes image to ECR repository
The Big Data Protector build provides an automated script to execute the above-mentioned steps. For more information, refer EMR Serverless Setup CLI.
Understanding the Logging Architecture
- The driver/executor logs are written into the CloudWatch Log group.
- The CloudWatch Logs Subscription filter streams the matching log lines into Kinesis Data Streams.
- The Lambda function consumes the Kinesis batches, extracts only the Protegrity audit JSON lines, builds OpenSearch Bulk (_bulk) payload and invokes the ESA endpoint.
Note: For the CloudWatch subscription filter, provide a filter according to the type of logs that are generated.
Note: For more information, refer Setting up the Log Forwarder
Feedback
Was this page helpful?