Large File Processing
Overview
S3 Protector processes files uploaded to Amazon S3 by invoking an AWS Lambda function for
each ObjectCreated event. A single Lambda invocation is bounded by a hard 15-minute
timeout and a maximum of 10 240 MB RAM. Large-file processing removes those ceilings
by automatically splitting supported files into chunks, protecting each chunk in a separate
parallel Processor Lambda, and reassembling the results into a single output file.
Two complementary mechanisms handle large files:
| Mechanism | What it does |
|---|---|
| Row-level streaming | Reads the source file in row chunks instead of loading it entirely into RAM. Controlled by FILE_CHUNK_SIZE. Supported for CSV, Parquet, and JSON Lines (NDJSON). Not supported by JSON array files. |
| Byte-range dispatch | Splits files that meet or exceed a configurable byte threshold across N parallel Processor Lambdas, each handling one byte-range chunk. A Dispatcher coordinates the split; a Merger reassembles the parts. Controlled by LARGE_FILE_DISPATCH_THRESHOLD_BYTES and LARGE_FILE_CHUNK_SIZE_BYTES. Supported for CSV, Parquet, and both JSON Lines and JSON array. |
Dispatch and row-level streaming are complementary: a dispatched chunk is still streamed row-by-row inside each Processor Lambda, keeping per-invocation memory bounded.
How Dispatch Works: Step by Step
1 — Dispatcher
When a regular data file (.csv, .parquet, .json) arrives in the source bucket and its
size meets or exceeds LARGE_FILE_DISPATCH_THRESHOLD_BYTES, the Lambda acts as a
Dispatcher:
- Reads the file’s byte size.
- Splits the file into byte-range chunks of approximately
LARGE_FILE_CHUNK_SIZE_BYTESeach, aligned on natural record boundaries (newline for CSV and JSON Lines, row-group boundary for Parquet,}, {separator for JSON arrays). - Writes one manifest file per chunk into the source bucket. Each manifest is a small JSON object that describes the chunk’s byte range (or row-group indices for Parquet), the total chunk count, and a snapshot of the mapping configuration.
- Returns immediately. The Dispatcher never reads or transforms any record data.
2 — Processor (one per chunk)
Each manifest landing in S3 fires a separate ObjectCreated event that triggers a new
Processor Lambda invocation:
- Reads the manifest from S3.
- Fetches only the byte range (or row groups) assigned to this chunk.
- Protects (tokenises) the data by calling the Cloud API Lambda.
- Writes the protected output as a
.partfile into the staging parts directory in the source bucket. - Deletes the manifest file. Manifest deletion is the chunk-completion signal.
3 — Merger (inline, triggered by the last Processor)
After all chunks complete, the Merger:
- Acquires a merge lock by creating a
merge.lockfile in the staging parts directory. - Assembles all
.partfiles into the single final output object in the target bucket. - Deletes the
.partfiles from the staging directory on confirmed merge success. - Releases the merge lock.
- Optionally deletes the original source file when
DELETE_INPUT_FILES=true.
Staging Directory Layout
All transient dispatch artifacts are stored under a per-job staging directory in the
source bucket. The directory is named by appending .dispatch/ to the source file key:
{source_key}.dispatch/
manifests/
chunk_0000.manifest.json
chunk_0001.manifest.json
…
chunk_NNNN.manifest.json
parts/
chunk_0000.part
chunk_0001.part
…
chunk_NNNN.part
merge.lock ← transient lock, present only during final merge
Example — for a source file data/large.csv in bucket my-source-bucket:
s3://my-source-bucket/data/large.csv.dispatch/
manifests/chunk_0000.manifest.json
manifests/chunk_0001.manifest.json
parts/chunk_0000.part
parts/chunk_0001.part
parts/merge.lock
Manifest files (manifests/chunk_NNNN.manifest.json)
- Created by: the Dispatcher at job start.
- Contents: byte range (or row-group indices for Parquet), total chunk count, source and
output bucket/key, a snapshot of
mapping.jsoncaptured at dispatch time (so all Processors use identical column configuration even ifmapping.jsonis replaced mid-job), and the rawOUTPUT_FILE_FORMATvalue. - Removed by: the Processor Lambda responsible for that chunk, immediately after
writing its
.partfile. Manifest deletion signals chunk completion. - When NOT removed: if bucket doesn’t have a trigger for
.jsonfiles; if a Processor Lambda fails before completing its write, its manifest is left in place. The manifest count in CloudWatch Logs will not reach zero and the merge will not proceed.
Part files (parts/chunk_NNNN.part)
- Created by: each Processor Lambda, written to the source bucket’s staging directory.
- Contents: the protected output for the assigned byte range (or row groups). Format matches the resolved output format (CSV, Parquet, JSON Lines, or JSON array wrapper).
- Removed by: the Merger after all chunks have completed and the final output object
has been successfully assembled via
UploadPartCopy. - When NOT removed: if the merge step fails, the
.partfiles are left in place and the merge lock is released. Check CloudWatch Logs, fix the root cause, delete the entire.dispatch/directory, and restart file processing.
Merge lock (parts/merge.lock)
- Created by: the first Processor to reach the merge step.
- Removed by: the merging Processor, whether the merge succeeded or failed.
- Lifetime: seconds to a few minutes, the duration of the final merge operation.
When the staging directory is NOT cleaned up
The entire .dispatch/ staging directory remains if:
- Source bucket does not have a trigger for
.jsonfiles. - One or more Processor Lambdas failed before writing their
.partfile. - The merge step raised an error.
Recovery: Check CloudWatch Logs for the error, delete
s3://{source-bucket}/{source_key}.dispatch/ manually, then restart file processing.
Note: The Dispatcher will not proceed if the
.dispatch/directory already exists. Delete the staging directory before retrying.
Environment Variable Reference
Dispatch Variables
| Variable | Default | Range | Description |
|---|---|---|---|
LARGE_FILE_DISPATCH_THRESHOLD_BYTES | 262144000 (250 MB) | 100 MB – 100 GB | Size (in bytes), files of this size or larger are split across multiple Processor Lambdas. Omit or leave empty to disable dispatch entirely. Performance: lower values trigger dispatch sooner, increasing parallelism for medium-sized files at the cost of dispatch overhead. |
LARGE_FILE_CHUNK_SIZE_BYTES | 157286400 (150 MB) | 100 MB – 10 GB | Size (in bytes) of each byte-range chunk assigned to one Processor Lambda. Performance: smaller chunks → more parallel Lambdas → faster wall-clock time, but higher Lambda concurrency consumption. Larger chunks → fewer Lambdas → less concurrency pressure, but each invocation takes longer. The maximum supported number of chunks per job is 9999. |
Supported formats for dispatch: CSV, Parquet, JSON (both JSON Lines and JSON array).
XLSX dispatch is not supported.
Row-Level Streaming Variables
| Variable | Default | Range | Description |
|---|---|---|---|
FILE_CHUNK_SIZE | 100000 | 100 – 100 000 000 | Number of rows read from the source file at a time during streaming I/O. Performance: larger values reduce S3 read round-trips at the cost of higher peak RAM usage per invocation. Reduce if the Lambda runs out of memory on wide rows; increase to reduce latency on narrow rows. Applies to CSV, Parquet, and JSON Lines. Does not apply to JSON array files — those load the entire assigned byte range into RAM as one DataFrame. |
MAX_BATCH_SIZE | 25000 | 100 – 100 000 000 | Maximum number of values sent to the Cloud API Lambda in a single invocation payload. Performance: the AWS Lambda synchronous invocation payload limit is 6 MB. Increasing this reduces the number of Cloud API calls (lower latency), but risks hitting the 6 MB limit for wide columns. Keep MAX_BATCH_SIZE ≤ FILE_CHUNK_SIZE. |
Parallelism Variable
| Variable | Default | Range | Description |
|---|---|---|---|
MAX_PARALLEL_PROTECT_CALLS | 1000 | 0 – 10 000 | Maximum number of concurrent asyncio tasks that call the Cloud API Lambda in parallel within a single Processor Lambda invocation. Performance: higher values increase throughput by overlapping network I/O to the Cloud API, at the cost of more open connections and memory per task. |
Recommended Configuration for Large Files
Scenario A — Large CSV / JSON Lines files, row-level streaming only (no dispatch)
Suitable when files are up to ~1 GB and stay within the 15-minute Lambda timeout.
FILE_CHUNK_SIZE = 150000
MAX_BATCH_SIZE = 25000
MAX_PARALLEL_PROTECT_CALLS = 1000
OUTPUT_FILE_FORMAT = preserve_input_format
OUTPUT_FILE_COMPRESSION_TYPE = gzip
MIN_LOG_LEVEL = info
Memory footprint per invocation: approximately FILE_CHUNK_SIZE × avg_row_bytes × 2
(current chunk + output buffer).
Scenario B — Very large CSV / Parquet / JSON files, parallel dispatch enabled
Suitable for files from 256 MB to several GB.
LARGE_FILE_DISPATCH_THRESHOLD_BYTES = 1048576000 # 1 GB — enable dispatch above this
LARGE_FILE_CHUNK_SIZE_BYTES = 157286400 # 150 MB per Processor Lambda
FILE_CHUNK_SIZE = 150000
MAX_BATCH_SIZE = 25000
MAX_PARALLEL_PROTECT_CALLS = 1000
OUTPUT_FILE_FORMAT = preserve_input_format
OUTPUT_FILE_COMPRESSION_TYPE = gzip
MIN_LOG_LEVEL = info
Each Processor Lambda handles one ~150 MB byte range. A 1.5 GB file produces ~10 parallel invocations that each finish in a fraction of the total serial time.
Troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
| Lambda times out on large files | File too large for serial processing | Enable dispatch for this file (LARGE_FILE_DISPATCH_THRESHOLD_BYTES) |
| Lambda runs out of memory | FILE_CHUNK_SIZE too large, or cross-format conversion | Reduce FILE_CHUNK_SIZE; use the same input and output format |
| Output file not produced after all chunks succeed | Processor Lambda(s) failed before writing .part files | Check CloudWatch Logs, delete the .dispatch/ staging directory and restart file processing |
.part files accumulate and are never deleted | Merge failed | Check CloudWatch Logs, delete the .dispatch/ staging directory and restart file processing |
RuntimeError: Dispatch directory already exists | Previous dispatch run did not complete | Check CloudWatch Logs, delete the .dispatch/ staging directory and restart file processing |
| Cloud API payload errors | MAX_BATCH_SIZE too large | Reduce MAX_BATCH_SIZE until request fits under 6 MB |
Check CloudWatch Logs
Use the following CloudWatch Logs Insights query to surface errors across all Lambda invocations for a job:
fields @timestamp, @message
| filter @message like /(?i)error/ or @message like /Status: timeout/
| sort @timestamp desc
| limit 1000
Restart File Processing
Before retrying, delete the entire .dispatch/ staging directory so the Dispatcher does not abort on startup.
Then trigger S3 Protector by either re-uploading the source file or invoking the Lambda directly with a synthetic S3 test event.
Replace my-source-bucket and data/large.csv with the values for your job:
{
"Records": [
{
"s3": {
"bucket": {
"name": "my-source-bucket",
"arn": "arn:aws:s3:::my-source-bucket"
},
"object": {
"key": "data/large.csv"
}
}
}
]
}
Feedback
Was this page helpful?