Large File Processing

Known product limitations.

Overview

S3 Protector processes files uploaded to Amazon S3 by invoking an AWS Lambda function for each ObjectCreated event. A single Lambda invocation is bounded by a hard 15-minute timeout and a maximum of 10 240 MB RAM. Large-file processing removes those ceilings by automatically splitting supported files into chunks, protecting each chunk in a separate parallel Processor Lambda, and reassembling the results into a single output file.

Two complementary mechanisms handle large files:

MechanismWhat it does
Row-level streamingReads the source file in row chunks instead of loading it entirely into RAM. Controlled by FILE_CHUNK_SIZE. Supported for CSV, Parquet, and JSON Lines (NDJSON). Not supported by JSON array files.
Byte-range dispatchSplits files that meet or exceed a configurable byte threshold across N parallel Processor Lambdas, each handling one byte-range chunk. A Dispatcher coordinates the split; a Merger reassembles the parts. Controlled by LARGE_FILE_DISPATCH_THRESHOLD_BYTES and LARGE_FILE_CHUNK_SIZE_BYTES. Supported for CSV, Parquet, and both JSON Lines and JSON array.

Dispatch and row-level streaming are complementary: a dispatched chunk is still streamed row-by-row inside each Processor Lambda, keeping per-invocation memory bounded.


How Dispatch Works: Step by Step

1 — Dispatcher

When a regular data file (.csv, .parquet, .json) arrives in the source bucket and its size meets or exceeds LARGE_FILE_DISPATCH_THRESHOLD_BYTES, the Lambda acts as a Dispatcher:

  1. Reads the file’s byte size.
  2. Splits the file into byte-range chunks of approximately LARGE_FILE_CHUNK_SIZE_BYTES each, aligned on natural record boundaries (newline for CSV and JSON Lines, row-group boundary for Parquet, }, { separator for JSON arrays).
  3. Writes one manifest file per chunk into the source bucket. Each manifest is a small JSON object that describes the chunk’s byte range (or row-group indices for Parquet), the total chunk count, and a snapshot of the mapping configuration.
  4. Returns immediately. The Dispatcher never reads or transforms any record data.

2 — Processor (one per chunk)

Each manifest landing in S3 fires a separate ObjectCreated event that triggers a new Processor Lambda invocation:

  1. Reads the manifest from S3.
  2. Fetches only the byte range (or row groups) assigned to this chunk.
  3. Protects (tokenises) the data by calling the Cloud API Lambda.
  4. Writes the protected output as a .part file into the staging parts directory in the source bucket.
  5. Deletes the manifest file. Manifest deletion is the chunk-completion signal.

3 — Merger (inline, triggered by the last Processor)

After all chunks complete, the Merger:

  1. Acquires a merge lock by creating a merge.lock file in the staging parts directory.
  2. Assembles all .part files into the single final output object in the target bucket.
  3. Deletes the .part files from the staging directory on confirmed merge success.
  4. Releases the merge lock.
  5. Optionally deletes the original source file when DELETE_INPUT_FILES=true.

Staging Directory Layout

All transient dispatch artifacts are stored under a per-job staging directory in the source bucket. The directory is named by appending .dispatch/ to the source file key:

{source_key}.dispatch/
    manifests/
        chunk_0000.manifest.json
        chunk_0001.manifest.json
        chunk_NNNN.manifest.json
    parts/
        chunk_0000.part
        chunk_0001.part
        chunk_NNNN.part
        merge.lock              ← transient lock, present only during final merge

Example — for a source file data/large.csv in bucket my-source-bucket:

s3://my-source-bucket/data/large.csv.dispatch/
    manifests/chunk_0000.manifest.json
    manifests/chunk_0001.manifest.json
    parts/chunk_0000.part
    parts/chunk_0001.part
    parts/merge.lock

Manifest files (manifests/chunk_NNNN.manifest.json)

  • Created by: the Dispatcher at job start.
  • Contents: byte range (or row-group indices for Parquet), total chunk count, source and output bucket/key, a snapshot of mapping.json captured at dispatch time (so all Processors use identical column configuration even if mapping.json is replaced mid-job), and the raw OUTPUT_FILE_FORMAT value.
  • Removed by: the Processor Lambda responsible for that chunk, immediately after writing its .part file. Manifest deletion signals chunk completion.
  • When NOT removed: if bucket doesn’t have a trigger for .json files; if a Processor Lambda fails before completing its write, its manifest is left in place. The manifest count in CloudWatch Logs will not reach zero and the merge will not proceed.

Part files (parts/chunk_NNNN.part)

  • Created by: each Processor Lambda, written to the source bucket’s staging directory.
  • Contents: the protected output for the assigned byte range (or row groups). Format matches the resolved output format (CSV, Parquet, JSON Lines, or JSON array wrapper).
  • Removed by: the Merger after all chunks have completed and the final output object has been successfully assembled via UploadPartCopy.
  • When NOT removed: if the merge step fails, the .part files are left in place and the merge lock is released. Check CloudWatch Logs, fix the root cause, delete the entire .dispatch/ directory, and restart file processing.

Merge lock (parts/merge.lock)

  • Created by: the first Processor to reach the merge step.
  • Removed by: the merging Processor, whether the merge succeeded or failed.
  • Lifetime: seconds to a few minutes, the duration of the final merge operation.

When the staging directory is NOT cleaned up

The entire .dispatch/ staging directory remains if:

  • Source bucket does not have a trigger for .json files.
  • One or more Processor Lambdas failed before writing their .part file.
  • The merge step raised an error.

Recovery: Check CloudWatch Logs for the error, delete s3://{source-bucket}/{source_key}.dispatch/ manually, then restart file processing.

Note: The Dispatcher will not proceed if the .dispatch/ directory already exists. Delete the staging directory before retrying.


Environment Variable Reference

Dispatch Variables

VariableDefaultRangeDescription
LARGE_FILE_DISPATCH_THRESHOLD_BYTES262144000 (250 MB)100 MB – 100 GBSize (in bytes), files of this size or larger are split across multiple Processor Lambdas. Omit or leave empty to disable dispatch entirely. Performance: lower values trigger dispatch sooner, increasing parallelism for medium-sized files at the cost of dispatch overhead.
LARGE_FILE_CHUNK_SIZE_BYTES157286400 (150 MB)100 MB – 10 GBSize (in bytes) of each byte-range chunk assigned to one Processor Lambda. Performance: smaller chunks → more parallel Lambdas → faster wall-clock time, but higher Lambda concurrency consumption. Larger chunks → fewer Lambdas → less concurrency pressure, but each invocation takes longer. The maximum supported number of chunks per job is 9999.

Supported formats for dispatch: CSV, Parquet, JSON (both JSON Lines and JSON array).
XLSX dispatch is not supported.

Row-Level Streaming Variables

VariableDefaultRangeDescription
FILE_CHUNK_SIZE100000100 – 100 000 000Number of rows read from the source file at a time during streaming I/O. Performance: larger values reduce S3 read round-trips at the cost of higher peak RAM usage per invocation. Reduce if the Lambda runs out of memory on wide rows; increase to reduce latency on narrow rows. Applies to CSV, Parquet, and JSON Lines. Does not apply to JSON array files — those load the entire assigned byte range into RAM as one DataFrame.
MAX_BATCH_SIZE25000100 – 100 000 000Maximum number of values sent to the Cloud API Lambda in a single invocation payload. Performance: the AWS Lambda synchronous invocation payload limit is 6 MB. Increasing this reduces the number of Cloud API calls (lower latency), but risks hitting the 6 MB limit for wide columns. Keep MAX_BATCH_SIZEFILE_CHUNK_SIZE.

Parallelism Variable

VariableDefaultRangeDescription
MAX_PARALLEL_PROTECT_CALLS10000 – 10 000Maximum number of concurrent asyncio tasks that call the Cloud API Lambda in parallel within a single Processor Lambda invocation. Performance: higher values increase throughput by overlapping network I/O to the Cloud API, at the cost of more open connections and memory per task.

Scenario A — Large CSV / JSON Lines files, row-level streaming only (no dispatch)

Suitable when files are up to ~1 GB and stay within the 15-minute Lambda timeout.

FILE_CHUNK_SIZE              = 150000
MAX_BATCH_SIZE               = 25000
MAX_PARALLEL_PROTECT_CALLS   = 1000
OUTPUT_FILE_FORMAT           = preserve_input_format
OUTPUT_FILE_COMPRESSION_TYPE = gzip
MIN_LOG_LEVEL                = info

Memory footprint per invocation: approximately FILE_CHUNK_SIZE × avg_row_bytes × 2 (current chunk + output buffer).

Scenario B — Very large CSV / Parquet / JSON files, parallel dispatch enabled

Suitable for files from 256 MB to several GB.

LARGE_FILE_DISPATCH_THRESHOLD_BYTES = 1048576000   # 1 GB — enable dispatch above this
LARGE_FILE_CHUNK_SIZE_BYTES         = 157286400    # 150 MB per Processor Lambda
FILE_CHUNK_SIZE                     = 150000
MAX_BATCH_SIZE                      = 25000
MAX_PARALLEL_PROTECT_CALLS          = 1000
OUTPUT_FILE_FORMAT                  = preserve_input_format
OUTPUT_FILE_COMPRESSION_TYPE        = gzip
MIN_LOG_LEVEL                       = info

Each Processor Lambda handles one ~150 MB byte range. A 1.5 GB file produces ~10 parallel invocations that each finish in a fraction of the total serial time.


Troubleshooting

SymptomLikely causeFix
Lambda times out on large filesFile too large for serial processingEnable dispatch for this file (LARGE_FILE_DISPATCH_THRESHOLD_BYTES)
Lambda runs out of memoryFILE_CHUNK_SIZE too large, or cross-format conversionReduce FILE_CHUNK_SIZE; use the same input and output format
Output file not produced after all chunks succeedProcessor Lambda(s) failed before writing .part filesCheck CloudWatch Logs, delete the .dispatch/ staging directory and restart file processing
.part files accumulate and are never deletedMerge failedCheck CloudWatch Logs, delete the .dispatch/ staging directory and restart file processing
RuntimeError: Dispatch directory already existsPrevious dispatch run did not completeCheck CloudWatch Logs, delete the .dispatch/ staging directory and restart file processing
Cloud API payload errorsMAX_BATCH_SIZE too largeReduce MAX_BATCH_SIZE until request fits under 6 MB

Check CloudWatch Logs

Use the following CloudWatch Logs Insights query to surface errors across all Lambda invocations for a job:

fields @timestamp, @message
| filter @message like /(?i)error/ or @message like /Status: timeout/
| sort @timestamp desc
| limit 1000

Restart File Processing

Before retrying, delete the entire .dispatch/ staging directory so the Dispatcher does not abort on startup.

Then trigger S3 Protector by either re-uploading the source file or invoking the Lambda directly with a synthetic S3 test event. Replace my-source-bucket and data/large.csv with the values for your job:

{
  "Records": [
    {
      "s3": {
        "bucket": {
          "name": "my-source-bucket",
          "arn": "arn:aws:s3:::my-source-bucket"
        },
        "object": {
          "key": "data/large.csv"
        }
      }
    }
  ]
}

Last modified : April 30, 2026