Architecture

Deployment architecture and connectivity

Deployment Architecture

The Protegrity S3 solution should be deployed in the customer’s Cloud account within the same AWS region as the Protegrity Cloud Protect API. The Cloud Protect API is required.

Two S3 Buckets are required for processing data with S3 Protector:

Source bucket for collecting data and triggering protection job.
Target bucket for processed files.

The following diagram shows the high-level architecture of the S3 Protector.

The ruleset for processing a type of input dataset is defined by a metadata file called mapping.json. A dedicated folder with a mapping.json must be created in the S3 input bucket for each distinct file schema. The mapping.json provides:

processing instructions for each column of the input data file
specification for reading the input data file
specification for writing the processed data file

Input and output data file specifications are passed to Pandas library as arguments, offering flexibility to handle diverse data file structures. Column instructions define the protect operation and data element to apply for each column.

The Protegrity S3 protector invokes the Cloud Protect API to execute the policy on the data. The processed data is saved to the specified target S3 bucket.

The target bucket can be the basis of a data lake or a staging area to load databases. For example, Snowflake Snowpipe can be used to automatically ingest the processed (ie. Protected) data into Snowflake. Amazon Redshift provides a similar mechanism for bulk loading data from Amazon S3.

For more information about installing and managing the Cloud API component, refer to the Cloud API on AWS Protegrity documentation.

AWS Lambda Timeout

S3 Protector is an AWS Lambda function. Every AWS Lambda function has a maximum execution time called a ’timeout’. When you install this product with supplied CloudFormation template its timeout is set to 5 minutes. A maximum timeout of 15 minutes may be set. Function timeout puts a restriction on the size of the file that this product may process. If S3 Protector runs out of time while processing a file, it will fail with ‘Status: timeout’, which will appear in the logs similar to the following:

REPORT RequestId: aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa	Duration: 300000.00 ms	Billed Duration: 300000 ms	Memory Size: 1728 MB	Max Memory Used: 654 MB	Init Duration: 3868.74 ms	Status: timeout

When S3 Protector runs out of time, it does not have an opportunity to mark file upload as completed. Incomplete uploads do not appear in S3 console, however you are still charged for them. Review AWS documentation on how to manage incomplete multi-part file uploads, for example:

There is no way to automatically re-start this product from where it has timed out while processing a file. To reduce the likelihood of a timeout error, consider the following:

Increase function timeout to its maximum of 15 minutes
Split large files into multiple smaller ones and let S3 Protector process them individually
Increase ‘MaxBatchSize’ to increase protect operation throughput
Ensure sufficient concurrency of Cloud Protect API functions
Monitor Cloud Protect API and S3 Protect functions for performance and errors

Parquet Timestamp

Parquet files define file schema with a data type for each column. S3 Protector uses Pandas library to process data in the source file. Pandas library represents timestamps as 64-bit integers representing microseconds since the UNIX epoch. The supported date range for this representation is between ‘1677-09-21 00:12:43.145224’ and ‘2262-04-11 23:47:16.854775’. To correctly handle timestamps outside of this range, S3 Protector will treat every timestamp column in a source file as a string column. The schema of protected file will differ from the source file, where every protected timestamp column will be converted to a string column.