Amazon S3 Tunnel

About S3 tunnel fields.

Amazon Simple Storage Service (S3) is an online file storage web service. It lets you manage files through browser-based access as well as web services APIs. In DSG, the S3 tunnel is used to communicate with Amazon S3 cloud storage over the Amazon S3 REST API. The higher-layer S3 Service object, which sits above the tunnel object, configured at the RuleSet level is used to process file contents retrieved from S3.

A sample S3 tunnel configuration is shown in the following figure.

Amazon S3 tunnel screen

Amazon S3 uses buckets to store data and data is classified as objects. Each object is identified with a unique key ID. Consider an example that john.doe is the bucket and incoming is a folder under john.doe bucket. Assuming the requirement is that files landing in the incoming folder should be picked up and processed by DSG nodes. The data pulled from the AWS online storage is available in the incoming folder under the source bucket. The Amazon S3 Service is used to perform data security operation on this data in the source bucket.

Note: The DSG supports four levels of nested folders in an Amazon S3 bucket.

After the rules are executed, the processed data may be stored in a separate bucket (e.g. the folder named outgoing under the same john.doe bucket), which is the target bucket. When the DSG nodes poll AWS for a file uploaded, whichever node accesses the file first places a lock on the file. You can specify if the lock files must be stored in a separate bucket or under the source bucket. If the file is locked, the other DSG nodes will stop trying to access the file.

If the data operation on a locked file fails, the lock file can be viewed for detailed log and error information. The lock files are automatically deleted if the processing completes successfully.

Consider the scenario where an incoming bucket contains two directories Folder1 and Folder2.

The DSG allows multiprocessing of files that are place in the bucket. The lock files are created for every file processed. In the scenario mentioned, the lock files are created as follows:

  • If the abc.csv file of Folder1 is processed, the lock file is created as Folder1.abc.csv.<hostname>.<Process ID>.lock.
  • If the pqr.csv file of Folder2 is processed, the lock file is created as Folder1.pqr.csv.<hostname>.<Process ID>.lock.

Consider the following figure where files are nested in the S3 bucket.

The lock files are created as follows:

  • If the abc.csv file of Folder1 is processed, the lock file is created as Folder1.abc.csv.<hostname>.<Process ID>.lock.
  • If the pqr.csv file of Folder2 is processed, the lock file is created as Folder1.Folder2.pqr.csv.<hostname>.<Process ID>.lock.
  • If the abc.csv file of Folder3 is processed, the lock file is created as Folder1.Folder2.Folder3.abc.csv.<hostname>.<Process ID>.lock.

If the multiprocessing of files is to be discontinued, remove the enhanced-lock-filename flag from the features.json file available in the System > Files on the DSG Web UI.

The following image illustrates the options available for an S3 tunnel.

S3 Tunnel Settings

The options specific to the S3 Protocol type are described as follows:

Bucket list settings

1 Source Bucket Name: Bucket name as defined in AWS where the files that need to be processed are available.

2. Source File Name Pattern: Regex pattern for the filenames to be processed. For example, .csv.

Rename Processed Files: Regex logic for renaming processed file.

3. Match Pattern: Regex logic for renaming processed file.

4. Replace Value: Value to append or name that will be used to rename the original source file based on the pattern provided and grouping defined in regex logic.

5. Overwrite Target Object: Select to overwrite a file in the bucket with a newly processed file of the same name. Refer to Amazon S3 Object.

6. Lock Files Bucket: Name of the lock files folder, if you want the lock files to be stored in a separate bucket. If not defined, lock files are placed in the source bucket.

7. Interval: Time in secs when the DSG node will poll AWS for pulling files. You can also specify a cron job expression. Refer to Cron documentation. The default value is 5. If you use the cron job expression “* * * * *”, DSG will poll AWS at the minimum interval of one minute.
Cron job format is also supported to schedule jobs.

AWS Settings

8. AWS Access Key Id: Access key id used to make secure protocol request to an AWS service API. Refer to Amazon Web Service documentation.

9. AWS Secret Access Key: Secret access key related to the access key id. The access key id and secret access key work together to sign into AWS and provide access to resources. Refer to Amazon Web Service documentation.

10. AWS Endpoint URL: Specify the endpoint URL if it is other than the amazon S3 bucket. This parameter should only be configured if the user is using DSG to connect to other endpoint than amazon S3 bucket i.e. On-Premise S3, Google Cloud Bucket, and so on.If not defined, the DSG will connect to Amazon S3 bucket.

11. Path to CA Bundle: Specify the path to CA bundle if the endpoint is other than Amazon S3 bucket. If the user has installed the S3 on-premise using the self signed certificate, then specify that path to CA bundle in this parameter. If the endpoint URL is Amazon S3 bucket, then by default it uses SSL certificate to connect to S3 bucket.

Advanced Settings

12. Advanced settings: Set additional advanced options for tunnel configuration, if required, in the form of JSON in the following textbox. In a scenario where an ESA and two DSG nodes are in a cluster, by using the Selective Tunnel Loading functionality, you can load specific tunnel configurations on specific DSG nodes.

The advanced settings that can be configured for S3 Protocol.

OptionsDescription
SSECustomerAlgorithmIf server-side encryption with a customer-provided encryption key was requested, the response will include this header confirming the encryption algorithm used.
SSECustomerKeyConstructs a new customer provided server-side encryption key.
SSECustomerKeyMD5If server-side encryption with a customer-provided encryption key was requested, the response will include this header to provide round-trip message integrity verification of the customer-provided encryption key.
ACLAllows controlling the ownership of uploaded objects in an S3 bucket.For example, if ACL or Access Control List is set to “bucket-owner-full-control”, new objects uploaded by other AWS accounts are owned by the bucket owner. By default, the objects uploaded by other AWS accounts are owned by them.

Using S3 tunnel to access files on Google Cloud Storage

Similar to AWS buckets, data is also stored on the Google Cloud Storage can also be protected. You can use the S3 tunnel to access the files on the GCP storage. The incoming and processed file has to be placed in the same storage in separate folders. For example, a storage named john.doe bucket contains a folder incoming that contains files to be picked and processed by DSG nodes. This acts as the source bucket. After the rules are executed, the data is stored in the processed bucket. Ensure the following points are considered.

  • AWS Endpoint URL contains the URL of the Google Cloud storage.
  • AWS Access Key ID and AWS Secret Access Key contain the secret ID and HMAC keys.

Refer to Google docs for information about Access ID and HMAC keys.