Extract
Describes the Extract rule for the DSG ruleset configurations
The Extract action defines the payloads supported by the DSG.
The following payloads are supported in DSG.
Adobe Action Message Format (AMF)
Binary
Character Separated Values (CSV)
Common Event Format (CEF)
eXtensible Markup Language (XML)
Extensible Markup Language (XML) with Tree-of-Tress (ToT)
Fixed Width
HTML Form Media Type (X-WWW-FORM-URLENCODED)
HTTP Message
JavaScript Object Notation (JSON)
JavaScript Object Notation (JSON) with Tree-of-Tress (ToT)
Multipart Mime
Microsoft Office 2007 Excel Document
Microsoft Office 2013 Document
Adobe Portable Document Format (PDF)
Enhanced Adobe Portable Document Format (PDF)
Protocol Buffer (protobuf)
Secured File Transfer
Amazon S3 Object
SMTP Message
Text
Uniform Resource Locator
User Defined Extraction
Note: JWT token encoded in base64 and without padding characters can only be extracted using the UDF extraction rule.
ZIP Compressed File
1 - Adobe Action Message Format
Describes the AMF payload
This payload extracts AMF format from the request and lets you define regex to control precise extraction.
The fields for Adobe Action Message Format (AMF) payload are as seen in the following figure.
The properties for the AMF payload are explained in the following table.
Field | Description |
---|
Method* | Specifies the method of extraction for AMF payloads. |
Pattern | Regular Expression pattern to match and extract from string value of AMF payload |
* The following options are available for the Method field.
- Serialize: Configure the AMF payload only to be exposed in learn mode. This will be useful for debugging while creating rules for learn mode.
- Serialized String Value: Configure the AMF payload as string and extract the matched Pattern.
- String Value: Configure the data using the matched Pattern. The data is not serialized ahead of the pattern matching.
- String Value by Key Name: The data is expected to come in key-value pairs. The parameters are matched using the Pattern. The value for the matched parameter is extracted.
2 - Amazon S3 Object
Describes the Extract rule for the DSG ruleset configuration
This payload extracts Amazon S3 object from the request and lets you define regex to control precise extraction. It is generally used with the Amazon S3 service.
The following figure illustrates the Amazon S3 Object payload fields.
The properties for the Amazon S3 Object payload are explained in the following table.
Properties | Description |
---|
Object Key | Regex logic to identify source object key to be extracted. |
Target Object | Object attribute that will be extracted from the following options. |
3 - Binary Payload
Describes the binary payload.
This payload extracts binary data from the request and lets you define regex to control precise extraction.
The fields for Binary payload are as seen in the following figure.
The properties for the Binary payload are explained in the following table.
Field | Sub-Field | Description |
---|
Prerequisite Match Pattern | | A regular expression to be searched for in the input is specified in the field. |
Pattern | | The regular expression pattern on which the extraction is applied is specified in this field. For example, consider if the text is “Hello World”, then pattern would be “\w+”. |
| Pattern Group Id | The grouping number to extract from the Regular Expression Pattern. For example, for the text “Hello World”, Group Id would be 0 to match characters in first group as per regex. |
| Profile Name | Profile to be used to perform transform operations on the matched content. |
| User Comments | Additional information related to the action performed by the group processing. |
Encoding | | The encoding method used while extracting binary payload. |
Prefix | | Prefix text to be padded before the protected value. This helps in identifying a protected text from a clear one. |
Suffix | | Suffix text to be padded after the protected value. The use is same as above. |
Padding Character | | Characters to be added to raise the number of characters to minimum required size by the Protection method. |
Minimum Input length | | Number of characters that define if input is too short for the Protection method to be padded with Padding Character |
The following table describes the fields for Encoding.
Field | Description |
---|
Codec | Select the appropriate codec based on the selection of Encoding |
The following options are available for the Encoding field:
- No Encoding
- Standard
- External
- Proprietary
No Encoding
If the No Encoding option is selected, then no encoding is applied.
Standard
The Standard Encoding consists of built-in codecs of standard character encodings or mapping tables, including UTF-8, UTF-16, ASCII and more.
For more information about the complete list of encoding methods. refer to the section Standard Encoding Method List.
External
When external encoding is applied, you must select a codec.
The following table describes the codecs for the External encoding.
Codec | Description |
---|
Base64 | Binary to text encoding to represent binary data in ASCII format. |
HTML Encoding | Replace special characters “&”, “<” and “>” to HTML-safe sequences. |
JSON Escape | Escapes special JSON characters, such as quote (”") in JSON string values to make it JSON-safe sequences. |
URI Encoding | RFC 2396 Uniform Resource Identifiers (URI) requires each part of URL to be quoted. It will not encode ‘/’. |
URI Encoding Plus | It is similar to URI Encoding, except replacing ’ ’ with ‘+’. |
XML Encoding | Escape &, <, and > in a string of data, then quote it for use as an attribute value to XML-safe sequences. |
Quoted Printable | Convert to/from quoted-printable transport encoding as per RFC 1521. |
SQL Escape | Performs SQL statement string escaping by replacing single quote (’) with two single quotes (’), replaces single double quote (") with two double quotes (""). |
Proprietary
When proprietary encoding is selected, codecs linked are displayed.
The following table describes the codecs for the Proprietary encoding.
Codec | Description |
---|
Base128 Unicode CJK | Base128 encoding in Chinese, Japanese and Korean characters. |
High ASCII | Character encodings of for eight bit or larger. |
The following encryption methods are not supported for the High ASCII codec and the Base128 Unicode CJK codec:
- AES-128
- AES-256
- 3DES
- CUSP AES-128
- CUSP AES-256
- CUSP 3DES
- FPE NIST 800-38G Unicode (Basic Latin and Latin-1 Supplement Alpha)
- FPE NIST 800-38G Unicode (Basic Latin and Latin-1 Supplement Alpha-Numeric)
The following tokenization data types are not supported for the High ASCII codec and the Base128 Unicode CJK codec:
The input data for the Base128 Unicode CJK and High ASCII codecs must contain only ASCII characters. For example, if input data consisting of non-english characters is tokenized using the Alpha tokenization, then the Alpha tokenization treats the non-english characters as a delimiter and the tokenized output will include the non-english characters. As a result, the protection or unprotection operation will fail.
4 - CSV Payload
Describes the CSV payload.
This payload extracts CSV format from the request and lets you define regex to control precise extraction.
With the Row and Column Index, you can now define how the column positions can be calculated. For example, consider a CSV input as provided in the following snippet, where the first column begins at the 0th field in the row, that are padded with commas until the next column, ex5 begins. This applies when the indexing is 0-based. If you choose to use 1-based indexing, the first column begins at 1 and subsequent fields are 2, 3, and so on. Based on these definitions, you can define the rule and its properties.
first, ex5, last, pick, ex6, city, ex1, ex2
John, wwww, Smith, mister, wwww, stamford, 333, 444
Adam, wwww, Martin, mister, wwww, fairfield, 333, 444
It is recommended to use the External Base64 encoding method in the Transform action type for the CSV codec. If the Standard Base64 method is used, then additional newline feeds are generated in the output.
The CSV implementation in the DSG does not support the following:
- The fields contain line breaks.
- If double quotes are used to enclose fields, then the double quotes appearing inside a field are escaped by preceding them with another double quote.
The fields for the CSV payload are as seen in the following figure.
If the CSV input includes NON-ASCII or Unicode data, then the Binary extract rule must be used before using the CSV extract rule.
If the CSV input file includes non-printable special characters, then to transform the data successfully, the user must add the csv-bytes-parsing parameter in the features.json file.
To add the parameter in the features.json file, perform the following steps.
- Login to the ESA Web UI.
- Navigate to Settings > System > Files.
- Open the features.json file for editing.
- Add the csv-bytes-parsing parameter in the features.json file. The csv-bytes-parsing parameter must be added in the following format:
{ "features": [ "csv-bytes-parsing" ] }
The properties for the CSV payload are explained in the following table.
Properties | Sub-Field | Description | Additional Information |
---|
Line Separator | | Separator that defines where a new line begins. | |
Skip Lines Matching Pattern | | Regex pattern that defines the lines that need to be skipped. | For example, consider the following lines in the file: User, Admin, Full Access, Viewer Partial Access, User, Viewer, Admin No Access, Viewer, User, Root No Acess, Partial Access, Root, Admin
- If you configure the regex as .*?User, the lines 1, 2, and 3 will be skipped.
- If you configure the regex as User, the first line will be skipped and the remaining lines will be processed.
|
Preserve Number of Columns | | Select to check if the number of columns are equal to the column headers in a CSV file. If there is a mismatch between the actual number of columns and the number of column headers, then the rule stops processing further and an error appears in the log. If you clear this check box and a mismatch is detected, then the rule still continues to process the data. A warning appears in the log. | If the checkbox is selected, ensure that the data does not contain two or more consecutive Line Separators. For example, if the Line Separator is set to \n, the following syntax must be corrected.
name, city, pin\n Joe, NY, 10\n Smith, LN, 12\n \n Remove the consecutive occurrences of \n |
Row and Column Index | | Select 0 if row and column counting begins at 0 or 1 if it begins at 1. | 0 |
Header Line Number | | Line number with column headers. | - -1 – Row and Column Index is 0
- 0 – Row and Column Index is 1
|
Data Starts at Line | | Line number from which the data begins. | Value calculated as Header Line Number +1 |
Column Separator | | Value by which the columns are separated. | |
Columns | | List of columns to be extracted and for which values action is to be applied. For example, consider a .csv file with multiple columns such as SSN, Name, etc that need to be processed. | |
| Column Name/Index | Column Name or index number of the column that will be processed. For example, if the name of the 1 column is “Name”, the value in Column Name/Index would be either 1 or Name. For example, with Row and Column Index defined as 0, if the name of the 1st column is “Name”, the value in Column Name/Index would be either 0 or Name. | |
| Profile Name | Profile to be used to perform transform operations on the matched content. | |
| User Comments | Additional information related to the action performed by the column processing. | |
Text Qualifier | | Pattern that allows cells to be combined. | |
Pattern | | Pattern that applies to the cells, after the lines and columns have been separated. | |
Advanced Settings | | Define the quote handling for unbalanced quotes in CSV records.- Set to
{"quoteHandlingMode" : "DEFAULT"} to correct unbalanced quotes in records, such as, single quotes, in the delimited CSV input file during data processing. For example, if the CSV includes unbalanced quotes, such as, ',03/11/2020 or ",13/08/2020 and the Default Mode is enabled, then during data processing, the DSG will correct the unbalanced quotes. The DSG will change the unbalanced quotes to '',03/11/2020 or "",13/08/2020 respectively. - Set to
{"quoteHandlingMode" : "PASSIVE"} to retain unbalanced quotes in records, such as single quotes, in the delimited CSV input file during data processing. For example, if the CSV includes a unbalanced quotes, such as, ',03/11/2020 or ",13/08/2020 and the Passive Mode is enabled, then during data processing, the DSG will retain the unbalanced quotes.
| If quoteHandlingMode is set as DEFAULT, the unbalanced quotes are balanced. However, if the quote is followed by a string, the unbalanced quotes are not corrected by the DSG. For example, in the following CSV text, the quotes are not balanced by DSG: 'Joe,03/11/2024 or "Joe,13/11/2024 The output of this entry remains unchanged. |
5 - Common Event Format (CEF)
Describes the CEF payload.
If you want to protect fields that are part of a CEF log file, you can use the CEF payload to extract the required fields.
The properties for the Common Event Format (CEF) payload are explained in the following table.
Properties | Sub-Field | Description |
---|
Line Separator | | Regex pattern to identify field separation. |
Fields | | CEF names and profile references must be selected. |
| Field Name | Comma separated list of CEF key names that need to be transformed (protected or unprotected). |
| Profile Name | Profile to be used to perform transform operations on the matched content. |
| User Comments | Additional information related to the action performed by the column processing. |
6 - XML Payload
Describes the Extract rule for the DSG ruleset configuration
This payload extracts the XML format content from the request and lets you extract the exact XML element value with it.
The fields for the XML payload are as seen in the following figure.
The properties for the XML payload are explained in the following table.
Properties | Description |
---|
XPath List | The XML element value to be extracted is specified in this field. Note: Ensure that you enter the XPath by following proper syntax for extracting the XML element value. If you enter incorrect syntax, then the service which has this XML payload definition in the rule fails to load and process the request. |
Advance XML Parser options* | Configure advanced parsing parameter options for the XML payload. This field accepts parsing options in the JSON format. The parsing options are of the Boolean data type. For example, the parsing parameter, remove_comments , accepts the values as true or false . |
* The Advance XML Parser options field provides the following parsing parameters that can be configured.
Options | Description | Default |
---|
remove_blank_text | Boolean value used to remove the whitespaces for indentation in the XML payload. | False |
remove_comments | Boolean value used to remove comments from the XML payload. In the XML format, comments are entered in the <!-- --> tag. | False |
remove_pis | Boolean value used to remove Processing Instructions (pi) from the XML payload. In the XML format, processing instructions are entered in the <? -- ?> tag. | False |
strip_cdata | Boolean value used to replace content in the cdata, Character data, or tag by normal text content. | True |
resolve_entities | Boolean value used to replace the entity value by their textual data value. | False |
no_network | Boolean value used to prevent network access while searching for external documents. | True |
ns_clean | Boolean value used to remove redundant namespace declarations. | False |
Consider the following example to understand the Advance XML Parser options available in the XML codec. In this example, a request is sent from a client to remove the whitespaces between the XML tags from a sample XML payload in the message body of the HTTP/REST request. The following Ruleset is created for this example.
Create an extract rule for the HTTP message payload using the default RuleSet template defined under the REST API service.
Consider the following sample XML payload in the HTTP message body.
<?xml version = "1.0" encoding = "ASCII" ?>
<class_list>
<!--Students grades are uploaded by months-->
<student>
<name>John Doe</name>
<grade>A</grade>
</student>
</class_list>
In the example, a lot of white spaces are used for indentation. The payload contain spaces, carriage returns, and line feeds between the <class_list>
, <student>
, and <name>
XML tags.
The extract rule for extracting the HTTP message body is as seen in the following figure.
Under the Extract rule, create another child rule to extract the XML payload from the HTTP Message.
In this child rule, provide /class_list/student/name
to parse the XML payload in the XPath List field and set the remove_blank_text parameter to true in the Advance XML Parser options field in the JSON format.
Under this extract rule, create another child rule to extract the sensitive data between the <name>
and <name>
tags. The fields for this child extract rule are as seen in the following figure.
Under the extract rule, create a transform rule to protect the sensitive data between the <name>
and the <name>
tags using Regex Replace with a pattern xxxxx
. The fields for the transform rule are as seen in the following figure.
Click Deploy or Deploy to Node Groups to apply the configuration changes.
When a request is sent to the configured URI, the DSG processes the request and the following response appears with the whitespaces removed from the XML payload. In addition, the sensitive data between the <name>
and the </name>
tags is protected.
<?xml version='1.0' encoding='ASCII'?>
<class_list><!--Students grades are uploaded by months--><student><name>xxxxx xxxxx</name><grade>A</grade></student></class_list>
7 - Date Time Format
Describes the date time format payload.
The Datetime format payload is used to convert custom datetime formats, which are not supported by tokenization datetime or date data element, to a supported format that can be processed by DSG.
Consider an example, where you provide a time format, such as DD/MM/YYYY HH:MM:SS as an input to an Extract rule with the Datetime payload. The given format is not supported by the datetime or date tokenization data element. The Extract rule converts the format to an acceptable format, a transform rule protects the datetime. The Datetime payload converts the protected value to the input format and returns this value to the user.
When you request DSG to unprotect the protected datetime value, an extract rule identifies the protected datetime value, a subsequent transform rule unprotects the value and returns the original datetime format, which is DD/MM/YYYY HH:MM:SS.
Ensure that the input sent to the extract rule for Date Time extraction is exactly in the same input format as configured in the rule. If you are unsure of the input that might be sent to the extract rule, then ensure that before you rollout for production, Ruleset configuration is thoroughly checked.
The following figure illustrates the Date Time format payload fields.
Before you begin:
Ensure that the following pre-requisites are completed:
- The datetime data element defined in the policy on ESA is used to perform protect or unprotect operation.
The following table describes the fields for Datetime codec.
Field | Description |
---|
Input Date Time Format | Format in which the input is provided to DSG.Note: This field accepts numeric values only in the input request sent to DSG. |
Data Element Date Time Format | Format to which input must be converted. Note: Ensure that the Transform rule that follows the Extract rule uses the same data element that is used to configure the Date Time Format codec. |
Mode of Operation | Data security operation that needs to be performed. You can select Protect or Unprotect. Note: The mode of operation must be same as the data security operation that you want to perform in the Transform rule. |
DistinguishableDate* | Select this checkbox if the data element used to protect the date time is included this setting. |
*These fields appear only when Unprotect is selected as Mode of Operation.
8 - XML with Tree-of-Trees (ToT)
Describes the XML ToT.
The XML with Tree-of-Trees (ToT) codec extracts the XML element defined in the XPath field. The XML with ToT codec allows you to process the multiple XML elements in an extract rule.
The fields for the XML with ToT payload is as seen in the following figure.
To understand the XML with ToT payload, consider the following example where the student details, such as, name, age, subject, and gender can be sent as a part of the request. In this example, the XML with ToT rule extracts and protects the name and the age element.
<?xml version='1.0' encoding='UTF-8'?>
<students>
<student>
<name>Rick Grimes</name>
<age>35</age>
<subject>Maths</subject>
<gender>Male</gender>
</student>
<student>
<name>Daryl Dixon </name>
<age>33</age>
<subject>Science</subject>
<gender>Male</gender>
</student>
<student>
<name>Maggie</name>
<age>36</age>
<subject>Arts</subject>
<gender>Female</gender>
</student>
</students>
The following figure illustrates one extraction rule for multiple XML elements.
In the Figure 14-25, the XML ToT Extract rule extracts the two different XML elements, name and age. The /students/student/name
path extracts the name element and protect it with the transform rule. Similarly, the /students/student/age
path extracts the age element and protect it with the transform rule. Same data elements are used to protect both the XML elements. You can use different data elements to transform XML elements as per your requirements. It is recommended to use the Profile reference, from the drop-down that appears on the Profile Name field. This helps to process the extraction and transformation in one rule. In addition, it reduces the transform overhead of defining one element at a time for the same XML file. If the Profile Name field is left empty then the extracted value is passed to the child rule for transformation.
For more information about profile referencing, refer to the section Profile Reference.
The properties for the XML with ToT payload are explained in the following table.
Properties | Subfield | Description |
---|
XPaths with Profile Reference | | Define the required XPath and Profile reference. Note: Ensure that you enter the XPath by following the required syntax for extracting the XML element value. For example, in the Figure 14-25, the /students/student/name path is defined for the name element, ensure to follow the same syntax to extract the XML element. If you enter an incorrect syntax, then the defined rule is disabled. |
| XPath | Define the required XML element. |
| Profile Name | Select the required transform rule. |
| User Comments | Add additional information for the action performed if required. |
Advance XML Parser options | | Configure advanced parsing parameter options for the XML payload. This field accepts parsing options in the JSON format. The parsing options are of the Boolean data type. For example, the parsing parameter, remove_comments , accepts the values as true or false . Note: The Advance XML Parser options that apply to the XML codec also apply to the XML with ToT codec. For more information about the additional XML Parser, refer to the Table: Advance XML Parser. |
9 - Fixed Width
Describes the fixed width payload.
In scenarios where the input data is sent to DSG in a fixed width format, the Fixed Width codec is used. In a fixed width input, the data columns are specified in terms of exact column start character offset and fixed column width in terms of number of characters that define column width.
For example, consider a fixed width input as provided in the following snippet. The Name column begins at the 0th character in a row, has a fixed width of 20 characters and is padded with spaces until the next column Number begins. The Number column begins at the 20th character in a row and has a fixed width of 12 characters.
With the Row and Column Index, you can now define how the column positions can be calculated. If you choose to use 1-based indexing, the Name column begins at 1 and for fixed width of 20 characters, subsequent column will begin at the 21st character. While, if you use 0-based indexing, the Name column begins at 0 and for fixed width of 20 characters, subsequent column will begin at the 20th character. Based on these definitions, you can define the rule and its properties.
Name Number
John Smith 418-Y11-4111
Mary Hartford 319-Z19-4341
Evan Nolan 465-R45-4567
The fields for the Fixed Width payload are as seen in the following figure.
Note:
If the input file includes non-printable special characters, then to transform the data successfully, the user must add the fw-bytes-parsing parameter in the features.json file.
To add the parameter in the features.json file, perform the following steps.
- Login to the ESA Web UI.
- Navigate to Settings > System > Files.
- Open the features.json file for editing.
- Add the fw-bytes-parsing parameter in the features.json file. The fw-bytes-parsing parameter must be added in the following format:
{ "features": [ "fw-bytes-parsing" ] }
The properties for the Fixed Width payload are explained in the following table.
Properties | Sub-Field | Description |
---|
Line Separator | | Separator that defines where a new line begins. |
Skip Lines Matching Pattern | | Regex pattern that defines the lines that need be skipped.For example, consider the following lines in the file:
User, Admin, Full Access, Viewer Partial Access, User, Viewer, Admin No Access, Viewer, User, Root No Acess, Partial Access, Root, Admin - If you configure the regex as .*?User, the lines 1, 2, and 3 will be skipped.
- If you configure the regex as User, the first line will be skipped and the remaining lines will be processed.
|
Preserve Input Length | | Select to perform a check for the input and output length. If a mismatch is detected, then the rule stops processing further and an error appears in the log. If you clear this check box and a mismatch is detected, the rule still continue processing the data. A warning appears in the log. |
Row and Column Index | | Select 0 if row and column counting begins at 0 or 1 if it begins at 1. |
Data Starts at Line | | Line number from which the data begins. |
Fixed Width Columns | | |
| Column Position | Column position where the data begins. For example, if you are protecting the first column with 20 characters fixed width, then the value in this field will be 0. This value differs based on the Row and Column Index defined. For example, if you choose to use 0-based indexing, then the first column begins at 0, and the value in this field will be 0. |
| Column Width | The fixed width of the column that must be protected. For example, if you are protecting the first column with 20 characters fixed width, then the value in this field will be 20. |
| Profile Name | Profile to be used to perform transform operations on the matched content. Note: Ensure that the data element used to perform the transfer operation is of Length Preserving type. |
| User Comments | Additional information related to the action performed by the column processing. |
10 - HTML Form Media Payload
Describes the HTML form media payload.
This payload extracts HTML form media format from the request and lets you define regex to control precise extraction.
The fields for the HTML Form Media Payload (X-WWW-FORM-URLENCODED) payload are as seen in the following figure.
The properties for the X-WWW-FORM-URLENCODED payload are explained in the following table.
Properties | Description |
---|
Name | The regular expression to match the parameter name is specified in this field. |
Value | The value to be extracted is specified in this field. |
Target Object | The parameter object to be extracted is specified in this field. |
Encoding Mode | Encoding mode that will be used for URI encoding handling. |
Encoding Reserve Characters | Characters beyond uppercase and lowercase alphabets, underscore, dot, and hyphen. |
11 - HTTP Message Payload
Describes the HTTP payload.
This payload extracts HTTP message format from the request and lets you define regex to control precise extraction.
The following figure illustrates the HTTP Message payload fields.
The properties for the HTTP Message payload are explained in the following table.
Properties | | Description |
---|
HTTP Message Type | | Type of HTTP Message to be matched. |
Method | | The value to be extracted is specified in this field. |
Request URI | | The regular expression to be matched with the request URI is specified in this field. |
Request Headers | | The list of name and value as regular expression to be matched with the request headers is specified in this field. |
Message Body | | The parameter object to be extracted is specified in this field. |
Require Client Certificate | | If checked, the client must present a certificate for authentication. If no certificate is provided, a 401 or 403 response appears. |
Authentication | | Authentication rule required for the rule to execute. Authentication mode can be none or basic authentication. |
Target Object | | The target message body to be extracted is specified in this field. The following Target Object options are available:- Message Body
- Cookie
- Message Header
- Message Headers
- Client Certificate*
- Uniform Resource Locator (URL)
Client Certificate* - The following fields are displayed if the Client Certificate option is selected in the Target Object drop down menu:- Attribute
- Value
- Target Object
|
| Attribute | The client certificate attributes to be extracted are specified in this field. The following attribute options are available:- issuer
- notAfter
- notBefore
- serialNumber
- subject$
- subjectAltName
- version
- crlDistributionPoints
- calssuers
- OCSP
|
| Value | Regular expression to identify the client certificate attributes to be extracted. The default value is (.*). |
| Target Object | The value or the attribute of the client certificate to be extracted is specified in this field. The following Target Object options are available: |
12 - Enhanced Adobe PDF Codec
Describes the enhanced PDF codec.
The Enhanced Adobe PDF codec extracts the PDF payload from the request and lets you define Regex to control precise extraction. This payload is available when the Action type is selected as Extract.
As part of the ruleset construction for this codec, it is mandatory to include a child Text extract rule under the Enhanced Adobe PDF codec extract rule. You must not use any other rule apart from the child Text extract rule under the Enhanced Adobe PDF codec extract rule.
In the DSG, some font files are already added to the /opt/protegrity/alliance/config/pdf_fonts directory. By default, the following font file is set in the gateway.json file.
"pdf_codec_default_font":{
"name": "OpenSans-Regular.ttf"
}
Note: The Advanced Settings can be used to configure the default font file for a specific rule.
If you want to process a PDF file that contains custom fonts, then upload it to the /opt/protegrity/alliance/config/pdf_fonts directory. If the custom fonts are not uploaded to the mentioned directory, then the OpenSans-Regular.ttf font file will be used to process the PDF file.
For more information about how-to examples to detokenize a PDF, refer to the section Using Amazon S3 to Detokenize a PDF and Using HTTP Tunnel to Detokenize a PDF in the Protegrity Data Security Gateway How-to Guide.
The following figure displays the Enhanced Adobe PDF payload fields.
The properties for the Enhanced Adobe PDF payload are explained in the following table.
Note: The configurations in the Advanced Settings are only applicable for that specific rule.
Properties | Description |
---|
Pattern | Pattern to be matched for is specified in the field.If no pattern is specified, then the whole input is considered for matching. |
Advanced Settings | Set the following additional configurations for the Enhanced Adobe PDF codec. Set the margins to determine if it is a line or paragraph in the PDF file.- Set the
{"layout_analysis_config" : "char_margin": 0.1} setting to check if two characters are closer together by the margin set, to determine if they are part of the same line. The margin is determined by the width of the characters. - Set the
{"layout_analysis_config" : "line_margin": 0.1} setting to check if two lines are closer together by the margin set, to determine if they are part of the same paragraph. The margin is determined by the height of the lines.
Note: The {“layout_analysis_config” : {“char_margin”: 0.1, “line_margin”: 0.1}} settings can also be configured in the gateway.json file. Set the default font file to process the PDF file.- Set the
{"pdf_codec_default_font":{ "name": "<name of font file>"} setting to process the PDF file using this font file.
|
Known Limitations
The following list describes the known limitations for this release.
- The Enhanced Adobe PDF codec does not support detokenization for sensitive data that splits into multiple lines. It is expected that the prefix, data to be detokenized, and the suffix are in a single line and do not break into multiple lines.
- The embedded fonts are not supported. Ensure that when you are uploading the fonts, the entire character set for that font family is uploaded to the DSG.
- The prefix and suffix used to identify the data to be detokenized must be unique and not a part of the data.
- The PDFs created with the rotate operator are not supported for detokenization.
- The Enhanced Adobe PDF codec does not process password protected PDFs.
- The detokenized data appears spaced out with extra white spaces.
13 - JSON Payload
This codec extracts the JSON element from the JSON request as per the JSONPATH defined.
Consider the following sample input that will be processed using the JSON codec to extract a unique JSON element:
{
"entities":[
{
"entity_type":"CostCenter",
"properties":{
"Id":"10097",
"LastUpdateTime":1455383881190,
"Currency":"USD",
"ApproveThresholdAmount":100000,
"AccountingCode":"5555",
"CostCenterAttachments":"{\"complexTypeProperties\":[]}"
}
}
],
"operation":"UPDATE"
}
In the Extract rule, assuming that the AccountingCode needs to be protected, the JSONPath that will be set is entities[*].properties.AccountingCode. Based on the input JSON structure, the JSONPath value differs.
The following figure illustrates the JSON payload fields.
The properties for the JSON payload are explained in the following tab
Properties | Sub-Field | Description |
---|
JSONPath | | This JSON element value to be extracted is specified in the JSON path. Note: Ensure that you enter the JSONPath by following proper syntax for extracting the JSON element value. If you enter incorrect syntax, then the service which has this JSON payload definition in the rule fails to load and process the request. |
Allow Empty String | | Enable to pass values that are defined as only whitespaces, such as, value: " “, that are part of the JSON payload and continue processing of the sequential rules. If this check box is disabled, then the Extract rule does not process values that are defined as only whitespaces. |
Preserve Element Order | | Select to preserve the order of key-value pairs in the JSON response. |
Fail Transaction | | Select to fail transaction when an error is encountered during tokenization. The error might occur due to use of incorrect token element to protect input data. For example, when handling integer input data, the accurate token element would be an integer token element. The DSG uses tokenization logic to perform data protection. This logic can work to its optimum only if the correct token element is used to protect the input data. If you fail to perform careful analysis of your input data and identify the accurate token element that must be used, then it will result in issues when data is protected using the tokenization logic. To avoid this issue, it is recommended that before defining rules, analyze the input data and identify the accurate token element to be used to protect the data. For more information about identifying the token element that will best suit the input data, refer to Protegrity Protection Methods Reference Guide 9.0.0.0. |
Minimize Output | | Select to minify the JSON response. The JSON response is displayed in a compact form as opposed to the indented JSON response that the DSG sends. Note: It is recommended that this option is selected when the JSON input includes deeply nested key-value pairs. |
Process Mode | | Select to parse JSON data types. This field includes the following three options:- Simple - Primitive
- Complex - Stringify
- Complex - Recurse
|
| Complex - Stringify | Select to process the complex JSON data type, such as, arrays and objects, to string values and serializing to JSON values before being passed to the child rule. This option is displayed by default. |
| Simple - Primitive | Select to process primitive data types, namely, string, int, float, and boolean. It does not support the processing of complex data types, such as, arrays and objects when it matches the JSON data type; the processing fails and an error message is displayed. |
| Complex - Recurse | Select to process the complex JSON data type and iterate through the JSON array or object recursively. |
The following table describes the additional configuration option for the Recurse mode.
Options | Description | Default |
---|
recurseMaxDepth | Maximum recursion depth that can be set for iterating matched arrays or objects. CAUTION: This parameter comes in effect only when the Complex - Recurse mode is selected. It is not supported for the Complex - Stringify and the Simple - Primitive modes. | 25 |
JSONPath Examples
This section provides guidance on the type of JSONPath expressions that DSG understands. This guidance must be considered before you define the acceptable JSONPath to be extracted when using the JSON codec.
The DSG supports the following operators.
CAUTION:
The $
operator is not supported.
Operator | Description | Example |
---|
* | Wildcard to select all elements in scope. | foo.*.baz ["foo"][*]["baz"] |
/docs | Skip any number of elements in path. | foo/docsbaz |
[] | Access arrays or names with spaces in them. | ["foo"]["bar"]["baz"] array[-1].attr [3] |
array[1:-1:2] | Slicing arrays. | array[1:-1:2] |
=, >, <, >=, <= and != | Filter using these elements. | foo(bar.baz=true) foo.bar(baz>0).baz foo(bar="yawn").bar |
To understand the JSONPath, consider the following example JSON. The subsequent table provides JSONPath examples that can be used with the example JSON.
{
"store": {
"book": [
{
"category": "reference",
"author": "Nigel Rees",
"title": "Sayings of the Century",
"price": 8.95
},
{
"category": "fiction",
"author": "J. R. R. Rowling",
"title": "Harry Potter and Chamber of Secrets",
"isbn": "0-395-12345-8",
"price": 29.99
},
{
"category": "fiction",
"author": "J. R. R. Tolkien ",
"title": "The Lord of the Rings",
"isbn": "0-395-19395-8",
"price": 22.99
},
{
"category": "fiction",
"author": "Arthur Conan Doyle ",
"title": "Sherlock Homes",
"isbn": "0-795-19395-8",
"price": 9
}
]
}
}
The following table provides the JSONPath examples based on the JSON example.
JSONPath | Description | Notes |
---|
store/docstitle | All titles are displayed. | The given JSONPath examples are different in construct but provide the same result. |
store.book[*].title | | |
store.book/docstitle | | |
["store"]["book"][*]["title"] | | |
store.book[0].title | The first title is displayed. | The given JSONPath examples are different in construct but provide the same result. |
["store"]["book"][0]["title"] | | |
store.book[1:-1].title | All titles except first and last title are displayed. | The given JSONPath examples are different in construct but provide the same result. |
["store"]["book"][1:-1]["title"] | | |
["store"]["book"](price>=9)["title"] | All titles with book price greater than or equal to 9 or 9.0. | |
["store"]["book"](price>9)["title"] | All titles with book price greater than 9 or 9.0. | |
["store"]["book"](price<9)["title"] | All titles with book price less than 9 or 9.0. | |
["store"]["book"](price<=9)["title"] | All titles with book price less than or equal to 9 or 9.0. | |
14 - JSON with Tree-of-Trees (ToT)
This section provides an overview of the JSON with Tree-of-Trees (ToT) payload. The JSON ToT payload allows you to use the advantages offered by Tree of Trees to extract the JSON payload from the request and provide protection according to the data element defined. Profile Reference can also be used to process different elements of the JSON.
The following figure illustrates the JSON ToT fields.
The properties for the JSON ToT payload are explained in the following table:
Properties | Sub-Field | | Description |
---|
Allow Empty String | | | Enable to pass values that are defined as only whitespaces, such as value: " “, that are part of the JSON payload and continue processing of the sequential rules. If this check box is disabled, then the Extract rule does not process values that are defined as only whitespaces. |
JSON Paths with Profile Reference | | | JSON path and profile references must be selected. |
| JSON Path | | JSON path representing the JSON field targeted for extraction. |
| Profile Name | | Profile to be used to perform transform operations on the matched content. |
| User Comments | | Additional information related to the action performed by the group processing. |
| Process Mode | | Select to parse JSON data types. This field includes the following three options:- Simple - Primitive
- Complex - Stringify
- Complex - Recurse
|
| | Complex - Stringify | Select to process the complex JSON data type, such as, arrays and objects, to string values and serializing to JSON values before being passed to the child rule. This option is displayed by default. |
| | Simple - Primitive | Select to process primitive data types, namely, string, int, float, and boolean. It does not support the processing of complex data types, such as, arrays and objects when it matches the JSON data type; the processing fails and an error message is displayed. |
| | Complex - Recurse | Select to process the complex JSON data type and iterate through the JSON array or object recursively. |
Preserve Element Order | | | Select to preserve the order of key-value pairs in the JSON response. This option is selected by default when you create the JSON ToT rule. |
Fail Transaction | | | Select to fail transaction when an error is encountered during tokenization. The error might occur due to use of incorrect token element to protect input data. For example, when handling integer input data, the accurate token element would be an integer token element. The DSG uses tokenization logic to perform data protection. This logic can work to its optimum only if the correct token element is used to protect the input data. If you fail to perform careful analysis of your input data and identify the accurate token element that must be used, then it will result in issues when data is protected using the tokenization logic. To avoid this issue, it is recommended that before defining rules, analyze the input data and identify the accurate token element to be used to protect the data. This option is selected by default when you create the JSON ToT rule. For more information about identifying the token element that will best suit the input data, refer to Protegrity Protection Methods Reference Guide 9.0.0.0. |
Minimize Output | | | Select to minify the JSON response. The JSON response is displayed in a compact form as opposed to the indented JSON response that the DSG sends. This option is deselected by default when you create the JSON ToT rule. Note: It is recommended that this option is selected when the JSON input includes deeply nested key-value pairs. |
The following table describes the additional configuration option for the Recurse mode.
Options | Description | Default |
---|
recurseMaxDepth | Maximum recursion depth that can be set for iterating matched arrays or objects. CAUTION: This parameter comes in effect only when the Complex - Recurse mode is selected. It is not supported for the Complex - Stringify and the Simple - Primitive modes. | 25 |
15 - Microsoft Office Documents
This payload extracts Microsoft Office documents from the request and lets you define regex to control precise extraction.
The following figure illustrates the MS Office payload fields.
The properties for the Microsoft Office documents payload are explained in the following table.
Properties | Sub-Field | Description |
---|
Pattern | | The regular expression pattern on which the extraction is applied is specified in this field. For example, consider if the text is “Hello World”, then pattern would be “\w+”. |
| Pattern Group Id | The grouping number to extract from the Regular Expression Pattern. For example, for the text “Hello World”, Group Id would be 0 to match characters in first group as per regex. |
| Profile Name | Profile to be used to perform transform operations on the matched content. |
| User Comments | Additional information related to the action performed by the group processing. |
Length Preservation | | Data transformation output is padded with spaces to make the output length equal to the input length. |
16 - Multipart Mime Payload
This payload extracts mime payload from the request and lets you define regex to control precise extraction.
The following figure illustrates the Multipart Mime payload.
The properties for the Multipart Mime payload are explained in the following table.
Properties | Description |
---|
Headers | Name-Value pair of the headers to be intercepted. |
Message Body | Intercept the message matching the regular expression. |
Target Object | Target message to be extracted. |
17 - PDF Payload
Describes the Extract rule for the DSG ruleset configuration
This payload extracts PDF payload from the request and lets you define regex to control precise extraction.
The following figure illustrates the PDF payload fields.
The properties for the PDF payload are explained in the following table.
Properties | Description |
---|
Pattern | Pattern to be matched for is specified in the field.If no pattern is specified, then the whole input is considered for matching. |
Note: The DSG PDF codec supports only text formats in PDFs.
For any assistance in supporting additional text formats, contact Protegrity Professional Services.
18 - Protocol Buffer Payload
The PBpath defines a way to address fields in binary encoded protocolbuf messages. It uses field ids to construct an address messages or fields in a nested message hierarchy.
An example for the PBpath field is shown as follows:
1.101.2.201.301.2.401.701.2.802
In DSG, protocol buffer version 2 is used.
The following figure illustrates the Protocol Buffer (protobuf) payload fields.
The properties for the Protocol Buffer payload are explained in the following table.
Properties | Description |
---|
PBPath List | This PB element value to be extracted is specified in the PB path. Note: Ensure that you enter the PBPath by following proper syntax for extracting the protobuf messages. If you enter incorrect syntax, then the service which has this protobuf payload definition in the rule fails to load and process the request. |
19 - Secure File Transfer Payload
Describes the Extract rule for the DSG ruleset configuration
This payload extracts SFTP message from the request and lets you further processing to be done on the files.
The following figure illustrates the Secure File Transfer payload fields.
The properties for the Secured File Transfer payload are explained in the following table.
Properties | Description |
---|
File Name | Name of the file to be matched. If the field is left blank, then all the files are matched. |
Method | Rule to be applied on the download or the upload of files. |
20 - Shared File
This payload extracts file from the request and lets you define regex to control precise extraction. It is generally used with Mounted services, namely NFS and CIFS.
The following figure illustrates the NFS/CIFS share-related Shared File payload fields.
The properties for the Shared File payload are explained in the following table.
Properties | Description |
---|
File Key | Regex logic to identify source file key to be extracted.Note: Click Test Regex to verify if the regex expression is valid. |
Target File | Attribute that will be extracted from the payload. The options are: |
21 - SMTP Message Payload
This payload extracts SMTP payload from the request and lets you define regex to control precise extraction.
The following figure illustrates the SMTP message payload fields.
The properties for the SMTP payload are explained in the following table.
Properties | Description |
---|
SMTP Message Type | Type of SMTP message to be intercepted. |
Method | A condition is applied, if matching is to be performed on the files that are uploaded or the files that are downloaded. |
Command | Regular expression to be matched with a command. |
Target Object | Attribute to be extracted. |
22 - Text Payload
This payload extracts text payload from the request and lets you define regex to control precise extraction.
The following figure illustrates the Text payload fields.
The properties for the Text payload are explained in the following table.
Properties | Sub-Field | Description |
---|
Prerequisite Match Pattern | | Regular expression to be matched before the action is executed. |
Pattern | | The regular expression pattern on which the extraction is applied is specified in this field. For example, consider if the text is “Hello World”, then pattern would be “\w+”. |
| Pattern Group Id | The grouping number to extract from the Regular Expression Pattern. For example, for the text “Hello World”, Group Id would be 0 to match characters in first group as per regex. |
| Profile Name | Profile to be used to perform transform operations on the matched content. |
| User Comments | Additional information related to the action performed by the group processing. |
Encoding | | Type of encoding to be used. |
Codec | | The encoding method used is specified in this field. For more information about codec types, refer to the section Standard Encoding Method List. |
23 - URL Payload
Describes the URL payload.
This payload extracts URL payload from the request and extract precise object based on selection.
The following figure illustrates the URL payload fields.
The properties for the URL payload are explained in the following table.
Properties | Description |
---|
Target Object | Object attribute to be extracted. |
24 - User Defined Extraction Payload
Describes the user defined extraction payload.
This codec lets you define custom extraction logic and pass arguments to the next rule. The language that is currently supported for extraction is Python.
From DSG 3.0.0.0, the Python version is upgraded to python 3. The UDFs written in Python v2.7 will not be compatible with Python v3.10. To migrate the UDFs from python 2 to python 3, refer to the section Migrating the UDFs to Python 3.
The following figure illustrates the User Defined Extraction payload fields.
The properties for the User Defined Extraction payload are explained in the following table.
Properties | Description |
---|
Programming Language | Programming language used for data extraction is selected. The language that is currently supported for extraction is Python. |
Source Code | Source code for the selected programming language. CAUTION: Ensure that the class name UserDefinedExtraction is not changed while creating the UDF. Note: For more information about the supported libraries apart from the default Python modules, refer to the section Supported Libraries. |
Initialization Arguments | The list of arguments passed to the constructor of the user defined extraction code is specified in this field. |
Rule Advanced Settings | Provide a specific blocked module that must be overruled. The module will be overruled only for that extract rule. The parameter must be set to the name of the module that must be overruled in the following format.
{"override_blocked_modules": ["<name of module>", "<name of module>"]} Note: Currently, methods cannot be overruled using Advanced settings. For more information about the allowed methods and modules, refer to the section Allowed Modules and Methods in UDF. Using the Rule Advanced Settings option, any module that is blocked, can be overruled to be unblocked. For example, the following are the modules that are allowed in the gateway.json file."globalUDFSettings" : { "allowed_modules":["bs4", "common.logger", "re", "gzip", "fromstring", "cStringIO","struct", "traceback"] } The os module is not listed as part of the allowed_modules parameter in the gateway.json file, so it is blocked. To allow the use of the os module in the Source Code of UDF rules, you can set the {“override_blocked_modules”: [“os”]} in the Advanced Settings of the extract rule. Note: By overriding blocked modules, you risk introducing security risks to the DSG system. |
Note: The DSG supports the usage of the PyJwt python library in custom UDF creations. PyJWT is a python library that is used to implement Open Authentication (OAuth) using JSON Web Tokens (JWT). JSON Web Tokens (JWT) is an open standard that defines how to transmit information between a sender and a receiver as a JSON object. To authenticate JWT for OAuth, you must write a custom UDF. The PyJwt library version supported by the DSG is 1.7.1.
For more information about writing a custom UDF on the DSG, refer to the section User Defined Functions (UDF).
Note: The DSG supports the usage of the Kafka python library in custom UDF creations. Kafka is a python library that is used for storing, processing, and forwarding for applications in a distributed environment. For example, the DSG uses the Kafka library to forward Transaction Metrics logs to external applications. The Kafka library version supported by the DSG is 2.0.2.
For more information about writing a custom UDF on the DSG, refer to the section User Defined Functions (UDF).
Note: The DSG supports the usage of the Openpyxl Python library in custom UDF creations. Openpyxl is a Python library that is used to parse Excel xlsx, xlsm, xltx, xltm files. This library enables column-based transformation for Microsoft Office Excel. The Openpyxl library version supported by the DSG is 2.6.4.
Note: The DSG uses the in-built tarfile python module for custom UDF creation. This module is used in the DSG to parse .tar and .tgz packages. Using the tarfile module, you can extract and decompress .tar and .tgz packages.
25 - ZIP Compressed File Payload
Describes the ZIP file payload.
This payload extracts ZIP file from the request and lets you extract file name or file content.
The following figure illustrates the ZIP Compressed File payload fields.
The properties for the ZIP payload are explained in the following table.
Properties | Description |
---|
File Name | Name of the file on which action is to be performed. |
Target Object | File name or the file content to be extracted. |