Enhanced Adobe PDF Codec

Describes the enhanced PDF codec.

The Enhanced Adobe PDF codec extracts the PDF payload from the request and lets you define Regex to control precise extraction. This payload is available when the Action type is selected as Extract.

As part of the ruleset construction for this codec, it is mandatory to include a child Text extract rule under the Enhanced Adobe PDF codec extract rule. You must not use any other rule apart from the child Text extract rule under the Enhanced Adobe PDF codec extract rule.

In the DSG, some font files are already added to the /opt/protegrity/alliance/config/pdf_fonts directory. By default, the following font file is set in the gateway.json file.

"pdf_codec_default_font":{
"name": "OpenSans-Regular.ttf"
 }

Note: The Advanced Settings can be used to configure the default font file for a specific rule.

If you want to process a PDF file that contains custom fonts, then upload it to the /opt/protegrity/alliance/config/pdf_fonts directory. If the custom fonts are not uploaded to the mentioned directory, then the OpenSans-Regular.ttf font file will be used to process the PDF file.

For more information about how-to examples to detokenize a PDF, refer to the section Using Amazon S3 to Detokenize a PDF and Using HTTP Tunnel to Detokenize a PDF in the Protegrity Data Security Gateway How-to Guide.

The following figure displays the Enhanced Adobe PDF payload fields.

Enhanced PDF codec

The properties for the Enhanced Adobe PDF payload are explained in the following table.

Note: The configurations in the Advanced Settings are only applicable for that specific rule.

PropertiesDescription
PatternPattern to be matched for is specified in the field.If no pattern is specified, then the whole input is considered for matching.
Advanced SettingsSet the following additional configurations for the Enhanced Adobe PDF codec. Set the margins to determine if it is a line or paragraph in the PDF file.
  • Set the {"layout_analysis_config" : "char_margin": 0.1} setting to check if two characters are closer together by the margin set, to determine if they are part of the same line. The margin is determined by the width of the characters.
  • Set the {"layout_analysis_config" : "line_margin": 0.1} setting to check if two lines are closer together by the margin set, to determine if they are part of the same paragraph. The margin is determined by the height of the lines.

Note: The {“layout_analysis_config” : {“char_margin”: 0.1, “line_margin”: 0.1}} settings can also be configured in the gateway.json file.
Set the default font file to process the PDF file.
  • Set the {"pdf_codec_default_font":{ "name": "<name of font file>"} setting to process the PDF file using this font file.

Known Limitations

The following list describes the known limitations for this release.

  1. The Enhanced Adobe PDF codec does not support detokenization for sensitive data that splits into multiple lines. It is expected that the prefix, data to be detokenized, and the suffix are in a single line and do not break into multiple lines.
  2. The embedded fonts are not supported. Ensure that when you are uploading the fonts, the entire character set for that font family is uploaded to the DSG.
  3. The prefix and suffix used to identify the data to be detokenized must be unique and not a part of the data.
  4. The PDFs created with the rotate operator are not supported for detokenization.
  5. The Enhanced Adobe PDF codec does not process password protected PDFs.
  6. The detokenized data appears spaced out with extra white spaces.
Last modified : September 26, 2024