Building the Request using the REST API

Use the information provided in this section to build the REST API request for performing the Protegrity Anonymization transformation.

Identifying the source and target

The source dataset is the starting point of the transformation. In this step, you specify the source that must be transformed. Specify the target where the anonymized data will be saved.

  • The following file formats are supported:
    • Comma separated values (CSV)
    • Columnar storage format: This is an optimized file format for large amounts of data. Using this file format provides faster results. For example, Parquet (gzip and snappy).
  • The following types of data storage have been tested for Protegrity Anonymization:
    • Local File System
    • Amazon S3
  • The following types of data storage can also be used for Protegrity Anonymization:
    • Microsoft Azure Storage
      • Data Lake Storage
      • Blob Storage
    • S3 bucket Storage
    • Other S3 Compatible Services

Use the following code to specify the source:

Note: Modify the source and destination code for your provider.

For more cloud-related sample codes, refer to the section Samples for Cloud-related Source and Destination Files.

"source": {
      "type": "File",
      "file": {
        "name": "<Source_file_path>"
      }
}

Note: When uploading a file to the cloud service, wait till the entire source file is uploaded before running the anonymization job.

Similarly, specify the target file using the following code:

"target": {
    "type": "File",
    "file": {
      "name": "<Target_file_path>"
    }
}

Specify additional parameters about the source and target file, such as, the character used to separate the values in the file, using the following properties attribute. If a property is not specified, then the default attribute shown here will be used.

"props": {
    "sep": ",",
    "decimal": ".",
    "quotechar": "\"",
    "escapechar": "\\",
    "encoding": "utf-8",
    "line_terminator": "\n"
}

If the required files are in cloud storage, then specify the cloud-related access information using the following code:

"accessOptions": {
}

For more information about specifying the source and target files, refer to Dask remote data configuration.

Note: If the target directory already exists, then the job fails. If the target file already exists, then the file will be overwritten. Additionally, some cloud services have limitations on the file size. If such a limitation exists, then you can set the single_file switch to no when writing large files to cloud storage. This saves the output as multiple files to avoid any errors related to saving large files to cloud storage.

Specifying the Transformation

For more information about specifying the transformation, refer to Specifying the Transformation.

Classifying the Fields

For more information about different fields classification, refer to Classifying the Fields.

The following data types are supported for working with the data in the fields:

  • Integer
  • Float
  • String
  • Date
  • Time
  • DateTime

Date: The following date types are supported:

  • mm-dd-yyyy - This is the default format.
  • dd-mm-yyyy
  • dd-mm-yy
  • mm-dd-yy
  • dd.mm.yyyy
  • mm.dd.yyyy
  • dd.mm.yy
  • mm.dd.yy
  • dd/mm/yyyy
  • mm/dd/yyyy
  • dd/mm/yy
  • mm/dd/yy

Time: HH is used to specify time in the 24-hour format and hh is used to specify time in the 12-hour format. The following time formats are supported:

  • HH:mm:ss - This is the default format.
  • HH:mm:ss.ns
  • hh:mm:ss
  • hh:mm:ss.ns
  • hh:mm:ss.ns p - Here, p is the 12 hour format with period AM/PM.
  • HH:mm:ss.ns z - Here, z is timezone info with +- from UTC, that is, +0000,+0530,-0230.
  • hh:mm:ss Z - Here, Z is the timezone info with the name, that is, UTC,EST, CST.

A few examples follow:

{
            "classificationType": "Non-Sensitive Attribute",
            "dataType": "Integer",
            "name": "index"
}
        
{
            "classificationType": "Sensitive Attribute",
            "dataType": "String",
            "name": "diagnosis_dup"
}

Note: The values present in the first row of the dataset are considered for determining the format for date, time, and datetime. You can override the detection using “props”: {“dateformat”: “<Specify_Format>”}.

Consider the following example for date with the mm/dd/yyyy format:

10/09/2020
12/24/2020
07/30/2020

In this case, the data will be identified as dd/mm/yyyy.

You can override the using the following property:

"props": {"dateformat": "mm/dd/yyyy"}

Specifying the Privacy Model

For more information about Protegrity Anonymization methods for privacy model, refer to Specifying the Privacy Model.

Specifying the Hierarchy

For more information about how the information in the data set is handled for Protegrity Anonymization, refer to Specifying the Hierarchy.

Generalization

For more information about grouping data into sets having similar attributes, refer to Generalization.

Micro-Aggregation

For more information about the mathematical formulas used to group the data, refer to Micro-Aggregation.

Specifying Configurations

Additional configurations are available in the Protegrity Anonymization to enhance the anonymity of the information in the data set.

The following configurations are available:

"config": {
    "maxSuppression": 0.1 
    "suppressionData": "*"
    "redactOutliers": False
}
  • maxSuppression specifies the percentage of rows allowed to be an outlier row to obtain the anonymized data. The default is 10%.
  • suppressionData specifies the character or character set to be used for suppressing the anonymized data. The default is *.
  • redactOutliers specifies if the outlier row should be part of the anonymized dataset or not. The default is included denoted by False.

Last modified : March 24, 2026