This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Building the Protegrity Anonymization Request

Use the APIs provided with Protegrity Anonymization to create your request.
  • To use the APIs, you need to specify the source (file or data) that must be transformed. The source can be a single row of data or multiple rows of data sent in the request, or it could be a file located on the Cloud storage.
  • Next, you need to specify the transformation that must be performed on the various columns in the table.
  • Finally, after the transformation is complete, you can save the output or use it for further processing.

The transformation request can be saved for processing further requests. It can also be used as an input in machine learning.

1 - Common Configurations for Building the Request

Use the information provided in this section to build the REST APIs and Python SDK request for performing the Protegrity Anonymization transformation.

Specifying the Transformation

The data store consists of various fields. These fields need to be identified for processing data. Additionally, the type of transformation that must be performed on the fields must be specified. Also specify the type of privacy model that must be used for anonymizing the data. While specifying the rules for transformation specify the importance of the data.

Classifying the Fields

Specify the type of information that the fields hold. This classification must be performed carefully, leaving out important fields might lead to the anonymized data being of no value. However, including data that can identify users poses a risk of anonymization not being carried out properly.

The following four different classifications are available:

ClassificationDescriptionFunctionTreatment
Direct IdentifierThis classification is used for the data in fields that directly identify an individual, such as, Name, SSN, phoneNo, email, and so on.RedactValues will be removed.
Quasi Identifying AttributeThis classification is used for the data in fields that do not identify an individual directly. However, it needs to be modified to avoid indirect identification. For example, age, date of birth, zip code, and so on.Hierarchy modelsValues will be transformed using the options specified.
Sensitive AttributeThis classification is used for the data in fields that do not identify an individual directly. However, it needs to be modified to avoid indirect identification. This data needs to be preserved to support further analysis and retain the utility of anonymized data. In addition, records with this classification must be part of a larger group so that they no longer enable identification of an individual.LDiv, TCloseNo change in values, exception extreme values that might identify an individual. Values will be generalized in case of t-closeness.
Non-Sensitive AttributeThis classification is used for the data in fields that do not identify an individual directly or indirectly.PreserveNo change in values.

Ensure that you identify the sensitive and the quasi-identifier fields for specifying the Protegrity Anonymization method for hiding individuals in the dataset.

Use the following code for specifying a quasi-identifier for REST API and Python SDK:

"classificationType": "Quasi Identifier",
e['<column>'] = asdk.Gen_Mask(maskchar='#', maxLength=3, maskOrder="L")

Specifying the privacy model

The privacy model transforms the dataset using one or several Protegrity Anonymization methods to achieve privacy.

The following anonymization techniques are available in Protegrity Anonymization:

K-anonymity

Configuration of quasi-identifier tuple occurs of k records. The information type is Quasi-Identifier.

Use the following code for specifying K-anonymity for REST API and Python SDK:

"privacyModel": {
    "k": {
    "kValue": 5
    }
}
e.config.k=asdk.K(2)

l-diversity

Ensures k records in the inter-group are distributed and diverse enough to reduce the risk of identification. The information type is Sensitive Attribute.

Use the following code for specifying l-diversity for REST API and Python SDK:

"privacyModel": {
    "ldiversity": [
        {
        "lFactor": 2,
        "name": "sex",
        "lType": "Distinct-l-diversity"
        }
    ]
}
e["<column>"]=asdk.LDiv(lfactor=2)

t-closeness

Intra-group diversity for every sensitive attribute must be defined. The information type is Sensitive Attribute.

Use the following code for specifying t-closeness for REST API and Python SDK:

"privacyModel": {
"tcloseness": [
    {
    "name": "salary-class",
    "emdType": "EMD with equal ground distance",
    "tFactor": 0.2
    }
  ]
}
e["<column>"]=asdk.TClose(tfactor=0.2)

Specifying the Hierarchy

The hierarchy specifies how the information in the dataset is handled for Protegrity Anonymization. These hierarchical transformations are performed on Quasi-Identifiers and Sensitive Attributes. Accordingly, the data can be generalized using transformations or aggregated using mathematical functions. As we go up the hierarchy, the data is anonymized better, however, the quality of data for further analysis reduces.

Global Recoding and Full Domain Generalization

Global recoding and full domain generalization is used for anonymizing the data. When data is anonymized, the quasi-identifiers values are transformed to ensure that data fulfils the required privacy requirements. This transformation is also called as data recoding. In Protegrity Anonymization, data is anonymized using global recoding, that is, the same transformation rule is applied to all entries in the data set.

Consider the data in the following tables:

IDGenderAgeRace
1Male45White
2Female30White
3Male25Black
4Male30White
5Female45Black
Level0Level1Level2Level3Level4
2520-2520-3020-40*
3030-3530-4030-50*
4540-4540-5040-60*

In the above example, when global recoding is used for a value such as 45, then all occurrences of age 45 will be generalized using only one generalized level as follows:

  • 40-45
  • 40-50
  • 40-60
  • *

Full-domain generalization means that all values of an attribute are generalized to the same level of the associated hierarchy level. Thus, in the first table, if age 45 gets generalized to 40-50 which is Level2, then all age values are also generalized to Level2 only. Hence, the value 30 will be generalized to 30-40.

In addition to generalization, micro-aggregation is available for transforming the dataset. In generalization, the mathematical function is performed on all the values of the column. However, in micro-aggregation, the mathematical function is performed on all the values within an equivalence class.

Consider the following table with ages of five men and five women.

GenderAge
M20
M20
F20
M22
M22
F22
F22
M28
F28
F28

The following output is obtained by performing a generalization aggregation on the Age using averages, by setting the Gender as QI and keeping the K value as 2.

GenderAgeGeneralization
M2023.2
M2023.2
F2023.2
M2223.2
M2223.2
F2223.2
F2223.2
M2823.2
F2823.2
F2823.2

In the table, a sum of all the ages is obtained and divided by the total, that is, 10 to obtain the generalization value using average.

The following output is obtained by performing a micro-aggregation on the Age using averages, by setting the Gender as QI and keeping the K value as 2.

GenderAgeMicro-Aggregation
F2024
F2224
F2224
F2824
F2824
M2022.4
M2022.4
M2222.4
M2222.4
M2822.4

In the table, two equivalence classes are formed based on the gender. The sum of the ages in each group is obtained and divided by the total of each group, that is, 5 to obtain the micro-aggregation value using average.

Generalization

In Generalization, the data is grouped into sets having similar attributes. The mathematical function is applied on the selected column by considering all the values in the dataset.

The following transformations are available:

  • Masking-Based: In this transformation, information is hidden by masking parts of the data to form similar sets. For example, masking the last three numbers in the zip code could help group them, such as 54892 and 54231 both being transformed as 54###.

An example of masking-based transformation for building a REST API and Python SDK is provided here.

{
            "classificationType": "Quasi Identifier",
            "dataTransformationType": "Generalization",
            "dataType": "String",
            "generalization": {
                "hierarchyType": "Rule",
                "rule": {
                    "masking": {
                        "maskOrder": "Right To Left",
                        "maskChar": "#",
                        "maxDomainSize": 5
                    }
                },
                "type": "Masking Based"
            },
            "name": "city"
}

Where:

  • maskOrder is the order for masking, use Right To Left to mask from right and Left To Right for masking from the left.
  • maskChar is the placeholder character for masking.
  • maxDomainSize is the number of characters to mask. Default is the maximum length of the string in the column.
e["zip_code"] = asdk.Gen_Mask(maskchar="#",  maskOrder = "R", maxLength=5)

Where:

  • maskchar is the placeholder character for masking.
  • maskOrder is the order for masking, use R to mask from right and L for masking from the left.
  • maxLength is the number of characters to mask. Default is the maximum length of the string in the column.
  • Tree-Based: In this transformation, data is aggregated by transformation to form similar sets using external knowledge. For example, in the case of address, the data can be anonymized based on the city, state, country, or continent, as required. You must specify the file containing the tree data. If the current level of aggregation does not provide adequate anonymization, then a higher level of aggregation is used. The higher the level of aggregation, the more the data is generalized. However, a higher level of generalization reduces the quality of data for further analysis.

An example of tree-based transformation for building a REST API and Python SDK is provided here.

{
          "classificationType": "Quasi Identifier",
          "dataTransformationType": "Generalization",
          "dataType": "String",
          "generalization": {
              "type": "Tree Based",
              "hierarchyType": "Data Store",
              "dataStore": {
                  "type": "File",
                  "file": {
                      "name": "adult_hierarchy_education.csv",
                      "props": {
                          "delimiter": ";",
                          "quotechar": "\"",
                          "header": null
                      }
                  },
                  "format": "CSV"
              }
          },
          "name": "education"
}
treeGen = {'lvl0': [11, 13, 14, 15, 27, 28, 20],
              'lvl1': ["<15", "<15", "<15", "<20", "<30", "<30", "<25"],
              'lvl2': ["<20", "<20", "<20", "<30", "<30", "<30", "<30"]}

e["bmi"] = asdk.Gen_Tree(pd.DataFrame(data=treeGen), ["Missing", "Might be <30", " Might be <30"])  

You can refer to an external file for specifying the parameters for the hierarchy tree.

education_df = pd.read_csv('D:\\WS\\data source\\hierarchy\\adult_hierarchy_education.csv', sep=';')
e['education'] = asdk.Gen_Tree(education_df)
  • Interval-Based: In this transformation, data is aggregated into groups according to a predefined interval specified.
    In addition, the lowerbound and upperbound values need to be specified for building the SDK API. Values below the lowerbound and values above the upperbound are excluded from range generation.

An example of interval-based transformation for building a REST API and Python SDK is provided here.

{
            "classificationType": "Quasi Identifier",
            "dataTransformationType": "Generalization",
            "dataType": "Integer",
            "generalization": {
                "hierarchyType": "Rule",
                "rule": {
                    "interval": {
                        "levels": [
                            "5",
                            "10",
                            "50",
                            "100"
                        ],
                        "lowerBound": "0"
                    }
                },
                "type": "Interval Based"
            },
            "name": "age"
}
asdk.Gen_Interval([<interval_level>],<lowerbound>,<upperbound>)

An example of interval-based transformation for building the SDK API is provided here.

e['age'] = asdk.Gen_Interval([5,10,15])
e['age'] = asdk.Gen_Interval([5,10,15],20,60)
  • Aggregation-Based: In this transformation, integer data is aggregated as per the conditions specified. The available options for aggregation are Mean and Mode.

    Note: Mean is applicable for Integer and Decimal data types.

    Mode is applicable for Integer, Decimal, and String data types.

An example of aggregation-based transformation for building a REST API and Python SDK is provided here.

{
            "classificationType": "Quasi Identifier",
            "dataTransformationType": "Generalization",
            "dataType": "Integer",
            "generalization": {
                "hierarchyType": "Aggregate",
                "type": "Aggregation Based",
                "aggregateFn": "Mean"
            },
            "name": "age"
}

An example of aggregation-based transformation using Mean is provided here.

e['age'] = asdk.Gen_Agg(asdk.AggregateFunction.Mean)

An example of aggregation-based transformation using Mode is provided here.

e['salary'] = asdk.Gen_Agg(asdk.AggregateFunction.Mode)
  • Date-Based: In this transformation, data is aggregated into groups according to the date.

An example of date-based interval and rounding for building a REST API and Python SDK is provided here.

{
            "classificationType": "Quasi Identifier",
            "dataTransformationType": "Generalization",
            "dataType": "Date",
            "generalization": {
                    "hierarchyType": "Rule",
                    "type": "Interval Based",
                    "rule": {
                      "daterange": {
                        "levels": [
                          "WD.M.Y",
                          "W.M.Y",
                          "FD.M.Y",
                          "M.Y",
                          "QTR.Y",
                          "Y",
                          "DEC",
                          "CEN"
                        ]
                      }
                    }
                  },
            "name": "date_of_birth"
}
It is not applicable for building Python SDK requests.
  • Time-Based: In this transformation, data is aggregated into groups according to the time. In this, time intervals are in seconds. The LowerBound and UpperBound takes value of the format [HH:MM:SS].

An example of time-based interval and rounding for building a REST API and Python SDK is provided here.

{
            "classificationType": "Quasi Identifier",
            "dataTransformationType": "Generalization",
            "dataType": "Date",
            "generalization": {
                      "hierarchyType": "Rule",
                      "type": "Interval Based",
                      "rule": {
                            "interval": {
                                "levels": [
                                    "30",
                                    "60",
                                    "180",
                                    "240"
                                  ],
                        "lowerBound": "00:00:00",
                        "upperBound": "23:59:59"
                      }
                    }
                  },
            "name": "time_of_birth"
}
It is not applicable for building Python SDK request.
  • Rounding-Based: In this transformation, data is rounded to groups according to a predefined rounding factor specified.

An example of rounding-based transformation for building a REST API and Python SDK is provided here.

It is not applicable for building the REST API request.

An example of date-based transformation is provided here.

e['DateOfBirth'] = asdk.Gen_Rounding(["H.M4", "WD.M.Y", "M.Y"])

An example of numeric-based transformation is provided here.

e['Interest_Rate'] = asdk.Gen_Rounding([0.05,0.10,1])

Micro-Aggregation

In Micro-Aggregation, mathematical formulas are used to group the data. This is used to achieve K-anonymity by forming small groups of data in the dataset.

The following aggregation functions are available for micro-aggregation in the Protegrity Anonymization:

  • For numeric data types (integer and decimal):
    • Arithmetic Mean

    • Geometric Mean

      Note: Micro-Aggregation using geometric mean is only supported for positive numbers.

    • Median

  • For all data types:
    • Mode

Note: Arithmetic Mean, Geometric Mean, and Median are applicable for Integer and Decimal data types.

Mode is applicable for Integer, Decimal, and String data types.

An example of micro-aggregation for building a REST API and Python SDK is provided here.

{
      "classificationType": "Quasi Identifier",
      "dataTransformationType": "Micro Aggregation",
      "dataType": "Decimal",
      "aggregateFn": "Median",
      "name": "age_ma_median"
}
e['income'] = asdk.MicroAgg(asdk.AggregateFunction.Mean)

2 - Building the Request using the REST API

Use the information provided in this section to build the REST API request for performing the Protegrity Anonymization transformation.

Identifying the source and target

The source dataset is the starting point of the transformation. In this step, you specify the source that must be transformed. Specify the target where the anonymized data will be saved.

  • The following file formats are supported:
    • Comma separated values (CSV)
    • Columnar storage format: This is an optimized file format for large amounts of data. Using this file format provides faster results. For example, Parquet (gzip and snappy).
  • The following types of data storage have been tested for Protegrity Anonymization:
    • Local File System
    • Amazon S3
  • The following types of data storage can also be used for Protegrity Anonymization:
    • Microsoft Azure Storage
      • Data Lake Storage
      • Blob Storage
    • S3 bucket Storage
    • Other S3 Compatible Services

Use the following code to specify the source:

Note: Modify the source and destination code for your provider.

For more cloud-related sample codes, refer to the section Samples for Cloud-related Source and Destination Files.

"source": {
      "type": "File",
      "file": {
        "name": "<Source_file_path>"
      }
}

Note: When uploading a file to the cloud service, wait till the entire source file is uploaded before running the anonymization job.

Similarly, specify the target file using the following code:

"target": {
    "type": "File",
    "file": {
      "name": "<Target_file_path>"
    }
}

Specify additional parameters about the source and target file, such as, the character used to separate the values in the file, using the following properties attribute. If a property is not specified, then the default attribute shown here will be used.

"props": {
    "sep": ",",
    "decimal": ".",
    "quotechar": "\"",
    "escapechar": "\\",
    "encoding": "utf-8",
    "line_terminator": "\n"
}

If the required files are in cloud storage, then specify the cloud-related access information using the following code:

"accessOptions": {
}

For more information about specifying the source and target files, refer to Dask remote data configuration.

Note: If the target directory already exists, then the job fails. If the target file already exists, then the file will be overwritten. Additionally, some cloud services have limitations on the file size. If such a limitation exists, then you can set the single_file switch to no when writing large files to cloud storage. This saves the output as multiple files to avoid any errors related to saving large files to cloud storage.

Specifying the Transformation

For more information about specifying the transformation, refer to Specifying the Transformation.

Classifying the Fields

For more information about different fields classification, refer to Classifying the Fields.

The following data types are supported for working with the data in the fields:

  • Integer
  • Float
  • String
  • Date
  • Time
  • DateTime

Date: The following date types are supported:

  • mm-dd-yyyy - This is the default format.
  • dd-mm-yyyy
  • dd-mm-yy
  • mm-dd-yy
  • dd.mm.yyyy
  • mm.dd.yyyy
  • dd.mm.yy
  • mm.dd.yy
  • dd/mm/yyyy
  • mm/dd/yyyy
  • dd/mm/yy
  • mm/dd/yy

Time: HH is used to specify time in the 24-hour format and hh is used to specify time in the 12-hour format. The following time formats are supported:

  • HH:mm:ss - This is the default format.
  • HH:mm:ss.ns
  • hh:mm:ss
  • hh:mm:ss.ns
  • hh:mm:ss.ns p - Here, p is the 12 hour format with period AM/PM.
  • HH:mm:ss.ns z - Here, z is timezone info with +- from UTC, that is, +0000,+0530,-0230.
  • hh:mm:ss Z - Here, Z is the timezone info with the name, that is, UTC,EST, CST.

A few examples follow:

{
            "classificationType": "Non-Sensitive Attribute",
            "dataType": "Integer",
            "name": "index"
}
        
{
            "classificationType": "Sensitive Attribute",
            "dataType": "String",
            "name": "diagnosis_dup"
}

Note: The values present in the first row of the dataset are considered for determining the format for date, time, and datetime. You can override the detection using “props”: {“dateformat”: “<Specify_Format>”}.

Consider the following example for date with the mm/dd/yyyy format:

10/09/2020
12/24/2020
07/30/2020

In this case, the data will be identified as dd/mm/yyyy.

You can override the using the following property:

"props": {"dateformat": "mm/dd/yyyy"}

Specifying the Privacy Model

For more information about Protegrity Anonymization methods for privacy model, refer to Specifying the Privacy Model.

Specifying the Hierarchy

For more information about how the information in the data set is handled for Protegrity Anonymization, refer to Specifying the Hierarchy.

Generalization

For more information about grouping data into sets having similar attributes, refer to Generalization.

Micro-Aggregation

For more information about the mathematical formulas used to group the data, refer to Micro-Aggregation.

Specifying Configurations

Additional configurations are available in the Protegrity Anonymization to enhance the anonymity of the information in the data set.

The following configurations are available:

"config": {
    "maxSuppression": 0.1 
    "suppressionData": "*"
    "redactOutliers": False
}
  • maxSuppression specifies the percentage of rows allowed to be an outlier row to obtain the anonymized data. The default is 10%.
  • suppressionData specifies the character or character set to be used for suppressing the anonymized data. The default is *.
  • redactOutliers specifies if the outlier row should be part of the anonymized dataset or not. The default is included denoted by False.

3 - Building the request using the Python SDK

Use the information provided in this section to build the request using the Python SDK environment for performing the Protegrity Anonymization transformation.

To build Protegrity Anonymization request using the SDK, the user first needs to import the anonsdk module using the following command.

import anonsdk as asdk

Creating the connection

You need to specify the connection to Protegrity Anonymization REST service to set up Protegrity Anonymization.

Note: If administrator has not updated the DNS entry for ANON REST API service, then map the hostname with the IP address of Anon Service in the hosts file of the system.

For example, if the Protegrity Anonymization REST service is located at https://anon.protegrity.com, then you would create the following connection.

conn = asdk.Connection("https://anon.protegrity.com/")

Identifying the source and target

Protegrity Anonymization is built to anonymize the data in a Pandas dataframe and return the anonymized dataframe. However, you can also specify a CSV file from various source systems for the source data.

Use the following code to specify the source.

e = asdk.AnonElement(conn, dataframe)

If the source file is located at the same place where Protegrity Anonymization is installed, then use the following code to load the source file into a dataframe.

dataframe = pandas.read_csv("<file_path>")
  • The following types of data storage have been tested for Protegrity Anonymization:

    • Local File System

    • Amazon S3

      For example:

      asdk.FileDataStore("s3://<path>/<file_name>.csv", access_options={"key": "<value>","secret": "value"})
      
  • The following types of data storage can also be used for Protegrity Anonymization:

    • Microsoft Azure Storage

      • Data Lake Storage

        For example:

        asdk.FileDataStore("adl://<path>/<file_name>.csv", access_options={"tenant_id": "<value>", "client_id": "<value>", "client_secret": "<value>"})
        
      • Blob Storage

        For example:

        asdk.FileDataStore("abfs://<path>/<file_name>.csv", access_options={"account_name": "<value>", "account_key": "<value>"})
        
    • S3 bucket Storage

    • Other S3 Compatible Services

      Note: When uploading a file to the cloud service, wait till the entire source file is uploaded before running the anonymization job.

      For more information about using remote sources, refer to Connect to remote data.

If required, you can directly specify data in a list using the following format:

d = {['<column1_name>':['value1','value2','value3',...],
     ['<column2_name>':[number1,number2,number3,...],
     ['<column3_name>':['value1','value2','value3',...],
     ...}

For example:

d = {'gender': ['Male', 'Male', 'Male', 'Male', 'Female', 'Female', 'Female', 'Male', 'Female'],
         'occupation': ['Adm-clerical', 'Exec-managerial', 'Handlers-cleaners', 'Handlers-cleaners', 'Prof-specialty', 'Exec-managerial', 'Other-service', 'Exec-managerial', 'Prof-specialty'],
         'age': [39, 50, 38, 53, 28, 37, 49, 52, 31],
         'race': ['White', 'White', 'White', 'Black', 'Black', 'White', 'Black', 'White', 'White'],
         'marital-status': ['Never-married', 'Married-civ-spouse', 'Divorced', 'Married-civ-spouse', 'Married-civ-spouse', 'Married-civ-spouse', 'Married-spouse-absent', 'Married-civ-spouse', 'Never-married'],
         'education': ['Bachelors', 'Bachelors', 'HS-grad', '11th', 'Bachelors', 'Masters', '9th', 'HS-grad', 'Masters'],
         'native-country': ['United-States', 'United-States', 'United-States', 'United-States', 'Cuba', 'United-States', 'Jamaica', 'United-States', 'United-States'],
         'workclass': ['State-gov', 'Self-emp-not-inc', 'Private', 'Private', 'Private', 'Private', 'Private', 'Self-emp-not-inc', 'Private'],
         'income': ['<=50K', '<=50K', '<=50K', '<=50K', '<=50K', '<=50K', '<=50K', '>50K', '>50K'],
         'bmi': [11.5, 12.5, 13.5, 14.5, 16.5, 16.5, 17.5, 18.5, 11.5] }

The anonymized data is returned to the user as Pandas dataframe. Optionally, you can specify the required target file system and provide the target using the following code.

asdk.anonymize(e, resultStore=<targetFile>)

Specify additional parameters about the source and target file, such as, the character used to separate the values in the file, using the various properties attribute. If a property is not specified, then the default attributes will be used.

Note: Some cloud services have limitations on the file size. If such a limitation exists, then you can set single_file to no when writing large files to cloud service. This saves the output as multiple files to avoid any errors related to saving large files to cloud storage.

For more information and help on specifying the source and target files, refer to Dask remote data configuration.

Specifying the transformation

For more information about specifying the transformation, refer to Specifying the Transformation.

Protegrity Anonymization uses Pandas to build and work with the data frame. You need to import the library for Pandas and store the source data that must be transformed in Pandas.

import pandas as pd

d = <source_data>
df = pd.DataFrame(data=d)

To build the transformation, you need to specify the AnonElement that holds the connection, data frame, and the source.

For example:

e = asdk.AnonElement(conn,df,source=datastore)

You need to specify the columns that must be included for processing the anonymization request and the column classification before performing the anonymization.

e["<column>"] = asdk.<transformation>

Where:

  • column: Specify the column name or column ID.
  • transformation: Specifies the processing to be applied for the column.

Note: By default, all the columns are set to ignore processing. The data is redacted and not included in the Protegrity Anonymization process. You need to manually set the column classification to include it in the Protegrity Anonymization process.

Specify multiple columns with assign using commas.

e.assign(["<column1>","<column2>"],asdk.Transformation())

You can view the configuration provided using the describe function.

e.describe()

Classifying the fields

For more information about different fields classification, refer to Classifying the Fields.

The following data types are supported for working with the data in the fields:

  • Integer
  • Float
  • String
  • DateTime

Specifying the privacy model

For more information about Protegrity Anonymization methods for privacy model, refer to Specifying the Privacy Model.

Specifying the Hierarchy

For more information about how the information in the data set is handled for Protegrity Anonymization, refer to Specifying the Hierarchy.

Generalization

For more information about grouping data into sets having similar attributes, refer to Generalization.

Micro-aggregation

For more information about the mathematical formulas used to group the data, refer to Micro-Aggregation.

Working with saved Protegrity Anonymization Requests

The save method provides interoperability with the REST API. It generates the required JSON payload that can be used as part of curl or any REST client.

Use the following command to save the Protegrity Anonymization request.

e.save("<file_path>\\fileName.json")

Applying Protegrity Anonymization to additional rows

You can use the applyAnon method to anonymize any additional rows using the saved request. Use the following command to anonymize using a previous anonymization job.

asdk.applyAnon(<conn>,job.id(), <single_row_data>)

Use this function to anonymize only a few rows. You need to specify the row information using the key-value pair and ensure that all the required columns are present.

An example of a single and multi row data is shown here.

single_row_data = [{'ID': '1', 'Name': 'Wilburt Daniel', 'Address': '4 Sachtjen Plaza', 'City_Name': 'Ambato', 'Gender': 'Male', 'date': '18-04-2008'': '9'}]
multi_row_data = [{'ID': '1', 'Name': 'Wilburt Daniel', 'Address': '4 Sachtjen Plaza', 'City_Name': 'Ambato', 'Gender': 'Male', 'date': '18-04-2008'': '9'},{'ID': '2', 'Name': 'Jones Knight', 'Address': '25 Macadamia Street', 'City_Name': 'Ambato', 'Gender': 'Male', 'date': '25-11-1997'': '9'}]

Running a sample request

Run the sample code provided here in and SDK. This sample is also available at https://<IP_Address>:<Port>/sdkapi.

Import the Protegrity Anonymization and the Pandas package in the SDK tool.

import pandas as pd
import anonsdk as asdk

Create a variable d with the sample data.

#Sample data for Demo
d = {'gender': ['Male', 'Male', 'Male', 'Male', 'Female', 'Female', 'Female', 'Male', 'Female'],
         'occupation': ['Adm-clerical', 'Exec-managerial', 'Handlers-cleaners', 'Handlers-cleaners', 'Prof-specialty', 'Exec-managerial', 'Other-service', 'Exec-managerial', 'Prof-specialty'],
         'age': [39, 50, 38, 53, 28, 37, 49, 52, 31],
         'race': ['White', 'White', 'White', 'Black', 'Black', 'White', 'Black', 'White', 'White'],
         'marital-status': ['Never-married', 'Married-civ-spouse', 'Divorced', 'Married-civ-spouse', 'Married-civ-spouse', 'Married-civ-spouse', 'Married-spouse-absent', 'Married-civ-spouse', 'Never-married'],
         'education': ['Bachelors', 'Bachelors', 'HS-grad', '11th', 'Bachelors', 'Masters', '9th', 'HS-grad', 'Masters'],
         'native-country': ['United-States', 'United-States', 'United-States', 'United-States', 'Cuba', 'United-States', 'Jamaica', 'United-States', 'United-States'],
         'workclass': ['State-gov', 'Self-emp-not-inc', 'Private', 'Private', 'Private', 'Private', 'Private', 'Self-emp-not-inc', 'Private'],
         'income': ['<=50K', '<=50K', '<=50K', '<=50K', '<=50K', '<=50K', '<=50K', '>50K', '>50K'],
         'bmi': [11.5, 12.5, 13.5, 14.5, 16.5, 16.5, 17.5, 18.5, 11.5] }

Load the data in a Pandas DataFrame.

df = pd.DataFrame(data=d)

Specify the additional data required per attribute to transform and obtain anonymized data. In this example, the Hierarchy Tree is specified.

treeGen = {'lvl0': [11, 13, 14, 15, 27, 28, 20],
               'lvl1': ["<15", "<15", "<15", "<20", "<30", "<30", "<25"],
               'lvl2': ["<20", "<20", "<20", "<30", "<30", "<30", "<30"]}

Build the connection to a running Protegrity Anonymization REST cluster instance. Ensure that the hosts file is configured and points to the REST cluster.

conn = asdk.Connection('https://anon.protegrity.com/')

Build the AnonElement passing the connection and the data as inputs for the Protegrity Anonymization request.

e = asdk.AnonElement(conn,df)

Use the following code sample to read data from an external file store.

e = asdk.AnonElement(conn, dataframe, <SourceFile>)

Specify the transformation that is required.

e['gender'] = asdk.Redact()
e['occupation'] = asdk.Redact()
e['age'] = asdk.Gen_Tree(pd.DataFrame(data=treeGen), ["Missing", "Might be <30", " Might be <30"])

e["bmi"] = asdk.Gen_Interval(['5', '10', '15'])

Specify the K-value, the L-Diversity, and the T-Closeness values.

e.config.k = asdk.K(2)
e["income"] = asdk.LDiv(lfactor=2)
e["income"] = asdk.TClose(tfactor=0.2)

Specify the max suppression.

e.config['maxSuppression'] = 0.7

Specify the importance for the required fields.

e["race"] = asdk.Gen_Mask(maskchar="*",importance=0.8)

View the details of the current configuration.

e.describe()

Anonymize the data.

job = asdk.anonymize(e)

If required, save the results to a file.

datastore=asdk.FileDataStore("s3://...",access_options={"key": "K...","secret": "S..."})
job = asdk.anonymize(e, resultStore=datastore)

View the job status.

job.status()

View the anonymized data.

result = job.result()
if result.df is not None:
    print("Anon Dataframe.")
    print(result.df.head())

View the utility and risk statistics of the data.

job.utilityStat() 
job.riskStat()

Save the job configuration with the updated source and target to a JSON file.

e.save("/file_path/file.json", store=datastore)

Optional: Apply the Protegrity Anonymization rules of previous jobs to new data.

anonData = asdk.applyAnon(conn,job.id(), [{'gender':'Male','age': '39', 'race': 'White', 'income': '<=50K','bmi':'12.5'}])
anonData