Building the request using the Python SDK
To build Protegrity Anonymization request using the SDK, the user first needs to import the anonsdk module using the following command.
import anonsdk as asdk
Creating the connection
You need to specify the connection to Protegrity Anonymization REST service to set up Protegrity Anonymization.
Note: If administrator has not updated the DNS entry for ANON REST API service, then map the hostname with the IP address of Anon Service in the hosts file of the system.
For example, if the Protegrity Anonymization REST service is located at https://anon.protegrity.com, then you would create the following connection.
conn = asdk.Connection("https://anon.protegrity.com/")
Identifying the source and target
Protegrity Anonymization is built to anonymize the data in a Pandas dataframe and return the anonymized dataframe. However, you can also specify a CSV file from various source systems for the source data.
Use the following code to specify the source.
e = asdk.AnonElement(conn, dataframe)
If the source file is located at the same place where Protegrity Anonymization is installed, then use the following code to load the source file into a dataframe.
dataframe = pandas.read_csv("<file_path>")
The following types of data storage have been tested for Protegrity Anonymization:
Local File System
Amazon S3
For example:
asdk.FileDataStore("s3://<path>/<file_name>.csv", access_options={"key": "<value>","secret": "value"})
The following types of data storage can also be used for Protegrity Anonymization:
Microsoft Azure Storage
Data Lake Storage
For example:
asdk.FileDataStore("adl://<path>/<file_name>.csv", access_options={"tenant_id": "<value>", "client_id": "<value>", "client_secret": "<value>"})Blob Storage
For example:
asdk.FileDataStore("abfs://<path>/<file_name>.csv", access_options={"account_name": "<value>", "account_key": "<value>"})
S3 bucket Storage
Other S3 Compatible Services
Note: When uploading a file to the cloud service, wait till the entire source file is uploaded before running the anonymization job.
For more information about using remote sources, refer to Connect to remote data.
If required, you can directly specify data in a list using the following format:
d = {['<column1_name>':['value1','value2','value3',...],
['<column2_name>':[number1,number2,number3,...],
['<column3_name>':['value1','value2','value3',...],
...}
For example:
d = {'gender': ['Male', 'Male', 'Male', 'Male', 'Female', 'Female', 'Female', 'Male', 'Female'],
'occupation': ['Adm-clerical', 'Exec-managerial', 'Handlers-cleaners', 'Handlers-cleaners', 'Prof-specialty', 'Exec-managerial', 'Other-service', 'Exec-managerial', 'Prof-specialty'],
'age': [39, 50, 38, 53, 28, 37, 49, 52, 31],
'race': ['White', 'White', 'White', 'Black', 'Black', 'White', 'Black', 'White', 'White'],
'marital-status': ['Never-married', 'Married-civ-spouse', 'Divorced', 'Married-civ-spouse', 'Married-civ-spouse', 'Married-civ-spouse', 'Married-spouse-absent', 'Married-civ-spouse', 'Never-married'],
'education': ['Bachelors', 'Bachelors', 'HS-grad', '11th', 'Bachelors', 'Masters', '9th', 'HS-grad', 'Masters'],
'native-country': ['United-States', 'United-States', 'United-States', 'United-States', 'Cuba', 'United-States', 'Jamaica', 'United-States', 'United-States'],
'workclass': ['State-gov', 'Self-emp-not-inc', 'Private', 'Private', 'Private', 'Private', 'Private', 'Self-emp-not-inc', 'Private'],
'income': ['<=50K', '<=50K', '<=50K', '<=50K', '<=50K', '<=50K', '<=50K', '>50K', '>50K'],
'bmi': [11.5, 12.5, 13.5, 14.5, 16.5, 16.5, 17.5, 18.5, 11.5] }
The anonymized data is returned to the user as Pandas dataframe. Optionally, you can specify the required target file system and provide the target using the following code.
asdk.anonymize(e, resultStore=<targetFile>)
Specify additional parameters about the source and target file, such as, the character used to separate the values in the file, using the various properties attribute. If a property is not specified, then the default attributes will be used.
Note: Some cloud services have limitations on the file size. If such a limitation exists, then you can set single_file to no when writing large files to cloud service. This saves the output as multiple files to avoid any errors related to saving large files to cloud storage.
For more information and help on specifying the source and target files, refer to Dask remote data configuration.
Specifying the transformation
For more information about specifying the transformation, refer to Specifying the Transformation.
Protegrity Anonymization uses Pandas to build and work with the data frame. You need to import the library for Pandas and store the source data that must be transformed in Pandas.
import pandas as pd
d = <source_data>
df = pd.DataFrame(data=d)
To build the transformation, you need to specify the AnonElement that holds the connection, data frame, and the source.
For example:
e = asdk.AnonElement(conn,df,source=datastore)
You need to specify the columns that must be included for processing the anonymization request and the column classification before performing the anonymization.
e["<column>"] = asdk.<transformation>
Where:
- column: Specify the column name or column ID.
- transformation: Specifies the processing to be applied for the column.
Note: By default, all the columns are set to ignore processing. The data is redacted and not included in the Protegrity Anonymization process. You need to manually set the column classification to include it in the Protegrity Anonymization process.
Specify multiple columns with assign using commas.
e.assign(["<column1>","<column2>"],asdk.Transformation())
You can view the configuration provided using the describe function.
e.describe()
Classifying the fields
For more information about different fields classification, refer to Classifying the Fields.
The following data types are supported for working with the data in the fields:
- Integer
- Float
- String
- DateTime
Specifying the privacy model
For more information about Protegrity Anonymization methods for privacy model, refer to Specifying the Privacy Model.
Specifying the Hierarchy
For more information about how the information in the data set is handled for Protegrity Anonymization, refer to Specifying the Hierarchy.
Generalization
For more information about grouping data into sets having similar attributes, refer to Generalization.
Micro-aggregation
For more information about the mathematical formulas used to group the data, refer to Micro-Aggregation.
Working with saved Protegrity Anonymization Requests
The save method provides interoperability with the REST API. It generates the required JSON payload that can be used as part of curl or any REST client.
Use the following command to save the Protegrity Anonymization request.
e.save("<file_path>\\fileName.json")
Applying Protegrity Anonymization to additional rows
You can use the applyAnon method to anonymize any additional rows using the saved request. Use the following command to anonymize using a previous anonymization job.
asdk.applyAnon(<conn>,job.id(), <single_row_data>)
Use this function to anonymize only a few rows. You need to specify the row information using the key-value pair and ensure that all the required columns are present.
An example of a single and multi row data is shown here.
single_row_data = [{'ID': '1', 'Name': 'Wilburt Daniel', 'Address': '4 Sachtjen Plaza', 'City_Name': 'Ambato', 'Gender': 'Male', 'date': '18-04-2008'': '9'}]
multi_row_data = [{'ID': '1', 'Name': 'Wilburt Daniel', 'Address': '4 Sachtjen Plaza', 'City_Name': 'Ambato', 'Gender': 'Male', 'date': '18-04-2008'': '9'},{'ID': '2', 'Name': 'Jones Knight', 'Address': '25 Macadamia Street', 'City_Name': 'Ambato', 'Gender': 'Male', 'date': '25-11-1997'': '9'}]
Running a sample request
Run the sample code provided here in and SDK. This sample is also available at https://<IP_Address>:<Port>/sdkapi.
Import the Protegrity Anonymization and the Pandas package in the SDK tool.
import pandas as pd
import anonsdk as asdk
Create a variable d with the sample data.
#Sample data for Demo
d = {'gender': ['Male', 'Male', 'Male', 'Male', 'Female', 'Female', 'Female', 'Male', 'Female'],
'occupation': ['Adm-clerical', 'Exec-managerial', 'Handlers-cleaners', 'Handlers-cleaners', 'Prof-specialty', 'Exec-managerial', 'Other-service', 'Exec-managerial', 'Prof-specialty'],
'age': [39, 50, 38, 53, 28, 37, 49, 52, 31],
'race': ['White', 'White', 'White', 'Black', 'Black', 'White', 'Black', 'White', 'White'],
'marital-status': ['Never-married', 'Married-civ-spouse', 'Divorced', 'Married-civ-spouse', 'Married-civ-spouse', 'Married-civ-spouse', 'Married-spouse-absent', 'Married-civ-spouse', 'Never-married'],
'education': ['Bachelors', 'Bachelors', 'HS-grad', '11th', 'Bachelors', 'Masters', '9th', 'HS-grad', 'Masters'],
'native-country': ['United-States', 'United-States', 'United-States', 'United-States', 'Cuba', 'United-States', 'Jamaica', 'United-States', 'United-States'],
'workclass': ['State-gov', 'Self-emp-not-inc', 'Private', 'Private', 'Private', 'Private', 'Private', 'Self-emp-not-inc', 'Private'],
'income': ['<=50K', '<=50K', '<=50K', '<=50K', '<=50K', '<=50K', '<=50K', '>50K', '>50K'],
'bmi': [11.5, 12.5, 13.5, 14.5, 16.5, 16.5, 17.5, 18.5, 11.5] }
Load the data in a Pandas DataFrame.
df = pd.DataFrame(data=d)
Specify the additional data required per attribute to transform and obtain anonymized data. In this example, the Hierarchy Tree is specified.
treeGen = {'lvl0': [11, 13, 14, 15, 27, 28, 20],
'lvl1': ["<15", "<15", "<15", "<20", "<30", "<30", "<25"],
'lvl2': ["<20", "<20", "<20", "<30", "<30", "<30", "<30"]}
Build the connection to a running Protegrity Anonymization REST cluster instance. Ensure that the hosts file is configured and points to the REST cluster.
conn = asdk.Connection('https://anon.protegrity.com/')
Build the AnonElement passing the connection and the data as inputs for the Protegrity Anonymization request.
e = asdk.AnonElement(conn,df)
Use the following code sample to read data from an external file store.
e = asdk.AnonElement(conn, dataframe, <SourceFile>)
Specify the transformation that is required.
e['gender'] = asdk.Redact()
e['occupation'] = asdk.Redact()
e['age'] = asdk.Gen_Tree(pd.DataFrame(data=treeGen), ["Missing", "Might be <30", " Might be <30"])
e["bmi"] = asdk.Gen_Interval(['5', '10', '15'])
Specify the K-value, the L-Diversity, and the T-Closeness values.
e.config.k = asdk.K(2)
e["income"] = asdk.LDiv(lfactor=2)
e["income"] = asdk.TClose(tfactor=0.2)
Specify the max suppression.
e.config['maxSuppression'] = 0.7
Specify the importance for the required fields.
e["race"] = asdk.Gen_Mask(maskchar="*",importance=0.8)
View the details of the current configuration.
e.describe()
Anonymize the data.
job = asdk.anonymize(e)
If required, save the results to a file.
datastore=asdk.FileDataStore("s3://...",access_options={"key": "K...","secret": "S..."})
job = asdk.anonymize(e, resultStore=datastore)
View the job status.
job.status()
View the anonymized data.
result = job.result()
if result.df is not None:
print("Anon Dataframe.")
print(result.df.head())
View the utility and risk statistics of the data.
job.utilityStat()
job.riskStat()
Save the job configuration with the updated source and target to a JSON file.
e.save("/file_path/file.json", store=datastore)
Optional: Apply the Protegrity Anonymization rules of previous jobs to new data.
anonData = asdk.applyAnon(conn,job.id(), [{'gender':'Male','age': '39', 'race': 'White', 'income': '<=50K','bmi':'12.5'}])
anonData
Feedback
Was this page helpful?