This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Using Protegrity Anonymization

This section explains the REST APIs provided by Protegrity Anonymization. It also details the method for creating and running Protegrity Anonymization SDK requests.

1: Creating Protegrity Anonymization requests
2: Working with Protegrity Anonymization APIs

2.1: Understanding Protegrity Anonymization REST APIs
2.2: Understanding Protegrity Anonymization Python SDK Requests

1 - Creating Protegrity Anonymization requests

This section walks you through the process of creating Protegrity Anonymization requests to anonymize your data. It describes the steps for using the REST API and creating Protegrity Anonymization Python SDK requests.

A general overview of the process you need to follow to anonymize the data is shown in the following figure:

Identify the dataset that needs to be anonymized.
Analyze and classify the various fields available in the dataset. The following classifications are available:
- Direct Identifiers
- Quasi-Identifier
- Sensitive Attributes
- Non-Sensitive Attributes
Determine the use case by specifying the data that is required for further analysis.
Specify the quasi-identifiers and other fields that are not required in the dataset
Specify the required anonymization methods for the data. Some commonly used methods are as follows:
- Generalization
- Micro-Aggregation
Specify and measure the acceptable statistics and risk levels for the data fields for measuring the statistic before running the anonymization job.

Note: For more information about different risk levels for the data fields, refer to Anonymization models.

Verify that the anonymized data satisfies the acceptable risk threshold level.
Measure the quality of the anonymized data by comparing it with the original data. If the quality does not meet standards, then work on the data or drop the output.
Save the anonymized data to an output file.

The anonymized data can now be used for further analysis and as input for machine learning softwares.

2 - Working with Protegrity Anonymization APIs

The various APIs provided with Protegrity Anonymization are described here.

For Protegrity Anonymization Python SDK, import the anonsdk module to install and use it. The AnonElement is an essential part of the Protegrity Anonymization Python SDK. For more information about the AnonElement object, refer to Understanding the AnonElement object.

The following table shows the list of REST APIs and Python SDK requests:

List of APIs	REST APIs	Python SDK
Anonymization Functions
Anonymize	Yes	Yes
Apply Anonymize	Yes	Yes
Measure	Yes	Yes
Task Monitoring APIs
Get Job IDs	Yes	Yes
Get Job Status	Yes	Yes
Get Metadata	Yes	Yes
Abort	Yes	Yes
Delete	Yes	Yes
Statistics APIs
Get Exploratory Statistics	Yes	Yes
Get Risk Metric	Yes	Yes
Get Utility Statistics	Yes	Yes
Detection APIs
Get Data Domains	Yes	No^*1
Detect Anonymization Information	Yes	No^*1
Detect Classification	Yes	No^*1
Detect Hierarchy	Yes	No^*1

^*1 - It is not applicable for Protegrity Anonymization Python SDK.

2.1 - Understanding Protegrity Anonymization REST APIs

The following APIs are available with Protegrity Anonymization REST API. You can run these APIs using the command line with the curl command. You can also run them using the Swagger UI or a tool like Postman.

Before running the anonymization jobs mentioned in the Protegrity Anonymization REST APIs section below, the following pre-requisites must be completed:

Ensure that Anonymization machine is set up and is configured as “https://anon.protegrity.com/".
For more information about setting up and configuring an Anonymization machine for AWS and Azure, refer to AWS and Azure.
Ensure that the disk is not full and enough free space is available for saving the destination file.
Verify the destination file is not in use. Set the required permissions for creating and modifying the destination file.
Verify that the anonymization job exists.

You can use different sample requests to build and run the anonymization APIs. For more information about the sample requests for REST APIs, refer to Sample Requests for Protegrity Anonymization.

Anonymization Functions

The Anonymization Functions APIs are used to run the anonymization job.

Anonymize

The Anonymize API is used to start an anonymize operation.

For more information about the anonymize API, refer to Submit a new anonymization job.

Note: When you run the job, an empty destination file is created. This file is created during processing for verifying the necessary destination permissions. Avoid using this file till the anonymization job is complete.

Ensure that the anonymized data file and the logs generated are moved to a different system before deleting your environment.

If the source file is larger than the maximum limit that is allowed on the Cloud environment, then run the anonymization request with “additional_properties”: { “single_file”: “no” }.

Apply Anonymize

The Apply Anonymize API is used as a template to anonymize additional entries. Using this API you can use the existing configuration to process additional data. This is especially useful in machine learning for training the system to anonymize new data points.

Note:In this API, privacy model parameters are ignored while performing the anonymization for the new entry.

For more information about the apply anonymize API, refer to Apply anonymization config to a given dataset.

Measure

The Measure API is used to measure or obtain anonymization result statics for different configurations before the actual anonymization job.

For more information about the anonymize API, refer to Submit a new anonymization Measure job.

Task Monitoring APIs

The Task Monitoring APIs are used to monitor the anonymization job. Use these APIs to obtain the job status, retrieve a job, and abort a job.

Get Job IDs

The Get Job ID API is used to get the job IDs of the last 20 anonymization operations that are running, in queue, or completed. You can then use the required job ID with the other APIs to work with the anonymization job.

For more information about the job ID API, refer to Obtain job ids.

Get Job Status

The Get Job Status API is used to get the status of an anonymize operation that is running, in queue, or complete. It shows the percentage of job completed. Use the information provided here to monitor if a job is running or stalled.

For more information about the job status API, refer to Obtain job status.

Get Job Status API Parameters

Use this API to get the status of an anonymize operation that is running. It shows the percentage of job completed. Use the information provided here to monitor if a job is running or stalled.

Monitor Job Information	Description
Function	status()
Parameters	None
Return Type	A string with the status information in the JSON format. completed: This is information about the job, such as, data, statistics, summary, and time spent. id: This is the job ID. info: This is information about the job being processed, such as, the source and attributes for the job. running: This is the completion status of the jobs being processed. It shows the percentage of the job completed. status: This is the status of the job, such as, running or completed. Note: This API displays all the status of the job. To obtain the ID of a job, use job.id().
Sample Request	job.status()

Get Metadata

The Get Metadata API is used to retrieve the metadata for the existing job. This API is useful when you need to view the configuration available for a job. It displays the fields, configuration, and the data that is used to run the anonymization job.

For more information about the metadata API, refer to Obtain job metadata.

Retrieve Anonymized Data API Parameters

Use this API to retrieve the results of an anonymized job.

Retrieve Job Information	Description
Function	result()
Parameters	None
Return Type	Returns the AnonResult element, which provides the DataFrame for the anon data. Note: The result.df will be None if you have overridden the resultstore as part of anonymize method.
Sample Request	job.result() Note: This is a blocking API and will stall processing till the job is complete.

Abort

The Abort API is used to abort a running anonymization job. You can abort jobs if you need to modify the parameters or if the job is stalled or taking too much time or resources to process.

For more information about the abort API, refer to Abort a running anonymization job.

Note: After aborting the task, it might take time before all the running processes are stopped.

Abort API Parameters

Use this API to abort a running anonymize operation. You can abort jobs if you need to modify the parameters or if the job is stalled or taking too much time or resources to process.

Abort Job Information	Description
Function	abort()
Parameters	None
Return Type	A string with the status of the abort request.
Sample Request	job.abort()

Delete

The Delete API is used to delete an existing job that is no longer required.

For more information about the delete API, refer to Delete a job.

Statistics APIs

The Statistics APIs are used to obtain information about the anonymization data. Use these APIs to obtain the risk and utility information about the anonymization. The user needs to access these APIs to measure the utility benefits and risk of publishing the anonymized data. If these configurations are not satisfactory, then the user can re-submit the anonymization job after modifying some parameters based on these results.

Get Exploratory Statistics

The Get Exploratory Statistics API is used to obtain data distribution statistics about a completed anonymization job.

For more information about the exploratory statistic API, refer to Obtain the exploratory statistics.

Get Exploratory Statistics API Parameters

It provides information about both, the source and the target data distribution statistics.

Exploratory Statistics Information	Description
Function	exploratoryStats()
Parameters	None
Return Type	A Pandas dataframe with the exploratory information of the source data and the anonymized data.
Sample Request	job.exploratoryStats()

This provides the data distribution of the attribute, which is all unique values of an attribute and its occurrence count. This can be used to build data histogram of all attributes in the dataset. .The following values appear for the source and result set:

Get Risk Metric

The Get Risk Metric API is used to ascertain the risk of the source data and the anonymized data.

For more information about the risk metric API, refer to Obtain the risk statistics.

Get Risk Metric API Parameters

It shows the risk of the data against attacks such as journalist, marketer, and prosecutor.

Risk Metric Information	Description
Function	riskStat()
Parameters	None
Return Type	A Pandas dataframe with the source data and the anonymized data privacy risk information. Note: You can customize the riskThreashold as part of AnonElement configuration.
Sample Request	job.riskStat()

The following values appear for the source and result set:

Values for Source and Result Set	Description
avgRecordIdentification	This value displays the average probability for identifying a record in the anonymized dataset. The risk is higher when the value is closer to the value 1.
maxProbabilityIdentification	This displays the maximum probability value that a record can be identified from the dataset. The risk is higher when the value is closer to the value 1.
riskAboveThreshold	This value displays the number of records that are at a risk above the risk threshold. The default threshold is 10%. The threshold is the maximum value set as a boundary. Any values beyond the threshold are a risk and might be easy to identify. For this result, the value 0 is preferred.

Get Utility Statistics

The Get Utility Statistics API is used to check the usability of the anonymized data.

For more information about the utility statistics API, refer to Obtain the anonymization data utility statistics.

Get Utility Statistics API Parameters

It shows the information that was lost to gain privacy protection.

Risk Metric Information	Description
Function	utilityStat()
Parameters	None
Return Type	A Pandas dataframe with the source and anonymized data utility information.
Sample Request	job.utilityStat()

The following values appear for the source and result set:

Values for Source and Result Set	Description
ambiguity	This value displays how well a record is hidden in all the records. This captures the ambiguity of records.
average_class_size	This measures the average size of groups of indistinguishable records. A smaller class size is more favourable for retaining the quality of the information. A larger class size increases anonymity at the cost of quality.
discernibility	This measures the size of groups of indistinguishable records with penalty for records which have been completely suppressed. Discernibility metrics measures the cardinality of the equivalent class. Discernibility metrics considers only the number of records in the equivalent class and does not capture information loss caused by generalization.
generalization_intensity	Data transformation from the original records to anonymity is performed using generalization and suppression. This measures the concentration of generalization and suppression on attribute values.
infoLoss	This value displays the probability of information lost with the data transformation from the original records. Larger the value, lesser the quality for further analysis.

Detection APIs

The Detection APIs are used to analyze and classify data in the Protegrity Anonymization.

Get Data Domains

The Get Data Domains API is used to obtain a list of data domains supported.

For more information about obtaining the data domains API, refer to Get the supported data domains.

Detect Anonymization Information

The Detect Anonymization Information API is used to detect the data domain, classification type, hierarchy, and privacy models for the dataset.

For more information about the detect anonymization information API, refer to Data domain, Classification type, Hierarchy, and Privacy Models detection from a dataset.

Detect Classification

The Detect Classification API is used to detect the classification that will be used for the anonymization operation. Accordingly, you can modify the classification to match your requirements.

For more information about the detect classification API, refer to Classification type detection from a dataset.

Detect Hierarchy

The Detect Hierarchy API is used to detect the hierarchy type that will be used for the anonymization operation.

For more information about the detect hierarchy API, refer to Hierarchy Type detection from a dataset.

2.2 - Understanding Protegrity Anonymization Python SDK Requests

The following APIs are available with Protegrity Anonymization. You can import the Protegrity Anonymization in your Python SDK environment, pass the required parameter and data to the Protegrity Anonymization Python SDK requests, and retrieve work with the anonymized output.

Before running the anonymization jobs mentioned in the Protegrity Anonymization SDK section below, the following pre-requisites must be completed:

Ensure that Anonymization machine is set up and is configured as “https://anon.protegrity.com/".
For more information about setting up and configuring an Anonymization machine for AWS and Azure, refer to AWS and Azure.
Ensure that the disk is not full and enough free space is available for saving the destination file.
Verify the destination file is not in use. Set the required permissions for creating and modifying the destination file.
Verify that the anonymization job exists.
Verify the import of the Pythonic SDK. For example, import anonsdk as asdk.

You can use different sample requests to build and run the anonymization APIs. For more information about the sample requests for Python SDK, refer to Sample Requests for Protegrity Anonymization.

Understanding the AnonElement object

The AnonElement is an essential part of the Protegrity Anonymization SDK. It holds all information that is required for processing the anonymization request. The AnonElement is a part of the anonsdk package.

Protegrity Anonymization SDK processes a Pandas dataframe to anonymize data using the Protegrity Anonymization REST API. It is the AnonElement that accepts the parameters and passes the information to the REST API. The AnonElement accepts the connection to the REST API, the pandas dataframe with the data that must be processed, and the optionally the source location for processing the request.

Anonymization Functions

The Anonymization Functions APIs are used to run the anonymization job.

Anonymize

The Anonymize API is used to start an anonymize operation.

For more information about the anonymize API, refer to Submit a new anonymization job.

Note: When you run the job, an empty destination file is created. This file is created during processing for verifying the necessary destination permissions. Avoid using this file till the anonymization job is complete.

Ensure that the anonymized data file and the logs generated are moved to a different system before deleting your environment.

If the source file is larger than the maximum limit that is allowed on the Cloud environment, then run the anonymization request with “additional_properties”: { “single_file”: “no” }.

Apply Anonymize

Note: In this API, privacy model parameters are ignored while performing the anonymization for the new entry.

For more information about the apply anonymize API, refer to Apply anonymization config to a given dataset.

Apply Anonymize API Parameters

Use this API to start an anonymize operation.

Apply Anonymize Job Information	Description
Function	anonymize(anon_object, target_datastore, force, mode)
Parameters	anon_object: The object with the configuration for performing the anonymization request. target_datastore: The location to store the anonymized result. force: The boolean value to force the operation. Acceptable values: True and False. Set this flag to true to resubmit the same anonymized job without any modification. mode: The value to enable auto anonymization. Acceptable value: auto. Do not include this parameter to skip auto anonymization.
Return Type	A job object with which the task monitoring and task statistics can be obtained.
Sample Request	Without auto anonymization: job = asdk.anonymize(anon_object,target_datastore ,force=True) With auto anonymization: job = asdk.anonymize(anon_object,target_datastore ,force=True,mode=“auto”) Note: When you run the job, an empty destination file is created. This file is created during processing for verifying the necessary destination permissions. Avoid using this file till the anonymization job is complete.

For more information about using the Auto Anonymization, refer to Using the Auto Anonymizer.

Ensure that the anonymized data file and the logs generated are moved to a different system before deleting your environment.

If the source file is larger than the maximum limit that is allowed on the Cloud environment, then run the anonymization request with “additional_properties”: { “single_file”: “no” }.

If you want to bypass the Anon-Storage, then you can disable the pods by setting the pyt_storage flag to False.
For example, use the following code to run the anonymization request without using the storage pods

job=asdk.anonymize(anon_object, pty_storage=False)

Measure

The Measure API is used to measure or obtain anonymization result statics for different configurations before the actual anonymization job.

For more information about the anonymize measure job API, refer to Submit a new anonymization Measure job.

Using Infer to Anonymize API Parameters

Use the Infer API to start auto-detecting the data-domain, classification type, hierarchies, and anonymization configuration in Protegrity Anonymization. Any user-defined configuration, such as, QI attribute assignments, hierarchy, and K value, are retained and considered while performing the auto anonymization.

Using Infer to Anonymize Information	Description
Function	infer(targetVariable)
Parameters	targetVariable: The field specified here is used as a focus point for performing the anonymization.
Return Type	It returns an anon element with all the detected classifications and hierarchies generated.
Sample Request	e.infer(targetVariable=‘income’) Note: You can use e.measure() to modify the request and view different outcomes of the result set.

For more information about the anonymize measure job API, refer to Using Infer to Anonymize.

Task Monitoring APIs

The Task Monitoring APIs are used to monitor the anonymization job. Use these APIs to obtain the job status, retrieve a job, and abort a job.

Get Job IDs

For more information about the job ID API, refer to Obtain job ids.

Get Job Status

For more information about the job status API, refer to Obtain job status.

Get Job Status API Parameters

Use this API to get the status of an anonymize operation that is running. It shows the percentage of job completed. Use the information provided here to monitor if a job is running or stalled.

Monitor Job Information	Description
Function	status()
Parameters	None
Return Type	A string with the status information in the JSON format. completed: This is information about the job, such as, data, statistics, summary, and time spent. id: This is the job ID. info: This is information about the job being processed, such as, the source and attributes for the job. running: This is the completion status of the jobs being processed. It shows the percentage of the job completed. status: This is the status of the job, such as, running or completed. Note: This API displays all the status of the job. To obtain the ID of a job, use job.id().
Sample Request	job.status()

Get Metadata

For more information about the metadata API, refer to Obtain job metadata.

Retrieve Anonymized Data API Parameters

Use this API to retrieve the results of an anonymized job.

Retrieve Job Information	Description
Function	result()
Parameters	None
Return Type	Returns the AnonResult element, which provides the DataFrame for the anon data. Note: The result.df will be None if you have overridden the resultstore as part of anonymize method.
Sample Request	job.result() Note: This is a blocking API and will stall processing till the job is complete.

Abort

The Abort API is used to abort a running anonymization job. You can abort jobs if you need to modify the parameters or if the job is stalled or taking too much time or resources to process.

For more information about the abort API, refer to Abort a running anonymization job.

Note: After aborting the task, it might take time before all the running processes are stopped.

Abort API Parameters

Use this API to abort a running anonymize operation. You can abort jobs if you need to modify the parameters or if the job is stalled or taking too much time or resources to process.

Abort Job Information	Description
Function	abort()
Parameters	None
Return Type	A string with the status of the abort request.
Sample Request	job.abort()

Delete

The Delete API is used to delete an existing job that is no longer required.

For more information about the delete API, refer to Delete a job.

Statistics APIs

Get Exploratory Statistics

The Get Exploratory Statistics API is used to obtain data distribution statistics about a completed anonymization job. The information includes information about both, the source and the target distribution.

For more information about the exploratory statistic API, refer to Obtain the exploratory statistics.

Get Risk Metric

The Get Risk Metric API is used to ascertain the risk of the anonymized data. It shows the risk of the data against attacks such as journalist, marketer, and prosecutor.

For more information about the risk metric API, refer to Obtain the risk statistics.

Get Utility Statistics

The Get Utility Statistics API is used to check the usability of the anonymized data.

For more information about the utility statistics API, refer to Obtain the anonymization data utility statistics.