This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Additional Information

Additional information to help you using the product.

1: Best practices when using Protegrity Anonymization
2: Protegrity Anonymization Risk Metrics

1 - Best practices when using Protegrity Anonymization

Suggestions for using Protegrity Anonymization efficiently.

Ensure that the source file is clean based on the following checks:
- A column contains correct data values. For example, a field with numbers, such as, salary, must not contain text values.
- Appropriate text as per the coding selected is present in the files. Special characters or characters that cannot be processed must not be present in the source file.
Move the anonymized data file and the logs generated to a different system before deleting your environment.
The maximum dataframe size that can attach to an anonymization job is 100MB.
For processing a larger dataset size, users can use the different cloud storages available.
Run a maximum of 5 anonymization jobs in Protegrity Anonymization: A maximum of 5 jobs can be put on the Protegrity Anonymization queue for adequate utilization of resources. If more jobs are raised, then the job after the initial 5 jobs are rejected and are not processed. If required, increase the maximum limit for the JOB_QUEUE_SIZE parameter in the config.yaml file. For Docker, update the config-docker.yaml file.
Protegrity Anonymization accepts a maximum of 60 requests per minute: Protegrity Anonymizationcan accept a maximum of 60 request per minute. If more than 60 requests are raised, then the excess requests are rejected and are not processed. If required, increase the maximum limit for the DEFAULT_API_RATE_LIMIT parameter in the config.yaml file. For Docker, update the config-docker.yaml file.

2 - Protegrity Anonymization Risk Metrics

This section describes how the risk metrics are derived. It details the descriptions and the equations used to calculate the risk.

Definitions

The following definitions are used for risk calculations:

Data Provider or Custodian: The custodian of the data, responsible for controlling the process of sharing by anonymizing the data as well as putting in place other controls which prevents data from being misused and or re-identified.
Data Recipient: Person or institution who receives the data from the data provider.
Dataset: The collection of all records containing the data on subjects.
Adversary: Data recipient who has the motives to attempt and means to succeed the re-identification of the data and intends to use the data in ways which may be harmful to individuals contained in the dataset.
Target: Person whose details are in the dataset who has been selected by the adversary to focus the re-identification attempt on.

Types of risks

Protegrity Anonymizationuses the Prosecutor, Journalist and Marketer risk models to access probability of re-identification attacks. A description of these risks are provided here.

Prosecutor Risk: If the adversary can know that the target is in the dataset, then it is called Prosecutor Risk. The fact that target is part of dataset increases the risk of successful re-identification.
Journalist Risk: When the adversary doesn’t know for certain that the target is in the dataset then it is called Journalist Risk.
Marketer Risk: Under Marketer Risk, the adversary attempts to re-identify as many subjects in the dataset as possible. If the risk of re-identifying an individual subject is possible, then the risk of multiple subjects being re-identified is also possible.

Relationship between the three risks

Prosecutor Risk >= Journalist Risk >= Marketer Risk

If the dataset is protected against the prosecutor and the journalist risk, depending on the adversary’s knowledge of target’s participation, then by default it is also protected against the marketer risk.

Measuring Risks

This section details the strategy used by Protegrity Anonymization to calculate risks.

For calculating risks, the population is the entire pool from which the sample dataset is drawn. In the current calculation of risk metrics, the population considered is the same as the sample. In case of journalist calculation, it is good to consider the population from a larger dataset from which the sample is drawn.

The following annotations are used in the calculations:

Ra is the proportion of records with risk above the threshold which is at highest risk.
Rb is the maximum probability of re-identification which is at maximum risk.
Rc is the proportion of records that can be re-identified on an average which is the success rate of re-identification.

As part of the risk calculations, anonymization API calculates the following metrics:

pRa is the highest prosecutor risk.
pRb is the maximum prosecutor risk.
pRc is the success rate of prosecutor risk.
jRa is the highest journalist risk.
jRb is the maximum journalist risk.
jRc is the success rate of journalist risk.
mRc is the success rate of marketer risk.

Risk Type	Equation	Notes
Prosecutor	pRa = 1/n fj x l(1 / fj > T)pRb = 1 / min(fj) pRc = \|J\| / n	fj size of equivalence class in the sample. FJ size of equivalence class in the population. fj = FJ if sample is same as population. n is number of records in the sample. T is the risk threshold which is the highest allowable probability of correctly re-identifying single record. Value of T in the calculation is 0.1 by default. This value can be configured.
Journalist	jRa = 1/n fj x l(1 / Fj > T) jRb = 1 / min(FJ) jRc = max ( \|J\| / FJ) , 1 /n fj / FJ)	fj size of equivalence class in the sample. FJ size of equivalence class in the population. fj = FJ if sample is same as population. n is number of records in the sample. T is the risk threshold. Value of T in the calculation is 0.1 by default. This value can be configured..
Marketer	mRc = 1/n fj /FJ	n is number of records in the sample. fj size of equivalence class in the sample FJ size of equivalence class in the population.

Risk Type

Equation

Notes

Prosecutor

pRa = 1/n fj x l(1 / fj > T)pRb = 1 / min(fj)

pRc = |J| / n

fj size of equivalence class in the sample.
FJ size of equivalence class in the population.
fj = FJ if sample is same as population.
n is number of records in the sample.
T is the risk threshold which is the highest allowable probability of correctly re-identifying single record. Value of T in the calculation is 0.1 by default. This value can be configured.

Journalist

jRa = 1/n fj x l(1 / Fj > T) jRb = 1 / min(FJ)

jRc = max ( |J| / FJ) , 1 /n fj / FJ)

fj size of equivalence class in the sample.
FJ size of equivalence class in the population.
fj = FJ if sample is same as population.
n is number of records in the sample.
T is the risk threshold. Value of T in the calculation is 0.1 by default. This value can be configured..

Marketer

mRc = 1/n fj /FJ

n is number of records in the sample.
fj size of equivalence class in the sample
FJ size of equivalence class in the population.

Measuring Journalist Risk

For Journalist Risk to be applied, the published dataset should be a specific sample.

There are two general types of re-identification attacks under journalist risk:

The adversary is targeting a specific individual.
The adversary is targeting any individual.

In case of journalist attack, the adversary will match the published dataset with another identification dataset, such as, voter registry, all patient data in hospital, and so on.

Identification of the dataset represents the population of which the published dataset is a sample.

For example, the sample published dataset is drawn from the identification dataset.

Derived Risk Metrics	Equation	Risk Value
jRa	1/n fj x l(1 / FJ > T)	0
jRb	1 / min(FJ)	0.25
jRc	max ( \|J\| / FJ) , 1 /n fj / FJ)	0.13

Calculation of jRa:

T value is 0.33. Size of equivalence classes in the identity dataset are 10, 8, 14, 4, 2.
Identity function returns 0 when value 1/F is less than τ value else 1.
Identify function returns 0, 0, 0, 0, 1.
Equivalence sizes in samples are 4, 3, 2, 1.
Values of equivalence size / number of records are 0.4, 0.3, 0.2, 0.1.
Product of above value with identity function values are 0, 0, 0, 0.
Value of jRa is 0.

Calculation of jRb:

Minimum of equivalence size of identification dataset is 4
Value of jRb is 0.25.

Calculation of jRc:

Number of equivalence classes in 5 in identification dataset.
Total records in identification dataset 38.
Number of equivalence classes / total records = 5/38 = 0.131.
Equivalence classes in sample / equivalence classes in identification dataset are 0.4, 0.375, 0/142857, 0/25.
Total of above values 1.16.
Above value / total records in sample = 1/16 / 10 = 0.116.
Max (0.131, 0.116) = 0.131.

Measuring Marketer Risk

The use case for deriving the marketer risk is shown here.

Derived Risk Metrics	Equation	Risk Value
mRc	1/n fj /FJ	0.116

Calculation of mRc:

Equivalence classes in sample / equivalence classes in identification dataset are 0.4, 0.375, 0/142857, 0/25.
Total of above values 1.16.
Above value / total records in sample = 1/16 / 10 = 0.116.
Value of marketer risk is 0.116.