Common Configurations for Building the Request
Specifying the Transformation
The data store consists of various fields. These fields need to be identified for processing data. Additionally, the type of transformation that must be performed on the fields must be specified. Also specify the type of privacy model that must be used for anonymizing the data. While specifying the rules for transformation specify the importance of the data.
Classifying the Fields
Specify the type of information that the fields hold. This classification must be performed carefully, leaving out important fields might lead to the anonymized data being of no value. However, including data that can identify users poses a risk of anonymization not being carried out properly.
The following four different classifications are available:
| Classification | Description | Function | Treatment |
|---|---|---|---|
| Direct Identifier | This classification is used for the data in fields that directly identify an individual, such as, Name, SSN, phoneNo, email, and so on. | Redact | Values will be removed. |
| Quasi Identifying Attribute | This classification is used for the data in fields that do not identify an individual directly. However, it needs to be modified to avoid indirect identification. For example, age, date of birth, zip code, and so on. | Hierarchy models | Values will be transformed using the options specified. |
| Sensitive Attribute | This classification is used for the data in fields that do not identify an individual directly. However, it needs to be modified to avoid indirect identification. This data needs to be preserved to support further analysis and retain the utility of anonymized data. In addition, records with this classification must be part of a larger group so that they no longer enable identification of an individual. | LDiv, TClose | No change in values, exception extreme values that might identify an individual. Values will be generalized in case of t-closeness. |
| Non-Sensitive Attribute | This classification is used for the data in fields that do not identify an individual directly or indirectly. | Preserve | No change in values. |
Ensure that you identify the sensitive and the quasi-identifier fields for specifying the Protegrity Anonymization method for hiding individuals in the dataset.
Use the following code for specifying a quasi-identifier for REST API and Python SDK:
"classificationType": "Quasi Identifier",
e['<column>'] = asdk.Gen_Mask(maskchar='#', maxLength=3, maskOrder="L")
Specifying the privacy model
The privacy model transforms the dataset using one or several Protegrity Anonymization methods to achieve privacy.
The following anonymization techniques are available in Protegrity Anonymization:
K-anonymity
Configuration of quasi-identifier tuple occurs of k records. The information type is Quasi-Identifier.
Use the following code for specifying K-anonymity for REST API and Python SDK:
"privacyModel": {
"k": {
"kValue": 5
}
}
e.config.k=asdk.K(2)
l-diversity
Ensures k records in the inter-group are distributed and diverse enough to reduce the risk of identification. The information type is Sensitive Attribute.
Use the following code for specifying l-diversity for REST API and Python SDK:
"privacyModel": {
"ldiversity": [
{
"lFactor": 2,
"name": "sex",
"lType": "Distinct-l-diversity"
}
]
}
e["<column>"]=asdk.LDiv(lfactor=2)
t-closeness
Intra-group diversity for every sensitive attribute must be defined. The information type is Sensitive Attribute.
Use the following code for specifying t-closeness for REST API and Python SDK:
"privacyModel": {
"tcloseness": [
{
"name": "salary-class",
"emdType": "EMD with equal ground distance",
"tFactor": 0.2
}
]
}
e["<column>"]=asdk.TClose(tfactor=0.2)
Specifying the Hierarchy
The hierarchy specifies how the information in the dataset is handled for Protegrity Anonymization. These hierarchical transformations are performed on Quasi-Identifiers and Sensitive Attributes. Accordingly, the data can be generalized using transformations or aggregated using mathematical functions. As we go up the hierarchy, the data is anonymized better, however, the quality of data for further analysis reduces.
Global Recoding and Full Domain Generalization
Global recoding and full domain generalization is used for anonymizing the data. When data is anonymized, the quasi-identifiers values are transformed to ensure that data fulfils the required privacy requirements. This transformation is also called as data recoding. In Protegrity Anonymization, data is anonymized using global recoding, that is, the same transformation rule is applied to all entries in the data set.
Consider the data in the following tables:
| ID | Gender | Age | Race |
|---|---|---|---|
| 1 | Male | 45 | White |
| 2 | Female | 30 | White |
| 3 | Male | 25 | Black |
| 4 | Male | 30 | White |
| 5 | Female | 45 | Black |
| Level0 | Level1 | Level2 | Level3 | Level4 |
|---|---|---|---|---|
| 25 | 20-25 | 20-30 | 20-40 | * |
| 30 | 30-35 | 30-40 | 30-50 | * |
| 45 | 40-45 | 40-50 | 40-60 | * |
In the above example, when global recoding is used for a value such as 45, then all occurrences of age 45 will be generalized using only one generalized level as follows:
- 40-45
- 40-50
- 40-60
- *
Full-domain generalization means that all values of an attribute are generalized to the same level of the associated hierarchy level. Thus, in the first table, if age 45 gets generalized to 40-50 which is Level2, then all age values are also generalized to Level2 only. Hence, the value 30 will be generalized to 30-40.
In addition to generalization, micro-aggregation is available for transforming the dataset. In generalization, the mathematical function is performed on all the values of the column. However, in micro-aggregation, the mathematical function is performed on all the values within an equivalence class.
Consider the following table with ages of five men and five women.
| Gender | Age |
|---|---|
| M | 20 |
| M | 20 |
| F | 20 |
| M | 22 |
| M | 22 |
| F | 22 |
| F | 22 |
| M | 28 |
| F | 28 |
| F | 28 |
The following output is obtained by performing a generalization aggregation on the Age using averages, by setting the Gender as QI and keeping the K value as 2.
| Gender | Age | Generalization |
|---|---|---|
| M | 20 | 23.2 |
| M | 20 | 23.2 |
| F | 20 | 23.2 |
| M | 22 | 23.2 |
| M | 22 | 23.2 |
| F | 22 | 23.2 |
| F | 22 | 23.2 |
| M | 28 | 23.2 |
| F | 28 | 23.2 |
| F | 28 | 23.2 |
In the table, a sum of all the ages is obtained and divided by the total, that is, 10 to obtain the generalization value using average.
The following output is obtained by performing a micro-aggregation on the Age using averages, by setting the Gender as QI and keeping the K value as 2.
| Gender | Age | Micro-Aggregation |
|---|---|---|
| F | 20 | 24 |
| F | 22 | 24 |
| F | 22 | 24 |
| F | 28 | 24 |
| F | 28 | 24 |
| M | 20 | 22.4 |
| M | 20 | 22.4 |
| M | 22 | 22.4 |
| M | 22 | 22.4 |
| M | 28 | 22.4 |
In the table, two equivalence classes are formed based on the gender. The sum of the ages in each group is obtained and divided by the total of each group, that is, 5 to obtain the micro-aggregation value using average.
Generalization
In Generalization, the data is grouped into sets having similar attributes. The mathematical function is applied on the selected column by considering all the values in the dataset.
The following transformations are available:
- Masking-Based: In this transformation, information is hidden by masking parts of the data to form similar sets. For example, masking the last three numbers in the zip code could help group them, such as 54892 and 54231 both being transformed as 54###.
An example of masking-based transformation for building a REST API and Python SDK is provided here.
{
"classificationType": "Quasi Identifier",
"dataTransformationType": "Generalization",
"dataType": "String",
"generalization": {
"hierarchyType": "Rule",
"rule": {
"masking": {
"maskOrder": "Right To Left",
"maskChar": "#",
"maxDomainSize": 5
}
},
"type": "Masking Based"
},
"name": "city"
}
Where:
- maskOrder is the order for masking, use Right To Left to mask from right and Left To Right for masking from the left.
- maskChar is the placeholder character for masking.
- maxDomainSize is the number of characters to mask. Default is the maximum length of the string in the column.
e["zip_code"] = asdk.Gen_Mask(maskchar="#", maskOrder = "R", maxLength=5)
Where:
- maskchar is the placeholder character for masking.
- maskOrder is the order for masking, use R to mask from right and L for masking from the left.
- maxLength is the number of characters to mask. Default is the maximum length of the string in the column.
- Tree-Based: In this transformation, data is aggregated by transformation to form similar sets using external knowledge. For example, in the case of address, the data can be anonymized based on the city, state, country, or continent, as required. You must specify the file containing the tree data. If the current level of aggregation does not provide adequate anonymization, then a higher level of aggregation is used. The higher the level of aggregation, the more the data is generalized. However, a higher level of generalization reduces the quality of data for further analysis.
An example of tree-based transformation for building a REST API and Python SDK is provided here.
{
"classificationType": "Quasi Identifier",
"dataTransformationType": "Generalization",
"dataType": "String",
"generalization": {
"type": "Tree Based",
"hierarchyType": "Data Store",
"dataStore": {
"type": "File",
"file": {
"name": "adult_hierarchy_education.csv",
"props": {
"delimiter": ";",
"quotechar": "\"",
"header": null
}
},
"format": "CSV"
}
},
"name": "education"
}
treeGen = {'lvl0': [11, 13, 14, 15, 27, 28, 20],
'lvl1': ["<15", "<15", "<15", "<20", "<30", "<30", "<25"],
'lvl2': ["<20", "<20", "<20", "<30", "<30", "<30", "<30"]}
e["bmi"] = asdk.Gen_Tree(pd.DataFrame(data=treeGen), ["Missing", "Might be <30", " Might be <30"])
You can refer to an external file for specifying the parameters for the hierarchy tree.
education_df = pd.read_csv('D:\\WS\\data source\\hierarchy\\adult_hierarchy_education.csv', sep=';')
e['education'] = asdk.Gen_Tree(education_df)
- Interval-Based: In this transformation, data is aggregated into groups according to a predefined interval specified.
In addition, the lowerbound and upperbound values need to be specified for building the SDK API. Values below the lowerbound and values above the upperbound are excluded from range generation.
An example of interval-based transformation for building a REST API and Python SDK is provided here.
{
"classificationType": "Quasi Identifier",
"dataTransformationType": "Generalization",
"dataType": "Integer",
"generalization": {
"hierarchyType": "Rule",
"rule": {
"interval": {
"levels": [
"5",
"10",
"50",
"100"
],
"lowerBound": "0"
}
},
"type": "Interval Based"
},
"name": "age"
}
asdk.Gen_Interval([<interval_level>],<lowerbound>,<upperbound>)
An example of interval-based transformation for building the SDK API is provided here.
e['age'] = asdk.Gen_Interval([5,10,15])
e['age'] = asdk.Gen_Interval([5,10,15],20,60)
Aggregation-Based: In this transformation, integer data is aggregated as per the conditions specified. The available options for aggregation are Mean and Mode.
Note: Mean is applicable for Integer and Decimal data types.
Mode is applicable for Integer, Decimal, and String data types.
An example of aggregation-based transformation for building a REST API and Python SDK is provided here.
{
"classificationType": "Quasi Identifier",
"dataTransformationType": "Generalization",
"dataType": "Integer",
"generalization": {
"hierarchyType": "Aggregate",
"type": "Aggregation Based",
"aggregateFn": "Mean"
},
"name": "age"
}
An example of aggregation-based transformation using Mean is provided here.
e['age'] = asdk.Gen_Agg(asdk.AggregateFunction.Mean)
An example of aggregation-based transformation using Mode is provided here.
e['salary'] = asdk.Gen_Agg(asdk.AggregateFunction.Mode)
- Date-Based: In this transformation, data is aggregated into groups according to the date.
An example of date-based interval and rounding for building a REST API and Python SDK is provided here.
{
"classificationType": "Quasi Identifier",
"dataTransformationType": "Generalization",
"dataType": "Date",
"generalization": {
"hierarchyType": "Rule",
"type": "Interval Based",
"rule": {
"daterange": {
"levels": [
"WD.M.Y",
"W.M.Y",
"FD.M.Y",
"M.Y",
"QTR.Y",
"Y",
"DEC",
"CEN"
]
}
}
},
"name": "date_of_birth"
}
It is not applicable for building Python SDK requests.
- Time-Based: In this transformation, data is aggregated into groups according to the time. In this, time intervals are in seconds. The LowerBound and UpperBound takes value of the format [HH:MM:SS].
An example of time-based interval and rounding for building a REST API and Python SDK is provided here.
{
"classificationType": "Quasi Identifier",
"dataTransformationType": "Generalization",
"dataType": "Date",
"generalization": {
"hierarchyType": "Rule",
"type": "Interval Based",
"rule": {
"interval": {
"levels": [
"30",
"60",
"180",
"240"
],
"lowerBound": "00:00:00",
"upperBound": "23:59:59"
}
}
},
"name": "time_of_birth"
}
It is not applicable for building Python SDK request.
- Rounding-Based: In this transformation, data is rounded to groups according to a predefined rounding factor specified.
An example of rounding-based transformation for building a REST API and Python SDK is provided here.
It is not applicable for building the REST API request.
An example of date-based transformation is provided here.
e['DateOfBirth'] = asdk.Gen_Rounding(["H.M4", "WD.M.Y", "M.Y"])
An example of numeric-based transformation is provided here.
e['Interest_Rate'] = asdk.Gen_Rounding([0.05,0.10,1])
Micro-Aggregation
In Micro-Aggregation, mathematical formulas are used to group the data. This is used to achieve K-anonymity by forming small groups of data in the dataset.
The following aggregation functions are available for micro-aggregation in the Protegrity Anonymization:
- For numeric data types (integer and decimal):
Arithmetic Mean
Geometric Mean
Note: Micro-Aggregation using geometric mean is only supported for positive numbers.
Median
- For all data types:
- Mode
Note: Arithmetic Mean, Geometric Mean, and Median are applicable for Integer and Decimal data types.
Mode is applicable for Integer, Decimal, and String data types.
An example of micro-aggregation for building a REST API and Python SDK is provided here.
{
"classificationType": "Quasi Identifier",
"dataTransformationType": "Micro Aggregation",
"dataType": "Decimal",
"aggregateFn": "Median",
"name": "age_ma_median"
}
e['income'] = asdk.MicroAgg(asdk.AggregateFunction.Mean)
Feedback
Was this page helpful?