Introduction on

Business cases

Mon, 01 Jan 0001 00:00:00 +0000

Consider the following business cases:

Case 1: A hospital wants to share patient data with a third-party research lab. The privacy of the patient, however, must be preserved.
Case 2: An organization requires customer data from several credit unions to create training data. The data will be used to train machine learning models looking for new insights. The customers, however, have not agreed for their data to be used.
Case 3: An organization which must be compliant with GDPR, CCPA, or other privacy regulations requires to keep some information beyond the period that meets regulations.
Case 4: An organization requires raw data to train their software for machine learning.

In all these cases, data forms an integral part of the source for continuing the business process or analysis. Additionally, only what was done is required in all the cases, who did it does not have any value in the data. In this case, the personal information about the individual users can be removed from the dataset. This removes the personal factor from the data and at the same time retains the value of the data from the business point of view. This data, since it does not have any private information, is also pulled from the legal requirements governing the data.

Data security and data privacy

Mon, 01 Jan 0001 00:00:00 +0000

Most organizations understand the need to secure access to personally identifiable information. Sensitive values in records are often protected at rest (storage), in transit (network) and in use (fine-grained access control), through a process known as de-identification. De-Identification is a spectrum, where data security and data privacy issues must be balanced with data usability.

Pseudonymization

Pseudonymization is the process of de-identification by substituting sensitive values with a consistent, non-sensitive value. This is most often accomplished through encryption, tokenization, or dynamic data masking. Access to the process for re-identification (decryption, detokenization, unmasking) is controlled, so that only users with a business requirement will see the sensitive values.

Importance and types of data

Mon, 01 Jan 0001 00:00:00 +0000

These records might be linked with other records, such as, income statements or medical records to provide valuable information. The various fields as a whole, called a record, is private and is user-centric. However, the individual fields may or may not be personal. Accordingly, based on the privacy level, the following data classifications are available:

Direct Identifier: Identity Attributes can identify an individual with the value alone. These attributes are unique to an individual in a dataset and at times even in the world. It is personal and private to the user. For example, name, passport, Social Security Number (SSN), mobile number, and so on.
Quasi-Identifier or Indirect Identifier: Quasi-Identifying Attributes are identifying characteristic about a data subject. However, you cannot identify an individual with the quasi-identifier alone. For example, date of birth or an address. Moreover, the individual pieces of data in a quasi-identifier might not be enough to identify a single individual. Take the example of date of birth, the year might be common to many individuals and would be difficult to narrow down to a single individual. However, if the dataset is small, then it might be easy to identify an individual using this information.
Data about data subject: Data about the data subject is typically the data that is being analyzed. This data might exist in the same table or a different related table of the dataset. It provides valuable information about the dataset and is very helpful for analysis. This data may or might not be private to an individual. For example, salary, account balance, or credit limit. However, like quasi-identifiers, in a small dataset, this data might be unique to an individual. Additionally, this data can be classified as follows:
- Sensitive Attributes: This data may disclose something like a health condition which in a small result set may identify a single individual.
- Insensitive Attributes: This data is not associated with a privacy risk and is common information, such as, the type of bank accounts in a bank, individual or business.

A sample dataset is shown in the following figure:

Data anonymization techniques

Mon, 01 Jan 0001 00:00:00 +0000

Important terminology

De-identification: General term for any process of removing the association between a set of identifying data and the data subject.
Pseudonymization: Particular type of data de-identification that both removes the association with a data subject and adds an association between a particular set of characteristics relating to the data subject and one or more pseudonyms.
Anonymization: Process that removes the association between the identifying dataset and the data subject. Anonymization is another subcategory of de-identification. Unlike pseudonymization, it does not provide a means by which the information may be linked to the same person across multiple data records or information systems. Hence reidentification of anonymized data is not possible.

Note: As defined in ISO/TS 25237:2008.

How Protegrity Anonymization Works

Mon, 01 Jan 0001 00:00:00 +0000

Protegrity Anonymization is a software solution that processes data by removing personal information and transforming the remaining details to protect privacy.