Importance and types of data
Importance of data classificaton to reduce re‑identification risk while preserving data utility.
A record consists of all the information pertaining to a user. This record consists of different fields of information, such as first name, last name, address, telephone number, age, and so on. These records might be linked with other records, such as income statements or medical records to provide valuable information. A record is made up of various fields and is private and user-centric. However, the individual fields may or may not be personal. Accordingly, based on the privacy level, the following data classifications are available:
- Direct Identifier: Identity Attributes can identify an individual with the value alone. These attributes are unique to an individual in a dataset and at times even in the world. It is personal and private to the user. For example, name, passport, Social Security Number (SSN), mobile number, and so on.
- Quasi-Identifier or Indirect Identifier: Quasi-Identifying Attributes are identifying characteristics about a data subject. However, you cannot identify an individual with the quasi-identifier alone. For example, date of birth or an address. Moreover, the individual pieces of data in a quasi-identifier might not be enough to identify a single individual. Take the example of date of birth, the year might be common to many individuals and would be difficult to narrow down to a single individual. However, if the dataset is small, then it might be easy to identify an individual using this information.
- Data about data subject: Data about the data subject is typically the data that is being analyzed. This data might exist in the same table or a different related table of the dataset. It provides valuable information about the dataset and is very helpful for analysis. This data may or might not be private to an individual. For example, salary, account balance, or credit limit. However, like quasi-identifiers, in a small dataset, this data might be unique to an individual. Additionally, this data can be classified as follows:
- Sensitive Attributes: This data may disclose something like a health condition which in a small result set may identify a single individual.
- Insensitive Attributes: This data is not associated with a privacy risk and is common information, such as, the type of bank accounts in a bank, individual or business.
A sample dataset is shown in the following figure:

Based on the type of data, the columns in the above table can be classified as follows:
| Type | Field Names | Description |
|---|---|---|
| Direct Identifier | First Name, Last Name, Address with city and state, E-Mail Address, SSN / NID | The data in these fields are enough to identify an individual. |
| Quasi-Identifier | City, State, Date of Birth | The data in these fields could be the same for more than one individual. Note: Address could be a direct identifier if a single individual is present from a particular state. |
| Sensitive Attribute | Account Balance, Credit Limit, Medical Code | The data is important for analysis, however, in a small dataset it is easy to de-identify an individual. |
| Insensitive Attribute | Type | The data is general information making it difficult to de-identify an individual. |
Feedback
Was this page helpful?