About Protegrity Anonymization

Protegrity Anonymization, developed by Protegrity, assesses the reidentification risk of datasets containing personal data.

1: Protegrity Anonymization Architecture
2: Understanding Protegrity Anonymization Components

Protegrity Anonymization allows processing of the datasets via generalization, to ensure the risk of reidentification is within tolerable thresholds. An example of this generalization process is that instead of a data subject being 32 years old, the Protegrity Anonymization process might need to generalize age to be a range between 30-35 years old. The Protegrity Anonymization process will have an impact on data utility, but Protegrity Anonymization optimizes this fundamental privacy-utility trade-off to ensure maximum data quality within the privacy goals. This trade-off can be further optimized via the importance parameter, later described.

Protegrity Anonymization leverages Kubernetes for data anonymization at scale and it provides instructions and support for deployment and usage on AWS EKS and Microsoft Azure AKS.

Note: Currently, Protegrity Anonymization has been tested only on AWS EKS and Microsoft Azure AKS.

1 - Protegrity Anonymization Architecture

Communication between Protegrity Anonymization, the Dask Scheduler, and Dask Workers is detailed in this section.

An overview of the communication is shown in the following figure.

Protegrity Anonymization leverages several pods on Kubernetes. The first pod contains the Dask Scheduler. This pod connects to the Dask Worker pod over TLS. If Protegrity Anonymization requires more processing to work with the dataset, then based on the configuration, additional Dask Worker pods can be added. Protegrity Anonymization Web Server performs the processing using an internal Database Server for holding the data securely. The Protegrity Anonymization request is received by the Nginx-Ingress component. Ingress forwards the request to the Anon-App. The Anon-App processes the request and submits the tasks to the Dask Cluster. The Dask Scheduler schedules task on the Dask Workers The Anon-app stores the metadata about the job in the Anon-DB container. Next, the Dask Workers read, write, and process the data that is stored in the Anon-Storage, the request stream, or the Cloud storage. The Anon-Storage uses S3 bucket for storing data. The communication between the Dask Scheduler and the Dask Workers is handled by the Dask Scheduler. The Dask workers run on random ports.

The user accesses Protegrity Anonymization using HTTPS over port 443. The user requests are directed to an Ingress Controller, and the controller in turn communicates with the required pods using the following ports:

8090: Ingress controller and the Protegrity Anonymization API Web Service
8786: Ingress controller and the Dask Scheduler
8100: Ingress controller and S3 bucket

Protegrity Anonymization leverages Kubernetes for data anonymization at scale and it provides instructions and support for deployment and usage on AWS EKS and Microsoft Azure AKS.

2 - Understanding Protegrity Anonymization Components

Protegrity Anonymization components are leveraged to anonymize datasets.

Protegrity Anonymization is composed of the following main components:

Protegrity Anonymization REST Server: This core component exposes a REST interface through which clients can interact with the Protegrity Anonymization service. It uses an in-memory task queue and stores anonymized datasets and respective metadata on persistent storage. Protegrity Anonymization tasks are submitted to a queue and are handled in first-in first out fashion. It invokes the Dask Scheduler to perform the Protegrity Anonymization task.

Note: Only one anonymization task is executed at a time in Protegrity Anonymization.

REST Client: The client connects to the Protegrity Anonymization REST Server using an API tool, such as Postman, to create, send, and receive the Protegrity Anonymization request. It also provides a Swagger interface detailing the APIs available. The Swagger interface can also be used as a REST client for raising API requests.
Python SDK: It is the Python programmatic interface used to communicate with the REST server.
Anon-Storage: It is used to read data from and write data to the storage. It uses the S3 bucket framework to perform file operations.
Anon-DB: It is a PostgreSQL database that is used to store metadata related to Protegrity Anonymization jobs.
Dask Scheduler: This component analyzes the work load and distributes processing of the dataset to one or more Dask Workers. The scheduler can invoke additional workers or reduce the number of workers required for processing the task. The Dask Scheduler analyzes the dataset as a whole and allocates a small chunk of the dataset to each worker.
Dask Worker: This component is registered with the Dask Scheduler and processes the dataset. It is the Dask library that handles the interaction and interface with the data sets and the storage. Protegrity Anonymization supports cloud storage, S3 bucket, and other storages compatible with Kubernetes. The repository can also be kept outside the container. The Dask Worker works on a subset of the entire data.