This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Anonymization

Protegrity Anonymization, developed by Protegrity, assesses the reidentification risk of datasets containing personal data.

Protegrity Anonymization allows processing of the datasets, via generalization, to ensure the risk of reidentification is within tolerable thresholds. For a meaningful anonymization of a dataset, direct identifiers and quasi-identifiers need to be correctly identified and specified on the configuration of an anonymization job. If direct identifiers and quasi-identifiers are not correctly specified, the risk metrics do not reflect the true risks of reidentification of that anonymized dataset.

1 - Introduction

Learn about data privacy.

Organizations today collect vast amounts of personal data, providing valuable insights into individuals’ habits, purchasing trends, health, and preferences. This information helps businesses refine their strategies, develop products, and drive success. However, much of this data is highly sensitive and private, requiring organizations to implement robust protection measures that align with compliance requirements and business needs.

To safeguard personal data, pseudonymization can be used to replace direct identifiers with encrypted or tokenized values, allowing data to be processed while minimizing direct exposure to sensitive attributes. Because pseudonymized data can be re-identified with authorized access to the decryption or tokenization mechanism, it enables controlled data usage while maintaining privacy. However, as more fields—particularly quasi-identifiers—are pseudonymized to prevent re-identification, the overall utility of the data may decrease. Attributes like ZIP codes, birthdates, or demographic details may not be personally identifiable on their own, but when combined, they can reveal an individual’s identity. Protecting these fields strengthens privacy but may also limit their analytical value. Striking the right balance between security and usability is essential for compliance while preserving meaningful insights.

For scenarios requiring a higher level of privacy protection, anonymization provides an additional layer of security by ensuring that not only PII but also quasi-identifiers are generalized, redacted, or transformed. This prevents re-identification even when multiple data points are analyzed together. Anonymization techniques include removing or obfuscating key attributes, generalizing data to broader categories (e.g., replacing an exact address with just the city or state). By implementing anonymization, organizations can retain the analytical value of data while eliminating the risk of re-identification, ensuring compliance with privacy regulations and ethical data practices.

1.1 - Business cases

A few business cases to understand more about the importance of data privacy.

Consider the following business cases:

  • Case 1: A hospital wants to share patient data with a third-party research lab. The privacy of the patient, however, must be preserved.
  • Case 2: An organization requires customer data from several credit unions to create training data. The data will be used to train machine learning models looking for new insights. The customers, however, have not agreed for their data to be used.
  • Case 3: An organization which must be compliant with GDPR, CCPA, or other privacy regulations requires to keep some information beyond the period that meets regulations.
  • Case 4: An organization requires raw data to train their software for machine learning.

In all these cases, data forms an integral part of the source for continuing the business process or analysis. Additionally, only what was done is required in all the cases, who did it does not have any value in the data. In this case, the personal information about the individual users can be removed from the dataset. This removes the personal factor from the data and at the same time retains the value of the data from the business point of view. This data, since it does not have any private information, is also pulled from the legal requirements governing the data.

Thus, revisiting the business cases, the data in each case can be valuable after processing it in the following ways:

  • In case 1, all private information can be removed from the data and sent to the research lab for analysis.
  • In case 2, all private information must be scrubbed from the data before the data can be used. After scrubbing, the data will be generalized in such a way that the data can be used for machine learning, since no one will be able to identify individuals in the anonymized dataset.
  • In case 3, by anonymizing the data, the Data Subject is removed, and the data is no longer in scope for privacy compliance.
  • In case 4, a generalized form of the data can be obtained.

Removing data manually to remove private information would take a lot of time and effort, especially if the dataset consists of millions of records, with file sizes of several GBs. Running a find and replace or just deleting columns might remove important fields that might make the dataset useless for further analysis. Additionally, a combination of remaining attributes (such as, date of birth, postcode, gender) may be enough to re-identify the data subject.

Protegrity Anonymization applies various privacy models to the data, removing direct identifiers and applying generalization to the remaining indirect identifiers, to ensure that no single data subject can be identified.

1.2 - Data security and data privacy

Understand the difference between data security and data privacy.

Most organizations understand the need to secure access to personally identifiable information. Sensitive values in records are often protected at rest (storage), in transit (network) and in use (fine-grained access control), through a process known as de-identification. De-Identification is a spectrum, where data security and data privacy issues must be balanced with data usability.

Pseudonymization

Pseudonymization is the process of de-identification by substituting sensitive values with a consistent, non-sensitive value. This is most often accomplished through encryption, tokenization, or dynamic data masking. Access to the process for re-identification (decryption, detokenization, unmasking) is controlled, so that only users with a business requirement will see the sensitive values.

Advantages:

  • The original data can be obtained again.
  • Only authorized users can view the original data from protected data.
  • It processes each record and cell (intersection of a record and column) individually.
  • This process is faster than anonymization.

Disadvantages

  • Access-Control Dependency: Pseudonymized data remains linkable to its original form if authorized users have access to the decryption or tokenization mechanism, which requires strict security controls.

  • Regulatory Considerations: Since pseudonymization allows re-identification under controlled access, it may not meet the same compliance exemptions as anonymization under certain privacy regulations.

  • Increased Security Overhead: Additional security measures are needed to protect the tokenization keys and manage access controls, ensuring only authorized users can reverse the process.

  • Limited Protection for Quasi-Identifiers: While direct identifiers are typically tokenized, quasi-identifiers (e.g., birthdates, ZIP codes) may still pose a re-identification risk if not generalized or redacted.

  • Using tokenized data might make analysis incorrect and or less useful (e.g., changing time related attributes).

  • The tokenized data is still private from the users perspective.

  • Further processing is required to retrieve the original data.

  • Additional security is required to secure the data and the keys used for working with data.

Anonymization

Anonymization is the process of de-identification which irreversibly redacts, aggregates, and generalizes identifiable information on all data subjects in a dataset. This method ensures that while the data retains value for various use cases, analytics, data democratization, sharing with 3rd parties, and so on, the individual data subject can no longer be identified in the dataset.

Advantages:

  • Anonymized datasets can be used for analysis with typically low information loss.
  • An individual user cannot be identified from the anonymized dataset.
  • Enables compliance with privacy regulation.

Disadvantages

  • Being an irreversible process, the original data cannot be obtained again. This is required for some use cases.
  • This process is slower than pseudonymization because multiple passes must be made on the set to anonymize it.

1.3 - Importance and types of data

A record consists of all the information pertaining to a user. This record consists of different fields of information, such as, first name, last name, address, telephone number, age, and so on.

These records might be linked with other records, such as, income statements or medical records to provide valuable information. The various fields as a whole, called a record, is private and is user-centric. However, the individual fields may or may not be personal. Accordingly, based on the privacy level, the following data classifications are available:

  • Direct Identifier: Identity Attributes can identify an individual with the value alone. These attributes are unique to an individual in a dataset and at times even in the world. It is personal and private to the user. For example, name, passport, Social Security Number (SSN), mobile number, and so on.
  • Quasi-Identifier or Indirect Identifier: Quasi-Identifying Attributes are identifying characteristic about a data subject. However, you cannot identify an individual with the quasi-identifier alone. For example, date of birth or an address. Moreover, the individual pieces of data in a quasi-identifier might not be enough to identify a single individual. Take the example of date of birth, the year might be common to many individuals and would be difficult to narrow down to a single individual. However, if the dataset is small, then it might be easy to identify an individual using this information.
  • Data about data subject: Data about the data subject is typically the data that is being analyzed. This data might exist in the same table or a different related table of the dataset. It provides valuable information about the dataset and is very helpful for analysis. This data may or might not be private to an individual. For example, salary, account balance, or credit limit. However, like quasi-identifiers, in a small dataset, this data might be unique to an individual. Additionally, this data can be classified as follows:
    • Sensitive Attributes: This data may disclose something like a health condition which in a small result set may identify a single individual.
    • Insensitive Attributes: This data is not associated with a privacy risk and is common information, such as, the type of bank accounts in a bank, individual or business.

A sample dataset is shown in the following figure:

Based on the type of data, the columns in the above table can be classified as follows:

TypeField NamesDescription
Direct IdentifierFirst Name, Last Name, Address with city and state, E-Mail Address, SSN / NIDThe data in these fields are enough to identify an individual.
Quasi-IdentifierCity, State, Date of BirthThe data in these fields could be the same for more than one individual.
Note: Address could be a direct identifier if a single individual is present from a particular state.
Sensitive AttributeAccount Balance, Credit Limit, Medical CodeThe data is important for analysis, however, in a small dataset it is easy to de-identify an individual.
Insensitive AttributeTypeThe data is general information making it difficult to de-identify an individual.

1.4 - Data anonymization techniques

The privacy models are techniques for anonymizing data.

Important terminology

  • De-identification: General term for any process of removing the association between a set of identifying data and the data subject.
  • Pseudonymization: Particular type of data de-identification that both removes the association with a data subject and adds an association between a particular set of characteristics relating to the data subject and one or more pseudonyms.
  • Anonymization: Process that removes the association between the identifying dataset and the data subject. Anonymization is another subcategory of de-identification. Unlike pseudonymization, it does not provide a means by which the information may be linked to the same person across multiple data records or information systems. Hence reidentification of anonymized data is not possible.

Note: As defined in ISO/TS 25237:2008.

Anonymization models

  • k-anonymity: K-anonymity can be described as a “hiding in the crowd”. Each quasi-identifier tuple occurs in at least k records for a dataset with k-anonymity. Definition: if each individual is part of a larger group, then any of the records in this group could correspond to a single person.

  • l-diversity: The l-diversity model is an extension of the k-anonymity and adds the promotion of intra-group diversity for sensitive values in the anonymization mechanism. The l-diversity model handles some of the weaknesses in the k-anonymity model where protected identities to the level of k-individuals is not equivalent to protecting the corresponding sensitive values that were generalized or suppressed, especially when the sensitive values within a group exhibit homogeneity.

  • t-closeness: t-closeness is a further refinement of l-diversity. The t-closeness model extends the l-diversity model by treating the values of an attribute distinctly by taking into account the distribution of data values for that attribute.

1.5 - How Protegrity Anonymization Works

Protegrity Anonymization takes as input a dataset, removes direct identifiers, transforms quasi identifiers, and applies privacy models, and outputs an anonymized dataset. Additionally, the three privacy models are used to calculate the risk of re-identification. They also generalize and remove direct identifiers.

Protegrity Anonymization is a software solution that processes data by removing personal information and transforming the remaining details to protect privacy.

In simple terms, it takes raw data as input, applies techniques like generalization and summarization, and outputs anonymized data that can still be used for analysis—without revealing individual identities. The following figure illustrates this process.

As shown in the above image, a sample table is fed as input into Protegrity Anonymization. The private data that can be used to identify a particular individual is removed from the table. The final table with anonymized information is provided as output. The output table shows data loss due to column and row removals during Anonymization. This data loss is necessary to mitigate the risk of de-identification.

The anonymized data is used for analytics and data sharing. However, a standard set of attacks is defined to assess the effectiveness of Anonymization against different attack vectors. The de-identification attacks can be from a prosecutor, journalist, or marketer. The prosecutor’s attack is known as the worst case attack since the target individual is known.

  • In prosecutor, the attacker has prior knowledge about a specific person whose information is present in the dataset. The attacker matches this pre-existing information with the information in the dataset and identifies an individual.
  • In journalist, the attacker uses the prior information that is available. However, this information might not be enough to identify a person in the dataset. Here, the attacker might find additional information about the person using public records and narrow down the records to de-identify the individual.
  • In marketer, the attacker tries to de-identify as many people as possible from the dataset. This is a hit or miss strategy and many individuals identified might be incorrect. However, even though a lot of individuals de-identified might be incorrect, it is an issue if even few individuals are identified.

For more information about risk metrics, refer to Protegrity Anonymization Risk Metrics.

2 - About Protegrity Anonymization

Protegrity Anonymization, developed by Protegrity, assesses the reidentification risk of datasets containing personal data.

Protegrity Anonymization allows processing of the datasets via generalization, to ensure the risk of reidentification is within tolerable thresholds. An example of this generalization process is that instead of a data subject being 32 years old, the anonymization process might need to generalize age to be a range between 30-35 years old. The anonymization process will have an impact on data utility, but Protegrity Anonymization optimizes this fundamental privacy-utility trade-off to ensure maximum data quality within the privacy goals. This trade-off can be further optimized via the importance parameter, later described.

Protegrity Anonymization leverages Kubernetes for data anonymization at scale and it provides instructions and support for deployment and usage on AWS EKS and Microsoft Azure AKS.

Note: Currently, the Protegrity Anonymization has been tested only on AWS EKS and Microsoft Azure AKS.

2.1 - Protegrity Anonymization Architecture

Communication between Protegrity Anonymization, the Dask Scheduler, and Dask Workers is detailed in this section.

An overview of the communication is shown in the following figure.

Protegrity Anonymization leverages several pods on Kubernetes. The first pod contains the Dask Scheduler. This pod connects to the Dask Worker pod over TLS. If Protegrity Anonymization requires more processing to work with the dataset, then based on the configuration, additional Dask Worker pods can be added. Protegrity Anonymization Web Server performs the processing using an internal Database Server for holding the data securely. The anonymization request is received by the Nginx-Ingress component. Ingress forwards the request to the Anon-App. The Anon-App processes the request and submits the tasks to the Dask Cluster. The Dask Scheduler schedules task on the Dask Workers The Anon-app stores the metadata about the job in the Anon-DB container. Next, the Dask Workers read, write, and process the data that is stored in the Anon-Storage, the request stream, or the Cloud storage. The Anon-Storage uses MinIO for storing data. The Anon-workstation comprises of the Jupyter notebook environment with Anon preinstalled. The communication between the Dask Scheduler and the Dask Workers is handled by the Dask Scheduler. The Dask workers run on random ports.

The user accesses Protegrity Anonymization using HTTPS over the port 443. The user requests are directed to an Ingress Controller, and the controller in turn communicates with the required pods using the following ports:

  • 8090: Ingress controller and the Protegrity Anonymization API Web Service
  • 8786: Ingress controller and the Dask Scheduler
  • 8100: Ingress controller and MinIO
  • 8888: Ingress controller and the Jupyter Lab service

Protegrity Anonymization leverages Kubernetes for data anonymization at scale and it provides instructions and support for deployment and usage on AWS EKS and Microsoft Azure AKS.

2.2 - Understanding Protegrity Anonymization Components

Protegrity Anonymization components are leveraged to anonymize datasets.

Protegrity Anonymization is composed of the following main components:

  • Protegrity Anonymization REST Server: This core component exposes a REST interface through which clients can interact with the anonymization service. Protegrity Anonymization uses an in-memory task queue and stores anonymized datasets and respective metadata on persistent storage. Anonymization tasks are submitted to a queue and are handled in first-in first out fashion. Protegrity Anonymization invokes the Dask Scheduler to perform the anonymization task.

Note: Only one anonymization task is executed at a time in Protegrity Anonymization.

  • REST Client: The client connects to the Protegrity Anonymization REST Server using an API tool, such as Postman, to create, send, and receive the anonymization request. It also provides a Swagger interface detailing the APIs available. The Swagger interface can also be used as a REST client for raising API requests.
  • Python SDK: It is the Python programmatic interface used to communicate with the REST server.
  • Anon-Storage: It is used to read data from and write data to the storage. It uses the MinIO framework to perform file operations.
  • Anon-DB: It is a PostgreSQL database that is used to store metadata related to anonymization jobs.
  • Dask Scheduler: This component analyzes the work load and distributes processing of the dataset to one or more Dask Workers. The scheduler can invoke additional workers or reduce the number of workers required for processing the task. The Dask Scheduler analyzes the dataset as a whole and allocates a small chunk of the dataset to each worker.
  • Dask Worker: This component is registered with the Dask Scheduler and processes the dataset. It is the Dask library that handles the interaction and interface with the data sets and the storage. Protegrity Anonymizationsupports cloud storage, MinIO, and other storages compatible with Kubernetes. The repository can also be kept outside the container. The Dask Worker works on a subset of the entire data.
  • Jupyter Lab Workstation: The Jupyter Lab notebook provides a ready environment to run an anonymization request using Protegrity Anonymization with minimum configuration. To use the notebook, you open the notebook, update the required parameters in the notebook, and run the request.

3 - Installing Protegrity Anonymization

Protegrity Anonymization is available as a REST API that can be installed and run from Kubernetes environments on AWS and Azure. After installing the REST API, you can use Protegrity Anonymization API to anonymize your data. We also offer a local docker deployment mode.

3.1 - Prerequisites for Deploying the Protegrity Anonymization API

Prerequisites to install the Protegrity Anonymization REST API.

The Protegrity Anonymization API is provided as a Docker image. Prepare your system to run commands for processing the basic Kubernetes services for setting up the Protegrity Anonymization API. Additionally, ensure that the following prerequisites are met to install the Protegrity Anonymization REST API in your Cloud environment.

  • The user should be well versed with using container orchestration service like Kubernetes in different cloud services.

  • Access as an Admin user is available for the cloud service used.

  • A minimum of 2 nodes with the following minimum configuration:

    • RAM: 16 GB
    • CPU: 8 core
    • Hard Disk: Unlimited
  • Verify the contents of the package after extracting the ANON-API_DEB-ALL-64_x86-64_Docker-ALL-64_1.4.0.x.tgz and ANON-SDK_ALL-ALL-64_x86-64_PY-3-64_1.4.0.x.tgz files from the .tgz archive.

    • ANON-REST-API_1.4.0.x.tgz – Installation package for the Protegrity Anonymization API. This package contains the following files:
    FilesDescription
    ANON-API_1.4.0.x.tar.gzThis image is used to create the Protegrity Anonymization API Docker Container.
    cluster-aws.yamlThis is the template configuration file for creating the cluster in the AWS Cloud environment.
    ANON-API_HELM_1.4.0.x.tgzThis contains the Helm chart, which is used to deploy the Protegrity Anonymization API application on the Kubernetes cluster.
    Anon_logs.shThis is the script for extracting the logs from the Protegrity Anonymization API container.
    README.txtThis readme contains information about the Protegrity Anonymization API.
    Contractual.csvThis contains the list of libraries used in the Protegrity Anonymization API.
    docker/docker-compose.yamlThis file is used to deploy the API in Docker containers.
    docker/nginx.confThis file is used to configure nginx for Docker.
    docker/cert/cert.pemThis is the default self-signed certificate for the Docker container.
    docker/cert/key.pemThis is the key for the Docker container.
    aws-terraform/main.tfThis template file is used to deploy the API in AWS using Terraform.
    aws-terraform/vars.tfThis file is used for specifying the cluster configuration information.
    rbac/kubconfigcmd.txtThis file contains the commands for working with RBAC.
    rbac/anon-service-account.yamlThis template file contains the RBAC namespace configuration information.
    rbac/anon-role-and-rolebinding.yamlThis template file contains the RBAC configuration information for the roles and role binding.
    rbac/anon-clusterrolebinding.yamlThis file contains the RBAC configuration information for binding the roles to the cluster.
    rbac/kubconfigcmd.txtThis file contains the RBAC commands for retrieving tokens and assigning access to the service account.
  • ANON-NOTEBOOK_1.4.0.x.x.tgz - Docker image for the Protegrity Anonymization API Notebook workstation. Do not extract or modify the contents of this file.

  • ANON-SDK_ALL-ALL-64_x86-64_PY-3-64_1.4.0.x.tgz - Contains the Anonsdk-wheel file that is used to install anonsdk in the Python environment.

  • If required, then a REST client to access the REST services, such as Postman.

3.2 - Using Cloud Services

Configure the Protegrity Anonymization API in the different cloud services.

The Protegrity Anonymization API can be hosted in the Kubernetes service provided by various cloud platforms, such as AWS and Azure.

  • Anonymizing Using Amazon Elastic Kubernetes Service (EKS)
  • Anonymizing Using Azure Kubernetes Service (AKS)

Note: Protegrity Anonymization API is compatible for use with other Cloud providers. However, the compatibility has not been tested.

3.2.1 - Anonymizing Using Amazon Elastic Kubernetes Service (EKS)

3.2.1.1 - Verifying the Prerequisites

Prerequisites for configuring Protegrity Anonymization API on Amazon Elastic Kubernetes Service (EKS).

Ensure that the following prerequisites are met:

  • Base machine - This might be a Linux machine instance that is used to communicate with the Kubernetes cluster. This instance can be on-premise or on AWS. Ensure that Helm is installed on this Linux instance. You must also install Docker on this Linux instance to communicate with the Container Registry, where you want to upload the Docker images.

    For more information about the minimum hardware requirements, refer to the section Prerequisites for Deploying the Protegrity Anonymization API.

  • Access to an AWS account.

  • Permissions to create a Kubernetes cluster.

  • IAM user:

    • Required to create the Kubernetes cluster. This user requires the following policy permissions managed by AWS:

      • AmazonEC2FullAccess
      • AmazonEKSClusterPolicy
      • AmazonS3FullAccess
      • AmazonSSMFullAccess
      • AmazonEKSServicePolicy
      • AmazonEKS_CNI_Policy
      • AWSCloudFormationFullAccess
      • Custom policy that allows the user to create a new role and an instance profile, retrieve information regarding a role and an instance profile, attach a policy to the specified IAM role, and so on. The following actions must be permitted on the IAM service:
        • GetInstanceProfile
        • GetRole
        • AddRoleToInstanceProfile
        • CreateInstanceProfile
        • CreateRole
        • PassRole
        • AttachRolePolicy
      • Custom policy that allows the user to delete a role and an instance profile, detach a policy from a specified role, delete a policy from the specified role, remove an IAM role from the specified EC2 instance profile, and so on. The following actions must be permitted on the IAM service:
        • GetOpenIDConnectProvider
        • CreateOpenIDConnectProvider
        • DeleteInstanceProfile
        • DeleteRole
        • RemoveRoleFromInstanceProfile
        • DeleteRolePolicy
        • DetachRolePolicy
        • PutRolePolicy
      • Custom policy that allows the user to manage EKS clusters. The following actions must be permitted on the EKS service:
        • ListClusters
        • ListNodegroups
        • ListTagsForResource
        • ListUpdates
        • DescribeCluster
        • DescribeNodegroup
        • DescribeUpdate
        • CreateCluster
        • CreateNodegroup
        • DeleteCluster
        • DeleteNodegroup
        • UpdateClusterConfig
        • UpdateClusterVersion
        • UpdateNodegroupConfig
        • UpdateNodegroupVersion

      For more information about creating an IAM user, refer to Creating an IAM User in Your AWS Account. Contact your system administrator for creating the IAM users.

      For more information about the AWS-specific permissions, refer to API Reference document for Amazon EKS.

  • Access to the Amazon Elastic Kubernetes Service (EKS) to create a Kubernetes cluster.

  • Access to the AWS Elastic Container Registry (ECR) to upload the Protegrity Anonymization API image.

3.2.1.2 - Preparing the Base Machine

Steps to prepare the base machine for working with the EKS cluster.

The steps provided here installs the software required for running the various EKS commands for setting up and working with the Protegrity Anonymization API cluster.

  1. Log in to your system as an administrator.

  2. Open a command prompt with administrator.

  3. Install the following tools to get started with creating the EKS cluster.

    1. Install AWS CLI 2, which provides a set of command line tools for the AWS Cloud Platform.

      For more information about installing the AWS CLI 2, refer to Installing or updating to the latest version of the AWS CLI.

    2. Configure AWS CLI on your machine by running the following command.

      aws configure
      

      You are prompted to enter the AWS Access Key ID, Secret Access Key, AWS Region, and the default output format where these results are formatted.

      For more information about configuring AWS CLI, refer to Configuring settings for the AWS CLI.

      You need to specify the credentials of IAM User created in the section Verifying the Prerequisites to create the Kubernetes cluster.

      AWS Access Key ID [None]: <AWS Access Key ID of the IAM User 1>
      AWS Secret Access Key [None]: <AWS Secret Access Key of the IAM User 1>
      Default region name [None]: <Region where you want to deploy the Kubernetes cluster>
      Default output format [None]: json
      
    3. Install Kubectl version 1.22, which is the command line interface for Kubernetes.

      Kubectl enables you to run commands from the Linux instance so that you can communicate with the Kubernetes cluster.

      For more information about installing kubectl, refer to Set up kubectl and eksctl in the AWS documentation.

    4. Install one of the following command line tools for creating the Kubernetes cluster on AWS (EKS):

      • eksctl: Install eksctl which is a command line utility to create and manage Kubernetes clusters on Amazon Elastic Kubernetes Service (Amazon EKS).

        For more information about installing eksctl on the Linux instance, refer to Set up to use Amazon EKS.

      • Terraform/OpenTofu: Optionally, install Terraform or OpenTofu which is the command line to create and manage Kubernetes clusters. Use the terraform version command in the CLI to verify that Terraform or OpenTofu is installed.

        For more information about installing Terraform or OpenTofu, refer to Install Terraform.

    5. Install the Helm client version 3.8.2 for working with Kubernetes clusters.

      For more information about installing the Helm client, refer to Installing Helm.

3.2.1.3 - Creating the EKS Cluster

Steps to create the EKS cluster.

Complete the steps provided here to create the EKS cluster by running commands on the machine for the Protegrity Anonymization API.

Note: The steps listed in this procedure for creating the EKS cluster are for reference use. If you have an existing EKS cluster or want to create an EKS cluster based on your own requirements, then you can directly navigate to the section Accessing the EKS Cluster to connect your EKS cluster and the Linux instance.

To create an EKS cluster:

  1. Log in to the Linux machine.

  2. Obtain and extract the Protegrity Anonymization API files to a directory on your system.

    1. Download and extract the ANON-API_DEB-ALL-64_x86-64_Docker-ALL-64_1.4.0.x.tgz file.
    2. Verify that the following files are available in the package:
      • ANON-REST-API_1.4.0.x.tgz: The files for working with Protegrity Anonymization REST API.
      • ANON-NOTEBOOK_1.4.0.x.tgz: This file contains the image for the Anon-workstation.
    3. Extract the contents of the ANON-REST-API_1.4.0.x.tgz and ANON-NOTEBOOK_1.4.0.x.tgz files to a directory.
  3. Add the Cloud-related settings in the configuration files using one of the following options:

    Note: Use the checklist at AWS Checklist to update the YAML files.

    • For eksctl: Update the cluster-aws.yaml template file with the EKS authentication values for creating the EKS cluster.

      • Update the following placeholder information in the cluster-aws.yaml file.

          apiVersion: eksctl.io/v1alpha5
          kind: ClusterConfig
          metadata:
            name: <cluster_name>   #(provide an appropriate name for your cluster)
            region: <Region where you want to deploy Kubernetes Cluster>   #(specify the region to be used)
            version: "1.27"
          vpc:
            id: "#Update_vpc_here#  #   (enter the vpc id to be used)
            subnets:             # (In this section specify the subnet region and subnet id accordingly)
              private:
                <Availability zone for the region where you want to deploy your Kubernetes cluster>:
                  id: "#Update_id_here#"
                <Availability zone for the region where you want to deploy your Kubernetes cluster>:
                  id: "#Update_id_here#"
          nodeGroups:
            - name: <Name of your Node Group>
              instanceType: t3a.xlarge
              minSize: 2
              maxSize: 4        # (Set max node size according to load to be processed, for cluster-autoscaling)
              desiredCapacity: 3
              privateNetworking: true
              iam:
                attachPolicyARNs:
                  - "arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy"
                  - "arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy"
                  - "arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly"
                withAddonPolicies:
                  autoScaler: true
                  awsLoadBalancerController: true
                  ebs: true
              securityGroups:
                withShared: true
                withLocal: true
                attachIDs: ['#Update_security_group_id_linked_to_your_VPC_here#']
              tags:
                #Add required tags (Product, name, etc.) here
                k8s.io/cluster-autoscaler/<cluster_name>: "owned"       # (Update your cluster name in this line) ## These tags are required for
                k8s.io/cluster-autoscaler/enabled: "true"                                                 ##     cluster-autoscaling
                Product: "Anonymization"
              ssh:
                publicKeyName: '<EC2 Key Pair>'                    rgba(4, 4, 4, 1) SSH key to login to Nodes in the cluster if needed.</ns:clipboard
        

        Note: In the ssh/publicKeyName parameter, you must specify the name of the key pair that you have created.

        For more information about creating the EC2 key pair, refer to Amazon EC2 key pairs and Amazon EC2 instances.

        The AmazonEKSWorkerNodePolicy policy allows Amazon EKS worker nodes to connect to Amazon EKS Clusters. For more information about the policy, refer to Amazon EKS Worker Node Policy.

        For more information about the attached role arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy in the nodegroup, refer to Amazon EKS node IAM role.

        The ARN of the AmazonEKS_CNI_Policy policy is a default AWS policy that enables the Amazon VPC CNI Plugin to modify the IP address configuration on your EKS nodes. For more information about this policy, refer to Amazon EKS CNI Policy.

        For more information about the attached role arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy in the nodegroup, refer to Configure Amazon VPC CNI plugin to use IRSA.

    • For Terraform: Update the following placeholder information in the aws-terraform/vars.tf file with the Terraform values for creating the cluster.

      variable "cluster_name" {
      default = "<Cluster_name>" ## Supply the name for your EKS cluster.
      }
      variable "cluster_version" {
      default = "1.27"
      }
      variable "aws_region" {
      default = "<Region>" ## The region in which EKS cluster will be
      created.
      }
      variable "role_arn" {
      default = "<Specify Role_arn>" ## Amazon Resource Name (ARN) of the IAM
      role that provides permissions for the Kubernetes control plane to make calls to AWS
      API operations on your behalf.
      }
      variable "security_group_id" {
      default = ["<Specify security group id>"] ## The Security Group ID for your VPC.
      }
      variable "subnet_ids" {
      default = ["<subnet-1 id>", "<subnet-2 id>"] ## Supply the subnet ID's. Ensure the
      subnets should be in different Availability Zone.
      }
      variable "node_group_name" {
      default = "<Nodegroup Name>" ## Name of the nodegroup that will join the
      EKS cluster.
      }
      variable "node_role_arn" { ## Amazon Resource Name (ARN) of the IAM
      Role that provides permissions for the EKS Node Group.
      default = "<IAM-Node ROLE ARN>" ## Refer
      }
      variable "instance_type" {
      default = ["<instance_type>"] ## Type of Nodes in EKS cluster. Eg:
      t3a.xlarge.
      }
      variable "desired_nodes_count" {
      default = "<Desired node count>" ## Desired number of Nodes Running in EKS
      cluster.
      }
      variable "max_nodes" {
      default = "<Max node count>" ## Maximum number of Nodes in EKS cluster
      can Autoscale to.
      }
      variable "min_nodes" {
      default = "<Min node count>" ## Minimum number of Nodes in EKS cluster.
      }
      variable "ssh_key" {
      default = "<EC2-SSH-key>" ## EC2-SSH Key Pair to SSH to Nodes of
      cluster.
      }
      output "endpoint" {
      value = aws_eks_cluster.eks_Anon.endpoint
      }
      
  4. Run one of the the following commands to create the Kubernetes cluster. This process might take time to complete. You might need to wait for 10 to 15 minutes for the cluster creation process to complete:

    • For eksctl:

      eksctl create cluster -f cluster-aws.yaml
      
    • For Terraform:

      terraform init terraform plan terraform apply
      
  5. Deploy the Cluster Autoscaler component to enable the autoscaling of nodes in the EKS cluster.

    For more information about deploying the Cluster Autoscaler, refer to the Deploy the Cluster Autoscaler section in the Amazon EKS documentation.

  6. Install the Metrics Server to enable the horizontal autoscaling of pods in the Kubernetes cluster.

    For more information about installing the Metrics Server, refer to the Horizontal Pod Autoscaler section in the Amazon EKS documentation.

3.2.1.4 - Accessing the EKS Cluster

Steps to access the EKS cluster.

Connect to the cloud service using the steps in this section.

  1. Run the following command to connect your Linux instance to the Kubernetes cluster.

    aws eks update-kubeconfig --name <Name of Kubernetes cluster> --region <Region in which the cluster is created>
    
  2. Run the following command to verify that the nodes are deployed.

    kubectl get nodes
    

    Note: You can also verify that the nodes are deployed in AWS from the EKS Kubernetes Cluster dashboard.

3.2.1.5 - Uploading the Image to AWS Container Registry (ECR)

Steps to upload the Protegrity Anonymization API image.

Use the information in this section to upload the Protegrity Anonymization API image to the AWS container registry (ECR) for running the Protegrity Anonymization API in EKS.

Ensure that you have set up your Container Registry.

Note: The steps listed in this section for uploading the container images to the Amazon Elastic Container Repository (ECR) are for reference use. You can choose to use a different Container Registry for uploading the container images.

For more information about setting up Amazon ECR, refer to Moving an image through its lifecycle in Amazon ECR.

To install the Protegrity Anonymization API:

  1. Log in to the machine as an administrator to install the Protegrity Anonymization API.

  2. Install Docker using the steps provided at https://docs.docker.com/engine/install/.

  3. Configure Docker to push the Protegrity Anonymization API images to the AWS Container Registry (ECR) by running following command:

    aws ecr get-login-password --region <Region> | docker login --username AWS --password-stdin <AWS_account_ID>.dkr.ecr.<Region>.amazonaws.com
    
  4. Obtain and extract the Protegrity Anonymization files to a directory on your system.

    1. Download and extract the ANON-API_DEB-ALL-64_x86-64_Docker-ALL-64_1.4.0.x.tgz file.

    2. Extract the contents of the ANON-REST-API_1.4.0.x.tgz and ANON-NOTEBOOK_1.4.0.x.tgz files to a directory.

      Note: Do not extract the ANON-API_1.4.0.x.tar.gz package obtained in the directory after performing the extraction. You need to run the docker load command on the package obtained in the directory.

  5. Navigate to the directory where the ANON-API_1.4.0.x.tar.gz file is saved.

  6. Load the Docker image into Docker by using the following command:

    docker load < ANON-API_1.4.0.x.tar.gz
    
  7. List the images that are loaded by using the following command:

    docker images
    
  8. Tag the image to the ECR repository by using the following command:

    docker tag <Container image>:<Tag> <Container registry path>/<Container image>:<Tag>
    

    For example:

    docker tag ANON-API_1.4.0.x:anon_EKS <account_name>.dkr.ecr.region.amazonaws.com/anon:anon_EKS
    
  9. Push the tagged image to the ECR by using the following command:

    docker push <Container_regitry_path>/<Container_image>:<Tag>
    

    For example:

    docker push <account_name>.dkr.ecr.region.amazonaws.com/anon:anon_EKS
    
  10. Extract ANON-NOTEBOOK_1.4.0.x.tgz to obtain the ANON-NOTEBOOK_1.4.0.x.tar.gz file and then repeat the steps 5 to 9 for ANON-NOTEBOOK_1.4.0.x.tar.gz.

    The images are loaded to the ECR and are ready for deployment.

    For more information about pushing container images to the ECR, refer to Moving an image through its lifecycle in Amazon ECR.

3.2.1.6 - Setting up NGINX Ingress Controller

Steps to install the NGINX Ingress Controller.

Complete the steps provided here for installing the NGINX Ingress Controller on the base machine.

  1. Login to the base machine and open a command prompt.

  2. Create a namespace where the NGINX Ingress Controller needs to be deployed using the following command.

    kubectl create namespace <Namespace name>
    

    For example,

    kubectl create namespace nginx
    
  3. Add the repository from where the Helm charts for installing the NGINX Ingress Controller must be fetched using the following command.

    helm repo add stable https://charts.helm.sh/stable
    helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
    
  4. Install the NGINX Ingress Controller using Helm charts using the following command.

    helm install nginx-ingress --namespace <Namespace name> --set controller.replicaCount=1 --set controller.nodeSelector."beta\.kubernetes\.io/os"=linux --set defaultBackend.nodeSelector."beta\.kubernetes\.io/os"=linux ingressnginx/ingress-nginx --set controller.publishService.enabled=true --set controller.ingressClassResource.name=<NGINX ingress class name> --set podSecurityPolicy.enabled=true --set rbac.create=true --set controller.extraArgs.enablessl-passthrough="true" --set controller.service.annotations."service\.beta\.kubernetes\.io/aws-load-balancer-internal"=\"true\" --set controller.service.annotations."service\\.beta\\.kubernetes\\.io/aws-load-balancer-connection-idle-timeout"=\\"300\\" --version 4.3.0
    

    For example,

    helm install nginx-ingress --namespace nginx --set controller.replicaCount=1 --set controller.extraArgs.enable-ssl-passthrough="true" --set controller.nodeSelector."beta\\.kubernetes\\.io/os"=linux --set defaultBackend.nodeSelector."beta\\.kubernetes\\.io/os"=linux ingress-nginx/ingress-nginx --set controller.publishService.enabled=true --setcontroller.ingressClassResource.name=nginx-anon --set podSecurityPolicy.enabled=true --set rbac.create=true --set controller.service.annotations."service\\.beta\\.kubernetes\\.io/aws-load-balancer-internal"=\\"true\\" --set controller.service.annotations."service\\.beta\\.kubernetes\\.io/aws-load-balancer-connection-idle-timeout"=\\"300\\" --version 4.3.0
    

    For more information about the various configuration parameters for installing the NGINX Ingress Helm charts, refer to values.yaml file.

  5. Check the status of the nginx-ingress release and verify that all the deployments are running accurately using the following command.

    kubectl get pods -n <Namespace name>
    

    For example,

    kubectl get pods -n nginx
    

    Note: The pod name should be noted. It is required as a parameter in the next step.

  6. View the logs on the Ingress pod using the following command.

    kubectl logs pod/<pod-name> -n <Namespace name>
    
  7. Obtain the external IP of the nginx service by executing the following command.

    kubectl get service --namespace <Namespace name>
    

    For example,

    kubectl get service -n nginx
    

    Note: The IP should be noted. It is required for communicating the Protegrity Anonymization API.

3.2.1.7 - Using Custom Certificates in Ingress

Steps to use your custom certificates with the Ingress Controller.

Protegrity Anonymization API uses certificates for secure communication with the client. You can use the certificates provided by Protegrity or use your own certificates. Complete the configurations provided in this section to use your custom certificates with the Ingress Controller.

Ensure that the certificates and keys are in the .pem format.

Note: Skip the steps provided in this section if you want to use the default Protegrity certificates for the Protegrity Anonymization API.

  1. Login to the Base Machine where Ingress in configured and open a command prompt.

  2. Copy your certificates to the Base Machine.

    Note: Verify the certificates using the commands provided in the section Working with Certificates.

  3. Create a Kubernetes secret of the server certificate using the following command. The namespace used must be the same where the Protegrity Anonymization API application is to be deployed.

    kubectl create secret --namespace <namespace-name> generic <secret-name> --from-file=tls.crt=<path_to_certificate>/<certificate-name> --from-file=tls.key=<path_to_certificate>/<certificate-key>
    

    For example,

    kubectl create secret --namespace anon-ns generic anon-protegrity-tls --from-file=tls.crt=/tmp/cust_cert/anon-server-cert.pem --from-file=tls.key=/tmp/cust_cert/anon-server-key.pem
    
  4. Create a Kubernetes secret of the CA certificate using the following command. The namespace used must be the same where the Protegrity Anonymization API application is to be deployed.

    kubectl create secret --namespace <namespace-name> generic <secret-name> --from-file=ca.crt=<path_to_certificate>/<certificate-name>
    

    For example,

    kubectl create secret --namespace anon-ns generic ca-protegrity --from-file=ca.crt=/tmp/cust_cert/anon-ca-cert.pem
    
  5. Open the values.yaml file.

  6. Add the following host and secret code for the Ingress configuration at the end of the values.yaml file.

    ## Refer section in documentation for setting up and configuring NGINX-INGRESS before deploying the application.
    ingress:
      ## Add host section with the hostname used as CN while creating server certificates.
      ## While creating the certificates you can use *.protegrity.com as CN and SAN used in below example
      host: **anon.protegrity.com**                  # Update the host according to your server certificates.
    
      ## To terminate TLS on the Ingress Controller Load Balancer.
      ## K8s TLS Secret containing the certificate and key must also be provided.
      secret: **anon-protegrity-tls**                # Update the secretName according to your secretName.
    
      ## To validate the client certificate with the above server certificate
      ## Create the secret of the CA certificate used to sign both the server and client certificate as shown in example below
      ca_secret: **ca-protegrity**                    # Update the ca-secretName according to your secretName.
    
      ingress_class: nginx-anon
    

    Note: Ensure that you replace the host, secret, and ca_secret attributes in the values.yaml file with the values as per your certificate.

    For more information about using custom certificates, refer to Updating the Configuration Files.

3.2.1.8 - Updating the Configuration Files

Steps to update the Configuration Files.

Use the template files provided to specify the EKS settings for the Protegrity Anonymization API.

  1. Extract and update the files in the ANON-API_HELM_1.4.0.x.tgz package.

    The ANON-API_HELM_1.4.0.x.tgz package contains the values.yaml file that must be modified as per your requirements. It also contains the templates directory with yaml files.

    Note: Ensure that the necessary permissions for updating the files are assigned to the .yaml files.

  2. Navigate to the <path_to_helm>/templates directory and delete the anon-db-storage-aws.yaml file.

  3. Update the values.yaml file.

    Note: For more information about the values.yaml file, refer to values.yaml.

    1. Specify a namespace for the pods.

      namespace:
        name: **anon-ns**
      
    2. Specify the node name and zone information for the node as a prerequisite for the database pod and the Anon-Storage(MinIO) pod. Use the node name which is running in the same zone where the EBS is created.

      ## Prerequisite for setting up Database and Minio Pod.
      ## This is to handle any new DB pod getting created that uses the same persistence storage in case the running Database pod gets disrupted.
      ## This persistence also helps persist Anon-storage data.
      persistence:
        ## 1. Get the list of nodes in the cluster. CMD: kubectl get nodes
        ## 2. Get the node name which is running in the same zone where the external-storage is created. CMD: kubectl describe nodes
        nodename: "**<Node_name>**"                    # Update the Node name
      
        ## Fetch the zone in which the node is running using the `kubectl describe node/nodename` command or the following command.
        ## CMD: ` kubectl describe node/<nodename> | grep topology.kubernetes.io/zone | grep -oP 'topology.kubernetes.io/zone=K[^ ]+' `
        zone: "**<Zone in which above Node is running>**"
      
        ## For EKS cluster, supply the volumeID of the aws-ebs
        ## For AKS cluster, supply the subscriptionID of the azure-disk
        dbstorageId: "**<Provide dbstorage ID>**"           # To persist database schemas.
        anonstorageId: "**<Provide anonstorage ID>**"       # To persist Anonymized data.
      
    3. Update the repository information in the file. The Anon-Storage pod uses the MinIO Docker image quay.io/minio/minio:RELEASE.2022-10-29T06-21-33Z, which is pulled from the Public repository.

      image:
        minio_repo: quay.io/minio/minio                    # Public repo path for Minio Image.
        minio_tag: RELEASE.2022-10-29T06-21-33Z            # Tag name for Minio image.
      
        repository: **<Repo_path>**                            # Repo path for the Container Registry in Azure, GCP, AWS.
        anonapi_tag: **<AnonImage_tag>**                       # Tag name of the ANON-API Image.
        anonworkstation_tag: **<WorkstationImage_tag>**        # Tag name of the ANON-Workstation Image.
      
        pullPolicy: Always
      

      Note: Ensure that you update the repository, anonapi_tag, and anonworkstation_tag according to your container registry.

    4. MinIO uses access keys and secret for performing file operations. Protegrity provides a default set of credentials that are stored as part of the secret storage-creds. If you are creating your own secret, then, update the existingSecret parameter.

      anonstorage:
        ## Refer the following command for creating your own secret.
        ## CMD: kubectl create secret generic my-minio-secret --from-literal=rootUser=foobarbaz --from-literal=rootPassword=foobarbazqux
        existingSecret: ""                # Supply your secret Name for ignoring below default credentials.
        bucket_name: "anonstorage"        # Default bucket name for minio
        secret:
          name: "storage-creds"           # Secret to access minio-server
          access_key: "anonuser"          # Access key for minio-server
          secret_key: "protegrity"        # Secret key for minio-server
      

3.2.1.9 - Deploying the Protegrity Anonymization API to the EKS Cluster

Steps to deploy the Protegrity Anonymization API on the EKS cluster.

Complete the following steps to deploy the Protegrity Anonymization API on the EKS cluster.

  1. Navigate to the <path_to_helm>/templates directory and delete the anon-dbpvc-azure.yaml and the anon-storagepvc-azure.yaml files.

  2. Create the Protegrity Anonymization API namespace using the following command.

    kubectl create namespace <name>
    

    Note: Update and use the from the values.yaml file that is present in the Helm chart that you used in the previous section.

  3. Run the following command to deploy the pods.

    helm install <helm-name> /<path_to_helm> -n <namespace>
    
  4. Verify that the necessary pods and services are configured and running.

    1. Run the following command to verify the information for accessing the Protegrity Anonymization API externally on the cluster. The port mapping for accessing the UI is displayed after running the command.

      kubectl get service -n <namespace>
      
    2. Run the following command to verify the deployment.

      kubectl get deployment -n <namespace>
      
    3. Run the following command to verify the pods created.

      kubectl get pods -n <namespace>
      
    4. Run the following command to verify the pods.

      kubectl get pods -o wide -n <namespace>
      
  5. If you customize the values.yaml, then update the configuration using the following command.

    helm upgrade <helm name> /path/to/helmchart -n <namespace>
    
  6. If required, configure logging using the steps provided in the section Setting Up Logging for the Protegrity Anonymization API.

  7. Execute the following command to obtain the IP address of the service.

    kubectl get ingress -n <namespace>
    

3.2.1.10 - Viewing Protegrity Anonymization API Using REST

Steps to view the Protegrity Anonymization API service.

Use the URLs provided here for viewing the Protegrity Anonymization API service and pod details after you have successfully deployed the Protegrity Anonymization API.

You need to map the IP address of Ingress in the hosts file with the host name set in the Ingress configuration.

For more information about updating the hosts file, refer to step 2 of the section Enabling Custom Certificates From SDK.

Optionally, update the hostname of the Elastic Load Balancer (ELB) that is created by the NGINX Ingress Controller using the section Creating a DNS Entry for the ELB Hostname in Route53.

For more information about configuring the DNS, refer to the section Creating a DNS Entry for the ELB Hostname in Route53.

  1. Open a web browser.

  2. Use the following URL to view basic information about the Protegrity Anonymization API.

    https://anon.protegrity.com/

  3. Use the following URL to view the Swagger UI. The various Protegrity Anonymization APIs are visible on this page.

    https://anon.protegrity.com/anonymization/api/v1/ui

  4. Use the following URL to view the contractual information for the Protegrity Anonymization API.

    https://anon.protegrity.com/about

3.2.1.11 - Creating Kubernetes Service Accounts and Kubeconfigs for Anonymization Cluster

Steps to create a Kubernetes service account and the role-based access control (RBAC) configuration.

A service account in the anonymization cluster namespace has access to the anonymization namespace. It might also have access to the whole cluster. These permissions for the service account allow the user to create, read, update, and delete objects in the anonymization Kubernetes cluster or the namespace. Additionally, the kubeconfig is required to access the service account using a token.

In this section, you create a Kubernetes service account and the role-based access control (RBAC) configuration manually using kubectl.

Ensure that the user has access to permissions for creating and updating the following resources in the Kubernetes cluster:

  • Kubernetes Service Accounts

  • Kubernetes Roles and Rolebindings

  • Optional: Kubernetes ClusterRoles and Rolebindings

  • Use the steps provided in the followng link to create the namespace and assign the required permissions to the cluster.
    Creating the Service Account

  • Complete the steps provided in the following link to retrieve the tokens for the Protegrity Anonymization API service account and to create a kubeconfig with access to the service account.
    Obtaining the Tokens for the Service Account

Obtaining the Tokens for the Service Account

Complete the steps provided int his section to retrieve the tokens for the Protegrity Anonymization API service account and to create a kubeconfig with access to the service account.

  1. Open a command line interface on the base machine for running the configuration commands.

    Note: A copy of the commands is available in the kubconfigcmd.txt file in the rbac directory of the Protegrity Anonymization API package. Use the code form the file to run the commands.

  2. Set the environment variables for running the configuration commands using the following command.

    SERVICE_ACCOUNT_NAME=anon-service-account
    CONTEXT=$(kubectl config current-context)
    NAMESPACE=anon-namespace
    NEW_CONTEXT=anon-context
    
    SECRET_NAME=$(kubectl get serviceaccount ${SERVICE_ACCOUNT_NAME} -n ${NAMESPACE} --context ${CONTEXT} --namespace ${NAMESPACE} -o jsonpath='{.secrets[0].name}')
    TOKEN_DATA=$(kubectl get secret ${SECRET_NAME} -n ${NAMESPACE} --context ${CONTEXT} --namespace ${NAMESPACE} -o  jsonpath='{.data.token}')
    TOKEN=$(echo ${TOKEN_DATA} | base64 -d)
    

    Note: Ensure that you use the appropriate values as per your configuration in the above command.

  3. Set the token in the config credentials using the following command.

    kubectl config set-credentials <username> --token=$TOKEN
    

    For example,

    kubectl config set-credentials test-user --token=$TOKEN
    
  4. Retrieve the cluster name using the following command.

    kubectl config get-clusters
    
  5. Set the context in kubeconfig using the following command.

    kubectl config set-context ${NEW_CONTEXT} --cluster=<name of your cluster> --user=test-user
    
  6. Set the current context to to use the new anonymization config using the following command.

    kubectl config use-context ${NEW_CONTEXT}
    
  7. Verify the new context using the following command.

    kubectl config current-context
    
  8. Verify the status of the pods using the following command.

    kubectl get pods -n <name space>
    

Creating the Service Account

Use the steps provided in this section to create the namespace and assign the required permissions to the cluster.

  1. Create the Kubernetes Service Account using the following steps.

    1. Navigate to the rbac directory of the extracted Protegrity Anonymization API package.

    2. Open the anon-service-account.yaml file using a text editor.

    3. Update the namespace as per your configuration in the anon-service-account.yaml file.

    4. Save and close the file.

    5. From a command prompt, navigate to the rbac directory and run the following command to create the service account.

      kubectl apply -f anon-service-account.yaml
      
  2. Grant the appropriate permission to the service account using any one of the following two steps.

    • Grant cluster-admin permissions for the service account to all the namespaces using the following steps.

      Note: You need to run this step only if you want to grant the service account access to all namespaces in your cluster.

      A Kubernetes ClusterRoleBinding is available at the cluster level, but the subject of the ClusterRoleBinding exists in a single namespace. Hence, you must specify the namespace for the service account.

      1. Navigate to the rbac directory of the extracted Protegrity Anonymization API package.

      2. Open the anon-clusterrolebinding.yaml file using a text editor.

      3. Update the namespace as per your configuration in the anon-clusterrolebinding.yaml file.

      4. Save and close the file.

      5. From a command prompt, navigate to the rbac directory and run the following command to assign the appropriate permissions.

        kubectl apply -f anon-clusterrolebinding.yaml
        
    • Grant namespace-specific permissions to the service account using the following steps.

      Note: You need to run this step only if you want to grant the service account access to just the Protegrity Anonymization API namespace.

      Ensure that you create a role with a set of permissions and rolebinding for attaching the role to the service account.

      1. Navigate to the rbac directory of the extracted Protegrity Anonymization API package.

      2. Open the anon-role-and-rolebinding.yaml file using a text editor.

      3. Update the namespace, role, and service account name as per your configuration in the anon-role-and-rolebinding.yaml file.

      4. Save and close the file.

      5. From a command prompt, navigate to the rbac directory and run the following command to assign the appropriate permissions.

        kubectl apply -f anon-role-and-rolebinding.yaml
        

3.2.2 - Anonymizing Using Azure Kubernetes Service (AKS)

3.2.2.1 - Set up Anonymization API on Azure Kubernetes Service (AKS)

Steps to set up Anonymization API on Azure Kubernetes Service (AKS).

To set up and use the Protegrity Anonymization API on Azure, follow the steps provided in this section.

3.2.2.2 - Preparing the Base Machine

Steps to prepare the base machine for working with Azure Kubernetes Service (AKS).

Install the Azure CLI and login to your account to work with Protegrity Anonymization API on the Azure Cloud.

  1. Install and initialize the Azure CLI on your system.

    For more information about the installation steps, refer to How to install the Azure CLI.

  2. Login to your account using the following command from a command prompt.

    az login
    
  3. Sign in to your account.

    The configuration complete message appears.

    Azure Configuration

  4. Install Kubectl version 1.22, which is the command line interface for Kubernetes.

    Kubectl enables you to run commands from the Linux instance so that you can communicate with the Kubernetes cluster.

    For more information about installing kubectl, refer to Set up Kubernetes tools on your computer.

  5. Install the Helm client version 3.8.2 for working with Kubernetes clusters.

    For more information about installing the Helm client, refer to Installing Helm.

3.2.2.3 - Creating a Kubernetes Cluster

Steps to create a Kubernetes Cluster on Azure.

This section describes how to create a Kubernetes Cluster on Azure.

Note: The steps listed in this procedure for creating a Kubernetes cluster are for reference use. If you have an existing Kubernetes cluster or want to create a Kubernetes cluster based on your own requirements, then you can directly navigate to the section Accessing the AKS Cluster to connect your Kubernetes cluster and the Linux instance.

To create a Kubernetes cluster:

  1. Login to the Azure environment.

  2. Click the Portal menu icon.

    The Portal menu appears.

  3. Navigate to All Services > Kubernetes services.

    The Kubernetes Services screen appears.

    Kubernetes Services screen

  4. Click Add.

    The Create Kubernetes cluster screen appears.

    Create Kubernetes cluster screen

  5. In the Resource group field, select the required resource group.

  6. In the Kubernetes cluster name field, specify a name for your Kubernetes cluster.

    Retain the default values for the remaining settings.

  7. Click Review + create to validate the configuration.

  8. Click Create to create the Kubernetes cluster.

    The Kubernetes cluster is created.

3.2.2.4 - Accessing the AKS Cluster

Steps to access the Kubernetes Cluster.

Connect to the cloud service using the steps in this section.

  1. Login to the Linux instance, and run the following command to connect your Base machine to the Kubernetes cluster.

    az aks get-credentials --resource-group <Name_of _Resource_group> --name <Name_of Kubernetes_Cluster>
    

    The Base machine is now connected with the Kubernetes cluster. You can now run commands using the Kubernetes command line interface (kubectl) to control the nodes on the Kubernetes cluster.

  2. Validate whether the cluster is up by running the following command.

    kubectl get nodes
    

    The command lists the Kubernetes nodes available in your cluster.

3.2.2.5 - Uploading the Image to the Azure Container Registry

Steps to upload the Docker image to the Azure Container Registry (ACR).

Use the information in this section to upload the Docker image to the Azure Container Registry (ACR) for running the Protegrity Anonymization API in AKS.

Note: For more information about creating the Azure Container Registry, refer to Create an Azure container registry using the Azure portal.

To install the Protegrity Anonymization API:

  1. Login to the machine as an administrator to install the Protegrity Anonymization API.

  2. Install Docker using the steps provided at https://docs.docker.com/engine/install/.

  3. Configure Docker to push the Protegrity Anonymization API images to the Azure Container Registry (ACR) by running following command:

    docker login <Container_registry_name>.azurecr.io
    
  4. Obtain and extract the Protegrity Anonymization API files to a directory on your system.

    1. Download and extract the ANON-API_DEB-ALL-64_x86-64_Docker-ALL-64_1.4.0.x.tgz file.

    2. Open the directory and extract the ANON-API_DEB-ALL-64_x86-64_Docker-ALL-64_1.4.0.x.tar file.

    3. Extract the contents of the ANON-REST-API_1.4.0.x.tgz file to a directory.

      Note: Do not extract the ANON-API_1.4.0.x.tar.gz package obtained in the directory after performing the extraction. You need to run the docker load command on the package obtained in the directory.

  5. Navigate to the directory where the ANON-API_1.4.0.x.tar.gz file is saved.

  6. Load the Docker image into Docker by using the following command:

    docker load < ANON-API_1.4.0.x.tar.gz
    
  7. List the images that are loaded by using the following command:

    docker images
    
  8. Tag the image to the ACR repository by using the following command:

    docker tag <Container image>:<Tag> <Container registry path>/<Container image>:<Tag>
    

    For example:

    docker tag ANON-API_1.4.0.x:anon_AZ <container_registry_name>.azurecr.io/anon:anon_AZ
    
  9. Push the tagged image to the ACR by using the following command:

    docker push <Container_regitry_path>/<Container_image>:<Tag>
    

    For example:

    docker push <container_registry_name>.azurecr.io/anon:anon_AZ
    

    Note: Ensure that the appropriate path for the image registry along with the tag is updated in the values.yaml file.

  10. Extract ANON-NOTEBOOK_1.4.0.x.tgz to obtain the ANON-NOTEBOOK_1.4.0.x.tar.gz file and then repeat the steps 5 to 9 for the ANON-NOTEBOOK_1.4.0.x.tar.gz file.

The image is loaded to the ACR and is ready for deployment.

3.2.2.6 - Creating an Azure Disk

Steps to create an Azure disk.

Complete the steps provided here to create an Azure disk and obtain the subscription ID.

To create the Azure disk:

  1. Refer to Create and use a volume with Azure Disks in Azure Kubernetes Service (AKS) and complete the steps provided in the section Create an Azure disk.

    The command for creating the Azure disk is provided here, update the values according to your setup:

    az disk create \
      --resource-group **<Resource Group Name>** \
      --name **<Disk Name>** \
      --size-gb 20 \
      --location **<Location of any node in cluster>** \
      --zone **<Zone of the node in cluster>** \
      --query id --output tsv
    

    Note: Ensure that you create two disks, one for database persistence and one for Anon-Storage.

  2. The subscription ID of the Azure disk that you created should be noted. The subscription IDs are required later for configuring the persistent disks.

3.2.2.7 - Setting up NGINX Ingress Controller

Steps to install the NGINX Ingress Controller.

Complete the steps provided here for installing the NGINX Ingress Controller on the base machine.

  1. Login to the base machine and open a command prompt.

  2. Create a namespace where the NGINX Ingress Controller needs to be deployed using the following command.

    kubectl create namespace <Namespace name>
    

    For example,

    kubectl create namespace nginx
    
  3. Add the repository from where the Helm charts for installing the NGINX Ingress Controller must be fetched using the following command.

    helm repo add stable https://charts.helm.sh/stable
    helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
    
  4. Install the NGINX Ingress Controller using Helm charts using the following command.

    helm install nginx-ingress --namespace <Namespace name> --set controller.replicaCount=1 --set controller.nodeSelector."beta\.kubernetes\.io/os"=linux --set defaultBackend.nodeSelector."beta\.kubernetes\.io/os"=linux ingress-nginx/ingress-nginx --set controller.publishService.enabled=true --set controller.ingressClassResource.name=<NGINX ingress class name> --set podSecurityPolicy.enabled=true --set rbac.create=true --set controller.extraArgs.enable-ssl-passthrough="true" --set controller.service.annotations."service\.beta\.kubernetes\.io/azure-load-balancer-internal"=\"true\" --version 4.3.0
    

    For example,

    helm install nginx-ingress --namespace nginx --set controller.replicaCount=1 --set controller.extraArgs.enable-ssl-passthrough="true" --set controller.nodeSelector."beta\\.kubernetes\\.io/os"=linux --set defaultBackend.nodeSelector."beta\\.kubernetes\\.io/os"=linux ingress-nginx/ingress-nginx --set controller.publishService.enabled=true --set controller.ingressClassResource.name=nginx-anon --set podSecurityPolicy.enabled=true --set rbac.create=true --set controller.service.annotations."service\\.beta\\.kubernetes\\.io/azure-load-balancer-internal"=\\"true\\" --version 4.3.0
    

    For more information about the various configuration parameters for installing the NGINX Ingress Helm charts, refer to values.yaml file.

  5. Check the status of the nginx-ingress release and verify that all the deployments are running accurately using the following command.

    kubectl get pods -n <Namespace name>
    

    For example,

    kubectl get pods -n nginx
    

    Note: The pod name should be noted. It is required as a parameter in the next step.

  6. View the logs on the Ingress pod using the following command.

    kubectl logs pod/<pod-name> -n <Namespace name>
    
  7. Obtain the external IP of the nginx service by executing the following command.

    kubectl get service --namespace <Namespace name>
    

    For example,

    kubectl get service -n nginx
    

    Note: The IP should be noted. It is required for configuring the Protegrity Anonymization API SDK.

3.2.2.8 - Using Custom Certificates in Ingress

Steps to use custom certificates with the Ingress Controller.

Protegrity Anonymization API uses certificates for secure communication with the client. You can use the certificates provided by Protegrity or use your own certificates. Complete the configurations provided in this section to use your custom certificates with the Ingress Controller.

Ensure that the certificates and keys are in the .pem format.

Note: Skip the steps provided in this section if you want to use the default Protegrity certificates for the Protegrity Anonymization API.

  1. Login to the Base Machine where Ingress in configured and open a command prompt.

  2. Copy your certificates to the Base Machine.

    Note: Verify the certificates using the commands provided in the section Working with Certificates.

  3. Create a Kubernetes secret of the server certificate using the following command. The namespace used must be the same where the Protegrity Anonymization API application is to be deployed.

    kubectl create secret --namespace <namespace-name> generic <secret-name> --from-file=tls.crt=<path_to_certificate>/<certificate-name> --from-file=tls.key=<path_to_certificate>/<certificate-key>
    

    For example,

    kubectl create secret --namespace anon-ns generic anon-protegrity-tls --from-file=tls.crt=/tmp/cust_cert/anon-server-cert.pem --from-file=tls.key=/tmp/cust_cert/anon-server-key.pem
    
  4. Create a Kubernetes secret of the CA certificate using the following command. The namespace used must be the same where the Protegrity Anonymization API application is to be deployed.

    kubectl create secret --namespace <namespace-name> generic <secret-name> --from-file=ca.crt=<path_to_certificate>/<certificate-name>
    

    For example,

    kubectl create secret --namespace anon-ns generic ca-protegrity --from-file=ca.crt=/tmp/cust_cert/anon-ca-cert.pem
    
  5. Open the values.yaml file.

  6. Add the following host and secret code for the Ingress configuration at the end of the values.yaml file.

    ## Refer section in documentation for setting up and configuring NGINX-INGRESS before deploying the application.
    ingress:
      ## Add host section with the hostname used as CN while creating server certificates.
      ## While creating the certificates you can use *.protegrity.com as CN and SAN used in below example
      host: **anon.protegrity.com**                  # Update the host according to your server certificates.
    
      ## To terminate TLS on the Ingress Controller Load Balancer.
      ## K8s TLS Secret containing the certificate and key must also be provided.
      secret: **anon-protegrity-tls**                # Update the secretName according to your secretName.
    
      ## To validate the client certificate with the above server certificate
      ## Create the secret of the CA certificate used to sign both the server and client certificate as shown in example below
      ca_secret: **ca-protegrity**                    # Update the ca-secretName according to your secretName.
    
      ingress_class: nginx-anon
    

    Note: Ensure that you replace the host, secret, and ca_secret attributes in the values.yaml file with the values as per your certificate.

    For more information about using custom certificates, refer to Updating the Configuration Files.

3.2.2.9 - Updating the Configuration Files

Steps to update configuration files.

Use the template files provided to specify the AKS settings for the Protegrity Anonymization API.

  1. Create the Protegrity Anonymization API namespace using the following command.

    kubectl create namespace <name>
    

    Note: Update and use the from the values.yaml file that is present in the Helm chart.

  2. Extract and update the files in the ANON-API_HELM_1.4.0.x.tgz package.

    The ANON-API_HELM_1.4.0.x.tgz package contains the values.yaml file that must be modified as per your requirements. It also contains the templates directory with yaml files.

    Note: Ensure that the necessary permissions for updating the files are assigned to the .yaml files.

  3. Navigate to the <path_to_helm>/templates directory and delete the anon-dbpvc-aws.yaml and the anon-storagepvc-aws.yaml files.

  4. Update the values.yaml file.

    Note: For more information about the values.yaml file, refer to values.yaml.

    1. Specify a namespace for the pods.

      namespace:
        name: **anon-ns**
      
    2. Specify the node name and zone information for the node as a prerequisite for the database pod and the Anon-Storage(MinIO) pod. Use the node name which is running in the same zone where the AKS is created.

      ## Prerequisite for setting up Database and Minio Pod.
      ## This is to handle any new DB pod getting created that uses the same persistence storage in case the running Database pod gets disrupted.
      ## This persistence also helps persist Anon-storage data.
      persistence:
        ## 1. Get the list of nodes in the cluster. CMD: kubectl get nodes
        ## 2. Get the node name which is running in the same zone where the external-storage is created. CMD: kubectl describe nodes
        nodename: "**<Node_name>**"                    # Update the Node name
      
        ## Fetch the zone in which the node is running using the `kubectl describe node/nodename` command or the following command.
        ## CMD: ` kubectl describe node/<nodename> | grep topology.kubernetes.io/zone | grep -oP 'topology.kubernetes.io/zone=K[^ ]+' `
        zone: "**<Zone in which above Node is running>**"
      
        ## For EKS cluster, supply the volumeID of the aws-ebs
        ## For AKS cluster, supply the subscriptionID of the azure-disk
        dbstorageId: "**<Provide dbstorage ID>**"           # To persist database schemas.
        anonstorageId: "**<Provide anonstorage ID>**"       # To persist Anonymized data.
      
    3. Update the repository information in the file. The Anon-Storage pod uses the MinIO Docker image quay.io/minio/minio:RELEASE.2022-10-29T06-21-33Z, which is pulled from the Public repository.

      image:
        minio_repo: quay.io/minio/minio                    # Public repo path for Minio Image.
        minio_tag: RELEASE.2022-10-29T06-21-33Z            # Tag name for Minio image.
      
        repository: **<Repo_path>**                            # Repo path for the Container Registry in Azure, GCP, AWS.
        anonapi_tag: **<AnonImage_tag>**                       # Tag name of the ANON-API Image.
        anonworkstation_tag: **<WorkstationImage_tag>**        # Tag name of the ANON-Workstation Image.
      
        pullPolicy: Always
      

      Note: Ensure that you update the repository, anonapi_tag, and anonworkstation_tag according to your container registry.

    4. MinIO uses access keys and secret for performing file operations. Protegrity provides a default set of credentials that are stored as part of the secret storage-creds. If you are creating your own secret, then, update the existingSecret section.

      anonstorage:
        ## Refer the following command for creating your own secret.
        ## CMD: kubectl create secret generic my-minio-secret --from-literal=rootUser=foobarbaz --from-literal=rootPassword=foobarbazqux
        existingSecret: ""                # Supply your secret Name for ignoring below default credentials.
        bucket_name: "anonstorage"        # Default bucket name for minio
        secret:
          name: "storage-creds"           # Secret to access minio-server
          access_key: "anonuser"          # Access key for minio-server
          secret_key: "protegrity"        # Secret key for minio-server
      
  5. Extract the values.yaml Helm chart from the package.

  6. Uncomment the following parameters and update the secret name in the values.yaml file.

    ## This section is required if the image is getting pulled from the Azure Container Registry
    ## create image pull secrets and specify the name here.
    ## remove the [] after 'imagePullSecrets:' once you specify the secrets
    #imagePullSecrets: []
    #  - name: regcred
    
  7. Perform the following steps for the communication between the Kubernetes cluster and the Azure Container Registry.

    1. Run the following command from a command prompt to login.

      docker login
      
    2. Specify your ACR access credentials.

  8. Create the secret for Azure by using the following command.

    kubectl create secret generic regcred --from-file=.dockerconfigjson=<PATH_TO_DOCKER_CONFIG>/config.json --type=Kubernetes.io/dockerconfigjson --namespace <NAMESPACE>
    

3.2.2.10 - Deploying the Protegrity Anonymization API to the AKS Cluster

Steps to deploy the AKS cluster.

Deploy the pods using the steps in the following section.

  1. Run the following command to deploy the pods.

    helm install <helm-name> /<path_to_helm> -n <namespace>
    
  2. Verify that the necessary pods and services are configured and running.

    1. Run the following command to verify the information for accessing the Protegrity Anonymization API externally on the cluster. The port mapping for accessing the UI is displayed after running the command.

      kubectl get service -n <namespace>
      
    2. Run the following command to verify the deployment.

      kubectl get deployment -n  <namespace>
      
    3. Run the following command to verify the pods created.

      kubectl get pods -n <namespace>
      
    4. Run the following command to verify the pods.

      kubectl get pods -o wide -n <namespace>
      
  3. Execute the following command to obtain the IP address of the service.

    kubectl get ingress -n <namespace>
    

The container is now ready to process Protegrity Anonymization API requests.

3.2.2.11 - Viewing Protegrity Anonymization API Using REST

Steps to view the Protegrity Anonymization API service and pod details.

Use the URLs provided here for viewing the Protegrity Anonymization API service and pod details after you have successfully deployed the Protegrity Anonymization API.

You need to map the IP address of Ingress in the hosts file with the host name set in the Ingress configuration.

For more information about updating the hosts file, refer to step 2 of the section Enabling Custom Certificates From SDK.

  1. Open a web browser.

  2. Use the following URL to view basic information about the Protegrity Anonymization API.

    https://anon.protegrity.com/

  3. Use the following URL to view the Swagger UI. The various Protegrity Anonymization APIs are visible on this page.

    https://anon.protegrity.com/anonymization/api/v1/ui

  4. Use the following URL to view the contractual information for the Protegrity Anonymization API.

    https://anon.protegrity.com/about

3.3 - Installing Using Docker Containers

Deploy the Protegrity Anonymization API using Docker Containers.

Complete the following steps to run the Protegrity Anonymization API on a host machine.

Ensure that you have completed the following prerequisites before deploying the Protegrity Anonymization API.

  1. Install Docker using the steps provided at https://docs.docker.com/engine/install/.
  2. Install Docker Compose using the steps provided at https://docs.docker.com/compose/install/.

To install the Protegrity Anonymization API:

  1. Login to the machine as an administrator to install the Protegrity Anonymization API.

  2. Obtain and extract the Protegrity Anonymization API files to a directory on your system.

    1. Download and extract the ANON-API_DEB-ALL-64_x86-64_Docker-ALL-64_1.4.0.x.tgz file.
    2. Verify that the following files are available in the package:
      • ANON-REST-API_1.4.0.x.tgz: The files for working with the Protegrity Anonymization REST API.
      • ANON-NOTEBOOK_1.4.0.x.tgz: The files for the Protegrity Anonymization API notebook.
  3. Extract the ANON-REST-API_1.4.0.x.tgz file.

  4. Run the following command to load the API container:

    docker load < ANON-API_1.4.0.x.tar.gz
    
  5. Verify that the image is successfully loaded using the following command:

    docker images
    
  6. Navigate to the directory where the ANON-NOTEBOOK_1.4.0.x.tgz file is saved.

  7. Extract the ANON-NOTEBOOK_1.4.0.x.tgz file.

  8. Run the following command to load the container:

    docker load < ANON-NOTEBOOK_1.4.0.x.tar.gz
    
  9. Verify that the image is successfully loaded using the following command:

    docker images
    

    Note the image ID for the ANON-API and ANON-NOTEBOOK containers.

  10. Navigate to the directory where the contents of the ANON-REST-API_1.4.0.x.tgz file are extracted.

  11. Update the docker/docker-compose.yaml file for the configuration that you require, such as the image ID.

    Update the image tags for scheduler, anon, anondb, and pty-worker with the details of the Anon API Image.

    Update the image tags for minio with the details of the Anon-Storage Image and workstation with the details of the Anon Workstation Image.

    Note: If required, then navigate to pty-worker and increase the replicas parameter.

    An extract of the docker-compose.yaml file with the details updated is provided here as an example. Update the file based on your configuration.

    version: "3.1"
    
    services:
      anonstorage:
        image: quay.io/minio/minio:RELEASE.2022-10-29T06-21-33Z       # Minio Image pulled from Public repo
    
    .
    . <existing configuration>
    .
    
        environment:                           # Protegrity default credentials for communicating with MinIO
          MINIO_ROOT_USER: anonuser
          MINIO_ROOT_PASSWORD: protegrity
    
    .
    . <existing configuration>
    .
    
    
      scheduler:
        image: **anonapi-1.4.0.x:latest**
    
    .
    . <existing configuration>
    .
    
      anon:
        image: **anonapi-1.4.0.x:latest**
    
    . 
    . <existing configuration>
    .
    
      pty-worker:
        image: **anonapi-1.4.0.x:latest**
    
    . 
    . <existing configuration>
    .
    
      anondb:
        image: **anonapi-1.4.0.x:latest**
    
    .
    . <existing configuration>
    .
    
      nginx-proxy:
        image: nginx:1.20.1
    
    .
    . <existing configuration>
    .
    
        workstation:
          image: **anonworkstation-1.4.0.x:latest**
          restart: unless-stopped
          hostname: workstation
          container_name: pty-workstation
          # extra_hosts: #### Uncomment and edit this section for using jupyter-workstation to send request to Protegrity Anonymization-API
          # - "anon.protegrity.com: <IP_of_host_machine>"
    .
    . <existing configuration>
    .
    

    Note: You can specify the IMAGE ID instead of the REPOSITORY:TAG for the image attribute.

  12. Configure the Protegrity Anonymization API to use your custom SSL certificates, if required.

    Note: The Protegrity Anonymization API provides its own set of certificates for SSL communication. Complete this step only to use custom certificates. Ensure you have the trusted CA .pem file, server certificate, and server key. The server certificate must be signed by the trusted CA.

    Only .pem files are supported by the Protegrity Anonymization API.

    Docker Compose mounts the certificate files from the current directory in the compose file, under the nginx-proxy section, as shown here.

    ./cert:/.cert/:Z
    

    You can mount the directory where you have obtained the trusted CA files or you can replace the certificates in the default directory.

  13. Deploy the Protegrity Anonymization API to Docker using the following command.

    docker-compose -f /path/to/docker-compose.yaml up -d
    
  14. Verify that the Docker containers are running using the following command.

    docker ps
    
  15. Update the hosts file with an entry of the IP address to anon.protegrity.com.

    Alternatively, update the server_name in the Nginx.conf property.

    server_name anon.protegrity.com;
    
  16. Update the host name as provided in the nginx-proxy config host name and as per your certificate.

  17. Update the hosts file with the following code.

    <IP of Docker Host> <host name as of nginx.conf>
    

    For example,

    192.168.1.120 anon.protegrity.com
    

The Protegrity Anonymization API is now visible using the Swagger UI. Use the URLs provided here to view the Protegrity Anonymization API using REST.

  • Use the following URL to view basic information about the Protegrity Anonymization API.

    https:///

    Note: The default Hostname is anon.protegrity.com. Ensure that you use the Hostname that you provided to access the Protegrity Anonymization API.

  • Use the following URL to view the Swagger UI. The various Protegrity Anonymization APIs are visible on this page.

    https:///anonymization/api/v1/ui

  • Use the following URL to view the contractual information for the Protegrity Anonymization API.

    https:///about

4 - Using Protegrity Anonymization

This section explains the REST APIs provided by Protegrity Anonymization. It also details the method for creating and running Protegrity Anonymization SDK requests.

4.1 - Creating Protegrity Anonymization requests

This section walks you through the process of creating Protegrity Anonymization requests to anonymize your data. It describes the steps for using the REST API and creating Protegrity Anonymization Python SDK requests.

A general overview of the process you need to follow to anonymize the data is shown in the following figure:

  1. Identify the dataset that needs to be anonymized.
  2. Analyze and classify the various fields available in the dataset. The following classifications are available:
    • Direct Identifiers
    • Quasi-Identifier
    • Sensitive Attributes
    • Non-Sensitive Attributes
  3. Determine the use case by specifying the data that is required for further analysis.
  4. Specify the quasi-identifiers and other fields that are not required in the dataset
  5. Specify the required anonymization methods for the data. Some commonly used methods are as follows:
    • Generalization
    • Micro-Aggregation
  6. Specify and measure the acceptable statistics and risk levels for the data fields for measuring the statistic before running the anonymization job.

Note: For more information about different risk levels for the data fields, refer to Anonymization models.

  1. Verify that the anonymized data satisfies the acceptable risk threshold level.
  2. Measure the quality of the anonymized data by comparing it with the original data. If the quality does not meet standards, then work on the data or drop the output.
  3. Save the anonymized data to an output file.

The anonymized data can now be used for further analysis and as input for machine learning softwares.

4.2 - Working with Protegrity Anonymization APIs

The various APIs provided with Protegrity Anonymization are described here.

For Protegrity Anonymization Python SDK, import the anonsdk module to install and use it. The AnonElement is an essential part of the Protegrity Anonymization Python SDK. For more information about the AnonElement object, refer to Understanding the AnonElement object.

The following table shows the list of REST APIs and Python SDK requests:

List of APIsREST APIsPython SDK
Anonymization Functions
AnonymizeYesYes
Apply AnonymizeYesYes
MeasureYesYes
Task Monitoring APIs
Get Job IDsYesYes
Get Job StatusYesYes
Get MetadataYesYes
AbortYesYes
DeleteYesYes
Statistics APIs
Get Exploratory StatisticsYesYes
Get Risk MetricYesYes
Get Utility StatisticsYesYes
Detection APIs
Get Data DomainsYesNo*1
Detect Anonymization InformationYesNo*1
Detect ClassificationYesNo*1
Detect HierarchyYesNo*1

*1 - It is not applicable for Protegrity Anonymization Python SDK.

4.2.1 - Understanding Protegrity Anonymization REST APIs

The following APIs are available with Protegrity Anonymization REST API. You can run these APIs using the command line with the curl command. You can also run them using the Swagger UI or a tool like Postman.

Before running the anonymization jobs mentioned in the Protegrity Anonymization REST APIs section below, the following pre-requisites must be completed:

  • Ensure that Anonymization machine is set up and is configured as “https://anon.protegrity.com/".
    For more information about setting up and configuring an Anonymization machine for AWS and Azure, refer to AWS and Azure.
  • Ensure that the disk is not full and enough free space is available for saving the destination file.
  • Verify the destination file is not in use. Set the required permissions for creating and modifying the destination file.
  • Verify that the anonymization job exists.

You can use different sample requests to build and run the anonymization APIs. For more information about the sample requests for REST APIs, refer to Sample Requests for Protegrity Anonymization.

Anonymization Functions

The Anonymization Functions APIs are used to run the anonymization job.

Anonymize

The Anonymize API is used to start an anonymize operation.

For more information about the anonymize API, refer to Submit a new anonymization job.

Note: When you run the job, an empty destination file is created. This file is created during processing for verifying the necessary destination permissions. Avoid using this file till the anonymization job is complete.

Ensure that the anonymized data file and the logs generated are moved to a different system before deleting your environment.

If the source file is larger than the maximum limit that is allowed on the Cloud environment, then run the anonymization request with “additional_properties”: { “single_file”: “no” }.

Apply Anonymize

The Apply Anonymize API is used as a template to anonymize additional entries. Using this API you can use the existing configuration to process additional data. This is especially useful in machine learning for training the system to anonymize new data points.

Note:In this API, privacy model parameters are ignored while performing the anonymization for the new entry.

For more information about the apply anonymize API, refer to Apply anonymization config to a given dataset.

Measure

The Measure API is used to measure or obtain anonymization result statics for different configurations before the actual anonymization job.

For more information about the anonymize API, refer to Submit a new anonymization Measure job.

Task Monitoring APIs

The Task Monitoring APIs are used to monitor the anonymization job. Use these APIs to obtain the job status, retrieve a job, and abort a job.

Get Job IDs

The Get Job ID API is used to get the job IDs of the last 20 anonymization operations that are running, in queue, or completed. You can then use the required job ID with the other APIs to work with the anonymization job.

For more information about the job ID API, refer to Obtain job ids.

Get Job Status

The Get Job Status API is used to get the status of an anonymize operation that is running, in queue, or complete. It shows the percentage of job completed. Use the information provided here to monitor if a job is running or stalled.

For more information about the job status API, refer to Obtain job status.

Get Job Status API Parameters

Use this API to get the status of an anonymize operation that is running. It shows the percentage of job completed. Use the information provided here to monitor if a job is running or stalled.

Monitor Job InformationDescription
Functionstatus()
ParametersNone
Return TypeA string with the status information in the JSON format.

completed: This is information about the job, such as, data, statistics, summary, and time spent.

id: This is the job ID.

info: This is information about the job being processed, such as, the source and attributes for the job.

running: This is the completion status of the jobs being processed. It shows the percentage of the job completed.

status: This is the status of the job, such as, running or completed.

Note: This API displays all the status of the job. To obtain the ID of a job, use job.id().
Sample Requestjob.status()

Get Metadata

The Get Metadata API is used to retrieve the metadata for the existing job. This API is useful when you need to view the configuration available for a job. It displays the fields, configuration, and the data that is used to run the anonymization job.

For more information about the metadata API, refer to Obtain job metadata.

Retrieve Anonymized Data API Parameters

Use this API to retrieve the results of an anonymized job.

Retrieve Job InformationDescription
Functionresult()
ParametersNone
Return TypeReturns the AnonResult element, which provides the DataFrame for the anon data.

Note: The result.df will be None if you have overridden the resultstore as part of anonymize method.
Sample Requestjob.result()

Note: This is a blocking API and will stall processing till the job is complete.

Abort

The Abort API is used to abort a running anonymization job. You can abort jobs if you need to modify the parameters or if the job is stalled or taking too much time or resources to process.

For more information about the abort API, refer to Abort a running anonymization job.

Note: After aborting the task, it might take time before all the running processes are stopped.

Abort API Parameters

Use this API to abort a running anonymize operation. You can abort jobs if you need to modify the parameters or if the job is stalled or taking too much time or resources to process.

Abort Job InformationDescription
Functionabort()
ParametersNone
Return TypeA string with the status of the abort request.
Sample Requestjob.abort()

Delete

The Delete API is used to delete an existing job that is no longer required.

For more information about the delete API, refer to Delete a job.

Statistics APIs

The Statistics APIs are used to obtain information about the anonymization data. Use these APIs to obtain the risk and utility information about the anonymization. The user needs to access these APIs to measure the utility benefits and risk of publishing the anonymized data. If these configurations are not satisfactory, then the user can re-submit the anonymization job after modifying some parameters based on these results.

Get Exploratory Statistics

The Get Exploratory Statistics API is used to obtain data distribution statistics about a completed anonymization job.

For more information about the exploratory statistic API, refer to Obtain the exploratory statistics.

Get Exploratory Statistics API Parameters

It provides information about both, the source and the target data distribution statistics.

Exploratory Statistics InformationDescription
FunctionexploratoryStats()
ParametersNone
Return TypeA Pandas dataframe with the exploratory information of the source data and the anonymized data.
Sample Requestjob.exploratoryStats()

This provides the data distribution of the attribute, which is all unique values of an attribute and its occurrence count. This can be used to build data histogram of all attributes in the dataset. .The following values appear for the source and result set:

Get Risk Metric

The Get Risk Metric API is used to ascertain the risk of the source data and the anonymized data.

For more information about the risk metric API, refer to Obtain the risk statistics.

Get Risk Metric API Parameters

It shows the risk of the data against attacks such as journalist, marketer, and prosecutor.

Risk Metric InformationDescription
FunctionriskStat()
ParametersNone
Return TypeA Pandas dataframe with the source data and the anonymized data privacy risk information.

Note: You can customize the riskThreashold as part of AnonElement configuration.
Sample Requestjob.riskStat()

The following values appear for the source and result set:

Values for Source and Result SetDescription
avgRecordIdentificationThis value displays the average probability for identifying a record in the anonymized dataset. The risk is higher when the value is closer to the value 1.
maxProbabilityIdentificationThis displays the maximum probability value that a record can be identified from the dataset. The risk is higher when the value is closer to the value 1.
riskAboveThresholdThis value displays the number of records that are at a risk above the risk threshold. The default threshold is 10%. The threshold is the maximum value set as a boundary. Any values beyond the threshold are a risk and might be easy to identify. For this result, the value 0 is preferred.

Get Utility Statistics

The Get Utility Statistics API is used to check the usability of the anonymized data.

For more information about the utility statistics API, refer to Obtain the anonymization data utility statistics.

Get Utility Statistics API Parameters

It shows the information that was lost to gain privacy protection.

Risk Metric InformationDescription
FunctionutilityStat()
ParametersNone
Return TypeA Pandas dataframe with the source and anonymized data utility information.
Sample Requestjob.utilityStat()

The following values appear for the source and result set:

Values for Source and Result SetDescription
ambiguityThis value displays how well a record is hidden in all the records. This captures the ambiguity of records.
average_class_sizeThis measures the average size of groups of indistinguishable records. A smaller class size is more favourable for retaining the quality of the information. A larger class size increases anonymity at the cost of quality.
discernibilityThis measures the size of groups of indistinguishable records with penalty for records which have been completely suppressed. Discernibility metrics measures the cardinality of the equivalent class. Discernibility metrics considers only the number of records in the equivalent class and does not capture information loss caused by generalization.
generalization_intensityData transformation from the original records to anonymity is performed using generalization and suppression. This measures the concentration of generalization and suppression on attribute values.
infoLossThis value displays the probability of information lost with the data transformation from the original records. Larger the value, lesser the quality for further analysis.

Detection APIs

The Detection APIs are used to analyze and classify data in the Protegrity Anonymization.

Get Data Domains

The Get Data Domains API is used to obtain a list of data domains supported.

For more information about obtaining the data domains API, refer to Get the supported data domains.

Detect Anonymization Information

The Detect Anonymization Information API is used to detect the data domain, classification type, hierarchy, and privacy models for the dataset.

For more information about the detect anonymization information API, refer to Data domain, Classification type, Hierarchy, and Privacy Models detection from a dataset.

Detect Classification

The Detect Classification API is used to detect the classification that will be used for the anonymization operation. Accordingly, you can modify the classification to match your requirements.

For more information about the detect classification API, refer to Classification type detection from a dataset.

Detect Hierarchy

The Detect Hierarchy API is used to detect the hierarchy type that will be used for the anonymization operation.

For more information about the detect hierarchy API, refer to Hierarchy Type detection from a dataset.

4.2.2 - Understanding Protegrity Anonymization Python SDK Requests

The following APIs are available with Protegrity Anonymization. You can import the Protegrity Anonymization in your Python SDK environment, pass the required parameter and data to the Protegrity Anonymization Python SDK requests, and retrieve work with the anonymized output.

Before running the anonymization jobs mentioned in the Protegrity Anonymization SDK section below, the following pre-requisites must be completed:

  • Ensure that Anonymization machine is set up and is configured as “https://anon.protegrity.com/".
    For more information about setting up and configuring an Anonymization machine for AWS and Azure, refer to AWS and Azure.
  • Ensure that the disk is not full and enough free space is available for saving the destination file.
  • Verify the destination file is not in use. Set the required permissions for creating and modifying the destination file.
  • Verify that the anonymization job exists.
  • Verify the import of the Pythonic SDK. For example, import anonsdk as asdk.

You can use different sample requests to build and run the anonymization APIs. For more information about the sample requests for Python SDK, refer to Sample Requests for Protegrity Anonymization.

Understanding the AnonElement object

The AnonElement is an essential part of the Protegrity Anonymization SDK. It holds all information that is required for processing the anonymization request. The AnonElement is a part of the anonsdk package.

Protegrity Anonymization SDK processes a Pandas dataframe to anonymize data using the Protegrity Anonymization REST API. It is the AnonElement that accepts the parameters and passes the information to the REST API. The AnonElement accepts the connection to the REST API, the pandas dataframe with the data that must be processed, and the optionally the source location for processing the request.

Anonymization Functions

The Anonymization Functions APIs are used to run the anonymization job.

Anonymize

The Anonymize API is used to start an anonymize operation.

For more information about the anonymize API, refer to Submit a new anonymization job.

Note: When you run the job, an empty destination file is created. This file is created during processing for verifying the necessary destination permissions. Avoid using this file till the anonymization job is complete.

Ensure that the anonymized data file and the logs generated are moved to a different system before deleting your environment.

If the source file is larger than the maximum limit that is allowed on the Cloud environment, then run the anonymization request with “additional_properties”: { “single_file”: “no” }.

Apply Anonymize

The Apply Anonymize API is used as a template to anonymize additional entries. Using this API you can use the existing configuration to process additional data. This is especially useful in machine learning for training the system to anonymize new data points.

Note: In this API, privacy model parameters are ignored while performing the anonymization for the new entry.

For more information about the apply anonymize API, refer to Apply anonymization config to a given dataset.

Apply Anonymize API Parameters

Use this API to start an anonymize operation.

Apply Anonymize Job InformationDescription
Functionanonymize(anon_object, target_datastore, force, mode)
Parametersanon_object: The object with the configuration for performing the anonymization request.

target_datastore: The location to store the anonymized result.

force: The boolean value to force the operation.
Acceptable values: True and False.
Set this flag to true to resubmit the same anonymized job without any modification.

mode: The value to enable auto anonymization.
Acceptable value: auto.
Do not include this parameter to skip auto anonymization.
Return TypeA job object with which the task monitoring and task statistics can be obtained.
Sample RequestWithout auto anonymization: job = asdk.anonymize(anon_object,target_datastore ,force=True)

With auto anonymization: job = asdk.anonymize(anon_object,target_datastore ,force=True,mode=“auto”)

Note: When you run the job, an empty destination file is created. This file is created during processing for verifying the necessary destination permissions. Avoid using this file till the anonymization job is complete.

For more information about using the Auto Anonymization, refer to Using the Auto Anonymizer.

Ensure that the anonymized data file and the logs generated are moved to a different system before deleting your environment.

If the source file is larger than the maximum limit that is allowed on the Cloud environment, then run the anonymization request with “additional_properties”: { “single_file”: “no” }.

If you want to bypass the Anon-Storage, then you can disable the pods by setting the pyt_storage flag to False.
For example, use the following code to run the anonymization request without using the storage pods

job=asdk.anonymize(anon_object, pty_storage=False)

Measure

The Measure API is used to measure or obtain anonymization result statics for different configurations before the actual anonymization job.

For more information about the anonymize measure job API, refer to Submit a new anonymization Measure job.

Using Infer to Anonymize API Parameters

Use the Infer API to start auto-detecting the data-domain, classification type, hierarchies, and anonymization configuration in Protegrity Anonymization. Any user-defined configuration, such as, QI attribute assignments, hierarchy, and K value, are retained and considered while performing the auto anonymization.

Using Infer to Anonymize InformationDescription
Functioninfer(targetVariable)
ParameterstargetVariable: The field specified here is used as a focus point for performing the anonymization.
Return TypeIt returns an anon element with all the detected classifications and hierarchies generated.
Sample Requeste.infer(targetVariable=‘income’)

Note: You can use e.measure() to modify the request and view different outcomes of the result set.

For more information about the anonymize measure job API, refer to Using Infer to Anonymize.

Task Monitoring APIs

The Task Monitoring APIs are used to monitor the anonymization job. Use these APIs to obtain the job status, retrieve a job, and abort a job.

Get Job IDs

The Get Job ID API is used to get the job IDs of the last 20 anonymization operations that are running, in queue, or completed. You can then use the required job ID with the other APIs to work with the anonymization job.

For more information about the job ID API, refer to Obtain job ids.

Get Job Status

The Get Job Status API is used to get the status of an anonymize operation that is running, in queue, or complete. It shows the percentage of job completed. Use the information provided here to monitor if a job is running or stalled.

For more information about the job status API, refer to Obtain job status.

Get Job Status API Parameters

Use this API to get the status of an anonymize operation that is running. It shows the percentage of job completed. Use the information provided here to monitor if a job is running or stalled.

Monitor Job InformationDescription
Functionstatus()
ParametersNone
Return TypeA string with the status information in the JSON format.

completed: This is information about the job, such as, data, statistics, summary, and time spent.

id: This is the job ID.

info: This is information about the job being processed, such as, the source and attributes for the job.

running: This is the completion status of the jobs being processed. It shows the percentage of the job completed.

status: This is the status of the job, such as, running or completed.

Note: This API displays all the status of the job. To obtain the ID of a job, use job.id().
Sample Requestjob.status()

Get Metadata

The Get Metadata API is used to retrieve the metadata for the existing job. This API is useful when you need to view the configuration available for a job. It displays the fields, configuration, and the data that is used to run the anonymization job.

For more information about the metadata API, refer to Obtain job metadata.

Retrieve Anonymized Data API Parameters

Use this API to retrieve the results of an anonymized job.

Retrieve Job InformationDescription
Functionresult()
ParametersNone
Return TypeReturns the AnonResult element, which provides the DataFrame for the anon data.

Note: The result.df will be None if you have overridden the resultstore as part of anonymize method.
Sample Requestjob.result()

Note: This is a blocking API and will stall processing till the job is complete.

Abort

The Abort API is used to abort a running anonymization job. You can abort jobs if you need to modify the parameters or if the job is stalled or taking too much time or resources to process.

For more information about the abort API, refer to Abort a running anonymization job.

Note: After aborting the task, it might take time before all the running processes are stopped.

Abort API Parameters

Use this API to abort a running anonymize operation. You can abort jobs if you need to modify the parameters or if the job is stalled or taking too much time or resources to process.

Abort Job InformationDescription
Functionabort()
ParametersNone
Return TypeA string with the status of the abort request.
Sample Requestjob.abort()

Delete

The Delete API is used to delete an existing job that is no longer required.

For more information about the delete API, refer to Delete a job.

Statistics APIs

The Statistics APIs are used to obtain information about the anonymization data. Use these APIs to obtain the risk and utility information about the anonymization. The user needs to access these APIs to measure the utility benefits and risk of publishing the anonymized data. If these configurations are not satisfactory, then the user can re-submit the anonymization job after modifying some parameters based on these results.

Get Exploratory Statistics

The Get Exploratory Statistics API is used to obtain data distribution statistics about a completed anonymization job. The information includes information about both, the source and the target distribution.

For more information about the exploratory statistic API, refer to Obtain the exploratory statistics.

Get Risk Metric

The Get Risk Metric API is used to ascertain the risk of the anonymized data. It shows the risk of the data against attacks such as journalist, marketer, and prosecutor.

For more information about the risk metric API, refer to Obtain the risk statistics.

Get Utility Statistics

The Get Utility Statistics API is used to check the usability of the anonymized data.

For more information about the utility statistics API, refer to Obtain the anonymization data utility statistics.

5 - Building the Anonymization request

Use the APIs provided with the Protegrity Anonymization to create your request.
  • To use the APIs, you need to specify the source (file or data) that must be transformed. The source can be a single row of data or multiple rows of data sent in the request, or it could be a file located on the Cloud storage.
  • Next, you need to specify the transformation that must be performed on the various columns in the table.
  • Finally, after the transformation is complete, you can save the output or use it for further processing.

The transformation request can be saved for processing further requests. It can also be used as an input in machine learning.

5.1 - Common Configurations for building the request

Use the information provided in this section to build the REST APIs and Python SDK request for performing the Protegrity Anonymization transformation.

Specifying the Transformation

The data store consists of various fields. These fields need to be identified for processing data. Additionally, the type of transformation that must be performed on the fields must be specified. Also specify the type of privacy model that must be used for anonymizing the data. While specifying the rules for transformation specify the importance of the data.

Classifying the Fields

Specify the type of information that the fields hold. This classification must be performed carefully, leaving out important fields might lead to the anonymized data being of no value. However, including data that can identify users poses a risk of anonymization not being carried out properly.

The following four different classifications are available:

ClassificationDescriptionFunctionTreatment
Direct IdentifierThis classification is used for the data in fields that directly identify an individual, such as, Name, SSN, phoneNo, email, and so on.RedactValues will be removed.
Quasi Identifying AttributeThis classification is used for the data in fields that does not identify an individual directly. However, it needs to be modified to avoid indirect identification. For example, age, date of birth, zip code, and so on.Hierarchy modelsValues will be transformed using the options specified.
Sensitive AttributeThis classification is used for the data in fields that does not identify an individual directly. However, it needs to be modified to avoid indirect identification. This data needs to be preserved to ensure further analysis or to obtain utility out of Anonymized data. In addition, ensure that records with this classification are part of a herd or group where it loses the ability to identify an individual.LDiv, TCloseNo change in values, exception extreme values that might identify an individual. Values will be generalized in case of t-closeness.
Non-Sensitive AttributeThis classification is used for the data in fields that does not identify an individual directly or indirectly.PreserveNo change in values.

Ensure that you identify the sensitive and the quasi-identifier fields for specifying the anonymization method for hiding individuals in the dataset.

Use the following code for specifying a quasi-identifier for REST API and Python SDK:

"classificationType": "Quasi Identifier",
e['<column>'] = asdk.Gen_Mask(maskchar='#', maxLength=3, maskOrder="L")

Specifying the privacy model

The privacy model transforms the dataset using one or several anonymization methods to achieve privacy.

The following anonymization techniques are available in the Protegrity Anonymization:

K-anonymity

Configuration of quasi-identifier tuple occurs of k records. The information type is Quasi-Identifier.

Use the following code for specifying K-anonymity for REST API and Python SDK:

"privacyModel": {
    "k": {
    "kValue": 5
    }
}
e.config.k=asdk.K(2)

l-diversity

Ensures k records in the inter-group is distributed and diverse enough to reduce the risk of identification. The information type is Sensitive Attribute.

Use the following code for specifying l-diversity for REST API and Python SDK:

"privacyModel": {
    "ldiversity": [
        {
        "lFactor": 2,
        "name": "sex",
        "lType": "Distinct-l-diversity"
        }
    ]
}
e["<column>"]=asdk.LDiv(lfactor=2)

t-closeness

Intra-group diversity for every sensitive attribute must be defined. The information type is Sensitive Attribute.

Use the following code for specifying t-closeness for REST API and Python SDK:

"privacyModel": {
"tcloseness": [
    {
    "name": "salary-class",
    "emdType": "EMD with equal ground distance",
    "tFactor": 0.2
    }
  ]
}
e["<column>"]=asdk.TClose(tfactor=0.2)

Specifying the Hierarchy

The hierarchy specifies how the information in the dataset is handled for anonymization. These hierarchical transformations are performed on Quasi-Identifiers and Sensitive Attributes. Accordingly, the data can be generalized using transformations or aggregated using mathematical functions. As we go up the hierarchy, the data is anonymized better, however, the quality of data for further analysis reduces.

Global Recoding and Full Domain Generalization

Global recoding and full domain generalization is used for anonymizing the data. When data is anonymized, the quasi-identifiers values are transformed to ensure that data fulfils the required privacy requirements. This transformation is also called as data recoding. In the Protegrity Anonymization, data is anonymized using global recoding, that is, the same transformation rule is applied to all entries in the data set.

Consider the data in the following tables:

IDGenderAgeRace
1Male45White
2Female30White
3Male25Black
4Male30White
5Female45Black
Level0Level1Level2Level3Level4
2520-2520-3020-40*
3030-3530-4030-50*
4540-4540-5040-60*

In the above example, when global recoding is used for a value such as 45, then all occurrences of age 45 will be generalized using only one generalized level as follows:

  • 40-45
  • 40-50
  • 40-60
  • *

Full-domain generalization means that all values of an attribute are generalized to the same level of the associated hierarchy level. Thus, in the first table, if age 45 gets generalized to 40-50 which is Level2, then all age values are also generalized to Level2 only. Hence, the value 30 will be generalized to 30-40.

In addition to generalization, micro-aggregation is available for transforming the dataset. In generalization, the mathematical function is performed on all the values of the column. However, in micro-aggregation, the mathematical function is performed on all the values within an equivalence class.

Consider the following table with ages of five men and five women.

Table with the values Gender Age M 20 M 20 F 20 M 22 M 22 F 22 F 22 M 28 F 28 F 28

The following output is obtained by performing a generalization aggregation on the Age using averages, by setting the Gender as QI and keeping the K value as 2.

Table with the values Gender Age Generalization M 20 23.2 M 20 23.2 F 20 23.2 M 22 23.2 M 22 23.2 F 22 23.2 F 22 23.2 M 28 23.2 F 28 23.2 F 28 23.2

In the table, a sum of all the ages is obtained and divided by the total, that is, 10 to obtain the generalization value using average.

The following output is obtained by performing a micro-aggregation on the Age using averages, by setting the Gender as QI and keeping the K value as 2.

Table with the values Gender Age Micro-Aggregation F 20 24 F 22 24 F 22 24 F 28 24 F 28 24 M 20 22.4 M 20 22.4 M 22 22.4 M 22 22.4 M 28 22.4

In the table, two equivalence classes are formed based on the gender. The sum of the ages in each group is obtained and divided by the total of each group, that is, 5 to obtain the micro-aggregation value using average.

Generalization

In Generalization, the data is grouped into sets having similar attributes. The mathematical function is applied on the selected column by considering all the values in the dataset.

The following transformations are available:

  • Masking-Based: In this transformation, information is hidden by masking parts of the data to form similar sets. For example, masking the last three numbers in the zip code could help group them, such as, 54892 and 54231 both being transformed as 54###.

An example of masking-based transformation for building a REST API and Python SDK is provided here.

{
            "classificationType": "Quasi Identifier",
            "dataTransformationType": "Generalization",
            "dataType": "String",
            "generalization": {
                "hierarchyType": "Rule",
                "rule": {
                    "masking": {
                        "maskOrder": "Right To Left",
                        "maskChar": "#",
                        "maxDomainSize": 5
                    }
                },
                "type": "Masking Based"
            },
            "name": "city"
}

Where:

  • maskOrder is the order for masking, use Right To Left to mask from right and Left To Right for masking from the left.
  • maskChar is the placeholder character for masking.
  • maxDomainSize is the number of characters to mask. Default is the maximum length of the string in the column.
e["zip_code"] = asdk.Gen_Mask(maskchar="#",  maskOrder = "R", maxLength=5)

Where:

  • maskchar is the placeholder character for masking.
  • maskOrder is the order for masking, use R to mask from right and L for masking from the left.
  • maxLength is the number of characters to mask. Default is the maximum length of the string in the column.
  • Tree-Based: In this transformation, data is aggregated by transformation to form similar sets using external knowledge. For example, in the case of address, the data can be anonymized based on the city, state, country, or continent, as required. You must specify the file containing the tree data. If the current level of aggregation does not provide adequate anonymization, then a higher level of aggregation is used. The higher the level of aggregation, the more the data is generalized. However, a higher level of generalization reduces the quality of data for further analysis.

An example of tree-based transformation for building a REST API and Python SDK is provided here.

{
          "classificationType": "Quasi Identifier",
          "dataTransformationType": "Generalization",
          "dataType": "String",
          "generalization": {
              "type": "Tree Based",
              "hierarchyType": "Data Store",
              "dataStore": {
                  "type": "File",
                  "file": {
                      "name": "adult_hierarchy_education.csv",
                      "props": {
                          "delimiter": ";",
                          "quotechar": "\"",
                          "header": null
                      }
                  },
                  "format": "CSV"
              }
          },
          "name": "education"
}
treeGen = {'lvl0': [11, 13, 14, 15, 27, 28, 20],
              'lvl1': ["<15", "<15", "<15", "<20", "<30", "<30", "<25"],
              'lvl2': ["<20", "<20", "<20", "<30", "<30", "<30", "<30"]}

e["bmi"] = asdk.Gen_Tree(pd.DataFrame(data=treeGen), ["Missing", "Might be <30", " Might be <30"])  

You can refer to an external file for specifying the parameters for the hierarchy tree.

education_df = pd.read_csv('D:\\WS\\data source\\hierarchy\\adult_hierarchy_education.csv', sep=';')
e['education'] = asdk.Gen_Tree(education_df)
  • Interval-Based: In this transformation, data is aggregated into groups according to a predefined interval specified.
    In addition, the lowerbound and upperbound values need to be specified for building the SDK API. Values below the lowerbound and values above the upperbound are excluded from range generation.

An example of interval-based transformation for building a REST API and Python SDK is provided here.

{
            "classificationType": "Quasi Identifier",
            "dataTransformationType": "Generalization",
            "dataType": "Integer",
            "generalization": {
                "hierarchyType": "Rule",
                "rule": {
                    "interval": {
                        "levels": [
                            "5",
                            "10",
                            "50",
                            "100"
                        ],
                        "lowerBound": "0"
                    }
                },
                "type": "Interval Based"
            },
            "name": "age"
}
asdk.Gen_Interval([<interval_level>],<lowerbound>,<upperbound>)

An example of interval-based transformation for building the SDK API is provided here.

e['age'] = asdk.Gen_Interval([5,10,15])
e['age'] = asdk.Gen_Interval([5,10,15],20,60)
  • Aggregation-Based: In this transformation, integer data is aggregated as per the conditions specified. The available options for aggregation are Mean and Mode.

    Note: Mean is applicable for Integer and Decimal data types.

    Mode is applicable for Integer, Decimal, and String data types.

An example of aggregation-based transformation for building a REST API and Python SDK is provided here.

{
            "classificationType": "Quasi Identifier",
            "dataTransformationType": "Generalization",
            "dataType": "Integer",
            "generalization": {
                "hierarchyType": "Aggregate",
                "type": "Aggregation Based",
                "aggregateFn": "Mean"
            },
            "name": "age"
}

An example of aggregation-based transformation using Mean is provided here.

e['age'] = asdk.Gen_Agg(asdk.AggregateFunction.Mean)

An example of aggregation-based transformation using Mode is provided here.

e['salary'] = asdk.Gen_Agg(asdk.AggregateFunction.Mode)
  • Date-Based: In this transformation, data is aggregated into groups according to the date.

An example of date-based interval and rounding for building a REST API and Python SDK is provided here.

{
            "classificationType": "Quasi Identifier",
            "dataTransformationType": "Generalization",
            "dataType": "Date",
            "generalization": {
                    "hierarchyType": "Rule",
                    "type": "Interval Based",
                    "rule": {
                      "daterange": {
                        "levels": [
                          "WD.M.Y",
                          "W.M.Y",
                          "FD.M.Y",
                          "M.Y",
                          "QTR.Y",
                          "Y",
                          "DEC",
                          "CEN"
                        ]
                      }
                    }
                  },
            "name": "date_of_birth"
}
It is not applicable for building Python SDK requests.
  • Time-Based: In this transformation, data is aggregated into groups according to the time. In this, time intervals are in seconds. The LowerBound and UpperBound takes value of the format [HH:MM:SS].

An example of time-based interval and rounding for building a REST API and Python SDK is provided here.

{
            "classificationType": "Quasi Identifier",
            "dataTransformationType": "Generalization",
            "dataType": "Date",
            "generalization": {
                      "hierarchyType": "Rule",
                      "type": "Interval Based",
                      "rule": {
                            "interval": {
                                "levels": [
                                    "30",
                                    "60",
                                    "180",
                                    "240"
                                  ],
                        "lowerBound": "00:00:00",
                        "upperBound": "23:59:59"
                      }
                    }
                  },
            "name": "time_of_birth"
}
It is not applicable for building Python SDK request.
  • Rounding-Based: In this transformation, data is rounded to groups according to a predefined rounding factor specified.

An example of rounding-based transformation for building a REST API and Python SDK is provided here.

It is not applicable for building the REST API request.

An example of date-based transformation is provided here.

e['DateOfBirth'] = asdk.Gen_Rounding(["H.M4", "WD.M.Y", "M.Y"])

An example of numeric-based transformation is provided here.

e['Interest_Rate'] = asdk.Gen_Rounding([0.05,0.10,1])

Micro-Aggregation

In Micro-Aggregation, mathematical formulas are used to group the data. This is used to achieve K-anonymity by forming small groups of data in the dataset.

The following aggregation functions are available for micro-aggregation in the Protegrity Anonymization:

  • For numeric data types (integer and decimal):
    • Arithmetic Mean

    • Geometric Mean

      Note: Micro-Aggregation using geometric mean is only supported for positive numbers.

    • Median

  • For all data types:
    • Mode

Note: Arithmetic Mean, Geometric Mean, and Median is applicable for Integer and Decimal data types.

Mode is applicable for Integer, Decimal, and String data types.

An example of micro-aggregation for building a REST API and Python SDK is provided here.

{
      "classificationType": "Quasi Identifier",
      "dataTransformationType": "Micro Aggregation",
      "dataType": "Decimal",
      "aggregateFn": "Median",
      "name": "age_ma_median"
}
e['income'] = asdk.MicroAgg(asdk.AggregateFunction.Mean)

5.2 - Building the request using the REST API

Use the information provided in this section to build the REST API request for performing the Protegrity Anonymization transformation.

Identifying the source and target

The source dataset is the starting point of the transformation. In this step, you specify the source that must be transformed. Specify the target where the anonymized data will be saved.

  • The following file formats are supported:
    • Comma separated values (CSV)
    • Columnar storage format: This is an optimized file format for large amounts of data. Using this file format provides faster results. For example, Parquet (gzip and snappy).
  • The following data storages have been tested for the Protegrity Anonymization:
    • Local File System
    • Amazon S3
  • The following data storages can also be used for the Protegrity Anonymization:
    • Microsoft Azure Storage
      • Data Lake Storage
      • Blob Storage
    • MinIO Storage
    • Other S3 Compatible Services

Use the following code to specify the source:

Note: Modify the source and destination code for your provider.

For more cloud-related sample codes, refer to the section Samples for Cloud-related Source and Destination Files.

"source": {
      "type": "File",
      "file": {
        "name": "<Source_file_path>"
      }
}

Note: When uploading a file to the Cloud service, wait till the entire source file is uploaded before running the anonymization job.

Similarly, specify the target file using the following code:

"target": {
    "type": "File",
    "file": {
      "name": "<Target_file_path>"
    }
}

Specify additional parameters about the source and target file, such as, the character used to separate the values in the file, using the following properties attribute. If a property is not specified, then the default attribute shown here will be used.

"props": {
    "sep": ",",
    "decimal": ".",
    "quotechar": "\"",
    "escapechar": "\\",
    "encoding": "utf-8",
    "line_terminator": "\n"
}

If the required files are on a cloud storage, then specify the cloud-related access information using the following code:

"accessOptions": {
}

For more information about specifying the source and target files, refer to Dask remote data configuration.

Note: If the target directory already exists, then the job fails. If the target file already exists, then the file will be overwritten. Additional, some Cloud services have limitations on the file size. If such a limitation exists, then you can set the single_file switch to no when writing large files to the Cloud storage. This saves the output as multiple files to avoid any errors related to saving large files to the Cloud storage.

Specifying the Transformation

For more information about specifying the transformation, refer to Specifying the Transformation.

Classifying the Fields

For more information about different fields classification, refer to Classifying the Fields.

The following data types are supported for working with the data in the fields:

  • Integer
  • Float
  • String
  • Date
  • Time
  • DateTime

Date: The following date types are supported:

  • mm-dd-yyyy - This is the default format.
  • dd-mm-yyyy
  • dd-mm-yy
  • mm-dd-yy
  • dd.mm.yyyy
  • mm.dd.yyyy
  • dd.mm.yy
  • mm.dd.yy
  • dd/mm/yyyy
  • mm/dd/yyyy
  • dd/mm/yy
  • mm/dd/yy

Time: HH is used to specify time in the 24-hour format and hh is used to specify time in the 12-hour format. The following time formats are supported:

  • HH:mm:ss - This is the default format.
  • HH:mm:ss.ns
  • hh:mm:ss
  • hh:mm:ss.ns
  • hh:mm:ss.ns p - Here, p is the 12 hour format with period AM/PM.
  • HH:mm:ss.ns z - Here, z is timezone info with +- from UTC, that is, +0000,+0530,-0230.
  • hh:mm:ss Z - Here, Z is the timezone info with the name, that is, UTC,EST, CST.

Here are a few examples:

{
            "classificationType": "Non-Sensitive Attribute",
            "dataType": "Integer",
            "name": "index"
}
        
{
            "classificationType": "Sensitive Attribute",
            "dataType": "String",
            "name": "diagnosis_dup"
}

Note: The values present in the first row of the dataset is considered for determining the format for date, time, and datetime. You can override the detection using “props”: {“dateformat”: “<Specify_Format>”}.

Consider the following example for date with the mm/dd/yyyy format:

10/09/2020
12/24/2020
07/30/2020

In this case, the data will be identified as dd/mm/yyyy.

You can override the using the following property:

"props": {"dateformat": "mm/dd/yyyy"}

Specifying the Privacy Model

For more information about anonymization methods for privacy model, refer to Specifying the Privacy Model.

Specifying the Hierarchy

For more information about how the information in the data set is handled for anonymization, refer to Specifying the Hierarchy.

Generalization

For more information about grouping data into sets having similar attributes, refer to Generalization.

Micro-Aggregation

For more information about the mathematical formulas used to group the data, refer to Micro-Aggregation.

Specifying Configurations

Additional configurations are available in the Protegrity Anonymization to enhance the anonymity of the information in the data set.

The following configurations are available:

"config": {
    "maxSuppression": 0.1 
    "suppressionData": "*"
    "redactOutliers": False
}
  • maxSuppression specifies the percentage of rows allowed to be an outlier row to obtain the anonymized data. The default is 10%.
  • suppressionData specifies the character or character set to be used for suppressing the anonymized data. The default is *.
  • redactOutliers specifies if the outlier row should be part of the anonymized dataset or not. The default is included denoted by False.

5.3 - Building the request using the Python SDK

Use the information provided in this section to build the request using the Python SDK environment for performing the Protegrity Anonymization transformation.

To build an anonymization request using the SDK, the user first needs to import the anonsdk module using the following command.

import anonsdk as asdk

Creating the connection

You need to specify the connection to the Protegrity Anonymization REST service to set up the Protegrity Anonymization.

Note: If administrator has not updated the DNS entry for ANON REST API service, then map the hostname with the IP address of Anon Service in the hosts file of the system.

For example, if the Protegrity Anonymization REST service is located at https://anon.protegrity.com, then you would create the following connection.

conn = asdk.Connection("https://anon.protegrity.com/")

Identifying the source and target

Protegrity Anonymizationis built to anonymize the data in a Pandas dataframe and return the anonymized dataframe. However, you can also specify a CSV file from various source systems for the source data.

Use the following code to specify the source.

e = asdk.AnonElement(conn, dataframe)

If the source file is located at the same place where Protegrity Anonymization is installed, then use the following code to load the source file into a dataframe.

dataframe = pandas.read_csv("<file_path>")
  • The following data storages have been tested for Protegrity Anonymization:

    • Local File System

    • Amazon S3

      For example:

      asdk.FileDataStore("s3://<path>/<file_name>.csv", access_options={"key": "<value>","secret": "value"})
      
  • The following data storages can also be used for Protegrity Anonymization:

- Microsoft Azure Storage
    - Data Lake Storage

        For example:

        ```
        asdk.FileDataStore("adl://<path>/<file_name>.csv", access_options={"tenant_id": "<value>", "client_id": "<value>", "client_secret": "<value>"})
        ```

    - Blob Storage

        For example:

        ```
        asdk.FileDataStore("abfs://<path>/<file_name>.csv", access_options={"account_name": "<value>", "account_key": "<value>"})
        ```

- MinIO Storage
- Other S3 Compatible Services

    > **Note**: When uploading a file to the Cloud service, wait till the entire source file is uploaded before running the anonymization job.

    For more information about using remote sources, refer to [Connect to remote data](https://docs.dask.org/en/latest/how-to/connect-to-remote-data.html).

If required, you can directly specify data in a list using the following format:

d = {['<column1_name>':['value1','value2','value3',...],
     ['<column2_name>':[number1,number2,number3,...],
     ['<column3_name>':['value1','value2','value3',...],
     ...}

For example:

d = {'gender': ['Male', 'Male', 'Male', 'Male', 'Female', 'Female', 'Female', 'Male', 'Female'],
         'occupation': ['Adm-clerical', 'Exec-managerial', 'Handlers-cleaners', 'Handlers-cleaners', 'Prof-specialty', 'Exec-managerial', 'Other-service', 'Exec-managerial', 'Prof-specialty'],
         'age': [39, 50, 38, 53, 28, 37, 49, 52, 31],
         'race': ['White', 'White', 'White', 'Black', 'Black', 'White', 'Black', 'White', 'White'],
         'marital-status': ['Never-married', 'Married-civ-spouse', 'Divorced', 'Married-civ-spouse', 'Married-civ-spouse', 'Married-civ-spouse', 'Married-spouse-absent', 'Married-civ-spouse', 'Never-married'],
         'education': ['Bachelors', 'Bachelors', 'HS-grad', '11th', 'Bachelors', 'Masters', '9th', 'HS-grad', 'Masters'],
         'native-country': ['United-States', 'United-States', 'United-States', 'United-States', 'Cuba', 'United-States', 'Jamaica', 'United-States', 'United-States'],
         'workclass': ['State-gov', 'Self-emp-not-inc', 'Private', 'Private', 'Private', 'Private', 'Private', 'Self-emp-not-inc', 'Private'],
         'income': ['<=50K', '<=50K', '<=50K', '<=50K', '<=50K', '<=50K', '<=50K', '>50K', '>50K'],
         'bmi': [11.5, 12.5, 13.5, 14.5, 16.5, 16.5, 17.5, 18.5, 11.5] }

The anonymized data is returned to the user as Pandas dataframe. Optionally, you can specify the required target file system and provide the target using the following code.

asdk.anonymize(e, resultStore=<targetFile>)

Specify additional parameters about the source and target file, such as, the character used to separate the values in the file, using the various properties attribute. If a property is not specified, then the default attributes will be used.

Note: Some Cloud services have limitations on the file size. If such a limitation exists, then you can set single_file to no when writing large files to the Cloud service, . This saves the output as multiple files to avoid any errors related to saving large files to the Cloud storage.

For more information and help on specifying the source and target files, refer to Dask remote data configuration.

Specifying the transformation

For more information about specifying the transformation, refer to Specifying the Transformation.

Protegrity Anonymizationuses Pandas to build and work with the data frame. You need to import the library for Pandas and store the source data that must be transformed in Pandas.

import pandas as pd

d = <source_data>
df = pd.DataFrame(data=d)

To build the transformation, you need to specify the AnonElement that holds the connection, data frame, and the source.

For example:

e = asdk.AnonElement(conn,df,source=datastore)

You need to specify the columns that must be included for processing the anonymization request and the column classification before performing the anonymization.

e["<column>"] = asdk.<transformation>

Where:

  • column: Specify the column name or column ID.
  • transformation: Specifies the processing to be applied for the column.

Note: By default, all the columns are set to ignore processing. The data is redacted and not included in the anonymization process. You need to manually set the column classification to include it in the anonymization process.

Specify multiple columns with assign using commas.

e.assign(["<column1>","<column2>"],asdk.Transformation())

You can view the configuration provided using the describe function.

e.describe()

Classifying the fields

For more information about different fields classification, refer to Classifying the Fields.

The following data types are supported for working with the data in the fields:

  • Integer
  • Float
  • String
  • DateTime

Specifying the privacy model

For more information about anonymization methods for privacy model, refer to Specifying the Privacy Model.

Specifying the Hierarchy

For more information about how the information in the data set is handled for anonymization, refer to Specifying the Hierarchy.

Generalization

For more information about grouping data into sets having similar attributes, refer to Generalization.

Micro-aggregation

For more information about the mathematical formulas used to group the data, refer to Micro-Aggregation.

Working with saved Anonymization requests

The save method provides interoperability with the REST API. It generates the required JSON payload that can be used as part of curl or any REST client.

Use the following command to save the anonymization request.

e.save("<file_path>\\fileName.json")

Applying Anonymization to additional rows

You can use the applyAnon method to anonymize any additional rows using the saved request. Use the following command to anonymize using a previous anonymization job.

asdk.applyAnon(<conn>,job.id(), <single_row_data>)

Use this function to anonymize only a few rows. You need to specify the row information using the key-value pair and ensure that all the required columns are present.

An example of a single and multi row data is shown here.

single_row_data = [{'ID': '1', 'Name': 'Wilburt Daniel', 'Address': '4 Sachtjen Plaza', 'City_Name': 'Ambato', 'Gender': 'Male', 'date': '18-04-2008'': '9'}]
multi_row_data = [{'ID': '1', 'Name': 'Wilburt Daniel', 'Address': '4 Sachtjen Plaza', 'City_Name': 'Ambato', 'Gender': 'Male', 'date': '18-04-2008'': '9'},{'ID': '2', 'Name': 'Jones Knight', 'Address': '25 Macadamia Street', 'City_Name': 'Ambato', 'Gender': 'Male', 'date': '25-11-1997'': '9'}]

Running a sample request

Run the sample code provided here in and SDK. This sample is also available at https://<IP_Address>:<Port>/sdkapi.

Import the Protegrity Anonymization and the Pandas package in the SDK tool.

import pandas as pd
import anonsdk as asdk

Create a variable d with the sample data.

#Sample data for Demo
d = {'gender': ['Male', 'Male', 'Male', 'Male', 'Female', 'Female', 'Female', 'Male', 'Female'],
         'occupation': ['Adm-clerical', 'Exec-managerial', 'Handlers-cleaners', 'Handlers-cleaners', 'Prof-specialty', 'Exec-managerial', 'Other-service', 'Exec-managerial', 'Prof-specialty'],
         'age': [39, 50, 38, 53, 28, 37, 49, 52, 31],
         'race': ['White', 'White', 'White', 'Black', 'Black', 'White', 'Black', 'White', 'White'],
         'marital-status': ['Never-married', 'Married-civ-spouse', 'Divorced', 'Married-civ-spouse', 'Married-civ-spouse', 'Married-civ-spouse', 'Married-spouse-absent', 'Married-civ-spouse', 'Never-married'],
         'education': ['Bachelors', 'Bachelors', 'HS-grad', '11th', 'Bachelors', 'Masters', '9th', 'HS-grad', 'Masters'],
         'native-country': ['United-States', 'United-States', 'United-States', 'United-States', 'Cuba', 'United-States', 'Jamaica', 'United-States', 'United-States'],
         'workclass': ['State-gov', 'Self-emp-not-inc', 'Private', 'Private', 'Private', 'Private', 'Private', 'Self-emp-not-inc', 'Private'],
         'income': ['<=50K', '<=50K', '<=50K', '<=50K', '<=50K', '<=50K', '<=50K', '>50K', '>50K'],
         'bmi': [11.5, 12.5, 13.5, 14.5, 16.5, 16.5, 17.5, 18.5, 11.5] }

Load the data in a Pandas DataFrame.

df = pd.DataFrame(data=d)

Specify the additional data required per attribute to transform and obtain anonymized data. In this example, the Hierarchy Tree is specified.

treeGen = {'lvl0': [11, 13, 14, 15, 27, 28, 20],
               'lvl1': ["<15", "<15", "<15", "<20", "<30", "<30", "<25"],
               'lvl2': ["<20", "<20", "<20", "<30", "<30", "<30", "<30"]}

Build the connection to a running Protegrity Anonymization REST cluster instance. Ensure that the hosts file is configured and points to the REST cluster.

conn = asdk.Connection('https://anon.protegrity.com/')

Build the AnonElement passing the connection and the data as inputs for the anonymization request.

e = asdk.AnonElement(conn,df)

Use the following code sample to read data from an external file store.

e = asdk.AnonElement(conn, dataframe, <SourceFile>)

Specify the transformation that is required.

e['gender'] = asdk.Redact()
e['occupation'] = asdk.Redact()
e['age'] = asdk.Gen_Tree(pd.DataFrame(data=treeGen), ["Missing", "Might be <30", " Might be <30"])

e["bmi"] = asdk.Gen_Interval(['5', '10', '15'])

Specify the K-value, the L-Diversity, and the T-Closeness values.

e.config.k = asdk.K(2)
e["income"] = asdk.LDiv(lfactor=2)
e["income"] = asdk.TClose(tfactor=0.2)

Specify the max suppression.

e.config['maxSuppression'] = 0.7

Specify the importance for the required fields.

e["race"] = asdk.Gen_Mask(maskchar="*",importance=0.8)

View the details of the current configuration.

e.describe()

Anonymize the data.

job = asdk.anonymize(e)

If required, save the results to a file.

datastore=asdk.FileDataStore("s3://...",access_options={"key": "K...","secret": "S..."})
job = asdk.anonymize(e, resultStore=datastore)

View the job status.

job.status()

View the anonymized data.

result = job.result()
if result.df is not None:
    print("Anon Dataframe.")
    print(result.df.head())

View the utility and risk statistics of the data.

job.utilityStat() 
job.riskStat()

Save the job configuration with the updated source and target to a JSON file.

e.save("/file_path/file.json", store=datastore)

Optional: Apply the anonymization rules of previous jobs to new data.

anonData = asdk.applyAnon(conn,job.id(), [{'gender':'Male','age': '39', 'race': 'White', 'income': '<=50K','bmi':'12.5'}])
anonData

6 - Using the Auto Anonymizer

The Auto Anonymizer feature of Protegrity Anonymization is a powerful feature for performing anonymization. It processes the data to generate a template for completing anonymization requests.

The Auto Anonymizer feature is simple and easy to configure. Moreover, it is built to analyze the data and produce an output that has a balance of both, generalization and value. The output of the auto anonymizer should always be verified by a human with dataset knowledge. The output is merely a suggestion and should not be used without further inspection.

Protegrity Anonymization analyzes a sample of the data from the dataset. This sample is then analyzed to build a template for performing the anonymization. The template building takes time, based on the size of the dataset and the nature of the data itself.

You can specify the parameters such as, the various fields for redacting, for anonymizing the data. You can use the Auto Anonymizer feature to automatically analyze the data and perform the required anonymization. This feature can also scan the data and perform the best optimization for providing high quality anonymized data. The various parameters used for performing auto anonymization are configurable and can be optimized to suite your business need or requirements. Additionally, frequently performed fields can be created and stored to enable you to build the anonymization request faster and with minimal information before runtime.

A brief flow of the steps for auto anonymization is shown in the following figure.

The user provides the data, column identification, and anonymization parameters, if required. Protegrity Anonymization analyzes the parameters provided and analyzes the dataset. Various anonymization models are generated and analyzed. The parameters, such as, the K, l, and t values, along with the data available in the dataset are used for processing the request. The results are compared and finally, the dataset is processed using the model and parameters that have the best anonymization output.

Consider the following sample graph.

Protegrity Anonymization will first auto assign the privacy levels for the various columns in the dataset. Direct identifiers will be redacted from the dataset. Next, models will be created using different values for K-anonymity, l-diversity, and t-closeness. The values will be analyzed, and the best values selected, such as, the values at point b in the graph. The dataset will then be anonymized using the values determined to complete the anonymization request.

The user can specify the values that must be used, if required. Protegrity Anonymization will consider the values specified by the user and continue to auto generate the remaining values accordingly.

Note: The auto anonymization runs the same request using different values, the anonymization request will take more time to complete compared to a regular anonymization request.

You can use measure, mode, and Infer for Auto Anonymization.

For more information about the measure API, refer to Measure API.

The difference between using mode and Infer is provided in the following table.

ModeInfer
Analyzes the dataset and performs the anonymization job.Only analyzes the dataset.
The result set is the output.Updates the models used for performing the anonymization job.
You cannot retrieve the attributes for the job.You can view the auto generated job attribute values, such as, K-anonymity, that will be used for performing the job using the describe method.
You can specify target variables for focusing the anonymization job with the anonymization function.You can specify target variables for focus before performing the anonymization job or even modify the model after performing the anonymization job.

6.1 - Using mode to Auto Anonymize

Set the mode to Auto to auto anonymize. The auto anonymization auto-detects the data-domain, classification type, hierarchies, and anonymization configuration in Protegrity Anonymization.

Any user-defined configuration, such as, QI attribute assignments, hierarchy, and K value, are retained and considered while performing the auto anonymization. You can also specify the targetVariable that must be considered for obtaining the best possible result set in terms of quality data while performing the anonymization job.

Ensure that you complete the following checks before starting the anonymization job:

  • Verify that the destination file is not in use and that the required permissions are set for creating and modifying the destination file.
  • Ensure that the disk is not full and enough free space is available for saving the destination file.
  • Verify that you have imported the Pythonic SDK, for example, import anonsdk as asdk.

The folowing table shows the auto anonymization information.

Using mode to Auto Anonymize InformationDescription
Functionjob = asdk.anonymize(e, targetVariable="targetVariable", mode=“Auto”)
ParameterstargetVariable: The field specified here is used as a focus point for performing the anonymization.
Return TypeIt returns the result set after performing the anonymization job.
Sample Requestjob = asdk.anonymize(e, targetVariable=“date”, mode=“Auto”)

For more sample requests that you can use, refer to Sample Requests for Protegrity Anonymization.

Note: You can use e.measure() to modify the request and view different outcomes of the result set.

For more information about the measure API, refer to Measure API.

6.2 - Using Infer to Anonymize

Use the Infer API to start auto-detecting the data-domain, classification type, hierarchies, and anonymization configuration in Protegrity Anonymization.

Any user-defined configuration, such as, QI attribute assignments, hierarchy, and K value, are retained and considered while performing the auto anonymization.

Ensure that you complete the following checks before starting the anonymization job:

  • Verify that the destination file is not in use and that the required permissions are set for creating and modifying the destination file.
  • Ensure that the disk is not full and enough free space is available for saving the destination file.
  • Verify that you have imported the Pythonic SDK, for example, import anonsdk as asdk.

The folowing table shows the auto anonymization information.

Using Infer to Anonymize InformationDescription
Functioninfer(targetVariable)
ParameterstargetVariable: The field specified here is used as a focus point for performing the anonymization.
Return TypeIt returns an anon element with all the detected classifications and hierarchies generated.
Sample Requeste.infer(targetVariable=‘income’)

For more sample requests that you can use, refer to Sample Requests for Protegrity Anonymization.

Note: You can use e.measure() to modify the request and view different outcomes of the result set.

For more information about the measure API, refer to Measure API.

7 - Using Sample Anonymization Jobs

Sample anonymization jobs that you can use for working with and testing Protegrity Anonymization.

7.1 - Sample Data Sets

Use the following dataset to test Protegrity Anonymization. This dataset is comprehensive and can give you thorough insights about working with Protegrity Anonymization.

Adult Dataset: Here is an extract of the dataset, the complete dataset can be found in the adult.csv file in the samples directory. Adult Dataset: Here is an extract of the dataset, the complete dataset can be found in the adult.csv file in the samples directory.

sex;age;race;marital-status;education;native-country;citizenSince;weight;workclass;occupation;salary-class
Male;39;White;Never-married;Bachelors;United-States;08-01-1971;185.38;State-gov;Adm-clerical;<=50K
Male;50;White;Married-civ-spouse;Bachelors;United-States;19-04-1960;176.32;Self-emp-not-inc;Exec-managerial;<=50K
Male;38;White;Divorced;HS-grad;United-States;07-12-1971;159.13;Private;Handlers-cleaners;<=50K
Male;53;Black;Married-civ-spouse;11th;United-States;22-05-1957;170.45;Private;Handlers-cleaners;<=50K
Female;28;Black;Married-civ-spouse;Bachelors;Cuba;03-02-1982;178.79;Private;Prof-specialty;<=50K
Female;37;White;Married-civ-spouse;Masters;United-States;06-12-1972;161.65;Private;Exec-managerial;<=50K
Female;49;Black;Married-spouse-absent;9th;Jamaica;18-04-1961;162.73;Private;Other-service;<=50K
Male;52;White;Married-civ-spouse;HS-grad;United-States;21-05-1958;171.75;Self-emp-not-inc;Exec-managerial;>50K
Female;31;White;Never-married;Masters;United-States;31-12-1978;164.03;Private;Prof-specialty;>50K
Male;42;White;Married-civ-spouse;Bachelors;United-States;11-02-1968;186.33;Private;Exec-managerial;>50K
Male;37;Black;Married-civ-spouse;Some-college;United-States;06-12-1972;189.49;Private;Exec-managerial;>50K
Male;30;Asian-Pac-Islander;Married-civ-spouse;Bachelors;India;01-02-1980;178.70;State-gov;Prof-specialty;>50K
Female;23;White;Never-married;Bachelors;United-States;08-04-1987;183.22;Private;Adm-clerical;<=50K
Male;32;Black;Never-married;Assoc-acdm;United-States;01-01-1978;156.63;Private;Sales;<=50K
Male;34;Amer-Indian-Eskimo;Married-civ-spouse;7th-8th;Mexico;03-12-1975;173.41;Private;Transport-moving;<=50K
Male;25;White;Never-married;HS-grad;United-States;06-03-1985;170.72;Self-emp-not-inc;Farming-fishing;<=50K
Male;32;White;Never-married;HS-grad;United-States;01-01-1978;174.91;Private;Machine-op-inspct;<=50K
Male;38;White;Married-civ-spouse;11th;United-States;07-12-1971;176.47;Private;Sales;<=50K
Female;43;White;Divorced;Masters;United-States;12-02-1967;179.88;Self-emp-not-inc;Exec-managerial;>50K
Male;40;White;Married-civ-spouse;Doctorate;United-States;09-01-1970;170.80;Private;Prof-specialty;>50K
Female;54;Black;Separated;HS-grad;United-States;23-06-1956;171.61;Private;Other-service;<=50K
Male;35;Black;Married-civ-spouse;9th;United-States;04-12-1974;183.71;Federal-gov;Farming-fishing;<=50K
Male;43;White;Married-civ-spouse;11th;United-States;12-02-1967;158.63;Private;Transport-moving;<=50K
Female;59;White;Divorced;HS-grad;United-States;28-07-1951;181.64;Private;Tech-support;<=50K
Male;56;White;Married-civ-spouse;Bachelors;United-States;25-06-1954;171.80;Local-gov;Tech-support;>50K
Male;19;White;Never-married;HS-grad;United-States;12-05-1991;172.74;Private;Craft-repair;<=50K
Male;39;White;Divorced;HS-grad;United-States;08-01-1971;159.41;Private;Exec-managerial;<=50K
Male;49;White;Married-civ-spouse;HS-grad;United-States;18-04-1961;176.76;Private;Craft-repair;<=50K
Male;23;White;Never-married;Assoc-acdm;United-States;08-04-1987;164.43;Local-gov;Protective-serv;<=50K
Male;20;Black;Never-married;Some-college;United-States;11-05-1990;157.60;Private;Sales;<=50K
Male;45;White;Divorced;Bachelors;United-States;14-03-1965;176.38;Private;Exec-managerial;<=50K
Male;30;White;Married-civ-spouse;Some-college;United-States;01-02-1980;160.60;Federal-gov;Adm-clerical;<=50K
Male;22;Black;Married-civ-spouse;Some-college;United-States;09-04-1988;173.41;State-gov;Other-service;<=50K
Male;48;White;Never-married;11th;Puerto-Rico;17-04-1962;189.50;Private;Machine-op-inspct;<=50K
Male;21;White;Never-married;Some-college;United-States;10-05-1989;162.76;Private;Machine-op-inspct;<=50K
Female;19;White;Married-AF-spouse;HS-grad;United-States;12-05-1991;158.42;Private;Adm-clerical;<=50K
Male;48;White;Married-civ-spouse;Assoc-acdm;United-States;17-04-1962;160.75;Self-emp-not-inc;Prof-specialty;<=50K
Male;31;White;Married-civ-spouse;9th;United-States;31-12-1978;172.10;Private;Machine-op-inspct;<=50K
Male;53;White;Married-civ-spouse;Bachelors;United-States;22-05-1957;189.74;Self-emp-not-inc;Prof-specialty;<=50K
Male;24;White;Married-civ-spouse;Bachelors;United-States;07-04-1986;170.08;Private;Tech-support;<=50K
Female;49;White;Separated;HS-grad;United-States;18-04-1961;173.71;Private;Adm-clerical;<=50K
Male;25;White;Never-married;HS-grad;United-States;06-03-1985;160.52;Private;Handlers-cleaners;<=50K
Male;57;Black;Married-civ-spouse;Bachelors;United-States;26-07-1953;178.12;Federal-gov;Prof-specialty;>50K
Male;53;White;Married-civ-spouse;HS-grad;United-States;22-05-1957;186.11;Private;Machine-op-inspct;<=50K
Female;44;White;Divorced;Masters;United-States;13-02-1966;162.80;Private;Exec-managerial;<=50K
Male;41;White;Married-civ-spouse;Assoc-voc;United-States;10-01-1969;172.39;State-gov;Craft-repair;<=50K
Male;29;White;Never-married;Assoc-voc;United-States;02-02-1981;168.83;Private;Prof-specialty;<=50K
Female;25;Other;Married-civ-spouse;Some-college;United-States;06-03-1985;179.12;Private;Exec-managerial;<=50K
Female;47;White;Married-civ-spouse;Prof-school;Honduras;16-03-1963;163.02;Private;Prof-specialty;>50K
Male;50;White;Divorced;Bachelors;United-States;19-04-1960;172.18;Federal-gov;Exec-managerial;>50K

7.2 - Sample Requests for Protegrity Anonymization

Modify and use the sample requests provided here for anonymizing your dataset. Use these requests as a template or as a guideline for building the required request.

Tree-based Aggregation for Attributes with k-Anonymity

This sample uses the following attributes:

  • Source: Local file system
  • Target: Amazon S3 bucket
  • Data set: 1 Quasi Identifier
  • Suppression: 0.01
  • Privacy Model: K-Anonimity with k value as 50

In this example, the data has custom delimiters.

{
    "source": {
        "type": "File",
        "file": {
            "name": "samples/adult.csv",
            "props": {
                "sep": ";"
            }
        }
    },
    "attributes": [
        {
            "name": "age",
            "dataType": "String",
            "classificationType": "Quasi Identifier",
            "dataTransformationType": "Generalization",
            "generalization": {
                "type": "Masking Based",
                "hierarchyType": "Rule",
                "rule": {
                    "masking": {
                        "maskOrder": "Right To Left",
                        "maskChar": "*",
                        "maxDomainSize": 2
                    }
                }
            }
        }
    ],
    "privacyModel": {
        "k": {
            "kValue": 50
        }
    },
    "config": {
        "maxSuppression": 0.01
    },
    "target": {
        "type": "File",
        "file": {
            "name": "s3://<Your-S3-BucketName>/anon-adult-e1.csv",
            "props": {
                "lineterminator": "\n"
            },
            "accessOptions": {
                "key": "<Your-S3-API Key>",
                "secret": "<Your-S3-API Secret>"
            }
        }
    }
}
#import  the anonsdk library
import anonsdk as asdk
import pandas as pd

# s3 bucket credentials
s3_key = <AWS_Key>
s3_secret = <AWS_Secret>

#set the source path for anonymization
# dataset path
source_csv_path = "adult.csv"
# create Store Object source_datastore
source_datastore = asdk.FileDataStore(source_csv_path)

#Set the target path for anonymized result
# anonymized file path
target_csv_path = "s3://target/anon-adult-e1.csv"
# create Store Object target_datastore
target_datastore = asdk.FileDataStore(target_csv_path, access_options={"key": s3_key,"secret": s3_secret})

# Create connection Object with Rest API server
conn = asdk.Connection("https://anon.protegrity.com/")
df = pd.read_csv(source_csv_path,sep=";")
df.head()

# create AnonObject with connection, dataframe metadata and source path
anon_object = asdk.AnonElement(conn, df, source_datastore)
# configure masking of string datatype
anon_object["age"] = asdk.Gen_Mask(maskchar="*",maskOrder="R",maxLength=2)

#Configure K-anonymity , suppression in the dataset allowed
anon_object.config.k = asdk.K(50)
anon_object.config['maxSuppression'] = 0.01

# Send Anonymization request with Transformation Configuration with the target store
job = asdk.anonymize(anon_object,target_datastore ,force=True)

# check the status of the job <check the status iteratively until  'status': 'Completed' >
job.status()

# check the comparative risk statistics from the source and result dataset
job.riskStat()

# check the comparative utility statistics from the source and result dataset
job.utilityStat()

Tree-based Aggregation for Attributes with k-Anonymity, l-Diversity, and t-Closeness

This sample uses the following attributes:

  • Source: Local file system
  • Target: Amazon S3 bucket
  • Data set: 4 Quasi Identifiers, 2 Sensitive Attributes
  • Suppression: 0.10
  • Privacy Model: K with value 3, T-closeness with value 0.2, and L-diversity with value 2

In this example, for an attribute, the generalization hierarchy is a part of the request.

{
    "source": {
        "type": "File",
        "file": {
            "name": "samples/adult.csv",
            "props": {
                "sep": ";",
                "decimal": ",",
                "quotechar": "\"",
                "escapechar": "\\",
                "encoding": "utf-8"
            }
        }
    },
    "attributes": [
        {
            "name": "marital-status",
            "dataType": "String",
            "classificationType": "Quasi Identifier",
            "dataTransformationType": "Generalization",
            "generalization": {
                "type": "Tree Based",
                "hierarchyType": "Data Store",
                "dataStore": {
                    "type": "File",
                    "format": "CSV",
                    "file": {
                        "name": "samples/hierarchy/adult_hierarchy_marital-status.csv",
                        "props": {
                            "delimiter": ";",
                            "quotechar": "\"",
                            "header": null
                        }
                    }
                }
            }
        },
        {
            "name": "native-country",
            "dataType": "String",
            "classificationType": "Quasi Identifier",
            "dataTransformationType": "Generalization",
            "generalization": {
                "type": "Tree Based",
                "hierarchyType": "Data Store",
                "dataStore": {
                    "type": "File",
                    "format": "CSV",
                    "file": {
                        "name": "samples/hierarchy/adult_hierarchy_native-country.csv",
                        "props": {
                            "delimiter": ";",
                            "quotechar": "\"",
                            "header": null
                        }
                    }
                }
            }
        },
        {
            "name": "occupation",
            "dataType": "String",
            "classificationType": "Quasi Identifier",
            "dataTransformationType": "Generalization",
            "generalization": {
                "type": "Tree Based",
                "hierarchyType": "Data Store",
                "dataStore": {
                    "type": "File",
                    "format": "CSV",
                    "file": {
                        "name": "samples/hierarchy/adult_hierarchy_occupation.csv",
                        "props": {
                            "delimiter": ";",
                            "quotechar": "\"",
                            "header": null
                        }
                    }
                }
            }
        },
        {
            "name": "race",
            "dataType": "String",
            "classificationType": "Quasi Identifier",
            "dataTransformationType": "Generalization",
            "generalization": {
                "type": "Tree Based",
                "hierarchyType": "Data",
                "data": {
                    "hierarchy": [
                        [
                            "White",
                            "*"
                        ],
                        [
                            "Asian-Pac-Islander",
                            "*"
                        ],
                        [
                            "Amer-Indian-Eskimo",
                            "*"
                        ],
                        [
                            "Black",
                            "*"
                        ]
                    ],
                    "defaultHierarchy": [
                        "Other",
                        "*"
                    ]
                }
            }
        },
        {
            "name": "sex",
            "dataType": "String",
            "classificationType": "Sensitive Attribute"
        },
        {
            "name": "salary-class",
            "dataType": "String",
            "classificationType": "Sensitive Attribute"
        }
    ],
    "config": {
        "maxSuppression": 0.10
    },
    "privacyModel": {
        "k": {
            "kValue": 3
        },
        "tcloseness": [
            {
                "name": "salary-class",
                "emdType": "EMD with equal ground distance",
                "tFactor": 0.2
            }
        ],
        "ldiversity": [
            {
                "name": "sex",
                "lFactor": 2,
                "lType": "Distinct-l-diversity"
            }
        ]
    },
    "target": {
        "type": "File",
        "file": {
            "name": "s3://<Your-S3-BucketName>/anon-adult_klt.csv",
            "props": {
                "lineterminator": "\n"
            },
            "accessOptions": {
                "key": "<Your-S3-API Key>",
                "secret": "<Your-S3-API Secret>"
            }
        }
    }
}
#import the anonsdk library
import anonsdk as asdk
import pandas as pd

# s3 bucket credentials
s3_key = <AWS_Key>
s3_secret = <AWS_Secret>

#set the source path for anonymization
# dataset path
source_csv_path = "adult.csv"
# create Store Object source_datastore
source_datastore = asdk.FileDataStore(source_csv_path)

#Set the target path for anonymized result
# anonymized file path
target_csv_path = "s3://target/anon-adult_klt.csv"

# create Store Object target_datastore
target_datastore = asdk.FileDataStore(target_csv_path, access_options={"key": s3_key,"secret": s3_secret})

# Create connection Object with Rest API server
conn = asdk.Connection("https://anon.protegrity.com/")

# create AnonObject with connection, dataframe metadata and source path
df = pd.read_csv(source_csv_path,sep=";")
df.head()
anon_object = asdk.AnonElement(conn, df, source_datastore)

# configuration
hierarchy_marital_status_path = "samples/hierarchy/adult_hierarchy_marital-status.csv"
df_ms = pd.read_csv(hierarchy_marital_status_path,sep=";").compute()
print(df_ms)
anon_object['marital-status']=asdk.Gen_Tree(df_ms)

hierarchy_native_country_path = "samples/hierarchy/adult_hierarchy_native-country.csv"
df_nc = pd.read_csv(hierarchy_native_country_path,sep=";").compute()
print(df_nc)
anon_object['nativecountry']=asdk.Gen_Tree(df_nc)

hierarchy_occupation_path = "hierarchy/adult_hierarchy_occupation.csv"
df_occ = pd.read_csv(hierarchy_occupation_path).compute()
print(df_occ)
anon_object['occupation']=asdk.Gen_Tree(df_occ)

df_race = pd.DataFrame(data={"lvl0":["White","Asian-Pac-Islander","Amer-Indian","Black","Other"], "lvl1":["*","*","*","*","*"]})
anon_object['race']=asdk.Gen_Tree(df_race)

#Configure K-anonymity , suppression allowed in the dataset
anon_object.config.k = asdk.K(3)
anon_object.config['maxSuppression'] = 0.10

#Configure L-diversity and T-closeness
anon_object["sex"]=asdk.LDiv(lfactor=2)
anon_object["salary-class"]=asdk.TClose(tfactor=0.2)

# Send Anonymization request with Transformation Configuration with the target store
job = asdk.anonymize(anon_object,target_datastore ,force=True)

# check the status of the job
job.status()

# check the comparative risk statistics from the source and result dataset
job.riskStat()

# check the comparative utility statistics from the source and result dataset
job.utilityStat()

Micro-Aggregation and Generalization with Aggregates

This sample uses the following attributes:

  • Source: Local file system
  • Target: Amazon S3 bucket
  • Data set: 2 Quasi Identifiers, 1 Aggregation-based Quasi Identifier, 2 Micro Aggregations, and 2 Sensitive Attributes
  • Suppression: 0.50
  • Privacy Model: K with value 5, T-closeness with value 0.2, and L-diversity with value 2
{
    "source": {
        "type": "File",
        "file": {
            "name": "samples/adult.csv",
            "props": {
                "sep": ";"
            }
        }
    },
    "attributes": [
        {
            "name": "age",
            "dataType": "Integer",
            "classificationType": "Quasi Identifier",
            "dataTransformationType": "Micro Aggregation",
            "aggregateFn": "GMean"
        },
        {
            "name": "marital-status",
            "dataType": "String",
            "classificationType": "Quasi Identifier",
            "dataTransformationType": "Micro Aggregation",
            "aggregateFn": "Mode"
        },
        {
            "name": "native-country",
            "dataType": "String",
            "classificationType": "Quasi Identifier",
            "dataTransformationType": "Generalization",
            "generalization": {
                "type": "Tree Based",
                "hierarchyType": "Data Store",
                "dataStore": {
                    "type": "File",
                    "format": "CSV",
                    "file": {
                        "name": "samples/hierarchy/adult_hierarchy_native-country.csv",
                        "props": {
                            "delimiter": ";",
                            "quotechar": "\"",
                            "header": null
                        }
                    }
                }
            }
        },
        {
            "name": "occupation",
            "dataType": "String",
            "classificationType": "Quasi Identifier",
            "dataTransformationType": "Generalization",
            "generalization": {
                "type": "Tree Based",
                "hierarchyType": "Data Store",
                "dataStore": {
                    "type": "File",
                    "format": "CSV",
                    "file": {
                        "name": "samples/hierarchy/adult_hierarchy_occupation.csv",
                        "props": {
                            "delimiter": ";",
                            "quotechar": "\"",
                            "header": null
                        }
                    }
                }
            }
        },
        {
            "name": "race",
            "dataType": "String",
            "classificationType": "Quasi Identifier",
            "dataTransformationType": "Generalization",
            "generalization": {
                "type": "Aggregation Based",
                "hierarchyType": "Aggregate",
                "aggregateFn": "Mode"
            }
        },
        {
            "name": "sex",
            "classificationType": "Sensitive Attribute",
            "dataType": "String"
        },
        {
            "name": "salary-class",
            "classificationType": "Sensitive Attribute",
            "dataType": "String"
        }
    ],
    "config": {
        "maxSuppression": 0.50
    },
    "privacyModel": {
        "k": {
            "kValue": 5
        },
        "tcloseness": [
            {
                "name": "salary-class",
                "emdType": "EMD with equal ground distance",
                "tFactor": 0.2
            }
        ],
        "ldiversity": [
            {
                "name": "sex",
                "lType": "Distinct-l-diversity",
                "lFactor": 2
            }
        ]
    },
    "target": {
        "type": "File",
        "file": {
            "name": "s3://<Your-S3-BucketName>/anon-adult_micro.csv",
            "props": {
                "lineterminator": "\n"
            },
            "accessOptions": {
                "key": "<Your-S3-API Key>",
                "secret": "<Your-S3-API Secret>"
            }
        }
    }
}
#import the anonsdk library
import anonsdk as asdk
import pandas as pd

# s3 bucket credentials
s3_key = <AWS_Key>
s3_secret = <AWS_Secret>

#set the source path for anonymization
# dataset path
source_csv_path = "adult.csv"
# create Store Object source_datastore
source_datastore = asdk.FileDataStore(source_csv_path)

#Set the target path for anonymized result
# anonymized file path
target_csv_path = "s3://target/anon-adult_micro.csv"
# create Store Object target_datastore
target_datastore = asdk.FileDataStore(target_csv_path, access_options={"key": s3_key,"secret": s3_secret})

# Create connection Object with Rest API server
conn = asdk.Connection("https://anon.protegrity.com/")
df = pd.read_csv(source_csv_path,sep=";")
df.head()

# create AnonObject with connection, dataframe metadata and source path
anon_object = asdk.AnonElement(conn, df, source_datastore)

# configuration
hierarchy_native_country_path = "hierarchy/adult_hierarchy_native-country.csv"
df_nc = pd.read_csv(hierarchy_native_country_path,sep=";")
print(df_nc)
anon_object['nativecountry']=asdk.Gen_Tree(df_nc)

hierarchy_occupation_path = "samples/hierarchy/adult_hierarchy_occupation.csv"
df_occ = pd.read_csv(hierarchy_occupation_path)
print(df_occ)
anon_object['marital-status']=asdk.Gen_Tree(df_occ)

# applying aggregation rules
anon_object['age']=asdk.MicroAgg(asdk.AggregateFunction.GMean)
anon_object['race']=asdk.Gen_Agg(asdk.AggregateFunction.Mode)

# applying micro-aggregation rule
anon_object['marital-status']=asdk.MicroAgg(asdk.AggregateFunction.Mode)

#Configure K-anonymity , suppression in the dataset allowed
anon_object.config.k = asdk.K(5)
anon_object.config['maxSuppression'] = 0.50

#Configure L-diversity and T-closeness
anon_object["sex"]=asdk.LDiv(lfactor=2)
anon_object["salary-class"]=asdk.TClose(tfactor=0.2)

# Send Anonymization request with Transformation Configuration with the target store
job = asdk.anonymize(anon_object,target_datastore ,force=True)

# check the status of the job
job.status()

# check the comparative risk statistics from the source and result dataset
job.riskStat()

# check the comparative utility statistics from the source and result dataset
job.utilityStat()

Parquet File Format

This sample uses the following attributes:

  • Source: Local file system
  • Target: Amazon S3 bucket in the Parquet format
  • Data set: 4 Quasi Identifiers, 1 Aggregation-based Quasi Identifier, 1 Micro Aggregation, and 1 Sensitive Attribute
  • Suppression: 0.4
  • Privacy Model: K with value 350 and L-diversity with value 2

In this example, for an attribute, the generalization hierarchy is part of the request.

    {
        "source": {
            "type": "File",
            "file": {
                "name": "samples/adult.csv",
                "props": {
                    "sep": ";",
                    "decimal": ",",
                    "quotechar": "\"",
                    "escapechar": "\\",
                    "encoding": "utf-8"
                }
            }
        },
        "attributes": [
            {
                "name": "age",
                "dataType": "Integer",
                "classificationType": "Quasi Identifier",
                "dataTransformationType": "Generalization",
                "generalization": {
                    "hierarchyType": "Rule",
                    "type": "Rounding",
                    "rule": {
                        "interval": {
                            "levels": [
                                "5",
                                "10",
                                "50",
                                "100"
                            ],
                            "lowerBound":"5",
                            "upperBound":"100"
                        }
                    }
                }
            },
            {
                "name": "marital-status",
                "dataType": "String",
                "classificationType": "Quasi Identifier",
                "dataTransformationType": "Micro Aggregation",
                "aggregateFn": "Mode"
            },
            {
                "name": "citizenSince",
                "dataType": "Date",
                "classificationType": "Quasi Identifier",
                "dataTransformationType": "Generalization",
                "generalization": {
                    "type": "Rounding",
                    "hierarchyType": "Rule",
                    "rule": {
                        "daterange": {
                            "levels": [
                                "WD.M.Y",
                                "FD.M.Y",
                                "QTR.Y",
                                "Y"
                            ]
                        }
                    }
                },
                "props": {
                    "dateformat": "dd-mm-yyyy"
                }
            },
            {
                "name": "occupation",
                "dataType": "String",
                "classificationType": "Quasi Identifier",
                "dataTransformationType": "Generalization",
                "generalization": {
                    "type": "Tree Based",
                    "hierarchyType": "Data Store",
                    "dataStore": {
                        "type": "File",
                        "format": "CSV",
                        "file": {
                            "name": "samples/hierarchy/adult_hierarchy_occupation.csv",
                            "props": {
                                "delimiter": ";",
                                "quotechar": "\"",
                                "header": null
                            }
                        }
                    }
                }
            },
            {
                "name": "race",
                "classificationType": "Quasi Identifier",
                "dataTransformationType": "Generalization",
                "dataType": "String",
                "generalization": {
                    "type": "Aggregation Based",
                    "hierarchyType": "Aggregate",
                    "aggregateFn": "Mode"
                }
            },
            {
                "name": "salary-class",
                "dataType": "String",
                "classificationType": "Quasi Identifier",
                "dataTransformationType": "Generalization",
                "generalization": {
                    "type": "Masking Based",
                    "hierarchyType": "Rule",
                    "rule": {
                        "masking": {
                            "maskOrder": "Left To Right",
                            "maskChar": "*",
                            "maxDomainSize": 3
                        }
                    }
                }
            },
            {
                "name": "sex",
                "dataType": "String",
                "classificationType": "Sensitive Attribute"
            }
        ],
        "config": {
            "maxSuppression": 0.4,
            "redactOutliers": true,
            "suppressionData": "Any"
        },
        "privacyModel": {
            "k": {
                "kValue": 350
            },
            "ldiversity": [
                {
                    "name": "sex",
                    "lType": "Distinct-l-diversity",
                    "lFactor": 2
                }
            ]
        },
        "target": {
            "type": "File",
            "file": {
                "name": "s3://<Your-S3-BucketName>/anon-adult-rules",
                "format": "Parquet",
                "accessOptions": {
                    "key": "<Your-S3-API Key>",
                    "secret": "<Your-S3-API Secret>"
                }
            }
        }
    }
It is not applicable for SDK functions.

Retaining and Redacting

This sample uses the following attributes:

  • Source: Local file system
  • Target: Amazon S3 bucket in the Parquet format
  • Data set: 2 Quasi Identifiers, 1 Aggregation-based Quasi Identifier, 1 Micro Aggregation, 1 Non-Sensitive Attribute, 1 Identifying Attribute, and 2 Sensitive Attributes
  • Suppression: 0.10
  • Privacy Model: K with value 200 and L-diversity with value 2

In this example, for an attribute, the generalization hierarchy is part of the request.

    {
        "source": {
            "type": "File",
            "file": {
                "name": "samples/adult.csv",
                "props": {
                    "sep": ";",
                    "decimal": ",",
                    "quotechar": "\"",
                    "escapechar": "\\",
                    "encoding": "utf-8"
                }
            }
        },
        "attributes": [
            {
                "name": "age",
                "dataType": "Integer",
                "classificationType": "Quasi Identifier",
                "dataTransformationType": "Generalization",
                "generalization": {
                    "type": "Rounding",
                    "hierarchyType": "Rule",
                    "rule": {
                        "interval": {
                            "levels": [
                                "5",
                                "10",
                                "50",
                                "100"
                            ]
                        }
                    }
                }
            },
            {
                "name": "marital-status",
                "dataType": "String",
                "classificationType": "Quasi Identifier",
                "dataTransformationType": "Micro Aggregation",
                "aggregateFn": "Mode"
            },
            {
                "name": "occupation",
                "dataType": "String",
                "classificationType": "Quasi Identifier",
                "dataTransformationType": "Generalization",
                "generalization": {
                    "type": "Tree Based",
                    "hierarchyType": "Data Store",
                    "dataStore": {
                        "type": "File",
                        "format": "CSV",
                        "file": {
                            "name": "samples/hierarchy/adult_hierarchy_occupation.csv",
                            "props": {
                                "delimiter": ";",
                                "quotechar": "\"",
                                "header": null
                            }
                        }
                    }
                }
            },
            {
                "name": "race",
                "dataType": "String",
                "classificationType": "Quasi Identifier",
                "dataTransformationType": "Generalization",
                "generalization": {
                    "type": "Aggregation Based",
                    "hierarchyType": "Aggregate",
                    "aggregateFn": "Mode"
                }
            },
            {
                "name": "citizenSince",
                "dataType": "Date",
                "classificationType": "Identifying Attribute"
            },
            {
                "name": "education",
                "dataType": "String",
                "classificationType": "Non-Sensitive Attribute"
            },
            {
                "name": "salary-class",
                "dataType": "String",
                "classificationType": "Sensitive Attribute"
            },
            {
                "name": "sex",
                "dataType": "String",
                "classificationType": "Sensitive Attribute"
            }
        ],
        "config": {
            "maxSuppression": 0.10,
            "suppressionData": "Any"
        },
        "privacyModel": {
            "k": {
                "kValue": 200
            },
            "ldiversity": [
                {
                    "name": "sex",
                    "lType": "Distinct-l-diversity",
                    "lFactor": 2
                },
                {
                    "name": "salary-class",
                    "lType": "Distinct-l-diversity",
                    "lFactor": 2
                }
            ]
        },
        "target": {
            "type": "File",
            "file": {
                "name": "s3://<Your-S3-BucketName>/anon-adult_retd",
                "format": "Parquet",
                "accessOptions": {
                    "key": "<Your-S3-API Key>",
                    "secret": "<Your-S3-API Secret>"
                }
            }
        }
    }
# import the anonsdk library
import anonsdk as asdk
import pandas as pd

# s3 bucket credentials
s3_key = < AWS_Key >
s3_secret = < AWS_Secret >

# set the source path for anonymization
# dataset path
source_csv_path = "adult.csv"
# create Store Object source_datastore
source_datastore = asdk.FileDataStore(source_csv_path)

# Set the target path for anonymized result
# anonymized file path
target_csv_path = "s3://target/anon-adult_retd"

# create Store Object target_datastore
target_datastore = asdk.FileDataStore(target_csv_path, access_options={"key": s3_key, "secret": s3_secret})

# Create connection Object with Rest API server
conn = asdk.Connection("https://anon.protegrity.com/")
df = pd.read_csv(source_csv_path, sep=";")
df.head()

# create AnonObject with connection, dataframe metadata and source path
anon_object = asdk.AnonElement(conn, df, source_datastore)

# configuration
hierarchy_occupation_path = "samples/hierarchy/adult_hierarchy_occupation.csv"
df_occ = pd.read_csv(hierarchy_occupation_path, sep=";")
print(df_occ)
anon_object['marital-status'] = asdk.Gen_Tree(df_occ)
anon_object['marital-status'] = asdk.MicroAgg(asdk.AggregateFunction.Mode)
anon_object['race'] = asdk.Gen_Agg(asdk.AggregateFunction.Mode)
anon_object['age'] = asdk.Gen_Interval([5, 10, 50, 100])
anon_object['citizenSince'] = asdk.Preserve()
anon_object['education'] = asdk.Preserve()
anon_object['salary-class'] = asdk.Redact()
anon_object['sex'] = asdk.Redact()

# Configure K-anonymity , suppression in the dataset allowed
anon_object.config.k = asdk.K(200)
anon_object.config['maxSuppression'] = 0.10

# Configure L-diversity
anon_object["sex"] = asdk.LDiv(lfactor=2)
anon_object["salary-class"] = asdk.LDiv(lfactor=2)

# Send Anonymization request with Transformation Configuration with the target store
job = asdk.anonymize(anon_object, target_datastore, force=True)

# check the status of the job
job.status()

# check the comparative risk statistics from the source and result dataset
job.riskStat()

# check the comparative utility statistics from the source and result dataset
job.utilityStat()

7.3 - Samples for cloud-related source and destination files

Code for specifying the source and destination for AWS and Azure.
"source": {
      "type": "File",
      "file": {
        "name": "s3://<path_to_dataset>",
        "accessOptions": {
            "key": "API Key",
            "secret": "Secret Key"
        }
      }
    }
  
"source": {
      "type": "File",
      "file": {
        "name": "adl://<path-to-dataset>",
        "accessOptions":{
            "tenant_id": Tenant_ID,
            "client_id": Client_ID,
            "client_secret": Client_Secret_Key
        }
      }
    }
  
"source": {
    "type": "File",
    "file": {
      "name": "abfs://<path_to_source_file>",
      "accessOptions":{
        "account_name": "<account_name>",
        "account_key": "<Account_key>” 
      }
    },
    "format": "CSV"
  }

8 - Additional Information

Additional information to help you using the product.

8.1 - Best practices when using Protegrity Anonymization

Suggestions for using Protegrity Anonymization efficiently.
  • Ensure that the source file is clean based on the following checks:

    • A column contains correct data values. For example, a field with numbers, such as, salary, must not contain text values.
    • Appropriate text as per the coding selected is present in the files. Special characters or characters that cannot be processed must not be present in the source file.
  • Move the anonymized data file and the logs generated to a different system before deleting your environment.

  • The maximum dataframe size that can attach to an anonymization job is 100MB.

    For processing a larger dataset size, users can use the different cloud storages available.

  • Run a maximum of 5 anonymization jobs in Protegrity Anonymization: A maximum of 5 jobs can be put on the Protegrity Anonymization queue for adequate utilization of resources. If more jobs are raised, then the job after the initial 5 jobs are rejected and are not processed. If required, increase the maximum limit for the JOB_QUEUE_SIZE parameter in the config.yaml file. For Docker, update the config-docker.yaml file.

  • Protegrity Anonymization accepts a maximum of 60 requests per minute: Protegrity Anonymizationcan accept a maximum of 60 request per minute. If more than 60 requests are raised, then the excess requests are rejected and are not processed. If required, increase the maximum limit for the DEFAULT_API_RATE_LIMIT parameter in the config.yaml file. For Docker, update the config-docker.yaml file.

8.2 - Protegrity Anonymization Risk Metrics

This section describes how the risk metrics are derived. It details the descriptions and the equations used to calculate the risk.

Definitions

The following definitions are used for risk calculations:

  • Data Provider or Custodian: The custodian of the data, responsible for controlling the process of sharing by anonymizing the data as well as putting in place other controls which prevents data from being misused and or re-identified.
  • Data Recipient: Person or institution who receives the data from the data provider.
  • Dataset: The collection of all records containing the data on subjects.
  • Adversary: Data recipient who has the motives to attempt and means to succeed the re-identification of the data and intends to use the data in ways which may be harmful to individuals contained in the dataset.
  • Target: Person whose details are in the dataset who has been selected by the adversary to focus the re-identification attempt on.

Types of risks

Protegrity Anonymizationuses the Prosecutor, Journalist and Marketer risk models to access probability of re-identification attacks. A description of these risks are provided here.

  • Prosecutor Risk: If the adversary can know that the target is in the dataset, then it is called Prosecutor Risk. The fact that target is part of dataset increases the risk of successful re-identification.
  • Journalist Risk: When the adversary doesn’t know for certain that the target is in the dataset then it is called Journalist Risk.
  • Marketer Risk: Under Marketer Risk, the adversary attempts to re-identify as many subjects in the dataset as possible. If the risk of re-identifying an individual subject is possible, then the risk of multiple subjects being re-identified is also possible.

Relationship between the three risks

Prosecutor Risk >= Journalist Risk >= Marketer Risk

If the dataset is protected against the prosecutor and the journalist risk, depending on the adversary’s knowledge of target’s participation, then by default it is also protected against the marketer risk.

Measuring Risks

This section details the strategy used by Protegrity Anonymization to calculate risks.

For calculating risks, the population is the entire pool from which the sample dataset is drawn. In the current calculation of risk metrics, the population considered is the same as the sample. In case of journalist calculation, it is good to consider the population from a larger dataset from which the sample is drawn.

The following annotations are used in the calculations:

  • Ra is the proportion of records with risk above the threshold which is at highest risk.
  • Rb is the maximum probability of re-identification which is at maximum risk.
  • Rc is the proportion of records that can be re-identified on an average which is the success rate of re-identification.

As part of the risk calculations, anonymization API calculates the following metrics:

  • pRa is the highest prosecutor risk.
  • pRb is the maximum prosecutor risk.
  • pRc is the success rate of prosecutor risk.
  • jRa is the highest journalist risk.
  • jRb is the maximum journalist risk.
  • jRc is the success rate of journalist risk.
  • mRc is the success rate of marketer risk.

Risk Type

Equation

Notes

Prosecutor

pRa = 1/n fj x l(1 / fj > T)pRb = 1 / min(fj)

pRc = |J| / n

  • fj size of equivalence class in the sample.
  • FJ size of equivalence class in the population.
  • fj = FJ if sample is same as population.
  • n is number of records in the sample.
  • T is the risk threshold which is the highest allowable probability of correctly re-identifying single record. Value of T in the calculation is 0.1 by default. This value can be configured.

Journalist

jRa = 1/n fj x l(1 / Fj > T) jRb = 1 / min(FJ)

jRc = max ( |J| / FJ) , 1 /n fj / FJ)

  • fj size of equivalence class in the sample.
  • FJ size of equivalence class in the population.
  • fj = FJ if sample is same as population.
  • n is number of records in the sample.
  • T is the risk threshold. Value of T in the calculation is 0.1 by default. This value can be configured..

Marketer

mRc = 1/n fj /FJ

  • n is number of records in the sample.
  • fj size of equivalence class in the sample
  • FJ size of equivalence class in the population.

Measuring Journalist Risk

For Journalist Risk to be applied, the published dataset should be a specific sample.

There are two general types of re-identification attacks under journalist risk:

  • The adversary is targeting a specific individual.
  • The adversary is targeting any individual.

In case of journalist attack, the adversary will match the published dataset with another identification dataset, such as, voter registry, all patient data in hospital, and so on.

Identification of the dataset represents the population of which the published dataset is a sample.

For example, the sample published dataset is drawn from the identification dataset.

Derived Risk MetricsEquationRisk Value
jRa1/n fj x l(1 / FJ > T)0
jRb1 / min(FJ)0.25
jRcmax ( |J| / FJ) , 1 /n fj / FJ)0.13

Calculation of jRa:

  1. T value is 0.33. Size of equivalence classes in the identity dataset are 10, 8, 14, 4, 2.
  2. Identity function returns 0 when value 1/F is less than τ value else 1.
  3. Identify function returns 0, 0, 0, 0, 1.
  4. Equivalence sizes in samples are 4, 3, 2, 1.
  5. Values of equivalence size / number of records are 0.4, 0.3, 0.2, 0.1.
  6. Product of above value with identity function values are 0, 0, 0, 0.
  7. Value of jRa is 0.

Calculation of jRb:

  1. Minimum of equivalence size of identification dataset is 4
  2. Value of jRb is 0.25.

Calculation of jRc:

  1. Number of equivalence classes in 5 in identification dataset.
  2. Total records in identification dataset 38.
  3. Number of equivalence classes / total records = 5/38 = 0.131.
  4. Equivalence classes in sample / equivalence classes in identification dataset are 0.4, 0.375, 0/142857, 0/25.
  5. Total of above values 1.16.
  6. Above value / total records in sample = 1/16 / 10 = 0.116.
  7. Max (0.131, 0.116) = 0.131.

Measuring Marketer Risk

The use case for deriving the marketer risk is shown here.

Derived Risk MetricsEquationRisk Value
mRc1/n fj /FJ0.116

Calculation of mRc:

  1. Equivalence classes in sample / equivalence classes in identification dataset are 0.4, 0.375, 0/142857, 0/25.
  2. Total of above values 1.16.
  3. Above value / total records in sample = 1/16 / 10 = 0.116.
  4. Value of marketer risk is 0.116.

8.3 - AWS Checklist

List of variables to configure AWS account.

Update the table using from your AWS account to configure the Protegrity Anonymization API.

Table: CLI Installation

VariableValueObtain from
AWS Access Key IDAWS > IAM > Users > <user_name> > Security credentials > Access key ID
AWS Secret Access Keyhttps://aws.amazon.com/blogs/security/how-to-find-updateaccess-keys-password-mfa-awsmanagement-console/
Default region nameAWS > EC2 > Region name from the upper-right corner
Default output formatjson
metadataAWS > EC2 > Region name from the upper-right corner
nameSpecify a name
region
vpc
idAWS > EC2 > Instance_Id > Networking > VPC ID
cidrAWS > EC2 > Instance_Id > VPC_Id > IPv4 CIDR
subnets
private
us-east-1aAWS > VPC > Subnets > Subnet > Availability Zone
idAWS > VPC > Subnets > Subnet > Subnet ID
cidrAWS > VPC > Subnets > Subnet > IPv4 CIDR
us-east-1bAWS > VPC > Subnets > Subnet > Availability Zone
idAWS > VPC > Subnets > Subnet > Subnet ID
cidrAWS > VPC > Subnets > Subnet > IPv4 CIDR
nodeGroups
securityGroups
attachIDsAWS > VPC > Security Groups > security_group > Security group ID

8.4 - Working with Certificates

Commands to work with and troubleshoot certificate-related issues.

Use the commands provided in this section to work with and troubleshoot any certificate-related issues.

  • Verify the certificate and view the certificate information.

    openssl verify -verbose -CAfile cacert.pem server.crt
    
  • Check a certificate and view information about the certificate, such as, signing authority, expiration date, and other certificate-related information.

    openssl x509 -in server.crt -text -noout
    
  • Check the SSL key and verify the key for consistency.

    openssl rsa -in server.key -check
    
  • Verify the CSR and view the CSR data that was entered when generating the certificate.

    openssl req -text -noout -verify -in server.csr
    
  • Verify that the certificate and corresponding key matches by displaying the md5 checksums of the certificate and key. The checksums can then be compared to verify that the certificate and key match.

    openssl x509 -noout -modulus -in server.crt| openssl md5
    openssl rsa -noout -modulus -in server.key| openssl md5
    

8.5 - values.yaml

Configuration for setting up the Protegrity Anonymization API.

The values.yaml file contains the configuration for setting up the Protegrity Anonymization API. Use the template provided with the Protegrity Anonymization API or copy the following code to a .yaml file and modify it as per your requirements before running it.

## PREREQUISITES
## Create separate namespace. Eg: kubectl create ns anon-ns. Update your namespace name in values.yaml.

## Running all pods in the namespace specific for Protegrity Anonymization API
namespace:
  name: anon-ns                           # Update the namespace if required.

## Prerequisite for setting up Database and Minio Pod.
## This is to handle any new DB pod getting created that uses the same persistence storage in case the running Database pod gets disrupted.
## This persistence also helps persist Anon-storage data.
persistence:
  ## 1. Get the list of nodes in the cluster. CMD: kubectl get nodes
  ## 2. Get the node name which is running in the same zone where the external-storage is created. CMD: kubectl describe nodes
  nodename: "<Node_name>"                    # Update the Node name

  ## Fetch the zone in which the node is running using the `kubectl describe node/nodename` command or the following command.
  ## CMD: ` kubectl describe node/<nodename> | grep topology.kubernetes.io/zone | grep -oP 'topology.kubernetes.io/zone=\K[^ ]+' `
  zone: "<Zone in which above Node is running>"

  ## For EKS cluster, supply the volumeID of the aws-ebs
  ## For AKS cluster, supply the subscriptionID of the azure-disk
  dbstorageId: "<Provide dbstorage ID>"           # To persist database schemas.
  anonstorageId: "<Provide anonstorage ID>"       # To persist Anonymized data.
  notebookstorageId: "<Provide Notebookstorage ID>" # To persist User created notebooks.

  fsType: ext4

anonstorage:
  ## Refer the following command for creating your own secret.
  ## CMD: kubectl create secret generic my-minio-secret --from-literal=rootUser=foobarbaz --from-literal=rootPassword=foobarbazqux
  existingSecret: ""                # Supply your secret Name for ignoring below default credentials.
  bucket_name: "anonstorage"        # Default bucket name for minio
  secret:
    name: "storage-creds"           # Secret to access minio-server
    access_key: "anonuser"          # Access key for minio-server
    secret_key: "protegrity"        # Secret key for minio-server

## This section is required if the image is getting pulled from the Azure Container Registry
## create image pull secrets and specify the name here.
## remove the [] after 'imagePullSecrets:' once you specify the secrets
#imagePullSecrets: []
#  - name: regcred

image:
  minio_repo: quay.io/minio/minio                    # Public repo path for Minio Image.
  minio_tag: RELEASE.2022-10-29T06-21-33Z            # Tag name for Minio image.

  repository: <Repo_path>                            # Repo path for the Container Registry in Azure, GCP, AWS.
  anonapi_tag: <AnonImage_tag>                       # Tag name of the ANON-API Image.
  anonworkstation_tag: <WorkstationImage_tag>        # Tag name of the ANON-Workstation Image.
  syndataapi_tag: <SyntheticDataImage_tag>           # Tag name for synthetic Image.
  mlflow_tag: <MlflowImage_tag>                       # Tag name for Mlflow Image.

  pullPolicy: Always

## Refer to the section in the documentation for setting up and configuring NGINX-INGRESS before deploying the application.
ingress:
  ## Add the host section with the hostname used as CN while creating server certificates.
  ## While creating the certificates you can use *.protegrity.com as CN and SAN as used in the below example
  anonhost: anon.protegrity.com                  # Update the host according to your server certificates.
  sdatahost: syndata.protegrity.com

  ## To terminate TLS on the Ingress Controller Load Balancer.
  ## K8s TLS Secret containing the certificate and key must be provided.
  secret: anon-protegrity-tls                # Update the secretName according to your secretName.

  ## To validate the client certificate with the above server certificate
  ## Create the secret of the CA certificate used to sign both the server and client certificate as shown in the example below
  ca_secret: ca-protegrity                    # Update the ca-secretName according to your secretName.

  ingress_class: nginx-anon
  ## IP Address of Ingress Server
  ## CMD: kubectl get service -n nginx
  ingressIP: <IP Address of Ingress Server>       # Specify the external IP address obtained from above command.
  ## ingress connection timeout (connect/read/send time out interval)
  timeout: 600
## Typically the deployment includes checksums of secrets/config,
## So that when these change on a subsequent helm install, the deployment/statefulset
## is restarted, so set to "true" to disable this behaviour.
ignoreChartChecksums: false

####################### WORKER CONFIGURATIONS #########################
## Increase the number of worker pods as per your requirement
workers:
  hpa: anon-worker-hpa
  labels:
    app: dask-worker
  replicaCount: 1

## Resources defined for the worker pod
  worker_resources:
    requests:
      cpu: 2
      memory: 6Gi
    limits:
      cpu: 2
      memory: 6Gi

## Specs with which worker container should start
  containerSpecs:
    memLimit: "6G"
    nthreads: 2

## Worker pod env to read values from configMap manifest.
## A config Map(wrkr-specs) is used to set these values.
  workerPodEnv:
    - name: worker_mem_limit
      valueFrom:
        configMapKeyRef:
          name: wrkr-specs
          key: worker-mem-limit
    - name: num_threads
      valueFrom:
        configMapKeyRef:
          name: wrkr-specs
          key: num-threads

  autoscaling:
    minReplicas: 1                        # Min number of worker pods which will be running when the cluster starts.
    maxReplicas: 3                        # Max number of worker pods which will autoscale in the cluster.
    targetMemoryThreshold: 4Gi            # Threshold memory-load beyond which worker pods will autoscale.

## FOR MORE INFO ABOUT PROCESSING LARGE DATASETS REFER TO THE DOCUMENTATION
########################################################################

## Create the volumes and specify the names here.
## remove the [] after 'volumes:' once you specify volumes
volumes: []
  #- name: gcs-secret             ##This secret is used when user wants to read and write data to a Google cloud storage Refer DOC.
    #secret:
      #secretName: adc-gcs-creds

## Create the volumeMounts and specify the names here.
## remove the [] after 'volumeMounts:' once you specify volumeMounts
volumeMounts: []
  #- name: gcs-secret
    #mountPath: /home/anonuser/gcs

## Creating a service account for Anonymization
serviceaccount:
  name: anon-service-account

## Setting the pod security context
podSecurityContext:
  runAsNonRoot: true
  runAsUser: 1000
  fsGroup: 1000

# Configure the delays for Liveness Probe here
livenessProbe:
  initialDelaySeconds: 50
  periodSeconds: 40

#Configure the delays for Readiness Probe here
readinessProbe:
  initialDelaySeconds: 15
  periodSeconds: 20

## MLFLOW-APP ##
mlflow:
  name: mlflow-depl
  service:
    name: mlflow-svc
    mlflowPort: 8200
    labels:
      appname: mlflow

## SYNDATA-APP ##
syndataapp:
  name: syndata-app-depl
  service:
    name: syndata-app-svc
    syndataPort: 8095
    labels:
      appname: syndataapp

## ANON-APP ##
anonapp:
  name: anon-app-depl
  service:
    name: anon-app-svc
    anonPort: 8090
    labels:
      appname: anonapp
  loglevel: INFO                            # To get logs at DEBUG: Set loglevel to DEBUG and do helm upgrade

## ANON-DATABASE ##
database:
  name: anon-db-depl
  labels:
    app: anon-db
  service:
    name: anon-db-svc
    dbport: 5432
  persistence:    ## Persistence Volume size
    pvName: anon-db-pv
    pvcName: anon-db-pvc
    accessMode: ReadWriteOnce
    storageDB:
      size: 20Gi

## ANON-WORKSTATION ##
anonlab:
  name: anon-workstation-depl
  labels:
    app: anon-lab
  service:
    name: anon-lab-svc
    labport: 8888
  persistence:
    pvName: anon-nb-pv
    pvcName: anon-nb-pvc
    accessMode: ReadWriteOnce
    size: 2Gi

## ANON-DASK ##
dask:
  scheduler:
    name: anon-scheduler-depl
  worker:
    name: anon-worker-depl
  service:
    name: anon-dask-svc
    daskMasterPort: 8786
    daskUiPort: 8787
    labels:
      appname: dask

## ANON-STORAGE ##
storage:
  persistence:
    ## Path where PV would be mounted on the MinIO Pod
    mountPath: "/data"
    volumeName: "anon-storage-pv"
    claimName: "anon-storage-pvc"
    accessMode: ReadWriteOnce
    size: 20Gi
  service:
    name: anon-minio-svc
    port: 8100
  securityContext:
    runAsUser: 1000
    runAsGroup: 1000
    fsGroup: 1000
    fsGroupChangePolicy: "OnRootMismatch"
  resources:
    requests:
      memory: 2Gi
      cpu: 1
  certsPath: "/etc/minio/certs/"
  configPathmc: "/etc/minio/mc/"

8.6 - Setting up logging for the Protegrity Anonymization API

Steps to set up logging for the Protegrity Anonymization API.

Logging is helpful to know the tasks being performed on the system. It is especially helpful to trace and resolve errors in the configuration and to see that a software is processing a request and is not stalled. You need to set up logging for the Protegrity Anonymization API if you require it. In logging, Protegrity Anonymization API captures the internal processing and saves it in a log file that you can view for further analysis. Update and use the script files provided here for logging as per your requirements.

Note: This is an alternative way for obtaining logs.

  1. Navigate to the machine where the Protegrity Anonymization API is set up.

  2. Use the Anon_logs.sh script to pull the logs for the task being performed in the Protegrity Anonymization API pod.

  3. Assign the appropriate permissions and run the Anon_logs.sh script.

    chmod +x Anon_logs.sh
    ./<path_to_script>/Anon_logs.sh
    

8.7 - Enabling Custom Certificates from SDK

Steps to set up the certificates.

Protegrity Anonymization API uses certificates for secure communication with the client. You can use the certificates provided by Protegrity or use your own certificates. Complete the configurations provided in this section to use your custom certificates with the SDK.

Before you begin

Ensure that the certificates and keys are in the .pem format.

Note: If you want to use the default Protegrity certificates for the Protegrity Anonymization API, then skip the steps to set up the certificates provided in this section.

  1. Complete the configuration on the machine where the Protegrity Anonymization API SDK will be used.
    a. Create a directory that is named .pty_anon in the directory from where the SDK will run.
    b. Create certs in the .pty_anon directory.
    c. Create generated-certs in the certs directory.
    d. Create ca-cert in the generated-certs directory.
    e. Create cert in the generated-certs directory.
    f. Create key in the generated-certs directory.
    g. Copy the client certificates and key to the respective directories in the .pty_anon/certs/ generated-certs directory.
    The directory structure will be as follows:

    .pty_anon/certs/generated-certs/ca-cert/CA-xyz-cert.pem
    .pty_anon/certs/generated-certs/key/xyz-key.pem
    .pty_anon/certs/generated-certs/cert/xyz-cert.pem
    

    Make sure that you are using valid certificates. Users can validate the certificates using the commands provided in the section Working with certificates.

    h. Create a config.yaml file in the .pty_anon directory with the following Ingress Endpoint defined under CLUSTER_ENDPOINT. The BUCKET_NAME, ACCESS_KEY, and SECRET_KEY are the default details that are used to communicate with the MinIO container for reading and writing files from SDK.

    STORAGE:
      CLUSTER_ENDPOINT: https://anon.protegrity.com/
      BUCKET_NAME: 'anonstorage'
      ACCESS_KEY: 'anonuser'
      SECRET_KEY: 'protegrity'
    

    Note: Ensure that you replace anon.protegrity.com with your host name specified in values.yaml. Also, ensure that you update the default credentials if you have used your own secret.

  2. Updating the hosts file.
    a. Login to the machine where the Protegrity Anonymization API SDK will be used.
    b. Update the hosts file with the following code according to your setup.

    For Kubernetes:

    <LB-IP of Ingress> <host defined for ingress in values.yaml>
    

    For Docker:

    <LB-IP of Ingress> <server_name defined in nginx.conf>
    

    For example,

    XX.XX.XX.XX anon.protegrity.com
    

The URL can now be used while creating the Connection Object in the SDK, such as, conn = anonsdk.Connection(“https://anon.protegrity.com/").

8.8 - Creating a DNS entry for the ELB hostname in Route53

Steps to configure hostnames specified in the values.yaml file.

This section describes the steps to configure hostnames specified in the values.yaml file of the Helm chart for resolving the hostname of the Elastic Load Balancer (ELB) that is created by the NGINX Ingress Controller.

  1. Configure Route53 for DNS resolution.

    • Create a private hosted zone in the Route53 service.
    • In our case, the domain name for the hosted zone is protegrity.com.
    • Select the VPC where the Kubernetes cluster is created.
  2. Create a hostname for the ELB in the private hosted zone created in step 1.

    • Create a Record Set with type A - Ipv4 address
    • Select Alias as yes
    • Specify the Alias Target to the ELB created by the Nginx Ingress Controller
  3. Save the record Create Inbound endpoint for DNS queries from a network to the hosted VPC used in Kubernetes.

    • Select Configure endpoints in the Route53 Resolver service.
    • Select Inbound Only endpoint.
    • Give a name to the endpoint.
    • Select the VPC used in the Kubernetes cluster and Route53 private hosted zone.
    • Select the availability zone as per the subnet.
    • Review and create the endpoint.
    • Note the IP addresses from the Inbound endpoint page.
    • Send CURL request to the hostname created using the Route 53 service

For more information about Amazon Route53, refer to Amazon Route53 Documentation.