This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Synthetic Data

Protegrity Synthetic Data solution is an artificial data generator. It generates artificial data to mimic the structure and statistical properties of real data.

Synthetic Data is created using advanced Generative Adversarial Networks (GANs) model. It mimics the properties of real data, such as data types, ranges, correlations, and distributions. It does not contain any actual personal information.

1 - Introduction

Learn about Synthetic Data.

Synthetic Data unlocks the full potential of AI and analytics by creating entirely new data that mirrors the patterns of your original datasets. This new data contains no sensitive information. You can train and test AI models without risk. You can also scale these models without exposure or compliance violations.

Advantges of Synthetic Data over Anonymized Data:

  • Preserve utility for analytics, machine learning, and testing while minimizing privacy risks.
  • Can simulate rare events or edge cases in data.
  • Does not have a 1:1 mapping to real records.
  • Is not regulated or biased.
  • Cannot be traced back to any individual.

Use Cases

Synthetic Data is used for:

  • Training machine learning models without exposing sensitive data.
  • Sharing data across teams or vendors while maintaining compliance.
  • Replacing expensive or hard-to-source real-world data collection.
  • Testing and development environments that replicate real-world complexity without privacy risks.
  • Monetizing data and evaluating vendors.

1.1 - Privacy-Preserving Characteristics

A list of characteristics for privacy-preserving using Synthetic Data.

Synthetic Data is generated from learned patterns in real datasets but does not contain any actual personal records. This ensures:

  • No 1:1 mapping between synthetic and real data.
  • No re-identification risk, even when used in sensitive domains, such as healthcare or finance.

Compliance with Privacy Regulations

  • General Data Protection Regulation (GDPR): Synthetic Data is considered anonymous under GDPR. It lacks identifiable links to real individuals.
  • Health Insurance Portability and Accountability Act (HIPAA): It qualifies under Safe Harbor and Expert Determination methods. This makes it suitable for healthcare data use, without being classified as Protected Health Information (PHI).

Built-In Privacy Safeguards

Protegrity’s Synthetic Data solution includes multiple privacy-enhancing features:

  • Privacy Measurement Tools: It evaluates the robustness of data.
  • Automated De-identification: It removes sensitive attributes while preserving data utility.
  • Support for Tabular Data: It enables realistic simulation of structured datasets for analytics and AI training.
  • On-demand Generation Capabilities: It allows developers to invoke Synthetic Data generation using API and integrate it into pipelines with minimal effort.

1.2 - Comparison with Other Privacy-Enhancing Technologies

Understand the difference between Synthetic Data and other data protection methods.

The following section provides details about synthetic data and other data protection methods.

  • Pseudonymization replaces real data with tokens for certain attributes, such as Personally Identifiable Information (PII). However, this method still uses real data, and the analytical value is perfect unless other attributes are tokenized.

  • Anonymization reduces the risk of reidentification by transforming quasi-identifiers. However, careful balancing of utility and privacy is needed to minimize the impact on downstream usage.

  • Synthetic Data closely resembles real data. It does not contain real records and typically results in less information loss compared to Anonymization.

Advantages

  • It can be used for analytics and advanced analytics with minimal impact.
  • It ensures that no real individual can be re-identified.
  • It is generated with privacy safeguards and can be used without user approval.
  • It can be viewed by any user once generalized.
  • It is produced by processing all records together.
  • It does not require additional security measures.
  • It can be generated on demand.
  • It can be considered anonymous data within the context of GDPR.
  • It can be generated in a manner that avoids being subject to HIPAA regulations.

Disadvantages

  • It is slower than Pseudonymization or Anonymization.
  • It is not suitable for use cases where re-identification is necessary.
  • It requires minimal data to work reliably. The amount of data needed increases with data complexity.

1.3 - Synthetic Data Overview

An overview of key characteristics of Synthetic Data and its role in privacy compliance.

Synthetic Data is a privacy-enhancing technology that uses real datasets to create artificial data. It does not represent real individuals and has no connection to real people. However, it still provides strong analytical utility and preserves relationships between variables.

Key Characteristics of Synthetic Data

FeatureSynthetic Data
Represents real peopleFalse.
It has no direct link to real individuals.
Closeness to real individualsLow.
It preserves relationships between variables and real data.
Analytics and advanced analyticsHigh utility.
It is suitable for ML, forecasting, and testing.
Maintain data typesGuaranteed.
It preserves the schema compatibility.
Internal and external sharingPossible.
It is compliant with privacy regulations like GDPR and HIPAA.
Simulating rare scenariosPossible.
It simulates rare scenarios, fraud patterns, or edge cases not present in production.
Risk of re-identificationLow.
It minimizes the risk of re-identification compared to Anonymization or Pseudonymization.
Data progressionPossible.
It can be used to create data trends that might change over time.
CostModerate.
It incurs varying costs depending on the complexity of the data and the synthesis methods used.
ScalabilityHigh.
It can be generated in large volumes as needed.
MaintenanceModerate.
It requires periodic updates to reflect changes in real data.

Synthetic Data is a powerful tool for privacy compliance. It:

  • Does not represent real individuals, eliminating direct privacy risks.
  • Preserves analytical utility, making it suitable for machine learning, forecasting, and testing.
  • Maintains statistical relationships between variables without exposing personal information.

1.4 - How Synthetic Data is Generated

Describes how Synthetic Data generation works.

Synthetic Data is a privacy-enhancing technology that creates artificial datasets. It works by learning from the structure and statistical properties of real data. It is designed to preserve analytical utility while protecting individual privacy. The process involves three key stages:

Stage 1: Extract Characteristics from Original Data

The system analyzes the original dataset to understand its structure and relationships:

CharacteristicsExamples
Column typesstring, integer, categorical
Value distributionsage ranges, frequency of pet types
Relationships between variablesage and pet ownership patterns

Stage 2: Generate Fictional Records

Based on the extracted characteristics, synthetic records are created using advanced modeling techniques:

  • Generative Algorithms: Generative Adversarial Networks (GANs) or other statistical models.
  • Privacy Assurance: These records are entirely fictional and do not correspond to real individuals.

Stage 3: Validate Privacy

The Synthetic Data dataset undergoes rigorous validation to ensure privacy protection:

  • Re-identification Risk Analysis: It ensures that no original entries can be inferred or reconstructed.
  • Privacy Techniques Applied: It includes methods like privacy risk scoring to quantify and mitigate risks.

Table: Original Dataset

NameSurnameAgePet Owned
JackDawson42Dog
JaneDawson25Cat
BillCarvalho18Dog
JenniePhilip53Hamster

Table: Synthetic Data Dataset

NameSurnameAgePet Owned
ScottVaz48Dog
AnnaRodriguez21Cat
HankSummers19Dog
JeanVaz51Hamster
BillDiaz58Dog
SeanYoung34Dog
CarrieLewis24Hamster
PerryMacanzie42Cat

2 - About Protegrity Synthetic Data

Summary of the Protegrity Synthetic Data architecture, including its components, communication flow, access methods, and hosting options.

Protegrity’s Synthetic Data solution is a Synthetic Data generator which generates artificial data that is realistic, statistically accurate, and privacy-safe. This data unlocks the full potential of AI and analytics. By creating entirely new data that mirrors the patterns of your original datasets but contains no sensitive information you can train and test AI models without risk. You can also scale these models without exposure or compliance violations.

2.1 - Protegrity Synthetic Data Architecture

Communication between Protegrity Synthetic Data, the Dask Scheduler, and Dask Workers is detailed in this section.

An overview of the communication is shown in the following figure. Synthetic Data Components

The Synthetic Data system includes the following core components:

Key Pods and Services

  • Synthetic Data App Pod

    • Orchestrates Synthetic Data generation.
  • MLFlow Pod

    • Captures model training and evaluation.
    • Hosted in containers for scalability.
  • MinIO Pod

    • Stores models, model artifacts, and generated reports.
    • Used by both MLFlow and Synthetic Data App pods.
  • SQL Database Server Pod

    • Provides storage for MLFlow experiments metadata.

Data Generation Interfaces

Synthetic Data can be generated using:

  • REST APIs
  • Swagger UI

These interfaces allow developers and data scientists to interact with the system programmatically or visually.

Access and Networking

Users access the Protegrity Synthetic Data using HTTP over default port 8095 and other services using the following ports:

PortCommunication Path
5000MLFlow pod
5432SQL Database Server
8095Protegrity Synthetic Data Service
9000MinIO

Cloud Hosting Options

Like the Protegrity Anonymization API, the entire Synthetic Data API can be hosted using any cloud-provided Kubernetes service, including:

  • Amazon Elastic Kubernetes Service (EKS)
  • Google Kubernetes Engine (GKE)
  • Microsoft Azure Kubernetes Service (AKS)
  • Red Hat OpenShift
  • Other Kubernetes platforms

This flexibility allows organizations to scale Synthetic Data generation securely across environments.

3 - Installing Protegrity Synthetic Data

It provides installation of Protegrity Synthetic Data using Docker and Containers.

Synthetic Data enables innovation and testing without compromising privacy, ethics, or security. Protegrity’s solution provides a containerized lab environment for exploring Synthetic Data datasets interactively.

3.1 - Installing Synthetic Data Environment

Set up Synthetic Data environment.

Follow these steps to set up your system using Docker containers.

Before You Begin

Ensure the following prerequisites are completed on your base machine:

  • Docker Engine is installed.
  • Docker Compose V2 is installed.
  • Verify that administrator access is available on the host machine to install Protegrity Synthetic Data.

3.2 - Installing Protegrity Synthetic Data Using Docker Containers

Complete the following steps to install the Protegrity Synthetic Data using Docker Containers.

Installation Steps

  1. Log in to the machine as an administrator.

  2. Download and extract the API package using the following command:

    tar -xvzf SYNTHETIC-DATA_RHUBI-ALL-64_x86-64_Generic.DOCKER_1.0.0.x.tgz
    
  3. Load the API container using the following command:

    docker load <synthetic-data.tgz
    
  4. Load the dependent container using the following command:

    docker load <dependent_images.tgz
    
  5. Verify that the image is successfully loaded using the following command:

    docker images
    
  6. Navigate to the directory where the contents are extracted.

  7. Change the working directory to docker and update the docker/docker-compose.yaml file for the configuration that you require, such as the image ID.

    Note: You can specify the IMAGE ID instead of the REPOSITORY:TAG for the image attribute.

  8. Deploy the Protegrity Synthetic Data to Docker using the following command.

    docker-compose -f docker-compose.yaml up -d
    
  9. Verify that the Docker containers are running using the following command.

    docker ps
    

Protegrity Synthetic Data is ready to use. Use the URLs provided here to view the Protegrity Synthetic Data using REST API.

Protegrity Synthetic Data using RESTURL
Basic information about the Protegrity Synthetic Datahttp://<Hostname>:<port>/pty/synthetic-data/v1/
Contractual information for the Protegrity Synthetic Datahttp://<Hostname>:<port>/pty/synthetic-data/v1/about
Protegrity Synthetic Data APIshttp://<Hostname>:<port>/pty/synthetic-data/v1/docs

4 - Using Protegrity Synthetic Data

The REST endpoints and methods for generating Synthetic Data.

The information provided in this section explains the various APIs provided by the Protegrity Synthetic Data and the method for generating Synthetic Data.

4.1 - Understanding the Protegrity Synthetic Data REST APIs

List of APIs available with Protegrity Synthetic Data.

The following APIs are available with Protegrity Synthetic Data REST API for generating and analyzing Synthetic Data. You can run these APIs using the command line with the curl command or submit the command using the default Swagger UI provided or a tool, such as Postman.

Scheduling APIs

This API is used to start a Synthetic Data job. The job generates the best Synthetic Data. However, parameters are available for you to customize to generate data as per your requirements.

Generate API

Use this API to start the Synthetic Data job. You can specify the number of rows required and run the job using the configuration. The default configuration analyzes the source data and determines the best configuration for processing the request. It learns the source and trains a model that is used for generating Synthetic Data. Alternatively, you can modify the configuration according to your requirements for running the job. The API generates a job ID that you can use to work and monitor the status of the job.

For more information about the generate API, refer to Generate API.

Base URL: http://syndata.protegrity.com/pty/synthetic-data/v1/

Path: /generate

Parameters:

  • n: Specify the number of records to generate.
  • name: Specify a name for the job.
  • tag: Specify the tag details for the job. The tag contains the domain and sub domain information.
  • source: Specify the source file details.
  • target: Specify the destination file details.
  • other params: Specify the required details for generating the Synthetic Data.

Method: POST

Sample Request:

curl -X 'POST' \\ 'http://syndata.protegrity.com/pty/synthetic-data/v1/generate?n=10' \\ -H 'accept: application/json' \\ -H 'Content-Type: application/json' \\ -d '\{ "name": "Generate Data-Test", "tag": \{ "domain": "Domain Name", "subdomain": "Sub-Domain Name" \} \}'

Monitoring APIs

These APIs are used to monitor Synthetic Data jobs. Use these APIs to obtain the job status, retrieve a job, abort a job, and delete a job.

Get Job IDs

Use this API to get the job IDs of the operations that are running, in queue, or completed. You can then use the required job ID with the other APIs to work with the job.

For more information about the job ID API, refer to Get Job ID.

Base URL: https://syndata.protegrity.com/pty/synthetic-data/v1/

Path: /jobs

Parameters:

  • Running: Displays a list of jobs that are running.
  • InQueue: Displays a list of jobs that are queued.
  • Completed: Displays a list of jobs that are complete.
  • Failed: Displays a list of jobs that have failed.
  • Aborted: Displays a list of jobs that have been aborted.

Method: GET

Sample Request:

curl -X 'GET' \\ 'http://syndata.protegrity.com/pty/synthetic-data/v1/jobs?status=Running\|InQueue\|Completed&amp;limit=10'

Get Job Status

Use this API to get the status of a job. It shows the percentage of job completion. Use the information provided here to monitor if a job is running.

For more information about the job status API, refer to Get Job Status.

Base URL: http://syndata.protegrity.com/pty/synthetic-data/v1/

Path: /jobs/{jobId}

Parameter:

  • jobId: a string that specifies the ID of the job of which the status must be retrieved. The jobid is obtained in the request body when a job is scheduled.

Method: GET

Sample Request:

curl -X 'GET' \\ 'http://syndata.protegrity.com/pty/synthetic-data/v1/jobs/842aac73-1ec9-4450-af26-935e33791216'

Delete

Use this API to delete an existing job that is no longer required.

For more information about the delete API, refer to Delete an Existing Job.

Base URL: http://syndata.protegrity.com/pty/synthetic-data/v1/

Path: /jobs/{jobId}

Parameter:

  • jobId: a string that specifies the ID of the job that must be deleted.

Method: DELETE

Sample Request:

curl -X 'DELETE' \\ 'http://syndata.protegrity.com/pty/synthetic-data/v1/jobs/842aac73-1ec9-4450-af26-935e33791216'

Get Metadata

Use this API to retrieve the metadata for the existing job. This API is useful when you need to view the configuration available for a job. It displays the fields and configuration that is used to run the job.

For more information about the metadata API, refer to Get Metadata.

Base URL: http://syndata.protegrity.com/pty/synthetic-data/v1/

Path: /jobs/{jobId}/meta

Parameter:

  • jobId: a string that specifies the ID of the job for which the metadata must be retrieved.

Method: GET

Sample Request:

curl -X 'GET' \\ 'http://syndata.protegrity.com/pty/synthetic-data/v1/jobs/6a1b1a3a-0ea3-4b93-ab4c-ffc6d404b4e6/meta'

Abort

Use this API to abort a running job. You can abort jobs if you need to modify the parameters or if the job takes too much time or resources to conclude.

For more information about the abort API, refer to Abort a Running Job.

Note: After aborting the task, it might take time before all the running processes are stopped.

Base URL: http://syndata.protegrity.com/pty/synthetic-data/v1/

Path: /jobs/{jobId}/abort?body=PLACEHOLDER

Parameter:

  • jobId: a string that specifies the ID of the job that must be aborted.

Method: POST

Sample Request:

curl -X 'POST' \\ 'http://syndata.protegrity.com/pty/synthetic-data/v1/jobs/ce5793b2-4434-4104-acff-26651de018d8/abort?body=PLACEHOLDER'

Reporting APIs

These APIs provide analysis information about the job. Use these APIs to validate and view the quality of the job and the Synthetic Data that is generated.

Get Report

Use this API to get the report of the job using the job ID.

For more information about the get report API, refer to Get Report.

Base URL: http://syndata.protegrity.com/pty/synthetic-data/v1/

Path: /reports/{job_id}

Parameters:

  • job\_id: Specify the job id for retrieving the report.
  • format: Specify the format in which the report must be retrieved. html or pdf.

Method: GET

Sample Request:

curl -X 'GET' \\ 'http://syndata.protegrity.com/pty/synthetic-data/v1/reports/a1b2c3d4' \\ -H 'accept: application/json'

Get Evaluation Metrics

Use this API to get the evaluation metrics of the job using the job ID.

For more information about the get evaluation metric API, refer to Get Evaluation Metrics.

Base URL: http://syndata.protegrity.com/pty/synthetic-data/v1/

Path: /reports/{job_id}/metrics

Parameter:

  • job\_id: Specify the job id for retrieving the evaluation metrics.

Method: GET

Sample Request:

curl -X 'GET' \\ 'http://syndata.protegrity.com/pty/synthetic-data/v1/reports/a1b2c3d4/metrics' \\ -H 'accept: application/json'

Get Domains

Use this API to get all the domain and sub domain information. In addition to the domain information, the API also returns additional information, such as, the last updated data, the model ID, and the URL for viewing the report data.

For more information about the get report API, refer to Get Domains.

Base URL: http://syndata.protegrity.com/pty/synthetic-data/v1/

Path: /domains

Parameters:

  • domainFilter: Specify the domain for retrieving data.
  • subDomainFilter: Specify the sub domain for retrieving data.

Method: GET

Sample Request:

curl -X 'GET' \\ 'http://syndata.protegrity.com/pty/synthetic-data/v1/domains?domainFilter=domain\_name&amp;subDomainFilter=subdomain\_name' \\ -H 'accept: application/json'

Model Management APIs

These APIs are used to manage the models used for generating Synthetic Data. The source data is analyzed, and a model is trained. This model is used for generating Synthetic Data. Use these APIs to list information about the model or to archive a model so that it is not used in production.

Get Model Details

Use this API to retrieve the model details using the domain and sub domain details. The API retrieves the configuration information that was used for training the model.

For more information about the Models API, refer to Get Model Details.

Base URL: http://syndata.protegrity.com/pty/synthetic-data/v1

Path: /models

Parameters:

  • domain: Specify the domain name for retrieving the model.
  • subdomain: Specify the sub domain name for retrieving the model.

Method: GET

Sample Request:

curl -X 'GET' \\ 'http://syndata.protegrity.com/pty/synthetic-data/v1/models/?domain=america&amp;subdomain=north' \\ -H 'accept: application/json'

Delete Model API

Use this API to remove the model that is generated. This API ensures that the model is not used in production.

For more information about removing the model API, refer to Delete Model API.

Base URL: http://syndata.protegrity.com/pty/synthetic-data/v1/

Path: /models

Parameters:

  • domain: Specify the domain name for deleting the model.
  • subdomain: Specify the sub domain name for deleting the model.
    • includeProdModels: Specify true to include production models for deletion or retain the production model.

Method: Delete

Sample Request:

curl -X 'DELETE' \\ 'http://syndata.protegrity.com/pty/synthetic-data/v1/models/?domain=america&amp;subdomain=north&amp;includeProdModels=true' \\ -H 'accept: application/json'

5 - Building the Protegrity Synthetic Data Request

Use the APIs to build and configure a Protegrity Synthetic Data request.

It is possible to generate Synthetic Data with minimal, default configuration. It is also possible to control several aspects of the generation process, ranging from privacy controls to enforcing business logic. This section details the various configurations available for advanced usage.

5.1 - High-Level Workflow

High-level workflow of the Synthetic Data generation process.

The Protegrity Synthetic Data follows a structured pipeline to generate Synthetic Data:

  1. Configuration Validation
  2. Optimal Real Data Usage
  3. Automatic Data Preprocessing
  4. Training of Synthetic Data Generator Model
  5. Evaluation Against Real Data
  6. Synthetic Data Generation
  7. Machine Learning Operations

Configuration Validation

Training Synthetic Data generators is a slow process, taking from a couple of minutes to several hours depending on the configurations used. To optimize compute time, several validations are proactively done to ensure a valid configuration before any training takes place. If any violations are found, descriptive exceptions are returned to the user.

  • Existence Validation: Ensures that the specified column exists in the real data.
  • Data Type Validation: Ensures required types, for example, categorical or integer are present for features like bias customization.
  • Unique ID Validation: Ensures unique identifiers are not used inappropriately, for example, bias customization on a unique identifier.

Optimal Real Data Usage

The performance of any machine learning model is influenced by the size of training data or learning curve. The API estimates a learning curve from the real data and may randomly sample data to reduce its size.

  • required_groups Parameter: Ensures downsampling includes all unique values in a specified categorical column.

Automatic Data Preprocessing

No manual preprocessing is necessary. The API automatically performs all required preprocessing.

  • Optional Data Type Specification: It is preferred for users to pass the data types of each column to ensure that the generated data respects them (for example, integer instead of float). Users can also specify data types for only some columns. However, if the user does not provide data types, the system will automatically infer them. This is particularly useful when a column appears to be an integer but encodes a categorical variable.

Training of Synthetic Data Generator Model

Two training modes are available:

  • Default Mode: Uses default configurations for modest to high fidelity results.
  • Autolearn Mode: Performs hyperparameter optimization. Requires:
    • Time budget specification
    • Option to start from scratch or continue previous tuning session

Evaluation Against Real Data

The API evaluates Synthetic Data against real data using default metrics and charts:

  • Correlation measures
  • Composite score
  • Information preserved
  • Similarity

An HTML and PDF evaluation report is returned. Real data is also evaluated against samples of real data to assess theoretical limits of closeness.

Synthetic Data Generation

A job represents a single event of training and generation or generation only, if a model already exists.

Machine Learning Operations

Organizations may have multiple data domains with distinct requirements. The API manages Synthetic Data generators using:

  • Data Domains and Subdomains: It is useful for auditing and lifecycle management.
  • Model in Production: It indicates generators and artifacts are stored and ready for future use.

5.2 - Building the Request Using the REST API

Build the REST API request for performing the Protegrity Synthetic Data.

Identifying the Source and Target

In this step, you specify the source real dataset from which you wish to produce Synthetic Data and a target, where corresponding Synthetic Data will be saved.

The following file formats are supported:

  • Comma separated values (CSV)

The following data storages have been tested for Protegrity Synthetic Data:

  • Local File System
  • Amazon S3

The following data storage types can also be used for the Protegrity Synthetic Data:

  • Google Cloud Storage
  • Microsoft Azure Storage
  • MinIO Storage
  • Other S3 Compatible Services

Use the following code to specify the source:

Note: Modify the source and destination code for your provider.

For more cloud-related sample codes, refer to the section Samples for cloud-related source and destination files.

"source": {
    "file": {
      "name": "string",
      "format": "CSV",
      "props": {},
      "access_options": {}
    }
  }

Note: When uploading a file to the cloud service, wait till the entire source file is uploaded before running the Synthetic Data job.

Similarly, specify the target file using the following code:

"target": {
    "file": {
      "name": "string",
      "format": "CSV",
      "props": {},
      "access_options": {}
    }
  }

Specify additional parameters about the source and target file, such as, the character used to separate the values in the file, using the following properties attribute. If a property is not specified, then the default attribute shown here will be used.

"props": {
    "sep": ",",
    "decimal": ".",
    "quotechar": "\"",
    "escapechar": "\\",
    "encoding": "utf-8",
    "line_terminator": "\n"
}

If the required files are in cloud storage, then specify the cloud-related access information using the following code:

"accessOptions": {
}

For more information and help on specifying the source and target files, refer to Dask remote data configuration.

Note: If the target file already exists, then the file will be overwritten. Additionally, some cloud services have limitations on the file size. If such a limitation exists, then you can set the single_file switch to no when writing large files to the cloud storage. This saves the output as multiple files to avoid any errors related to saving large files to cloud storage.

Tagging the Job

You are required to tag a job by specifying the domain and subdomain. The other parameters are optional. Use the following structure to specify tag information:

"tag": {
    "domain": "card_transactions",
    "subdomain": "fraud_detection"
  },

Specifying Learning Configuration

This optional learning configuration section allows you to control learning aspects, that is, hyperparameter tuning and model operations. Use the following structure to specify learning parameters:

"learning": {
    "autolearn": {
        "minutes": int,
        "continue_previous_exploration": bool,
    },
    "relearn": bool
 }
  • autolearn: This is an optional parameter.
    • minutes: This is a mandatory parameter if autolearn is enabled. . It is the amount of time to invest in hyperparameter tuning. The total execution time will be approximate. If you specify 60 minutes and by that time there is a model in training, that training session will be allowed to complete. There are intermediate checkpoints saved, to optimize compute time in case of system failure.
    • continue_previous_exploration: This is an optional parameter. Whenever hyperparameter tuning is triggered, a dedicated database is created. This enables resuming the process later. By tweaking this parameter, it is possible to leverage this database, or not, later. Situations where this parameter might be set to false: hyperparameter tuning cannot find better models after significant processing. In this case, resetting the search space could overcome this situation. The data schema is the same, but the underlying behavior captured by the data is distinct. Therefore, resetting the search space makes sense.
    • relearn: This is an optional parameter. It is set to false by default. If set to true, irrespective of the model metric comparison, the model trained on the current job will be pushed to production and used to start producing Synthetic Data. This is useful in cases where the data schema remains the same, but the underlying behavior captured by that data changes, for example, online purchasing behaviors during COVID-19.

The following figure shows the different combinations of these settings and their implication.

Learning configurations

Specifying Evaluation Parameters

The optional evaluation configuration section allows you to compute additional metrics. Use the following structure to specify evaluation parameters:

"evaluation": {
  "metrics": ["association_risk", "membership_inference_attack", "sensitive_attribute_reconstruction"],
  "synthetic_fairness": ["religion"]
   }
  • association_risk: Measures the association between real records and synthetic records.
  • membership_inference_attack: Calculates the probability of success of a membership inference attack. It is the probability of an attacker inferring that a real person was included on the training data that resulted in a given Synthetic Data output.
  • sensitive_attribute_reconstruction: Measures the success of an attacker guessing a real sensitive attribute based on Synthetic Data and real quasi-identifiers.
  • synthetic_fairness: Provides insights into data fidelity asymmetries between categories. This is useful, for example, to understand if synthetic people from distinct nationalities have the same data fidelity. This is a key ingredient in ensuring algorithmic fairness.

Specifying Supervise Parameters

The optional supervise parameters allow you to filter and change real data and Synthetic Data. Use the following structure to specify supervise parameters:

"supervise": {
    "input_filters": {
        "outlier_detection": {"outlier_frac": float}
    },
    "output_filters": {
        "fabricate_random_data": {
            "date": ["col_name"],
            "word": ["col_name"],
            "sentence": ["col_name"],
            "datetime": ["col_name"],
            "time": ["col_name"],
            "timestamp": ["col_name"],
            "email": ["col_name"],
            "float": ["col_name"],
            "positive_integer": ["col_name"],
        },
        "association_risk": {"threshold": float},
        "bias_customization": {"balance": ["col_a", "col_b"]},
        "business_rules": {
            "intervals": {"numeric_column": [numeric lower bound, numeric upper bound]},
            "unique_combinations": [
                ["categorical_col_a", "categorical_col_b"], 
                ["categorical_col_c", "categorical_col_d"]
            ]
        },
        "sample_from_external": {
            "col_a": ["a", "b", "c", "d", "e"],
            "col_b": [0.0, 0.0, 0.0, 0.0, 0.0],
            "col_c": ["06611", "06611", "06611", "06611", "06611"],
        }

}
  },

The following parameters are available for this configuration:

  • input_filters: Specifies filters applied to real data before training a Synthetic Data generator.

The following option is available for input_filters:

  • outlier_detection: Removes a percentage of outliers from real data before training a Synthetic Data generator. Must be a value between 0 and 0.5.

  • output_filters: Specifies the filters applied to Synthetic Data before being returned to the user. In certain scenarios, the engine might generate slightly more records than requested, due to generation retries needed to adhere to certain filters. If the filters are too restrictive and the generation retries are exhausted, then less records might be returned.

The following options are available for output_filters:

  • fabricate_random_data: For generating random data, that is, with no connection to the real data of a given type. The following datatypes are available:
    • date
    • datetime
    • email
    • float
    • positive_integer
    • sentence
    • time
    • timestamp
    • word
  • input_filters: Specifies the learning data that must be used for training.
  • output_filters: Specifies the filters for the output data. Additional samples cannot be generated after the filter is applied. Hence, records are generated before filtering to compensate, so that the number of records requested by the user can be generated. The final number might be less than the number of records requested.

The following options are available for output_filters:

  • sample_from_external: External sampler allows you to substitute synthetic output data attributes with values from external data. Using this filter, you can substitute columns in the Synthetic Data with data from a data source. Before using the external data source, the validity of external data is checked. Additionally, if applicable, the external data is validated using the business rules filters specified. The following checks are performed on the external data:

    • A data type check to ensure that the type of the external data attributes is the same as the original data.
    • An existence check to ensure that the attributes exist.
    • A business rules interval conformity check with external data is performed.
    • A business rules unique combinations conformity check with external data is performed.
  • bias_customization: In bias_customization a validity check is performed on the user specified attribute and settings. In this release, only the balance option is supported. The balance option runs the data generation process for a certain amount of retries. This ensures that all categories of the specified column for bias customization are equally proportioned (support for bias customization on one column). However, given the nature of the dataset, especially when you have an extremely small representation of a certain category, it might not be possible to balance the dataset. The following checks are performed:

    • An existence check ensures that the attribute exists.
    • A unique id check ensures that unique identifiers are not used.
    • A categorical check ensures that float attributes are not used.
    • A balance check for a given column, if it is specified.
  • association_risk: Uses an association risk metric to filter Synthetic Data records that have a high risk of being associated with real data records. Given a threshold by the user, it removes the highly associated synthetic records (above that certain threshold).

  • business_rules: Ensure business logic is observed on Synthetic Data. There are certain configurations that can be used:

    • unique_combinations: Relationships between several categorical variables are maintained. For example, country and city columns. If the values on those two columns are always unique in the original data, we ensure that the generated Synthetic Data will never generate invalid combinations.
    • intervals: The minimum and maximum numeric values for an attribute.

6 - Using Sample Synthetic Data Jobs

Sample Synthetic Data jobs that you can use for working with and testing Protegrity Synthetic Data.

Sample Data Sets

Use the following dataset to test Protegrity Synthetic Data. This dataset is comprehensive and can give you thorough insights into working with Protegrity Synthetic Data.

Adult Dataset: Here is an extract of the dataset, the complete dataset can be found in the adult.csv file in the samples directory.

sex;age;race;marital-status;education;native-country;citizenSince;weight;workclass;occupation;salary-class
Male;39;White;Never-married;Bachelors;United-States;08-01-1971;185.38;State-gov;Adm-clerical;<=50K
Male;50;White;Married-civ-spouse;Bachelors;United-States;19-04-1960;176.32;Self-emp-not-inc;Exec-managerial;<=50K
Male;38;White;Divorced;HS-grad;United-States;07-12-1971;159.13;Private;Handlers-cleaners;<=50K
Male;53;Black;Married-civ-spouse;11th;United-States;22-05-1957;170.45;Private;Handlers-cleaners;<=50K
Female;28;Black;Married-civ-spouse;Bachelors;Cuba;03-02-1982;178.79;Private;Prof-specialty;<=50K
Female;37;White;Married-civ-spouse;Masters;United-States;06-12-1972;161.65;Private;Exec-managerial;<=50K
Female;49;Black;Married-spouse-absent;9th;Jamaica;18-04-1961;162.73;Private;Other-service;<=50K
Male;52;White;Married-civ-spouse;HS-grad;United-States;21-05-1958;171.75;Self-emp-not-inc;Exec-managerial;>50K
Female;31;White;Never-married;Masters;United-States;31-12-1978;164.03;Private;Prof-specialty;>50K
Male;42;White;Married-civ-spouse;Bachelors;United-States;11-02-1968;186.33;Private;Exec-managerial;>50K
Male;37;Black;Married-civ-spouse;Some-college;United-States;06-12-1972;189.49;Private;Exec-managerial;>50K
Male;30;Asian-Pac-Islander;Married-civ-spouse;Bachelors;India;01-02-1980;178.70;State-gov;Prof-specialty;>50K
Female;23;White;Never-married;Bachelors;United-States;08-04-1987;183.22;Private;Adm-clerical;<=50K
Male;32;Black;Never-married;Assoc-acdm;United-States;01-01-1978;156.63;Private;Sales;<=50K
Male;34;Amer-Indian-Eskimo;Married-civ-spouse;7th-8th;Mexico;03-12-1975;173.41;Private;Transport-moving;<=50K
Male;25;White;Never-married;HS-grad;United-States;06-03-1985;170.72;Self-emp-not-inc;Farming-fishing;<=50K
Male;32;White;Never-married;HS-grad;United-States;01-01-1978;174.91;Private;Machine-op-inspct;<=50K
Male;38;White;Married-civ-spouse;11th;United-States;07-12-1971;176.47;Private;Sales;<=50K
Female;43;White;Divorced;Masters;United-States;12-02-1967;179.88;Self-emp-not-inc;Exec-managerial;>50K
Male;40;White;Married-civ-spouse;Doctorate;United-States;09-01-1970;170.80;Private;Prof-specialty;>50K
Female;54;Black;Separated;HS-grad;United-States;23-06-1956;171.61;Private;Other-service;<=50K
Male;35;Black;Married-civ-spouse;9th;United-States;04-12-1974;183.71;Federal-gov;Farming-fishing;<=50K
Male;43;White;Married-civ-spouse;11th;United-States;12-02-1967;158.63;Private;Transport-moving;<=50K
Female;59;White;Divorced;HS-grad;United-States;28-07-1951;181.64;Private;Tech-support;<=50K
Male;56;White;Married-civ-spouse;Bachelors;United-States;25-06-1954;171.80;Local-gov;Tech-support;>50K
Male;19;White;Never-married;HS-grad;United-States;12-05-1991;172.74;Private;Craft-repair;<=50K
Male;39;White;Divorced;HS-grad;United-States;08-01-1971;159.41;Private;Exec-managerial;<=50K
Male;49;White;Married-civ-spouse;HS-grad;United-States;18-04-1961;176.76;Private;Craft-repair;<=50K
Male;23;White;Never-married;Assoc-acdm;United-States;08-04-1987;164.43;Local-gov;Protective-serv;<=50K
Male;20;Black;Never-married;Some-college;United-States;11-05-1990;157.60;Private;Sales;<=50K
Male;45;White;Divorced;Bachelors;United-States;14-03-1965;176.38;Private;Exec-managerial;<=50K
Male;30;White;Married-civ-spouse;Some-college;United-States;01-02-1980;160.60;Federal-gov;Adm-clerical;<=50K
Male;22;Black;Married-civ-spouse;Some-college;United-States;09-04-1988;173.41;State-gov;Other-service;<=50K
Male;48;White;Never-married;11th;Puerto-Rico;17-04-1962;189.50;Private;Machine-op-inspct;<=50K
Male;21;White;Never-married;Some-college;United-States;10-05-1989;162.76;Private;Machine-op-inspct;<=50K
Female;19;White;Married-AF-spouse;HS-grad;United-States;12-05-1991;158.42;Private;Adm-clerical;<=50K
Male;48;White;Married-civ-spouse;Assoc-acdm;United-States;17-04-1962;160.75;Self-emp-not-inc;Prof-specialty;<=50K
Male;31;White;Married-civ-spouse;9th;United-States;31-12-1978;172.10;Private;Machine-op-inspct;<=50K
Male;53;White;Married-civ-spouse;Bachelors;United-States;22-05-1957;189.74;Self-emp-not-inc;Prof-specialty;<=50K
Male;24;White;Married-civ-spouse;Bachelors;United-States;07-04-1986;170.08;Private;Tech-support;<=50K
Female;49;White;Separated;HS-grad;United-States;18-04-1961;173.71;Private;Adm-clerical;<=50K
Male;25;White;Never-married;HS-grad;United-States;06-03-1985;160.52;Private;Handlers-cleaners;<=50K
Male;57;Black;Married-civ-spouse;Bachelors;United-States;26-07-1953;178.12;Federal-gov;Prof-specialty;>50K
Male;53;White;Married-civ-spouse;HS-grad;United-States;22-05-1957;186.11;Private;Machine-op-inspct;<=50K
Female;44;White;Divorced;Masters;United-States;13-02-1966;162.80;Private;Exec-managerial;<=50K
Male;41;White;Married-civ-spouse;Assoc-voc;United-States;10-01-1969;172.39;State-gov;Craft-repair;<=50K
Male;29;White;Never-married;Assoc-voc;United-States;02-02-1981;168.83;Private;Prof-specialty;<=50K
Female;25;Other;Married-civ-spouse;Some-college;United-States;06-03-1985;179.12;Private;Exec-managerial;<=50K
Female;47;White;Married-civ-spouse;Prof-school;Honduras;16-03-1963;163.02;Private;Prof-specialty;>50K
Male;50;White;Divorced;Bachelors;United-States;19-04-1960;172.18;Federal-gov;Exec-managerial;>50K

Bank Customer Chrun Dataset: Here is an extract of the dataset, the complete dataset can be found in the bank_churn.csv file in the samples directory.

RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
1,15634602,Hargrave,619,France,Female,42,2,0,1,1,1,101348.88,1
2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
4,15701354,Boni,699,France,Female,39,1,0,2,0,0,93826.63,0
5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0
6,15574012,Chu,645,Spain,Male,44,8,113755.78,2,1,0,149756.71,1
7,15592531,Bartlett,822,France,Male,50,7,0,2,1,1,10062.8,0
8,15656148,Obinna,376,Germany,Female,29,4,115046.74,4,1,0,119346.88,1
9,15792365,He,501,France,Male,44,4,142051.07,2,0,1,74940.5,0
10,15592389,H?,684,France,Male,27,2,134603.88,1,1,1,71725.73,0
11,15767821,Bearce,528,France,Male,31,6,102016.72,2,0,0,80181.12,0
12,15737173,Andrews,497,Spain,Male,24,3,0,2,1,0,76390.01,0
13,15632264,Kay,476,France,Female,34,10,0,2,1,0,26260.98,0
14,15691483,Chin,549,France,Female,25,5,0,2,0,0,190857.79,0
15,15600882,Scott,635,Spain,Female,35,7,0,2,1,1,65951.65,0
16,15643966,Goforth,616,Germany,Male,45,3,143129.41,2,0,1,64327.26,0
17,15737452,Romeo,653,Germany,Male,58,1,132602.88,1,1,0,5097.67,1
18,15788218,Henderson,549,Spain,Female,24,9,0,2,1,1,14406.41,0
19,15661507,Muldrow,587,Spain,Male,45,6,0,1,0,0,158684.81,0
20,15568982,Hao,726,France,Female,24,6,0,2,1,1,54724.03,0
21,15577657,McDonald,732,France,Male,41,8,0,2,1,1,170886.17,0
22,15597945,Dellucci,636,Spain,Female,32,8,0,2,1,0,138555.46,0
23,15699309,Gerasimov,510,Spain,Female,38,4,0,1,1,0,118913.53,1
24,15725737,Mosman,669,France,Male,46,3,0,2,0,1,8487.75,0
25,15625047,Yen,846,France,Female,38,5,0,1,1,1,187616.16,0
26,15738191,Maclean,577,France,Male,25,3,0,2,0,1,124508.29,0
27,15736816,Young,756,Germany,Male,36,2,136815.64,1,1,1,170041.95,0
28,15700772,Nebechi,571,France,Male,44,9,0,2,0,0,38433.35,0
29,15728693,McWilliams,574,Germany,Female,43,3,141349.43,1,1,1,100187.43,0
30,15656300,Lucciano,411,France,Male,29,0,59697.17,2,1,1,53483.21,0
31,15589475,Azikiwe,591,Spain,Female,39,3,0,3,1,0,140469.38,1
32,15706552,Odinakachukwu,533,France,Male,36,7,85311.7,1,0,1,156731.91,0
33,15750181,Sanderson,553,Germany,Male,41,9,110112.54,2,0,0,81898.81,0
34,15659428,Maggard,520,Spain,Female,42,6,0,2,1,1,34410.55,0
35,15732963,Clements,722,Spain,Female,29,9,0,2,1,1,142033.07,0
36,15794171,Lombardo,475,France,Female,45,0,134264.04,1,1,0,27822.99,1
37,15788448,Watson,490,Spain,Male,31,3,145260.23,1,0,1,114066.77,0
38,15729599,Lorenzo,804,Spain,Male,33,7,76548.6,1,0,1,98453.45,0
39,15717426,Armstrong,850,France,Male,36,7,0,1,1,1,40812.9,0
40,15585768,Cameron,582,Germany,Male,41,6,70349.48,2,0,1,178074.04,0
41,15619360,Hsiao,472,Spain,Male,40,4,0,1,1,0,70154.22,0
42,15738148,Clarke,465,France,Female,51,8,122522.32,1,0,0,181297.65,1
43,15687946,Osborne,556,France,Female,61,2,117419.35,1,1,1,94153.83,0
44,15755196,Lavine,834,France,Female,49,2,131394.56,1,0,0,194365.76,1
45,15684171,Bianchi,660,Spain,Female,61,5,155931.11,1,1,1,158338.39,0
46,15754849,Tyler,776,Germany,Female,32,4,109421.13,2,1,1,126517.46,0
47,15602280,Martin,829,Germany,Female,27,9,112045.67,1,1,1,119708.21,1
48,15771573,Okagbue,637,Germany,Female,39,9,137843.8,1,1,1,117622.8,1
49,15766205,Yin,550,Germany,Male,38,2,103391.38,1,0,1,90878.13,0