High-Level Workflow

High-level workflow of the Synthetic Data generation process.

The Protegrity Synthetic Data follows a structured pipeline to generate Synthetic Data:

Configuration Validation
Optimal Real Data Usage
Automatic Data Preprocessing
Training of Synthetic Data Generator Model
Evaluation Against Real Data
Synthetic Data Generation
Machine Learning Operations

Configuration Validation

Training Synthetic Data generators is a slow process, taking from a couple of minutes to several hours depending on the configurations used. To optimize compute time, several validations are proactively done to ensure a valid configuration before any training takes place. If any violations are found, descriptive exceptions are returned to the user.

Existence Validation: Ensures that the specified column exists in the real data.
Data Type Validation: Ensures required types, for example, categorical or integer are present for features like bias customization.
Unique ID Validation: Ensures unique identifiers are not used inappropriately, for example, bias customization on a unique identifier.

Optimal Real Data Usage

The performance of any machine learning model is influenced by the size of training data or learning curve. The API estimates a learning curve from the real data and may randomly sample data to reduce its size.

required_groups Parameter: Ensures downsampling includes all unique values in a specified categorical column.

Automatic Data Preprocessing

No manual preprocessing is necessary. The API automatically performs all required preprocessing.

Optional Data Type Specification: It is preferred for users to pass the data types of each column to ensure that the generated data respects them (for example, integer instead of float). Users can also specify data types for only some columns. However, if the user does not provide data types, the system will automatically infer them. This is particularly useful when a column appears to be an integer but encodes a categorical variable.

Training of Synthetic Data Generator Model

Two training modes are available:

Default Mode: Uses default configurations for modest to high fidelity results.
Autolearn Mode: Performs hyperparameter optimization. Requires:
- Time budget specification
- Option to start from scratch or continue previous tuning session

Evaluation Against Real Data

The API evaluates Synthetic Data against real data using default metrics and charts:

Correlation measures
Composite score
Information preserved
Similarity

An HTML and PDF evaluation report is returned. Real data is also evaluated against samples of real data to assess theoretical limits of closeness.

Synthetic Data Generation

A job represents a single event of training and generation or generation only, if a model already exists.

Machine Learning Operations

Organizations may have multiple data domains with distinct requirements. The API manages Synthetic Data generators using:

Data Domains and Subdomains: It is useful for auditing and lifecycle management.
Model in Production: It indicates generators and artifacts are stored and ready for future use.

Feedback

Was this page helpful?

Last modified : November 10, 2025