High-Level Workflow
The Protegrity Synthetic Data follows a structured pipeline to generate Synthetic Data:
- Configuration Validation
- Optimal Real Data Usage
- Automatic Data Preprocessing
- Training of Synthetic Data Generator Model
- Evaluation Against Real Data
- Synthetic Data Generation
- Machine Learning Operations
Configuration Validation
Training Synthetic Data generators is a slow process, taking from a couple of minutes to several hours depending on the configurations used. To optimize compute time, several validations are proactively done to ensure a valid configuration before any training takes place. If any violations are found, descriptive exceptions are returned to the user.
- Existence Validation: Ensures that the specified column exists in the real data.
- Data Type Validation: Ensures required types, for example, categorical or integer are present for features like bias customization.
- Unique ID Validation: Ensures unique identifiers are not used inappropriately, for example, bias customization on a unique identifier.
Optimal Real Data Usage
The performance of any machine learning model is influenced by the size of training data or learning curve. The API estimates a learning curve from the real data and may randomly sample data to reduce its size.
required_groupsParameter: Ensures downsampling includes all unique values in a specified categorical column.
Automatic Data Preprocessing
No manual preprocessing is necessary. The API automatically performs all required preprocessing.
- Optional Data Type Specification: It is preferred for users to pass the data types of each column to ensure that the generated data respects them (for example, integer instead of float). Users can also specify data types for only some columns. However, if the user does not provide data types, the system will automatically infer them. This is particularly useful when a column appears to be an integer but encodes a categorical variable.
Training of Synthetic Data Generator Model
Two training modes are available:
- Default Mode: Uses default configurations for modest to high fidelity results.
- Autolearn Mode: Performs hyperparameter optimization. Requires:
- Time budget specification
- Option to start from scratch or continue previous tuning session
Evaluation Against Real Data
The API evaluates Synthetic Data against real data using default metrics and charts:
- Correlation measures
- Composite score
- Information preserved
- Similarity
An HTML and PDF evaluation report is returned. Real data is also evaluated against samples of real data to assess theoretical limits of closeness.
Synthetic Data Generation
A job represents a single event of training and generation or generation only, if a model already exists.
Machine Learning Operations
Organizations may have multiple data domains with distinct requirements. The API manages Synthetic Data generators using:
- Data Domains and Subdomains: It is useful for auditing and lifecycle management.
- Model in Production: It indicates generators and artifacts are stored and ready for future use.
Feedback
Was this page helpful?