Introduction

Learn about Protegrity Synthetic Data.

1: Privacy-Preserving Characteristics
2: Comparison with Other Privacy-Enhancing Technologies
3: Protegrity Synthetic Data Overview
4: How Protegrity Synthetic Data is Generated

Protegrity Synthetic Data unlocks the full potential of AI and analytics by creating entirely new data that mirrors the patterns of your original datasets. This new data contains no sensitive information. You can train and test AI models without risk. You can also scale these models without exposure or compliance violations.

Advantges of Protegrity Synthetic Data over Anonymized Data:

Preserve utility for analytics, machine learning, and testing while minimizing privacy risks.
Can simulate rare events or edge cases in data.
Does not have a 1:1 mapping to real records.
Is not regulated or biased.
Cannot be traced back to any individual.

Use Cases

Protegrity Synthetic Data is used for:

Training machine learning models without exposing sensitive data.
Sharing data across teams or vendors while maintaining compliance.
Replacing expensive or hard-to-source real-world data collection.
Testing and development environments that replicate real-world complexity without privacy risks.
Monetizing data and evaluating vendors.

1 - Privacy-Preserving Characteristics

A list of characteristics for privacy-preserving using Protegrity Synthetic Data.

No Direct Link to Real Individuals

Protegrity Synthetic Data is generated from learned patterns in real datasets but does not contain any actual personal records. This ensures:

No 1:1 mapping between synthetic and real data.
No re-identification risk, even when used in sensitive domains, such as healthcare or finance.

Compliance with Privacy Regulations

General Data Protection Regulation (GDPR): Synthetic Data is considered anonymous under GDPR. It lacks identifiable links to real individuals.
Health Insurance Portability and Accountability Act (HIPAA): It qualifies under Safe Harbor and Expert Determination methods. This makes it suitable for healthcare data use, without being classified as Protected Health Information (PHI).

Built-In Privacy Safeguards

Protegrity’s Synthetic Data solution includes multiple privacy-enhancing features:

Privacy Measurement Tools: It evaluates the robustness of data.
Automated De-identification: It removes sensitive attributes while preserving data utility.
Support for Tabular Data: It enables realistic simulation of structured datasets for analytics and AI training.
On-demand Generation Capabilities: It allows developers to invoke Synthetic Data generation using API and integrate it into pipelines with minimal effort.

2 - Comparison with Other Privacy-Enhancing Technologies

Understand the difference between Protegrity Synthetic Data and other data protection methods.

The following section provides details about Protegrity Synthetic Data and other data protection methods.

Pseudonymization replaces real data with tokens for certain attributes, such as Personally Identifiable Information (PII). However, this method still uses real data, and the analytical value is perfect unless other attributes are tokenized.
Anonymization reduces the risk of reidentification by transforming quasi-identifiers. However, careful balancing of utility and privacy is needed to minimize the impact on downstream usage.
Synthetic Data closely resembles real data. It does not contain real records and typically results in less information loss compared to Anonymization.

Advantages

It can be used for analytics and advanced analytics with minimal impact.
It ensures that no real individual can be re-identified.
It is generated with privacy safeguards and can be used without user approval.
It can be viewed by any user once generalized.
It is produced by processing all records together.
It does not require additional security measures.
It can be generated on demand.
It can be considered anonymous data within the context of GDPR.
It can be generated in a manner that avoids being subject to HIPAA regulations.

Disadvantages

It is slower than Pseudonymization or Anonymization.
It is not suitable for use cases where re-identification is necessary.
It requires minimal data to work reliably. The amount of data needed increases with data complexity.

3 - Protegrity Synthetic Data Overview

An overview of key characteristics of Protegrity Synthetic Data and its role in privacy compliance.

Protegrity Synthetic Data is a privacy-enhancing technology that uses real datasets to create artificial data. It does not represent real individuals and has no connection to real people. However, it still provides strong analytical utility and preserves relationships between variables.

Key Characteristics of Protegrity Synthetic Data

Feature	Synthetic Data
Represents real people	False. It has no direct link to real individuals.
Closeness to real individuals	Low. It preserves relationships between variables and real data.
Analytics and advanced analytics	High utility. It is suitable for ML, forecasting, and testing.
Maintain data types	Guaranteed. It preserves the schema compatibility.
Internal and external sharing	Possible. It is compliant with privacy regulations like GDPR and HIPAA.
Simulating rare scenarios	Possible. It simulates rare scenarios, fraud patterns, or edge cases not present in production.
Risk of re-identification	Low. It minimizes the risk of re-identification compared to Anonymization or Pseudonymization.
Data progression	Possible. It can be used to create data trends that might change over time.
Cost	Moderate. It incurs varying costs depending on the complexity of the data and the synthesis methods used.
Scalability	High. It can be generated in large volumes as needed.
Maintenance	Moderate. It requires periodic updates to reflect changes in real data.

Protegrity Synthetic Data is a powerful tool for privacy compliance. It:

Does not represent real individuals, eliminating direct privacy risks.
Preserves analytical utility, making it suitable for machine learning, forecasting, and testing.
Maintains statistical relationships between variables without exposing personal information.

4 - How Protegrity Synthetic Data is Generated

Describes how Protegrity Synthetic Data generation works.

Protegrity Synthetic Data is a privacy-enhancing technology that creates artificial datasets. It works by learning from the structure and statistical properties of real data. It is designed to preserve analytical utility while protecting individual privacy. The process involves three key stages:

Stage 1: Extract Characteristics from Original Data

The system analyzes the original dataset to understand its structure and relationships:

Characteristics	Examples
Column types	string, integer, categorical
Value distributions	age ranges, frequency of pet types
Relationships between variables	age and pet ownership patterns

Stage 2: Generate Fictional Records

Based on the extracted characteristics, synthetic records are created using advanced modeling techniques:

Generative Algorithms: Generative Adversarial Networks (GANs) or other statistical models.
Privacy Assurance: These records are entirely fictional and do not correspond to real individuals.

Stage 3: Validate Privacy

Protegrity Synthetic Data dataset undergoes rigorous validation to ensure privacy protection:

Re-identification Risk Analysis: It ensures that no original entries can be inferred or reconstructed.
Privacy Techniques Applied: It includes methods like privacy risk scoring to quantify and mitigate risks.

Table: Original Dataset

Name	Surname	Age	Pet Owned
Jack	Dawson	42	Dog
Jane	Dawson	25	Cat
Bill	Carvalho	18	Dog
Jennie	Philip	53	Hamster

Table: Synthetic Data Dataset

Name	Surname	Age	Pet Owned
Scott	Vaz	48	Dog
Anna	Rodriguez	21	Cat
Hank	Summers	19	Dog
Jean	Vaz	51	Hamster
Bill	Diaz	58	Dog
Sean	Young	34	Dog
Carrie	Lewis	24	Hamster
Perry	Macanzie	42	Cat