Introduction
Learn about Protegrity Synthetic Data.
Protegrity Synthetic Data unlocks the full potential of AI and analytics by creating entirely new data that mirrors the patterns of your original datasets. This new data contains no sensitive information. You can train and test AI models without risk. You can also scale these models without exposure or compliance violations.
Advantges of Protegrity Synthetic Data over Anonymized Data:
- Preserve utility for analytics, machine learning, and testing while minimizing privacy risks.
- Can simulate rare events or edge cases in data.
- Does not have a 1:1 mapping to real records.
- Is not regulated or biased.
- Cannot be traced back to any individual.
Use Cases
Protegrity Synthetic Data is used for:
- Training machine learning models without exposing sensitive data.
- Sharing data across teams or vendors while maintaining compliance.
- Replacing expensive or hard-to-source real-world data collection.
- Testing and development environments that replicate real-world complexity without privacy risks.
- Monetizing data and evaluating vendors.
1 - Privacy-Preserving Characteristics
A list of characteristics for privacy-preserving using Protegrity Synthetic Data.
No Direct Link to Real Individuals
Protegrity Synthetic Data is generated from learned patterns in real datasets but does not contain any actual personal records. This ensures:
- No 1:1 mapping between synthetic and real data.
- No re-identification risk, even when used in sensitive domains, such as healthcare or finance.
Compliance with Privacy Regulations
- General Data Protection Regulation (GDPR): Synthetic Data is considered anonymous under GDPR. It lacks identifiable links to real individuals.
- Health Insurance Portability and Accountability Act (HIPAA): It qualifies under Safe Harbor and Expert Determination methods. This makes it suitable for healthcare data use, without being classified as Protected Health Information (PHI).
Built-In Privacy Safeguards
Protegrity’s Synthetic Data solution includes multiple privacy-enhancing features:
- Privacy Measurement Tools: It evaluates the robustness of data.
- Automated De-identification: It removes sensitive attributes while preserving data utility.
- Support for Tabular Data: It enables realistic simulation of structured datasets for analytics and AI training.
- On-demand Generation Capabilities: It allows developers to invoke Synthetic Data generation using API and integrate it into pipelines with minimal effort.
2 - Comparison with Other Privacy-Enhancing Technologies
Understand the difference between Protegrity Synthetic Data and other data protection methods.
The following section provides details about Protegrity Synthetic Data and other data protection methods.
Pseudonymization replaces real data with tokens for certain attributes, such as Personally Identifiable Information (PII). However, this method still uses real data, and the analytical value is perfect unless other attributes are tokenized.
Anonymization reduces the risk of reidentification by transforming quasi-identifiers. However, careful balancing of utility and privacy is needed to minimize the impact on downstream usage.
Synthetic Data closely resembles real data. It does not contain real records and typically results in less information loss compared to Anonymization.
Advantages
- It can be used for analytics and advanced analytics with minimal impact.
- It ensures that no real individual can be re-identified.
- It is generated with privacy safeguards and can be used without user approval.
- It can be viewed by any user once generalized.
- It is produced by processing all records together.
- It does not require additional security measures.
- It can be generated on demand.
- It can be considered anonymous data within the context of GDPR.
- It can be generated in a manner that avoids being subject to HIPAA regulations.
Disadvantages
- It is slower than Pseudonymization or Anonymization.
- It is not suitable for use cases where re-identification is necessary.
- It requires minimal data to work reliably. The amount of data needed increases with data complexity.
3 - Protegrity Synthetic Data Overview
An overview of key characteristics of Protegrity Synthetic Data and its role in privacy compliance.
Protegrity Synthetic Data is a privacy-enhancing technology that uses real datasets to create artificial data. It does not represent real individuals and has no connection to real people. However, it still provides strong analytical utility and preserves relationships between variables.
Key Characteristics of Protegrity Synthetic Data
| Feature | Synthetic Data |
|---|
| Represents real people | False. It has no direct link to real individuals. |
| Closeness to real individuals | Low. It preserves relationships between variables and real data. |
| Analytics and advanced analytics | High utility. It is suitable for ML, forecasting, and testing. |
| Maintain data types | Guaranteed. It preserves the schema compatibility. |
| Internal and external sharing | Possible. It is compliant with privacy regulations like GDPR and HIPAA. |
| Simulating rare scenarios | Possible. It simulates rare scenarios, fraud patterns, or edge cases not present in production. |
| Risk of re-identification | Low. It minimizes the risk of re-identification compared to Anonymization or Pseudonymization. |
| Data progression | Possible. It can be used to create data trends that might change over time. |
| Cost | Moderate. It incurs varying costs depending on the complexity of the data and the synthesis methods used. |
| Scalability | High. It can be generated in large volumes as needed. |
| Maintenance | Moderate. It requires periodic updates to reflect changes in real data. |
Protegrity Synthetic Data is a powerful tool for privacy compliance. It:
- Does not represent real individuals, eliminating direct privacy risks.
- Preserves analytical utility, making it suitable for machine learning, forecasting, and testing.
- Maintains statistical relationships between variables without exposing personal information.
4 - How Protegrity Synthetic Data is Generated
Describes how Protegrity Synthetic Data generation works.
Protegrity Synthetic Data is a privacy-enhancing technology that creates artificial datasets. It works by learning from the structure and statistical properties of real data. It is designed to preserve analytical utility while protecting individual privacy. The process involves three key stages:
The system analyzes the original dataset to understand its structure and relationships:
| Characteristics | Examples |
|---|
| Column types | string, integer, categorical |
| Value distributions | age ranges, frequency of pet types |
| Relationships between variables | age and pet ownership patterns |
Stage 2: Generate Fictional Records
Based on the extracted characteristics, synthetic records are created using advanced modeling techniques:
- Generative Algorithms: Generative Adversarial Networks (GANs) or other statistical models.
- Privacy Assurance: These records are entirely fictional and do not correspond to real individuals.
Stage 3: Validate Privacy
Protegrity Synthetic Data dataset undergoes rigorous validation to ensure privacy protection:
- Re-identification Risk Analysis: It ensures that no original entries can be inferred or reconstructed.
- Privacy Techniques Applied: It includes methods like privacy risk scoring to quantify and mitigate risks.
Table: Original Dataset
| Name | Surname | Age | Pet Owned |
|---|
| Jack | Dawson | 42 | Dog |
| Jane | Dawson | 25 | Cat |
| Bill | Carvalho | 18 | Dog |
| Jennie | Philip | 53 | Hamster |
Table: Synthetic Data Dataset
| Name | Surname | Age | Pet Owned |
|---|
| Scott | Vaz | 48 | Dog |
| Anna | Rodriguez | 21 | Cat |
| Hank | Summers | 19 | Dog |
| Jean | Vaz | 51 | Hamster |
| Bill | Diaz | 58 | Dog |
| Sean | Young | 34 | Dog |
| Carrie | Lewis | 24 | Hamster |
| Perry | Macanzie | 42 | Cat |