This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Introduction

Learn about Protegrity Synthetic Data.

Protegrity Synthetic Data unlocks the full potential of AI and analytics by creating entirely new data that mirrors the patterns of your original datasets. This new data contains no sensitive information. You can train and test AI models without risk. You can also scale these models without exposure or compliance violations.

Advantges of Protegrity Synthetic Data over Anonymized Data:

  • Preserve utility for analytics, machine learning, and testing while minimizing privacy risks.
  • Can simulate rare events or edge cases in data.
  • Does not have a 1:1 mapping to real records.
  • Is not regulated or biased.
  • Cannot be traced back to any individual.

Use Cases

Protegrity Synthetic Data is used for:

  • Training machine learning models without exposing sensitive data.
  • Sharing data across teams or vendors while maintaining compliance.
  • Replacing expensive or hard-to-source real-world data collection.
  • Testing and development environments that replicate real-world complexity without privacy risks.
  • Monetizing data and evaluating vendors.

1 - Privacy-Preserving Characteristics

A list of characteristics for privacy-preserving using Protegrity Synthetic Data.

Protegrity Synthetic Data is generated from learned patterns in real datasets but does not contain any actual personal records. This ensures:

  • No 1:1 mapping between synthetic and real data.
  • No re-identification risk, even when used in sensitive domains, such as healthcare or finance.

Compliance with Privacy Regulations

  • General Data Protection Regulation (GDPR): Synthetic Data is considered anonymous under GDPR. It lacks identifiable links to real individuals.
  • Health Insurance Portability and Accountability Act (HIPAA): It qualifies under Safe Harbor and Expert Determination methods. This makes it suitable for healthcare data use, without being classified as Protected Health Information (PHI).

Built-In Privacy Safeguards

Protegrity’s Synthetic Data solution includes multiple privacy-enhancing features:

  • Privacy Measurement Tools: It evaluates the robustness of data.
  • Automated De-identification: It removes sensitive attributes while preserving data utility.
  • Support for Tabular Data: It enables realistic simulation of structured datasets for analytics and AI training.
  • On-demand Generation Capabilities: It allows developers to invoke Synthetic Data generation using API and integrate it into pipelines with minimal effort.

2 - Comparison with Other Privacy-Enhancing Technologies

Understand the difference between Protegrity Synthetic Data and other data protection methods.

The following section provides details about Protegrity Synthetic Data and other data protection methods.

  • Pseudonymization replaces real data with tokens for certain attributes, such as Personally Identifiable Information (PII). However, this method still uses real data, and the analytical value is perfect unless other attributes are tokenized.

  • Anonymization reduces the risk of reidentification by transforming quasi-identifiers. However, careful balancing of utility and privacy is needed to minimize the impact on downstream usage.

  • Synthetic Data closely resembles real data. It does not contain real records and typically results in less information loss compared to Anonymization.

Advantages

  • It can be used for analytics and advanced analytics with minimal impact.
  • It ensures that no real individual can be re-identified.
  • It is generated with privacy safeguards and can be used without user approval.
  • It can be viewed by any user once generalized.
  • It is produced by processing all records together.
  • It does not require additional security measures.
  • It can be generated on demand.
  • It can be considered anonymous data within the context of GDPR.
  • It can be generated in a manner that avoids being subject to HIPAA regulations.

Disadvantages

  • It is slower than Pseudonymization or Anonymization.
  • It is not suitable for use cases where re-identification is necessary.
  • It requires minimal data to work reliably. The amount of data needed increases with data complexity.

3 - Protegrity Synthetic Data Overview

An overview of key characteristics of Protegrity Synthetic Data and its role in privacy compliance.

Protegrity Synthetic Data is a privacy-enhancing technology that uses real datasets to create artificial data. It does not represent real individuals and has no connection to real people. However, it still provides strong analytical utility and preserves relationships between variables.

Key Characteristics of Protegrity Synthetic Data

FeatureSynthetic Data
Represents real peopleFalse.
It has no direct link to real individuals.
Closeness to real individualsLow.
It preserves relationships between variables and real data.
Analytics and advanced analyticsHigh utility.
It is suitable for ML, forecasting, and testing.
Maintain data typesGuaranteed.
It preserves the schema compatibility.
Internal and external sharingPossible.
It is compliant with privacy regulations like GDPR and HIPAA.
Simulating rare scenariosPossible.
It simulates rare scenarios, fraud patterns, or edge cases not present in production.
Risk of re-identificationLow.
It minimizes the risk of re-identification compared to Anonymization or Pseudonymization.
Data progressionPossible.
It can be used to create data trends that might change over time.
CostModerate.
It incurs varying costs depending on the complexity of the data and the synthesis methods used.
ScalabilityHigh.
It can be generated in large volumes as needed.
MaintenanceModerate.
It requires periodic updates to reflect changes in real data.

Protegrity Synthetic Data is a powerful tool for privacy compliance. It:

  • Does not represent real individuals, eliminating direct privacy risks.
  • Preserves analytical utility, making it suitable for machine learning, forecasting, and testing.
  • Maintains statistical relationships between variables without exposing personal information.

4 - How Protegrity Synthetic Data is Generated

Describes how Protegrity Synthetic Data generation works.

Protegrity Synthetic Data is a privacy-enhancing technology that creates artificial datasets. It works by learning from the structure and statistical properties of real data. It is designed to preserve analytical utility while protecting individual privacy. The process involves three key stages:

Stage 1: Extract Characteristics from Original Data

The system analyzes the original dataset to understand its structure and relationships:

CharacteristicsExamples
Column typesstring, integer, categorical
Value distributionsage ranges, frequency of pet types
Relationships between variablesage and pet ownership patterns

Stage 2: Generate Fictional Records

Based on the extracted characteristics, synthetic records are created using advanced modeling techniques:

  • Generative Algorithms: Generative Adversarial Networks (GANs) or other statistical models.
  • Privacy Assurance: These records are entirely fictional and do not correspond to real individuals.

Stage 3: Validate Privacy

Protegrity Synthetic Data dataset undergoes rigorous validation to ensure privacy protection:

  • Re-identification Risk Analysis: It ensures that no original entries can be inferred or reconstructed.
  • Privacy Techniques Applied: It includes methods like privacy risk scoring to quantify and mitigate risks.

Table: Original Dataset

NameSurnameAgePet Owned
JackDawson42Dog
JaneDawson25Cat
BillCarvalho18Dog
JenniePhilip53Hamster

Table: Synthetic Data Dataset

NameSurnameAgePet Owned
ScottVaz48Dog
AnnaRodriguez21Cat
HankSummers19Dog
JeanVaz51Hamster
BillDiaz58Dog
SeanYoung34Dog
CarrieLewis24Hamster
PerryMacanzie42Cat