1 - Classify

Identify, classify and locate sensitive data.

1.1 - Classify Text API

Classify plain text unstructured data.

Method

POST

URL

http://{Host Address}/pty/data-discovery/v2/classify/text

Query Parameters

score_threshold

  • Type: float
  • Description: Optional. Exclude results with a score lower than this threshold.
  • Values: Minimum 0, Maximum 1.0
  • Default: 0.7

Body

  • Content type must be a plain text and in an UTF-8 format.

  • Length of the body is limited to 10K Bytes.

Sample Request

curl -X POST "http://<Host_address>/pty/data-discovery/v2/classify/text?score_threshold=0.85" \
          -H "Content-Type: text/plain" \
          --data "You can reach Dave Elliot by phone 203-555-1286"
import requests
    
    url = "http://<Host_address>/pty/data-discovery/v2/classify/text"
    params = {"score_threshold": 0.85}
    headers = {"Content-Type": "text/plain"}
    data = "You can reach Dave Elliot by phone 203-555-1286"
    
    response = requests.post(url, params=params, headers=headers, data=data, verify=False)
    
    print("Status code:", response.status_code)
    print("Response JSON:", response.json())
URL: POST `http://<Host_address>/pty/data-discovery/v2/classify/text`
   Query Parameters:
   -score_threshold (optional), float between 0.0 and 1.0, default: 0.
   Headers:
   -Content-Type: text/plain
   Body:
   -You can reach Dave Elliot by phone 203-555-1286

Sample Response

{
    "providers": [
        {
            "name": "Pattern Classification Provider",
            "version": "...",
            "status": 200,
            "elapsed_time": 0.028261899948120117,
            "config_provider": {
                "name": "Pattern",
                "address": "http://pattern_provider_service:8051",
                "supported_content_types": []
            }
        },
        {
            "name": "Context Classification Provider",
            "version": "...",
            "status": 200,
            "elapsed_time": 0.040960073471069336,
            "config_provider": {
                "name": "Context",
                "address": "http://context_provider_service:8052",
                "supported_content_types": []
            }
        }
    ],
    "classifications": {
        "PERSON": [
            {
                "score": 0.9238499879837037,
                "location": {
                    "start_index": 14,
                    "end_index": 25
                },
                "classifiers": [
                    {
                        "provider_index": 0,
                        "name": "SpacyRecognizer",
                        "score": 0.85,
                        "original_entity": "PERSON",
                        "details": {}
                    },
                    {
                        "provider_index": 1,
                        "name": "context",
                        "score": 0.9976999759674072,
                        "original_entity": "NAME",
                        "details": {}
                    }
                ]
            }
        ],
        "PHONE_NUMBER": [
            {
                "score": 0.9995999932289124,
                "location": {
                    "start_index": 35,
                    "end_index": 47
                },
                "classifiers": [
                    {
                        "provider_index": 1,
                        "name": "context",
                        "score": 0.9995999932289124,
                        "original_entity": "PHONE",
                        "details": {}
                    }
                ]
            }
        ]
    }
}

Response Fields Description

Providers Section

NameExample ResponseDescription
providersArrayArray of provider objects that participated in the request, including their respective success or failure codes.
providers[n].namePattern Classification ProviderProduct name of the provider.
providers[n].version2.0.0Version of the provider.
providers[n].status200HTTP response code returned by the provider.
providers[n].elapsed_time0.028Time, in seconds, taken by the provider to process the request.
providers[n].config_providerObjectObject containing configuration details for each provider.
providers[n].config_provider.namePatternInternal name of the provider.
providers[n].config_provider.addresshttp://pattern_provider_service:8051Network address or endpoint of the provider.
providers[n].config_provider.supported_content_types[]Array of supported content types. An empty array indicates support for all content types.

Classifications Section

NameExample ResponseDescription
classificationsDictionaryA dictionary mapping entity types (e.g., “PERSON”, “PHONE_NUMBER”) to arrays of occurrence objects. Each key is an entity type, and its value is a list of detected occurrences, each containing location and classifier details.
classifications[’entity’][n].score0.9238The confidence score for the detected entity, aggregated from all contributing classifiers.
classifications[’entity’][n].locationObjectAn object specifying the location of the entity within the input text.
classifications[’entity’][n].location.start_index14The starting index of the entity in the input text.
classifications[’entity’][n].location.end_index25The ending index of the entity in the input text.
classifications[’entity’][n].classifiersArrayAn array of classifier objects that contributed to the entity detection.
classifications[’entity’][n].classifiers[m].provider_index0The index of the provider in the top-level providers array.
classifications[’entity’][n].classifiers[m].nameSpacyRecognizerThe name of the classifier. A provider may have multiple classifiers.
classifications[’entity’][n].classifiers[m].score0.85The score assigned by the classifier for the entity detection.
classifications[’entity’][n].classifiers[m].original_entityPERSONThe original entity type detected by the classifier. See Harmonization for details.
classifications[’entity’][n].classifiers[m].detailsObjectOptional. Additional key-value details provided by the classifier.

Response Codes

Response CodeDescription
200Successful Response.
206Partial Content. Only some providers classifed data successfully.
400Bad Request. Invalid input parameters or content.
413Payload too large.
415Unsupported media type.
422Untrusted input. For more information, refer to Input Validation
502Bad Gateway. All upstream providers failed; no successful data aggregation possible.
598Unexpected internal server error. Check server logs.
599Internal server error. Check server logs.

1.2 - Classify Tabular API

Classify structured Tabular data.

Method

POST

URL

http://{Host Address}/pty/data-discovery/v2/classify/tabular

Query Parameters

score_threshold

  • Type: float
  • Description: Optional. Exclude results with a score lower than this threshold.
  • Values: Minimum 0, Maximum 1.0
  • Default: 0.7

has_headers

  • Type: boolean
  • Description: Optional. Indicates whether the first row represents the column header.
  • Values: true/false
  • Default: true

column_delimiter

  • Type: char
  • Description: Optional. Delimiter to separate the columns.
  • Default: ,

quote_char

  • Type: char
  • Description: Optional. Character to quote fields containing special characters, such as, the column_delimiter or new-line characters.
  • Default: "

Body

  • Content type should be text/csv and in UTF-8 format.

  • Body size is limited to 10K Bytes

Sample Request

curl -X POST "http://<Host_address>/pty/data-discovery/v2/classify/tabular?score_threshold=0.85" \
     --header 'Content-Type: text/csv' \
     --data-raw 'Social Security Number,Credit Card Number,IBAN,Phone Number
     589-25-1068,349384370543801,FR43 9255 4858 47BG 3EBG U4OK O18,(483) 9440301
     636-36-3077,4041594844904,AL50 8947 4215 KAEY GAPM NLYC FNZG,(113) 5143119
     748-82-2375,3558175715821800,AT34 4082 9269 0841 5702,(763) 5136237
     516-62-9861,560221027976015000,FR22 0068 7181 11FB UG8H ECEM 306,(726) 6031636
     121-49-9409,374283320982549,DK37 5687 8459 8060 79,(624) 9205200
     838-73-3299,5558216060144900,CR54 8952 8144 6403 4765 0,(356) 9479541
     439-11-5310,5048376143641900,RS76 6213 4824 0184 8983 74,(544) 5623326
     564-06-8466,3543299511845640,EE51 6882 3443 7863 4703,(702) 6093849
     518-54-5443,3543019452249540,IT65 D000 3874 2801 Z15I LNLL OOX,(584) 8618371'
import requests
    
    url = "http://<Host_address>/pty/data-discovery/v2/classify/tabular"
    params = {"score_threshold": 0.85}
    headers = {"Content-Type": "text/csv"}
    data = """Social Security Number,Credit Card Number,IBAN,Phone Number
    589-25-1068,349384370543801,FR43 9255 4858 47BG 3EBG U4OK O18,(483) 9440301
    636-36-3077,4041594844904,AL50 8947 4215 KAEY GAPM NLYC FNZG,(113) 5143119
    748-82-2375,3558175715821800,AT34 4082 9269 0841 5702,(763) 5136237
    516-62-9861,560221027976015000,FR22 0068 7181 11FB UG8H ECEM 306,(726) 6031636
    121-49-9409,374283320982549,DK37 5687 8459 8060 79,(624) 9205200
    838-73-3299,5558216060144900,CR54 8952 8144 6403 4765 0,(356) 9479541
    439-11-5310,5048376143641900,RS76 6213 4824 0184 8983 74,(544) 5623326
    564-06-8466,3543299511845640,EE51 6882 3443 7863 4703,(702) 6093849
    518-54-5443,3543019452249540,IT65 D000 3874 2801 Z15I LNLL OOX,(584) 8618371
    """
    
    response = requests.post(url, params=params, headers=headers, data=data, verify=False)
    
    print("Status code:", response.status_code)
    try:
        print("Response JSON:", response.json())
    except ValueError:
        print("Response Text:", response.text)
    
URL: POST `http://<Host_address>/pty/data-discovery/v2/classify/tabular`
      Query Parameters:
      -score_threshold (optional), float between 0.0 and 1.0, default: 0.
      -has_headers (optional), Indicates whether the first row represents the column header.
      -column_delimiter (optional), Delimiter to separate the columns.
      -quote_char (optional), Character to quote fields containing special characters, such as, the column_delimiter or new-line characters.
      Headers:
      -Content-Type: text/csv
      Body:
      -Social Security Number,Credit Card Number,IBAN,Phone Number
     589-25-1068,349384370543801,FR43 9255 4858 47BG 3EBG U4OK O18,(483) 9440301
     636-36-3077,4041594844904,AL50 8947 4215 KAEY GAPM NLYC FNZG,(113) 5143119
     748-82-2375,3558175715821800,AT34 4082 9269 0841 5702,(763) 5136237
     516-62-9861,560221027976015000,FR22 0068 7181 11FB UG8H ECEM 306,(726) 6031636
     121-49-9409,374283320982549,DK37 5687 8459 8060 79,(624) 9205200
     838-73-3299,5558216060144900,CR54 8952 8144 6403 4765 0,(356) 9479541
     439-11-5310,5048376143641900,RS76 6213 4824 0184 8983 74,(544) 5623326
     564-06-8466,3543299511845640,EE51 6882 3443 7863 4703,(702) 6093849
     518-54-5443,3543019452249540,IT65 D000 3874 2801 Z15I LNLL OOX,(584) 8618371
   

Sample Response

{
    "providers": [
        {
            "name": "Pattern Classification Provider",
            "version": "...",
            "status": 200,
            "elapsed_time": 0.31273603439331055,
            "config_provider": {
                "name": "Pattern",
                "address": "http://pattern_provider_service:8051",
                "supported_content_types": []
            }
        },
        {
            "name": "Context Classification Provider",
            "version": "...",
            "status": 200,
            "elapsed_time": 1.1383004188537598,
            "config_provider": {
                "name": "Context",
                "address": "http://context_provider_service:8052",
                "supported_content_types": []
            }
        }
    ],
    "classifications": {
        "SOCIAL_SECURITY_ID": [
            {
                "score": 0.9994888835483127,
                "rows_processed": 9,
                "location": {
                    "column_name": "Social Security Number",
                    "column_index": 0
                },
                "classifiers": [
                    {
                        "provider_index": 1,
                        "name": "context",
                        "rows_with_classification": 9,
                        "total_classifications": 9,
                        "score": 0.9994888835483127,
                        "details": {}
                    }
                ]
            }
        ],
        "CREDIT_CARD": [
            {
                "score": 0.9986333317226834,
                "rows_processed": 9,
                "location": {
                    "column_name": "Credit Card Number",
                    "column_index": 1
                },
                "classifiers": [
                    {
                        "provider_index": 1,
                        "name": "context",
                        "rows_with_classification": 9,
                        "total_classifications": 9,
                        "score": 0.9986333317226834,
                        "details": {}
                    }
                ]
            }
        ],
        "BANK_ACCOUNT": [
            {
                "score": 0.7901234567901234,
                "rows_processed": 9,
                "location": {
                    "column_name": "IBAN",
                    "column_index": 2
                },
                "classifiers": [
                    {
                        "provider_index": 0,
                        "name": "IbanRecognizer",
                        "rows_with_classification": 8,
                        "total_classifications": 8,
                        "score": 0.8888888888888888,
                        "details": {}
                    }
                ]
            }
        ],
        "PHONE_NUMBER": [
            {
                "score": 0.9961333341068692,
                "rows_processed": 9,
                "location": {
                    "column_name": "Phone Number",
                    "column_index": 3
                },
                "classifiers": [
                    {
                        "provider_index": 1,
                        "name": "context",
                        "rows_with_classification": 9,
                        "total_classifications": 9,
                        "score": 0.9961333341068692,
                        "details": {}
                    }
                ]
            }
        ]
    }
}

Response Fields Description

Providers Section

NameExample ResponseDescription
providersArrayArray of provider objects that participated in the request, including their respective success or failure codes.
providers[n].namePattern Classification ProviderProduct name of the provider.
providers[n].version2.0.0Version of the provider.
providers[n].status200HTTP response code returned by the provider.
providers[n].elapsed_time0.028Time, in seconds, taken by the provider to process the request.
providers[n].config_providerObjectObject containing configuration details for each provider.
providers[n].config_provider.namePatternInternal name of the provider.
providers[n].config_provider.addresshttp://pattern_provider_service:8051Network address or endpoint of the provider.
providers[n].config_provider.supported_content_types[]Array of supported content types. An empty array indicates support for all content types.

Classifications Section

NameExample ResponseDescription
classificationsDictionaryA dictionary mapping entity types (e.g., “SOCIAL_SECURITY_ID”, “CREDIT_CARD”) to arrays of occurrence objects. Each key is an entity type, and its value is a list of detected occurrences, each containing location, classifier, and row details.
classifications[’entity’][n].score0.9995The confidence score for the detected entity, aggregated and calculated from all contributing classifiers and their
reported scores.
classifications[’entity’][n].rows_processed9The number of rows passed to and processed by the classification request.
classifications[’entity’][n].locationObjectAn object specifying the location of the entity within the tabular data.
classifications[’entity’][n].location.column_nameSocial Security NumberThe name of the column in which the entity was detected.
classifications[’entity’][n].location.column_index0The index of the column in which the entity was detected.
classifications[’entity’][n].classifiersArrayAn array of classifier objects that contributed to the entity detection.
classifications[’entity’][n].classifiers[m].provider_index1The index of the provider in the top-level providers array.
classifications[’entity’][n].classifiers[m].namecontextThe name of the classifier. A provider may have multiple classifiers.
classifications[’entity’][n].classifiers[m].score0.9995The score assigned by the classifier for the entity detection.
classifications[’entity’][n].classifiers[m].rows_with_classification9The number of rows in which the entity was classified by this classifier.
classifications[’entity’][n].classifiers[m].total_classifications9The total number of classifications made by this classifier in this location. it is possible to find multiple entities within a single column, e.g., date and time, complex address, etc'.
classifications[’entity’][n].classifiers[m].detailsObjectOptional. Additional key-value details provided by the classifier.

Response Codes

Response CodeDescription
200Successful Response.
206Partial Content. Only some providers classifed data successfully.
400Bad Request. Invalid input parameters or content.
413Payload too large.
415Unsupported media type.
422Untrusted input. For more information, refer to Input Validation
502Bad Gateway. All upstream providers failed; no successful data aggregation possible.
598Unexpected internal server error. Check server logs.
599Internal server error. Check server logs.

1.3 - Input Validation

Rejecting unsanitized data.

The Classification service in Data Discovery offers an input validation security feature that rejects invalid input data. Data that is malformed, non-normalized, containing homoglyphs, hieroglyphs, mixed Unicode variants, or control characters is considered as unsanitized or invalid data. These are rejected and will not be classified.

The following are few examples of data that will be rejected:

  • 𝓉𝑒𝓍𝓉
  • Pep

Before invoking the Classification endpoint, ensure that the input text is normalized. Replace invalid characters by their corresponding normalized plaintext characters. If the input text contains any invalid character, a status code of 422 and a message Untrusted input is returned.

For security purposes, the application rejects unsanitized data by default. It is recommended that this feature remains enabled. However, to override this feature, perform the following steps.

1.4 - Harmonization

Aggregate responses under a similar category.

Based on the detection logic, the Pattern and Context classification providers might classify the same data in different labels. The classification service standardizes provider outputs into a unified response.

Consider the example, You can visit our office located in New York City.

  • Context provider might categorize New York City as CITY.
  • Pattern provider might categorize New York City as LOCATION.

This can cause an inconsistency in the outputs generated across the providers.

Data Discovery ensures standardization of responses by aggregating similar outputs of the providers under a common classification name. In the example shown, the classification service will categorize New York City under the category LOCATION.

For a complete reference, see the supported classification entities and their harmonization categories.

Harmonization Process

The following pointers illustrate the harmonization process in detail.

Providers Mapping Entities

Each provider is responsible for mapping its identified entities to harmonized classification entities that are consistent with those used by other providers. This ensures that the classification service can accurately aggregate and interpret responses across multiple providers. When a provider’s classification is harmonized, the response must include the originally identified entity alongside the harmonized classification.

The following snippet shows how the Context classification provider initially classified the entity as CITY, which was then harmonized into the category LOCATION.

{
  "providers": "...",
  "classifications": {
    "LOCATION": [
      {
        "score": 0.9222000122070313,
        "location": {
          "start_index": 36,
          "end_index": 49
        },
        "classifiers": [
          {
            "provider_index": 0,
            "name": "SpacyRecognizer",
            "score": 0.85,
            "original_entity": "LOCATION",
            "details": {}
          },
          {
            "provider_index": 1,
            "name": "context",
            "score": 0.9944000244140625,
            "original_entity": "CITY",
            "details": {}
          }
        ]
      }
    ]
  }
}

Grouping by Matching Indexes

The entities are grouped together only if the responses shared by the providers contain the same start_index, end_index, and similar classification entity. If the start_index and end_index differ, the entities will not be grouped together.

As shown in the following snippet, the Context and Pattern providers classify the data as IT_IDENTITY_CARD and ID_CARD respectively. These are then grouped under the NATIONAL_ID category by the classification service.

{
  "providers": ...,
  "classifications": {
    "NATIONAL_ID": [
      {
        "score": 0.9236000061035157,
        "location": {
          "start_index": 14,
          "end_index": 25
        },
        "classifiers": [
          {
            "provider_index": 0,
            "name": "pattern_classification",
            "score": 0.85,
            "original_entity": "IT_IDENTITY_CARD" 
          }, {
            "provider_index": 1,
            "name": "context_classification",
            "score": 0.9972000122070312,
            "original_entity": "ID_CARD" 
          }
        ]
      }
    ]
  }
}

Non-Matching Indexes

If the responses for start_index and end_index differ, the entities will not be grouped together. However, the entities will appear under a common classification name.

The following table illustrates a common classification name for multiple providers.

ProviderOriginal Entity LabelsCommon Classification Name
Pattern ProviderLOCATIONLOCATION
Context ProviderCITY, STATE, COUNTRY, COUNTY, ZIP_CODE, STREET, BUILDING, GEO_COORDINATELOCATION

The following snippet illustrates the sample.

{
  "providers": "...",
  "classifications": {
    "LOCATION": [
      {
        "score": 0.9236000061035157,
        "location": {
          "start_index": 0,
          "end_index": 35
        },
        "classifiers": [
          {
            "provider_index": 0,
            "name": "pattern_provider",
            "score": 0.85,
            "original_entity": "LOCATION"
          }
        ]
      },
      {
        "score": 0.9236000061035157,
        "location": {
          "start_index": 0,
          "end_index": 17
        },
        "classifiers": [
          {
            "provider_index": 1,
            "name": "context_provider",
            "score": 0.9972000122070312,
            "original_entity": "STREET"
          }
        ]
      },
      {
        "score": 0.9236000061035157,
        "location": {
          "start_index": 20,
          "end_index": 22
        },
        "classifiers": [
          {
            "provider_index": 1,
            "name": "context_provider",
            "score": 0.9972000122070312,
            "original_entity": "BUILDING"
          }
        ]
      },
      {
        "score": 0.9236000061035157,
        "location": {
          "start_index": 25,
          "end_index": 31
        },
        "classifiers": [
          {
            "provider_index": 1,
            "name": "context_provider",
            "score": 0.9972000122070312,
            "original_entity": "ZIP_CODE"
          }
        ]
      }
    ]
  }
}

1.5 - Supported Classification Entities

A list of the entities calssified by Data-Discovery

Supported Entity Types

PII entities supported by Data Discovery with their Harmonized Categories.

Harmonized CategoryEntity NameDescription
ACCOUNT_NAMEACCOUNTNAMEName associated with a financial account.
ACCOUNT_NUMBERACCOUNTNUMBERBank account number used to identify financial accounts.
AGEAGEAge information used to identify individuals.
AMOUNTAMOUNTSpecific amount of money, which can be linked to financial transactions.
BANK_ACCOUNTBICBank Identifier Code used to identify financial institutions.
BANK_ACCOUNTIBANInternational Bank Account Number used to identify bank accounts globally.
BANK_ACCOUNTIBAN_CODEInternational Bank Account Number used to identify bank accounts globally.
BANK_ACCOUNTUS_BANK_NUMBERBank account number used to identify financial accounts in the United States.
BANK_ROUTING_CODEABA_ROUTING_NUMBERIt identifies a bank/branch for routing payments, not an individual bank account.
BANK_ROUTING_CODEBICSWIFT/BIC is a bank identifier, not an account number.
CREDIT_CARDCCNCredit card number used for financial transactions.
CREDIT_CARDCREDIT_CARDCredit card number used for financial transactions.
SECURITY_CODECREDIT_CARD_CVVCVVs are security codes for payment authentication, not passwords.
CRYPTO_ADDRESSBITCOINADDRESSBitcoin wallet address used for digital transactions.
CRYPTO_ADDRESSCRYPTOCryptocurrency wallet address used for digital transactions.
CRYPTO_ADDRESSETHEREUMADDRESSEthereum wallet address used for digital transactions.
CRYPTO_ADDRESSLITECOINADDRESSLitecoin wallet address used for digital transactions.
CURRENCY_CODECURRENCYCODECode representing currency used in financial transactions.
CURRENCY_NAMECURRENCYCurrency information used in financial transactions.
CURRENCY_NAMECURRENCYNAMEName of currency used in financial transactions.
CURRENCY_SYMBOLCURRENCYSYMBOLSymbol representing currency, sometimes linked to financial transactions.
DATETIMEDATESpecific date that can be linked to personal activities.
DATETIMEDATE_TIMESpecific date and time that can be linked to personal activities.
DATETIMETIMESpecific time that can be linked to personal activities.
DRIVER_LICENSEDRIVERLICENSEDriver’s license number used to identify individuals.
DRIVER_LICENSEIT_DRIVER_LICENSEDriver’s license number used to identify individuals in Italy.
DRIVER_LICENSEUS_DRIVER_LICENSEDriver’s license number used to identify individuals in the United States.
EMAIL_ADDRESSEMAILEmail address used for communication and identification.
EMAIL_ADDRESSEMAIL_ADDRESSEmail address used for communication and identification.
GENDERGENDERGender information used to identify individuals.
HEALTH_CARE_IDAU_MEDICAREMedicare number used to identify individuals for healthcare services in Australia.
HEALTH_CARE_IDMEDICAL_LICENSELicense number used to identify medical professionals.
HEALTH_CARE_IDUK_NHSNational Health Service number used to identify individuals for healthcare services in the United Kingdom.
IN_VEHICLE_REGISTRATIONIN_VEHICLE_REGISTRATIONVehicle registration number used to identify vehicles in India.
IN_VOTERIN_VOTERVoter ID number used to identify registered voters in India.
IP_ADDRESSIPInternet Protocol address used to identify devices on a network.
IP_ADDRESSIP_ADDRESSInternet Protocol address used to identify devices on a network.
LOCATIONBUILDINGBuilding information used to identify specific locations.
LOCATIONCITYCity information used to identify geographic locations.
LOCATIONCOUNTRYCountry information used to identify geographic locations.
LOCATIONCOUNTYCounty information used to identify geographic locations.
LOCATIONGEOCOORDGeographic coordinates used to identify specific locations.
LOCATIONLOCATIONSpecific location or address that can be linked to an individual.
LOCATIONADDRESSInformation used to uniquely identify a physical location.
LOCATIONSECADDRESSAdditional address information used to identify locations.
LOCATIONSECONDARYADDRESSAdditional address information used to identify locations.
LOCATIONSTATEState information used to identify geographic locations.
LOCATIONSTREETStreet address used to identify specific locations.
LOCATIONZIPCODEPostal code used to identify specific geographic areas.
MAC_ADDRESSMACMedia Access Control address used to identify devices on a network.
BUSINESS_IDAU_ACNACN is an Australian company identifier, not a personal national ID.
BUSINESS_IDSG_UENUEN is a company/entity registration number, not a personal national ID.
NATIONAL_IDES_NIEForeigner Identification Number used to identify non-residents in Spain.
NATIONAL_IDFI_PERSONAL_IDENTITY_CODEPersonal identity code used to identify individuals in Finland.
NATIONAL_IDIDCARDIdentity card number used to identify individuals.
NATIONAL_IDIN_AADHAARUnique identification number used to identify residents in India.
NATIONAL_IDIT_IDENTITY_CARDIdentity card number used to identify individuals in Italy.
NATIONAL_IDPL_PESELPersonal Identification Number used to identify individuals in Poland.
NATIONAL_IDSG_NRIC_FINNational Registration Identity Card number used to identify residents in Singapore.
ORGANIZATIONCOMPANYNAMEName of a company used to identify businesses.
PASSWORDCREDITCARDCVVCard Verification Value used to secure credit card transactions.
PASSWORDPASSWORDPassword used to secure access to personal accounts.
SECURITY_CODEPINPINs are short numeric codes for authentication, not passwords.
PASSPORTIN_PASSPORTPassport number used to identify individuals in India.
PASSPORTIT_PASSPORTPassport number used to identify individuals in Italy.
PASSPORTPASSPORTPassport number used to identify individuals.
PASSPORTUS_PASSPORTPassport number used to identify individuals in the United States.
PERSONNAMEName or identifier used to identify an individual.
PERSONPERSONName or identifier used to identify an individual.
PHONE_NUMBERPHONENumber used to contact or identify an individual.
PHONE_NUMBERPHONE_NUMBERNumber used to contact or identify an individual.
SOCIAL_SECURITY_IDSSNSocial Security Number used to identify individuals.
SOCIAL_SECURITY_IDUK_NINONational Insurance Number used to identify individuals in the United Kingdom.
SOCIAL_SECURITY_IDUS_SSNSocial Security Number used to identify individuals in the United States.
BUSINESS_TAX_IDAU_ABNABN is used for tax and business registration, specific to organizations.
BUSINESS_TAX_IDIT_VAT_CODEVAT codes are business tax identifiers, not personal tax IDs.
TAX_IDAU_TFNTax File Number used to identify taxpayers in Australia.
TAX_IDES_NIFTax Identification Number used to identify taxpayers in Spain.
TAX_IDIN_PANPermanent Account Number used to identify taxpayers in India.
TAX_IDIT_FISCAL_CODEFiscal code used to identify taxpayers in Italy.
TAX_IDUS_ITINIndividual Taxpayer Identification Number used to identify taxpayers in the United States.
TITLETITLETitle or honorific used to identify individuals.
URLURLWeb address that can sometimes contain personal information.
USER_NAMEUSERNAMEUsername used to identify individuals in online systems.
KR_RRNKR_RRNThe Korean Resident Registration Number (RRN) is a 13-digit number issued to all Korean residents.
IN_GSTININ_GSTINThe Indian Goods and Services Tax Identification Number (GSTIN) is a 15-character identifier with state code (01-37), PAN, registration number, ‘Z’, and checksum.
DATE_OF_BIRTHDOBDate of Birth. Standard personal-identification detail that specifies the exact day, month, and year a person was born.
TH_TNINTH_TNINThe Thai National ID Number (TNIN) is a unique 13-digit number issued to all Thai residents.
IP_ADDRESSIPV4Internet Protocol address identifies a device on a network and providing its location, enabling proper routing of data
IP_ADDRESSIPV6Internet Protocol address identifies a device on a network and providing its location, enabling proper routing of data

2 - Transform

Identify, Classify & Transform sensitive data.

2.1 - Label Text API

Identify and classify plain-text sensitive data. Replace the sensitive data with labels of the classified data types, such as, <CREDIT_CARD> and so on.

Method

POST

URL

http://{Host Address}/pty/data-discovery/v2/transform/label

Query Parameters

score_threshold

  • Type: float
  • Description: Optional. Label results where the score is greater than this threshold.
  • Values: Minimum 0, Maximum 1.0
  • Default: 0.7

include_providers

  • Type: binary
  • Description: Optional. Include details of the service providers in the response.
  • Values: Yes / No
  • Default: No

include_classification_details

  • Type: binary
  • Description: Optional. Include classification details in the response.
  • Values: Yes / No
  • Default: No

Body

  • Content type must be text/plain and in UTF-8 format.

  • Body size is limited to 10K Bytes

Sample Request

curl -X POST "http://<Host_address>/pty/data-discovery/v2/transform/label?score_threshold=0.85" \
          -H "Content-Type: text/plain" \
          --data "Jake lives at 15 Main st, Hamden 06517, Connecticut."
import requests
    
    url = "http://<Host_address>/pty/data-discovery/v2/transform/label"
    params = {"score_threshold": 0.85}
    headers = {"Content-Type": "text/plain"}
    data = "Jake lives at 15 Main st, Hamden 06517, Connecticut."
    
    response = requests.post(url, params=params, headers=headers, data=data, verify=False)
    
    print("Status code:", response.status_code)
    print("Response JSON:", response.json())
URL: POST `http://<Host_address>/pty/data-discovery/v2/transform/label`
   Query Parameters:
   -score_threshold (optional), float between 0.0 and 1.0, default: 0.
   Headers:
   -Content-Type: text/plain
   Body:
   -Jake lives at 15 Main st, Hamden 06517, Connecticut.

Sample Responses

{
    "transform": {
        "text": "[PERSON] lives at [LOCATION] [LOCATION], [LOCATION] [LOCATION], [LOCATION]."
    },
    "providers": [
        {
            "name": "Pattern Classification Provider",
            "version": "...",
            "status": 200,
            "elapsed_time": 0.011328935623168945,
            "config_provider": {
                "name": "Pattern",
                "address": "http://pattern_provider_service:8051",
                "supported_content_types": []
            }
        },
        {
            "name": "Context Classification Provider",
            "version": "...",
            "status": 200,
            "elapsed_time": 0.03895401954650879,
            "config_provider": {
                "name": "Context",
                "address": "http://context_provider_service:8052",
                "supported_content_types": []
            }
        }
    ],
    "classifications": {
        "LOCATION": [
            {
                "score": 0.85,
                "location": {
                    "start_index": 17,
                    "end_index": 24
                },
                "classifiers": [
                    {
                        "provider_index": 0,
                        "name": "SpacyRecognizer",
                        "score": 0.85,
                        "original_entity": "LOCATION",
                        "details": {}
                    }
                ]
            },
            {
                "score": 0.9240000128746033,
                "location": {
                    "start_index": 26,
                    "end_index": 32
                },
                "classifiers": [
                    {
                        "provider_index": 0,
                        "name": "SpacyRecognizer",
                        "score": 0.85,
                        "original_entity": "LOCATION",
                        "details": {}
                    },
                    {
                        "provider_index": 1,
                        "name": "context",
                        "score": 0.9980000257492065,
                        "original_entity": "CITY",
                        "details": {}
                    }
                ]
            },
            {
                "score": 0.9244499981403351,
                "location": {
                    "start_index": 40,
                    "end_index": 51
                },
                "classifiers": [
                    {
                        "provider_index": 0,
                        "name": "SpacyRecognizer",
                        "score": 0.85,
                        "original_entity": "LOCATION",
                        "details": {}
                    },
                    {
                        "provider_index": 1,
                        "name": "context",
                        "score": 0.9988999962806702,
                        "original_entity": "STATE",
                        "details": {}
                    }
                ]
            },
            {
                "score": 0.9958999752998352,
                "location": {
                    "start_index": 14,
                    "end_index": 16
                },
                "classifiers": [
                    {
                        "provider_index": 1,
                        "name": "context",
                        "score": 0.9958999752998352,
                        "original_entity": "BUILDING",
                        "details": {}
                    }
                ]
            },
            {
                "score": 0.9983999729156494,
                "location": {
                    "start_index": 33,
                    "end_index": 38
                },
                "classifiers": [
                    {
                        "provider_index": 1,
                        "name": "context",
                        "score": 0.9983999729156494,
                        "original_entity": "ZIPCODE",
                        "details": {}
                    }
                ]
            }
        ],
        "PERSON": [
            {
                "score": 0.8819000124931335,
                "location": {
                    "start_index": 0,
                    "end_index": 4
                },
                "classifiers": [
                    {
                        "provider_index": 1,
                        "name": "context",
                        "score": 0.8819000124931335,
                        "original_entity": "NAME",
                        "details": {}
                    }
                ]
            }
        ]
    }
}
The fields for the transform section are described as follows:
NameExample ResponseDescription
transform.text[PERSON] lives at [LOCATION]..The labed input text with classified entities listed by name in place of the original sensitive data
The fields for the providers section are described as follows:
NameExample ResponseDescription
providersArrayArray of provider objects that participated in the request, including their respective success or failure codes.
providers[n].namePattern Classification ProviderProduct name of the provider.
providers[n].version2.0.0Version of the provider.
providers[n].status200HTTP response code returned by the provider.
providers[n].elapsed_time0.028Time, in seconds, taken by the provider to process the request.
providers[n].config_providerObjectObject containing configuration details for each provider.
providers[n].config_provider.namePatternInternal name of the provider.
providers[n].config_provider.addresshttp://pattern_provider_service:8051Network address or endpoint of the provider.
providers[n].config_provider.supported_content_types[]Array of supported content types. An empty array indicates support for all content types.
The fields for the classificartion section are described as follows:
NameExample ResponseDescription
classificationsDictionaryA dictionary mapping entity types (e.g., “PERSON”, “PHONE_NUMBER”) to arrays of occurrence objects. Each key is an entity type, and its value is a list of detected occurrences, each containing location and classifier details.
classifications[’entity’][n].score0.9238The confidence score for the detected entity, aggregated from all contributing classifiers.
classifications[’entity’][n].locationObjectAn object specifying the location of the entity within the input text.
classifications[’entity’][n].location.start_index14The starting index of the entity in the input text.
classifications[’entity’][n].location.end_index25The ending index of the entity in the input text.
classifications[’entity’][n].classifiersArrayAn array of classifier objects that contributed to the entity detection.
classifications[’entity’][n].classifiers[m].provider_index0The index of the provider in the top-level providers array.
classifications[’entity’][n].classifiers[m].nameSpacyRecognizerThe name of the classifier. A provider may have multiple classifiers.
classifications[’entity’][n].classifiers[m].score0.85The score assigned by the classifier for the entity detection.
classifications[’entity’][n].classifiers[m].original_entityPERSONThe original entity type detected by the classifier. See Harmonization for details.
classifications[’entity’][n].classifiers[m].detailsObjectOptional. Additional key-value details provided by the classifier.

Response Codes

Response CodeDescription
200Successful Response.
206Partial Content. Only some providers classifed data successfully.
400Bad Request. Invalid input parameters or content.
413Payload too large.
415Unsupported media type.
422Untrusted input. For more information, refer to Input Validation
502Bad Gateway. All upstream providers failed; no successful data aggregation possible.
598Unexpected internal server error. Check server logs.
599Internal server error. Check server logs.

2.1.1 - Handling Overlapping Conflicts

Resolving conflicts between entities that label sensitive data.

While classifying data, the providers may label an identical text under two different entities. This distinction arises from the detection strategies the classifiers adopt. Data Discovery handles these conflicts by applying certain rules on these conflicting entities.

The rules for handling the conflicting entities are as follows:

  • No overlap: If the two entities do not conflict, retain the results in the original form.

    For example, Jake Filbert lives in Connecticut. If only Jake Filbert is identified, the result will be labeled as [NAME] lives in Connecticut.

  • Full overlap: If both the entities overlap, the following logic will be applied:

    • Select the entity with a higher confidence score.
    • If both the entities contain the same confidence score, select the first entity.

    For example, Jake Filbert lives in Connecticut. Here, the name is recognized as [USER] with a score 0.7 and [NAME] with a score 0.9. As [NAME] has a higher score, the result will be labeled as [NAME] lives in Connecticut.

  • One entity contained in other: If one entity is completely contained in the other, select the entity with the longer text.

    For example, jake@email.com. Here, the classifiers may recognize the text as [NAME] and [EMAIL]. As [EMAIL] is the longer text, the result will be labeled as [EMAIL].

  • Partial intersection. If the two entities overlap partially, the result will be a combination of both.

    For example, 092-33445. Here, the classifiers may recognize the text as [PHONE_NUMBER] and [SSN]. The result will be labeled as [PHONE_NUMBER&SSN].

2.1.2 - Sample Response Default

Sample Response Default.
{ “transform”: { “text”: “[PERSON] lives at [LOCATION] [LOCATION], [LOCATION] [LOCATION], [LOCATION].” } }

The fields are described as follows:

NameExample ResponseDescription
transform.text[PERSON] lives at [LOCATION]..The labed input text with classified entities listed by name in place of the original sensitive data

2.1.3 -

NameExample ResponseDescription
transform.text[PERSON] lives at [LOCATION]..The labed input text with classified entities listed by name in place of the original sensitive data

2.1.4 -

3 - Common APIs

Standard operational endpoints available on the service.

These endpoints provide operational capabilities such as retrieving the API specification, managing log levels, checking version information, and monitoring service health.

3.1 - API Specification

Returns the OpenAPI specification for the Data Discovery API.

Method

GET

URL

http://{Host Address}/doc

Query Parameters

None

Sample Request

curl -X GET "http://<Host_address>/doc"
     
import requests
    
    url = "http://<Host_address>/doc"
    response = requests.get(url, verify=False)
    
    print("Status code:", response.status_code)
    print("Response YAML:", response.text)
    
URL: GET `http: //<Host_address>/doc`

Sample Response

Returns the OpenAPI specification in YAML format. The following shows a partial example:

openapi: 3.0.3
info:
  title: Protegrity Classification Service API
  version: v2
servers:
- url: /pty/data-discovery/v2
components:
  schemas:
    TextAggregatedResponse:
      allOf:
      # ... (abbreviated)
paths:
  /classify/text:
    post:
      summary: Classify free-form text input.
      tags: [Classify
]
  /classify/tabular:
    post:
      summary: Classify tabular CSV input.
      tags: [Classify
]
  /version:
    get:
      summary: Returns runtime version information.
      tags: [Common
]
  /log:
    get:
      summary: Get current runtime log level.
      tags: [Common
]
  # ... (full specification continues)

Response Codes

CodeDescription
200The OpenAPI specification is returned in YAML format.

3.2 - Health Probes

Kubernetes-style health probe endpoints for monitoring state of the service.

The following are the health probe endpoints that can be used on platforms such as Kubernetes.

EndpointPurpose
Liveness (/live)Indicates that the service can handle HTTP requests.
Readiness (/ready)Indicates that the service is initialized and ready to serve requests.
Health (/health)Indicates that the service is running and all components are functioning properly.

3.2.1 - Liveness Probe

Indicates that the service is running and can handle HTTP requests.

Method

GET

URL

http://{Host Address}/live

Used by Kubernetes as a liveness probe.

Query Parameters

None

Sample Request

curl -X GET "http://<Host_address>/live"
     
import requests
    
    url = "http://<Host_address>/live"
    response = requests.get(url, verify=False)
    
    print("Status code:", response.status_code)
    
URL: GET `http: //<Host_address>/live`

Response Codes

CodeDescription
204Service can handle requests.

3.2.2 - Readiness Probe

Indicates the service is initialized and ready to serve requests.

Method

GET

URL

http://{Host Address}/ready

This is used by Kubernetes as a readiness probe.

Query Parameters

None

Sample Request

curl -X GET "http://<Host_address>/ready"
     
import requests
    
    url = "http://<Host_address>/ready"
    response = requests.get(url, verify=False)
    
    print("Status code:", response.status_code)
    
URL: GET `http: //<Host_address>/ready`

Response Codes

CodeDescription
204Service is fully initialized and can handle requests.
503Service is not yet ready to serve requests.

3.2.3 - Health Check

Indicates that the service is running and all components are functioning correctly.

Method

GET

URL

http://{Host Address}/health

Returns service health status including individual component-level checks.

Query Parameters

None

Sample Request

curl -X GET "http://<Host_address>/health"
     
import requests
    
    url = "http://<Host_address>/health"
    response = requests.get(url, verify=False)
    
    print("Status code:", response.status_code)
    try:
        print("Response JSON:", response.json())
    except ValueError:
        print("Response Text:", response.text)
    
URL: GET `http: //<Host_address>/health`

Sample Response

{
    "isHealthy": true,
    "checks": [
        {
            "isHealthy": true,
            "output": {
                "isHealthy": true,
                "checks": [
                    {
                        "passed": true,
                        "output": "Pattern Classifier found",
                        "componentType": "engine",
                        "componentName": "Pattern Classifier"
                    },
                    {
                        "passed": true,
                        "output": "Pattern Classifier engine initialized",
                        "componentType": "engine",
                        "componentName": "Pattern Classifier"
                    },
                    {
                        "passed": true,
                        "output": "Dummy classification is responsive",
                        "componentType": "engine",
                        "componentName": "Pattern Classifier"
                    }
                ]
            },
            "componentType": "classification-provider",
            "componentName": "Pattern"
        },
        {
            "isHealthy": true,
            "output": {
                "isHealthy": true,
                "checks": [
                    {
                        "passed": true,
                        "output": "PII Classifier model initialized",
                        "componentType": "model",
                        "componentName": "PII Classifier"
                    },
                    {
                        "passed": true,
                        "output": "Dummy classification is responsive",
                        "componentType": "engine",
                        "componentName": "Context Classifier"
                    }
                ]
            },
            "componentType": "classification-provider",
            "componentName": "Context"
        }
    ]
}

Response Fields Description

NameTypeDescription
isHealthybooleantrue if all components are functioning properly.
checksarrayList of component health checks.
checks[].isHealthybooleantrue if this component is healthy.
checks[].componentTypestringType of the component (e.g., classification-provider).
checks[].componentNamestringName of the component (e.g., Pattern).
checks[].outputobjectDetailed output for this component’s checks.
checks[].output.isHealthybooleantrue if all of this component’s internal checks passed.
checks[].output.checksarrayList of individual sub-checks for this component.
checks[].output.checks[].passedbooleantrue if this sub-check passed.
checks[].output.checks[].outputstringDescription of the sub-check result.
checks[].output.checks[].componentTypestringType of the element checked.
checks[].output.checks[].componentNamestringName of the element checked.

Response Codes

CodeDescription
200Service is running normally.
503Service is unhealthy. Its components may be initializing or may need a restart.

3.3 - Log Level API

Retrieve or update the runtime log level.

3.3.1 - Log Level API

Retrieve the runtime log level.

Method

GET

URL

http://{Host Address}/log

Returns the current runtime logging level.

Query Parameters

None

Sample Request

curl -X GET "http://<Host_address>/log"
     
import requests
    
    url = "http://<Host_address>/log"
    response = requests.get(url, verify=False)
    
    print("Status code:", response.status_code)
    try:
        print("Response JSON:", response.json())
    except ValueError:
        print("Response Text:", response.text)
    
URL: GET `http://<Host_address>/log`

Sample Response

{
  "level": "info"
}

Response Fields Description

NameDescription
levelThe current log level. Possible values: debug, info, warn.

Response Codes

CodeDescription
200Log level information retrieved successfully.

3.3.2 - Log Level API

Update the runtime log level.

Method

POST

URL

http://{Host Address}/log

Updates the runtime logging level.

Request Body

NameTypeRequiredDescription
levelstringYesThe log level to set. Possible values: debug, info, warn.

Sample Request

curl -X POST "http://<Host_address>/log" \
       -H "Content-Type: application/json" \
       -d '{"level": "debug"}'
     
import requests
    
    url = "http://<Host_address>/log"
    payload = {"level": "debug"}
    headers = {"Content-Type": "application/json"}
    response = requests.post(url, json=payload, headers=headers, verify=False)
    
    print("Status code:", response.status_code)
    try:
        print("Response JSON:", response.json())
    except ValueError:
        print("Response Text:", response.text)
    
URL: POST `http: //<Host_address>/log`
   Body (JSON): {
       "level": "debug"
   }

Sample Response

{
    "level": "debug"
}

Response Fields Description

NameDescription
levelThe updated log level.

Response Codes

CodeDescription
200Log level updated successfully.
500An error occurred (e.g., invalid log level specified).

Note: The service currently returns 500 for invalid log level values. The OpenAPI spec defines 400 for this case — this is a known discrepancy to be addressed in a future release.

3.4 - Version API

View runtime version information.

Method

GET

URL

http://{Host Address}/version

Query Parameters

None

Sample Request

curl -X GET "http://<Host_address>/version"
     
import requests
    
    url = "http://<Host_address>/version"
    response = requests.get(url, verify=False)
    
    print("Status code:", response.status_code)
    try:
        print("Response JSON:", response.json())
    except ValueError:
        print("Response Text:", response.text)
    
URL: GET `http: //<Host_address>/version`

Sample Response

{
  "version": "2.0.0",
  "buildVersion": "2.0.0.374.8047721c"
}

Response Fields Description

NameDescription
versionSemantic version of Data Discovery in the MAJOR.MINOR.PATCH format.
buildVersionFull build version string in the MAJOR.MINOR.PATCH.BUILD.COMMITHASH format.

Response Codes

CodeDescription
200Version information retrieved successfully.

4 -

Response CodeDescription
200Successful Response.
206Partial Content. Only some providers classifed data successfully.
400Bad Request. Invalid input parameters or content.
413Payload too large.
415Unsupported media type.
422Untrusted input. For more information, refer to Input Validation
502Bad Gateway. All upstream providers failed; no successful data aggregation possible.
598Unexpected internal server error. Check server logs.
599Internal server error. Check server logs.

5 -

NameExample ResponseDescription
providersArrayArray of provider objects that participated in the request, including their respective success or failure codes.
providers[n].namePattern Classification ProviderProduct name of the provider.
providers[n].version2.0.0Version of the provider.
providers[n].status200HTTP response code returned by the provider.
providers[n].elapsed_time0.028Time, in seconds, taken by the provider to process the request.
providers[n].config_providerObjectObject containing configuration details for each provider.
providers[n].config_provider.namePatternInternal name of the provider.
providers[n].config_provider.addresshttp://pattern_provider_service:8051Network address or endpoint of the provider.
providers[n].config_provider.supported_content_types[]Array of supported content types. An empty array indicates support for all content types.