This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

APIs

APIs and supporting information.

1: Classify

1.1: Classify Text API

1.2: Classify CSV API

2: Transform

2.1: Label Text API

2.1.1: Handling Overlapping Conflicts
2.1.2: Sample Response Default
2.1.3: Sample Response with Detail
2.1.4:

3: Harmonizing Provider Outputs
4: Input Validation
5:
6:
7:
8:

1 - Classify

Identify, classify and locate sensitive data.

1.1 - Classify Text API

Classify plain text unstructured data.

POST https://{Host Address}/pty/data-discovery/v1.1/classify

Query Parameters

score_threshold

Type: float
Description: Optional. Exclude results with a score lower than this threshold.
Values: Minimum 0, Maximum 1.0
Default: 0.00

Body

Content type must be a plain text and in an UTF-8 format.
Length of the body is limited to 10K Bytes.

Sample Request

curl -X POST "https://<SERVER_IP>/pty/data-discovery/v1.1/classify?score_threshold=0.85" \
          -H "Content-Type: text/plain" \
          --data "You can reach Dave Elliot by phone 203-555-1286"

import requests
    
    url = "https://<SERVER_IP>/pty/data-discovery/v1.1/classify"
    params = {"score_threshold": 0.85}
    headers = {"Content-Type": "text/plain"}
    data = "You can reach Dave Elliot by phone 203-555-1286"
    
    response = requests.post(url, params=params, headers=headers, data=data, verify=False)
    
    print("Status code:", response.status_code)
    print("Response JSON:", response.json())

URL: POST `https://<SERVER_IP>/pty/data-discovery/v1.1/classify`
   Query Parameters:
   -score_threshold (optional), float between 0.0 and 1.0, default: 0.
   Headers:
   -Content-Type: text/plain
   Body:
   -You can reach Dave Elliot by phone 203-555-1286

Sample Response

{
    "providers": [
        {
            "name": "Pattern Classification Provider",
            "version": "1.1.0",
            "status": 200,
            "elapsed_time": 0.028261899948120117,
            "config_provider": {
                "name": "Pattern",
                "address": "http://pattern_provider_service:8051",
                "supported_content_types": []
            }
        },
        {
            "name": "Context Classification Provider",
            "version": "1.1.0",
            "status": 200,
            "elapsed_time": 0.040960073471069336,
            "config_provider": {
                "name": "Context",
                "address": "http://context_provider_service:8052",
                "supported_content_types": []
            }
        }
    ],
    "classifications": {
        "PERSON": [
            {
                "score": 0.9238499879837037,
                "location": {
                    "start_index": 14,
                    "end_index": 25
                },
                "classifiers": [
                    {
                        "provider_index": 0,
                        "name": "SpacyRecognizer",
                        "score": 0.85,
                        "original_entity": "PERSON",
                        "details": {}
                    },
                    {
                        "provider_index": 1,
                        "name": "context",
                        "score": 0.9976999759674072,
                        "original_entity": "NAME",
                        "details": {}
                    }
                ]
            }
        ],
        "PHONE_NUMBER": [
            {
                "score": 0.9995999932289124,
                "location": {
                    "start_index": 35,
                    "end_index": 47
                },
                "classifiers": [
                    {
                        "provider_index": 1,
                        "name": "context",
                        "score": 0.9995999932289124,
                        "original_entity": "PHONE",
                        "details": {}
                    }
                ]
            }
        ]
    }
}

Response Fields Description

Providers Section

Name	Example Response	Description
providers	Array	Array of provider objects that participated in the request, including their respective success or failure codes.
providers[n].name	Pattern Classification Provider	Product name of the provider.
providers[n].version	1.0.0	Version of the provider.
providers[n].status	200	HTTP response code returned by the provider.
providers[n].elapsed_time	0.028	Time, in seconds, taken by the provider to process the request.
providers[n].config_provider	Object	Object containing configuration details for each provider.
providers[n].config_provider.name	Pattern	Internal name of the provider.
providers[n].config_provider.address	http://pattern_provider_service:8051	Network address or endpoint of the provider.
providers[n].config_provider.supported_content_types	[]	Array of supported content types. An empty array indicates support for all content types.

Classifications Section

Name	Example Response	Description
classifications	Dictionary	A dictionary mapping entity types (e.g., “PERSON”, “PHONE_NUMBER”) to arrays of occurrence objects. Each key is an entity type, and its value is a list of detected occurrences, each containing location and classifier details.
classifications[’entity’][n].score	0.9238	The confidence score for the detected entity, aggregated from all contributing classifiers.
classifications[’entity’][n].location	Object	An object specifying the location of the entity within the input text.
classifications[’entity’][n].location.start_index	14	The starting index of the entity in the input text.
classifications[’entity’][n].location.end_index	25	The ending index of the entity in the input text.
classifications[’entity’][n].classifiers	Array	An array of classifier objects that contributed to the entity detection.
classifications[’entity’][n].classifiers[m].provider_index	0	The index of the provider in the top-level providers array.
classifications[’entity’][n].classifiers[m].name	SpacyRecognizer	The name of the classifier. A provider may have multiple classifiers.
classifications[’entity’][n].classifiers[m].score	0.85	The score assigned by the classifier for the entity detection.
classifications[’entity’][n].classifiers[m].original_entity	PERSON	The original entity type detected by the classifier. See Harmonization for details.
classifications[’entity’][n].classifiers[m].details	Object	Optional. Additional key-value details provided by the classifier.

Response Codes

Response Code	Description
200	Successful Response.
206	Partial Content. Only some providers classifed data successfully.
400	Bad Request. Invalid input parameters or content.
413	Payload too large.
415	Unsupported media type.
422	Untrusted input. For more information, refer to Input Validation
502	Bad Gateway. All upstream providers failed; no successful data aggregation possible.
598	Unexpected internal server error. Check server logs.
599	Internal server error. Check server logs.

1.2 - Classify CSV API

Classify structured CSV data.

POST https://{Host Address}/pty/data-discovery/v1.1/classify

Query Parameters

score_threshold

Type: float
Description: Optional. Exclude results with a score lower than this threshold.
Values: Minimum 0, Maximum 1.0
Default: 0.00

has_headers

Type: boolean
Description: Optional. Indicates whether the first row represents the column header.
Values: true/false
Default: true

column_delimiter

Type: char
Description: Optional. Delimiter to separate the columns.
Values: , |
Default: ,

quote_char

Type: char
Description: Optional. Character to quote fields containing special characters, such as, the column_delimiter or new-line characters.
Values: ""

Body

Content type should be text/csv and in UTF-8 format.
Body size is limited to 10K Bytes

Sample Request

curl -X POST "https://<SERVER_IP>/pty/data-discovery/v1.1/classify?score_threshold=0.85" \
     --header 'Content-Type: text/csv' \
     --data-raw 'Social Security Number,Credit Card Number,IBAN,Phone Number
     589-25-1068,349384370543801,FR43 9255 4858 47BG 3EBG U4OK O18,(483) 9440301
     636-36-3077,4041594844904,AL50 8947 4215 KAEY GAPM NLYC FNZG,(113) 5143119
     748-82-2375,3558175715821800,AT34 4082 9269 0841 5702,(763) 5136237
     516-62-9861,560221027976015000,FR22 0068 7181 11FB UG8H ECEM 306,(726) 6031636
     121-49-9409,374283320982549,DK37 5687 8459 8060 79,(624) 9205200
     838-73-3299,5558216060144900,CR54 8952 8144 6403 4765 0,(356) 9479541
     439-11-5310,5048376143641900,RS76 6213 4824 0184 8983 74,(544) 5623326
     564-06-8466,3543299511845640,EE51 6882 3443 7863 4703,(702) 6093849
     518-54-5443,3543019452249540,IT65 D000 3874 2801 Z15I LNLL OOX,(584) 8618371'

import requests
    
    url = "https://<SERVER_IP>/pty/data-discovery/v1.1/classify"
    params = {"score_threshold": 0.85}
    headers = {"Content-Type": "text/csv"}
    data = """Social Security Number,Credit Card Number,IBAN,Phone Number
    589-25-1068,349384370543801,FR43 9255 4858 47BG 3EBG U4OK O18,(483) 9440301
    636-36-3077,4041594844904,AL50 8947 4215 KAEY GAPM NLYC FNZG,(113) 5143119
    748-82-2375,3558175715821800,AT34 4082 9269 0841 5702,(763) 5136237
    516-62-9861,560221027976015000,FR22 0068 7181 11FB UG8H ECEM 306,(726) 6031636
    121-49-9409,374283320982549,DK37 5687 8459 8060 79,(624) 9205200
    838-73-3299,5558216060144900,CR54 8952 8144 6403 4765 0,(356) 9479541
    439-11-5310,5048376143641900,RS76 6213 4824 0184 8983 74,(544) 5623326
    564-06-8466,3543299511845640,EE51 6882 3443 7863 4703,(702) 6093849
    518-54-5443,3543019452249540,IT65 D000 3874 2801 Z15I LNLL OOX,(584) 8618371
    """
    
    response = requests.post(url, params=params, headers=headers, data=data, verify=False)
    
    print("Status code:", response.status_code)
    try:
        print("Response JSON:", response.json())
    except ValueError:
        print("Response Text:", response.text)

URL: POST `https://<SERVER_IP>/pty/data-discovery/v1.1/classify`
      Query Parameters:
      -score_threshold (optional), float between 0.0 and 1.0, default: 0.
      -has_headers (optional), Indicates whether the first row represents the column header.
      -column_delimiter (optional), Delimiter to separate the columns.
      -quote_char (optional), Character to quote fields containing special characters, such as, the column_delimiter or new-line characters.
      Headers:
      -Content-Type: text/csv
      Body:
      -Social Security Number,Credit Card Number,IBAN,Phone Number
     589-25-1068,349384370543801,FR43 9255 4858 47BG 3EBG U4OK O18,(483) 9440301
     636-36-3077,4041594844904,AL50 8947 4215 KAEY GAPM NLYC FNZG,(113) 5143119
     748-82-2375,3558175715821800,AT34 4082 9269 0841 5702,(763) 5136237
     516-62-9861,560221027976015000,FR22 0068 7181 11FB UG8H ECEM 306,(726) 6031636
     121-49-9409,374283320982549,DK37 5687 8459 8060 79,(624) 9205200
     838-73-3299,5558216060144900,CR54 8952 8144 6403 4765 0,(356) 9479541
     439-11-5310,5048376143641900,RS76 6213 4824 0184 8983 74,(544) 5623326
     564-06-8466,3543299511845640,EE51 6882 3443 7863 4703,(702) 6093849
     518-54-5443,3543019452249540,IT65 D000 3874 2801 Z15I LNLL OOX,(584) 8618371

Sample Response

{
    "providers": [
        {
            "name": "Pattern Classification Provider",
            "version": "1.1.0",
            "status": 200,
            "elapsed_time": 0.31273603439331055,
            "config_provider": {
                "name": "Pattern",
                "address": "http://pattern_provider_service:8051",
                "supported_content_types": []
            }
        },
        {
            "name": "Context Classification Provider",
            "version": "1.1.0",
            "status": 200,
            "elapsed_time": 1.1383004188537598,
            "config_provider": {
                "name": "Context",
                "address": "http://context_provider_service:8052",
                "supported_content_types": []
            }
        }
    ],
    "classifications": {
        "SOCIAL_SECURITY_ID": [
            {
                "score": 0.9994888835483127,
                "rows_processed": 9,
                "location": {
                    "column_name": "Social Security Number",
                    "column_index": 0
                },
                "classifiers": [
                    {
                        "provider_index": 1,
                        "name": "context",
                        "rows_with_classification": 9,
                        "total_classifications": 9,
                        "score": 0.9994888835483127,
                        "details": {}
                    }
                ]
            }
        ],
        "CREDIT_CARD": [
            {
                "score": 0.9986333317226834,
                "rows_processed": 9,
                "location": {
                    "column_name": "Credit Card Number",
                    "column_index": 1
                },
                "classifiers": [
                    {
                        "provider_index": 1,
                        "name": "context",
                        "rows_with_classification": 9,
                        "total_classifications": 9,
                        "score": 0.9986333317226834,
                        "details": {}
                    }
                ]
            }
        ],
        "BANK_ACCOUNT": [
            {
                "score": 0.7901234567901234,
                "rows_processed": 9,
                "location": {
                    "column_name": "IBAN",
                    "column_index": 2
                },
                "classifiers": [
                    {
                        "provider_index": 0,
                        "name": "IbanRecognizer",
                        "rows_with_classification": 8,
                        "total_classifications": 8,
                        "score": 0.8888888888888888,
                        "details": {}
                    }
                ]
            }
        ],
        "PHONE_NUMBER": [
            {
                "score": 0.9961333341068692,
                "rows_processed": 9,
                "location": {
                    "column_name": "Phone Number",
                    "column_index": 3
                },
                "classifiers": [
                    {
                        "provider_index": 1,
                        "name": "context",
                        "rows_with_classification": 9,
                        "total_classifications": 9,
                        "score": 0.9961333341068692,
                        "details": {}
                    }
                ]
            }
        ]
    }
}

Response Fields Description

Providers Section

Name	Example Response	Description
providers	Array	Array of provider objects that participated in the request, including their respective success or failure codes.
providers[n].name	Pattern Classification Provider	Product name of the provider.
providers[n].version	1.0.0	Version of the provider.
providers[n].status	200	HTTP response code returned by the provider.
providers[n].elapsed_time	0.028	Time, in seconds, taken by the provider to process the request.
providers[n].config_provider	Object	Object containing configuration details for each provider.
providers[n].config_provider.name	Pattern	Internal name of the provider.
providers[n].config_provider.address	http://pattern_provider_service:8051	Network address or endpoint of the provider.
providers[n].config_provider.supported_content_types	[]	Array of supported content types. An empty array indicates support for all content types.

Classifications Section

Name	Example Response	Description
classifications	Dictionary	A dictionary mapping entity types (e.g., “SOCIAL_SECURITY_ID”, “CREDIT_CARD”) to arrays of occurrence objects. Each key is an entity type, and its value is a list of detected occurrences, each containing location, classifier, and row details.
classifications[’entity’][n].score	0.9995	The confidence score for the detected entity, aggregated and calculated from all contributing classifiers and their
reported scores.
classifications[’entity’][n].rows_processed	9	The number of rows passed to and processed by the classification request.
classifications[’entity’][n].location	Object	An object specifying the location of the entity within the CSV data.
classifications[’entity’][n].location.column_name	Social Security Number	The name of the column in which the entity was detected.
classifications[’entity’][n].location.column_index	0	The index of the column in which the entity was detected.
classifications[’entity’][n].classifiers	Array	An array of classifier objects that contributed to the entity detection.
classifications[’entity’][n].classifiers[m].provider_index	1	The index of the provider in the top-level providers array.
classifications[’entity’][n].classifiers[m].name	context	The name of the classifier. A provider may have multiple classifiers.
classifications[’entity’][n].classifiers[m].score	0.9995	The score assigned by the classifier for the entity detection.
classifications[’entity’][n].classifiers[m].rows_with_classification	9	The number of rows in which the entity was classified by this classifier.
classifications[’entity’][n].classifiers[m].total_classifications	9	The total number of classifications made by this classifier in this location. it is possible to find multiple entities within a single column, e.g., date and time, complex address, etc'.
classifications[’entity’][n].classifiers[m].details	Object	Optional. Additional key-value details provided by the classifier.

Response Codes

Response Code	Description
200	Successful Response.
206	Partial Content. Only some providers classifed data successfully.
400	Bad Request. Invalid input parameters or content.
413	Payload too large.
415	Unsupported media type.
422	Untrusted input. For more information, refer to Input Validation
502	Bad Gateway. All upstream providers failed; no successful data aggregation possible.
598	Unexpected internal server error. Check server logs.
599	Internal server error. Check server logs.

2 - Transform

Identify, Classify & Transform sensitive data.

2.1 - Label Text API

Identify and classify plain-text sensitive data. Replace the sensitive data with labels of the classified data types, such as, <CREDIT_CARD> and so on.

POST https://{Host Address}/pty/data-discovery/v1.1/transform/label

Query Parameters

score_threshold

Type: float
Description: Optional. Label results where the score is greater than this threshold.
Values: Minimum 0, Maximum 1.0
Default: 0.7

include_providers

Type: binary
Description: Optional. Include details of the service providers in the response.
Values: Yes / No
Default: No

include_classification_details

Type: binary
Description: Optional. Include classification details in the response.
Values: Yes / No
Default: No

Body

Content type must be text/plain and in UTF-8 format.
Body size is limited to 10K Bytes

Sample Request

curl -X POST "https://<SERVER_IP>/pty/data-discovery/v1.1/transform/label?score_threshold=0.85" \
          -H "Content-Type: text/plain" \
          --data "Jake lives at 15 Main st, Hamden 06517, Connecticut."

import requests
    
    url = "https://<SERVER_IP>/pty/data-discovery/v1.1/transform/label"
    params = {"score_threshold": 0.85}
    headers = {"Content-Type": "text/plain"}
    data = "Jake lives at 15 Main st, Hamden 06517, Connecticut."
    
    response = requests.post(url, params=params, headers=headers, data=data, verify=False)
    
    print("Status code:", response.status_code)
    print("Response JSON:", response.json())

URL: POST `https://<SERVER_IP>/pty/data-discovery/v1.1/transform/label`
   Query Parameters:
   -score_threshold (optional), float between 0.0 and 1.0, default: 0.
   Headers:
   -Content-Type: text/plain
   Body:
   -Jake lives at 15 Main st, Hamden 06517, Connecticut.

Sample Responses

title: Sample Response Default weight: 60 date: 2024-02-20 description: Sample Response Default.

{ “transform”: { “text”: “[PERSON] lives at [LOCATION] [LOCATION], [LOCATION] [LOCATION], [LOCATION].” } }

The fields are described as follows:

Name	Example Response	Description
transform.text	[PERSON] lives at [LOCATION]..	The labed input text with classified entities listed by name in place of the original sensitive data

title: Sample Response with Detail weight: 60 date: 2024-02-20 description: Sample Response with Detail.

{
        "transform": {
            "text": "[PERSON] lives at [LOCATION] [LOCATION], [LOCATION] [LOCATION], [LOCATION]."
        },
        "providers": [
            {
                "name": "Pattern Classification Provider",
                "version": "1.1.0",
                "status": 200,
                "elapsed_time": 0.011328935623168945,
                "config_provider": {
                    "name": "Pattern",
                    "address": "http://pattern_provider_service:8051",
                    "supported_content_types": []
                }
            },
            {
                "name": "Context Classification Provider",
                "version": "1.1.0",
                "status": 200,
                "elapsed_time": 0.03895401954650879,
                "config_provider": {
                    "name": "Context",
                    "address": "http://context_provider_service:8052",
                    "supported_content_types": []
                }
            }
        ],
        "classifications": {
            "LOCATION": [
                {
                    "score": 0.85,
                    "location": {
                        "start_index": 17,
                        "end_index": 24
                    },
                    "classifiers": [
                        {
                            "provider_index": 0,
                            "name": "SpacyRecognizer",
                            "score": 0.85,
                            "original_entity": "LOCATION",
                            "details": {}
                        }
                    ]
                },
                {
                    "score": 0.9240000128746033,
                    "location": {
                        "start_index": 26,
                        "end_index": 32
                    },
                    "classifiers": [
                        {
                            "provider_index": 0,
                            "name": "SpacyRecognizer",
                            "score": 0.85,
                            "original_entity": "LOCATION",
                            "details": {}
                        },
                        {
                            "provider_index": 1,
                            "name": "context",
                            "score": 0.9980000257492065,
                            "original_entity": "CITY",
                            "details": {}
                        }
                    ]
                },
                {
                    "score": 0.9244499981403351,
                    "location": {
                        "start_index": 40,
                        "end_index": 51
                    },
                    "classifiers": [
                        {
                            "provider_index": 0,
                            "name": "SpacyRecognizer",
                            "score": 0.85,
                            "original_entity": "LOCATION",
                            "details": {}
                        },
                        {
                            "provider_index": 1,
                            "name": "context",
                            "score": 0.9988999962806702,
                            "original_entity": "STATE",
                            "details": {}
                        }
                    ]
                },
                {
                    "score": 0.9958999752998352,
                    "location": {
                        "start_index": 14,
                        "end_index": 16
                    },
                    "classifiers": [
                        {
                            "provider_index": 1,
                            "name": "context",
                            "score": 0.9958999752998352,
                            "original_entity": "BUILDING",
                            "details": {}
                        }
                    ]
                },
                {
                    "score": 0.9983999729156494,
                    "location": {
                        "start_index": 33,
                        "end_index": 38
                    },
                    "classifiers": [
                        {
                            "provider_index": 1,
                            "name": "context",
                            "score": 0.9983999729156494,
                            "original_entity": "ZIPCODE",
                            "details": {}
                        }
                    ]
                }
            ],
            "PERSON": [
                {
                    "score": 0.8819000124931335,
                    "location": {
                        "start_index": 0,
                        "end_index": 4
                    },
                    "classifiers": [
                        {
                            "provider_index": 1,
                            "name": "context",
                            "score": 0.8819000124931335,
                            "original_entity": "NAME",
                            "details": {}
                        }
                    ]
                }
            ]
        }
    }

The fields for the transform section are described as follows:

Name	Example Response	Description
transform.text	[PERSON] lives at [LOCATION]..	The labed input text with classified entities listed by name in place of the original sensitive data

The fields for the providers section are described as follows:

Name	Example Response	Description
providers	Array	Array of provider objects that participated in the request, including their respective success or failure codes.
providers[n].name	Pattern Classification Provider	Product name of the provider.
providers[n].version	1.0.0	Version of the provider.
providers[n].status	200	HTTP response code returned by the provider.
providers[n].elapsed_time	0.028	Time, in seconds, taken by the provider to process the request.
providers[n].config_provider	Object	Object containing configuration details for each provider.
providers[n].config_provider.name	Pattern	Internal name of the provider.
providers[n].config_provider.address	http://pattern_provider_service:8051	Network address or endpoint of the provider.
providers[n].config_provider.supported_content_types	[]	Array of supported content types. An empty array indicates support for all content types.

The fields for the classificartion section are described as follows:

Name	Example Response	Description
classifications	Dictionary	A dictionary mapping entity types (e.g., “PERSON”, “PHONE_NUMBER”) to arrays of occurrence objects. Each key is an entity type, and its value is a list of detected occurrences, each containing location and classifier details.
classifications[’entity’][n].score	0.9238	The confidence score for the detected entity, aggregated from all contributing classifiers.
classifications[’entity’][n].location	Object	An object specifying the location of the entity within the input text.
classifications[’entity’][n].location.start_index	14	The starting index of the entity in the input text.
classifications[’entity’][n].location.end_index	25	The ending index of the entity in the input text.
classifications[’entity’][n].classifiers	Array	An array of classifier objects that contributed to the entity detection.
classifications[’entity’][n].classifiers[m].provider_index	0	The index of the provider in the top-level providers array.
classifications[’entity’][n].classifiers[m].name	SpacyRecognizer	The name of the classifier. A provider may have multiple classifiers.
classifications[’entity’][n].classifiers[m].score	0.85	The score assigned by the classifier for the entity detection.
classifications[’entity’][n].classifiers[m].original_entity	PERSON	The original entity type detected by the classifier. See Harmonization for details.
classifications[’entity’][n].classifiers[m].details	Object	Optional. Additional key-value details provided by the classifier.

Response Codes

Response Code	Description
200	Successful Response.
206	Partial Content. Only some providers classifed data successfully.
400	Bad Request. Invalid input parameters or content.
413	Payload too large.
415	Unsupported media type.
422	Untrusted input. For more information, refer to Input Validation
502	Bad Gateway. All upstream providers failed; no successful data aggregation possible.
598	Unexpected internal server error. Check server logs.
599	Internal server error. Check server logs.

2.1.1 - Handling Overlapping Conflicts

Resolving conflicts between entities that label sensitive data.

While classifying data, the providers may label an identical text under two different entities. This distinction arises from the detection strategies the classifiers adopt. Data Discovery handles these conflicts by applying certain rules on these conflicting entities.

The rules for handling the conflicting entities are as follows:

No overlap: If the two entities do not conflict, retain the results in the original form.
For example, Jake Filbert lives in Connecticut. If only Jake Filbert is identified, the result will be labeled as [NAME] lives in Connecticut.
Full overlap: If both the entities overlap, the following logic will be applied:
- Select the entity with a higher confidence score.
- If both the entities contain the same confidence score, select the first entity.
For example, Jake Filbert lives in Connecticut. Here, the name is recognized as [USER] with a score 0.7 and [NAME] with a score 0.9. As [NAME] has a higher score, the result will be labeled as [NAME] lives in Connecticut.
One entity contained in other: If one entity is completely contained in the other, select the entity with the longer text.
For example, jake@email.com. Here, the classifiers may recognize the text as [NAME] and [EMAIL]. As [EMAIL] is the longer text, the result will be labeled as [EMAIL].
Partial intersection. If the two entities overlap partially, the result will be a combination of both.
For example, 092-33445. Here, the classifiers may recognize the text as [PHONE_NUMBER] and [SSN]. The result will be labeled as [PHONE_NUMBER&SSN].

2.1.2 - Sample Response Default

Sample Response Default.

{ “transform”: { “text”: “[PERSON] lives at [LOCATION] [LOCATION], [LOCATION] [LOCATION], [LOCATION].” } }

The fields are described as follows:

Name	Example Response	Description
transform.text	[PERSON] lives at [LOCATION]..	The labed input text with classified entities listed by name in place of the original sensitive data

2.1.3 - Sample Response with Detail

Sample Response with Detail.

{
    "transform": {
        "text": "[PERSON] lives at [LOCATION] [LOCATION], [LOCATION] [LOCATION], [LOCATION]."
    },
    "providers": [
        {
            "name": "Pattern Classification Provider",
            "version": "1.1.0",
            "status": 200,
            "elapsed_time": 0.011328935623168945,
            "config_provider": {
                "name": "Pattern",
                "address": "http://pattern_provider_service:8051",
                "supported_content_types": []
            }
        },
        {
            "name": "Context Classification Provider",
            "version": "1.1.0",
            "status": 200,
            "elapsed_time": 0.03895401954650879,
            "config_provider": {
                "name": "Context",
                "address": "http://context_provider_service:8052",
                "supported_content_types": []
            }
        }
    ],
    "classifications": {
        "LOCATION": [
            {
                "score": 0.85,
                "location": {
                    "start_index": 17,
                    "end_index": 24
                },
                "classifiers": [
                    {
                        "provider_index": 0,
                        "name": "SpacyRecognizer",
                        "score": 0.85,
                        "original_entity": "LOCATION",
                        "details": {}
                    }
                ]
            },
            {
                "score": 0.9240000128746033,
                "location": {
                    "start_index": 26,
                    "end_index": 32
                },
                "classifiers": [
                    {
                        "provider_index": 0,
                        "name": "SpacyRecognizer",
                        "score": 0.85,
                        "original_entity": "LOCATION",
                        "details": {}
                    },
                    {
                        "provider_index": 1,
                        "name": "context",
                        "score": 0.9980000257492065,
                        "original_entity": "CITY",
                        "details": {}
                    }
                ]
            },
            {
                "score": 0.9244499981403351,
                "location": {
                    "start_index": 40,
                    "end_index": 51
                },
                "classifiers": [
                    {
                        "provider_index": 0,
                        "name": "SpacyRecognizer",
                        "score": 0.85,
                        "original_entity": "LOCATION",
                        "details": {}
                    },
                    {
                        "provider_index": 1,
                        "name": "context",
                        "score": 0.9988999962806702,
                        "original_entity": "STATE",
                        "details": {}
                    }
                ]
            },
            {
                "score": 0.9958999752998352,
                "location": {
                    "start_index": 14,
                    "end_index": 16
                },
                "classifiers": [
                    {
                        "provider_index": 1,
                        "name": "context",
                        "score": 0.9958999752998352,
                        "original_entity": "BUILDING",
                        "details": {}
                    }
                ]
            },
            {
                "score": 0.9983999729156494,
                "location": {
                    "start_index": 33,
                    "end_index": 38
                },
                "classifiers": [
                    {
                        "provider_index": 1,
                        "name": "context",
                        "score": 0.9983999729156494,
                        "original_entity": "ZIPCODE",
                        "details": {}
                    }
                ]
            }
        ],
        "PERSON": [
            {
                "score": 0.8819000124931335,
                "location": {
                    "start_index": 0,
                    "end_index": 4
                },
                "classifiers": [
                    {
                        "provider_index": 1,
                        "name": "context",
                        "score": 0.8819000124931335,
                        "original_entity": "NAME",
                        "details": {}
                    }
                ]
            }
        ]
    }
}

The fields for the transform section are described as follows:

Name	Example Response	Description
transform.text	[PERSON] lives at [LOCATION]..	The labed input text with classified entities listed by name in place of the original sensitive data

The fields for the providers section are described as follows:

Name	Example Response	Description
providers	Array	Array of provider objects that participated in the request, including their respective success or failure codes.
providers[n].name	Pattern Classification Provider	Product name of the provider.
providers[n].version	1.0.0	Version of the provider.
providers[n].status	200	HTTP response code returned by the provider.
providers[n].elapsed_time	0.028	Time, in seconds, taken by the provider to process the request.
providers[n].config_provider	Object	Object containing configuration details for each provider.
providers[n].config_provider.name	Pattern	Internal name of the provider.
providers[n].config_provider.address	http://pattern_provider_service:8051	Network address or endpoint of the provider.
providers[n].config_provider.supported_content_types	[]	Array of supported content types. An empty array indicates support for all content types.

The fields for the classificartion section are described as follows:

Name	Example Response	Description
classifications	Dictionary	A dictionary mapping entity types (e.g., “PERSON”, “PHONE_NUMBER”) to arrays of occurrence objects. Each key is an entity type, and its value is a list of detected occurrences, each containing location and classifier details.
classifications[’entity’][n].score	0.9238	The confidence score for the detected entity, aggregated from all contributing classifiers.
classifications[’entity’][n].location	Object	An object specifying the location of the entity within the input text.
classifications[’entity’][n].location.start_index	14	The starting index of the entity in the input text.
classifications[’entity’][n].location.end_index	25	The ending index of the entity in the input text.
classifications[’entity’][n].classifiers	Array	An array of classifier objects that contributed to the entity detection.
classifications[’entity’][n].classifiers[m].provider_index	0	The index of the provider in the top-level providers array.
classifications[’entity’][n].classifiers[m].name	SpacyRecognizer	The name of the classifier. A provider may have multiple classifiers.
classifications[’entity’][n].classifiers[m].score	0.85	The score assigned by the classifier for the entity detection.
classifications[’entity’][n].classifiers[m].original_entity	PERSON	The original entity type detected by the classifier. See Harmonization for details.
classifications[’entity’][n].classifiers[m].details	Object	Optional. Additional key-value details provided by the classifier.

2.1.4 -

Name	Example Response	Description
transform.text	[PERSON] lives at [LOCATION]..	The labed input text with classified entities listed by name in place of the original sensitive data

3 - Harmonizing Provider Outputs

Aggregate responses under a similar category.

Based on the detection logic, the Pattern and Context classification providers might classify the same data in different labels. The classification service standardizes provider outputs into a unified response.

Consider the example, You can visit our office located in New York City.

Context provider might categorize New York City as CITY.
Pattern provider might categorize New York City as LOCATION.

This can cause an inconsistency in the outputs generated across the providers.

Data Discovery ensures standardization of responses by aggregating similar outputs of the providers under a common classification name. In the example shown, the classification service will categorize New York City under the category LOCATION.

Harmonization Process

The following pointers illustrate the harmonization process in detail.

Providers Mapping Entities

Each provider is responsible for mapping its identified entities to harmonized classification entities that are consistent with those used by other providers. This ensures that the classification service can accurately aggregate and interpret responses across multiple providers. When a provider’s classification is harmonized, the response must include the originally identified entity alongside the harmonized classification.

The following snippet shows how the Context classification provider initially classified the entity as CITY, which was then harmonized into the category LOCATION.

{
  "providers": "...",
  "classifications": {
    "LOCATION": [
      {
        "score": 0.9222000122070313,
        "location": {
          "start_index": 36,
          "end_index": 49
        },
        "classifiers": [
          {
            "provider_index": 0,
            "name": "SpacyRecognizer",
            "score": 0.85,
            "original_entity": "LOCATION",
            "details": {}
          },
          {
            "provider_index": 1,
            "name": "context",
            "score": 0.9944000244140625,
            "original_entity": "CITY",
            "details": {}
          }
        ]
      }
    ]
  }
}

Grouping by Matching Indexes

The entities are grouped together only if the responses shared by the providers contain the same start_index, end_index, and similar classification entity. If the start_index and end_index differ, the entities will not be grouped together.

As shown in the following snippet, the Context and Pattern providers classify the data as IT_IDENTITY_CARD and ID_CARD respectively. These are then grouped under the NATIONAL_ID category by the classification service.

{
  "providers": ...,
  "classifications": {
    "NATIONAL_ID": [
      {
        "score": 0.9236000061035157,
        "location": {
          "start_index": 14,
          "end_index": 25
        },
        "classifiers": [
          {
            "provider_index": 0,
            "name": "pattern_classification",
            "score": 0.85,
            "original_entity": "IT_IDENTITY_CARD" 
          }, {
            "provider_index": 1,
            "name": "context_classification",
            "score": 0.9972000122070312,
            "original_entity": "ID_CARD" 
          }
        ]
      }
    ]
  }
}

Non-Matching Indexes

If the responses for start_index and end_index differ, the entities will not be grouped together. However, the entities will appear under a common classification name.

The following table illustrates a common classification name for multiple providers.

Provider	Original Entity Labels	Common Classification Name
Pattern Provider	LOCATION	LOCATION
Context Provider	CITY, STATE, COUNTRY, COUNTY, ZIP_CODE, STREET, BUILDING, GEO_COORDINATE	LOCATION

The following snippet illustrates the sample.

{
  "providers": "...",
  "classifications": {
    "LOCATION": [
      {
        "score": 0.9236000061035157,
        "location": {
          "start_index": 0,
          "end_index": 35
        },
        "classifiers": [
          {
            "provider_index": 0,
            "name": "pattern_provider",
            "score": 0.85,
            "original_entity": "LOCATION"
          }
        ]
      },
      {
        "score": 0.9236000061035157,
        "location": {
          "start_index": 0,
          "end_index": 17
        },
        "classifiers": [
          {
            "provider_index": 1,
            "name": "context_provider",
            "score": 0.9972000122070312,
            "original_entity": "STREET"
          }
        ]
      },
      {
        "score": 0.9236000061035157,
        "location": {
          "start_index": 20,
          "end_index": 22
        },
        "classifiers": [
          {
            "provider_index": 1,
            "name": "context_provider",
            "score": 0.9972000122070312,
            "original_entity": "BUILDING"
          }
        ]
      },
      {
        "score": 0.9236000061035157,
        "location": {
          "start_index": 25,
          "end_index": 31
        },
        "classifiers": [
          {
            "provider_index": 1,
            "name": "context_provider",
            "score": 0.9972000122070312,
            "original_entity": "ZIP_CODE"
          }
        ]
      }
    ]
  }
}

Harmonization Fields

The following table illustrates the original entities and the their corresponding harmonized classification

Original Provider Entity	Harmonized/Common Classification
US_BANK_NUMBER	BANK_ACCOUNT
IBAN_CODE	BANK_ACCOUNT
IBAN	BANK_ACCOUNT
BIC	BANK_ACCOUNT
CRYPTO	CRYPTO_ADDRESS
BITCOINADDRESS	CRYPTO_ADDRESS
ETHEREUMADDRESS	CRYPTO_ADDRESS
LITECOINADDRESS	CRYPTO_ADDRESS
IT_DRIVER_LICENSE	DRIVER_LICENSE
US_DRIVER_LICENSE	DRIVER_LICENSE
DRIVERLICENSE	DRIVER_LICENSE
US_PASSPORT	PASSPORT
IN_PASSPORT	PASSPORT
IT_PASSPORT	PASSPORT
PASSPORT	PASSPORT
IT_IDENTITY_CARD	NATIONAL_ID
FI_PERSONAL_IDENTITY_CODE	NATIONAL_ID
IN_AADHAAR	NATIONAL_ID
ES_NIE	NATIONAL_ID
SG_NRIC_FIN	NATIONAL_ID
PL_PESEL	NATIONAL_ID
SG_UEN	NATIONAL_ID
AU_ACN	NATIONAL_ID
IDCARD	NATIONAL_ID
US_ITIN	TAX_ID
AU_TFN	TAX_ID
IN_PAN	TAX_ID
ES_NIF	TAX_ID
IT_FISCAL_CODE	TAX_ID
AU_ABN	TAX_ID
IT_VAT_CODE	TAX_ID
US_SSN	SOCIAL_SECURITY_ID
UK_NINO	SOCIAL_SECURITY_ID
SSN	SOCIAL_SECURITY_ID
MEDICAL_LICENSE	HEALTH_CARE_ID
AU_MEDICARE	HEALTH_CARE_ID
UK_NHS	HEALTH_CARE_ID
DATE_TIME	DATETIME
DATE	DATETIME
TIME	DATETIME
EMAIL	EMAIL_ADDRESS
IP	IP_ADDRESS
IPV4	IP_ADDRESS
IPV6	IP_ADDRESS
NAME	PERSON
PHONE	PHONE_NUMBER
PIN	PASSWORD
PASSWORD	PASSWORD
CREDITCARDCVV	PASSWORD
BUILDING	LOCATION
COUNTRY	LOCATION
CITY	LOCATION
COUNTY	LOCATION
GEOCOORD	LOCATION
SECADDRESS	LOCATION
SECONDARYADDRESS	LOCATION
STATE	LOCATION
STREET	LOCATION
ZIPCODE	LOCATION
CCN	CREDIT_CARD
COMPANYNAME	ORGANIZATION
MAC	MAC_ADDRESS
ACCOUNTNAME	ACCOUNT_NAME
ACCOUNTNUMBER	ACCOUNT_NUMBER
CURRENCYCODE	CURRENCY_CODE
CURRENCYNAME	CURRENCY_NAME
CURRENCYSYMBOL	CURRENCY_SYMBOL

4 - Input Validation

Rejecting unsanitized data.

The Classification service in Data Discovery offers an input validation security feature that rejects invalid input data. Data that is malformed, non-normalized, containing homoglyphs, hieroglyphs, mixed Unicode variants, or control characters is considered as unsanitized or invalid data. These are rejected and will not be classified.

The following are few examples of data that will be rejected:

Ⅷ
𝓉𝑒𝓍𝓉
Ｐｅｐ

Before invoking the Classification endpoint, ensure that the input text is normalized. Replace invalid characters by their corresponding normalized plaintext characters. If the input text contains any invalid character, a status code of 422 and a message Untrusted input is returned.

For security purposes, the application rejects unsanitized data by default. It is recommended that this feature remains enabled. However, to override this feature, perform the following steps.

Navigate to the docker_compose directory.
Edit the docker-compose.yaml file.
Under the environment section of classification_service, append the security parameter as follows.

- SECURITY_SETTINGS={"ENABLE_ALL_SECURITY_CONTROLS":false}

Save the changes.
If the application is already running, stop the containers first:

docker compose down

Start the application with your configuration changes following the Docker Compose deployment guide:

docker compose up -d

Navigate to the /eks/helm/classification_app directory.
Create a values-override.yaml file with the required custom configuration.

securitySettings:
    ENABLE_ALL_SECURITY_CONTROLS: false

Save the changes.
If the application is already deployed, uninstall using the following command.

helm uninstall data-discovery-classification --namespace default --wait

Run the following installation command.

helm install data-discovery-classification . \
    --namespace default \
    --create-namespace \
    --wait \
    --wait-for-jobs \
    --timeout 900s \
    -f values-override.yaml

5 -

Response Code	Description
200	Successful Response.
206	Partial Content. Only some providers classifed data successfully.
400	Bad Request. Invalid input parameters or content.
413	Payload too large.
415	Unsupported media type.
422	Untrusted input. For more information, refer to Input Validation
502	Bad Gateway. All upstream providers failed; no successful data aggregation possible.
598	Unexpected internal server error. Check server logs.
599	Internal server error. Check server logs.

6 -

Name	Example Response	Description
providers	Array	Array of provider objects that participated in the request, including their respective success or failure codes.
providers[n].name	Pattern Classification Provider	Product name of the provider.
providers[n].version	1.0.0	Version of the provider.
providers[n].status	200	HTTP response code returned by the provider.
providers[n].elapsed_time	0.028	Time, in seconds, taken by the provider to process the request.
providers[n].config_provider	Object	Object containing configuration details for each provider.
providers[n].config_provider.name	Pattern	Internal name of the provider.
providers[n].config_provider.address	http://pattern_provider_service:8051	Network address or endpoint of the provider.
providers[n].config_provider.supported_content_types	[]	Array of supported content types. An empty array indicates support for all content types.

7 -

Navigate to the docker_compose directory.
Edit the docker-compose.yaml file.
Under the environment section of classification_service, append the security parameter as follows.

- SECURITY_SETTINGS={"ENABLE_ALL_SECURITY_CONTROLS":false}

Save the changes.
If the application is already running, stop the containers first:

docker compose down

Start the application with your configuration changes following the Docker Compose deployment guide:

docker compose up -d

8 -

Navigate to the /eks/helm/classification_app directory.
Create a values-override.yaml file with the required custom configuration.

securitySettings:
  ENABLE_ALL_SECURITY_CONTROLS: false

Save the changes.
If the application is already deployed, uninstall using the following command.

helm uninstall data-discovery-classification --namespace default --wait

Run the following installation command.

helm install data-discovery-classification . \
  --namespace default \
  --create-namespace \
  --wait \
  --wait-for-jobs \
  --timeout 900s \
  -f values-override.yaml