Data Discovery is currently in Private Preview and is not available for General Availability (GA). It should not be used in production environments, as features and functionality may change before the final GA release.

Harmonizing Provider Outputs

Aggregate responses under a similar category.

Based on the detection logic, the Pattern and Context classification providers might classify the same data in different labels. The classification service standardizes provider outputs into a unified response.

Consider the example, You can visit our office located in New York City.

  • Context provider might categorize New York City as CITY.
  • Pattern provider might categorize New York City as LOCATION.

This can cause an inconsistency in the outputs generated across the providers.

Data Discovery ensures standardization of responses by aggregating similar outputs of the providers under a common classification name. In the example shown, the classification service will categorize New York City under the category LOCATION.

Harmonization Process

The following pointers illustrate the harmonization process in detail.

Providers Mapping Entities

Each provider is responsible for mapping its identified entities to harmonized classification entities that are consistent with those used by other providers. This ensures that the classification service can accurately aggregate and interpret responses across multiple providers. When a provider’s classification is harmonized, the response must include the originally identified entity alongside the harmonized classification.

The following snippet shows how the Context classification provider initially classified the entity as CITY, which was then harmonized into the category LOCATION.

{
  "providers": "...",
  "classifications": {
    "LOCATION": [
      {
        "score": 0.9222000122070313,
        "location": {
          "start_index": 36,
          "end_index": 49
        },
        "classifiers": [
          {
            "provider_index": 0,
            "name": "SpacyRecognizer",
            "score": 0.85,
            "original_entity": "LOCATION",
            "details": {}
          },
          {
            "provider_index": 1,
            "name": "context",
            "score": 0.9944000244140625,
            "original_entity": "CITY",
            "details": {}
          }
        ]
      }
    ]
  }
}

Grouping by Matching Indexes

The entities are grouped together only if the responses shared by the providers contain the same start_index, end_index, and similar classification entity. If the start_index and end_index differ, the entities will not be grouped together.

As shown in the following snippet, the Context and Pattern providers classify the data as IT_IDENTITY_CARD and ID_CARD respectively. These are then grouped under the NATIONAL_ID category by the classification service.

{
  "providers": ...,
  "classifications": {
    "NATIONAL_ID": [
      {
        "score": 0.9236000061035157,
        "location": {
          "start_index": 14,
          "end_index": 25
        },
        "classifiers": [
          {
            "provider_index": 0,
            "name": "pattern_classification",
            "score": 0.85,
            "original_entity": "IT_IDENTITY_CARD" 
          }, {
            "provider_index": 1,
            "name": "context_classification",
            "score": 0.9972000122070312,
            "original_entity": "ID_CARD" 
          }
        ]
      }
    ]
  }
}

Non-Matching Indexes

If the responses for start_index and end_index differ, the entities will not be grouped together. However, the entities will appear under a common classification name.

The following table illustrates a common classification name for multiple providers.

ProviderOriginal Entity LabelsCommon Classification Name
Pattern ProviderLOCATIONLOCATION
Context ProviderCITY, STATE, COUNTRY, COUNTY, ZIP_CODE, STREET, BUILDING, GEO_COORDINATELOCATION

The following snippet illustrates the sample.

{
  "providers": "...",
  "classifications": {
    "LOCATION": [
      {
        "score": 0.9236000061035157,
        "location": {
          "start_index": 0,
          "end_index": 35
        },
        "classifiers": [
          {
            "provider_index": 0,
            "name": "pattern_provider",
            "score": 0.85,
            "original_entity": "LOCATION"
          }
        ]
      },
      {
        "score": 0.9236000061035157,
        "location": {
          "start_index": 0,
          "end_index": 17
        },
        "classifiers": [
          {
            "provider_index": 1,
            "name": "context_provider",
            "score": 0.9972000122070312,
            "original_entity": "STREET"
          }
        ]
      },
      {
        "score": 0.9236000061035157,
        "location": {
          "start_index": 20,
          "end_index": 22
        },
        "classifiers": [
          {
            "provider_index": 1,
            "name": "context_provider",
            "score": 0.9972000122070312,
            "original_entity": "BUILDING"
          }
        ]
      },
      {
        "score": 0.9236000061035157,
        "location": {
          "start_index": 25,
          "end_index": 31
        },
        "classifiers": [
          {
            "provider_index": 1,
            "name": "context_provider",
            "score": 0.9972000122070312,
            "original_entity": "ZIP_CODE"
          }
        ]
      }
    ]
  }
}

Harmonization Fields

The following table illustrates the original entities and the their corresponding harmonized classification

Original Provider EntityHarmonized/Common Classification
US_BANK_NUMBERBANK_ACCOUNT
IBAN_CODEBANK_ACCOUNT
IBANBANK_ACCOUNT
BICBANK_ACCOUNT
CRYPTOCRYPTO_ADDRESS
BITCOINADDRESSCRYPTO_ADDRESS
ETHEREUMADDRESSCRYPTO_ADDRESS
LITECOINADDRESSCRYPTO_ADDRESS
IT_DRIVER_LICENSEDRIVER_LICENSE
US_DRIVER_LICENSEDRIVER_LICENSE
DRIVERLICENSEDRIVER_LICENSE
US_PASSPORTPASSPORT
IN_PASSPORTPASSPORT
IT_PASSPORTPASSPORT
PASSPORTPASSPORT
IT_IDENTITY_CARDNATIONAL_ID
FI_PERSONAL_IDENTITY_CODENATIONAL_ID
IN_AADHAARNATIONAL_ID
ES_NIENATIONAL_ID
SG_NRIC_FINNATIONAL_ID
PL_PESELNATIONAL_ID
SG_UENNATIONAL_ID
AU_ACNNATIONAL_ID
IDCARDNATIONAL_ID
US_ITINTAX_ID
AU_TFNTAX_ID
IN_PANTAX_ID
ES_NIFTAX_ID
IT_FISCAL_CODETAX_ID
AU_ABNTAX_ID
IT_VAT_CODETAX_ID
US_SSNSOCIAL_SECURITY_ID
UK_NINOSOCIAL_SECURITY_ID
SSNSOCIAL_SECURITY_ID
MEDICAL_LICENSEHEALTH_CARE_ID
AU_MEDICAREHEALTH_CARE_ID
UK_NHSHEALTH_CARE_ID
DATE_TIMEDATETIME
DATEDATETIME
TIMEDATETIME
EMAILEMAIL_ADDRESS
IPIP_ADDRESS
IPV4IP_ADDRESS
IPV6IP_ADDRESS
NAMEPERSON
PHONEPHONE_NUMBER
PINPASSWORD
PASSWORDPASSWORD
CREDITCARDCVVPASSWORD
BUILDINGLOCATION
COUNTRYLOCATION
CITYLOCATION
COUNTYLOCATION
GEOCOORDLOCATION
SECADDRESSLOCATION
SECONDARYADDRESSLOCATION
STATELOCATION
STREETLOCATION
ZIPCODELOCATION
CCNCREDIT_CARD
COMPANYNAMEORGANIZATION
MACMAC_ADDRESS
ACCOUNTNAMEACCOUNT_NAME
ACCOUNTNUMBERACCOUNT_NUMBER
CURRENCYCODECURRENCY_CODE
CURRENCYNAMECURRENCY_NAME
CURRENCYSYMBOLCURRENCY_SYMBOL
Last modified : September 03, 2025