Data Discovery is currently in Private Preview and is not available for General Availability (GA). It should not be used in production environments, as features and functionality may change before the final GA release.
Handling Overlapping Conflicts
While classifying data, the providers may label an identical text under two different entities. This distinction arises from the detection strategies the classifiers adopt. Data Discovery handles these conflicts by applying certain rules on these conflicting entities.
The rules for handling the conflicting entities are as follows:
No overlap: If the two entities do not conflict, retain the results in the original form.
For example,
Jake Filbert lives in Connecticut. If only Jake Filbert is identified, the result will be labeled as[NAME] lives in Connecticut.Full overlap: If both the entities overlap, the following logic will be applied:
- Select the entity with a higher confidence score.
- If both the entities contain the same confidence score, select the first entity.
For example,
Jake Filbert lives in Connecticut. Here, the name is recognized as [USER] with a score 0.7 and [NAME] with a score 0.9. As [NAME] has a higher score, the result will be labeled as[NAME] lives in Connecticut.One entity contained in other: If one entity is completely contained in the other, select the entity with the longer text.
For example,
jake@email.com. Here, the classifiers may recognize the text as [NAME] and [EMAIL]. As [EMAIL] is the longer text, the result will be labeled as[EMAIL].Partial intersection. If the two entities overlap partially, the result will be a combination of both.
For example,
092-33445. Here, the classifiers may recognize the text as [PHONE_NUMBER] and [SSN]. The result will be labeled as [PHONE_NUMBER&SSN].