Data Discovery is currently in Private Preview and is not available for General Availability (GA). It should not be used in production environments, as features and functionality may change before the final GA release.
Input Validation
The Classification service in Data Discovery offers an input validation security feature that rejects invalid input data. Data that is malformed, non-normalized, containing homoglyphs, hieroglyphs, mixed Unicode variants, or control characters is considered as unsanitized or invalid data. These are rejected and will not be classified.
The following are few examples of data that will be rejected:
- Ⅷ
- 𝓉𝑒𝓍𝓉
- Pep
Before invoking the Classification endpoint, ensure that the input text is normalized. Replace invalid characters by their corresponding normalized plaintext characters. If the input text contains any invalid character, a status code of 422 and a message Untrusted input is returned.
For security purposes, the application rejects unsanitized data by default. It is recommended that this feature remains enabled. However, to override this feature, perform the following steps.