Short Data Tokenization

Data is considered short when the number of tokenizable characters is below the tokenizer’s limit. The behavior for short input data can be configured, as it generally produces weaker tokens.

When using tokenizers, such as, SLT_1_3, SLT_2_3, and SLT_X_1, the minimum input limit for tokenizable characters or bytes is three. When using tokenizers, such as, SLT_1_6 and SLT_2_6, the minimum input limit for tokenizable characters or bytes is six.

The possible flag values for short data tokenization are described in the following table.

Table: Short tokens flag values

Short Token Flag ValueAction
No, generate errorDo not tokenize the short input but generate an error code and an audit log stating that the data is too short.
YesTokenize the data if the input is short.
No, return input as it isDo not tokenize the short input but return the input as it is.

The following tokens support short data tokenization:

The following deprecated tokens support short data tokenization:

Important: Short input data tokenization can be at risk as a user can easily guess the lookup table and the original data by tokenizing some input data. Consider carefully before using the short data tokenization. If possible, short data input must be avoided.

For more information about the maximum length setting for non-length-preserving token elements, refer to Minimum and Maximum Input Length by Token Types.


Last modified : December 18, 2025