Unicode

Details about the Unicode token type.

Deprecated

Starting from v10.0.x, the Unicode token type is deprecated.
It is recommended to use the Unicode Gen2 token type instead of the Unicode token type.

The Unicode token type can be used to tokenize multi-byte character strings. The input is treated as a byte stream, hence there are no delimiters. There are also no character conversions or code point validation done on the input. The token value will be alpha-numeric.

The encoding and unicode character set of the input data will affect the protected data length. For instance, the respective lengths for UTF-8 and UTF-16, in bytes, is described in the following table.

Table: Lengths for UTF-8 and UTF-16

Input Values	UTF-8	UTF-16
導字社導字會	18 bytes	12 bytes
Protegrity	10 bytes	20 bytes

Table: Unicode Tokenization Type properties

Tokenization Type Properties	Settings
Name	Unicode
Token type and Format	Application protectors support UTF-8, UTF-16LE, and UTF-16BE encoding. Hex character codes from 0x00 to 0xFF. For the list of supported characters, refer to ASCII Character Codes.
Tokenizer	Length Preservation	Allow Short Data	Minimum Length	Maximum Length^*2
SLT_1_3^1 SLT_2_3^1	No	Yes	1 byte	4096
		No, return input as it is	3 bytes
		No, generate error	3 bytes
Possibility to set Minimum/ maximum length	No
Left/Right settings	No
Internal IV	No
External IV	Yes
Return of Protected value	Yes
Token specific properties	Tokenization result is Alpha-Numeric.

^*1 - If the input and output types of the API are BYTE[], then the customer application should convert the input to and output from the byte array, before calling the API.

^*2 - The maximum input length to safely tokenize and detokenize the data is 4096 bytes, which is irrespective of the byte representation.

The following table shows examples of the way in which a value will be tokenized with the Unicode token.

Table: Examples of Tokenization for Unicode Values

Input Value	Tokenized Value	Comments
Протегріті	WurIeXLFZPApXQorkFCKl3hpRaGR28K	Input value contains Cyrillic characters. Tokenization result is Alpha-Numeric.
安全	xM2EcAQ0LVtQJ	Input value contains characters in Simplified Chinese. Tokenization result is Alpha-Numeric.
Protegrity	RsbQU8KdcQzHJ1	Algorithm is non-length preserving. Tokenized value is longer than initial one.
a	V2wU	Unicode, Allow Short Data=Yes Algorithm is non-length preserving. Tokenized value is longer than initial one.
a9c	A0767Vo

Unicode Tokenization Properties for different protectors

Unicode tokenization is supported only by Application Protectors, Big Data Protector and Data Warehouse Protector.

Application Protector

The following table shows supported input data types for Application protectors with the Unicode token.

Table: Supported input data types for Application protectors with Unicode token

Application Protectors^*2	AP Java^*1	AP Python
Supported input data types	BYTE[] CHAR[] STRING	BYTES STRING

^*1 - The API accepts and returns data in BYTE[] format. The customer application needs to convert the input into byte arrays before calling the API, and similarly, convert the output from byte arrays after receiving the response from the API.

^*2 - The Protegrity Application Protectors only support bytes converted from the string data type. If int, short, or long format data is directly converted to bytes and passed as input to the Application Protector APIs that support byte as input and provide byte as output, then data corruption might occur.

For more information about Application protectors, refer to Application Protector.

Big Data Protector

Protegrity supports MapReduce, Hive, Pig, HBase, Spark, and Impala, which utilizes Hadoop Distributed File System (HDFS) or Ozone as the data storage layer. The data is protected from internal and external threats, and users and business processes can continue to utilize the secured data. Protegrity protects data inside the files using tokenization and strong encryption protection methods.

The minimum and maximum lengths supported for the Big Data Protector are as described by the following points:

MapReduce: The maximum limit that can be safely tokenized and detokenized back is 4096 bytes. The user controls the encoding, as required.
Spark: The maximum limit that can be safely tokenized and detokenized back is 4096 bytes. The user controls the encoding, as required.
Hive: The ptyProtectUnicode and ptyUnprotectUnicode UDFs convert data to UTF-16LE encoding internally. These encoding has a minimum requirement of four bytes of data in UTF-16LE encoding. Additionally, it has a maximum limit of 4096 bytes in UTF-16LE encoding for safely tokenizing and detokenizing the data. The pty_ProtectStr and pty_UnprotectStr UDFs convert data to UTF-8 encoding internally. This encoding has a minimum requirement of three bytes for data in UTF-8 encoding. Additionally, it has a maximum limit of 4096 bytes for safely tokenizing and detokenizing the data.
Impala: The pty_UnicodeStringIns and pty_UnicodeStringSel UDFs convert data to UTF-16LE encoding internally. These encoding has a minimum requirement of four bytes of data in UTF-16LE encoding. Additionally, it has a maximum limit of 4096 bytes in UTF-16LE encoding for safely tokenizing and detokenizing the data. The pty_StringIns and pty_StringSel UDFs convert data to UTF-8 encoding internally. This encoding has a minimum requirement of three bytes for data in UTF-8 encoding. Additionally, it has a maximum limit of 4096 bytes for safely tokenizing and detokenizing the data.

The following table shows supported input data types for Big Data protectors with the Unicode token.

Table: Supported input data types for Big Data protectors with Unicode token

Big Data Protectors	MapReduce^*2	Hive	Pig	HBase^*2	Impala	Spark^*2	Spark SQL	Trino
Supported input data types^*1	BYTE[]	STRING	Not supported	BYTE[]	STRING	BYTE[] STRING	STRING	VARCHAR

^*1 – If the input and output types of the API are BYTE [], the customer application should convert the input to a byte array. Then, call the API and convert the output from the byte array.

^*2 – The Protegrity MapReduce protector, HBase coprocessor, and Spark protector only support bytes converted from the string data type. Data types that are not bytes converted from the string data type might cause data corruption to occur when:

Any other data type is directly converted to bytes and passed as input to the MapReduce or Spark API that supports byte as input and provides byte as output.
Any other data type is directly converted to bytes and inserted in an HBase table. Where the HBase table is configured with the Protegrity HBase coprocessor.

For more information about Big Data protectors, refer to Big Data Protector.

Data Warehouse Protector

The Protegrity Data Warehouse Protector is an advanced security solution designed to protect sensitive data at the column level. This enables you to secure your data, while still permitting access to authorized users. Additionally, the Data Warehouse Protector integrates seamlessly with existing database systems using the User-Defined Functions for an enhanced security. Protegrity protects data inside the data warehouses using various tokenization and encryption methods.

If short data tokenization is not enabled, the minimum length for Unicode tokenization type is 3 bytes. The input value in Teradata Unicode UDF is encoded using UTF16 due to which internally the data length is multiplied by 2 bytes. Hence, the Teradata Unicode UDF is able to tokenize a data length that is less than the minimum supported length of 3 bytes.

The External IV is not supported in Data Warehouse Protector.

The following table shows the supported input data types for the Teradata protector with the Unicode token.

Table: Supported input data types for Data Warehouse protectors with Unicode token

Data Warehouse Protectors	Teradata
Supported input data types	VARCHAR UNICODE

For more information about Data Warehouse protectors, refer to Data Warehouse Protector.

Database Protectors

Oracle Database Protector

The following table shows supported input data types for Oracle Database protector with the Unicode token.

Table: Supported input data types for Oracle Database protector with Unicode token

Protector	Oracle
Supported Input Data Types	VARCHAR2

For more information about the Oracle Database protector, refer to Oracle Database Protector.

MSSQL Database Protector

The following table shows supported input data types for MSSQL Database protector with the Unicode token.

Table: Supported input data types for MSSQL Database protector with Unicode token

Protector	MSSQL
Supported Input Data Types	NVARCHAR

Note:
For the MSSQL database protector, if Unicode UDFs are provided with an input data exceeding 4000 characters, then SQL Server internally processes only the first 4000 characters, truncating any additional characters.
Cross-product data migration for Unicode token type is compatible between products that use the same encoding technique. For example, the Teradata database cross product data migration for Unicode token type is compatible with the MSSQL database protector, because, both the protectors use the UTF-16 encoding technique. However, it is not compatible with the Oracle database protector, because, it uses the UTF-8 encoding.

For more information about the MSSQL Database protector, refer to MSSQL Database Protector.

Feedback

Was this page helpful?

Last modified : May 21, 2026