CDP-PVC-Base

Install the Big Data Protector using the CDP-PVC-Base Installer

Features of the Big Data Protector on CDP-PVC-Base

The Protegrity Big Data Protector (Big Data Protector) uses vaultless tokenization and central policy control for access management and secures sensitive data at rest in the following areas:

Data in HDFS and Ozone
Data used during processing with MapReduce, Hive, Pig, HBase, Impala, and Spark
Data traversing enterprise data systems

The data is protected from internal and external threats, and users and business processes can continue to utilize the secured data.

Data protection may be by encryption or tokenization. In tokenization, the data is converted to similar looking inert data known as tokens where the data format and type can be preserved. These tokens can be detokenized back to the original values whenever required.

Protegrity protects data inside the files using tokenization and strong encryption protection methods. Depending on the user access rights and the policies set using Policy management in ESA, this data is unprotected.

The Protegrity Hadoop Big Data Protector provides the following features:

Provides fine grained field-level protection within the MapReduce, Hive, Pig, HBase, and Spark frameworks.
Provides Protegrity Format Preserving Encryption (FPE) method for structured data. The following data types are supported:
- Numeric (0-9)
- Alpha (a-z, A-Z)
- Alpha-Numeric (0-9, a-z, A-Z)
- Credit Card (0-9)
- Unicode Basic Latin and Latin-1 Supplement Alpha
- Unicode Basic Latin and Latin-1 Supplement Alpha-Numeric
Retains distributed processing capability as field-level protection is applied to the data.
Protects data in the Hadoop cluster using role-based administration with a centralized security policy.
Simplified installation, administration, and managem ent of Big Data Protector using the following components:
- Parcels: In Cloudera Manager, the Big Data Protector Parcel is a single consolidated file. This file contains all the required files for installing and using Big Data Protector on a cluster. It also contains the metadata used by Cloudera Manager.
- Custom Service Descriptors (CSDs): In Cloudera Manager, a CSD contains all the configurations required to describe and manage the Big Data Protector services. The CSDs are provided as Jar files.
Easy monitoring of the Big Data Protector services, such as, BDP, using the Cloudera Manager UI instead of the CLI.

Provides logging and viewing data access activities and real-time alerts with a centralized monitoring system.
Ensures minimal overhead for processing secured data, with minimal consumption of resources, threads and processes, and network bandwidth.
Provides transparent data protection with Protegrity HBase protectors.

Currently, Protegrity supports MapReduce, Hive, Pig, HBase, Spark, and Impala, which utilizes HDFS or Ozone as the data storage layer. The following points can be referred to as general guidelines:

Beeline and Hue: Beeline and Hue are certified with the Hive protector.
Ranger: Ranger is certified to work with the Hive protector.
Sentry (CDH): Sentry is certified with the Hive and Impala protector only.

Overview of Hadoop Application Protection

The various levels of protection provided by Hadoop Application Protection are explained below.

Protection in MapReduce Jobs

A MapReduce job in the Hadoop cluster involves sensitive data. You can use Protegrity interfaces to protect data when it is saved or retrieved from a protected source. The output data written by the job can be encrypted or tokenized. The protected data can be subsequently used by other jobs in the cluster in a secured manner. Field level data can be secured and ingested into HDFS by independent Hadoop jobs or other ETL tools. For more information about secure ingestion of data in Hadoop, refer to section Ingesting Files Using Hive Staging. For more information on the list of available APIs, refer to section MapReduce APIs. If Hive queries are created to operate on sensitive data, then you can use Protegrity Hive UDFs for securing data. While inserting data to Hive tables, or retrieving data from protected Hive table columns, you can call Protegrity UDFs loaded into Hive during installation. The UDFs protect data based on the input parameters provided. Secure ingestion of data into HDFS to operate Hive queries can be achieved by independent Hadoop jobs or other ETL tools. For more information about securely ingesting data in Hadoop, refer Ingesting Data Securely.

Protection in Hive Queries

Protection in Hive queries is done by Protegrity Hive UDFs. These UDFs translate a HiveQL query into a MapReduce, Tez or Spark distributed job before sending it to the Hadoop cluster. For more information on the list of available UDFs, refer Hive UDFs.

Protection in Pig Jobs

Protection in Pig jobs is done by Protegrity Pig UDFs, which are similar in function to the Protegrity UDFs in Hive. For more information on the list of available UDFs, refer Pig UDFs.

Protection in HBase

HBase is a database which provides random read and write access to tables, consisting of rows and columns, in real-time. HBase is designed to run on commodity servers, to automatically scale as more servers are added, and is fault tolerant as data is divided across servers in the cluster. HBase tables are partitioned into multiple regions. Each region stores a range of rows in the table. Regions contain a datastore in memory and a persistent datastore(HFile). The Name node assigns multiple regions to a region server. The Name node manages the cluster and the region servers store portions of the HBase tables and perform the work on the data.

The Protegrity HBase protector extends the functionality of the data storage framework. It also provides a transparent data protection and unprotection using coprocessors. These coprocessors provide the functionality to run the code directly on region servers. The Protegrity coprocessor for HBase runs on the region servers and protects the data stored in the servers. All clients which work with HBase are supported. The data is transparently protected or unprotected, as required, utilizing the coprocessor framework.

Protection in Impala

Impala is an MPP SQL query engine for querying the data stored in a cluster. It provides the flexibility of the SQL format and is capable of running the queries on HDFS in HBase. The Protegrity Impala protector extends the functionality of the Impala query engine and provides UDFs which protect or unprotect the data as it is stored or retrieved. For more information about the Impala protector, refer Impala UDFs.

Protection in Spark

Spark is an execution engine that carries out batch processing of jobs in-memory and handles a wider range of computational workloads. In addition to processing a batch of stored data, Spark is capable of manipulating data in real time. You can also utilise Spark Streaming to process live data streams and store the processed data in Hadoop. The Protegrity Spark Java protector extends the functionality of the Spark engine and provides Java APIs that protect, unprotect, or reprotect the data as it is stored or retrieved. For more information about the Spark Java and SQL protectors, refer to section Spark. The Protegrity Spark Java protector extends the functionality of the Spark engine and provides Java APIs that protect, unprotect, or reprotect the data as it is stored or retrieved. The Protegrity Spark SQL protector provides native UDFs that can be utilized with Spark Scala to protect, unprotect, or reprotect the data as it is stored or retrieved. You can create and submit Spark jobs using the methods listed in the following table.

Create and submit Spark jobs using	Reference Section
Spark Java APIs	Spark Java
Spark SQL UDFs	Spark SQL
PySpark Scala Wrapper UDFs	PySpark Scala Wrapper UDFs

Ingesting Data Securely

The methods by which data can be secured and ingested by various jobs in Hadoop at a field or file level are explained below.

Ingesting Files Using Hive Staging

Semi-structured data files can be loaded into a Hive staging table for ingestion into a Hive table with Hive queries and Protegrity UDFs. After loading data in the table, the data will be stored in protected form.

Data Security Policy and Protection Methods

A data security policy establishes processes to ensure the security and confidentiality of sensitive information. In addition, the data security policy establishes administrative and technical safeguards against unauthorized access or use of the sensitive information. Depending on the requirements, the data security policy typically performs the following functions:

Classifies the data that is sensitive for the organization.
Defines the methods to protect sensitive data, such as encryption and tokenization.
Defines the methods to present the sensitive data, such as masking the display of sensitive information.
Defines the access privileges of the users that would be able to access the data.
Defines the time frame for privileged users to access the sensitive data.
Enforces the security policies at the location where sensitive data is stored.
Provides a means of auditing authorized and unauthorized accesses to the sensitive data. In addition, it can also provide a means of auditing operations to protect and unprotect the sensitive data. The data security policy contains a number of components, such as, data elements, datastores, member sources, masks, and roles. The following list describes the functions of each of these entities:
Data elements define the data protection properties for protecting sensitive data, consisting of the data securing method, data element type and its description. In addition, Data elements describe the tokenization or encryption properties, which can be associated with roles.
Datastores consist of enterprise systems, which might contain the data that needs to be processed, where the policy is deployed and the data protection function is utilized.
Member sources are the external sources from which users (or members) and groups of users are accessed. Examples are a file, database, LDAP, and Active Directory.
Masks are a pattern of symbols and characters, that when imposed on a data field, obscures its actual value to the user. Masks effectively aid in hiding sensitive data.
Roles define the levels of member access that are appropriate for various types of information. Combined with a data element, roles determine and define the unique data access privileges for each member.

For more information about creating a policy, refer Creating a Structured Policy.

Feedback

Was this page helpful?

Last modified : January 20, 2026

CDP-PVC-Base

Features of the Big Data Protector on CDP-PVC-Base

Overview of Hadoop Application Protection

Protection in MapReduce Jobs

Protection in Hive Queries

Protection in Pig Jobs

Protection in HBase

Protection in Impala

Protection in Spark

Ingesting Data Securely

Ingesting Files Using Hive Staging

Data Security Policy and Protection Methods

Understanding the architecture

System Requirements

Preparing the Environment

Installing the Big Data Protector

Configuring the Big Data Protector

Upgrading the Big Data Protector

Uninstalling the Big Data Protector

Feedback