The MaskingAlgorithm Java interface

Any Java class that should be recognized as a masking algorithm (whether standalone or configurable) must implement the MaskingAlgorithm interface. This interface is parameterized with the data type that the algorithm masks, which defines the input and output data type of the mask method. The full details of this interface are described in the masking plugin API Java document.

Core data types

The Delphix Continuous Compliance Engine is designed to support a wide and extensible set of data sources, which naturally encode data in a variety of different formats. In order to simplify algorithm development while maintaining the ability to mask data from many sources, a core set of data formats have been identified to likely require different masking treatment, and the Extensible Algorithm framework ensures that all data is converted to and from these types (as needed). These types define the allowed parameterization of the MaskingAlgorithm Java interface.

Each masking algorithm class is defined to mask exactly one of the following data types:

Binary data - java.nio.ByteBuffer
String data - java.lang.String
Numeric data - java.math.BigDecimal
Date time data - java.time.LocalDateTime
Multi-column data - com.delphix.masking.api.plugin.utils.GenericDataRow (See Multi-Column Masking section)

Each algorithm is expected to input, process, and emit objects of one of the above Java types, but is free to use any intermediate types (as needed) to access library methods. Since that is frequently the case, that data of one type is stored in databases or documents in a type other than its most natural native type (ex. dates stored in VARCHAR fields, or numbers stored as text in a CSV file), the masking framework that executes these algorithms is capable of performing a number of automatic type conversions, detailed in the next section. This allows algorithms written to process one data type to handle data of other types, with no additional work required of the algorithm author.

Supported automatic type conversions

Algorithm native type	Supported type	Notes
ByteBuffer	String	Algorithm receives the UTF-8 encoded value of the String and is expected to return a valid UTF-8 ByteBuffer.
LocalDateTime	String	The correct date format must be assigned to the field or column in the masking inventory.
LocalDateTime	Compatible numeric types	A compatible date format, such as yyyyMMdd, must be assigned to the column in inventory.
BigDecimal	All numeric types	Unconverted to BigDecimal. Values out of range (after masking) are truncated to fit the range of the underlying type.
BigDecimal	String	String value is converted to a number.

Special case values

In order to allow algorithms to implement special handling for null, empty, and special case values, these values are presented to the masking algorithm unmodified. Algorithms should be prepared to process the full range of input values possible for the input type. In practice, this means that most mask method implementations will begin with a null check on the input value prior to attempting to use the input – for example, calling input.length() or similar. It is perfectly acceptable and common to return null in the case where the mask input is null.

Method overview

This section provides a high-level overview of the methods in the MaskingAlgorithm interface. For complete details, consult the masking plugin API Java document included in the Algorithm SDK archive.

getName and getDescription - These methods are used to determine the name and description of frameworks and algorithm instances included in the plugin. For user-created instances, these methods are never called.
getDefaultInstances and getAllowFurtherInstances - These methods control the set of instances of the algorithm framework that are defined by the plugin, and whether the user should be allowed to create additional instances.
validate - This method is called after configuration is applied to allow the algorithm class to check whether the injected configuration is valid.
setup and tearDown - These methods are called before the algorithm object is used for masking, and after, respectively. Typically, any resources, such as input files, are acquired during setup and released during tearDown.
mask - This is the method that does the actual data masking in the algorithm class. The input and output values are parameterized for type safety as described above
maskBatch - This method is called to perform masking in situations when it is possible for the caller to build a collection of input values to mask in a single method call. A default implementation is provided that simply calls the mask method on each value in the batch.
listMultiColumnFields - This method needs to be implemented only for Multi-Column Algorithms. It returns a list of AlgorithmLogicalField objects that define the set of fields that the multi-column algorithm masks.

The following methods are available but deprecated:

listMaskedFields - This method needs to be implemented for Multi-Column Algorithms. It returns a map of field names (String) to the Core Data Type. This method does not need to be implemented if not implementing a Multi-Column Algorithm. Implement listMultiColumnFields instead.
listReadOnlyFields - Similar to listMaskedFields but optional for Multi-Column Algorithms. Fields returned by this method are read-only and cannot be changed. Implement listMultiColumnFields instead.

The life cycles of algorithm objects

The Extensibility framework uses objects classes implementing MaskingAlgorithm interface for several distinct purposes. These object life cycles are as follows:

Plugin discovery

This occurs when the extensibility framework evaluates the capabilities present in a MaskingAlgorithm class.

Java object creation - an object of the algorithm class is created
getName - determines framework name
getDescription - determines framework description
getDefaultInstances- determines all plugin-provided algorithm instances. For each instance:
1. getName - determines instance name
2. getDescription - determines instance description
3. validate - ensure object passes validation
4. Serialize configurable fields - these are saved as a JSON document defining the instance's configuration
5. Disposal - the Java object is discarded
getAllowFurtherInstances - determines whether the framework is visible in the algorithm/framework API endpoint
Disposal - the Java object is discarded

User algorithm creation

This life cycle occurs whenever a user attempts to create a new instance of a plugin algorithm framework. The algorithm definition is saved only if each step succeeds.

Java object creation - an object of the algorithm class is created
Configuration injection - the values in the user-provided JSON document are injected into the object
validate - the object's validate method is called
Disposal - the Java object is discarded

The setup method is not executed when a user-defined instance is created.

Algorithm use

This is the life cycle of an algorithm object when used to mask data.

Java object creation - an object of the algorithm class is created
Configuration injection - the saved JSON document defining this instance is injected in the object
setup - the setup method is called once
mask - the mask method is called on each value to be masked
tearDown - the tearDown method is called once
Disposal - the Java object is discarded

It should be noted that a distinct Java object is created for each application of a masking algorithm during Job execution. For algorithms that create or load a large amount of state, this can result in significant memory usage storing redundant data for each instance. This can be avoided using a class level static cache to store data; the instance name, which can be retrieved during setup from the ComponentService interface object, can be used as an access key for data cached in this way.

Multi-column masking

It is possible to write an algorithm that masks data that depends on other column(s) values. In order to account for the different possible data types, we use an object called a GenericDataRow.

Generic data

A GenericDataRow is a map of field names (String) to GenericData objects. Each GenericData object contains the value, along with methods to return the respective typed object. When accessing the value from a GenericDataObject it will be necessary to read it into a Core Data Type. To do so, use one of the following methods:

getStringValue()
getBigDecimalValue()
getLocalDateTimeValue()
getByteBufferValue()

Once the value has been masked it should be re-set by calling setValue and passing as an argument the value as a Core Data Type.

Batch masking

Batch masking is a feature that can improve algorithm performance significantly when high latency operations are employed as part of the masking process. Accessing an external resource like a database or API introduces significant execution latency compared executing Java code; batching incurs only a single round-trip latency while masking many values. Batching also allows the interchange of values between data rows during masking.

Batch masking support in jobs

Batching is currently supported for these job types:

All Database masking jobs
Delimited File jobs
Fixed-Width File jobs

Batch size is equal to the job's Row Limit divided by 5, or equal to 2000 when the Row Limit is disabled; this is the guaranteed lower bound for batch size, assuming at least that number of inputs are available and no conditional record types are present. The final batch when processing a table or file may be up to twice the normal batch size.

For file jobs, the presence of conditional record types will cause batch sizes to be unpredictable, as the availability of records for batch execution will naturally vary based on how many records actually match the criteria for each record type. Algorithms that require a minimum batch size, such as Secure Shuffle, may fail in this case.

Using Batch Masking in an algorithm implementation

An algorithm implementation can customize how batches of values are masked by overriding the maskBatch method in the MaskingAlgorithm interface. There is no reason to implement this method unless there is a benefit to processing multiple values in a single operation. A common example of this is when the algorithm is accessing an external API to perform masking; in this case, masking multiple inputs per method call allows the access latency of the API to be incurred only once for the entire batch of inputs.

The maskBatch method is called with a MaskingBatch object parameterized by the same Java type used in the MaskingAlgorithm interface definition. The MaskingBatch object provides the following methods to facilitate masking:

size - returns the size of the batch of values
getValue - returns the value to be masked at a particular index in the batch
setValue - sets the mask result at a particular index in the batch
setError - indicates that an error occurred when masking the input value at a particular index in the batch

The default implementation of maskBatch in the MaskingAlgorithm interface provides a simple example of how to use these methods.

The masking engine will not utilize the maskBatch method or create a batch with size greater than 1 in all cases. Batch masking is only supported for some job configurations, so it is critical that the mask method also be implemented for all algorithms. It is strongly recommended that the mask and maskBatch method be implemented to produce the same mask results given the same inputs.