The MaskingAlgorithm Java Interface
Any Java class that should be recognized as a masking algorithm (whether stand-alone or configurable) must implement the MaskingAlgorithm interface. This interface is parameterized with the data type the algorithm masks, which defines the input and output data type of the mask method. The full details of this interface are described in the Masking Plugin API Javadoc
Core Data Types
The Delphix Masking Engine is designed to support a wide and extensible set of data sources, which naturally encode data in a variety of different formats. In order to simplify algorithm development, while maintaining the ability to mask data from many sources, we've identified a core set of data formats which are likely to require different masking treatment and ensured that the Extensible Algorithm framework converts all data to/from these types as needed. These types define the allowed parameterization of the MaskingAlgorithm Java interface.
Each masking algorithm class is defined to mask exactly one of the following data types:
Binary data - java.nio.ByteBuffer
String data - java.lang.String
Numeric data - java.math.BigDecimal
Date time data - java.time.LocalDateTime
Multi-column data - com.delphix.masking.api.plugin.utils.GenericDataRow (See Multi-Column Masking section)
Each algorithm is expected to input, process, and emit objects of one of the above Java types, but is free to use any intermediate types as needed to access library methods. Because it is frequently the case that data of one type is stored in databases or documents in a type other than its most natural native type (ex. dates stored in VARCHAR fields, or numbers stored as text in a CSV file), the masking framework that executes these algorithms is capable of performing a number of automatic type conversions, detailed in the next section. This allows algorithms written to process one data type to handle data of other types, with no additional work required of the algorithm author.
Supported Automatic Type Conversions
Algorithm Native Type | Supported Type | Notes |
---|---|---|
ByteBuffer | String | Algorithm receives the UTF-8 encoded value of the String and is expected to return a valid UTF-8 ByteBuffer. |
LocalDateTime | String | The correct date format must be assigned to the field or column in the masking inventory. |
LocalDateTime | Compatible numeric types | A compatible date format, such as yyyyMMdd, must be assigned to the column in inventory. |
BigDecimal | All numeric types | Upconverted to BigDecimal. Out of range values after masking are truncated to fit the range of the underlying type. |
BigDecimal | String | String value is converted to a number. |
Special Case Values
In order to allow algorithms to implement special handling for null, empty, and special case values, these values are presented to the masking algorithm unmodified. Algorithms should be prepared to process the full range of input values possible for the input type. In practice, this means that most mask method implementations will begin with a null check on the input value, prior to attempting to use the input - for example, by calling input.length() or similar. It is perfectly acceptable and commonplace to return null in the case where the mask input is null.
Method Overview
This section provides a high-level overview of the methods in the MaskingAlgorithm interface. For complete details, consult the Masking Plugin API Javadoc included in the Algorithm SDK archive.
getName and getDescription - These methods are used to determine the name and description of frameworks and algorithm instances included in the plugin. For user-created instances, these methods are never called.
getDefaultInstances and getAllowFurtherInstances - These methods control the set of instances of the algorithm framework that are defined by the plugin, and whether the user should be allowed to create additional instances.
validate - This method is called after configuration is applied to allow the algorithm class to check whether the injected configuration is valid.
setup and tearDown - These methods are called before the algorithm object is used for masking, and after, respectively. Typically, any resources, such as input files, are acquired during setup and released during tearDown.
mask - This is the method that does the actual data masking in the algorithm class. The input and output values are parameterized for type safety as described above
maskBatch - This method is called to perform masking in situations when it is possible for the caller to build a collection of input values to mask in a single method call. A default implementation is provided that simply calls the mask method on each value in the batch.
listMultiColumnFields - This method needs to be implemented only for Multi-Column Algorithms. It returns a list of AlgorithmLogicalField objects that define the set of fields that the multi-column algorithm masks.
The following methods are available but deprecated:
listMaskedFields - This method needs to be implemented for Multi-Column Algorithms. It returns a map of field names (
String
) to the Core Data Type. This method does not need to be implemented if not implementing a Multi-Column Algorithm. Implement listMultiColumnFields instead.listReadOnlyFields - Similar to
listMaskedFields
but optional for Multi-Column Algorithms. Fields returned by this method are read-only and cannot be changed. Implement listMultiColumnFields instead.
The Life Cycles of Algorithm Objects
The Extensibility framework uses objects classes implementing MaskingAlgorithm interface for several distinct purposes. These object life cycles are as follows:
Plugin Discovery
This occurs when the extensibility framework evaluates the capabilities present in a MaskingAlgorithm class.
Java object creation - an object of the algorithm class is created
getName - determines framework name
getDescription - determines framework description
getDefaultInstances- determines all plugin-provided algorithm instances. For each instance:
getName - determines instance name
getDescription - determines instance description
validate - ensure object passes validation
Serialize configurable fields - these are saved as a JSON document defining the instance's configuration
Disposal - the Java object is discarded
getAllowFurtherInstances - determines whether the framework is visible in the algorithm/framework API endpoint
Disposal - the Java object is discarded
User Algorithm Creation
This life cycle occurs whenever a user attempts to create a new instance of a plugin algorithm framework. The algorithm definition is saved only if each step succeeds.
Java object creation - an object of the algorithm class is created
Configuration injection - the values in the user-provided JSON document are injected into the object
validate - the object's validate method is called
Disposal - the Java object is discarded
The setup method is not executed when a user-defined instance is created.
Algorithm Use
This is the life cycle of an algorithm object when used to mask data.
Java object creation - an object of the algorithm class is created
Configuration injection - the saved JSON document defining this instance is injected in the object
setup - the setup method is called once
mask - the mask method is called on each value to be masked
tearDown - the tearDown method is called once
Disposal - the Java object is discarded
It should be noted that a distinct Java object is created for each application of a masking algorithm during Job execution. For algorithms that create or load a large amount of state, this can result in significant memory usage storing redundant data for each instance. This can be avoided using a class level static cache to store data; the instance name, which can be retrieved during setup from the ComponentService interface object, can be used as an access key for data cached in this way.
Multi-Column Masking
It is possible to write an algorithm that masks data that depends on other column(s) values. In order to account for the different possible data types, we use an object called a GenericDataRow
.
Generic Data
A GenericDataRow
is a map of field names (String
) to GenericData
objects. Each GenericData
object contains the value, along with methods to return the respective typed object. When accessing the value from a GenericDataObject
it will be necessary to read it into a Core Data Type. To do so, use one of the following methods:
getStringValue()
getBigDecimalValue()
getLocalDateTimeValue()
getByteBufferValue()
Once the value has been masked it should be re-set by calling setValue
and passing as an argument the value as a Core Data Type.
Batch Masking
By overriding the maskBatch method in the MaskingAlgorithm
interface, an algorithm implementation may increase performance or efficiency in cases where the underlying masking operation may be performed more optimally on multiple values per method call. A common example of this is when the algorithm is accessing an external API to perform masking; in this case, masking multiple inputs per method call allows the access latency of the API to be incurred only once for the entire batch of inputs.
The maskBatch method is called with a MaskingBatch
object parameterized by the same Java type used in the MaskingAlgorithm
interface definition. The MaskingBatch
object provides the following methods to facilitate masking:
size - returns the size of the batch of values
getValue - returns the value to be masked at a particular index in the batch
setValue - sets the mask result at a particular index in the batch
setError - indicates that an error occurred when masking the input value at a particular index in the batch
The default implementation of maskBatch in the MaskingAlgorithm
interface provides a simple example of how to use these methods.
The masking engine will not utilize the maskBatch method or create a batch with size greater than 1 in all cases. Batch masking is only supported for some job configurations, so it is critical that the mask method also be implemented for all algorithms. It is strongly recommended that the mask and maskBatch method be implemented to produce the same mask results given the same inputs.
Batching is currently supported for these job types:
All Database masking jobs
Delimited File masking jobs when no more than one body record type is defined
Fixed-Width File masking jobs when no more than one body record type is defined
Batch size is equal to the job's Row Limit
divided by 5, or equal to 2000 when the Row Limit
is disabled; this is the guaranteed lower bound for batch size, assuming at least that number of inputs are available. Typically, the size of the final batch in a job will be larger.