Managing classifiers

Classifier instances define the logic that the ASDD profiler uses to identify sensitive information. For an overview of classifiers and related concepts, refer to Discovering Your Sensitive Data. Each classifier instance is based on a classifier framework that implements the recognition logic.

An overview of the available frameworks is available in the classifier concept section. The API Calls for Managing Classifiers article describes how to use the API Client to retrieve a detailed description of all classifier frameworks on the system, along with their configuration schemas.

To view a list of all classifier instances available, Navigate to Settings > Classifiers.

The classifiers on the screen can be filtered or sorted by the various informational fields by clicking on the respective fields. More information on grid filtering and sorting can be found here.

Sortable fields are Name, Domain and Description.
Filterable fields are Name, Domain, Owner, Type, and Description.

Adding a new Classifier

Click the + Classifier button from the top-right corner above the Classifiers grid.
Select a Domain from the Domain dropdown.
- Domains are used by profiling jobs to determine the masking algorithm to apply to the sensitive data. When a classifier is matched, the profiling job will associate the specified domain with the sensitive data.
- The Masking Engine comes out of the box with over 30 pre-defined domains. For more information on domains, refer to the Managing Domains article.
Enter the following information for the classifier:
- Classifier Name to create.
- Description for the classifier (optional).
Select the Classifier Framework to create the classifier with (i.e. Data Type, Regex, Path, or List). The bottom part of the form will change based on the classifier framework selected.
To Add a Regex Classifier:
- Fill out the Profiling Regex field (mandatory), then click Add. For the rest of the fields, if left unfilled/unselected, the default value will be selected and added to the table.
To Add a Data Type Classifier:
- Choose a Data Type from the dropdown (mandatory), then click Add. For the rest of the fields, if left unfilled/unselected, the default value will be selected and added to the table.
To Add a List Classifier:
- Upload a File (mandatory), then click Add. For the rest of the fields, if left unfilled/unselected, the default value will be selected and added to the table.
To Add a Path Classifier:
- Add a Field Value (mandatory), then click Add. For the rest of the fields, if left unfilled/unselected, the default value will be selected and added to the table.
It also allows in-place editing/deleting of configurations.
- To edit/delete, click the Actions button (…) next to the configuration.
- Clicking Edit will take all the information from the corresponding row to the right side form, where it can then be modified.
Click Save to create the classifier. Users can add multiple configurations of the same framework type to a classifier.

Modifying Classifiers

Users can perform 3 types of action on this screen by clicking the (…) button to the right of the corresponding row under the Actions column.

View Classifier

Every field on the dialog will be disabled when the View action is selected.

Edit Classifier

On clicking Edit action, a dialog will appear for editing the classifier. The Classifier name and framework type are not editable after creation hence it will be disabled for Editing.

Delete Classifier

Clicking Delete Action will prompt for confirmation. Click on Confirm to delete the classifier.

Configuration considerations for classifiers

Classifier design is more complex than search or type expressions because classifiers offer more flexibility in matching logic and configuration around match strength. Classifiers contain more configuration, typically encompassing all logic of the framework's type for a particular domain. For example, a legacy profile set might have three different column-level search expressions, but these would all be consolidated into a single PATH type classifier. Classifiers also add the notion of rejection strength, which allows the profiling logic to eliminate domains from consideration earlier in the profiling process.

Strength values

Classifiers use a scale of -1.0 to 1.0 to represent match and reject strength, which correspond to the -100 to 100 confidence values used in the UI. The product currently does not display confidence values for non-matches, so only values between 0 and 100 are typically visible.

A value of -1.0 indicates absolute rejection, and the domain in question is immediately eliminated from the set of possible matches. Similarly, a value of 1.0 indicates absolute confirmation, and the domain is assigned as a match without checking any other classifiers for that domain.

When multiple classifiers produce match or reject strength values, those results may be combined to get a final confidence. If the results conflict, with opposite signs, the result with the highest absolute value takes precedence. If the results have the same sign, the final result for that domain is a stronger match. The exact values and formula applied are under development and may change in the future. Currently, only the strongest column level result and strongest data level result are combined in this fashion.

Examples

A column named "ssn_present" matches a PATH classifier for the SSN domain with a match strength of 0.67. However, the column is boolean type and does not match the TYPE classifier for the SSN domain, which returns a -1.0 result. The verdict is -1.0 and the SSN domain is not assigned.
A column named "passport_no" contains 9-digit numeric values, which match the REGEX classifiers for both the SSN and PASSPORT_NO domains. Both REGEX classifiers return a confidence of 0.5 for this match. However, while the PATH classifier for the PASSPORT_NO domain matches and returns a match strength of 0.67, the PATH classifier for the SSN domain does not match, returning 0. The final confidence values are PASSPORT_NO at 0.84, and SSN at 0.5, so the PASSPORT_NO domain is the best match and the PASSPORT_NO domain and associated masking algorithm are assigned to the column.

Default assignment threshold

The ASDD profiler has a default minimum confidence value of 1, which means that any positive match, no matter how weak, will trigger an assignment. This is significantly different from the legacy profiler, which by default requires an 80% match for data level expressions. The application setting ASDD/DefaultAssignmentThreshold controls this value. For more details, refer to this section.

Choosing values for match strength

The match strength value (typically called matchStrength in the classifier configuration) reflects how confident the classifier is that a particular data element exclusively matches the associated domain. A match strength of 0.01 indicates that the data element may belong to the domain, but might also belong to any number of other domains or not be sensitive at all, while a value of 1.0 reflects absolute certainty that this data matches this domain and no other domain. A value of 0 provides no information.

Not all classifiers have a match strength greater than 0. One example of this is TYPE classifiers, which typically have a high reject strength, but 0 match strength (since it is impossible to match any of the built-in domains based on the data type of a column alone).

PATH classifiers built-in to the product typically have a match strength of 0.67, so in order for a REGEX or LIST classifier to override a PATH result, that classifier's match strength or reject strength would have to be higher than this value. This can help eliminate false positive results from the PATH classifiers, but be wary of the next recommendation before setting match strength to a high value.

When choosing match strength for REGEX classifiers, consider whether the pattern is unique to the type of sensitive data being detected. If it is not, it is safer to give a relatively low match strength in the range of 0.1 to 0.5, so that PATH level results can contribute information. Consider this example of REGEX detection of US Social Security numbers. These might be stored as a string value with a more distinct pattern like "001-23-4567", or simply as a 9-digit number "001234567". A 9-digit number might be any number of other numeric identifiers, like account number, passport number, a row identifier for rows in another table, etc. so the match strength for the [0-9]{9} regex should be quite low. The distinct text pattern with dashes has a much higher match strength since it is unlikely to be any other kind of information.

For LIST classifiers that utilize tokenization of inputs and reuse the same lists as other classifiers, consider lowering the match strength for each of these files used for tokenization. For example, list classifiers for First Name and Full Name may reference the same list of values, but during discovery with tokenization, the same values that match the First Name classifier will also match the Full Name classifier. Ideally, this configuration should allow the first name column to have a higher match to the First Name classifier.

Here are some additional tips for choosing match strength:

Use a higher match strength for patterns that are more likely to be unique to the type of sensitive data being detected.
Use a lower match strength for patterns that are less likely to be unique to the type of sensitive data being detected.
Use a match strength of 0 for patterns that are not unique to the type of sensitive data being detected.

Choosing values for reject strength

The reject strength (typically called rejectStrength in the classifier configuration) value reflects how likely it is that a value matches the classifier's domain when the classifier does not match. If you are certain that your classifier configurations will match every possible value for the domain, the reject strength should be set to 1.0; however, this degree of certainty is rare. Similar to match strength, not all classifiers provide any rejection capability. This is true of PATH classifiers, for example, as we cannot rely on an unknown database schema to use predictable or human-readable names for columns.

The reject strength for classifiers applies any time there is no match. For example, if a REGEX classifier contains 4 regular expressions, each expression would be tested against the column data value, and if none match, the reject strength defines the result. For this reason, it can be useful to add a pattern that matches quite broadly, even if it's not particularly selective for the domain in question, with a low match strength. This prevents a full rejection for values that might match this classifier's domain as well as one or more other domains.

For LIST and REGEX classifiers where the set of patterns or list values is known to be only a subset of possible values for the domain, reject strength should be below 0.5 to allow column-level matches to take precedence, even if none of the data values match. For example, the value lists built-in for first and last name LIST classifier only contain English values, and names might be in other languages. These classifiers have "reject strength" set relatively low, to prevent the LIST classifiers from overriding the PATH classifier match if, for example, the column contains only Japanese names.

Here are some additional tips for choosing reject strength:

Use a higher reject strength for classifiers that are more likely to match all values in the domain.
Use a lower reject strength for classifiers that are less likely to match all values in the domain.
Use a reject strength of 0 for classifiers that are not designed to match all values in the domain.

Regex configuration

The PATH and REGEX classifier types consume regular expressions using Java 8 regex syntax and matching logic. These classifiers have additional configuration options to control whether these patterns should match the entire input, and whether they are case-sensitive. For this reason, avoid using regex constructs such as "^(pattern)$" for these purposes.

Type classifiers

The TYPE classifier framework uses the same four types as Type Expressions, as described in the Managing expressions section. However, the type-matching system is more versatile and provides better type identification across all database variants. This framework also supports the identification of auto-incrementing columns. The property matchAutoIncrement defines whether auto-incrementing columns should be considered for profiling (true) or completely skipped (false) for the domain. The default value is false.

Tokenization in list classifiers

List classifiers to identify complex data can be created by combining multiple sets of list files and enabling tokenizeInput. In such list classifiers, the input is tokenized to individual tokens which are split by delimiter values specified by the tokenizationDelimiter field. The default value for this field is a space " ", but can be a string of individual characters such as -_.\n where each character is a delimiter. The confidence value for each input value is calculated by summing the match strengths of all matching tokens divided by the total number of tokens in the input.

For example, to identify a FULL_NAME which is a combination of FIRST_NAME & LAST_NAME, it's unrealistic to create a list with all possible combinations of first and last names, but by providing separate lists and turning on tokenization, discovery can match each token of the input to the value lists.