Multi Column Address (Algorithm frameworks)

Overview

The Multi Column Address framework allows multi-column masking of address information for a physical location. The purpose of the framework is to take data across many columns, with inconsistent formatting or completion of individual columns make sure that input referring to a single physical address is masked to an output that corresponds to a single real or contrived physical address.

This means that a full address in a single column would mask to the same output as the same address spread across several columns of address components like street number and city.

The data in these two rows can mask to the same thing using this framework, because the location they refer to is the same even though the information is stored differently:

CITY	STREET	ZIP	HOUSE_NO	COUNTRY	REGION	ADDRESS
Lorem City	Ipsum Ave	99999	1111	Dolor	Sit Amet	N/A
N/A	N/A	N/A	N/A	N/A	N/A	1111 Ipsum Ave, Lorem City, Sit Amet, Dolor, 99999

The framework offers 17 logical columns that can be assigned to columns in a database. All columns are optional. The algorithm concatenates the input column values to a hash and this hash is used to look up a row from user provided CSV data using a SecureLookup approach.

There are several advanced features:

Grouping: The user can assign a single column that determines groups or “buckets“ to be masked to within the lookup file. If assigned to the zip code column, this could ensure that data with a certain zip code was only masked to values in the lookup file with that same zip code.
US Zip Matching: The United States can use either 5 or 9 digit zip codes to refer to the same place. This feature will trim the 9 digit codes to the 5 digit format and use that value to calculate the hash used for the lookup file, and to determine the masked output zip code.
Filter City: If a city value is followed by the state code (ex: "San Diego CA") and the state code ("CA") is one of the values in the filterCityValueList the value is removed for hashing and after masking the state code can be re-applied. This feature can mask to the same address result for city values with and without a state code.
Option Columns: These 3 logical columns can be used to apply an action to data columns in the source that fall outside of the standard columns provided. The behaviors available are:
- LOOKUP: Use the column value from the CSV file.
- NULL: Always return null.
- BLANK: Always return a blank space/empty string.
- ALGO_IN: Chain the input value to a configured algorithm.
- ALGO_OUT: Chain the output from the CSV lookup value to a configured algorithm.

The descriptions for all individual configuration keys are found at the end of this documentation.

Column assignment

The following 17 logical columns are available in the algorithm for assignment:

CITY
STREET
ZIP
HOUSE_NO
STREET_NO
REGION
COUNTRY
COUNTRY_CODE
LONGITUDE
LATITUDE
OFFICE
ADDRESS
ADDRESS_LINE1
ADDRESS_LINE2
OPTION1
OPTION2
OPTION3

Logical columns are assigned to columns in a masking ruleset. Multi-column algorithms are assigned to a ruleset column, and then a logical column from the algorithm is assigned:

Creating a Lookup File

The lookup file used for masking must be created by the user. Each row must contain 17 columns for the 17 different logical columns in the algorithm in the same order. These do not all need to contain data, but they should exist at minimum as a column defined by the configured delimiter, defaulting to “|“. There should not be a row defining column names.

Row format and required sequence:

CODE

CITY|STREET|ZIP|HOUSE_NO|STREET_NO|REGION|COUNTRY|COUNTRY_CODE|LONGITUDE|LATITUDE|OFFICE|ADDRESS|ADDRESS_LINE1|ADDRESS_LINE2|OPTION1|OPTION2|OPTION3

The following example:

CODE

Lorem City|Ipsum Ave|99999|99|1111|Sit Amet|Dolor|DOL|99.0|99.0|99|1111 Ipsum Ave, Lorem City, Sit Amet, Dolor, 99999|1111 Ipsum Ave|Lorem City, Set Amet, Dolor, 99999|NULL|NULL|NULL

And this example:

CODE

Lorem City|Ipsum Ave|99999|99|1111|Sit Amet|Dolor||||||||||

Are both valid rows for the lookup file.

Remember that all columns do not need to be filled out. Just the ones that create the masked output format desired.

Basic algorithm configuration

The most basic configuration of the Multi Column Address algorithm takes values from the data source using the assigned logical columns, concatenates the values for each row, hashes them and uses the hash to retrieve a row from the user provided lookup file. The column values for this row are inserted in the output row columns as the masked values.

An example config could look like this:

CODE

{
    "option1": "LOOKUP",
    "option2": "LOOKUP",
    "option3": "LOOKUP",
    "filterCity": false,
    "lookupFile": {
        "uri": "delphix-file://upload/f_9999988bde64469eb08f514c74a3244c/YOUR_LOOKUP.csv"
    },
    "regexFields": null,
    "csvDelimiter": "\\|",
    "addressGroups": [],
    "fallbackGroup": null,
    "usZipMatching": false,
    "enforceNotNull": true,
    "matchingColumn": "",
    "option1Algorithm": null,
    "option2Algorithm": null,
    "option3Algorithm": null,
    "filterCityReapply": false,
    "inputRemoveSpaces": true,
    "inputCaseSensitive": false,
    "externalSecurityKey": 0,
    "filterCityValueList": "",
    "preserveEmptyFields": true,
    "columnsUsedToBuildHash": "",
    "enforceNotNullReplacement": "_"
}

Notice that regexFields and addressGroups, two of the more advanced features, are not used for simple configurations.

Advanced algorithm configuration

Because of the large number of options in this algorithm it is impossible to provide an example of all possible permutations. The following takes advantage of the advanced features not used in the basic configuration.

CODE

 {
    "option1": "LOOKUP",
    "option2": "LOOKUP",
    "option3": "LOOKUP",
    "filterCity": false,
    "lookupFile": {
      "uri": "delphix-file://upload/f_9999988bde64469eb08f514c74a3244c/YOUR_LOOKUP.csv"
    },
    "regexFields": {
      "regexZIP": [],
      "regexCITY": [],
      "regexHASH": [],
      "regexOFFICE": [],
      "regexREGION": [],
      "regexSTREET": [],
      "regexADDRESS": [],
      "regexCOUNTRY": [],
      "regexOPTION1": [],
      "regexOPTION2": [],
      "regexOPTION3": [],
      "regexHOUSE_NO": [],
      "regexLATITUDE": [],
      "regexLONGITUDE": [],
      "regexCOUNTRY_CODE": [],
      "regexADDRESS_LINE1": [],
      "regexADDRESS_LINE2": [],
      "regexMatchingColumn": [
        "G[0-9]",
        "G[0-9]([0-9])"
      ],
      "regexSTREET_HOUSE_NO": []
    },
    "csvDelimiter": "\\|",
    "addressGroups": [
      {
        "group": "A",
        "matchList": [
          "G00",
          "G01"
        ]
      },
      {
        "group": "B",
        "matchList": [
          "G02",
          "G03"
        ]
      },
      {
        "group": "X",
        "matchList": [
          "G04"
        ]
      }
    ],
    "fallbackGroup": "X",
    "usZipMatching": false,
    "enforceNotNull": true,
    "matchingColumn": "ZIP",
    "option1algorithm": null,
    "option2algorithm": null,
    "option3algorithm": null,
    "filterCityReapply": false,
    "inputRemoveSpaces": true,
    "inputCaseSensitive": false,
    "externalSecurityKey": 0,
    "filterCityValueList": "",
    "preserveEmptyFields": true,
    "columnsUsedToBuildHash": "ZIP,CITY,STREET",
    "enforceNotNullReplacement": "_"
  }

Note that:

The matchingColumn is set to “ZIP“. To determine a group for an input row the ZIP column is used.
The regexMatchingColumn has two regex filters. The input from the ZIP column will pass through these filters and the matching values will be used to determine group.
There are 3 groups defined in addressGroups. The regex filtered matching column value from ZIP will be assigned to the group where it is contained in the matchList. This means that value G001-AAA would be regex filtered to G00 and assigned to group A. The input row for that value could only be masked to a row in the lookup file matching group A.
The columnsUsedToBuildHash mentions a subset of the available logical columns. Only these columns will be considered when building a hash to retrieve a masking row from the lookup file.
There is a regex list for each logical column, seen like this: regexStreet. Any value from the logical column with this name will be filtered through these regex before being used to build the lookup hash.

Algorithm configuration options

matchingColumn: [String]

This mandatory setting contains the name of the column to be used as a matching value for grouping. When loading the CSV each data row is mapped into a specific group based on the matchingColumn value.

Example: The matchingColumn ZIP has input value 34564 but because the regexMatchingColumn is enabled with [0-9]{1} the ZIP column input value is filtered to the first numeric character '3'. The value 3 belongs to group A because 3 is an entry in the matchList of group A.

CODE

"matchingColumn": "ZIP"

This key is only used when groups are defined. It is ignored in simple configurations.

addressGroups: [List of JSON Arrays]

If using a matchingColumn this setting must contain at least one group. Each group has a matchList containing possible input from the matchingColumn. Each matchList value should exist only once in the entire definition (otherwise first found is applied). An input row matched into a group can be assigned a deterministic value from and row in the lookup matching values in its matchList.

CODE

"addressGroups" : [
    {"group": "A","matchList": ["1","2","3"]},
    {"group": "B","matchList": ["37","38","39"]}
]

columnsUsedToBuildHash: [String]

Optional property containing a comma separated list of columns used for matching. The column values are concatenated for hashing. If not defined all columns are used to compute the hash.

CODE

"columnsUsedToBuildHash": “ZIP,CITY,STREET_HOUSE_NO”

fallbackGroup: [String]

This setting is mandatory for configurations using addressGroups. If a record from a matchingColumn does not match a value in the match list for any group, this group is used for masking the row’s values.

CODE

"fallbackGroup": "X"

This key is only used when groups are defined. It is ignored in simple configurations.

lookupFile: [FileReference]

Mandatory URI to a file reference for the CSV data file containing masking output address material. The URI can be obtained through the masking engine API for any file uploaded.

CODE

"lookupFile": {
      "uri": "delphix-file://upload/f_9999988bde64469eb08f514c74a3244c/YOUR_LOOKUP.csv"
}

inputCaseSensitive: [boolean]

Determines if the input ignores case. Important for determining lookup hashes. Defaults to false and this will satisfy most use cases.

inputRemoveSpaces: [boolean]

When true any space in the input string is filtered out. Important for determining lookup hashes. This is helpful to achieve the same result for differently structured inputs like column ‘STREET' and column 'HOUSE_NO’ versus a single column containing street and house number. The single column would contain a space between the values where two concatenated columns may not. Defaults to true and this will satisfy most use cases.

filterCity: [boolean]

If filterCity is enabled the city value will be checked to see if it’s end contains a value from the filterCityValueList. If it does, the matching ending is removed for hashing. Uncommonly used. Defaults to false and this will satisfy most use cases.

filterCityValueList: [String]

Comma separated list of String values that are to be removed from the end of the city column input value if filterCity is true. Uncommonly used.

filterCityReapply: [boolean]

If enabled and the filterCity feature has detected a match and filterCityReapply is enabled the masked output COUNTRY_CODE value will be added to the city column. Example:

San Diego CA > CA is filtered > masked value Boston > we detect that filterCity occurred and MA is appended > Final result: Boston MA.

Uncommonly used.

preserveEmptyFields: [boolean]

If set to true any null or empty string from input will be preserved as an empty string (““).

csvDelimiter: [String - Single Character]

Required. Character to be used as the delimiter in lookup CSV file. Defaults to “|“.

enforceNotNull: [boolean]

Required. This setting controls whether a masked column output can be null or an empty string. The value is configured under property enforceNotNullValue.

enforceNotNullReplacement: [String]

Required. This setting controls what value should be used as replacement for null or empty strings in masked output. Useful to avoid breaking database table not null constraints. Defaults to an underscore “_“.

This value will be seen in masked output for configurations not using the group feature. If groups are utilized, the replacement value will be masked as a member of the fallbackGroup.

The replacement can also be set as a value that matches a different group- effectively creating a different fallback for null values.

option1/option2/option3: [String]

Used to assign behavior to the wildcard option columns. Options are:

LOOKUP: This option will use a value from CSV lookup.
NULL: This option will always return NULL as the masked value.
EMPTY: This option will always return an empty string (“”) as the masked value.
BLANK: This option will always return a blank space (“ “) as the masked value.
ALGO_IN: This option will send the input value to a configured algorithm and return its result.
ALGO_OUT: This option will send the value returned from the CSV lookup to a configured algorithm and return its result.
option1Algorithm/option2Algorithm/option3Algorithm: [String]

Assigns a chained algorithm for option fields configured with ALGO_IN or ALGO_OUT.

CODE

"option1Algorithm": {"name": "dlpx-core:CM Alpha-Numeric"}

regexFields [JSON]

The input value from any logical field can be filtered through regex before before being concatenated into the value to be hashed for lookup. The keys available for this object correspond to each logical field. Each logical field can have multiple regex filters. The output of each filter match is concatenated into a end result. This can allow multiple filters to match various parts of a large input and achieve a normalized value to use in hashing.

When multiple regex filters are used on a single field, the matching values for each are concatenated in order. The filters list is an AND implementation, not an OR.

CODE

"regexADDRESS": [
        "[a-zA-Z]{1,3}",
        "[0-9]{1,2}"
]

Example input:

CODE

Samplestreet 123

Filtered and concatenated output:

CODE

1Sam