
Schema Data Class Search
The Schema Data Class Search wizard in IRI Workbench (WB) can be used to match data in entire schemas to configured data classes. The search process compares the matchers in the data classes with the data to determine the best match, if any. The matchers can be either patterns or file lookups. If only selected tables or structured files need to be searched, use the Data Class Library editor for faster results.
The main output is a Data Class Library or additions to an existing data class library. The library contains links to the data sources, the data classes, and the mappings between the two.
This example searches a single MySQL schema using eight data classes matching personally identifiable information or indirect identifiers.
Data Class Settings in Preferences
Data classes are set up in WB preferences so that they are available to all projects in the workspace. WB comes pre-loaded with some data classes. This example uses the default list plus a credit card pattern from IRI’s list of common patterns.
Additionally, a matching threshold can be set in preferences. This threshold allows the wizard to end scanning of the current column if it reaches that threshold after scanning 4096 rows of data. For this example, the default threshold of 90% is used. Therefore, if 90% of the first group of data matches, the process moves to the next column. If it doesn’t, groups of 4096 rows are retrieved and scanned until reaching the threshold, if ever.
Using the Schema Data Class Search Wizard
To begin using the wizard, select to include matching on column names and whether to scan tables that were not scanned previously. The option to match the data class name to the column name will skip scanning the actual data for a match. For instance, if there is a column and a data class both named “CREDIT_CARD”, it is a match. The “previous scan” option is helpful if a scan did not finish or if new tables have been added since a previous scan. Click Add to add the tableSearchResults from a previous scan (or multiple scans) to exclude them from this scan. Click to override existing data class mappings with the results of this scan.
Next, select the depth of matching. The first choice is full scan using the threshold set in preferences. If the scan should only match on the column names, select not to scan the data. In this example, both matching on column names and matching against data will be performed.
On the next page, select the schema or schemas to search and click Next.
On the next page, select the data types of the columns that contain the data in question. If the suspect data is only in text-type fields, selecting only those types of columns will reduce the amount of data that the underlying search engine has to process.
Data that matches particular patterns can be excluded to reduce the scanning volume. On this page, enter regular expression patterns to exclude items. The patterns should follow this format: <Schema>.<Table> or <Schema>.<Table>.<Column>. There are no exclusions in this example.
The next page displays the available data classes from the list in preferences. Data classes can also be added and edited from this page. Those changes will be propagated to the data classes in preferences. Select the data classes that are to be matched, eight in this case.
Clicking Finish will start the search/map process. Depending on the volume of data being scanned, the wizard may run for a long time. Because of this, a file called tableSearchResults is created and records every table that has been fully processed. In case of a failure during the search, this file will show the last table that was successfully searched.
If the option to match on names is selected, the column name is compared to the name of the data classes for a match. If there is no match and the data is being scanned, the job moves on to scanning the actual data. If a data class match was found, a file named columnSearchResults is appended with the name of the column.
In this example, cosort.PERSONS.SSN and cosort.PERSONS.CREDIT_CARD were matched on names. The other three were matched by the data.
Working with the Data Classes
The data class library is opened when the wizard finishes. The two tables that contained matches were added, as were the data classes that matched. Each of the columns listed above have a map to a data class as well.
In the screenshot below, the four matched columns in the table PERSONS is displayed with the data class and matching percentage or type. Even though the wizard searched for last names, a data class was not matched to the LASTNAME field because the matched percentage was below the threshold.
If changes need to be made, this form editor can change the data class or rescan specific columns in their entirety. The % matching will return the percentage even if under the threshold. Field rules can also be assigned to the columns here.
Classifying data is important when keeping up with changes in regulations. These data classes can now be used in WB wizards with field rules to mask personally identifiable information or make modifications before sending to analytics. See this article on masking data in multiple RDB columns across a schema using the FieldShield Database Data Class Masking wizard.
It is also possible to search and mask data class-matching PII in multiple structured files in FieldShield, or in semi-strutured and unstructured sources in IRI DarkShield. Data classes and masking functions are shared between both products, which are also inside the IRI Voracity data management platform.
1 COMMENT
[…] a data class library (in this example, using the Schema Data Class Search wizard). Make sure that data classes are mapped to […]