
Directory Data Class Search
The Directory Data Class Search wizard in IRI Workbench (WB) matches data in structured files within one or more directories to configured data classes. The search process compares the matchers in the data classes with the data in those files to determine the best match, if any. The matchers can be either patterns or set file lookups. If only a few, selected structured files need to be searched, use the Data Class Library editor for faster results.
The main output is a Data Class Library, or additions to an existing data class library. The library contains links to the data sources, the data classes, and the mappings between the two.
This example searches a directory containing five csv files using eight data classes matching personally identifiable information or indirect identifiers. Data classes are set up in preferences so that they are available to all projects in the workspace. WB comes pre-loaded with some data classes. This example uses the default list.
Additionally, a matching threshold can be set in preferences. This threshold allows the wizard to end scanning of the current field if it reaches that threshold after scanning 4096 records. For this example, the default threshold of 90% is used. Therefore, if 90% of the first group of data matches, the process moves to the next field. If it doesn’t, groups of 4096 records are retrieved and scanned until reaching the threshold, if ever.
To launch the wizard, click on the Discovery menu and select Directory Data Class Search. On the setup page, choose to include matching on field names, and whether to scan tables that were not scanned previously. The option to match the data class to the field name will match based on the name of the data class/field and skip scanning the actual data for a match. For instance, if there is a field and a data class both named “CREDIT_CARD”, it is a match.
The “previous scan” option is helpful if a scan did not finish or if new sources have been added since a previous scan. Click Add to add the sourceSearchResults.log from a previous scan (or multiple scans) to exclude them in this scan. Additionally, if there is an existing data class library, select whether to override existing mappings.
Next, select the depth of matching. The first choice is a full scan using the threshold set in preferences. If the scan should only match on the field names, select Do not scan data. In this example, both matching on field names and matching against data will be performed.
On the next page, select the directories to search. Selecting Include LAN drives will list network drives as well. Click Select, check the desired directories and click OK. Then click Next.
The files and details of the selected directory are below.
On the next page, enter regular expression patterns to exclude items. The patterns should follow this format: <Absolute file name> or <Absolute file name>.<Field name>. There are no exclusions in this example.
The last page displays the available data classes from preferences. Data classes can also be added and edited from this page. Those changes will be propagated to the data classes in preferences. Select the data classes that are to be assigned, eight in this case. The order of the data class list is editable by selecting the Move Up and Move Down buttons. This allows prioritization if multiple classes can match the same data.
Clicking Finish will start the search/map process. Depending on the volume of data being scanned, the wizard may run for a significant amount of time. Because of this, a file called sourceSearchResults is created and records every source that has been fully processed. In case of a failure during the search, this file will show the last source that was successfully searched.
If the option to match on names is selected, the field name is compared to the name of the data classes for a match. If there is no match and the data is being scanned, the job moves on to scanning the actual data. If a data class match was found, a file named fieldSearchResults is appended with the name of the field.
In this example, all matching fields in employees.csv, names.csv.LAST_NAME, names.csv.FIRST_NAME, and persons.csv.CREDIT_CARD were matched on name. The others were matched by the data.
The data class library is opened when the wizard finishes. The three files that contained matches were added, as were the data classes that matched. Each of the fields listed above have a map to a data class as well. In the screenshot below, the four matched fields in persons.csv are displayed with the data class and matching percentage or type. Even though the wizard searched for last names, a data class was not matched to the LASTNAME field because the matched percentage was below the threshold.
If changes need to be made, this form editor can change the data class or re-scan specific fields in their entirety. Classifying through this editor, the % matching will return the percentage even if under the threshold. Field rules can also be assigned to the fields or data classes here.
Classifying data is important when keeping up with changes in regulations. These data class mappings can be used in IRI Workbench wizards and paired with field rules to redact personally identifiable information or modify data before it is analyzed.
1 COMMENT
[…] a data class library (in this example, using the Directory Data Class Search wizard). Make sure that data classes are mapped to […]