IRI Voracity and IRI FieldShield users can now classify their data and apply transformation and protection rules to data classes in multiple sources at once. New data discovery-related facilities in IRI Workbench help you assign classes or class groups to your data based on your business rules and/or domain ontologies. You can then use your data class library in reusable field rules.These features provide convenience, consistency, and compliance capabilities to data architects and governance teams.
Create Data Classes
The classification starts by setting up data classes in the Workbench’s Preferences screen, which allows you to use classes globally, across multiple projects in your workspace. Workbench has some classes pre-loaded, including the FIRST_NAME, LAST_NAME, and PIN_US classes used in this example. Note that class and class group names are converted to uppercase to prevent problems with case-sensitive operating systems.
The data classes work by matching (1) the name of the class to the name of the field, (2) a pattern to the data in the field, or (3) set file contents against the data in the field. The first item is done for you automatically in the classifying process, if that option is chosen. You can add as many patterns and set file matchers as you need for each class to return your intended results. Rank is an available option and, though not used in the classification process, it can be used to prioritize your data classes if that option is needed.
You can also make your data classes and groups inactive. If you have a lot of classes but want to filter out the items not used in your particular project, you can make them inactive. This allows you to retain a copy of them but not clutter the drop down list that uses these classes.
Data Class Groups
You can also have data class groups. For instance, the included group “NAMES” contains the data classes FIRST_NAME, LAST_NAME, and FULL_NAME. If you want to apply a rule to multiple classes, you can use a group instead of selecting data classes individually.
For this example, I removed the underscore from the FIRST_NAME data class to demonstrate the name matching option of classification.
Data Classification Source Wizard
Once the matchers have been added to the needed classes, you can run the Data Classification Source Wizard. The wizard accepts the following data formats: CSV, Delimited, LDIF, ODBC, or XML. This wizard provides the means to select sources for your data class library for classification later.
On the setup page, begin by selecting the location of your new “iriLibrary.dataclass” file, which is the output of this wizard. The file name is read-only because there can only be one of these file types in each project. You can also select the checkbox if all of your sources are tables in a connection profile.
Selecting this box opens an input page like the one below where you can choose the tables to be included:
If the checkbox is not selected, you can add files or ODBC sources in the same input screen. On this type of input page, you will also need to add the metadata for each source. In this example, I have included a CSV file and two Oracle tables. If you need to search and classify data across one or more full databases schemas at once, use the Schema Pattern Search and Schema Pattern Search to Data Class Association wizards.
Clicking Finish will create a data class library with the selected sources included. The data class form editor that opens will allow you to classify the data in those sources.
Classifying the Data In Your Selected Sources
You begin the classification process by clicking one of the data sources to display the details about that source. The upper part of the screen has an expandable section that shows the file or table details. The classification section starts with a check box to include matching via the field name to data class name. For example, I have a data class called FIRSTNAME and a field called FIRSTNAME (the matching is case-insensitive). In this case, the classification process will select that data class for that field without reading the data content.
The next section displays a table containing field names with checkboxes, a column for the data class, and a column for the matching results. The lower table is a preview of the data in the source. The needed data classes should have been created before using this form editor, but you can add or edit them here.
Select the check box corresponding to the fields you want to classify. You can manually select the class by clicking the drop down box in the data class column of the field you want to classify. Click Auto Classify to start the automatic classification process, which can take a long time depending on the amount of data you have in your source.
The process can run in the background if you select that option in the standard Eclipse dialog that displays. Additionally, you can view the process status in the Progress View.
Upon finishing, the classification process found an 87% match on the SSN field, 11% on LASTNAME, and a name match on FIRSTNAME. The percentages indicate the amount of matched data in your source via the matchers for that data class. If “name” displays in the matching column, then the data class was matched based on the name. If you manually selected a data class, then “user” will be displayed in the matching column. When you are satisfied with the results (again, you can manually change the class if needed), you click Save Set. This will create the data class and data class map in the library for the selected fields.
The final library contents are displayed below. Just as you can see the details of the sources, you can also click the data classes and maps to display their details.
The data class maps use references to the data classes and fields, which is the reason the library stores the sources and data classes, in addition to the map itself. Deleting a source or data class will also remove any associated data class map that references that deleted item. When clicking Remove, a warning is displayed to remind you of this. The process can be repeated on the other included sources, and additional sources can be added at any time.
The classification results of this library can now be used to apply field rules to those data sources. The process is explained in my next article on Applying Field Rules Using Classification.