Editor’s Note: This article is part of a series of articles on using DarkShield to search and mask sensitive data in relational databases. This new support for RDB sources in DarkShield can augment or replace IRI FieldShield support for the same data classes, locational searches and masking functions for certain use cases.
The IRI DarkShield data masking product now includes fit-for-purpose wizards in the IRI Workbench IDE to search (classify) and mask (remediate) sensitive ”dark data” (as defined by Gartner) in many semi, unstructured, and structured file, NoSQL, and relational database (RDB) sources. The focus of this article is on RDBs; DarkShield can support any RDB with a JDBC driver connection.
The diagram below summarizes DarkShield architecture; the wizard this article explains is inside Workbench:
Although not discussed in this article, DarkShield also includes wizards for other sources, like the New File Search/Masking Job … wizard for sensitive data in files, and the New NoSQL DB Search/Masking Job… wizard for sensitive data in a NoSQL database.
What the DarkShield Relational DB Wizard Does
The “New Relational DB Search/Masking Job …” wizard in the IRI Workbench GUI for DarkShield helps you create a DarkShield Job (.dsc file) with the Search and Mask Contexts applicable to your job.
Search Contexts contain instructions on how to access an RDB instance and search for PII. Mask Contexts contain instructions on masking the PII found during the search operations, and how to access the target RDB where the masked version of data will be sent.
The scanning and remediation of data stored in RDBs is based on the Data Classes you defined and stored in an IRI Data Class and Rule Library. Each Data Class contains one or more search methods called Search Matchers used to identify PII.
The previous iterations of the wizard only supported scanning and extracting sensitive values that matched Java RegEx patterns and Set File lookups. Today’s wizard supports more search methods and of course simultaneous or separate masking operations, for DarkShield or Voracity users.
Before launching the DarkShield RDB wizard ensure these preliminary steps are completed:
First, verify that the DarkShield API distribution directory has been specified in IRI Workbench Preferences > IRI > DarkShield. From here you can configure DarkShield GUI and API preferences including the host, port, and directory where the DarkShield API resides.
If the DarkShield API distribution (Plankton) has not been specified in Preferences or your DarkShield Job will be using a different DarkShield API distribution, you will need to manually place JDBC driver(s) associated with the database silo(s) inside the lib folder of the DarkShield API distribution that will be used.
Second, all DarkShield Wizards require a project possessing an IRI Data Class and Rule Library. The IRI Library in turn should contain at least one Data Class and a (masking) Rule that can be assigned to that Data Class. To learn more about IRI Data Class and Rule Library and creating Data Classes and Rules read this article.
IRI Project containing IRI Library.
IRI Data Class and Rules Library Form Editor contains some Data Classes and Rules.
Fourth, verify that the Plankton (DarkShield API) server is running. This can be done by opening the DarkShield API Status view. This view displays information about the DarkShield API, including whether it is currently running.
DarkShield API Status view panel
Finally, the Data Source Explorer in IRI Workbench should contain data connection profiles for both the RDB source and target silos. The destination silo must also contain an existing schema.
Tables in the target schema will attempt to be populated automatically, but the attempted derivation of a create table statement to use is not guaranteed to be correct. If a table cannot successfully be automatically created when a DarkShield job is run, then it needs to be created beforehand.
Note that if a data source connection profile does not exist, it must be created in the later steps of the DarkShield RDB wizard. To see an example of how to create a data source connection profile in Workbench, see the JDBC section in this article.
For this demonstration, I will be searching a table called DARKSHIELD inside a schema called IRI. The DARKSHIELD table contains a combination of columnar values, free-floating text, and binary.
I will write the masked data to an empty table called DARKSHIELD inside the schema MASKED. The DARKSHIELD table contains identical columns from the same-named table in the IRI schema.
Using the Wizard
In this article, I will demonstrate the use of the New Relational DB Search/Masking Job… wizard to create a DarkShield RDB Job.
To open the wizard, select the DarkShield menu dropdown and select the New Relational DB Search/Masking Job… wizard. This brings up the first page where you can name the new job:
Here you will also specify the folder and file names for the output of the wizard.
Click Next when finished to move on to the specifics of the data you are trying to find — and how it should be masked.
Here is where you specify an IRI Data Class and Rule Library containing your desired Data Classes, Data Class Groups, and Rules. It is possible to filter the Data Classes and Groups from the library by selecting or deselecting Data Classes in the Active column. In this example, I am using the default Data Classes provided when creating an IRI Project.
In the Masking Rules tab, we can see that two functions are available: a Format Preserving Encryption Rule and a Blur Date Rule. These rules dictate how PII found using Data Classes will be masked. It is also possible to add or remove Masking Rules from this tab.
Click Next when finished to move onto the page that will allow you to assign these Masking Rules to specific Data Classes.
On the Assign Masking Rules to Data Classes wizard page, each Data Class and Data Class Group must be assigned a Masking Rule to indicate how PII will be masked. If you do not wish to mask a particular PII data type, click Back and deselect the applicable Active checkbox in the Data Class or Data Class Group tab. Then return to this page and finish assigning rules to your data classes.
Once finished, click Next to begin specifying the RDB silo that will be searched and masked.
On this page, choose the Connection profile associated with the database that you intend to scan from the dropdown menu. You can also create a new connection profile by clicking New Profile and following the steps.
When a connection profile has been chosen, select the name of the schema that will be accessed in the dropdown menu inside the form field Schema name. Unlike FieldShield, DarkShield RDB jobs will handle only one schema at a time.
The [Include] pattern option involves specifying a Java regular expression that defines what tables to include in the DarkShield search or masking job in the schema selected. Any table names in the source schema that do not match the Include pattern will be ignored at runtime.
The [Exclude] pattern option is for specifying a Java regular expression that defines what tables to exclude in a search or masking job on the schema selected. Any table names in the source schema that match the exclude pattern will not be included in the job.
The [Fetch Size] option defines the number of rows of a table to be processed in a single batch. The default value is 1024 rows. Generally, a smaller fetch size will reduce performance, but use less memory. After a certain point (approximately 1024 rows), there is usually minimal improvement in performance (i.e. time to run a job) compared to the amount of memory used.
The [Row Limit] option limits the maximum number of rows processed (searched and/or masked) in each table.
Once satisfied with the source schema selection click Next to optionally filter data to search or mask by their JDBC-defined data type. A list of JDBC data types can be selected from the JDBC Data Type Selection page to filter the scope of searching to only columns of the data types selected.
By default, all JDBC types are set to true. DarkShield will not search or mask any types you uncheck. Narrowing your job scope in this way can improve DarkShield’s overall performance.
For example, you may have binary columns containing files that may or may not contain sensitive data but your data governance team has decided those files are not required and will be purged from the database. Being able to filter out the LONGVARBINARY type will improve DarkShield’s speed.
Once you are finished filtering your input, click Next to move on to the Target Schema page where you will specify the destination for the masked content. Note that if your source and destination are the same, your original data will be overwritten!
Any driver properties in the connection profiles used for source and target databases are passed through to the DarkShield API.
For the target of the operation, select a database from the list of connection profiles, then select an available schema in that database. Optionally, specify whether foreign keys will attempt to be automatically disabled in the target schema when running a DarkShield job. Any foreign keys that were disabled when running a DarkShield job will be automatically reenabled before the completion of the DarkShield job. This option will only have any effect when the target schema is in a SQL Server, MySQL, Oracle, PostgreSQL, or DB2 database.
Tables will attempt to be generated automatically based on the DDL of the source; however, this may fail if the target database is different from the source database due to differences in DDL between different databases.
If there are existing tables in the database with the same name as tables being processed in the source, any data in those tables will be truncated at the start of a DarkShield job before data is inserted. If the target is the same as the source, then data will be updated as the job runs.
Click Finish to generate your DarkShield Job file with the extension .dsc.
Project containing RDB .dsc file
DarkShield Job Editor
Every .dsc file can be viewed from a DarkShield Job editor. This editor allows you to modify your DarkShield job parameters after you completed the steps of the DarkShield RDB Wizard; e.g.,
You can add, edit, or remove a source and/or target from your .dsc file as desired by clicking the Edit… button alongside the source or target.
From the editor, you can also modify your Data Class Rule Mappings by clicking the Modify button. It is also possible to choose a different IRI Library and/or rearrange the Masking Rules assigned to your Data Classes.
Finally, the editor also provides a preview option that allows you to test your Data Class search matchers and Masking Rules using text input. By clicking Preview, you can see what PII was found and how it was transformed using the current Data Classes and Masking Rules.
Running Your Search and Masking Jobs
You can use your DarkShield job configuration in three different ways; i.e., in a:
- Search Job to simply identify PII and log the results to file. Be aware of search results logged to file (.*annotations.json) from a search job, which may contain PII found in your data source(s). DarkShield (Base) and File API will save JSON files in your workspace but DarkShield NoSQL and RDB API will store search results in the directory specified in DarkShield API configuration file.;
- Masking Job that will use the search log to mask the discovered PII (not supported with RDBs); or,
- Search and Masking Job to search and mask PII in one job.
In this demonstration, we will be running a DarkShield Search and Mask Job. To run a DarkShield Search and Mask Job right, click the .dsc file and select IRI > Run Search and Masking Job.
After running a Search and Masking Job, the PII data found will be masked and placed in the target silo you previously specified.
Following is an example of my source and target tables, showing their mix of structured and unstructured data before and after a DarkShield search and masking operation:
Table DARKSHIELD from IRI schema containing various PII and binary inside the PDF column
Binary in PDF column converted to PDF with unprotected PII
Table DARKSHIELD from MASKED schema containing various PII and binary inside the PDF column
Binary in PDF column converted to PDF with PII-protected