
Searching and Masking Unstructured File Data with DarkShield
This article is part of a series explaining how to use DarkShield and its many functions.
The IRI DarkShield data masking product uses fit-for-purpose facilities in the IRI Workbench IDE to search (classify) and mask (remediate) sensitive ”dark data” (as defined by Gartner) in many semi and unstructured data sources — including JSON, XML,free-form text files, PDFs, MS Office documents, and images.
DarkShield can access files on local or networked drives as well as S3 buckets through the “New Dark Data Search/Masking Job …” wizard, formerly known as the Dark Data Discovery wizard and the IRI Data Restructuring wizard. The previous iterations of the wizard only supported scanning and extracting sensitive values that matched Java RegEx patterns and Set File lookups.
Today’s wizard supports more search methods, and of course simultaneous or separate masking operations, for DarkShield or Voracity users. The diagram below summarizes DarkShield’s architecture as part of the overarching Voracity platform, where the wizard this article explains is inside Workbench:
Although not discussed in this article, DarkShield also includes wizards for other sources, like: the New NoSQL Search/Masking Job … wizard for MongoDB, Cassandra and Elasticsearch, and the New Dark Data Schema Search/Masking Job … wizard for sensitive data that is fixed or floating in RDB columns. DarkShield users can also perform facial detection, recognition, and obfuscation in a separate module, available on request.
What this Wizard Does
The scanning and remediation of your dark data files is based on data you classify centrally in Workbench (see this article) and associate with one or more search methods. Those methods
match data to: Java regular expressions (Regex); lookup (set file) values; Named Entity Recognition (NER) models; path filters for semi-structured files; and, area bounding boxes for images.
The file formats containing strings that this wizard can search, extract, and mask, include:
- Free-form text (.txt)
- Microsoft Word documents (.doc and .docx)
- Adobe Portable Document Format (.pdf)
- Extensible Markup Language (.xml)
- E-mail messages (.eml)
- Microsoft Excel spreadsheets (.xls and .xlsx)
- Microsoft PowerPoint presentations (.ppt and .pptx)
- Microsoft Exchange and Outlook (.osd, and .pst)
- Rich Text Format (.rtf)
- Hypertext Markup Language files (.html)
- JavaScript Object Notation files (.json)
- Various image formats (.tiff, .jpeg, .png, .gif, .jp2, .jpx, .bmp)
Using the Wizard
In this article, I will demonstrate the use of the New Dark Data Search/Masking Job… wizard on some of the file types listed above. I will be searching for Names, Emails, and Social Security Numbers using RegEx and lookup value matchers.
Among all the documents I will be searching and masking are this sample text and Excel file:
To open the wizard, select the DarkShield menu dropdown and select the New Dark Data Search/Masking Job… wizard. This brings up the first page where you can name the new job:
Here you will also specify the folder and file names for the metadata output of the wizard.
You can select as well whether the wizard will generate a Data Definition Format (DDF) file, which is a metadata repository defining the layout of the flat file containing your search results.
The DDF syntax is recognized by, and used directly in, SortCL data transformation and reporting jobs supported in IRI Workbench. The field names in the DDF file will correspond to the keywords and patterns you searched, as well as the forensic attributes that you selected to be part of that output file.
Click Next to move into the data source specification (files to be masked) page of the wizard.
Click Add to add your data source location.
Note that this example will only address files in my local (PC’s) file system, but DarkShield supports more sources natively through this interface, and the DarkShield RPC API can support data in any source or silo through custom code.
Select any combination of sources, which again are file system (local) directories, SMB shares and S3 buckets, along with the list of file types which should be searched.
The [Include] and [Exclude] fields allow regular expressions to include or exclude based on name. The recursive checkbox will allow the search to look for more folders in the source folder.
Once a location and file types are selected, click OK to specify the location(s) of the files I will search and mask:
This reveals my root directory from which the searches will occur. It is also possible to add additional sources for the search here, and assign a priority to them in the search (and mask) operation.
When finished, click Next go onto the Data Target page.
Click Add to create or select a target location. Selecting a target is optional if you are only interested in performing search-only operations, although a target will need to be specified once a masking operation needs to be performed.
If needed, the source location can be designated as that target; i.e., the original documents will be overwritten with the masked version. Be careful with that capability, however, as reversal of the masking operation may not be possible.
Once the location is selected, click OK.
The Add … options allows additional target specifications.
Indeed, multiple disparate sources and targets can be specified in the wizard. DarkShield will search and mask all files found in the source URIs and replicate the masked files across all of the target URIs. The sources and targets can be any combination of local file directories, networked drives or S3 buckets.
Click Next to go to the Metadata Selection page.
This page will let you select the metadata that will be displayed as additional columns in the flat text file containing the values (and specified metadata) from the search operation. The default delimiter is a pipe (“|”) but it can be changed to another character.
Recall that we earlier specified the creation of the optional DDF to facilitate processing this file using CoSort or other compatible SortCL applications (like FieldShield and NextForm) in Voracity. The results of the search and the DDF are shown at the end of this article.
Click Next when finished to move onto the specifics of the data you are trying to find — and how it should be masked, if it should — by assigning those attributes to your data classes as “Search Matchers.”
Here is where the type of data you want to find, and how it should be masked, can be specified through search methods and masking rules. This part of the wizard will allow the creation and selection of data classes or data class groups, although those are usually pre-defined in IRI Workbench under Windows > Preferences > IRI > Data Classes and Groups.
Click Add to start adding Search Matchers.
If you already have a Data Class or Group created, select Browse and select the Data Class in mind. If you do not have one, select Create.
Here you can make your own data class or group, or you can select one of several predefined classes like credit cards, names, and social security numbers. Read more about the creation and modification of data classes and their associated search methods here.
For this article, we will only be using the predefined Data Classes that use set file lookups and Java Regular Expression (Regex) pattern matches. Data Classes that use image bounding boxes or Named Entity Recognition (NER) matchers will be described in a separate article.
After the Data Class or Group is selected, a Data (masking) Rule must also be added. Data Rules tell DarkShield the type of masking function to apply to each class (type) of data you specified. If you have a Data Rule created already, select Browse, otherwise click Create.
Read more about available DarkShield masking functions here, as exposed in this dialog:
After you select and configure each Data Rule from this dialog, click Finish.
For structured and semi-structured files, DarkShield also allows you to specify filters inside of a Search Matcher to search for PII in certain parts of the file by clicking Add next to the Filters table. For example, JSON and XML paths can be specified for searching through particular keys and tags within JSON and XML files respectively, as demonstrated in this article.
This method allows you to bypass searching through all that raw data to find the matches, though you could also combine it with raw data searches. You may also use table header or index filters for searching specific rows/columns in tabular files like CSV or Excel files.
Click OK when finished to add your new Search Matcher(s) to the list.
Once you have entered the required information in the wizard, click Finish to generate a .search file. That file contains the configuration parameters that you have selected throughout this process.
Finally, if you specified its generation on the first page of the wizard, the SortCL-compatible Data Definition Format (DDF) metadata file that describes the layout of the flat file which is created after a search or a search and masking job is created. That DDF file allows you to manipulate and otherwise produce additional reports on those results using licensed data transformation and formatting functions.
Here is an example of the DDF file that my job produced:
Note that the first two fields in that script refer to the data class name (the type of data searched for), and its actual value (the result). The remaining fields are the metadata attributes which I specified earlier.
To execute just a search operation, right click on the .search file and select IRI > Run Search Job. This will generate the flat file containing the delimited results and metadata information:
A .darkdata file is also created that includes the names of the files with the values of the data classes, and attendant metadata for each source that the search operation found. DarkShield requires this .darkdata file to perform masking operations; see below. After masking is completed, however, you may wish to remove it since it contains the list of sensitive values.
Data Masking
The masking (data) rules were determined earlier in the search matcher step. To perform the masking job, right click the .search file and select IRI > Run Search and Masking Job. If you already have a .darkdata file, you can right click that and select IRI > Run Masking Job.
The process will produce identical files from the search to the target location but with the found results masked.
Masked file:
Masked Excel File:
If you would like help using this wizard to scan and/or mask data in your files, please contact your IRI representative or email darkshield@iri.com.