Update June 2015: This wizard was renamed from the Data Restructuring Wizard to the Dark Data Discovery Wizard, and is provided free in the IRI Workbench, and for NextForm Lite users.
Update October 2018: This wizard is used with both the IRI CellShield Enterprise Edition (EE) and IRI DarkShield products for searching, extracting, and masking PII in multiple LAN-connected sources at once, and is being enhanced with value lookup, machine-learned NLP models for NER, and fuzzy search criteria. Additional blog content on DarkShield uses will follow.
This is the second of a three-part blog series introducing IRI’s unstructrued data discovery technology. The first article introduced the unstructured “dark data” sources that IRI’s data restructuring wizard supports. This article shows how the wizard works. The third article shows how the restructured data can be used by all IRI software products.
The idea of dark data in unstructured sources and formats was introduced in Finding Dark Data in Unstructured Sources (to introduce the IRI Data Restructuring Wizard). Recall that corporations and government agencies may have a lot of useful information trapped in these unstructured formats that can be mashed up with other (usually structured) repositories and mined for the benefit of operations, promotions, analytics, law enforcement, etc. However, some of these sources are difficult to parse, and the data they contain need structure to be useful in data integration and reporting contexts. This is where IRI’s Data Restructuring Wizard is useful; it unlocks and organizes dark data so it can start driving real value to the business.
The Data Restructuring wizard is bundled with the Unstructured Data edition of the IRI NextForm data and database migration product. The wizard is available through the IRI Workbench GUI for NextForm, built on Eclipse™. It can also be made available to IRI CoSort and IRI FieldShield users for data integration and data masking purposes, respectively. All three IRI products can also replicate and report on the data the wizard produces, too.
The general idea is that, after parsing through the data in unstructured sources, you can output what you’re looking for into a structured text (flat) file, with its layouts automatically defined in a data definition file (.DDF). The file and its metadata repository are easily used and re-used by IRI software to integrate, transform, migrate, mask, and report on that data, and/or feed it to other applications.
Note also that CoSort can query and join over flat files directly, or facilitate the creation and population of tables with DBA-defined primary-foreign keys. In this way, dark data extracts can acquire form and relationships (structure) that can make it a lot more useful.
Using the Wizard
IRI’s Data Restructuring wizard will search every supported unstructured document type in every directory below the root network drive you specify. The search for your dark data is based on parse patterns (regular expressions) and keywords. As a reminder, here is a list of unstructured sources that the wizard can analyze and structure:
- Free-form text (.txt)
- Microsoft Word documents (.doc and .docx)
- Adobe Portable Document Format (.pdf)
- Extensible Markup Language (.xml)
- E-mail messages (.eml)
- Microsoft Excel spreadsheets (.xls and .xlsx)
- Microsoft PowerPoint presentations (.ppt and .pptx)
- Microsoft Exchange and Outlook (.osd, and .pst)
- Rich Text Format (.rtf)
All of the information the wizard needs from you is solicited on the first page:
Use the Parent Search Directory field to specify the upper-most path. Indicate the types of documents to be searched by checking the relevant file extensions.
You can also profile several different forensic aspects of the dark data you’re discovering. The wizard can identify and display the creation, modification, and access dates of the data source, as well as its full path, owner, linkage, and hidden attributes.
There are a few ways to define the values to find
- Enter a specific value to find.
- Use regular expressions to search for specific patterns. If you are not familiar with regular expressions, a lot of assistance is available on the internet, including here at Wikipedia. IRI also provides examples in the wizard’s easy-to-use context help.
- Providing an IRI Set file for a dictionary search. A dictionary search is similar to searching for a specific value, except that instead of using one value to search against, you use a file containing many values.
Define specific values to find, or use regular expressions to search for data in specific patterns. If you are not familiar with regular expressions, a lot of assistance is available on the internet, including here at Wikipedia. IRI also provides examples in the wizard’s easy-to-use context help.
Choose the delimiter character to offset the fields in the flat results file, such as a comma, or “|” as shown.
Finally, specify the folder and file names for the structured output file and the data definition file (DDF) metadata for that file. The field names in the DDF will correspond to the keywords and patterns you searched, as well as the forensic attributes that you selected to be part of the output file.
Once you have entered the required information in the wizard, click Next to start the search and create the new files. A preview screen shows the restructured data that is returned in the search. You can display up to 50 lines here:
Click finish, which outputs the .txt file and the dark data definition file (DDF) describing the layout of the resulting flat file:
So, your now-structured data is stored in a file you can use (repeatedly) for any purpose. And within the same Eclipse IDE, the IRI Workbench, you now have access to this data and its DDF for:
- Data Integration and Transformation
- Data Migration and Replication
- Data Masking (Encryption, De-ID, etc.)
- DB Load and Query Optimization
- Reporting or Hand-offs to BI Tools
- Population of CRM, DB, ETL, and External Apps
See how to use the newly structured output file and its DDF in the next article, Using CoSort on Restructured Data in the IRI Workbench.