Finding Dark Data in Unstructured Sources

by Sharon Hewitt

Update April 2019: Added more unstructured file formats.

This is the first of a three-part blog series introducing IRI’s new data structuring technology. This article defines “dark data” and the unstructured sources IRI now supports. The second article shows how the Data Restructuring wizard works, and the third shows how the restructured data can be used by all IRI software products.

According to Gartner Analyst Douglas Laney, “enterprise dark data” is “unutilized or underutilized information, collected generally for a single purpose — then forgotten or archived.”¹ Much of the dark data that corporations have is in unstructured data repositories.

What is unstructured (vs. structured) data? According to Wikipedia, unstructured data is “information that either does not have a pre-defined data model or is not organized in a pre-defined manner”. It’s data that are not organized or classified in a way that can be easily grouped by subject; it’s mostly textual, but can also be images, audio, and video.

And let’s not forget social media. Facebook, Twitter, LinkedIn, Pinterest, just to name a few – these all contain unstructured and semi-structured data. Valuable data that can be very beneficial to businesses, large and small. However, it really needs to be structured before it becomes useful.

Structured data is of course the opposite of unstructured data. Webopedia defines structured data as “data that resides in a fixed field within a record or file.” It’s organized, and relies on a model that determines how the data is stored, processed, and accessed. Structured Query Language (SQL) is often used for managing structured data in database tables, just as SortCL data definition files (DDF) in IRI CoSort define the layouts of external, flat files.

Semi-structured data is a cross between both structured and unstructured data. It has structured data but doesn’t fit into the formal models of relational databases or other sequential sources. Legacy (mainframe index) files are a good example of this hybrid, because they consist of structured elements and proprietary layouts. Many XML files may fall into this category, too, although there are also tons of flat (structured) and unstructured (free-form) XML documents.

IRI software traditionally handled big data only in structured sources; i.e. all kinds of flat file formats and relational database tables that are extracted or reached via ODBC. But now it can also extract, structure, and process data in several semi- and unstructured data sources, including:

Unstructured Files (using the Data Structuring wizard in the IRI Workbench GUI, built on Eclipse™)

Free-form text (.txt)
Microsoft Word documents (.doc and .docx)
Adobe Portable Document Format (.pdf)
Extensible Markup Language (.xml)
E-mail messages (.eml)
Microsoft Excel spreadsheets (.xls and .xlsx)
Microsoft PowerPoint presentations (.ppt and .pptx)
Microsoft Exchange and Outlook (.osd, and .pst)
Rich Text Format (.rtf)
Hypertext Markup Language files (.html)
JavaScript Object Notation files (.json)
Various image formats (.tiff, .jpeg, .png, .gif, .jp2, .jpx, .bmp)

Semi-structured Files

ASN.1 call detail record (CDR) files (via a CoSort / SortCL input procedure)
C-ISAM, IMS, QSAM, VSAM and other mainframe files (using partner ODBC drivers)
MF-ISAM and Vision index files (using embedded Micro Focus libraries)
MongoDB (JSON) and XML -using JDBC drivers in IRI Workbench

This structuring of data is all done by the Data Restructuring wizard. The Data Restructuring wizard is bundled with the Unstructured Data edition of the IRI NextForm data and database migration product. The general idea is that, after parsing through the data in unstructured sources, you can output what you’re looking for into a structured text (flat) file, with its layouts automatically defined in a data definition file (.DDF). The file and its metadata repository are easily used and re-used by IRI software and/or fed to other applications all within the same Eclipse IDE, the IRI Workbench, for:

Data Integration and Transformation
Data Migration and Replication
Data Masking (Encryption, De-ID, etc.)
DB Load and Query Optimization
Reporting or Hand-offs to BI Tools
Population of CRM, DB, ETL, and External Apps

If you would like to see how to use the Data Restructuring wizard, you can visit the next article Using the Data Restructuring Wizard to Unlock Unstructured Data. You can also see how to use the newly structured output file and its DDF in all IRI software in the blog Using CoSort on Restructured Data in the IRI Workbench.

^{1. [Gartner, “Answering Big Data’s 10 Biggest Vision and Strategy Questions,” Douglas Laney et al, August 12, 2014, p.5.]↩}

NextForm v3: Five Options for Data and Database Migration

Using FieldShield to Comply with PCI DSS

Finding Dark Data in Unstructured Sources

Related articles

Search the Blog