This is the final article in a three-part blog series introducing IRI’s new data structuring technology. The first blog introduces “dark data” and the unstructured sources the wizard can analyze. The second in the series offers a deeper overview of how the Data Restructuring wizard works and this article shows how that newly structured data can be leveraged in the same GUI by various IRI software products.
The preceding article discussed the use of the Data Restructuring wizard in the IRI Workbench GUI to parse data from unstructured data sources. Recall that we extracted the data based on search patterns into a structured text file, and automatically defined the new layout in a data definition file (.DDF) available for use in any IRI CoSort or other IRI software application.
This article picks up where that one left off … to show how a structured data file and its DDF can be used to process the extracted data in the same GUI for CoSort. The data can be transformed, reformatted, protected at the field level, and reported upon in the same job script and I/O pass in CoSort SortCL programs. SortCL outputs can be sent to multiple file and database targets at once:
In our first application, we will sort our text file by phone number, and remove the duplicate record. We will also mask the first 12 digits in the credit card account numbers (green line in image below), and encrypt the social security numbers (purple line in image). A report header was added to clarify the output (red line).
Note that the RegInfo.ddf metadata repository created by the Data Restructuring wizard is referenced in the input section to define the fields. In this example, the output is sent to standard out (stdout) to display in the console:
Changing the SortCL job script produces the duplicate records.
In conclusion, the unstructured data we first structured in the IRI Workbench has now been re-structured once again in the same Eclipse GUI, and can be re-fit for any use. There are in fact no limits to the possible permutations in, and purposes for, SortCL job scripts, nor in other IRI Workbench products that use .DDF files, including:
- IRI NextForm – for data migration, replication, federation and reporting
- IRI FieldShield – for data masking and encryption
- IRI RowGen – for test data generation
Let IRI know if you would like to see more packaging and provisioning manifestations of such big data processing, including ad hoc reports in BIRT, hand-offs to other analytic tools, pre-sorted bulk database loads, etc.
To see what unstructured data sources and dark data can be analyzed by the Data Restructuring wizard, visit our blog Finding Dark Data in Unstructured Sources. To see how to use the wizard to discover unstructured data, visit our blog Using the Data Restructuring Wizard to Unlock Unstructured Data.