Big Data Transformations with CoSort (Structured Data)

by David Friedland

In 1992, Digital Equipment Corporation (DEC, long since acquired) asked IRI to develop a 4GL interface to CoSort in the syntax of the VAX VMS sort/merge utility. The result of that effort was the now widely adopted Sort Control Language (SortCL) program that is used to define data layouts and manipulations that go way beyond sort/merge.

SortCL now handles everything from data transformation and reporting to data migration and protection, and is the core of multiple spin-off products and a metadata infrastructure modeled and managed in the IRI Workbench GUI, built on Eclipse™.

In 1999, Database Trends Magazine studied the data transformation functions then in SortCL and labeled CoSort “The ETL Engine” in an edition dedicated to data warehousing. Indeed, since the mid 1990’s, hundreds of DW architects and thousands of EDW, ODS and database users around the world have deployed SortCL scripts directly, or within applications they use, to transform massive amounts of sequential data with built-in functions they can run alone or combination, such as:

Sort/Merge	Match/Join
Select/Filter	Aggregate
Find/Replace	PCRE
Lookup	Pivot
Rank	Scrub/Cleanse
Remap/Reformat	Substring
Convert	Validate

In addition to the price-performance advantages made possible with the underlying CoSort engine and its

linearly scaling, multi-threaded, co-routine sorting algorithm
sophisticated memory management and good neighbor I/O
same-script/same-pass marriage of sorting to joins and aggregations
thread-safe APIs, and custom input, compare, output, and field functions,

SortCL also delivers on the promise of open systems by being:

cross-platform, by running on every flavor of Unix and Windows with the same scripts
self-documenting, via a language familiar to both mainframe and SQL users
easily invoked, and widely interconnected to third-party applications
interchangeable, through scripts you can easily convert to and from.

IRI’s sweet spot in the market remains the integration and staging of huge flat files which include bulk database extracts (e.g. from IRI FACT operations), mainframe datasets, web and IoT device logs, spreadsheet and application exports, PoS server and telco switch (CDR) feeds, COBOL and shell programs, and so on. With CoSort (SortCL) running in IRI Voracity workflows that include FACT (E) and table creation and bulk load (L) steps, end-to-end ETL jobs are built and run quickly in Eclipse or on the command line.

In Voracity, most SortCL jobs can run either in the default CoSort engine, or seamlessly in Hadoop MapReduce, Spark, Spark Stream, Storm, or Tez. Either option provides an extremely high-speed, simple, and low-cost approach without changing code.

More advanced users can write custom detail and summary reports and protect data at the field level in the same SortCL job script and I/O pass with their transforms. Data in HDFS, unstructured sources, or in otherwise non-sequential/non-relational formats, can pass through drivers, or memory through custom input procedures that structure and feed that data to CoSort (or Hadoop) for fast transformations and hand-offs to DB loads, data marts, visualization tools, etc.

Sort Demonstration [video]

ETL, ELT & IRI in Between