IRI Blog Articles

Diving Deeper into Data Management

 

 

Post image for Optimizing Transforms in DataStage with CoSort

Optimizing Transforms in DataStage with CoSort

by Shyam Padamati

Comparing Filter/Sort/Join/Aggregate Performance

ITKeySource, an ETL consultancy in Jacksonville, FL, recently benchmarked relative performance gains running IRI CoSort — and its SortCL program in particular — alongside IBM DataStage.

CoSort has been used for years in this capacity in high volume data transformation environments, like this one on Solaris. But in this case, we tested them with this: 

Hardware:

    • Dell PowerEdge R230 Server
    • Intel® Xeon® E31230 v5 3.4GHz, 8M cache, 4C/8T
    • 16GB 2133MT/s UDIMMs
    • 1TB 7.2K RPM SATA 6Gbps 3.5in

and 

Software:

    • IBM InfoSphere DataStage® and QualityStage® 8.7
    • IRI Voracity® 1.0 & CoSort® 9.5.3
    • IRI Workbench, a free GUI option, built on Eclipse™:

IRIwbAbout

In this POC, I transformed flat files with combined row counts of 2MM, 5MM, 20MM, and 50MM in the CoSort SortCL program, IBM DataStage, and then together from within DataStage. Below are the times it took to sort, join, aggregate, and filter two input files and create two output files:

Rows

Left File

Right File

CoSort Alone

DataStage (DS) Alone

DS w/CoSort

2MM

76.8 MB

13.8 MB

11 seconds

22 seconds

12 seconds

5MM

192 MB

344.7 MB

29 seconds

60 seconds

32 seconds

20MM

76.8 MB

1378.8 MB

122 seconds

780 seconds

140 seconds

50MM

1920 MB

3447 MB

406 seconds

2940 seconds

310 seconds

In addition to the speed differences, which I try to account for in this article, I also explain the methodology by which I ran the same transform jobs … first alone, and then together. Before I do that, it should be noted that the raw data used for this job was randomly generated with the same SortCL executable using embedded ‘RowGen’ functionality.1

CoSort Process

The IRI CoSort product — and its SortCL data definition and manipulation 4GL program doing all the transforms — is shown above in the free IRI Workbench Eclipse GUI’s syntax-aware editor.

SortCL is fast primarily because it combines the transforms (and other functionality) in the same job script and I/O pass. It also uses the CoSort engine in the file system, which takes advantage of multi-threading, modern memory management and I/O optimization techniques. And, it is not a compiled program.

SortCL jobs can be hand-written, generated via a job wizard, or ETL workflow palette (with an IRI Voracity data management platform subscription2). The scripts are easy-to-read text files you can run (ad hoc or scheduled) in the GUI, or from the command line. That is how I called it in DataStage below.

The CoSort ‘SortCL’ script is shown above, with an outline, in the IRI Workbench (Eclipse) GUI. IRI Voracity ETL workflow and transform mapping diagrams of the same job are below those. All the metadata is: 1) re-entrant regardless of work-style, 2) manageable in eGit and other repositories, and 3) compatible with AnalytiX DS Mapping Manager CATfx templates for spreadsheet-style mapping definition, stratification, and conversion to/from other ETL tools.3

DataStage Process

Below is the DataStage job design required to sort and join the same two sequential files, then aggregate, and finally, filter them before writing out to the same two separate sequential files. In addition to taking longer to build and run those jobs in DataStage, they must be compiled:

Transforming with CoSort in DataStage

To combine the processes for the benefit of my DataStage operations, I wrote a simple to call my SortCL script to do the transformation ‘magic’ externally. Rather than using a traditional sequential file-stage call, it was more seamless to use a built-in, Before-job subroutine. This allows me to execute CoSort SortCL on the command line (sortcl /spec=join.scl) from my Windows (DOS) or Unix (Korn) shell, even before any DataStage processes begin:

The benefit of running CoSort in this environment is immediately apparent:

cosort v datastage

Given the same source data and targets, it is almost 10X faster at the 50-million-row level to in-source transformations to CoSort through a DataStage subroutine.

To process sources with more than 50 million rows, many DW architects would upgrade to either Parallel Extender or run DataStage on a Unix/Linux server license where the hardware provides more power. CoSort runs in those same environments with even more speed, however, and is much less expensive than either alternative.

If you would like more information or have additional suggestions, please provide feedback in the comment form below.

 


 

IRI RowGen is also a standalone spin-off product using SortCL syntax, a constituent product of the IRI Data Protector Suite, and a supported test data management facility in the IRI Voracity ‘total data management’ (and ETL) platform.

2 The IRI Voracity ETL package uses CoSort SortCL as its default data transformation program, but can also run most of the same jobs without changes in Hadoop MapReduce 2, Spark, Storm, or Tez.

3 AnalytiX DS conversion templates can automate the re-platforming of DataStage, Informatica, and several other legacy ETL packages, to Voracity’s considerably less expensive workflows.

Print Friendly

{ 1 comment… read it below or add one }

Mario Gómez March 31, 2016 at 3:00 pm

Thanks for this valuable information. This post is very important because we have a training program starting next week for DW/BI architects who need to know how to improve ETL performance.

Reply

Leave a Comment

Previous post:

Next post: