Big data integration activities can happen outside the database in an extract, transform, load (ETL) environment, or inside the database in ELT:
One example of an ELT operation would be Informatica’s Pushdown Optimization option, in which users transform data in a relational database like Oracle, or in Teradata. According to Informatica, this approach “allows the IT systems and their operators to respond to changing demands and peak processing needs.”
IRI and its customers believe that the work of big data transformation does not belong in databases designed for storage and retrieval. Transformations in the database layer can burden the database and the whole system, drawing resources from data storage maintenance, and slowing down queries.
Speed in any circumstance, whether an ETL or ELT operation, is always critical to the equation. CoSort supports the ETL paradigm, optimizing big data integration performance by combining transformations and running them externally:
File system operations are more efficient, affordable, and remove the burdens of data integration and staging from BI, database and existing ETL tools, freeing them to do the jobs they do best instead. Teradata, Hadoop, in-memory and columnar databases, and ELT appliances are also much more expensive and complicated ‘first resort’ options than CoSort!
How Does this Work?
IRI is frequently approached by customers using platforms like Informatica PowerCenter who need to accelerate production jobs, where jobs like sorting, joining and aggregation are staged separately:
With CoSort, all data transformation tasks run a single job script and I/O pass. Those jobs can also include filtering and conversion, an extensive array of field-level protections, custom reformatting and reporting with math functions, special layouts, multiple targets, and so on.
The difference CoSort’s task consolidation and file system processing makes can be seen even on a laptop. An independent consultant using PowerCenter 8.6 under Windows XP SP3 on an Intel® Core™ CPU M370 @2.4GHz with 3 GB of DDR3 memory set up sort, join, and aggregate transforms against 100MB and 53MB flat files, which ran in 94 seconds. That same job ‘pushed out’ to a single CoSort Sort Control Language (SortCL) program in the file system produced the same results “in” Informatica in 16 seconds.
Push “out” optimization to CoSort from Informatica is rather straightforward:
1. Write a simple SortCL language program like the one below in a text editor or with a wizard in the IRI Workbench GUI, built on Eclipse. The job script runs on the command line (sortcl /spec=script.scl) and can be called into Informatica through the same PowerCenter “Command Task” that runs any .exe file.
2. To create such a command task in PowerCenter, select “Workflow” from the upper menu and then select “create”:
3. Name the workflow. For example, wf_sorting_joining_aggr_cmd
4. To display the command in the workflow designer, select command task from the upper tool bar and create the actual command that will run:
5. This task will call the SortCL job in the command task. The task itself can be edited to call any job; in this case the SortCL job script “sortjoinaggr.scl.”
6. The consultant’s last image shows the CoSort/SortCL job completed from Informatica in 16 seconds; while the original job in Informatica took 94:
Contact firstname.lastname@example.org if you are interested in optimizing your ETL tools’ transformations through a simple call to CoSort.