IRI Blog Articles

Diving Deeper into Data Management



Computer Progammer Writing Informatica Pushdown Optimization

Informatica Pushdown Optimization with CoSort

by Jason Koivu

Big data integration activities can happen outside the database in an extract, transform, load (ETL) environment, or inside the database in ELT:

One example of an ELT operation would be Informatica’s Pushdown Optimization option, in which users transform data in a relational database like Oracle, or in Teradata. According to Informatica, this approach “allows the IT systems and their operators to respond to changing demands and peak processing needs.”

IRI and its customers believe that the work of big data transformation does not belong in databases designed for storage and retrieval. Transformations in the database layer can burden the database and the whole system, drawing resources from data storage maintenance, and slowing down queries.

Speed in any circumstance, whether an ETL or ELT operation, is always critical to the equation. The IRI CoSort product, and the modern IRI Voracity platform which can use it, both support support the ETL paradigm, optimizing big data integration performance by combining transformations and running them externally:

File system operations are more efficient, affordable, and remove the burdens of data integration and staging from BI, database and existing ETL tools, freeing them to do the jobs they do best instead. Teradata, Hadoop, in-memory and columnar databases, and ELT appliances are also much more expensive and complicated ‘first resort’ options than CoSort!

How Does this Work?

IRI is frequently approached by customers using platforms like Informatica PowerCenter who need to accelerate production jobs, where jobs like sorting, joining and aggregation are staged separately:

Sort, join, aggregation workflow

With CoSort, all data transformation tasks run a single job script and I/O pass. Those jobs can also include filtering and conversion, an extensive array of field-level protections, custom reformatting and reporting with math functions, special layouts, multiple targets, and so on.

The difference CoSort’s task consolidation and file system processing makes can be seen even on a laptop. An independent consultant using PowerCenter 8.6 under Windows XP SP3 on an Intel® Core™ CPU M370 @2.4GHz with 3 GB of DDR3 memory set up sort, join, and aggregate transforms against 100MB and 53MB flat files, which ran in 94 seconds. That same job ‘pushed out’ to a  single CoSort Sort Control Language (SortCL) program in the file system produced the same results “in” Informatica in 16 seconds.

Push “out” optimization to CoSort from Informatica is rather straightforward:

1. Write a simple SortCL language program like the one below in a text editor or with a wizard in the IRI Workbench GUI, built on Eclipse. The job script runs on the command line (sortcl /spec=script.scl) and can be called into Informatica through the same PowerCenter “Command Task” that runs any .exe file.

A SocrtCL script that can be run from numerous places for big data processing

2. To create such a command task in PowerCenter, select “Workflow” from the upper menu and then select “create”:

Creating a command task in PowerCenter for CoSort to transform big data

3. Name the workflow.  For example, wf_sorting_joining_aggr_cmd

Creating a big data workflow in Informatica PowerCenter

4. To display the command in the workflow designer, select command task from the upper tool bar and create the actual command that will run:

Workflow designer showing CoSort for big data transformation

5. This task will call the SortCL job in the command task. The task itself can be edited to call any job; in this case the SortCL job script “sortjoinaggr.scl.”

PowerCenter calling a SortCL script to speed big data in Informatica

6. The consultant’s last image shows the CoSort/SortCL job completed from Informatica in 16 seconds; while the original job in Informatica took 94:

Results show that CoSort transformed big data in 16 seconds.

Contact if you are interested in optimizing your ETL tools’ transformations through a simple call to CoSort. See this article if you’re interested in re-platforming Informatica ETL jobs to the same jobs in IRI Voracity jobs via AnalytiX DS technology to save time and money.

Voracity is IRI’s ‘total data management platform’ which uses CoSort or Hadoop engines (same metadata either way). It combines data discovery, integration, migration, governance and analytics in a single Eclipse GUI. Voracity is compatible with AnalytiX DS Mapping Manager and CATfx templates for managing ETL metadata stratification, creation, and conversion.

Print Friendly

{ 0 comments… add one now }

Leave a Comment

Previous post:

Next post: