Big Data Problem
Big data volumes are growing exponentially. This phenomenon has been happening for years, but its pace has accelerated dramatically since 2012. Check out this blog entitled Big Data Just Beginning to Explode from CSC for a similar viewpoint on the emergence of big data and the challenges related thereto.
IRI has been aware of this trend since the company’s founding in the late 1970s. Its flagship CoSort package is designed to handle growing data volumes through efficiencies in software algorithms and design, “portable” hardware exploitation techniques, and task consolidation (e.g. sort-join-aggregate-encrypt-report). The question this article poses is one of approach, given the “rise of the machines.”
Hardware’s Limitation in Solving It
Certainly computer performance has been accelerating in most every respect for decades. And for many, throwing hardware at the big data problem is merely second nature. However, the problem may be bigger than that. Consider Moore’s Law, in which CPU power only doubles every 18 months at best, and the inherent obsolescence, maintenance issues, and pure costs of a hardware-centric strategy.
Something new to consider too is an expectation that this performance paradigm for big data may be coming to an end. According to Gery Menegaz, his premise is that the end of Moore’s Law is near. Time Magazine ran a similar story in May 2012 entitled the Collapse of Moore’s Law: Physicist Says Its Already Happening. According to the Time article,
Given that, Kaku says that when Moore’s Law finally collapses by the end of the next decade, we’ll “simply tweak [it] a bit with chip-like computers in three dimensions.” Beyond that, he says “we may have to go to molecular computers and perhaps late in the 21st century quantum computers.”
For most users, however, their hardware is bought to handle, and to some extent scale, to meet the big data processing challenges they face or foresee. But the less efficiently the software running on it performs, the more hardware resources have to be spent overcoming the inefficiency. A example in our world might be buying an IBM p595 to run /bin/sort when a machine one third that size and cost running CoSort instead would produce the same result.
Meanwhile, DB and ELT appliances like Exadata and Netezza built around hardware already require 6-7 figure investments. And while they can scale to take on larger loads, there is usually a limit to how much they can scale (certainly not exponentially), how much money can be spent on the attempt to keep scaling, and how willing people are to rely on a single vendor for every mission critical aspect of their workloads. And, is it a good idea to impose the overhead of big data transformation in databases that were designed for storage and retrieval (query) optimization instead?
Even if all those questions had easy answers, how do computational problems (with even just linearly scaling big data growth) that require exponentially larger resource consumption (like sorting) get solved? Somehow the answer would not seem to lie in merely waiting for affordable quantum computing …
The Role of Software
As Hadoop and data warehouse architects know, Sorting — and the Join, Aggregate, and Loading operations in ETL that rely on Sorting — is at the heart of the big data processing challenge, and an exponential consumer of computing resources. As big data doubles, the resource requirements to sort it can triple. Therefore the algorithms, hardware exploitation techniques, and processing schemes involved with multi-platform, multi-core sorting software are the keys to managing this problem in scalable, affordable, and efficient ways.
CoSort performance scales linearly in volume, more along the lines of Amdahl’s Law. While CoSort can transform hundreds of gigabytes of big data in minutes with a few dozen cores, other tools can take more than twice as long, don’t scale nearly as well, and/or consume more memory and I/O in the process. Perhaps more importantly, CoSort integrates sorting directly into related applications, and does all its heavy lifting outside the DB and BI layer where staging data would be less efficient.
CoSort’s co-routine architecture moves records between the sorter and programs like SortCL (CoSort’s data transformation, filtering, lookup, reporting, migration, and protection utility) interactively, through memory. So as soon as the next sorted record is available, it can move into the application, load, etc. It looks to the app that it is reading an input file but in truth, the back end of source has not as yet been produced. And no, you will not get ahead of the sorter.
Physical computing resources alone cannot be counted on to scale up to the problem of processing big data. The CoSort software environment is one in which required big data transformation and related jobs run not as standalone processes, but in parallel during the same I/O pass.
So, if you need a fast sort for some purpose other than just the time to sort, you should be thinking about what happens downstream from the sort, and the best ways to link such processes together. And once you have determined the best runtime paradigm, can you then combine high-performance hardware with such software to optimize performance? Can you stage DW data with CoSort in the database server side of Exadata, for example? Or would it make more sense to keep your IBM p595 and add CoSort to triple throughput? Or if you’re intent on using Hadoop, consider using the same simple 4GL of CoSort or intuitive ETL mappings of Voracity to drive MapReduce 2, Spark, Storm, or Tez jobs.
Let your budget, and your imagination, be your guides to tackling data growth.