CoSort Transforms Big Marketing Data 10x Faster than SQL
Mark Hipp, President of DataBase Technologies, Inc.
DataBase Technologies (DBT) builds and maintains very large marketing databases for a few, select clients. We started in 1992 with a desktop CRM system that put the entire universe of B2B customers and prospects at a marketer’s fingertips. Our specialty has become rapidly integrating data flows from systems across multiple platforms and companies.
DBT has recently taken on an assignment to build a platform for the creation and distribution of new data products made from the aggregation and combination of various data flows coming out of the big data arena. Our basic initial workflow is to take 350 million records per day of transaction data, join it to a few files, the largest of which is 100 million records, and accumulate the data over time for analysis.
DBT is a Windows shop. We interface with and do remote work on a variety of systems but internal development is all on Microsoft OSs.
Most of our large dataset work has been done on high end SQL engines in the last few years. The first dataset of 350GB that we got for the new project took over two days to load as a heap and made a clustered index. We clearly needed to get the data sorted before we tried to load it. We did some research and decided to try CoSort – it worked well.
CoSort SortCL programs run fast, transforming big data quickly. It is fun to watch the system performance monitor and see all those processors working in the high 90 percentages and the disks being used at speeds that utilize the high data rates you pay for. I was doing simple sorts in a day or two and it only took a couple of weeks to figure out most of the functionality. This dog hunts! CoSort does much more than sort. Its core data transformation interface, “SortCL” also allows some basic, but very powerful, yes/no decisions on including records. It also has merge and join capabilities. I benchmark everything, and the join operations in CoSort are roughly 10 times faster than what I can get on the same data from the SQL package. Nine minutes and thirty seven seconds is better than an hour and thirty-eight minutes!
It can be difficult to tell if the data that comes out is OK when there is so much of it. I have spent more than a few hours with a hex editor to find out that occasional embedded blanks on input records were causing odd results. This is a data issue, not a program issue but there were no tools or tips to help find the problem.
We needed something that would sort large data sets quickly. Back in the mainframe days we used SyncSort and in my search for a Windows sort/merge/report utility with similar functionality found CoSort. CoSort was actually more along the lines of what I needed in terms of command line and batch execution options. The price of CoSort was higher than I was hoping for, but the product does more than I expected.
So far the support has been fast, friendly and helpful. My issues have been resolved while I was on the phone with them. I should probably use them more for programming questions but trial and error works to learn the system too.
The manual is large but it is very clear and well indexed (for those of us that still print this kind of stuff) and has very good hyperlinks in the pdf.