Abstract: In previous articles in the business intelligence (BI) section of this blog, IRI has described how wrangling data with the SortCL engine in the IRI CoSort data manipulation product and Voracity data management platform speeds time to data visualization, and thus actionable insight, in BI tools. This article highlights the data blending advantages that Voracity provides for analytics in the Cubeware platform.
While Voracity is very fast at wrangling data and features a wide array of data manipulation and data protection capabilities, it lacks an onboard visualization and dashboarding tool. This is where Cubeware comes in. Conversely, for users of Cubeware software, Voracity adds value to their Importer process through fast, consolidated data transformation, cleansing, and masking.
Introduction to Cubeware
Founded in 1997, Cubeware is an innovative BI software manufacturer, located in Rosenheim, Germany. Cubeware provides solutions for data management, visualization, analysis, and planning.
One of Cubeware’s products is Cockpit, the dashboard designer provided in CSP C8. Cockpit can produce reports for Windows, web, mobile, and Snack (Cubeware’s agile instant reporting tool). It can also integrate data from relational and multidimensional databases.
Cubeware also provides Importer, a graphical ETL tool for consolidation of data from various source systems with comprehensive transformation processes that has many feature-function in common with Voracity.
Then Why Would I Need Voracity?
For more speed in “big data” preparation jobs, additional data quality features, and data-centric security (PII classification, discovery, and masking) capabilities for compliance with GDPR, etc.
To begin, Voracity uses the SortCL engine to integrate and wrangle data faster than Cubeware Importer and most every ETL tool on the market. SortCL is a high-speed, task-consolidating data processing program for structured and semi-structured sources that optimizes memory, threads, and I/O. It runs on Windows and Linux/Unix command lines on-premise or in the cloud.
SortCL jobs are specified in simple 4GL job scripts, which can also be designed and managed in a free graphical IDE called IRI Workbench, which is built on Eclipse™. In addition to being easy to understand, SortCL programs support a vast range of data manipulation, mapping, munging, masking, and mining features for hundreds of data processing, protection, prototyping, and presentation use cases.
For example, SortCL allows Voracity to support a wide array of data quality features, including:
Its data validation capabilities can improve the quality of data; e.g., by creating conditions to test for empty values, upper or lower case values, alphabetic and numeric values, printable values, and more. It can also filter records meeting specified conditions, ranges, or quantity thresholds.
GDPR compliance can be achieved through SortCL-supported data ‘shields’ that mask PII like deletion, encryption, redaction, and pseudonymization, etc. In fact, Voracity specifically includes:
However, and most importantly, all the above activities involved in ETL, data quality, and masking, can be performed simultaneously, in the same job script and I/O pass, significantly saving job development and execution time.
SortCL can also produce custom 2D detail and summary reports to offer a quick overview of the data and make sure everything seems correct before feeding Cockpit with the handoffs it needs for fast display.
So how much can Voracity actually improve the speed of data manipulation and movement? Cubeware provided a benchmark for Importer of 5 minutes and 32 seconds for processing a 4.25 million-row SQL Server table with a size of 1046MB, performing these transformations:
and outputting to a text file.
Here are the specs of the computer used by Cubeware:
I created a similar SortCL task-consolidated data wrangling job in IRI Workbench that reads the same data from a SQL Server table, performs similar equivalent SortCL transformations, and writes to a text file. The specs of the computer used to perform this benchmark were:
The benchmark using SortCL in Voracity was an average of 131.0 seconds, or roughly 2.53 times quicker than the benchmark provided using Cubeware Importer at this volume level.
Note: This benchmark was run on a computer with both SortCL and the database installed locally. Here is a view of the source table in IRI Workbench:
These like transformations were specified in the SortCL job1:
- Filter weekday to Saturday ONLY
- Omit dates from a certain range from the PLSSTARTDAT field.
- Validate the PTD_ID field to make sure it consists of numbers.
- Output additional summary record to stdout for quick display of key aspects of the data.
- New field with absolute difference between two numeric values
- New field with math operations performed
- New field concatenating two existing fields with comma
- New field replacing Saturday with German word
- Encrypt the ID field with Format-Preserving AES256 Encryption.
- Mask Mandant field
- Change date format of datum field.
- Encrypt with AES 256 Encryption the JAHR field with a passphrase of “Typhoon”
- Pseudonymize the VISIONGRUPPESET field replacing with area codes selected from a set file.
- Mask the TEILPLSTV_ID field from positions 3 to 7 with the “*” character.
- Output to a text file
- Output summary record of total, average, max and min values of the TTSNR field to stdout for a quick overview of the data values of this field.
Here is the resulting text file:
The biggest advantage or Voracity is fast, robust flat-file processing. If able to filter out a large percentage of records from a very large file, its SortCL program can provide rapid file-to-database data transfer. Data transformation, validation, cleansing, masking, and reformatting are all supported features that can be combined in one SortCL script and I/O pass.
Generally, file-to-database SortCL scripts would result in slower results than Cubeware when writing a large number of records to the database. However, SortCL can process the input of flat files faster, and then filter them down to much smaller subsets of records bound for a database.
When reading from a database and writing to a flat file, the SortCL engine can provide some speed advantages. A benchmark provided by Cubeware yielded an average of 5m:32s to perform similar filtering and transformations that SortCL performed in 2m:11s. The input for this benchmark was Microsoft SQL Server, and the output was a semicolon-separated text file.
Bottom line: If you have IRI Voracity and Cubeware Cockpit, you have an affordable, best-of-breed technical solution stack for data integration, governance, and visual analytics.