Loading Big Data Faster

by David Friedland

Loading Big Data? For more speed, Pre-Sort and Bulk Load

Finding more speed when loading big data is a challenge in ETL, reorg, and very large database (VLDB) index populate operations. One way to load big data faster is by pre-sorting it, so the database does not have to sort. IBM and other mainframe database providers have given that advice for decades, and it is still true in relational databases in use on Unix and other “open systems” today, including Oracle, DB2, Sybase, and SQL Server.

Benchmarks in this area show improvements over unsorted loads depending on volume, but sort vendors like IRI claim the load performance has improved between two and ten times. In the TUSC Consulting report “Benchmarking Index Impact on OLTP Load Rates and Online Database Block Size Rebuild in Oracle,” a 100,000 row single-index insert test alone showed that pre-sorted data loaded 58% faster, and required 49% less space:

Loading in sorted order had a 42% lower sustained rows/second load rate
Unsorted inserts into indexes force more internal database work (block management and space reorganization) to be done
In load-sorted indexes, the clustering factor will be close to the number of leaf blocks
The order of loaded data is critical to load performance.

Many years later, in chapter 13 of his “Expert Oracle Database 11g Administration” guide, Sam R. Alapati (Miro Consulting) recommended pre-sorting in conjunction with direct path loads as the fastest way to bulk load Oracle (versus inserts):

“The direct-path loading option doesn’t use the SQL INSERT statement to put data into tables; rather, it formats Oracle data blocks and writes them directly to the database files. This direct-write process eliminates much of the overhead involved in executing SQL statements to load tables. Since the direct-path loading method doesn’t contend for database resources, it will load data much faster than a conventional data load. For larger data loads, the direct-path loading method words best, and it may be the only viable method of loading data into tables for the simple reason that a conventional load may require more time than is available.”

For administrators of VLDBs today, this is where CoSort comes in, since:

“Besides the obvious advantages of a shorter load time, direct loading also helps you rebuild indexes and presort table data.”

CoSort is traditionally used in the external pre-sort of a flat file that will be the import to a load specifying “direct=true” and this option:

“SORTED INDEXES: The SORTED_INDEXES parameter signals SQL*Loader that data is sorted on a specified index, which improves load performance.”

Similarly, Microsoft SQL Server documentation specifies file pre-sort as one of the “Methods for Optimizing Bulk Import”:

By default, a bulk-import operation assumes that a data file is unordered. If the table has a clustered index, the bcp utility, BULK INSERT statement, and OPENROWSET(BULK…) function (Transact-SQL) enable you to specify how data in the data file is sorted during a bulk-import operation. It is optional for data in the data file to be sorted in the same order as the table. However, you can improve performance of the bulk-import operation if you specify the same ordering for the data file as the table.

The /KEY field in a CoSort SortCL script would typically be the longest index (primary) key in the table, but it does not have to be. According to TUSC, for like columns:

Fewer longer indexes are preferable to more shorter indexes
Leading column drives the index load cost

Note also that:

Per Vertica and other RDBMS primers, maintaining columns in sorted orders optimizes query performance. Even the old advice in DEC’s Rdb/VMS Guide to Database Maintenance and Performance is still true:

“Pre-sort the records you plan to store in a table by primary key value before loading them into the database. When the records are loaded, they will be physically adjacent to one another, or clustered, until additional records are stored in the database. Maintaining this arrangement benefits queries that select rows based on a range of values or that join many rows of one table with rows in the same table.”
Pre-sorting data in tables can save time in views as well. According to “Oracle Database 10g: The Complete Reference” by Kevin Loney:

“Having the data sorted in the view may simplify your application development. For example, if your code steps through a set of records, having those records presorted may make your processing and error checking simpler. In your application development, you will know that the data will always be returned to you in an ordered fashion.”

Mr. Alapati warns DBAs about a limitation of direct path loads:

“Note: In a direct load, you can’t use any SQL functions. If you need to perform a large data load and also transform the data during the load, you have a problem. The conventional data load will let you use SQL functions to transform data, but the method is very slow compared to the direct load. Thus, for large data loads, you may want to consider using one of the newer load/transform techniques, such as external tables or table functions.”

However, CoSort’s SortCL program can transform the loading data during pre-sorting; i.e. by combining the same type of SQL functions in the same job script and I/O pass, including: joins, aggregations, cross-calculations, lookups, select/filter, substring and instring functions, and plenty of reformatting and custom-report targets — in that same pre-sort operation.

The new offline reorg utility in the IRI Workbench (Eclipse GUI) uses IRI FACT (Fast Extract) to unload table data quickly via OCI, uses CoSort to pre-sort on the primary key, and writes and runs SQL*Loader direct path loads to optimize and combine each of these steps.

Generating Test NID Data: Netherlands BSN

Data Masking: Masking Social Security and other National IDs

Loading Big Data Faster

Loading Big Data? For more speed, Pre-Sort and Bulk Load

Related articles

Search the Blog