IRI Blog Articles

Diving Deeper into Data Management

 

 

Speeding and Building on Unix bin/sort

by Jason Koivu

The sort included with each Unix-based operating system is a standard command line program that prints lines of input or specified input files in the specified sorted order.

The bin/sort interface provided with IRI CoSort is a drop-in replacement for the sort utility usually found in the /bin directory of most Unix/Linux operating systems, but uses IRI’s proprietary sort and resource-optimization algorithms to perform system sort jobs much faster.

To function as a drop-in, it is recommended that your system administrator:

1) Renames /bin/sort to /bin/sort.orig (for safe keeping). Note that the Windows replacement is named unixsort.exe upon installation of CoSort for Windows.

2) Creates a link from the default system /bin/sort to the CoSort sort.

Users will then have the sort speed of CoSort without changing any of their own references (within batch scripts and programs).

Given this 10-record input file:

Adams, John|adams@gmail.com|646-834-9956|Melbourne|Florida
Monroe, James|Monroe@James.com|433-758-2783|Rapid City|South Dakota
Jackson, Andrew|Jackson@Andrew.com|145-894-4328|Long Island|New York
Van, Martin|Van@Martin.com|654-763-7612|Tulsa|Oklahom
Harrison, Henry|Harrison@Henry.com|765-978-2457|Aberdeen|South Dakota
Tyler, John|Tyler@John.com|554-674-1289|Juneau|Alaska
Polk, James|Polk@James.com|553-563-2399|Baton Rouge|Louisiana
Pierce, Franklin|Pierce@Franklin.com|344-891-3289|Salem|Oregon
Cleveland, Henry|Cleveland@henry.com|345-548-3282|Kalahari|Delaware
Chalse, Logan|Chalse@Logan.com|321-889-4633|Melbourne|Florida

When we run the following from the command line:

sort -s -u -t '|' +4r patient_info.in -o patient_info.out

the output file contains:

Monroe, James|Monroe@James.com|433-758-2783|Rapid City|South Dakota
Pierce, Franklin|Pierce@Franklin.com|344-891-3289|Salem|Oregon
Van, Martin|Van@Martin.com|654-763-7612|Tulsa|Oklahoma
Jackson, Andrew|Jackson@Andrew.com|145-894-4328|Long Island|New York
Polk, James|Polk@James.com|553-563-2399|Baton Rouge|Louisiana
Adams, John|adams@gmail.com|646-834-9956|Melbourne|Florida
Cleveland, Henry|Cleveland@henry.com|345-548-3282|Kalahari|Delaware
Tyler, John|Tyler@John.com|554-674-1289|Juneau|Alaska

The final field (state) was sorted (as indicated by +4 with a “|” separator), and:

  • The sort order was reversed / descending
  • Duplicate records (those containing South Dakota and Florida in this case) were removed, so only one record with these values was kept
  • The ‘stable’ option ensured that the South Dakota record that was kept was the one associated with “Monroe, James” and the Florida record is that belonging to “Adams, John”

The reason these particular records were kept is that they appeared earlier than the other Florida and South Dakota records in the original input file.


Using SortCL Instead of bin/sort

We recommend that CoSort users eventually convert their bin/sort commands to equivalent syntax in “SortCL” job scripts. SortCL is the primary user interface in the IRI CoSort package and default data transformation, cleansing, masking, and reporting engine in the IRI Voracity platform. SortCL uses a simple, and familiar (to JCL and SQL users) 4GL to define data layouts and manipulations.

Because SortCL uses the same underlying sort engine as CoSort’s bin/sort replacement does, sort performance in SortCL will be just as superior to the system sort. SortCL will provide the added advantages of:

  • storing the metadata for the input source (if specified in the SortCL job script) centrally for re-use in every SortCL-compatible IRI software job script using that same source (layout)
  • a far richer set of data manipulation functionality being available through the job script, including SQL-compatible data transformations, report formatting, field-level masking, data cleansing, etc.
  • free graphical support for job (script) design, management, and execution in Eclipse via IRI Workbench
  • seamless Hadoop sort (and other SortCL) job execution in MapReduce2, Spark, Spark Stream, Storm or Tez without re-coding

The bin/sort example in the above, as performed on the date source patient_info.in:

sort -s -u -t '|' +4r patient_info.in -o patient_info.out

would be expressed as follows in SortCL (in its simplest form):

/INFILE=patient_info.in
#/SPEC=C:DDF_repositorypatient_info.ddf  # invokes the metadata for the source
# (commented out here)
/KEY=(POS=5,SEP="|",DESC)     # direct key specification, without requiring metadata
/STABLE
/NODUPLICATES
/OUTFILE=patient_info.out

Running the above will produce the same output shown in the above bin/sort example.

Note the following functional equivalents between the above sortcl script and bin/sort:

  • The /INFILE command is how you specify the input source (stdin is the default source)
  • The /KEY statement allows you to directly specify the field position for the sort key and the character separator on which that position is based.

In bin/sort this was done with -t “|” +4  where 0 was the first field in bin/sort and 1 was the first field for SortCL counting purposes.

** This direct specification of key parameters with position and separator options (and optionally SIZE) can be done in lieu of simply specifying a field name, if metadata were provided in the input section (via the /SPEC command, which was commented out above). Note that IRI Workbench features a Metadata Discovery wizard to automate the process of ascertaining metadata where possible.

  • The DESC option indicates a reverse order sort (the r option in bin/sort)
  • The /STABLE command to preserve input order is the equivalent of -s in bin/sort
  • The /NODUPLICATES command equates to unique (-u) in bin/sort
  • The /OUTFILE command is how you specify the output target (stdout is the default target)

 

Add Functions to Your Sort Jobs via SortCL

Once your bin/sort job is converted to SortCL, as in the above example, you can now enjoy the full benefits of that interface, described at:

http://www.iri.com/products/cosort/sortcl/function-matrix

To apply format-preserving encryption to the email field, for example, you could simply modify and expand the above job script as follows:

/INFILE=patient_info.in
   /SPEC=C:DDF_repositorypatient_info.ddf  # invokes the metadata for the source
/KEY=(state,desc)  # descending sort on the state field
   /STABLE
   /NODUPLICATES
/OUTFILE=encrypted_email.out
   /PROCESS = RECORD
   /FIELD=(NAME, POS=1, SEP="|")
   /FIELD=(ENC_FP_EMAIL=enc_fp_aes256_alphanum(EMAIL, "12345"), POS=2, SEP="|")
   # applies the encryption routine to the email field with the passphrase "12345"
   /FIELD=(PHONE, POS=3, SEP="|")
   /FIELD=(CITY, POS=4, SEP="|")
   /FIELD=(STATE, POS=5, SEP="|")

This would produce the following results:

Taylor, Zachary|Hwswie@Gqfimdu.dbs|423-560-3289|Charleston|West Virginia
Monroe, James|Tsafdh@Todhu.wqm|433-758-2783|Rapid City|South Dakota
Pierce, Franklin|Ltzglh@Mbzogtyv.szo|344-891-3289|Salem|Oregon
Van, Martin|Leb@Iggeqh.vru|654-763-7612|Tulsa|Oklahoma
Grant, Ulysses|Poqcn@Aaoqeat.thg|348-855-3478|Bismarck|North Dakota
Jackson, Andrew|Mexdkjh@Iuaddn.xwj|145-894-4328|Long Island|New York
Buchanan, James|Frnulwam@Fzriv.ulf|432-453-3472|Trenton|New Jersey
Adams Quincy|Hbcbe@Rklkhge.pnk|983-245-2954|Lyon|Nevada
Fillmore, Millard|Guyqyvll@Qcusvio.tfx|205-773-2347|Lincoln|Nebraska
Madison, James|Tuqzrpq@Tonzp.nvw|563-435-7821|Biloxi|Mississippi
Johnson, Andrew|Mwvptjr@Izjnmd.slp|984-587-2855|Saint Paul|Minnesota
Polk, James|Hvbv@Dsedx.eqo|553-563-2399|Baton Rouge|Louisiana
Jefferson, Thomas|Hwyjqjpgn@Frwqqe.pvo|321-890-8293|Goodland|Kansas
Adams, John|cqlai@zdnla.yxk|646-834-9956|Melbourne|Florida
Cleveland, Henry|Dmcykglas@wvzzl.wlf|345-548-3282|Kalahari|Delaware
Hayes, Rutherford|Hmjmb@Vovhfrzzio.vdv|646-344-1234|Sacramento|California
Tyler, John|Ftutv@Aypw.tss|554-674-1289|Juneau|Alaska

We still see the same sort that was performed above, but also the email field was encrypted with AES 256-bit format-preserving for the alpha-numeric characters (the @ and . were not encrypted, and the field width was preserved).

Multiple field protection (data masking) options are offered in CoSort, Voracity, or IRI FieldShield, as they all support the same metadata and job scripting syntax of SortCL.

Print Friendly

{ 0 comments… add one now }

Leave a Comment

Previous post:

Next post: