
Excel Data Processing Samples
This article is a continuation of the previous article, which introduced IRI support for XLS and XLSX file formats in the IRI SortCL 4GL and runtime program used for rapidly manipulating spreadsheet data in the IRI Voracity data management (ETL, data quality, etc.) platform, and its included fit-for-purpose SortCL-powered tools:
- IRI CoSort, for high speed data transformation, cleansing and reporting
- IRI NextForm, for data type conversion, and flat-file and database conversion
- FieldShield, for flat-file and database PII masking
- RowGen, for synthesizing test data in structured, and semi-structured files and databases
These job samples help demonstrate the breadth of what these metadata-compatible applications can now do with data directly to/from Excel sheets in either XLS or XLSX formats.
Sample Job Scripts
Note: To view XLS and XLSX files from within IRI Workbench, simply double click on the file in Project Explorer.
Example #1: Generate Test Data in Excel
This script demonstrates the use of an IRI RowGen job script for synthesizing a small amount of random test data into an XLSX file, and an XLS file with a header.
/INFILE=simple.in /INCOLLECT=5 /PROCESS=RANDOM /FIELD=(code,POSITION=1,SEPARATOR="\t",SIZE=5,TYPE=ALPHA_DIGIT) /FIELD=(value,POSITION=2,SEPARATOR="\t",SIZE=8,TYPE=WHOLE_NUMBER) /INCLUDE WHERE value >10 /REPORT /OUTFILE="'Sheet1'!A1:B5;randomnums.xlsx" /PROCESS=XLSX /FIELD=(code,POSITION=1,SEPARATOR="\t",TYPE=ALPHA_DIGIT,SDEF="=A") #format value field with a prepended string /FIELD=(valuef=format_strings("H2N3%s",value),POSITION=2,SIZE=12,SEPARATOR="\t",TYPE=ASCII,SDEF="=B") /OUTFILE="'Sheet1'!BB1:BC6,HEADER;randomnums.xls" /PROCESS=XLS /FIELD=(code,POSITION=1, SEPARATOR=",",SDEF="Code=BB") /FIELD=(value,POSITION=2,SEPARATOR=",",SDEF="Value=BC")
XLSX Output:
XLS Output:
Example #2: Synthesizing and Encrypting Data in Excel
This hybrid RowGen-FieldSheld job demonstrates random selection of names and email domains from set files and the subsequent masking of those names with format-preserving AES256 encryption with an encrypted passphrase. An email address is generated using the first letter of the first name plus the last name. The output is to a specific range of an XLS sheet.
/INFILES=random_file_placeholder /PROCESS=RANDOM /INCOLLECT=100 /FIELD=(FIRSTNAME, TYPE=ALPHA_DIGIT, POSITION=1, SEPARATOR="\t", SET="C:/IRI/cosort100/sets/names/names_first.set" SELECT=ANY) /FIELD=(LASTNAME, TYPE=ALPHA_DIGIT, POSITION=2, SEPARATOR="\t", SET="C:/IRI/cosort100/sets/names/names_last.set" SELECT=ANY) /FIELD=(EMAIL, TYPE=ALPHA_DIGIT, POSITION=3, SEPARATOR="\t", SET="C:/IRI/cosort100/sets/free_email_domains.set" SELECT=ANY) /INREC /FIELD=(FIRSTNAME, TYPE=ALPHA_DIGIT, POSITION=1, SEPARATOR="\t") /FIELD=(LASTNAME, TYPE=ALPHA_DIGIT, POSITION=2, SEPARATOR="\t") /FIELD=(EMAIL, TYPE=ALPHA_DIGIT, POSITION=3, SEPARATOR="\t") /FIELD=(FIRSTINITIAL=sub_string(FIRSTNAME,1,1),TYPE=ALPHA_DIGIT, POSITION=4, SEPARATOR="\t") /REPORT /OUTFILE="'Sheet1'!B1:D100;PERINFO.xls" /PROCESS=XLS /FIELD=(FIRSTNAMEENC=enc_fp_aes256_alphanum(FIRSTNAME, "epass:mfvndoTjj8PnGCVCB9pU0Q=="), TYPE=ALPHA_DIGIT, POSITION=1, SEPARATOR="\t", SDEF="=B") /FIELD=(LASTNAMEENC=enc_fp_aes256_alphanum(LASTNAME, "epass:mfvndoTjj8PnGCVCB9pU0Q=="), TYPE=ALPHA_DIGIT, POSITION=2, SEPARATOR="\t", SDEF="=C") /FIELD=(EMAIL_FORMATTED= format_strings("%s%s@%s",FIRSTINITIAL,LASTNAME,EMAIL), TYPE=ALPHA_DIGIT, POSITION=3, SEPARATOR="\t", SDEF="=D")
Set files:
Output:
Example #3: Convert Flat File Data to Excel, and Insert It
This NextForm data migration script reads a flat file with fixed position field data and maps up to 44 rows to an XLS sheet between the BA and BD column range. The header row of column names is derived from the first half of the SDEF attribute in the /OUTFILE field statement. It also demonstrates that the SDEF argument is not case sensitive.
# Three character columns are not supported by XLS- the maximum column value is IV # Header row value specified as the first part of the SDEF. After the equals sign is the column letter. /INFILE=chiefs /PROCESS=RECORD /FIELD=(name,POSITION=1,SIZE=27) /FIELD=(year,POSITION=28,SIZE=12) /FIELD=(party,POSITION=40,SIZE=5) /FIELD=(state,POSITION=45,SIZE=2) /REPORT /OUTFILE=”'Sheet1'!BA1:BD44,HEADER;chiefsAllXLSeasy.xls” /PROCESS=XLS /FIELD=(name,POSITION=1,SEPARATOR=”\t”,SDEF="NAME=BA") /FIELD=(year,POSITION=2, SEPARATOR=”\t”,SDef="YEAR=BB") /FIELD=(party,POSITION=3,SEPARATOR=”\t”,sdef="PARTY=BC") /FIELD=(state,POSITION=4,SEPARATOR=”\t”,sDEF="STATE=BD")
Input File:
Output Sheet:
Example #4: Select and Sort Data from a Range (of an XLS Sheet)
# CoSort SortCL Example Job # Extracts & Sorts Selected Sheet Columns, Outputs to Pipe /INFILE=”'Sheet1'!B1:E43;chiefsALLXLS.xls” /PROCESS=XLS # select only two input columns from a dataset that spans more than those columns. /FIELD=(year,POSITION=1,SEPARATOR=",",SDEF="=C") /FIELD=(party,POSITION=2,SEPARATOR=",",SDEF="=D") /SORT /KEY=(year) # sort by year field, but only output party field, with a maximum size of 5 characters. /OUTFILE=stdout /PROCESS=RECORD /FIELD=(party,POSITION=1,SIZE=5)
Result:
Example #5: Generate and Populate Sorted Test Data into Excel, Inverted
/INFILES=random_file_placeholder /PROCESS=RANDOM /INCOLLECT=10 /FIELD=(FIELD1, TYPE=ALPHA_DIGIT, POSITION=1, SIZE=6, SEPARATOR=",") /FIELD=(FIELD2, TYPE=ASCII, POSITION=2, SIZE=7, SEPARATOR=",") /FIELD=(FIELD3, TYPE=UPPERCASE, POSITION=3, SIZE=5, SEPARATOR=",") /SORT /KEY=(FIELD1) /OUTFILE=”'Sheet1'!A1:K3,HEADER,V;rgen.xls” /PROCESS=XLS /FIELD=(FIELD1, TYPE=ALPHA_DIGIT, POSITION=1, SIZE=6, SEPARATOR=",",SDEF="FIELD1=1") /FIELD=(FIELD2, TYPE=ASCII, POSITION=2, SIZE=7, SEPARATOR=",",SDEF="FIELD2=2") /FIELD=(FIELD3, TYPE=UPPERCASE, POSITION=3, SIZE=5, SEPARATOR=",",SDEF="FIELD3=3")
Result:
Example #6: Extract and Mask Flat-File PII into Excel
This example takes PII from a text tab separated file, writes an encrypted/pseudonymized version to one .xlsx file, and an unmodified version to another .xlsx file. The epass, or encrypted password, obscures the clear text encryption key (passphrase) in the script.
/INFILE=personal_info /FIELD=(credit_card,POSITION=1,SEPARATOR='\t') /FIELD=(driv_lic,POSITION=2,SEPARATOR='\t') /FIELD=(name,POSITION=3,SEPARATOR='\t') /REPORT /OUTFILE=”'Sheet1'!MMM1:MMO10;personal_info_encrypted.xlsx” /PROCESS=XLSX /FIELD=(credit_card1=enc_fp_aes256_alphanum(credit_card,"epass:Gg87jCTR15Jro4AGE44ENw=="),POSITION=1,SEPARATOR='\t',SDEF="=MMM") /FIELD=(driv_lic1=enc_fp_aes256_alphanum(driv_lic,"epass:Gg87jCTR15Jro4AGE44ENw=="),POSITION=2,SEPARATOR='\t',SDEF="=MMN") /FIELD=(name,POSITION=3,SEPARATOR='\t',SDEF="=MMO",SET=C:\IRI\cosort100\sets\names\names_first_last.set) /OUTFILE=”'Sheet1'!M1:O10;personal_info_unencrypted.xlsx” /PROCESS=XLSX /FIELD=(credit_card,POSITION=1,SEPARATOR='\t',SDEF="=M") /FIELD=(driv_lic,POSITION=2,SEPARATOR='\t',SDEF="=N") /FIELD=(name,POSITION=3,SEPARATOR='\t',SDEF="=O")
Input data (tab-separated text file):
9654-4338-8732-8128 W389-324-33-473-Q Jessica Steffani 2312-7218-4829-0111 H583-832-87-178-P Cody Blagg 8940-8391-9147-8291 E372-273-92-893-G Jacob Blagg 6438-8932-2284-6262 L556-731-91-842-J Just Rushlo 8291-7381-8291-7489 G803-389-53-934-J Maria Sheldon 7828-8391-7737-0822 K991-892-02-578-O Keenan Ross 7834-5445-7823-7843 F894-895-10-215-N Francesca Leonie 8383-9745-1230-4820 M352-811-49-765-N Nadia Elyse 3129-3648-3589-0848 S891-915-48-653-E Gordon Cade 0583-7290-7492-8375 Z538-482-61-543-M Hanna Fay
Output:
Encrypted/Pseudonymized:
Unencrypted:
See this article to compare this FieldShield-specific approach with other IRI tools for masking data in Excel.
Example #7: Populate XLS and XLSX Sheets with Random Test Values
This IRI RowGen example demonstrates the synthesis of 5,000 rows of test data with a header, targeting both XLS and XLSX files based on random value lookups into static inline sets and set files.
/INFILES=random_file_placeholder /PROCESS=RANDOM /INCOLLECT=5000 /FIELD=(FIELD1, TYPE=ASCII, POSITION=1, SEPARATOR="\t",SET={Dolphin,Fish,Sloth}) /FIELD=(FIELD2, TYPE=ASCII, POSITION=2, SEPARATOR="\t",SET=”moreAnimals.txt”) /FIELD=(FIELD3, TYPE=ASCII, POSITION=3, SEPARATOR="\t",SET={Red,Orange,Yellow,Blue}) /REPORT /OUTFILE=”'Sheet1'!A1:C5001,H;rgen.xlsx” /PROCESS=XLSX /FIELD=(FIELD1, TYPE=ASCII, POSITION=1, FRAME='"',SEPARATOR="\t",SDEF="Animal=A") /FIELD=(FIELD2, TYPE=ASCII, POSITION=2, FRAME='"',SEPARATOR="\t",SDEF="Another Animal=B") /FIELD=(FIELD3, TYPE=ASCII, POSITION=3, FRAME='"',SEPARATOR="\t",SDEF="Color=C") /OUTFILE=”'Sheet1'!A1:C5001,H;rgen.xls” /PROCESS=XLS /FIELD=(FIELD1, TYPE=ASCII, POSITION=1, FRAME='"',SEPARATOR="\t",SDEF="Animal=A") /FIELD=(FIELD2, TYPE=ASCII, POSITION=2, FRAME='"',SEPARATOR="\t",SDEF="Another Animal=B") /FIELD=(FIELD3, TYPE=ASCII, POSITION=3, FRAME='"',SEPARATOR="\t",SDEF="Color=C")
Example #8: Sort and Insert Unicode Flat-File Data into Excel
Unicode Example with no file line arguments and no SDEFs
This IRI CoSort job script demonstrates sorting Unicode data in a flat file and outputting it into an XLSX sheet with no arguments in the file line and no SDEFs:
/INFILE=chiefs10utf8.txt /PROCESS=DELIMITED /FIELD=(president,POSITION=1,SEPARATOR="|",type=UTF8) /FIELD=(term,POSITION=2,SEPARATOR="|",type=UTF8) /FIELD=(party,POSITION=3,SEPARATOR="|",type=UTF8) /FIELD=(state,POSITION=4,SEPARATOR="|",type=UTF8) /SORT /KEY=state /OUTFILE=chiefs10XLSeasyunicode.xlsx /PROCESS=XLSX /FIELD=(president,POSITION=1,SEPARATOR="|",type=UTF8) /FIELD=(term,POSITION=2,SEPARATOR="|",type=UTF8) /FIELD=(party,POSITION=3,SEPARATOR="|",type=UTF8) /FIELD=(state,POSITION=4,SEPARATOR="|",type=UTF8)
Input:
Output:
Example #9: Sorts US Presidents in a UTF8 Arabic File into Excel
This script sorts the Arabic translation of U.S. president names and outputs to a specific column in an XLSX spreadsheet, with an English header of “NAME”.
/INFILE=chiefs_arabic /PROCESS=RECORD /FIELD=(name,POSITION=1,SIZE=100,TYPE=UTF8) /SORT /KEY=name /OUTFILE=”'Sheet1'!BA1:BD45,HEADER;chiefsAllXLSeasy.xlsx” /PROCESS=XLSX /FIELD=(name,POSITION=1,SIZE=100,SEPARATOR="\t",SDEF="NAME=BA",TYPE=UTF8)
Input data:
Output sample:
Example #10: Shifts Data Between Sheets and Ranges
This job shifts data between different sheet names and ranges in XLSX and XLS, and filters on one field.
/INFILE="'data'!WZX1092:XAD2092;dfdfdf.xlsx" /PROCESS=XLSX /FIELD=(SSN, POSITION=1, SEPARATOR='\t',SDEF="=WZX") /FIELD=(FIRST_NAME, POSITION=2, SEPARATOR='\t',SDEF="=WZY") /FIELD=(LAST_NAME, POSITION=3, SEPARATOR='\t',SDEF="=WZZ") /FIELD=(EMAIL, POSITION=4, SEPARATOR='\t',SDEF="=XAA") /FIELD=(GENDER, POSITION=5, SEPARATOR='\t',SDEF="=XAB") /FIELD=(IP_ADDRESS, POSITION=6, SEPARATOR='\t',SDEF="=XAC") /FIELD=(URL, POSITION=7, SEPARATOR='\t',SDEF="=XAD") /INCLUDE WHERE GENDER eq "Male" /REPORT /OUTFILE="'Verylongsheetname'!EX1092:FD2092;outr.xls" /PROCESS=XLS /FIELD=(SSN, POSITION=1, SEPARATOR='\t',SDEF="=EX") # /FIELD=(FIRST_NAME, POSITION=2, SEPARATOR='\t',SDEF="=EY") # don’t include commented field in output /FIELD=(LAST_NAME, POSITION=3, SEPARATOR='\t',SDEF="=EZ") /FIELD=(EMAIL, POSITION=4, SEPARATOR='\t',SDEF="=FA") /FIELD=(GENDER, POSITION=5, SEPARATOR='\t',SDEF="=FB") /FIELD=(IP_ADDRESS, POSITION=6, SEPARATOR='\t',SDEF="=FC") /FIELD=(URL, POSITION=7, SEPARATOR='\t',SDEF="=FD")
Resulting Output:
Example # 11: Create Header into a Sheet from Other Sheet Data
This script demonstrates writing additional data to the same XLSX file as the data was read from, with an added header in the appended report. NOTE: This functionality is only available with the XLSX format and not the XLS format.
# write a report to the same file as data was read from /INFILE=test.xlsx /PROCESS=XLSX /FIELD=(FIELD1, POSITION=1, SEPARATOR='\t',SDEF="=A") /FIELD=(FIELD2, POSITION=2, SEPARATOR='\t',SDEF="=B") /FIELD=(FIELD3, POSITION=3, SEPARATOR='\t',SDEF="=C") /FIELD=(FIELD4, POSITION=4, SEPARATOR='\t',SDEF="=D") /FIELD=(FIELD5, POSITION=5, SEPARATOR='\t',SDEF="=E") /INCLUDE WHERE FIELD3 eq "December" /REPORT /OUTFILE=”G1:K351,HEADER;test.xlsx” /APPEND /PROCESS=XLSX /FIELD=(FIELD1, POSITION=1, SEPARATOR='\t',SDEF="Day=G") /FIELD=(FIELD2, POSITION=2, SEPARATOR='\t',SDEF="SSN=H") /FIELD=(FIELD3, POSITION=3, SEPARATOR='\t',SDEF="Month (December Only)=I") /FIELD=(FIELD4, POSITION=4, SEPARATOR='\t',SDEF="Integer=J") /FIELD=(FIELD5, POSITION=5, SEPARATOR='\t',SDEF="Decimal=K")
Input:
Output:
These were examples of the Excel data processing tools available through the SortCL-powered Voracity platform and its component products. They showed how to convert a flat file to Excel without an import stp, mask PII in Excel files, generate test data in Excel, and more.
They are just an inkling of what is now possible against XLS and XLSX sources and targets using these “IRI ETL tools for Excel.” In the next article, we will examine the xls2ddf utility, an executable that automatically builds SortCL data definition file (DDF) metadata files (field layouts) from existing spreadsheet headers.