Generating Test Client Data

by Susan Gegner

IRI RowGen software creates test data you can customize to meet specific needs. It supports the formats and techniques that make your test sets as realistic as you want them to be. In this article, I am using RowGen to generate realistic master, or dimensional, data in a flat file that can be used in other set files or database tables.

The base test set I’m creating contains client records that could represent a patient, borrower, or any account with a primary and secondary name. These test names would be inter-dependent, and would not, in reality, change often; e.g., where the second name is the co-signer on a loan the first name opened. I’m also generating dynamic numbers to represent realistic transaction values within the file.

How RowGen Works

RowGen synthesizes data according to the values and structures specified in 4GL job scripts. These scripts are either handwritten or auto generated in the GUI for RowGen (or for Voracity) called IRI Workbench, built on Eclipse™.

For any given column of data, RowGen can:

randomly generate data according to the data type and width you specify
randomly select data from a set file of new values and formats you define on the fly
randomly select values from a pre-existing set of real values.

The best way to produce the most realistic test data is by using set files in the job scripts. RowGen’s forte is helping you create data sets that can be used to create bigger and better data sets. In fact, in the RowGen GUI, the New Test DB Job creation wizard uses this core capability to build entire RDB and EDW prototypes with referential integrity by creating and loading interdependent sets automatically. RowGen even allows you to choose from among several random selection methods and customize the distribution of values.

Set files consist of one or more columns of data separated by tabs. IRI provides set files for things such as names, cities, states, etc. You can also create set files from existing files or columns in a database. When only a few values are needed, such as yes and no, inline set files can be used. The inline set values are contained within the field statement of the job script.

Planning the Customer Set

Suppose you want to create test data for a customer-related application that will support referential integrity in an application or test database. The customer information is the data that remains relatively constant, and will be linked to more transient information such as dated transactions. To accomplish this, we simultaneously generate, in one record, all the information about the client that needs to remain consistent throughout any processing that might occur elsewhere. This master test set can be put in one table or file, and it can be used to build other tables or files.

The first step is to decide which pieces of information are necessary and what set files are to be used in the job script. Then create any set files that do not already exist. Next, where applicable, decide on the minimum and maximum size of each numeric field or column. For these test records, I will create:

Account Number: The value will consist of 2 fields. The first, called Acct_Code has an A, B, C, D, or E, will be obtained from an inline set. The second field, called Acct_Digits, consists of 15 digits.
Transaction Limit: This number represents how many thousands of dollars can be in one transaction; the maximum amount is 50. We will use a numeric set file called limit.set that contains:

[50,20]
[19,10]
[19,10]
[9,7]
6
5
4
3
2
1
1

The lines with brackets indicate a range. A square bracket indicates that the number next to the bracket should be included in the range. A round bracket would exclude a number from the range. When RowGen runs, it will populate a value randomly picked from any given line in the set file. If the line is a bracketed range, then a value from within that range is randomly selected/produced. Any lines repeated in the set file will be included in the random line selection as often as the line appears. This is a (one) way of defining frequency distribution in RowGen.
Primary First Name: This will use the IRI-provided names_first.set file
Primary Last Name: This will use the IRI-provided names_last.set file
Primary Birth Date: This will use the ISO date format with a lower limit of Jan 1, 1910, and an upper limit of Dec 31, 1997. We will use a date set file called P_Bday.set that contains:

[1910-01-01,1949-12-31]
[1950-01-01,1980-12-31]
[1950-01-01,1980-12-31]
[1981-01-01,1990-12-31]
[1981-01-01,1990-12-31]
[1991-01-01,1997-12-31]

The rules are the same for date set files and numeric set files.
Secondary First Name: This will use the IRI-provided names_first.set file
Secondary Last Name: Very often the secondary Last Name is the same as the primary Last Name. Therefore, we will make that true 75% of the time by creating a field called code containing an inline set file of [0,0,0,1]. Whenever the code field has a 0 value, the Secondary Last Name will be the same as the Primary Last Name. If the code is 1, then the name will be selected from the names_last.set file.
Secondary Birth Date: This will use the ISO date format with a lower limit of Jan 1, 1910, and an upper limit of Dec 31, 2015. We will use a date set file called S_Bday.set that is as follows:
```
[1910-01-01,1949-12-31]
[1950-01-01,1980-12-31]
[1950-01-01,1980-12-31] 
[1981-01-01,1990-12-31]
[1981-01-01,1990-12-31] 
[1991-01-01,1997-12-31] 
[1998-01-01,2015-12-31]
```

Street Address, consisting of three fields: a street number with up to 5 digits, the actual street name, and a street type, such as Road or Avenue. The STREET_NUM field values will be selected from an inline NUMERIC set in the full range defined as [1,99999]. The STREET_NAME will be randomly selected from the IRI-provided street_names.set file. STREET_TYPE will be randomly selected from the IRI–provided street_types.set file.
City: The city should be one that actually exists in the indicated state. It will be selected from the second column of the ‘joined pairs’ set file state_2char_city.set.
State: The state will have two characters selected from the first column of the same set file.

The two composite fields will be concatenated with the use of a feature called TEMPLATES. A TEMPLATE is defined with a /TEMPLATE statement, and then it is referenced in the /FIELD statement with the STRUCTURE parameter.

Note that we could add more items to the client information records, but this will suffice to demonstrate several techniques that can be used in the generation process.

Next, we need to decide how many client records must be generated. For this example, we will only generate 15 records, but we could just as easily generate 500, 1,000,000, or billions of rows. We want unique records with pipe-delimited fields for easier reading and database loading. We could also save them in tab-delimited format, with one or more fields for use in other RowGen test sets, like this one. Or we could do both at once. In fact, there is no limit to the number or format of RowGen target files or tables defined in the same generation script and job.

Specifying the Job

Every RowGen job script has at least three sections:

Input, where the initial fields are generated. Either specify a data type for random value generation, or a set file (and lookup method) for random value selection. The number of rows to generate is also defined here.
Action, where keys are defined if the test data is being sorted. We will sort on the account number, and specify /NODUPLICATES so that every client has a unique account number. As a result, however, it is possible the final number of records will be less than the number we chose to generate.
Output, which consists one or more target files, tables, and/or custom reports.

Below is the RowGen job script for creating the test customer data. /FIELD statements are in parentheses. There will be a field name, data TYPE, field number designated by POS, and delimiter designated by SEP. Other parameters are defined after that, such as set files. PRECISION indicates the number of decimal places for a NUMERIC TYPE.

/INFILE=random_file_placeholder
/PROCESS=RANDOM
/INCOLLECT=15 # generate 15 fields
/FIELD=(Acct_CODE, TYPE=ALPHA, POSITION=1, SEPARATOR="|", SET={ A,B,C,D,E })
/FIELD=(Acct_Digits, TYPE=DIGIT, POS=2, SEP="|", SIZE=15)
/FIELD=(Trans_Limit, TYPE=NUMERIC, POS=3, SEP="|", PRECISION=0, SET=”limit.set” SELECT=ANY)
/FIELD=(LNAME1, TYPE=ASCII, POS=4, SEP="|", SET="names_last.set")
/FIELD=(FNAME1, TYPE=ASCII, POS=5, SEP="|", SET="names_first.set")
/FIELD=(PBDAY, TYPE=ISO_DATE, POS=6, SEP="|", SET=” P_Bday.set” SELECT=ANY)
/FIELD=(NAME_CODE, TYPE=DIGIT, POS=7, SEP="|",SET={ 0,0,0,1 })
/FIELD=(FNAME2, TYPE=ALPHA, POS=8, SEP="|", SET="names_first.set")
/FIELD=(SBDAY, TYPE=ISO_DATE, POS=9, SEP="|", SET= “S_Bday.set” SELECT=ANY)
/FIELD=(STREET_NUM, TYPE=NUMERIC, POS=10, SEP="|", PRECISION=0, SET={ [1,99999] })
/FIELD=(STREET_NAME, TYPE=ASCII, POS=11, SEP="|", SET="street_names.set" SELECT=ANY)
/FIELD=(STREET_TYPE, TYPE=ASCII, POS=12, SEP="|", SET=".street_types.set" SELECT=ANY)
/FIELD=(CITY, TYPE=ASCII, POS=13, SEP="|", SET="state_2char_city.set" SELECT=ROW[2])
/FIELD=(STATE, TYPE=ALPHA_DIGIT, POS=14, SEP="|", SET="state_2char_city.set" SELECT=ROW[1])

/NODUPLICATES # Make sure each record is unique with respect to the Account Number
/KEY= Acct_CODE
/KEY= Acct_Digits

/OUTFILE=client.dat
/PROCESS=RECORD
/TEMPLATE=(AcctNum="%s%s") # Concatenates 2 string fields
/TEMPLATE=(Address="%s %s %s") # Concatenates 3 string fields with a space between each
/FIELD=(AcctNumber, TYPE=ASCII, POS=1, SEP="|", STRUCTURE=AcctNum(Acct_CODE,Acct_Digits))
/FIELD=(Trans_Limit, TYPE=NUMERIC, POS=2, SEP="|", PRECISION=0)
/FIELD=(LNAME1, TYPE=ASCII, POS=3, SEP="|")
/FIELD=(FNAME1, TYPE=ASCII, POS=4, SEP="|")
/FIELD=(PBDAY, TYPE=ISO_DATE, POS=5, SEP="|")
/FIELD=(LNAME2, TYPE=ASCII, POS=6, SEP="|", IF NAME_CODE EQ 0 THEN LNAME1 ELSE SET="names_last.set")
/FIELD=(FNAME2, TYPE=ALPHA, POS=7, SEP="|")
/FIELD=(SBDAY, TYPE=ISO_DATE, POS=8, SEP="|")
/FIELD=(StreetAddress, TYPE=ASCII, POS=9, SEP="|", STRUCTURE=Address(STREET_NUM,STREET_NAME,STREET_TYPE))
/FIELD=(CITY, TYPE=ASCII, POS=10, SEP="|")
/FIELD=(STATE, TYPE=ALPHA_DIGIT, POS=11, SEP="|")

You can run this script in the IRI Workbench or on the command line with:

rowgen /specification=genclient.rcl

Using the Results

This will generate output similar to the following:

A041228567690798|15|Huddleston|Jarrod|1921-04-04|Beltz|Rudolf|1982-08-06|23484 Hilda Heights|Owensville|MO
A284391085036017|19|Rubinstein|Romona|1995-03-27|Rubinstein|Phyliss|1984-12-16|91115 Red Bird Trace|Lakeport|CA
A554707733795739|1|Wagnon|Von|1944-05-26|Wagnon|Rona|1997-03-18|79903 Blue Ridge Street|San Jose|CA
A940444121953751|13|Borum|Elvira|1993-01-11|Burney|Shelby|1940-01-09|53835 Adams Artery|Fredonia|NY
B498416663279606|13|Galvin|Sanjuanita|1980-08-03|Galvin|Alvin|1989-02-07|37182 Outlaw Boulevard|Ponca City|OK
B665089089214940|10|Brinker|Lucie|1962-04-27|Brinker|Kelley|1985-10-28|41512 Rosemary Trace|Bellmawr|NJ
C092836176730933|16|Guillotte|Von|1984-03-14|Guillotte|Peggie|1988-12-17|50676 Whitley Avenue|Marlborough|MA
C346991978386191|13|Manjarrez|Harriet|1996-06-08|Manjarrez|Karrie|1999-03-28|88221 North Chase Corridor|Plainview|TX
C744423272297425|2|Olsson|Claudia|1975-01-07|Olsson|Lavern|1994-10-21|47048 Michael Trace|Salmon|ID
C847549307203805|4|Spradley|Concepcion|1917-10-21|Spradley|Ahmad|1972-11-09|7045 Sanderson Loop|Wallace|NC
C865940707690079|1|Jonson|Alonso|1990-04-03|Jonson|Jared|1965-07-16|62832 Baker Chapel Corridor|Lake Stevens|WA
D552053977641891|18|Murdock|Wanda|1960-10-22|Murdock|Ann|1915-04-04|48010 Gracie Grove|Moultrie|GA
D889743001706813|5|Shuey|Jeff|1978-06-28|Shuey|Monty|1953-10-08|6358 Elkin Avenue|Streamwood|IL
E229826294718256|1|Shuford|Freida|1955-04-06|Shuford|Von|1920-10-02|96925 Leslie Village|Nashville|AR
E864261095027054|2|Hocking|Nicola|1976-05-12|Henrickson|Josh|1967-06-20|34785 Squire Course|Palm Desert|CA

From this initial set file, I now have the basis of a transaction system or database. I can load the data in this set file into a table where the RowGen GUI can be used to build more sets from columns. With values the same in the other tables, referential integrity is established. So if you need to test a database performing joins, IRI CoSort (SortCL) scripts to capture changes in files, or something else, you can use RowGen to prototype a relational application.

For example, my system might use the account numbers in this file in a billing system. For test purposes, we can generate a master product file in a similar manner to the client information file above. A third transaction file (or table) may contain a product SKU (or medical billing code or bank fee) from the second dimensional file, along with the account number from the first file. When a purchase is made by a person with an account number in my base file, the billing application can invoice the address associated with that person.

Optimizing Transforms in DataStage with CoSort

Indexing Splunk with Voracity Add-On