Using Data Templates to Find Data Format Errors
In Working towards Data Quality, we defined data quality (DQ) as a state in which data can be used for operations. What makes the quality of data high is the paucity of errors. We also outlined the responsibility of data architects, governance, and MDM teams to define, identify, and correct bad data.
In this article, we introduce the concept of data templates in IRI software, and show how they can be used to spot errors of form and format in data.
Why Automation Matters
Such errors would be obvious to anyone reading the data carefully. Of course, this is not a practical solution at the terabyte level.
What the DQ group needs are programs that look at data, in an application-independent manner, and identify records whose data are wrong on their face. As an example from the previous article, we mentioned the month of “Yraurbef,” which is always wrong; this article was not written on June 31; and my salary does not contain “abc”.
The program that does the checking can run at the same time output is being produced for regular production purposes; i.e., data integration / transformation, migration / replication, reporting / provisioning, and data masking. Given large amounts of data, and the likelihood of not knowing everything to predefine or verify, the prudent approach is to run the jobs iteratively until all the errors are identified and corrected, or otherwise noted.
Templates & Composites
A template is a structure used by SortCL programs (in the IRI CoSort package) to describe a particular format of a source and target data field, like dates, as well as any literal characters that always occur within it. The syntax and semantics of this structure are well known to C programmers. Readers who are not familiar with C can quickly learn the options. Templates are a way to define a new, custom data type, or master data format.
Using these canned structures, or library routines, also allows the I/O system to create and recognize data in a precise format for quality assessment. On input, values are extracted according to the template you define, and are transferred into SortCL internal values. On output, the internal values are written in the proper form specified by the output template. Whether reading or writing, if the data are not in the specified form, an error message is given.
Templates support simple and composite values. A composite value is one or more simple or composite values grouped together under a single name. The idea is to fix individual items relative to each other. The composite can also contain static values; i.e., character strings, special characters, and punctuation. Such static values allow the composite forms to conform to internal standards, and to be more familiar and understandable to end-users.
Templates can be constructed in the IRI Workbench GUI (built on Eclipse™) or scripted directly in SortCL programs with a text editor. Importantly, existing templates can be centralized and re-used to maintain data conformity between applications. For those taking an enterprise view, templates are a corporate resource to be shared – named, centrally maintained, and imported into applications as needed.
Template Use Examples – Finding Format Errors
Example 1. Finding Errors in Text and Dates
/TEMPLATE=(abc_dates = "abc %b/%e/%y") /INFILE=dates /FIELD=(f1,STRUCTURE = abc_dates(month,day,year)) . . . .
The abc_dates template describes records that look precisely like this:
4 characters |abc | 3 character month name (Jan Feb ..) followed by a slash “/” 2 digit day of the month followed by a slash “/” 2 digit year
abc Jan/29/13 aBc Dec/ 2/01 abc Jun/31/69 abc Ocp/12/71
Error Message Reason @ dates + 2:02 'B'!= b' abc_dates" 'B' not equal to 'b' at 2nd char @ dates + 3 date [1969 6 31] abc_dates" no such date @ dates + 4:05 month |Ocp/12/71| abc_dates" no such month
Note: Names of months (e.g., Jan, January …) and names of days (e.g., Tue, Tuesday …) are recognized and generated according to the LOCALE setting.
Example 2. Finding Errors in Numeric Representation
/TEMPLATE = (Bill_fmt = "%3i %9s %7.2f") /INFILE = Purchases /FIELD = (Bill_fmt, STRUCTURE = Bill_fmt(Amt, Item, Price)) /REPORT /OUTFILE= Results /FIELD = (Amt, POSITION = 5, SIZE = 5.0, TYPE = NUMERIC) /FIELD = (Item, POSITION = 12, SIZE = 9, TYPE = ASCII) /FIELD = (Price, POSITION = 21, SIZE = 10, TYPE = NUMERIC)
Purchases (input) Results (output) 4 Carrots 3.29 4 Carrots 3.29 .2 Onions 1.98 0 Onions 1.98 10 Potatoes 3.26 10 Potatoes 3.26 1O Chickens 13.42 1 Chickens 13.42 10 Chickens 13,42 10 Chickens 13.00
Without looking at the next paragraph, the reader might want to try to explain why the output has bad results. The output looks right, and would be reported as such, or become part of any calculations. Fortunately, messages are being generated because the input data does not conform to the specified requirements of its template. These messages are displayed:
@ Purchases + 2:01 value | .2 | "Bill_fmt" @ Purchases + 4:01 value | 1O | "Bill_fmt" @ Purchases + 5:17 value |13,42| "Bill_fmt"
On Record 2, Amt may only be an integer of length 3; on record 4, the “0” in “10” is really the capital letter “O”; and on record 5, the decimal point is really a comma “,”. This is easy to see when the messages are given; but without them, the errors would not be found if they were nestled in a million records.
Next Article – Finding Context Errors
In our next discussion, we will talk about tools for discovering data that is application dependent –- data that looks possible but is wrong in context.
Topics include, finding:
- membership or non-membership in a set of values
- low_value ≤ actual value ≤ high value
- data that is a function of other values (e.g., range ≠ distance * time)