Synthesizing Realistic HL7 and X12 Test Data
In this article we will discuss how HL7 and X12 messages are structured, and demonstrate how to use the IRI RowGen test data generation software package to generate realistic messages in those (primarily healthcare-related) EDI file formats.
The challenge of obtaining test data which complies with business rules and data privacy regulations is a familiar one for DBAs, developers and QA teams. Using production data for testing the functionality or capacity of an application may represent reality for testers, but what happens when that data contains PII? And what level of security is associated with the test environment?
A common way to address this challenge is to use obfuscated or anonymized production data for testing, requiring you to find and mask all the sensitive data, something IRI DarkShield can do with HL7 and X12 files. But such jobs can be time-consuming to design and run in a HIPPA-compliant way, and there are security and tuning issues associated with reaching and masking large volumes of production data.
For these reasons, it may be more efficient to synthesize HL7 or X12 messages in test files that can be used in developing and stress-testing an application (API, website, mobile app, etc.).
IRI RowGen is test data generation and management software that can create realistic, but synthetic, datasets suitable for testing in many targets, which now also include HL7 or X12 flat (Electronic Data Interchange, or EDI) file formats. It is also possible to create FHIR files using JSON constructs; see below.
Structure of HL7/X12 Message Structures
HL7 and X12 share several common features, as they both have highly standardized formats . HL7 and X12 messages are in rows that are referred to as segments. Within each segment there are fields. Each field is separated by a field delimiter.
Two field delimiters without any content between indicates an empty field. An empty field must still be present however, as the position of a field indicates the purpose of the field according to published standards.
Sample X12 Message Example
Analysis of an X12 segment
Sample HL7 message
Analysis of an HL7 segment
Message Envelopes for X12 and Message Headers for HL7
Message envelopes (X12) and message headers (HL7) are special segments that accompany messages to indicate the purpose of the content of the message. In HL7, a message segment header contains the necessary information required by the receiving organization and must always precede the actual content of the message. In X12, a message envelope, true to its name, envelops the content of the message with preceding and trailing segments.
Below is an image that breaks down message envelopes and message headers:
Comparison of X12 envelopes and HL7 message headers.
About RowGen
IRI RowGen was created to satisfy the need for realistic, but also possibly referentially correct, test data. When production data is not a viable avenue for procuring test data, RowGen can stand in the gap to provide realistic test data without the need for real data.
Using RowGen we can populate COBOL, CSV,/TSV, JSON, XML, Excel, other flat files, and database tables with referentially correct synthetic data. And now, RowGen can also generate semi-structured EDI files like ASN.1 (for CDRs), or HL7 and X12 (for healthcare transactions).
Using RowGen to Create HL7 & X12 EDI Files
As mentioned previously, RowGen can generate realistic synthetic data in HL7 and X12 formats. There are some limitations to be aware of in these contexts, however.
You will first need to know how the EDI message must be structured ahead of time. For instance, an ADT (HL7) message will require certain segments to be present in a message while others are optional. In other words, knowledge of the structure of the individual segments and the fields in each segment are required, as well as the content and order of the fields within each segment. See this link to learn more about HL7 formatting standards.
The second caveat is the need for external, pre-built set files of data for the segment values from which RowGen can randomly select at generation time. Set files can contain one or more tab-delimited columns of data, and the RowGen project features in RowGen from which RowGen randomly selects at generation (test target population) time. Populating these set files with realistic data requires a bit of legwork on the part of the user.
A set file generated from a list of fake medical providers from an online name generator.
Lastly, it is important to understand that there is currently no readily available method that will allow you to apply business logic to the data placed in fields. In HL7 and X12 file contexts, business logic refers to values in one or more fields relating to values in other fields located in the same or different rows.
For example, providing business logic for dates and times is difficult. In the MSH (message header segment) segment of an HL7 message there will be a field that provides a timestamp for when the message was sent out. Then in another segment further down, there is a field in the EVN (event) segment that indicates the date and time that an event occurred. The event (i.e. patient request or doctor’s orders) was the trigger for sending a message. Obviously, it would not make sense for the date in MSH to precede the time provided in EVN.
Currently, the only way to tackle this problem would be through environment variable manipulation or a custom external script (batch file, Python, Java, etc). With environment variable assignment you could assign values to fields ahead of time but there is less flexibility using this method. Conversely, custom scripts like Python can allow you to have “smarter” data generation. For instance, scripts could be used to decide the range of possible dates for a field based on the value of a date placed in another field previously.
Realistic Data from Set Files
For RowGen to generate realistic data in fields, set files will be essential. Set files are text files that store data in one or more columns separated by tabs. The IRI Workbench IDE (GUI for RowGen) comes pre-installed with several set file creation wizards to help get you started, as set files can also be generated using a variety of different techniques for various types of data.
In addition, built-in data generation functions run in RowGen and configured in RI Workbench can synthesize email addresses, phone numbers, credit card numbers, dates and social security numbers out of the box.
Another option would be to extract values from tables, spreadsheets, csv, tsv, etc. and mask them using IRI FieldShield or IRI DarkShield. Source-specific data masking examples are listed here.
Finally, there is web scraping (web data harvesting), which can be performed programmatically or manually. Web scraping allows you to gather information from the internet by reading text located in a web page’s HTML file. The RowGen-integrating Test Data Hub from IRI partner Value Labs supports this, as do other tools like ParseHub, Scrapey, and Webhose.io.
About this Project
To demonstrate how to generate HL7 and X12 messages, I will walk through a RowGen project that generates HL7 messages. Note that regardless of whether you are generating HL7, X12, etc., the implementation will be similar.
This project makes use of Data Definition Format (.ddf) field layout metadata files used with RowGen (.rcl) and other SortCL-compatible jobs. IRI Workbench metadata discovery wizards like this one can generate DDF files automatically.
How the Project is Structured
How the structure of the project flows
From a top to bottom view, we have a main HL7-generating .rcl (RowGen job) script that extracts entire segments from set files to write a complete HL7 message. The extracted segments were generated beforehand and placed in a folder called rows using .rcl scripts that correspond to each segment. These .rcl scripts are kept in a folder called scripts.
Note that it is the segment-generating scripts that ultimately determine whether you are building an HL7 or X12 message. You either want to build X12 segments for X12 messages or HL7 segments for HL7 messages.
Each segment-generating .rcl script in the scripts folder uses a corresponding DDF file that is stored in a metadata folder. The DDF files in the metadata folder draws from set files stored in the sets folder to populate the fields defined by the DDFs.
To sum things up, a batch file will be used to execute all the segment-generating .rcl files and populate the rows folder with set files containing the segments. After that, the batch file makes a final call to the main EDI-generating .rcl script that will draw from the set files containing the individual segments to create a completed message.
Infile DDF file for a HL7 segment
A RowGen script to generate a unique HL7 segment and place it in a corresponding set file
Below is the main RowGen (.rcl) job script that creates the entire HL7 message file from the generated segments stored in the rows folder:
It is possible to configure what segments are generated in an HL7 message from a file called config. This configuration file allows you to determine what HL7 segment types will appear, and in what order they will appear, in the HL7 message.
Snap shot of main RowGen job script that will synthesize the complete HL7 message
Contents of config file
With the execution of the main HL7-generating .rcl file, a complete HL7 message is produced:
In Closing
In this article we discussed how HL7 and X12 EDI-formatted files are structured, and demonstrated how to use IRI RowGen to create test message files in those formats. Generating realistic but safe EDI messages enables application testing that adheres to HIPAA standards.