Test Data Generation for Databases

 

Next Steps
Overview Benchmarking Compliance DB Test Data DB Subsetting DevOps Test Files/Reports Virtual Test Data TDaaS

Challenges

DBAs and data architects trying to populate test databases often rely on unsafe or inadequate production data. There are a number of problems with this approach, including:

Data Insufficiency

Testing with production databases is not a good strategy for several reasons:

  • Access to the database itself may be restricted because of the PII or other information within
  • The database may not exist yet
  • Current values in the database do not reflect the future range or volume of data that must be tested
  • Even masked, the wrong data masking functions can render the test data unrealistic or still non-compliant
Data Realism

DBAs and application developers require test data that has the characteristic content, format, and relationships of future production data. This is the only way to verify that each application step and query will still work.

Concurrency & Security

Developers require data that are secure (i.e., do not violate data privacy laws) and do not depend on other developers or phases of development to produce the data they need for their phase. Development may also need to be outsourced, and to occur in different locations concurrently, where consistent / reliable golden copies are test data are needed.

 

Beyond test data realism and anonymity is the issue of availability. That is, another key challenge for database testing solutions is how the test data is provisioned for use. It must be accessible in the formats and silos (or DevOps pipelines) required, and remain consistent or restorable in the event of change.

 

Solutions

There are several database test data generation tools now on the market. The most flexible, rapid, and robust test data solutions for databases are built with IRI RowGen software. Why?

RowGen allows you to approach database test data management from whatever angle best first you use case. Its primary test data generation method is data synthesis; i.e., RowGen creates synthetic test data for databses that is highly realistic.

Alternatively, IRI RowGen, FieldShield, or Voracity platform users can use their license for database subsetting (with or without data masking). Regardless of your strategy, the Eclipse UI, job metadata, and engine behind all this is the same.

The software works across all databases using the same easily defined, and reuseable test data generation rules. Your test data generation (or masking) rules typically apply consistently acros data classes (like columns), and leverage either random value generation or random value selection from other datasets. Once you approve the rules, you can automate their application in a fit-for-purpose wizard that builds your database test data job.

Before you run the wizard, think about some of the larger issues to consider as you scope out your test data requirements. See the series of articles on database test data management and test data provisioning in the IRI blog, starting here. Then, learn why the RowGen test data product -- or the IRI Voracity data management platform that includes it -- has been the best test data generation tool for achieving these goals since 2004.

No Production Data Needed

IRI RowGen software produces accurate, safe test data that reflects production database table formats, sizes, value ranges, and constraints - without requiring production data. RowGen uses your DDL information to automatically build and load huge, structurally and referentially correct test tables fast. For details, see:

Blog > Test Data > RowGen Automates Database Test Data Generation

Blog > Test Data > How to Generate DB Test Data

Video > Populating a Database with Realistic Test Data for Software Testing from Scratch

Test Data Realism

Valid test tables must contain the full range of values, data types, row layouts, and primary-foreign key relationships that DB application rely on. For information on how RowGen preserves data realism, see:

Blog > Test Data > Making Test Data Realistic

Concurrent Development

By generating test data for each phase of development, you can create phases simultaneously, and independently from other phases. For example:

  • Step 1 - Read a personnel table and join health insurance claim data to make a status file
  • Step 2 - Read the status file and generate a list of doctors
  • Step 3 - Load a test table and create a web-ready billing summary report sorted by patient

RowGen can generate the input table for step 1, the status file for step 2, and the table and web report targets in step 3 -- all without needing real data or data from the other steps.

Secure Development

RowGen gives you the ability to synthesize any type, size, layout, and amount of safe test data so you can create and provide for a more secure development environment.

Alternatively, IRI FieldShield allows you to use real data that is anonymnized with field-level data masking functions (like encryption) on a need-to-see basis. For details, see Solutions > Data Masking.

Both tools co-exist in the IRI Workbench GUI which also features automatic DB subsetting with masking. Either way, with safe data, there is no need to certify or bond your outsourced application developers.

Seamless Transition to Production

RowGen shows you what the real data can look like, and what the transformation and reporting application can look like. That's because both RowGen and the SortCL program in IRI CoSort and Voracity use the same metadata to define data manipulations and table layouts. They also use the same runtime infrastructure for API, CLI, and automated operations.

This means that RowGen can also perform the same data transformations and reporting functions you normally would in the same I/O pass and job script doing the test data generation. It also means that the same layouts (and even transformations) created for data synthesis jobs is immediately ready for data integration, masking, migration, reporting, and other data processing jobs if and when real data becomes available.

If you only use RowGen as your automated test data generation solution, you can easily upgrade to CoSort or Voracity to transform and report on the real data (when it becomes available). Using the same job script RowGen used to define and transform test data, you can transform real data in the same format.

Virtualization and Provisioning Options

IRI test data builders can furnish testers with test data in multiple ways, including:

  • Ad hoc tables (or files in multiple formats) in persistent or federated (view) schema
  • Golden copies of test data in existing lower environments, plus the ability to create new tables
  • During the cloning of databases in tools like Windocks, Actifio (Google) and Commvault
  • From within DevOps pipelines like Amazon CodePipeline, Azure DevOps, Jenkins, GitLab, etc.
  • Directly through test data management (TDM) hubs from Cigniti or ValueLabs

See the DevOps and Virtual Test Data pages in this section to learn more and link to more information on these options. Compare these methods and costs to more expensive and proprietary solutions that require special drivers or standing up ephemeral databases.

Frequently Asked Questions (FAQs)

1. What is database test data generation?
Database test data generation is the process of creating realistic, structurally correct, and referentially accurate data for use in database testing environments without exposing actual production data. It helps developers and testers validate applications safely and effectively.
2. How can I create referentially correct test data for my database?
You can use deterministic data masking functions like pseudonymization or encryption in IRI FieldShield or DarkShield, or you can use IRI RowGen to generate synthetic test data that adheres to the structure and relationships defined in your database schema (DDL). Any of these methods preserves primary-foreign key relationships and ensures each test table reflects production logic.
3. What are the risks of using production data for testing?
Using production data can lead to compliance violations, exposure of sensitive PII/PHI, and inaccurate testing if the data lacks future-state scenarios. Masking alone may still yield non-representative or risky data if not applied properly.
4. How does RowGen use DDL to generate test data?
RowGen reads your database's DDL (Data Definition Language) to understand table structures, constraints, and relationships. It then synthesizes data that fits those definitions—automatically and at scale—without requiring access to actual production data.
5. Can IRI RowGen generate test data for every development phase?
Yes. RowGen enables phased test data creation by simulating inputs, intermediate files, and final outputs independently. This supports parallel development without dependency on other phases or real data.
6. What kinds of databases does RowGen support?
RowGen supports all major relational databases, including Oracle, SQL Server, DB2, Teradata, PostgreSQL, and MySQL. It works across platforms by using common DDL inputs and producing data in standard formats or via direct DB load.
7. Can I generate large volumes of test data for performance testing?
Yes. RowGen can quickly generate millions or billions of rows that reflect the desired formats, value ranges, and data distributions needed for stress testing and benchmarking database performance.
8. How is test data provisioned for DevOps pipelines?
IRI RowGen jobs (or Voracity subsetting, or FieldShield and DarkShield masking jobs) can be integrated into CI/CD tools like Jenkins, GitLab, Amazon CodePipeline, and Azure DevOps. It also supports provisioning through cloning platforms like Windocks, Commvault, and TDM hubs like Cigniti or ValueLabs.
9. What if I need to anonymize real database data instead?
IRI FieldShield or DarkShield can be used to mask production data at the field level using techniques like encryption, pseudonymization, or redaction. This ensures compliance while preserving utility for lower environments.
10. Can I automate test data creation with RowGen?
Yes. RowGen includes automation wizards and supports CLI, API, and job scheduling to enable repeatable, scalable test data generation across projects and pipelines.
11. What makes RowGen different from other test data tools?
RowGen creates both the test data and the layout metadata, which can be reused for downstream data integration, masking, reporting, and transformation jobs. In fact, RowGen can synthesize, transform, and report on the test data in the same job it creates data! It’s also a unified approach not dependent on real data or proprietary environments, which means it can generate structurally and referentially correct test data for any database, as well as test data for NoSQL databases, structured and semi-structured files, and even generate synthetic values DarkShield can use to generate test documents and images in unstructured formats! Finally, RowGen embeds the multi-threaded CoSort engine to pre-sort test data in PK or query order to immediately or subsequently populate (load the test data it creates) into an entire schema.
12. How do I deliver test data to different teams or regions?
IRI test data jobs can output to ad hoc tables, federated schemas, or files in multiple formats. You can create golden copies of data for reuse or generate new tables dynamically across distributed teams.
Share this page

Request More Information

Live Chat

* indicates a required field.
IRI does NOT share your information.