Database Subsetting

 

Next Steps
Overview Benchmarking Compliance DB Test Data DB Subsetting DevOps Test Files/Reports Virtual Test Data TDaaS

Challenges


Database application developers often rely on data in production tables for testing. But there are several drawbacks to that approach, including, the:

  1. confidentiality of data in those tables
  2. cost of migrating, masking, refreshing, and/or storing replicated databases for testing
  3. redundancy of production data, which means wasted space and insufficient testing coverage
  4. need for only small slices of data for specific test cases

Sometimes smaller, masked subsets of massive files are also needed for rapidly testing applications with anonymized data. Most data masking tools cannot support the volume and variety of flat files involved.

Solutions


In addition to the powerful database parsing, generation, and population capabilities that IRI RowGen provides for synthesizing structurally and referentially correct test data, you can now also produce (and mask) referential intact database subsets from standard relational -- as well as complex and/or very large database (VLDB) -- sources.

A proven, powerful Database Subsetting wizard for relational databases is inside IRI Workbench, the Eclipse IDE for the IRI Voracity test data management platform or its component IRI Data Protector suite test data security tools: IRI FieldShield and IRI DarkShield for data masking and/or RowGen for test data generation.

This ergonomic data subsetting tool allows you to rapidly create custom-sized subsets of manageable, referentially correct data determined by your master (parent) table, and apply consistent data masking and/or mapping rules to all the subset (child) tables at the same time

It's also possible to selectively subset and mask data from individual tables on an ad hoc basis for testing using the same IRI metadata framework. To do that, simply write, or use a wizard to automatically create, a single source FieldShield data masking job script with the table details, and add either SQL SELECT syntax directly into the input section, or create and use custom /INCOLLECT row-count filtering and/or qualitative /INCLUDE or /OMIT statements to define the size and content of each subset, respectively.

You can provision the DB subsets in multiple ways for test data users working on-premise or in the cloud. These database subset management options include: new persistent or virtual (federated) test schema, flat file targets, and DevOps pipelines per this example

In addition to database subsets,  you can also create test file subsets. Use RowGen to create synthetic, highly realistic test files in any format and size. Use FieldShield to extract (and mask) test data subsets from structured (flat) files in fixed position or delimited format using its built-in selection and filtering functionality. Built on the same big data engine for Voracity (the IRI CoSort SortCLprogram), FieldShieldwill search and mask structured files at any volume, too.

Subsetting strategies like these not only minimize the risk of PII exposure and privacy law violations, but dramatically lower the costs of database and application testing infrastructures ... some say as much as $50,000 per database. And by running the subsetting jobs you save, you can use this tool for automated test data generation.

Learn how to automatically set-up, and create test data subsetting jobs in Workbench here.

Frequently Asked Questions (FAQs)

1. What is database subsetting?
Database subsetting is the process of extracting a smaller, representative slice of data from a larger production database. These subsets maintain referential integrity and are used for testing, development, and training without exposing full production datasets.
2. How can database subsetting help reduce testing costs?
By working with smaller, targeted subsets instead of full databases, you reduce the storage, migration, masking, and refresh costs. Subsets also accelerate test cycles and infrastructure provisioning.
3. What makes IRI’s subsetting solution different?
IRI’s solution uses an Eclipse-based wizard to automate referentially intact subsets based on master-child table relationships. It applies masking and mapping rules across all related tables and scales to large or complex database environments.
4. How does IRI ensure referential integrity in subsets?
IRI’s subsetting tool identifies the parent table and preserves all related foreign key relationships across child tables, ensuring that the extracted subset behaves like the original schema during testing.
5. Can I apply data masking during the subsetting process?
Yes. IRI allows you to apply masking rules in the same job that performs subsetting. You can reuse FieldShield’s encryption, redaction, or pseudonymization functions across all extracted data to stay compliant with privacy laws.
6. What databases are supported for subsetting?
IRI supports all major relational databases, including Oracle, SQL Server, PostgreSQL, MySQL, DB2, and Teradata. It also works with very large databases (VLDBs) without needing proprietary connectors or additional infrastructure.
7. How do I subset a specific portion of a table?
You can filter data using SQL SELECT statements, or apply IRI’s /INCOLLECT, /INCLUDE, or /OMIT syntax in FieldShield job scripts to define subset size, row limits, or specific criteria.
8. Can I subset and mask flat files?
Yes. IRI FieldShield can extract and mask structured flat files, including fixed-width and delimited formats. RowGen can also generate synthetic test file subsets in any structure or volume.
9. What formats can I provision the subsets into?
Subset data can be delivered as persistent tables, federated (virtual) schemas, or flat files. You can also push them into DevOps pipelines or clone environments using supported tools.
10. Can I reuse subsetting jobs for automated test data generation?
Yes. Once created, your subsetting and masking jobs can be saved and reused in scheduled, scripted, or pipeline-driven workflows for automated test data provisioning.
11. How does subsetting support compliance with data privacy laws?
Subsetting minimizes exposure by limiting access to sensitive fields and enabling targeted masking. This helps organizations stay compliant with laws like HIPAA, GDPR, and PCI DSS.
12. Can I perform subsetting in the cloud?
Yes. IRI supports on-premise, cloud, and hybrid environments. You can provision subsets to cloud-hosted databases or storage and integrate with cloud DevOps tools like Jenkins, GitLab, and Azure DevOps.
Share this page

Request More Information

Live Chat

* indicates a required field.
IRI does NOT share your information.