Once a database exceeds a certain size, it becomes expensive — and risky from a security perspective — to provide full-size copies for development, testing, and training. Most teams need smaller copies of the larger database, and often have the PII within masked.
Database subsetting is that process of creating a smaller, referentially-correct copy of a larger database from real table extracts. Subsets can be used with, or in lieu of, masking data or synthesizing test data to reduce the costs and risks associated with full sets. The process of creating meaningful subsets manually is complex and laborious, considering you would have to populate smaller databases with random samples from each production table, and make sure that any relational structure between the tables was still correct in the subset.
An end-to-end database subsetting job wizard in the IRI Workbench makes this process faster and easier. The wizard is available to licensed users of the IRI Voracity data management platform, IRI RowGen for test data generation, and IRI FieldShield for data masking. This article introduces the wizard and shows how it creates referentially-correct subsets, with options for column masking and field-preserving encryption.
The wizard for subset creation has the user select the source of the subset, the size and sorting of it, the names for the subset targets, and any encryption or masking the subsetted data should get. The wizard then creates a series of job scripts that create either subset tables or flat files.
This page defines the output type. If the loader is left empty, a flat file with a tab delimiter will be created. For database options, either ODBC or the specific database loader can be selected.
This page defines the table that will “drive” the created subset. Think of the driving table as the main table from where you want the subset to originate. For example, if you want to subset a table of sales and all tables connected to it, you would select the table with sales info here.
This page is where you specify the sort order of the subset. If you want the rows selected randomly, however, leave the Key Fields list blank.
The Sort page has 3 notable sections:
- The Input Fields list
- The Key Fields list
- Key Options
- If you want a subset of sorted data, select the column to sort on and click “Add Key.”
- A few options are available in the Key Options box:
- Select the Stable check box to duplicate records to load in the order they are in the production table.
- If you want no duplicates or only duplicates in your subset, select the Duplicates check box and the appropriate radio button.
This page is where you specify the size of the subset. For example, to get a subset of the 100 highest quantities sold, you would sort on the quantity sold field (as in this example) and enter 100 in the number of rows. A filter can also be added on the driver table.
This page is displayed if the output type is a database. It has two functions. The first allows you to view the relations of your driver table. The second is to name the output targets for the subset.
The Target Naming page has 5 notable sections:
- The target profile and schema
- The output mode is always create as this job will truncate any existing tables of the same name
- SQL options for the job
- The naming options
- The references list
- Select the connection profile.
- Select the schema.
- Specify the SQL options.
- Select the naming option to name the subset targets. If a different profile or schema are used, this step is optional:
- Prefix all adds a prefix to every table name and sets it as that table’s target.
- Postfix all adds a postfix to every table name and sets it as that table’s target.
- Name individually allows each subset target to be named independently of the other targets.
- Review the Referenced by and References lists to verify that the table’s relations have been found correctly:
- In the Imported by list, all of the tables that your chosen table is imported by are listed.
- In the Imports list, all of the tables that your chosen table has a relationship with are listed.
On this page, rules can be added to any of the targets. In this example, a rule matcher has been added to all fields ending in the word “NAME”. The rule will be applied to the matching target table. This rule will mask these two fields with asterisks. Care must be taken not to include rules on foreign key fields.
This page summarizes the job. It clearly displays the columns that matched a rule, and the target tables matching the source tables if they do not exist. Note that DDL will be created for any new tables.
Once all of these pages have been completed, you can select Finish. The wizard will then create job scripts and an executable batch file for Windows or Unix that you can launch from Workbench or the command line to extract subsets from the driver table and its related tables.
Here is a flow diagram showing those eight task created for the job in Voracity:
Contact firstname.lastname@example.org if you have any questions.