Once a database exceeds a certain size, it becomes expensive — and risky from a security perspective — to provide full-size copies for development, testing, and training. Most teams need smaller copies of the larger database, and often have the PII within masked.
Database subsetting is that process of creating a smaller, referentially-correct copy of a larger database from real table extracts. Subsets can be used with, or in lieu of, masking data or synthesizing test data to reduce the costs and risks associated with full sets. The process of creating meaningful subsets manually is complex and laborious, considering you would have to populate smaller databases with random samples from each production table, and make sure that any relational structure between the tables was still correct in the subset.
A new end-to-end database subsetting job wizard in the IRI Workbench GUI now makes this process faster and easier. The wizard — and IRI support for it — is available to licensed users of the IRI Voracity data management platform, IRI RowGen for test data generation, and IRI FieldShield for data masking. This article introduces the wizard and shows how it creates referentially-correct subsets, with options for column masking and field-preserving encryption.
The wizard for subset creation has the user select the source of the subset, the size and sorting of it, the names for the subset targets, and any encryption or masking the subsetted data should get. The wizard then creates a series of job scripts at runtime that create either subset tables or flat files.
This page defines the table that will “drive” the created subset. Think of the driving table as the main table from where you want the subset to originate. So for example, if you want to subset a table of customers and all tables connected to it, you would select the table with customer info here.
The Table Selection page has two notable sections:
- The Database selection combo box
- The Table selection combo box
Sizing and Sorting
This page is where you specify the size and sort order of the subset. For example, to get a subset of the 100 highest purchase levels, you would sort on the purchase level field (as in this example). If you want the rows selected randomly, however, leave the Key Fields list blank.
The Sort page has 4 notable sections:
- Amount of Records in Subset
- The Input Fields list
- The Key Fields list
- Key Options
- In the Number of Rows box, enter the size of the subset you want to create.
- If you want a subset of sorted data, select the column to sort on and click “Add Key.” Otherwise, leave the Key Fields list blank.
- A few options are available in the Key Options box:
- Select the Stable check box to duplicate records to load in the order they are in the production table.
- If you want no duplicates or only duplicates in your subset, select the Duplicates check box and the appropriate radio button.
This page has two functions. The first allows you to view the relations of your driver table. The second is to name the output targets for the subset.
The Target Naming page has 6 notable sections:
- The Output options radio buttons
- The Naming options radio buttons
- The Name entry text box
- The referenced by list
- The Table selection combo box
- The references list
1. Select the table from the Table name menu.
2. Review the Imported by and Imports lists to verify that the tables relations have been found correctly:
- In the Imported by list, all of the tables that your chosen table is imported by are listed.
- In the Imports list, all of the tables that your chosen table has a relationship with are listed.
3. Specify the target using the Output Option radio buttons: Output to flat file with the following separator – enter a separator if different than a tab (\t).
- Output to table
- Select the check box to generate an SQL script to create the new subset tables.
- In the Schema menu, select a different schema.
- Select the naming option to name the subset targets:
- Different schema with same table names (if you selected a different schema).
- Prefix all adds a prefix to every table name and sets it as that tables subset target.
- Postfix all adds a postfix to every table name and sets it as that tables subset target.
- Name individually allows each subset target to be named independently of the other subset targets.
- Enter appropriate text in the Name entry text box:
- If Different schema is select, enter a name for the subset targets.
- If Prefix all is selected, enter the prefix to add to all subset target names.
- If Postfix all is selected, enter the postfix to add to all subset target names.
- If Name individually is selected, enter the subset target name for the currently selected table.
- Click the set button.
Subset Target Review
This page allows you to review the output where the subset will be targeted, and to apply data masking and filtering functions.
The Subset Targets page has 4 notable sections:
- The Output Target list
- The “Target Field Layout…” button
- The “Named Condition…” button
- The “Record Filter…” button
- To apply a mask, select the Target from the list and click the “Target Column Layout…” button
- To create a condition, select the Target from the list and click the “Named Condition…” button
- To apply a filter, select the Target from the list and click the “Record Filter…” button
Once all of these pages have been completed, you can select Finish. The wizard will then create job scripts to extract subsets from the driver and its related tables.