Real-Time, Incremental Data Masking
Abstract: The previous article in this series of 4 demonstrated real-time data replication with IRI Ripcurrent, an IRI-developed command-line Java application that utilizes the Debezium embedded engine and the streaming feature of the IRI (CoSort) SortCL program. Together they react in real-time to database change events.
The capability demonstrated in the previous article can be augmented with the classification and masking of sensitive data in transit to file or database targets. When data in the source table is inserted or updated, it is moved and masked in real-time to the designated target table. When data in the source table is deleted, the same row(s) will be deleted in the (different) target table.
Assuming the prerequisites detailed in the first article have been met, for this case an IRI data class library and IRI rules library are also required. These libraries can be generated by running the Schema Data Class Search wizard available in IRI Workbench.
For Ripcurrent, a default rule is required to be assigned for each data class. The default rule is used by Ripcurrent to apply a chosen masking function to data or a column name matching the specified data class search parameters.
Schema Data Class Search
To create a data class and rules library, I will run the Schema Data Class Search wizard. This wizard can be accessed by right-clicking on the name of the schema in the Data Source Explorer, and selecting IRI >Schema Data Class Search… wizard.
This screenshot shows the first page of the wizard.
The name of the folder for artifacts to be generated, along with several various preferences for the search, can be specified on this page. I selected the default options.
On the second page of the wizard, select schemas to be scanned. I selected a single schema named SCOTT in my Oracle database.
After clicking next, Next >, the Data Type Selection page appears. I took the default selections for filtering data types in the search:
After clicking Next >, the Exclude Items page appears. I did not specify any columns, schemas, or tables to filter from the scan:
After clicking Next > from this page, a page to select data classes appears. I selected EMAIL and PHONE_US as the data classes to search for from my list of classes:
I assigned a FieldEncryptionRule for AES256 format-preserving encryption as the default rule for the data class EMAIL.
I also created a FieldRedactionRule to fully redact/mask PHONE_US values with asterisks:
Generating a Ripcurrent Properties File
Before running Ripcurrent, a property file should be created to define configuration options that are required for Ripcurrent to run.
After installing the Ripcurrent feature in IRI Workbench, the Ripcurrent Properties Generator wizard is available from the Voracity menu to assist in generating a properties file for Ripcurrent.
On the first page of the wizard, options can be specified pertaining to the database to be monitored. In this image, an Oracle connection was selected and a filter was specified to narrow down the tables to be monitored to only those in the SCOTT schema.
In this case, the DB name (required to set up a configuration to monitor an Oracle database) was also entered, XE in this case.
On the second page of the wizard, there are options to replicate to database tables, files, or both files and database tables. The image shows the choice of replicating data changes from the source Oracle database to a MySQL database with existing tables of the same name and structure as the source database, as well as to Excel files.
On the third page, locations of an IRI data class library and rule library can be specified. The image shows the selection of the library files created earlier in the article:
On the final page, the image shows the selection of a log location for structural change event information from the source database. Names for the connector and server connection used by Debezium are also specified.
After the properties file was successfully generated for my configurations, I began the execution of Ripcurrent by running the ripcurrent.bat script. After completion of an initial snapshot of the source database (if running for the first time), Ripcurrent will be monitoring for changes from the source database.
Adding Test Data to Source Tables with IRI RowGen
Next, I will run a batch file generated by the New Database Test Data wizard, accessible from the IRI RowGen menu to simulate new data coming into the source tables. The New Database Test Data wizard synthesizes data consistently by classifying database columns and pairing them with generation rules.
The image below shows an entity-relationship mapping diagram of the test data batch operation, produced by IRI Workbench.
Replication of Data from Source Tables to Multiple Targets, with Consistent Masking Rules Applied to Classified Data
After running the batch file to generate data for 5 tables in the source database, Ripcurrent recognized the new data inserted and triggered a replication and masking operation to the targets.
This is a sample of the data that was generated in one of the source Oracle database tables (named DM_EMPLOYEES):
Ripcurrent was silently triggered by this insertion event, and automatically masked and moved the new data into its defined targets, per this event log:
See that a series of Excel files have also been created in the target directory:
In the Excel file associated with the DM_EMPLOYEES table in the source database, notice that the email values have been modified from the original value.
Additionally, in the target MySQL database, data was replicated to a table with the same name and structure as the source table, with the email address column encrypted to the same values as in the Excel file.
This column was classified as containing email addresses based on the regular expression pattern to match emails defined in the data class library. The original email values were encrypted with format-preserving encryption, defined by the default rule that was mapped to the EMAIL data class.
This article demonstrated another use case of Ripcurrent that builds on the more rudimentary use case of data replication demonstrated in the previous article – that is, consistent classification, masking and replication of data from source tables to targets.
This approach ensures that data can be dynamically replicated to targets, with data classified as sensitive appropriately and consistently masked according to data classes and rules that can be flexibly defined.
The next and final article in this series deals with real-time notifications of structural changes to the database which can impact the layout of the source or target tables. This would indicate the need to re-run the data class search operations on your tables.