Real-time Database Data Replication
Abstract: The previous article in this series of 4 detailed how to install, set up, and configure IRI Ripcurrent, an IRI-developed command-line Java application that utilizes the Debezium embedded engine and the streaming feature of the IRI (CoSort) SortCL program to react in real-time to database change events by replicating data to downstream target(s), optionally with transformation (e.g., PII masking) rules consistently applied based on classification of the data.
One capability of Ripcurrent is real-time data replication. Ripcurrent can monitor and react to the insertion, deletion, or update of any rows in the source database by replicating that data to target silo(s). Real-time replication of data through Ripcurrent will be demonstrated in this article through example.
For this example, I am using a MySQL database as the source database being monitored; Debezium supports the monitoring of several other different types of databases, including PostgreSQL, Oracle, SQL Server, and DB2. The MySQL database must first be set up properly for Debezium to be able to monitor it.
See Debezium documentation for detailed instructions on how to set up a database for monitoring of changes. The process varies depending on the database vendor.
In the case of MySQL, this involves enabling binary logging for MySQL replication. The binary logs record transaction updates that allow replication tools to propagate changes.
I have already installed the Ripcurrent feature for IRI Workbench and set up a configuration file as specified in the previous article. I have also set up tables in the target schema with the same structure as the tables I am monitoring in the source database.
Ripcurrent is suited for many different use cases where data is being added and updated to a database frequently (e.g. orders/transactions, sensor data from IoT devices like weather conditions, location data from transportation vehicles, patient health data).
In this case, I am demonstrating an example of data replication with financial transactions. The source table is being updated with new transactions as they occur. Ripcurrent can replicate these new transactions to file and table targets as they happen in near real time.
Of course, there is always some latency in copying data from source to targets, but this is usually measured in sub-seconds unless a large amount of data has been modified or added to the table at once.
The image below shows a sample of data in this table.
Generating a Properties File
To begin with, a properties file should be created to define configurations required for Ripcurrent. After installing Ripcurrent in IRI Workbench, the Ripcurrent Properties Generator wizard is available in the IRI Voracity (orca icon) menu to assist in this task.
On the first page of the wizard, I selected a MySQL database as the source to monitor:
On the second page, I selected the option to replicate changes to both files in a directory and database tables in a specified schema. I chose a SQL Server database as my target and delimited flat files as a secondary target. I entered \t as the target separator to make the flat file output tab-separated.
In the case of direct data replication without any transformations, there is no need to select anything on the third page, which pertains to selecting IRI libraries to use with Ripcurrent that defines how to classify and transform data.
On the final page, I entered a name for the connector and server, and a path for database structure change events to be logged to. For the most part, the connector name and server name can be anything as long as they are unique. They may show in Ripcurrent logs, and also are used for identification.
There are a few restrictions on the format of the server name; the Ripcurrent documentation discusses Ripcurrent configuration properties in detail. The server ID is not required; I did not specify a value in this case.
After successful completion of the wizard, a properties file will be generated and opened in the editor:
Demonstrating Real-Time Data Replication with Ripcurrent
Ripcurrent can be started by running the ripcurrent.bat script in the bin folder of the Ripcurrent distribution, since I am using a Windows operating system. This script can be triggered through a terminal or through an IRI Workbench external tools run configuration (see below).
Ripcurrent runs in the background, monitoring events from Debezium until explicitly terminated.
Snapshot of running Ripcurrent from the External Tools Configuration in IRI Workbench.
Logging output has been configured to display on the console.
Each event received from Debezium when a database table has data added or modified includes information about the new data values, the data types, and the names for each column.
The data received is used as the input to a dynamically generated SortCL job based on the information from the Debezium event. New jobs are created when there is an event from a table not seen before, or there is an event from a table seen before but there is no existing job that has the same layout of columns from the event and the same type of event (e.g. update, insert, delete).
With Ripcurrent running, I edited values in the source table through the Data Source Explorer in IRI Workbench to trigger change events.
The new data in the source table is mirrored to a corresponding table in the target SQL Server schema, as well as a file in a specified directory. Ripcurrent will replicate data to additional tables or files for each event that is received from a unique table in the source database.
First, I inserted one row (shown highlighted) into the source transactions table in MySQL.
This was the first event received from the table, so a new replication job was started.
As a result of the replication job, the data added to the source table was duplicated to a tab-delimited file in the target directory.
Additionally, the data was replicated to the target table.
Next, I modified the account number of the record previously inserted into the source table. Note that updates and deletes of data from the source database can only be reflected in database targets, not file targets.
This change is also captured by Ripcurrent and reflected in the target table.
Finally, I added several new transaction records to the source table (3051-3053):
Ripcurrent captures these additional changes and duplicates them to the target table.
This article demonstrated one use case for Ripcurrent – the replication of changed database data directly into target files and/or tables. The next (third) article in this series will augment this demonstration with the use of data classification and masking rules to categorize and mask data on a consistent basis before it is replicated to your targets.