Secure, Then Splunk – A Format-Preserving Encryption and Pseduonymization…
Introduction: This example demonstrates an older method of using IRI FieldShield to protect sensitive data prior to indexing the data in Splunk. As you will read, FieldShield would process the data outside of Splunk and create a CSV file for Splunk’s ingestion. However, you can also use the Voracity (= FieldShield) add-on for seamless data preparation, indexing, and visualization in Splunk (details here), or Splunk Universal Forwarder to send FieldShield targets and logs into Splunk automatically (details here)!
One of the concerns Splunk users have is that data they index is stored by, and thus to some extent, under the control of, Splunk. Before they upload, however, IRI FieldShield users can protect their data so Splunk (or potentially those hacking into Splunk servers) cannot compromise it. The user’s original data, FieldShield executables, job scripts, and encryption keys all remain inside user firewalls — local to their machines, and not necessarily even connected to the internet — where it is much harder to find, much less compromise, those things too.
Splunk is a cloud-based analytic platform that indexes and displays user data. The data uploaded to Splunk’s can be searched and visualized in an internet browser. Because your data that you analyze in Splunk is hosted on its servers, you may have concerns about the security of that data.
Splunk does offer some encryption and certificate authentication when sending data back and forth, but not all types of communication are secured by default. Enabling additional protection requires proper configuration by the admin. The table below defines the types of communication security and what is enabled by default.
The default Splunk root certificate uses a private key that is the same for every Splunk user, and can be easily accessed. Possession of a certificate authority’s private key allows attackers to generate certificates signed by the trusted authority, which would defeat attempts to control authentication via public key infrastructure (PKI).
When sensitive data needs to be stored by any third party, you may want to secure your data before it reaches their network. IRI FieldShield, and its parent IRI CoSort data manipulation and management package, apply multiple data masking functions prior to Splunk indexing. Field-level protections like format-preserving encryption (FPE), redaction, and pseudonymization are among 12 categories of data-centric security functions supported by FieldShield in its Eclipse GUI dialogs.
FieldShield functions can be applied to entire rows, or only those columns/fields that need it. FPE maintains the data in the same size and format, complete with separator characters, where numbers and letters are in the same places within the string. This allows the data to be used in Splunk’s charting and analytic capabilities, while keeping it secure. FPE is also reversible, so that after data is used or displayed by Splunk, applicable fields can be decrypted by only those authorized to recover it.
The example below shows how selected fields can be protected, indexed, and stored securely in Splunk, then extracted and decrypted. Using FieldShield FPE, personal ID and Credit Card data are encrypted in the same format as the original data. People’s names are replaced with realistic, but not real, name data via reversible pseudonymization.
Original Data
Format-Preserved Encrypted (ID and CC_NUMBER) & Pseudonymized (NAME) Data
ID and credit card values have the same length after encryption, each character was retained as either a number or letter, and separator characters are unchanged. Since the name data must adhere to the traditional naming conventions, pseudonymization was used. An external list of names substitutes for the original names. Both sets get combined in a lookup file for pseudonymization, and an opposite restore set for later reversal outside Splunk.
To prepare and secure your data for Splunk indexing, use FieldShield specification files or job scripts to apply the functions you need to the fields in your data. You can write the source and target layouts and protections easily by hand or have the GUI do it automatically through job wizards and point-and-click dialogs.
FieldShield inputs can be database tables, files, or a combination of both. See this list of IRI-supported data sources. Source formats are specified in the input section of a script, while protections and new formats are specified in the output section. Any number of targets and formats are supported.
An AES-256 FPE function applied to an “ID” field might look like this:
Though the output field name is changed to reflect its newly encrypted state, the name of the input field “ID” is required, along with the name of the protection function and formatting attributes. Note that the encryption key, or “passphrase” in this case is maintained in a separate, securable file (called pass_id in this case), and not exposed in the script — though both an explicit key and/or environment variables are also options.
Pseudonymization in FieldShield uses ‘set’ files populated with values that fit the criteria associated with the original data field. In this case, names_first_last.set, contains first and last names. FieldShield creates a two-column lookup file, name_psuedo.set, that combines the original and substitute names for pre-Splunk pseudonymization, and the converse name_restore.set file for post-Splunk restoration:
These functions are both specified in our FieldShield job’s output file, transactions_safe.csv, in a Splunk-ready layout:
Once the partially encrypted and pseudonymized file is created, you can then index it into Splunk. Splunk will recognize the CSV format and automatically append a timestamp to each entry:
Once in Splunk, you can search through the data and customize analytic functions and report displays. Meanwhile, the data you protected in FieldShield will be safe but not hinder the layout or appearance of the data. Other users may not even know that the personalizing data was encrypted or psuedonymized.
After filtering the data, use the pipe “|” character to start a function. Splunk’s Chart function allows you to create and customize graphics to your liking. You can then add those graphics and statistics to a dashboard for fast and easy analysis.
To recover the original data values, use Splunk to search for your files or indexed fields, and export them:
A form will appear where you can designate the filename and format:
The exported file will be sent to your Downloads folder with the fields ordered alphabetically. Use a FieldShield script like this to restore the field order, decrypt the ID and CC values using the corresponding algorithm and original key (passphrase/file/EV), and reverse the pseudonyms via the restore set:
Running this job from the IRI Workbench GUI or command line will restore your data and its field layout to its original state:
Note that multiple target files or tables could have been specified in this same job as well, with recovery scripts and keys provided only to the authorized recipients of particular field values. You can also modify the sort order with CoSort’s SortCL program, if you use it instead, since SortCL is the parent engine that runs FieldShield jobs while extended their capabilities.
For more information on the interaction between IRI Workbench software like FieldShield and Splunk, or other applications, email fieldshield@iri.com, or submit your comment below.
3 COMMENTS
[…] Da sich die Datenschutzfunktionen von IRI auf Feldebene befinden, sind sie sicherer; wenn (im Gegensatz zu Splunk) ein Verschlüsselungscode offenbart wird, sind andere Felder mit anderen Schlüsseln oder Algorithmen noch sicher, siehe mehr Details in unserem Blog hier. […]
[…] Da sich die Datenschutzfunktionen von IRI auf Feldebene befinden, sind sie sicherer; wenn (im Gegensatz zu Splunk) ein Verschlüsselungscode offenbart wird, sind andere Felder mit anderen Schlüsseln oder Algorithmen noch sicher, siehe mehr Details in unserem Blog hier. […]
[…] Da sich die Datenschutzfunktionen von IRI auf Feldebene befinden, sind sie sicherer; wenn (im Gegensatz zu Splunk) ein Verschlüsselungscode offenbart wird, sind andere Felder mit anderen Schlüsseln oder Algorithmen noch sicher, siehe mehr Details in unserem Blog hier. […]