How to Find and Mask PII in Elasticsearch
Editor’s Note: A newer version of this IRI DarkShield wizard (as of 2024) is described in this article.
Elasticsearch is a Java-based search engine that has an HTTP interface and stores its data in schema-free JSON documents. Unfortunately, a spate of costly and painful breaches of Personally Identifiable Information (PII) continue to plague online Elasticsearch databases:
Were all the PII or other sensitive information in these DBs masked however, successful hacks and development copies may not be problematic. The purpose of IRI DarkShield is to lock down that information in production or test using privacy-law-compliant anonymization functions.
The Elasticsearch search and mask wizard in the IRI Workbench graphical IDE for IRI DarkShield uses the same tooling as the MongoDB and Cassandra connectors described in this article. This wizard can be used to classify, locate, and de-identify or delete PII and other sensitive information held on Elasticsearch collections, and to produce search and audit results.
Set Up
If you do not have an Elasticsearch cluster to connect to, you can easily create a local cluster by downloading Elasticsearch from here and following the instruction guide.
For my demonstration of this wizard, I am using a single index called customers on a locally hosted cluster. This index stores basic customer information that would normally be seen in an account, and be a rich target for malfeasance. This includes: email, name, and phone number:
Search
As with the other data sources DarkShield supports, you must create a .dsc job specification file to define your scanning and market criteria. As you would with MongoDB or Cassandra, select the New NoSQL Search/Masking Job … from the DarkShield Menu atop the IRI Workbench toolbar. Select a project folder and enter a name for the job.
On the next page, create a source URI:
Here is where you enter the parameters for your Elasticsearch cluster. The default host and port for Elasticsearch are localhost and 9200 if these fields are left blank.
If the cluster to which you are connecting needs a username and password, enter those in the authentication section. For this example, I am using host: localhost, port: 9200, and the cluster: Elasticsearch.
A username and password can also be added on this page. For the simplicity of this demonstration, the local cluster has not been configured with security in mind. Any real cluster should have login and permissions enabled for real use cases.
Click OK to finish and you will be returned to the previous page. Type in the index you would like to search. In this example, I am using the index named customers.
Next, you will need to set a target URI for the masked results. Keep in mind that only Elasticsearch masked results can only be sent to Elasticsearch targets. In this case, I will be using the same Customers URI created before but with a different index. This will create a new index with the masked results that will be created later on in this demonstration.
Next, you will be asked to create a Search Matcher, which is responsible for associating a Data Class with a corresponding Data (Masking) Rule. This is a necessary step as no masking can be applied without it.
As explained in the Data Classification article, Data Classes centrally catalog and define global criteria for finding and masking PII in structured, semi-structured or unstructured sources for both FieldShield and DarkShield. IRI Workbench ships with several predefined data classes (e.g., names, email and IP addresses, credit card numbers), found in Window > Preferences > IRI > Data Classes and Groups. You can edit those and add your own.
Click Browse or Create on the Data Class line. Browse will let you select your own data classes, or one of several predefined classes or groups, including email, phone number, and names. In this case the NAMES data class group includes a first names data class.
Here I selected the EMAIL data class that will look for emails within my Elasticsearch index:
Now a masking rule must be applied to the data class that has been chosen. Click the Create button to make a new data rule, or Browse to use any you may have defined already.
For Emails, I choose a redaction function:
More than one data class can be masked simultaneously of course. I added classes and assigned a format-preserving encryption function to phone numbers, and a random pseudonym (set file lookup) for people’s names:
If any search filters are needed, they can be added on the prior page. Filters can be used to find particular results, or to isolate specific fields in CSV, XML, JSON or RDBs to be masked, precluding the need for scanning row contents. I did not specify any in this case, however.
Click Finish when done. This completes the wizard and creates a .search file which holds the DarkShield configuration details for executing the search and/or masking job(s).
Note: If you are using the default locally hosted cluster like the one in this example, make sure that the cluster is on, because any search or masking jobs will fail otherwise. You can check if the server is running by opening a web browser and typing “http://localhost:9200/” into the address bar.
Searching and Masking
DarkShield supports searching and masking as separate or simultaneous operations. In this case, I want to search first and see what I’ve found before I mask it. That’s because (larger) masking jobs can take time, and I may want to hone my search methods and re-verify them.
To do this, right click the DarkShield job configuration (.dsc) file and run the file as a search job. This creates an annotations.json file you can run later as a masking job to remediate the collection with the redaction function(s) specified. You can also search and mask at the same time instead.
The previously searched results will be masked in the target location. To verify this, you can perform a search again and see that the data has now been “DarkShield’ed” as specified; i.e., emails redacted, first names pseudonymized, and phone numbers masked with format-preserving encryption:
If you need help protecting your Elasticsearch collections by masking their data at rest via this DarkShield interface in IRI Workbench or its CLI, or any semi-/unstructured data in flight via DarkShield’s REST API, please email darkshield@iri.com.