
Introduction to IRI DarkShield
What is DarkShield?
IRI DarkShield is a data masking tool for finding and de-identifying Personally Identifiable Information (PII) and other sensitive data in semi-structured and unstructured files and databases. As one of the three data masking products in the IRI Data Protector Suite, DarkShield was created to bridge the gap between structured and unstructured data masking, allowing users to secure data in a consistent manner across disparate silos and formats.
How Does DarkShield Work?
At its core, DarkShield treats all data as documents regardless of their physical location. With some minimal configuration for accessing different data silos, DarkShield can then discover the content types of those documents using a variety of heuristics, and select the appropriate data parsers to use.
This flexibility allows DarkShield to deal with a wide range of file types with little to no configuration. It also allows for nested handling of documents, like masking PDFs embedded in a BLOB column of a Database or using Optical Character Recognition (OCR) to find and redact PII in images embedded within Word documents.
In many cases, DarkShield is not explicit about the different ways that some of its features can be combined to access data. This allows for a more flexible and source-agnostic approach for finding and masking sensitive data. For more information regarding supported file formats and data silos, please see the DarkShield technical details web page.
DarkShield tries to share as much metadata as possible for its search and masking operations with its sister product, IRI FieldShield, to allow for a holistic view of data across your entire domain of structured, semi-structured and unstructured sources. This includes data classes and masking rules, which will be discussed in detail later on in the article.
The rest of this article will cover these shared features along with the core concepts behind DarkShield, but without going into the implementation details. This knowledge will be applicable for all subsequent articles, which will cover specific use cases and some of the more advanced DarkShield features and their implementation details.
Installation
DarkShield can be licensed as a standalone product, or packaged along with IRI FieldShield and IRI CellShield in the IRI Voracity data management platform. Licensing and pricing options are on this page.
The core DarkShield product uses the IRI Workbench IDE, built on Eclipse, as its graphical front-end for data classification and search/mask job design.
DarkShield also requires a licensed version of the IRI CoSort engine, SortCL, to perform masking operations. To install and activate these components, refer to provided email instructions and the IRI installation guide.
DarkShield is packaged as a separate feature of Workbench to reduce the size of the initial installation for FieldShield only users. To install DarkShield, follow the Install New Software section in the Workbench Update Site guide and select the IRI DarkShield feature from the IRI Tools update site. The same guide can be used to keep your software up to date as important bug fixes or new features become available.
For developers who wish to call DarkShield as a standalone program, an RPC API can be installed to run DarkShield separately from IRI Workbench. A licensed executable is still required. Please contact darkshield@iri.com for additional information regarding this option.
Data Classification
DarkShield uses the same Data Classification framework as FieldShield to define and catalog one or more items of PII or other sensitive information. Data Classes are configured globally through the IRI Data Classes and Groups preferences page.
Workbench comes pre-configured with some common Data Classes, and enables the modification and addition of Data Classes to address specific use cases. The saved Data Classes, as well as masking rules, job configuration files, and other project artifacts can also be team-shared in source code control systems supported in Eclipse like Git.
Data Classes consist of any number and combination of Data Class matchers, including:
- Strings conforming to IRI-supplied or custom-defined Java Regular Expression (Regex) patterns, which are ideal for Personal Identification Numbers (PINs), email addresses and phone numbers. These Regex searches can also be computationally validated using JavaScript to avoid false positives (e.g., calculating a checksum for credit cards).
- Exact matches to string values in a lookup set file (e.g., countries), and fuzzy matches (RPC API only)
- Named-Entity Recognition (NER), based on machine-trained Natural Language Processing (NLP) Models (e.g., names and addresses)
- Bounding boxes to define specific, repeated regions within images to mask
- Facial detection (on request feature)
Note that a Data Class can be declared without any Data Class matchers. In FieldShield, such Data Classes are used to classify fields based on column names only. There is currently no functional use in DarkShield for specifying a Data Class without any matchers.
NER and face models, and bounding boxes, are currently exclusive to DarkShield, but both Regular Expressions and exact matches can be utilized in FieldShield as well. Bounding Box matchers are only available for image data.
Masking Functions
DarkShield applies masking functions through Data Rules. The rules can be created and stored for future use and modification in an IRI Rule library stored in a Workbench project folder.
Data Rules can be matched to Data Classes or pattern matchers when defining Data Rule Matchers in the Dark Data Discovery Wizard. The Data Rule Matchers are then used to consistently mask the discovered PII via:
- Multiple, NSA Suite B and FIPS-compliant encryption and decryption algorithms, including format-preserving encryption
- SHA-1 and SHA-2 hashing
- ASCII de-ID (bit scrambling)
- Binary encoding
- Deletion/removal
- Full or partial string redaction
- Lookup value pseudonymization
- Byte shifting and string functions
DarkShield can search and mask any unstructured text and document files with all of the methods listed above. However, image formats and some .pdf files only support black box redaction masking irrespective of the specified Data Rule.
Search Matchers
To allow for simultaneous searching and masking of sensitive data, DarkShield users create Search Matchers, which map a Data Class or Data Class Group to a Data Rule. Any sensitive information which was matched by a Data Class Matcher from a Data Class or Data Class Group will be masked using the Data Rule associated with the Search Matcher.
Search Matchers also allow the user to specify document-specific filters. Since this is a fairly complex topic, we have a separate article devoted to filters in our Advanced Topics section.
Sources and Targets
DarkShield supports a variety of data sources and targets. Each data source has a corresponding wizard which is used to specify the necessary configuration options to access the data. Regardless of the source, the Data Classes, Data Rules, and Search Matchers will be configured in a similar manner, so there will be some overlap between the different articles.
The list of Sources and Target articles to be covered in this series will be:
- Text files, MS documents, etc.
- PDFs and image files
- NoSQL DBs (MongoDB, Cassandra, and ElasticSearch)
- Relational Databases
- Files in AWS S3
Advanced Topics
The advanced topics will be referenced from the Sources and Targets articles, but can also be read as standalone installments. The list of topics also (or to be) covered in this series include:
- Video: PDF configuration options
- Video: Named Entity Recognition (NER)
- Path filters (XML & JSON file example)
- RPC API calls
- Facial detection and recognition