IRI DarkShield Version 4 features a Remote Procedure Call (RPC) Application Programming Interface (API) for searching and masking unstructured files. The API allows DarkShield to be easily embedded as middleware in a pipeline outside of IRI Workbench.
Current supported formats include:
- Plain text
- PDFs (with embedded images)
- Images (png, .jpg/x/2, .tif/f, .gif, .bmp)
Support for Microsoft documents (Word, Excel, and Powerpoint) will also be released in upcoming minor updates to the API.
The API is built as a plugin on top of the IRI Web Services Platform (codenamed Plankton), allowing the user to pick which services they will require while utilizing the same hosting, configuration, and logging capabilities provided through the platform.
Before continuing with this article, please familiarize yourself with the operations of the base DarkShield API in this article. It outlines the declaration and use of Search and Mask Contexts to search and mask free-form text sent in json payloads.
The Files API utilizes the same Search and Mask Contexts to perform search and masking operations on text parsed from the different file formats. The Files API includes additional matchers, filters, and configuration options for specific file formats.
The rest of the article consists of an explanation of the various configuration options available through the File Search Contexts and File Mask Contexts, followed by a general demo setup guideline. After reading these three sections, you can choose to navigate to the sections on the different file types that you are most interested in searching and masking.
All demos associated with this article can be downloaded from our Git repository here. To run the demos, you will need a working copy of the DarkShield-Files API hosted locally on your computer. Contact your IRI representative for a trial copy of the software.
File Search Contexts
The Files API introduces an extension to the base Search Context, a File Search Context, for defining search criteria for files. The following snippet of the OpenAPI definition shows the structure of its schema:
The File Search Context is structured very much like the base Search Context. The name attribute is used to uniquely identify the context when performing search operations. The matchers array contains a list of matchers used to annotate the files.
Following are the supported matchers:
- JSON Path Matcher (jsonPath) – A matcher which matches all values found under a given json path. This matcher only works in json documents.
- Search Context Matcher (searchContext) – A matcher which delegates search tasks to a Search Context hosted in the base API (see the associated article For details of all supported search matchers). Can define additional format-specific filters for narrowing down the search to specific locations (a given json path in json documents, for example).
- Column Matcher (column): A matcher used in tabular files to match an entire column based on a regular expression match of the column name and/or an index range. This matcher only works on delimited (CSV/TSV) files, but will support Excel spreadsheets in the future.
- XML Path Matcher (xmlPath) – A matcher only for XML documents that matches all values (within tags and attributes) found under a given xml path.
The name of each File Search Matcher will be saved with the annotations that they generate. For annotations generated by the Search Context Matcher, the names of the Search Matchers in the base API will be used instead.
The last attribute within a File Search Context is the configs. This object contains format-specific configuration options for how files should be parsed.
DarkShield sets reasonable defaults for how each file type should be parsed, so it is unlikely that you will need to tweak these options except for very specific use-cases. We will go over some of these options in the file-specific sections below.
File Mask Contexts
Just like the File Search Context, the Files API also defines an extension for the Mask Context. The following snippet of the OpenAPI definition shows the structure of its schema:
The File Mask Context is structured in a very similar manner to the base Mask Context.
The name attribute is used to uniquely identify the context when performing masking operations. The rules array contains a list of rules which will be used to mask the annotations that were found in the files.
Currently only one type of rule is supported, the Mask Context Rule, which delegates masking operations to a Mask Context hosted in the base API. Just like the Search Context Matchers in the File Search Context, the name attribute in a Mask Context Rule indicates the name of the Mask Context in the base API. The rule name applied to the mask results will be the same rule name that matched in the base Mask Context.
Also like the File Search Context, the File Mask Context contains a configs object. That object holds format-specific configurations for how the original files should be parsed (if a mask operation is being performed with an annotations.json file), or some additional masking-specific configuration options.
Some of the configuration options may be the same as their File Search Context counterparts, although in many cases DarkShield will utilize some information that is stored in the annotations.json file to parse the file in the same way as the File Search Context.
We will go over some of these options in the file-specific sections below.
The demos below were written in Python to send requests and process responses from a locally hosted Files API. The same interactions can be simulated using any other programming language or tool which can interface with an RPC API.
The purpose of the demos is to illustrate how the API searches and masks different files, as well as how to correctly structure the requests. The sources and targets of the files vary with each use-case, and will not be the focus of this article.
The Files API utilizes multipart/form-data requests and responses to send and receive files, delegating the task of retrieving files and storing the masked files to the client (the calling program, or python script here). This approach maximizes the flexibility of the API while reducing its complexity, since scripting these interactions is almost always easier than trying to mould the use-case to match the supported functionality of the API.
Every demo project listed below contains a setup.py file with a setup and teardown function, responsible for setting up and destroying the contexts necessary for performing the demo respectively. To execute the demo, simply run python main.py. An additional README.md is provided to describe the expected results of the execution.
Each demo produces the masked file along with its resulting results.json file, which contains the locations of the annotations and masked results. The results can be used as an audit trail, or discarded if PII retention in this form cannot be secured.
Plain Text Files (.txt)
While regular text files are deceptively simple, in some cases it is important to understand how DarkShield handles them internally.
By default, DarkShield will load, search, and mask the entire file in memory. While fast, this approach may clash with memory-constrained servers or very large files which cannot fit into memory. DarkShield therefore provides an option to process the file in text blocks.
Here is a snippet of the configuration options for text files inside of the File Search Context, as seen from the OpenAPI document:
The bufferLimit parameter is used to specify the size of the text block in characters that will be read in when performing a search. DarkShield will search for the last delimiter within the buffer limit to create a text block.
By default, the delimiter is a new line (‘\n’), although this can be set to another string. If no delimiter is found within the buffer limit, a parsing error will be returned to the client.
You can modify the bufferLimit and delimiter parameters to see how they affect the resulting results.json file.
Note that the File Mask Context does not have an equivalent text configuration for specifying these parameters. Instead, the remediator will automatically resize its buffers to fit the largest text block in the annotations.json file. In search and mask operations, the remediator will operate on the text blocks that the annotator has already read.
This demo can be found under the text-files folder. The example will search and mask email and social security number (SSN) found inside the text file using regular expression pattern matches. The email will be hashed, and the SSN will have the first 5 digits redacted.
Here is a snippet of the original file along with the masked result:
In addition to the masked text file, the API will also return a results.json file containing the annotations that were found and how they were masked, along with any failed masking results if something went wrong:
The structure of the results.json file varies for different file types. In text files, as the example above shows, the offsets are relative to the text block that the annotations were found in. We will not be showing the results.json for all subsequent examples, but one will always be generated for every file that is searched and masked.
Make sure this file is either securely stored or deleted, since it contains PII or other sensitive information.
Tabular Files (.csv, .tsv)
Tabular data often contains embedded free-form text, making it suitable for use with DarkShield. In many cases, you may also wish to combine DarkShield’s ability to mask portions of the free-form text with FieldShield’s ability to mask an entire column.
We may also want to filter on certain columns when deciding which text will be searched by a Search Context. This can reduce the number of false positives, and speed overall processing.
To match and mask an entire column in a tabular file, DarkShield supports the creation of Column Matchers:
A pattern can be specified to match on particular column headers. For tabular files that do not have a header, a range of indexes can be specified. It is also possible to combine both pattern and range (for example, match on the firstName column but only if it appears in the first 3 columns).
Similarly, Column Filters can be attached to a Search Context Matcher. The Search Context Matcher will only search free form text in the columns that matched that Column Filter:
Lastly, additional CSV configuration options may be specified to improve CSV file parsing. While most CSV files follow a fairly common standard that can be automatically detected by DarkShield, in some cases it might be necessary to tweak certain options:
All standard CSV parsing options, like the delimiter and quote characters, can be configured. For very large CSV files, it may also be necessary to set the maximum size of any given column entry (maxCharsPerColumn), or the maximum number of columns (maxColumns), so that DarkShield can adjust its internal buffers accordingly.
The File Mask Context can also be configured with the same options, since it may need to read the CSV file on a subsequent masking run (if search and masking was not done in one step). However, by default, the File Mask Context will always attempt to use the CSV configuration options saved in the metadata object inside the annotations.json file, if available.
The demo can be found under the csv-tsv folder. The example will search and mask the following:
- Emails (EmailsMatcher): regular expression pattern matcher with hashing rule.
- Names (NamesMatcher): Using a column matcher on any column ending with ‘name’, and a Named Entity Recognition (NER) matcher to search through the comment field using a column filter (all names are masked using format-preserving encryption).
Note that this CSV file appears more structured, while the TSV file shows off some of the more non-standard cases that DarkShield can handle. Both contain the same information, and will be masked in the same manner.
Here is a snippet of the original CSV file along with the masked result:
And here is the snippet of the original tsv file along with the masked result:
Semi-structured Files (.xml, .json)
Semi-structured files are a popular format for storing unstructured data due to their flexibility compared to traditional tabular or relational structures. The Files API is capable of handling semi-structured files with arbitrary nesting and can search and mask free-form text found with json values or xml tags and attributes.
In some cases, you may also wish to mask a value based on the name of key in json files or the tag and attribute in xml files, similar to FieldShield’s approach to masking. Finally, you can also filter on which values will be searched by a Search Context using json/xml paths.
To achieve the first use case, DarkShield supports the creation of JSON Path Matchers and XML Path Matchers to match on an entire value for keys that match a json path in json documents and for tags and attributes that match an XML path in XML documents:
If the json/xml path resolves to a sub-tree, all values under the sub-tree will be masked.
Similarly, JSON Path Filters and XML Path Filters can be attached to a Search Context Matcher. Instead of matching on the entire value, the Search Context will be used to search mask portions of the free-form text found within:
Note that as of this writing, using XML Path Matchers or Filters will force DarkShield to load the entire XML file into memory in order to extract the correct tags and attributes. The JSON Path Matchers and Filters process json in a streaming fashion, but currently do not support array filters (for example, $..name[2:]).
There are currently no parsing configuration options for either JSON or XML in the File Search Context. However, in the File Mask Context you can specify whether the JSON output should be pretty-printed (properly indented) using the prettyPrint parameter:
By default, masked JSON documents are written out in a compact format by removing any extraneous whitespace. XML is written out in the same format at the input.
The demo can be found under the json-xml folder. The example will search and mask the following:
- Emails (EmailMatcher): Found using a regular expression and masked using a hashing function.
- Phone Numbers (PhoneMatcher): Found using a regular expression and masked using format-preserving encryption.
- Names (NameMatcher): Found using a Named Entity Recognition (NER) model AND using format-specific json/xml paths (all names can be found in the ‘name’ key/tag, regardless of nesting).
Here is a snippet of the original json file along with the masked result:
And here is a snippet of the original XML file along with the masked result:
PDF and Image Files
DarkShield can also process unstructured data inside of PDF and image files. DarkShield will use Optical Character Recognition (OCR) to extract the text from the file to perform the search. DarkShield can also handle images inside PDF documents.
The PDF configuration options inside the File Search Context deal mostly with memory efficiency vs. performance tradeoffs:
For PDFs that contain a large number of unique images, the disableImageCaching option can reduce DarkShield’s memory usage. In addition, you may choose not to search through images if you know that they do not contain any PII, by setting the disableImageProcessing option.
Finally, you can also set the maxMainMemoryBytes option, which limits how much of the document DarkShield will keep in memory at any given time. By default, DarkShield will load the entire document in memory for higher performance at the cost of more memory usage. All 3 options can also be set in the File Mask Context.
The image configuration options inside the File Search Context are related to tweaking its OCR engine:
DarkShield utilizes Tesseract models in order to perform OCR. By default, DarkShield will download the necessary models for you if they do not already exist and place them inside of the tessdata folder inside of the API’s installation directory — unless a different path is specified in the tessDataPath option.
You can also specify the language that the OCR engine will attempt to read from the image. Multiple languages can be specified using the plus (‘+’) character to separate them (for example, “eng+hin+ita”).
If you are familiar with the Tesseract engine, you can also specify specific configuration parameters inside of the tessConfigVariables dictionary.
Image masking is limited since only black box redaction is used regardless of the masking rule that was applied. DarkShield can insert masked text into a PDF, provided that two conditions hold true:
- The length of the replacement text is less than or equal to the length of the original. Note that the length is calculated in the number of characters, and not based on the widths of the glyphs as they are drawn out on the pdf.
- The replacement text can be encoded using a standard font available to DarkShield, or a font that is stored in the PDF.
By default, if either condition fails, DarkShield will generate a failed masking result and leave the original annotation unredacted. The File Mask Context also provides options for how to handle these issues by specifying the onEncodingError and/or onTextOverflow options:
The demo can be found under the pdf-image folder. Both the PDF and image are a form containing a name and two social security numbers. In this example, DarkShield will redact the first five digits of the social security numbers and format-preserving-encrypt the name. Note that for the image masking, a black box redaction is applied instead.
Here is a snippet of the original PDF file along with the masked result:
And here is a snippet of the original image file along with the masked result:
If you have any questions, please feel free to email firstname.lastname@example.org.