{"id":14244,"date":"2021-01-12T16:02:37","date_gmt":"2021-01-12T21:02:37","guid":{"rendered":"http:\/\/www.iri.com\/blog\/?p=14244"},"modified":"2025-09-15T08:38:14","modified_gmt":"2025-09-15T12:38:14","slug":"darkshield-files-rpc-api","status":"publish","type":"post","link":"https:\/\/www.iri.com\/blog\/data-protection\/darkshield-files-rpc-api\/","title":{"rendered":"IRI DarkShield-Files RPC API"},"content":{"rendered":"<p>The <a href=\"https:\/\/www.iri.com\/products\/darkshield\">IRI DarkShield<\/a> data masking tool features a self-hosted (on-premise) Remote Procedure Call (RPC) Application Programming Interface (API) for PII data masking in structured, semi-structured and unstructured files. This data masking API allows DarkShield to be easily embedded as middleware in a pipeline outside of <i>IRI Workbench<\/i>.<\/p>\n<p>Note this API is also leveraged in the IRI Workbench <a href=\"https:\/\/www.iri.com\/products\/workbench\/darkshield-gui\">GUI for DarkShield<\/a> through the File Search\/Mask wizard (which now also supports the detection and redaction of signatures among other PII), as described in <a href=\"https:\/\/www.iri.com\/blog\/data-protection\/finding-and-masking-pii-in-files-with-the-darkshield-files-wizard\/\">this article<\/a>. <b>Also note<\/b> that <em>DarkShield Version 6 also includes a single-endpoint <a href=\"https:\/\/www.iri.com\/blog\/data-protection\/iri-darkshield-rest-api\/\">REST API<\/a> that covers RDB, NoSQL DB, files, and streaming text sources.<\/em><\/p>\n<p>Currently supported file formats include:<\/p>\n<ul>\n<li aria-level=\"1\">CSV\/TSV<\/li>\n<li>Fixed Width<\/li>\n<li aria-level=\"1\">FHIR\/HL7\/X12<\/li>\n<li aria-level=\"1\">JSON<\/li>\n<li aria-level=\"1\">MS Excel (.xls\/.xlsx)<\/li>\n<li aria-level=\"1\">MS Word (.doc\/.docx)<\/li>\n<li aria-level=\"1\">MS PowerPoint (.ppt\/.pptx)<\/li>\n<li aria-level=\"1\">Parquet<\/li>\n<li aria-level=\"1\">Plain \/ Raw Text<\/li>\n<li aria-level=\"1\">XML<\/li>\n<li aria-level=\"1\">PDFs (with embedded images)<\/li>\n<li aria-level=\"1\">Images (png, .jpg\/x\/2, .tif\/f, .gif, .bmp)<\/li>\n<li aria-level=\"1\">DICOM (medical studies, including metadata and burned in pixels)<\/li>\n<\/ul>\n<p>The API is built as a plugin on top of the IRI Web Services Platform (codenamed <i>Plankton<\/i>), allowing the user to pick which services they will require while utilizing the same hosting, configuration, and logging capabilities provided through the platform. It is typically used for at rest\u00a0 unstructured data masking, and can be used for dynamic data masking.<\/p>\n<p><a href=\"\/blog\/wp-content\/uploads\/2021\/01\/iri-web-services-architecture-cropped.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-14216 aligncenter\" src=\"\/blog\/wp-content\/uploads\/2021\/01\/iri-web-services-architecture-cropped.png\" alt=\"\" width=\"438\" height=\"449\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/iri-web-services-architecture-cropped.png 438w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/iri-web-services-architecture-cropped-293x300.png 293w\" sizes=\"(max-width: 438px) 100vw, 438px\" \/><\/a><\/p>\n<p>Before continuing with this article, please familiarize yourself with the operations of the base <i>DarkShield <\/i>API in <a href=\"http:\/\/www.iri.com\/blog\/data-protection\/darkshield-rpc-api\/\">this article<\/a>. It outlines the declaration and use of Search and Mask Contexts to search and mask free-form text sent in JSON payloads.<\/p>\n<p>The Files API utilizes the same Search and Mask Contexts to perform search and masking operations on text parsed from the different file formats. The Files API includes additional matchers, filters, and configuration options for specific file formats.<\/p>\n<p><a href=\"\/blog\/wp-content\/uploads\/2021\/01\/DarkShield-files-api-diagram.png\"><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-14221 aligncenter\" src=\"\/blog\/wp-content\/uploads\/2021\/01\/DarkShield-files-api-diagram.png\" alt=\"\" width=\"649\" height=\"439\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/DarkShield-files-api-diagram.png 861w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/DarkShield-files-api-diagram-300x203.png 300w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/DarkShield-files-api-diagram-768x519.png 768w\" sizes=\"(max-width: 649px) 100vw, 649px\" \/><\/a><\/p>\n<p>The rest of the article consists of an explanation of the various configuration options available through the <i>File Search Contexts <\/i>and <i>File Mask Contexts<\/i>, followed by a general demo setup guideline. After reading these three sections, you can choose to navigate to the sections on the different file types that you are most interested in searching and masking.<i>\u00a0<\/i><\/p>\n<p>All demos associated with this article can be downloaded from our Git repository <a href=\"https:\/\/github.com\/TeamIRI\/darkshield-api-demos\">here<\/a>. To run the demos, you will need a working copy of the DarkShield-Files API hosted locally on your computer. Read the instructions in the Readme.md file for information on how to setup an environment to run the demos. Contact your <a href=\"https:\/\/www.iri.com\/partners\/resellers\">IRI representative<\/a> for a trial copy of the software.<\/p>\n<h5><b>File Search Contexts<\/b><\/h5>\n<p>The Files API introduces an extension to the base Search Context, a <i>File Search Context<\/i>, for defining search criteria for files. The following snippet of the OpenAPI definition shows the structure of its schema:<\/p>\n<p><a href=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-1.png\"><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-14256 aligncenter\" src=\"\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-1-1024x399.png\" alt=\"\" width=\"649\" height=\"253\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-1-1024x399.png 1024w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-1-300x117.png 300w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-1-768x300.png 768w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-1.png 1600w\" sizes=\"(max-width: 649px) 100vw, 649px\" \/><\/a><\/p>\n<p>The File Search Context is structured very much like the base Search Context. The <i>name <\/i>attribute is used to uniquely identify the context when performing search operations. The <i>matchers <\/i>array contains a list of matchers used to annotate the files.<\/p>\n<p>Following are the supported matchers:<\/p>\n<ol>\n<li>Column Matcher (column): A matcher used in tabular files to match an entire column based on a regular expression match of the column name and\/or an index range. This matcher works on delimited (CSV\/TSV) files.<\/li>\n<li aria-level=\"1\">Excel Cell Matcher (excelCell): A matcher for matching on cells inside of Excel sheets. There are numerous options available for filtering on the sheet name and\/or index, as well as matching on specific table headers or cell addresses. Unlike standard CSV\/TSV tabular data, it is also possible to match by row instead of column.<\/li>\n<li>JSON Path Matcher (jsonPath) &#8211; A matcher which matches all values found under a given <a href=\"https:\/\/goessner.net\/articles\/JsonPath\/\">JSON path<\/a>. This matcher only works with JSON documents.<\/li>\n<li aria-level=\"1\">Search Context Matcher (searchContext) &#8211; A matcher which delegates search tasks to a Search Context hosted in the base API (see the associated <a href=\"http:\/\/www.iri.com\/blog\/data-protection\/darkshield-rpc-api\/\">article<\/a> for details of all supported search matchers). Can define additional format-specific filters for narrowing down the search to specific locations (a given json path in json documents, for example).<\/li>\n<li aria-level=\"1\">XML Path Matcher (xmlPath) &#8211; A matcher only for XML documents that matches all values (within tags and attributes) found under a given <a href=\"https:\/\/en.wikipedia.org\/wiki\/XPath\">XML path<\/a>.<\/li>\n<\/ol>\n<p>The name of each File Search Matcher will be saved with the annotations that they generate. For annotations generated by the Search Context Matcher, the names of the Search Matchers in the base API will be used instead.<\/p>\n<p>Each of the format-specific matchers can also be used as <i>filters <\/i>for the matchers inside of the Search Context Matcher (#4):<\/p>\n<p><a href=\"\/blog\/wp-content\/uploads\/2021\/01\/search-context-matcher-filters.png\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-14654 aligncenter\" src=\"\/blog\/wp-content\/uploads\/2021\/01\/search-context-matcher-filters.png\" alt=\"Search Context Matcher Filters\" width=\"649\" height=\"256\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/search-context-matcher-filters.png 2230w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/search-context-matcher-filters-300x119.png 300w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/search-context-matcher-filters-768x303.png 768w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/search-context-matcher-filters-1024x405.png 1024w\" sizes=\"(max-width: 649px) 100vw, 649px\" \/><\/a><\/p>\n<p>If a filter is present in the SearchContextMatcher for a specific format, the search contexts will only search through content that first matches the filter for the given file format. This can be used in a variety of different situations, for example only using named entity recognition on a comments column in a csv file, or matching only on customers\u2019 emails inside of a deeply nested JSON file.<\/p>\n<p>The last attribute within a File Search Context is the <i>configs<\/i>. This object contains format-specific configuration options for how files should be parsed.<\/p>\n<p>DarkShield sets reasonable defaults for how each file type should be parsed, so it is unlikely that you will need to tweak these options except for very specific use-cases. We will go over some of these options in the file-specific sections below. File search contexts can be created by sending to the endpoint \/api\/darkshield\/files\/fileSearchContext.create.<\/p>\n<h5><b>File Mask Contexts<\/b><\/h5>\n<p>Just like the File Search Context, the Files API also defines an extension for the Mask Context. The following snippet of the OpenAPI definition shows the structure of its schema:<\/p>\n<p><a href=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-2.png\"><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-14257 aligncenter\" src=\"\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-2-1024x397.png\" alt=\"\" width=\"650\" height=\"252\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-2-1024x397.png 1024w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-2-300x116.png 300w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-2-768x298.png 768w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-2.png 1600w\" sizes=\"(max-width: 650px) 100vw, 650px\" \/><\/a><\/p>\n<p>The File Mask Context is structured in a very similar manner to the base Mask Context.<\/p>\n<p>The <i>name <\/i>attribute is used to uniquely identify the context when performing masking operations. The <i>rules <\/i>array contains a list of rules which will be used to mask the annotations that were found in the files.<\/p>\n<p>Currently only one type of rule is supported, the <i>Mask Context Rule<\/i>, which delegates masking operations to a Mask Context hosted in the base API. Just like the Search Context Matchers in the File Search Context, the <i>name <\/i>attribute in a Mask Context Rule indicates the name of the Mask Context in the base API. The rule name applied to the mask results will be the same rule name that matched in the base Mask Context.<\/p>\n<p>Also like the File Search Context, the File Mask Context contains a <i>configs <\/i>object. That object holds format-specific configurations for how the original files should be parsed (if a <i>mask <\/i>operation is being performed with an <i>annotations.json <\/i>file), or some additional masking-specific configuration options.<\/p>\n<p>Some of the configuration options may be the same as their File Search Context counterparts, although in many cases DarkShield will utilize some information that is stored in the <i>annotations.json <\/i>file to parse the file in the same way as the File Search Context. File mask contexts can be created by sending to the endpoint \/api\/darkshield\/files\/fileMaskContext.create.<\/p>\n<p>We will go over some of these options in the file-specific sections below.<\/p>\n<h5><b>Demo Setup<\/b><\/h5>\n<p>The demos below were written in Python to send requests and process responses from a locally hosted Files API. The same interactions can be simulated using any other programming language or tool which can interface with an RPC API.<\/p>\n<p>The purpose of the demos is to illustrate how the API searches and masks different files, as well as how to correctly structure the requests. The sources and targets of the files vary with each use-case, and will not be the focus of this article.<\/p>\n<p>The Files API utilizes multipart\/form-data requests and responses to send and receive files, delegating the task of retrieving files and storing the masked files to the client (the calling program, or Python script here). This approach maximizes the flexibility of the API while reducing its complexity, since scripting these interactions is almost always easier than trying to mould the use-case to match the supported functionality of the API.<\/p>\n<p>Every demo project listed below contains a <i>setup.py <\/i>file with a <i>setup <\/i>and <i>teardown <\/i>function, responsible for setting up and destroying the contexts necessary for performing the demo respectively. To execute the demo, simply run <i>python main.py<\/i>. An additional <i>README.md <\/i>is provided to describe the expected results of the execution.<\/p>\n<p>Each demo produces the masked file along with its resulting <i>results.json <\/i>file, which contains the locations of the annotations and masked results. The results can be used as an audit trail, or discarded if PII retention in this form cannot be secured.<\/p>\n<h5><b>Plain Text Files\u00a0 (.txt)<\/b><\/h5>\n<p>While regular text files are deceptively simple, in some cases it is important to understand how DarkShield handles them internally.<\/p>\n<p>By default, DarkShield will load, search, and mask the entire file in memory. While fast, this approach may clash with memory-constrained servers or very large files which cannot fit into memory. DarkShield therefore provides an option to process the file in text blocks.<\/p>\n<p>Here is a snippet of the configuration options for text files inside of the File Search Context, as seen from the OpenAPI document:<\/p>\n<p><a href=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-3.png\"><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-14258 aligncenter\" src=\"\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-3-1024x341.png\" alt=\"\" width=\"649\" height=\"216\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-3-1024x341.png 1024w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-3-300x100.png 300w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-3-768x256.png 768w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-3.png 1600w\" sizes=\"(max-width: 649px) 100vw, 649px\" \/><\/a><\/p>\n<p>The <i>bufferLimit<\/i> parameter is used to specify the size of the text block in characters that will be read in when performing a search. DarkShield will search for the last delimiter within the buffer limit to create a text block.<\/p>\n<p>By default, the delimiter is a new line (\u2018\\n\u2019), although this can be set to another string. If no delimiter is found within the buffer limit, a parsing error will be returned to the client.<\/p>\n<p>You can modify the <i>bufferLimit<\/i> and <i>delimiter<\/i> parameters to see how they affect the resulting <i>results.json <\/i>file.<\/p>\n<p>Note that the File Mask Context does not have an equivalent text configuration for specifying these parameters. Instead, the remediator will automatically resize its buffers to fit the largest text block in the <i>annotations.json <\/i>file. In search and mask operations, the remediator will operate on the text blocks that the annotator has already read.<\/p>\n<p>This demo can be found under the <i>text-files<\/i> folder. The example will search and mask email and social security number (SSN) found inside the text file using regular expression pattern matches. The email will be hashed, and the SSN will have the first 5 digits redacted.<\/p>\n<p>Here is a snippet of the original file along with the masked result:<\/p>\n<p>Original:<\/p>\n<p><a href=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-4.png\"><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-14259 aligncenter\" src=\"\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-4-1024x65.png\" alt=\"\" width=\"803\" height=\"51\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-4-1024x65.png 1024w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-4-300x19.png 300w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-4-768x48.png 768w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-4.png 1079w\" sizes=\"(max-width: 803px) 100vw, 803px\" \/><\/a><\/p>\n<p>Masked:<\/p>\n<p><a href=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-5.png\"><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-14260 aligncenter\" src=\"\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-5-1024x61.png\" alt=\"\" width=\"806\" height=\"48\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-5-1024x61.png 1024w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-5-300x18.png 300w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-5-768x46.png 768w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-5.png 1163w\" sizes=\"(max-width: 806px) 100vw, 806px\" \/><\/a><\/p>\n<p>In addition to the masked text file, the API will also return a <i>results.json <\/i>file containing the annotations that were found and how they were masked, along with any failed masking results if something went wrong:<\/p>\n<p><a href=\"\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-6.png\"><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-14261 aligncenter\" src=\"\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-6.png\" alt=\"\" width=\"651\" height=\"763\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-6.png 724w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-6-256x300.png 256w\" sizes=\"(max-width: 651px) 100vw, 651px\" \/><\/a><\/p>\n<p>The structure of the results.json file varies for different file types. In text files, as the example above shows, the offsets are relative to the text block that the annotations were found in. We will not be showing the results.json for all subsequent examples, but one will always be generated for every file that is searched and masked.<\/p>\n<p>Make sure this file is either securely stored or deleted, since it contains PII or other sensitive information.<\/p>\n<h5><b>Tabular Files (.csv, .tsv)<\/b><\/h5>\n<p>Tabular data often contains embedded free-form text, making it suitable for use with DarkShield. In many cases, you may also wish to combine DarkShield\u2019s ability to mask portions of the free-form text with FieldShield\u2019s ability to mask an entire column.<\/p>\n<p>We may also want to filter on certain columns when deciding which text will be searched by a Search Context. This can reduce the number of false positives, and speed overall processing.<\/p>\n<p>To match and mask an entire column in a tabular file, DarkShield supports the creation of <i>Column Matchers<\/i>:<\/p>\n<p><a href=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-7.png\"><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-14262 aligncenter\" src=\"\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-7-1024x435.png\" alt=\"\" width=\"650\" height=\"276\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-7-1024x435.png 1024w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-7-300x128.png 300w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-7-768x326.png 768w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-7.png 1110w\" sizes=\"(max-width: 650px) 100vw, 650px\" \/><\/a><\/p>\n<p>A <i>pattern <\/i>can be specified to match on particular column headers. For tabular files that do not have a header, a range of indexes can be specified. It is also possible to combine both pattern and range (for example, match on the <i>firstName <\/i>column but only if it appears in the first 3 columns).<\/p>\n<p>Similarly, <i>Column Filters <\/i>can be attached to a Search Context Matcher. The Search Context Matcher will only search free form text in the columns that matched that Column Filter:<\/p>\n<p><a href=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-8.png\"><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-14263 aligncenter\" src=\"\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-8-1024x544.png\" alt=\"\" width=\"649\" height=\"345\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-8-1024x544.png 1024w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-8-300x159.png 300w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-8-768x408.png 768w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-8.png 1600w\" sizes=\"(max-width: 649px) 100vw, 649px\" \/><\/a><\/p>\n<p>Lastly, additional CSV configuration options may be specified to improve CSV file parsing. While most CSV files follow a fairly common standard that can be automatically detected by DarkShield, in some cases it might be necessary to tweak certain options:<\/p>\n<p><a href=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-9.png\"><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-14264 aligncenter\" src=\"\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-9-1024x584.png\" alt=\"\" width=\"649\" height=\"370\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-9-1024x584.png 1024w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-9-300x171.png 300w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-9-768x438.png 768w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-9.png 1600w\" sizes=\"(max-width: 649px) 100vw, 649px\" \/><\/a><\/p>\n<p>All standard CSV parsing options, like the delimiter and quote characters, can be configured. For very large CSV files, it may also be necessary to set the maximum size of any given column entry (<i>maxCharsPerColumn<\/i>), or the maximum number of columns (<i>maxColumns<\/i>), so that DarkShield can adjust its internal buffers accordingly.<\/p>\n<p>The File Mask Context can also be configured with the same options, since it may need to read the CSV file on a subsequent masking run (if search and masking was not done in one step). However, by default, the File Mask Context will always attempt to use the CSV configuration options saved in the <i>metadata <\/i>object inside the <i>annotations.json <\/i>file, if available.<\/p>\n<p>The demo can be found under the <i>csv-tsv<\/i> folder. The example will search and mask the following:<\/p>\n<ol>\n<li aria-level=\"1\">Emails (EmailsMatcher): regular expression pattern matcher with hashing rule.<\/li>\n<li aria-level=\"1\">Names (NamesMatcher): Using a column matcher on any column ending with &#8216;name&#8217;, and a <i>Named Entity Recognition (NER)<\/i> matcher to search through the <i>comment<\/i> field using a column filter (all names are masked using format-preserving encryption).<\/li>\n<\/ol>\n<p>Note that this CSV file appears more structured, while the TSV file shows off some of the more non-standard cases that DarkShield can handle. Both contain the same information, and will be masked in the same manner.<\/p>\n<p>Here is a snippet of the original CSV file along with the masked result:<\/p>\n<p>Original:<\/p>\n<p><a href=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-10.png\"><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-14265 aligncenter\" src=\"\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-10-1024x75.png\" alt=\"\" width=\"806\" height=\"59\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-10-1024x75.png 1024w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-10-300x22.png 300w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-10-768x56.png 768w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-10.png 1245w\" sizes=\"(max-width: 806px) 100vw, 806px\" \/><\/a><\/p>\n<p>Masked:<\/p>\n<p><a href=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-11.png\"><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-14266 aligncenter\" src=\"\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-11-1024x70.png\" alt=\"\" width=\"805\" height=\"55\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-11-1024x70.png 1024w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-11-300x20.png 300w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-11-768x52.png 768w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-11.png 1110w\" sizes=\"(max-width: 805px) 100vw, 805px\" \/><\/a><\/p>\n<p>And here is the snippet of the original tsv file along with the masked result:<\/p>\n<p>Original:<\/p>\n<p><a href=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-12.png\"><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-14267 aligncenter\" src=\"\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-12-1024x143.png\" alt=\"\" width=\"652\" height=\"91\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-12-1024x143.png 1024w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-12-300x42.png 300w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-12-768x107.png 768w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-12.png 1287w\" sizes=\"(max-width: 652px) 100vw, 652px\" \/><\/a><\/p>\n<p>Masked:<\/p>\n<p><a href=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-13.png\"><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-14268 aligncenter\" src=\"\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-13-1024x127.png\" alt=\"\" width=\"653\" height=\"81\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-13-1024x127.png 1024w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-13-300x37.png 300w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-13-768x95.png 768w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-13.png 1464w\" sizes=\"(max-width: 653px) 100vw, 653px\" \/><\/a><\/p>\n<h5><strong>Fixed-Width Files<\/strong><\/h5>\n<p><span style=\"font-weight: 400;\">Fixed-width text files are text files that are formatted with columns and each column has an absolute length. Each row in a fixed-width text file must adhere to the format for column widths. Column widths themselves can vary amongst each other in width. Unlike TSVor CSV files, fixed-width files do not use characters to delimit columns and each row is the same length.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">To specify the column widths for the fixed width document there is a fixed-width configuration called columnWidths that will take an array of values.<\/span><\/p>\n<figure id=\"attachment_15002\" class=\"thumbnail wp-caption aligncenter style=\"width: 725px\"><a href=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/FW_config.png\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-15002 size-full\" src=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/FW_config.png\" alt=\"columWidths Configuration\" width=\"715\" height=\"237\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/FW_config.png 715w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/FW_config-300x99.png 300w\" sizes=\"(max-width: 715px) 100vw, 715px\" \/><\/a><figcaption class=\"caption wp-caption-text\">Config: columnWidths<\/figcaption><\/figure>\n<p><span style=\"font-weight: 400;\">The columnWidths configuration will require at least one item in the array and the value for column width must not be less than one character in length.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The demo can be found under the <\/span><i><span style=\"font-weight: 400;\">fixed-width<\/span><\/i><span style=\"font-weight: 400;\"> folder. The example will search and mask the following:<\/span><\/p>\n<ol>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Emails (EmailsMatcher): regular expression pattern matcher with format preserving encryption rule.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Social Security Numbers (SsnMatcher) regular expression pattern matcher with a redaction rule.<\/span><\/li>\n<\/ol>\n<p><span style=\"font-weight: 400;\">Below is a view of the original fixed width text file followed by the masked result:<\/span><\/p>\n<figure id=\"attachment_14999\" class=\"thumbnail wp-caption aligncenter style=\"width: 503px\"><a href=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/fw_original.png\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-14999 size-full\" src=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/fw_original.png\" alt=\"Original\" width=\"493\" height=\"64\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/fw_original.png 493w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/fw_original-300x39.png 300w\" sizes=\"(max-width: 493px) 100vw, 493px\" \/><\/a><figcaption class=\"caption wp-caption-text\">Original<\/figcaption><\/figure>\n<figure id=\"attachment_14998\" class=\"thumbnail wp-caption aligncenter style=\"width: 502px\"><a href=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/fw_masked.png\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-14998 size-full\" src=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/fw_masked.png\" alt=\"Masked Result\" width=\"492\" height=\"65\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/fw_masked.png 492w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/fw_masked-300x40.png 300w\" sizes=\"(max-width: 492px) 100vw, 492px\" \/><\/a><figcaption class=\"caption wp-caption-text\">Masked Result<\/figcaption><\/figure>\n<h5><strong>HL7 Files (v2)<\/strong><\/h5>\n<p>The Health Level Seven (HL7) standard was conceived for the purpose creating a standard format for exchanging electronic health information. HL7 first appeared in the 1980&#8217;s and since then several versions of HL7 have been released. As of now, the DarkShield-Files API is able to support HL7 v2 along with the already supported HL7 v3 (HL7 v3 was\u00a0<a href=\"https:\/\/github.com\/TeamIRI\/darkshield-api-demos\/tree\/master\/json-xml\">already supported<\/a>\u00a0by virtue of its XML format).<\/p>\n<p>A HL7 v2 message is strictly structured, and is broken down into segments, composites (fields) within segments, and sub-composites (sub-fields). An HL7 message must always start with a message header segment (MSH). The HL7 MSH segment contains information about the message, including what delimiters will be used in the message (e.g. \u2018|^~\\&amp;\u2019). Each line starts with a new segment and has a segment identifier (e.g. PID, MSA, NK1) at the beginning of the segment. Segments are split into composites (fields) using a composite delimiter and in most cases is a pipe delimiter. Within composites are sub-composites split up by sub-composite delimiters.<\/p>\n<p>Usually the DarkShield<i>\u00a0Files API<\/i>\u00a0will search and mask an entire file based on what is found by the searchMatchers. That said, there are instances when a user may wish for masking to be more specific. For example, there are requirements that dictate that the patient&#8217;s name and the name of next of kin should be masked but not the name of the primary care provider. In these situations the DarkShield-Files API supports the ability to specify specific columns within specific rows to mask, instead of masking the entire document.<\/p>\n<p>To accomplish the masking of specific fields within specific rows, a Column Matcher is specified in the file_search_context.<\/p>\n<p><a href=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/11\/fileSearchContextHl7.png\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-15122 size-full aligncenter\" src=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/11\/fileSearchContextHl7.png\" alt=\"\" width=\"335\" height=\"650\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/11\/fileSearchContextHl7.png 335w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/11\/fileSearchContextHl7-155x300.png 155w\" sizes=\"(max-width: 335px) 100vw, 335px\" \/><\/a><\/p>\n<p style=\"text-align: center;\"><em>File Search Context of HL7<\/em><\/p>\n<p>When targeting a specific column in a segment, note that the syntax for the column matcher uses a pipe delimiter to separate the segment identifier from the target column. Regardless of whether a pipe delimiter is used as a field delimiter in the document, this will be the syntax to indicate the specific column within targeted segment that will be masked by the API.<\/p>\n<p>Below is a snapshot of an original HL7 message followed by the masked result:<\/p>\n<p><a href=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/11\/hl7.png\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-15124 size-full aligncenter\" src=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/11\/hl7.png\" alt=\"\" width=\"886\" height=\"111\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/11\/hl7.png 886w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/11\/hl7-300x38.png 300w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/11\/hl7-768x96.png 768w\" sizes=\"(max-width: 886px) 100vw, 886px\" \/><\/a><\/p>\n<p style=\"text-align: center;\"><em>Original HL7 Document<\/em><\/p>\n<p><a href=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/11\/hl7Mask.png\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-15126 size-full aligncenter\" src=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/11\/hl7Mask.png\" alt=\"\" width=\"889\" height=\"111\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/11\/hl7Mask.png 889w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/11\/hl7Mask-300x37.png 300w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/11\/hl7Mask-768x96.png 768w\" sizes=\"(max-width: 889px) 100vw, 889px\" \/><\/a><\/p>\n<p style=\"text-align: center;\"><em>Masked HL7 Document<\/em><\/p>\n<p>Below are links to a demo and to an article discussing HL7 v2 support with DarkShield-Files API:<\/p>\n<ul>\n<li><a href=\"https:\/\/www.iri.com\/blog\/data-protection\/masking-phi-in-hl7-and-x12-files\/\">article\u00a0<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/TeamIRI\/darkshield-api-demos\/tree\/master\/hl7\">demo<\/a><\/li>\n<\/ul>\n<p><strong>X12 Files<\/strong><\/p>\n<p>X12 is a common format for electronic business documents, and is also in use in the healthcare industry.\u00a0Every X12 document has a three-digit identifier to notify the receiver of what information it contains (e.g. X12 835\u00a0Health Care Claim and Remittance Advice).<\/p>\n<p>The structure of a X12 message consists of segments, elements (fields) within segments, and composite elements (sub-fields) within elements. Each segment will have a segment identifier and a segment delimiter that is placed at the end of the segment. Within the segment are element delimiters that split the segment into elements. The elements can then be divided further into composite elements using a composite element delimiter.<\/p>\n<p>As a rule, a X12 message must always start with an Interchange control header (ISA segment). The ISA segment contains information about the message including at the end of the ISA segment, a list of the delimiters that will be present in the message. Directly following an ISA segment is the Functional Group Header segment(GS) which is also called the inner envelope.<\/p>\n<p>Usually the DarkShield<i>\u00a0Files API<\/i>\u00a0will search and mask an entire file based on what is found by the searchMatchers. That said, there are instances when a user may wish for masking to be more specific. In these situations the DarkShield-Files API supports the ability to specify specific columns within specific rows to mask, instead of masking the entire document.<\/p>\n<p>To accomplish the masking of specific fields within specific rows, a Column Matcher is specified in the file_search_context.<\/p>\n<p style=\"text-align: left;\"><a href=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/11\/fileSearchContextX12.png\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-15123 size-full aligncenter\" src=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/11\/fileSearchContextX12.png\" alt=\"\" width=\"308\" height=\"559\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/11\/fileSearchContextX12.png 308w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/11\/fileSearchContextX12-165x300.png 165w\" sizes=\"(max-width: 308px) 100vw, 308px\" \/><\/a><\/p>\n<p style=\"text-align: center;\"><em>File Search Context of X12<\/em><\/p>\n<p>When targeting a specific column in a segment, note that the syntax for the column matcher uses an asterisk delimiter to separate the segment identifier from the target column. Regardless of whether an asterisk delimiter is used as a field delimiter in the document, this will be the syntax to indicate the specific column within targeted segment that will be masked by the API.<\/p>\n<p>Below is a snapshot of a X12 835 message unaltered followed by the masked result:<\/p>\n<p><a href=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/10\/x12-835-data.png\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-15096 size-full aligncenter\" src=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/10\/x12-835-data.png\" alt=\"\" width=\"866\" height=\"539\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/10\/x12-835-data.png 866w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/10\/x12-835-data-300x187.png 300w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/10\/x12-835-data-768x478.png 768w\" sizes=\"(max-width: 866px) 100vw, 866px\" \/><\/a><\/p>\n<p style=\"text-align: center;\"><em>Original X12 835 Document<\/em><\/p>\n<p><a href=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/10\/x12-835-masked.png\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-15097 size-full aligncenter\" src=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/10\/x12-835-masked.png\" alt=\"\" width=\"863\" height=\"544\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/10\/x12-835-masked.png 863w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/10\/x12-835-masked-300x189.png 300w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/10\/x12-835-masked-768x484.png 768w\" sizes=\"(max-width: 863px) 100vw, 863px\" \/><\/a><\/p>\n<p style=\"text-align: center;\"><em>Masked X12 835 Document<\/em><\/p>\n<p>Below are links to a demo and to an article discussing HL7 v2 support with DarkShield-Files API:<\/p>\n<ul>\n<li><a href=\"https:\/\/www.iri.com\/blog\/data-protection\/masking-phi-in-hl7-and-x12-files\/\">article\u00a0<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/TeamIRI\/darkshield-api-demos\/tree\/master\/x12\">demo<\/a><\/li>\n<\/ul>\n<h5><b>Semi-structured Files (.xml, .json)<\/b><\/h5>\n<p>Semi-structured files are a popular format for storing unstructured data due to their flexibility compared to traditional tabular or relational structures. The Files API is capable of handling semi-structured files with arbitrary nesting and can search and mask free-form text found with json values or xml tags and attributes.<\/p>\n<p>In some cases, you may also wish to mask a value based on the name of the key in json files or the tag and attribute in xml files, similar to FieldShield\u2019s approach to masking. Finally, you can also filter on which values will be searched by a Search Context using json\/xml paths.<\/p>\n<p>To achieve the first use case, DarkShield supports the creation of <i>JSON Path Matchers<\/i> and <i>XML Path Matchers <\/i>to match on an entire value for keys that match a json path in json documents and for tags and attributes that match an XML path in XML documents:<\/p>\n<p><a href=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-14.png\"><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-14269 aligncenter\" src=\"\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-14-1024x252.png\" alt=\"\" width=\"650\" height=\"160\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-14-1024x252.png 1024w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-14-300x74.png 300w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-14-768x189.png 768w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-14.png 1600w\" sizes=\"(max-width: 650px) 100vw, 650px\" \/><\/a><\/p>\n<p><a href=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-15.png\"><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-14270 aligncenter\" src=\"\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-15-1024x252.png\" alt=\"\" width=\"650\" height=\"160\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-15-1024x252.png 1024w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-15-300x74.png 300w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-15-768x189.png 768w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-15.png 1600w\" sizes=\"(max-width: 650px) 100vw, 650px\" \/><\/a><\/p>\n<p>If the json\/xml path resolves to a sub-tree, all values under the sub-tree will be masked.<\/p>\n<p>Similarly, <i>JSON Path Filters <\/i>and <i>XML Path Filters <\/i>can be attached to a Search Context Matcher. Instead of matching on the entire value, the Search Context will be used to search mask portions of the free-form text found within:<\/p>\n<p><a href=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-16.png\"><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-14271 aligncenter\" src=\"\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-16-1024x424.png\" alt=\"\" width=\"650\" height=\"269\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-16-1024x424.png 1024w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-16-300x124.png 300w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-16-768x318.png 768w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-16.png 1600w\" sizes=\"(max-width: 650px) 100vw, 650px\" \/><\/a><\/p>\n<p>Note that as of this writing, using XML Path Matchers or Filters will force DarkShield to load the entire XML file into memory in order to extract the correct tags and attributes. The JSON Path Matchers and Filters process JSON in a streaming fashion, but currently do not support array filters (for example, <i>$..name[2:]<\/i>).<\/p>\n<p>There are currently no parsing configuration options for either JSON or XML in the File Search Context. However, in the File Mask Context you can specify whether the JSON output should be pretty-printed (properly indented) using the <i>prettyPrint <\/i>parameter:<\/p>\n<p><a href=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-17.png\"><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-14272 aligncenter\" src=\"\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-17-1024x170.png\" alt=\"\" width=\"651\" height=\"108\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-17-1024x170.png 1024w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-17-300x50.png 300w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-17-768x127.png 768w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-17.png 1600w\" sizes=\"(max-width: 651px) 100vw, 651px\" \/><\/a><\/p>\n<p>By default, masked JSON documents are written out in a compact format by removing any extraneous whitespace. XML is written out in the same format at the input.<\/p>\n<p>The demo can be found under the <i>json-xml<\/i> folder. The example will search and mask the following:<\/p>\n<ol>\n<li aria-level=\"1\">Emails (EmailMatcher): Found using a regular expression and masked using a hashing function.<\/li>\n<li aria-level=\"1\">Phone Numbers (PhoneMatcher): Found using a regular expression and masked using format-preserving encryption.<\/li>\n<li aria-level=\"1\">Names (NameMatcher): Found using a Named Entity Recognition (NER) model AND using format-specific JSON\/XMLpaths (all names can be found in the &#8216;name\u2019 key\/tag, regardless of nesting).<\/li>\n<\/ol>\n<p>Here is a snippet of the original JSON file along with the masked result:<\/p>\n<p>Original:<\/p>\n<p><a href=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-18.png\"><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-14273 aligncenter\" src=\"\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-18-1024x518.png\" alt=\"\" width=\"650\" height=\"329\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-18-1024x518.png 1024w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-18-300x152.png 300w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-18-768x388.png 768w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-18.png 1456w\" sizes=\"(max-width: 650px) 100vw, 650px\" \/><\/a><\/p>\n<p>Masked:<\/p>\n<p><a href=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-19.png\"><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-14274 aligncenter\" src=\"\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-19-1024x344.png\" alt=\"\" width=\"649\" height=\"218\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-19-1024x344.png 1024w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-19-300x101.png 300w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-19-768x258.png 768w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-19.png 1516w\" sizes=\"(max-width: 649px) 100vw, 649px\" \/><\/a><\/p>\n<p>And here is a snippet of the original XML file along with the masked result:<\/p>\n<p>Original:<\/p>\n<p><a href=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-20.png\"><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-14247 aligncenter\" src=\"\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-20-1024x413.png\" alt=\"\" width=\"650\" height=\"262\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-20-1024x413.png 1024w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-20-300x121.png 300w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-20-768x310.png 768w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-20.png 1315w\" sizes=\"(max-width: 650px) 100vw, 650px\" \/><\/a><\/p>\n<p>Masked:<\/p>\n<p><a href=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-21.png\"><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-14248 aligncenter\" src=\"\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-21-1024x319.png\" alt=\"\" width=\"652\" height=\"203\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-21-1024x319.png 1024w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-21-300x94.png 300w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-21-768x240.png 768w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-21.png 1110w\" sizes=\"(max-width: 652px) 100vw, 652px\" \/><\/a><\/p>\n<h5><b>Microsoft Office Documents (Excel, Word, PowerPoint)<br \/>\n<\/b><\/h5>\n<p>DarkShield supports searching and masking PII found inside Word documents, Excel spreadsheets, and PowerPoint presentations. This includes the older binary (<i>.doc, <\/i><i>.xls, .ppt<\/i>) as well as the newer OOXML (<i>.docx, <\/i><i>.xlsx, and .pptx<\/i>) formats. In addition to text found inside of the pages, sheets, or slides, DarkShield can also search and mask certain embedded objects within, for example charts and images.<\/p>\n<p>Due to certain limitations with the older binary formats and the lack of official documentation, DarkShield does not support the full range of capabilities for .<i>doc, <\/i><i>.xls, and .ppt <\/i>that are present in their OOXML counterparts. In particular, DarkShield cannot search and mask embedded charts within <i>.doc <\/i>and <i>.xls <\/i>files, <i>.doc <\/i>files only support searching but not masking embedded images., and .<em>ppt\u00a0<\/em>files cannot mask notes.<\/p>\n<p>Due to their internal structures, the older binary formats also have to be loaded fully into memory in order to be read and modified. This is in contrast to the newer versions which are streamed, meaning that DarkShield can handle them in memory more efficiently.<\/p>\n<p>This should generally not be a problem, since the older formats have hard limits on their file size, but it is important to keep this in mind when processing multiple files at the same time to avoid running out of memory.<\/p>\n<p>DarkShield provides additional support for finding PII in spreadsheets using its <i>ExcelCellMatcher <\/i>and <i>ExcelCellFilter <\/i>respectively. DarkShield can match or filter on rows or columns of data with a pattern-matched header stored as a value inside of a cell.<\/p>\n<p>A regular expression can also be specified to match on specific ranges of cell addresses. These matchers and filters can also be applied to only specific sheets in the file by using a regular expression pattern on the sheet name or the sheet\u2019s index range:<\/p>\n<p><a href=\"\/blog\/wp-content\/uploads\/2021\/01\/excelCellMatcher.png\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-14653 aligncenter\" src=\"\/blog\/wp-content\/uploads\/2021\/01\/excelCellMatcher.png\" alt=\"excel cell matcher spec\" width=\"649\" height=\"322\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/excelCellMatcher.png 1829w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/excelCellMatcher-300x149.png 300w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/excelCellMatcher-768x381.png 768w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/excelCellMatcher-1024x508.png 1024w\" sizes=\"(max-width: 649px) 100vw, 649px\" \/><\/a><\/p>\n<p>To demonstrate some of these capabilities, you can refer to the <i>microsoft-excel-and-word <\/i>demo project which contains a collection of Office documents.<\/p>\n<p>Both word documents contain a contrived bank statement that contains information regarding the account holder. The <i>.docx <\/i>version also contains a table and a graph containing which details the account holder\u2019s balance over the span of three months:<\/p>\n<p><a href=\"\/blog\/wp-content\/uploads\/2021\/01\/bank-report-docx.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-14656\" src=\"\/blog\/wp-content\/uploads\/2021\/01\/bank-report-docx.png\" alt=\"Bank Report docx\" width=\"649\" height=\"312\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/bank-report-docx.png 1620w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/bank-report-docx-300x144.png 300w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/bank-report-docx-768x369.png 768w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/bank-report-docx-1024x492.png 1024w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/bank-report-docx-730x350.png 730w\" sizes=\"(max-width: 649px) 100vw, 649px\" \/><\/a><\/p>\n<p><i>buyers.xls <\/i>contains a standard tabular sheet containing a mix of structured data and free-form text (note the <i>mailto<\/i> hyperlinks that are stored alongside the emails):<\/p>\n<p><a href=\"\/blog\/wp-content\/uploads\/2021\/01\/buyers-xls.png\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-14657 aligncenter\" src=\"\/blog\/wp-content\/uploads\/2021\/01\/buyers-xls.png\" alt=\"Buyers xls\" width=\"649\" height=\"203\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/buyers-xls.png 2307w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/buyers-xls-300x94.png 300w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/buyers-xls-768x240.png 768w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/buyers-xls-1024x320.png 1024w\" sizes=\"(max-width: 649px) 100vw, 649px\" \/><\/a><\/p>\n<p>Finally, <i>Bank Report.xlsx <\/i>contains the same tabular data and embedded graph which are present in <i>Bank Statement.xlsx<\/i>:<\/p>\n<p><a href=\"\/blog\/wp-content\/uploads\/2021\/01\/bank-report-xlsx.png\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-14658 aligncenter\" src=\"\/blog\/wp-content\/uploads\/2021\/01\/bank-report-xlsx.png\" alt=\"Bank Report xlsx\" width=\"649\" height=\"211\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/bank-report-xlsx.png 1954w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/bank-report-xlsx-300x97.png 300w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/bank-report-xlsx-768x249.png 768w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/bank-report-xlsx-1024x332.png 1024w\" sizes=\"(max-width: 649px) 100vw, 649px\" \/><\/a><\/p>\n<p>In addition to the standard regular expression pattern matching and named entity recognition used in other examples, we will also exploit the structure of the <i>buyers.xls <\/i>spreadsheet to match on the account number using a cell address pattern matching on column B, as well as a cell value pattern match on the <i>Name <\/i>column.<\/p>\n<p>Below are the masked versions of these documents:<\/p>\n<p><a href=\"\/blog\/wp-content\/uploads\/2021\/01\/bank-statement-docx-masked.png\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-14659 aligncenter\" src=\"\/blog\/wp-content\/uploads\/2021\/01\/bank-statement-docx-masked.png\" alt=\"Bank Statement docx Masked\" width=\"649\" height=\"311\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/bank-statement-docx-masked.png 1628w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/bank-statement-docx-masked-300x144.png 300w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/bank-statement-docx-masked-768x368.png 768w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/bank-statement-docx-masked-1024x491.png 1024w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/bank-statement-docx-masked-1110x530.png 1110w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/bank-statement-docx-masked-730x350.png 730w\" sizes=\"(max-width: 649px) 100vw, 649px\" \/><\/a><\/p>\n<p><i>buyers.xls:<\/i><\/p>\n<p><a href=\"\/blog\/wp-content\/uploads\/2021\/01\/buyers-xls-masked.png\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-14660 aligncenter\" src=\"\/blog\/wp-content\/uploads\/2021\/01\/buyers-xls-masked.png\" alt=\"\" width=\"649\" height=\"183\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/buyers-xls-masked.png 2306w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/buyers-xls-masked-300x85.png 300w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/buyers-xls-masked-768x217.png 768w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/buyers-xls-masked-1024x289.png 1024w\" sizes=\"(max-width: 649px) 100vw, 649px\" \/><\/a><\/p>\n<p><i>Bank Report.xlsx:<\/i><\/p>\n<p><a href=\"\/blog\/wp-content\/uploads\/2021\/01\/bank-report-xlsx-masked.png\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-14661 aligncenter\" src=\"\/blog\/wp-content\/uploads\/2021\/01\/bank-report-xlsx-masked.png\" alt=\"\" width=\"649\" height=\"183\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/bank-report-xlsx-masked.png 1953w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/bank-report-xlsx-masked-300x85.png 300w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/bank-report-xlsx-masked-768x217.png 768w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/bank-report-xlsx-masked-1024x289.png 1024w\" sizes=\"(max-width: 649px) 100vw, 649px\" \/><\/a><\/p>\n<p>There are several things to note from the masked examples:<\/p>\n<ol>\n<li aria-level=\"1\">A random offset was added to the balances in both documents, meaning that the numbers will not necessarily be consistent between different documents or separate demo runs. There are ways to apply a more consistent masking to these values (for example, format-preserving encryption).<\/li>\n<li aria-level=\"1\">The masked balances are stored as string data rather than as floating-point numbers with a currency format (the internal representation within Excel). This means that the chart will no longer display information since the balances are no longer numbers. Future additions to DarkShield may allow for more conversions of masked string data to a formatted floating-point value under certain circumstances.<\/li>\n<li aria-level=\"1\">Note how column B in <i>Bank Report.xlsx <\/i>is not matched on and masked as an account number. This is due to the sheet name filter that was also added to the account number excel cell matcher which limited its matches to cells found in the <i>buyers <\/i>sheet.<\/li>\n<\/ol>\n<p>There is also a separate demo for\u00a0<em>powerpoint<\/em>.<\/p>\n<p>The PowerPoint demo is looking for names, social security numbers, email addresses, and credit card numbers within text, notes, tables, footers, and images within the slideshow.<\/p>\n<p>Compare the content of some of the original slides with the masked slides:<\/p>\n<figure id=\"\" class=\"thumbnail wp-caption alignnone style=\"width: 1034px\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2022\/03\/Slide1-1024x576.png\" alt=\"This image has an empty alt attribute; its file name is Slide1-1024x576.png\" width=\"1024\" height=\"576\" \/><figcaption class=\"caption wp-caption-text\">Unmasked text<\/figcaption><\/figure>\n<figure id=\"\" class=\"thumbnail wp-caption alignnone style=\"width: 1034px\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2022\/03\/Slide1-1-1024x576.png\" alt=\"This image has an empty alt attribute; its file name is Slide1-1-1024x576.png\" width=\"1024\" height=\"576\" \/><figcaption class=\"caption wp-caption-text\">Masked text<\/figcaption><\/figure>\n<figure id=\"\" class=\"thumbnail wp-caption alignnone style=\"width: 1034px\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2022\/03\/Slide6-1024x576.png\" alt=\"This image has an empty alt attribute; its file name is Slide6-1024x576.png\" width=\"1024\" height=\"576\" \/><figcaption class=\"caption wp-caption-text\">Unmasked Embedded Image<\/figcaption><\/figure>\n<figure id=\"\" class=\"thumbnail wp-caption alignnone style=\"width: 1034px\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2022\/03\/Slide6-1-1024x576.png\" alt=\"This image has an empty alt attribute; its file name is Slide6-1-1024x576.png\" width=\"1024\" height=\"576\" \/><figcaption class=\"caption wp-caption-text\">Masked Embedded Image<\/figcaption><\/figure>\n<figure id=\"\" class=\"thumbnail wp-caption alignnone style=\"width: 782px\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2022\/03\/hashedEmails.gif\" alt=\"This image has an empty alt attribute; its file name is hashedEmails.gif\" width=\"772\" height=\"404\" \/><figcaption class=\"caption wp-caption-text\">Animation that compares unmasked emails within a table with masked emails.<\/figcaption><\/figure>\n<figure id=\"\" class=\"thumbnail wp-caption alignnone style=\"width: 1034px\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2022\/03\/unmasked-notes-1024x463.png\" alt=\"This image has an empty alt attribute; its file name is unmasked-notes-1024x463.png\" width=\"1024\" height=\"463\" \/><figcaption class=\"caption wp-caption-text\">Unmasked Notes<\/figcaption><\/figure>\n<figure id=\"\" class=\"thumbnail wp-caption alignnone style=\"width: 1034px\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2022\/03\/masked-notes-1024x485.png\" alt=\"This image has an empty alt attribute; its file name is masked-notes-1024x485.png\" width=\"1024\" height=\"485\" \/><figcaption class=\"caption wp-caption-text\">Masked Notes<\/figcaption><\/figure>\n<h5 id=\"Parquet\"><strong>Parquet File Format<\/strong><\/h5>\n<p><span style=\"font-weight: 400;\">Apache Parquet is a columnar, compressed file format that is optimized for performance. Parquet files are often found in cloud storage providers due to the optimizations of the file format that reduce costs in cloud environments when compared to CSV files.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">While designed for rapid analytic querying and low disk space usage, Parquet is a complex binary format that is not easily readable, which may make it difficult to protect sensitive data.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">However, the DarkShield Files API offers the capability to search and mask Parquet files for sensitive data. The Parquet file format allows for many data types and nested data structures; the DarkShield Files API is able to search and mask through common primitive types such as strings, integers, bytes, etc. as well as multiple levels of nesting. <\/span><\/p>\n<p><span style=\"font-weight: 400;\">The implementation of Parquet file format support in the DarkShield Files API was designed with bulk usage in mind. Many Parquet files are quite large, but if the size of each row group is limited to a reasonable size (a typical recommended amount is no more than 128 MB), the maximum memory used will be closer to the size of the row group than the size of the entire file.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">This demo can be found in the <\/span><i><span style=\"font-weight: 400;\">parquet <\/span><\/i><span style=\"font-weight: 400;\">demo folder <a href=\"https:\/\/github.com\/TeamIRI\/darkshield-api-demos\/tree\/master\/parquet\">here<\/a>.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The demo has two Parquet files, of which one is a \u2018flat\u2019 Parquet file which has a single field for each column, while the other file has nested fields in one column.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Viewing one of the Parquet files in a text editor reveals the format &#8211; some string text is visible, but there is a lot of binary encoding as well.<\/span><\/p>\n<p><a href=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/Screenshot-2021-08-23-153321.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-14665 size-full\" src=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/Screenshot-2021-08-23-153321-e1629813609177.png\" alt=\"Parquet View\" width=\"1121\" height=\"193\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/Screenshot-2021-08-23-153321-e1629813609177.png 1121w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/Screenshot-2021-08-23-153321-e1629813609177-300x52.png 300w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/Screenshot-2021-08-23-153321-e1629813609177-768x132.png 768w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/Screenshot-2021-08-23-153321-e1629813609177-1024x176.png 1024w\" sizes=\"(max-width: 1121px) 100vw, 1121px\" \/><\/a><\/p>\n<p><span style=\"font-weight: 400;\">Looking at the same file in a Parquet file schema viewer shows this schema structure:<\/span><\/p>\n<p><a href=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/Screenshot-2021-08-23-154316.png\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-14667 size-full aligncenter\" src=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/Screenshot-2021-08-23-154316.png\" alt=\"Parquet Schema\" width=\"402\" height=\"605\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/Screenshot-2021-08-23-154316.png 402w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/Screenshot-2021-08-23-154316-199x300.png 199w\" sizes=\"(max-width: 402px) 100vw, 402px\" \/><\/a><\/p>\n<p><span style=\"font-weight: 400;\">This is the Parquet file with nested fields in one column.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">After using Python to run the main.py script inside the parquet demo folder, the resulting masked version of the previously shown file is shown here in a Parquet viewer:<\/span><\/p>\n<p><a href=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/Screenshot-2021-08-23-154253.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-14666 size-full\" src=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/Screenshot-2021-08-23-154253.png\" alt=\"Parquet Masked Data\" width=\"1340\" height=\"128\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/Screenshot-2021-08-23-154253.png 1340w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/Screenshot-2021-08-23-154253-300x29.png 300w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/Screenshot-2021-08-23-154253-768x73.png 768w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/Screenshot-2021-08-23-154253-1024x98.png 1024w\" sizes=\"(max-width: 1340px) 100vw, 1340px\" \/><\/a><\/p>\n<p><span style=\"font-weight: 400;\">Note that:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">The names and the Social Security number have been encrypted with format-preserving encryption.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Certain other data that could be classified as sensitive (such as age, credit card number, and address) have been purposely left alone in this example.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Users can classify data however they want to so that only what is needed to be masked will be masked, leaving other parts of the data field that are deemed as non-sensitive usable.<\/span><\/li>\n<\/ul>\n<h5><b>DICOM Files<\/b><\/h5>\n<p><span style=\"font-weight: 400;\">DICOM, or Digital Imaging and Communications in Medicine, is a standard for the communication and management of medical imaging information and related data. <\/span><\/p>\n<p><span style=\"font-weight: 400;\">DICOM is implemented in almost every radiology, cardiology imaging, and radiotherapy device (X-ray, CT, MRI, ultrasound, etc.), and increasingly in devices in other medical domains such as ophthalmology and dentistry.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">DICOM defines individual files (typically having a .dcm file extension) that have a unique binary structure consisting of a header and a data set consisting of a list of attributes. The attributes include information about the scan such as the patient name. The final attribute is the actual pixel data (imagery) of the scan.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">DarkShield offers a solution for searching and masking sensitive attributes in a DICOM file.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Pixel data is just one of many attributes that may be contained in a DICOM file, and is separate from other attributes that may contain key- or quasi-identifiers such as patient name, date of birth, and hospital name. DarkShield will search through all attributes that are not a part of the pixel data.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Additionally, a series of black boxes may be specified in a file mask context to redact known portions of pixel data that may have sensitive information in burned-in text. The height, width, and X and Y coordinates of each black box can be specified in the configuration.<\/span><\/p>\n<p>The <a href=\"https:\/\/github.com\/TeamIRI\/darkshield-api-demos\/tree\/master\/dicom\">example <\/a>below shows patient information in a DICOM file being de-identified by processing with the DarkShield-Files API. The file contains burned-in text with the patient&#8217;s name and other sensitive information. Additionally, DICOM files have attributes containing identifying information of the scan that are not a part of the pixel data, but are overlayed onto the image when displayed in a DICOM viewer. A black box was defined as a configuration to the DarkShield-Files API to remove the burned-in text from view. Attributes that get overlayed onto the image, but are not actually a part of the image, were searched through and masked based on the text contents of the attributes. The results of a DICOM file being de-identified with DarkShield are shown below.<\/p>\n<p>Original:<\/p>\n<p><a href=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/10\/dicom_original_file.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-15078 size-full\" src=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/10\/dicom_original_file.png\" alt=\"\" width=\"1110\" height=\"579\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/10\/dicom_original_file.png 1110w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/10\/dicom_original_file-300x157.png 300w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/10\/dicom_original_file-768x401.png 768w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/10\/dicom_original_file-1024x534.png 1024w\" sizes=\"(max-width: 1110px) 100vw, 1110px\" \/><\/a><\/p>\n<p>Masked:<\/p>\n<p><a href=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/10\/dicom_black_box_burned_in_text_redacted.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-15077 size-full\" src=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/10\/dicom_black_box_burned_in_text_redacted.png\" alt=\"\" width=\"1110\" height=\"579\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/10\/dicom_black_box_burned_in_text_redacted.png 1110w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/10\/dicom_black_box_burned_in_text_redacted-300x157.png 300w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/10\/dicom_black_box_burned_in_text_redacted-768x401.png 768w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/10\/dicom_black_box_burned_in_text_redacted-1024x534.png 1024w\" sizes=\"(max-width: 1110px) 100vw, 1110px\" \/><\/a><\/p>\n<h5><b>PDF and Image Files<\/b><\/h5>\n<p>DarkShield can also process unstructured data inside of PDF and image files. DarkShield will use <i>Optical Character Recognition (OCR) <\/i>to extract the text from the file to perform the search. DarkShield can also handle images inside PDF documents.<\/p>\n<p>The PDF configuration options inside the File Search Context deal mostly with memory efficiency vs. performance tradeoffs:<\/p>\n<p><a href=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-22.png\"><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-14249 aligncenter\" src=\"\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-22-1024x473.png\" alt=\"\" width=\"649\" height=\"300\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-22-1024x473.png 1024w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-22-300x139.png 300w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-22-768x355.png 768w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-22.png 1110w\" sizes=\"(max-width: 649px) 100vw, 649px\" \/><\/a><\/p>\n<p>For PDFs that contain a large number of unique images, the <i>disableImageCaching <\/i>option can reduce DarkShield\u2019s memory usage. In addition, you may choose not to search through images if you know that they do not contain any PII, by setting the <i>disableImageProcessing <\/i>option.<\/p>\n<p>Finally, you can also set the <i>maxMainMemoryBytes <\/i>option, which limits how much of the document DarkShield will keep in memory at any given time. By default, DarkShield will load the entire document in memory for higher performance at the cost of more memory usage. All 3 options can also be set in the File Mask Context.<\/p>\n<p>The image configuration options inside the File Search Context are related to tweaking its OCR engine:<\/p>\n<p><a href=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-23.png\"><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-14250 aligncenter\" src=\"\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-23-1024x467.png\" alt=\"\" width=\"649\" height=\"296\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-23-1024x467.png 1024w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-23-300x137.png 300w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-23-768x350.png 768w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-23.png 1600w\" sizes=\"(max-width: 649px) 100vw, 649px\" \/><\/a><\/p>\n<p>DarkShield utilizes <i>Tesseract<\/i> models in order to perform OCR. By default, DarkShield will download the <a href=\"https:\/\/github.com\/tesseract-ocr\/tessdata\">necessary models<\/a> for you if they do not already exist and place them inside of the <i>tessdata <\/i>folder inside of the API\u2019s installation directory &#8212; unless a different path is specified in the <i>tessDataPath<\/i> option.<\/p>\n<p>You can also specify the <i>language <\/i>that the OCR engine will attempt to read from the image. Multiple languages can be specified using the plus (\u2018+\u2019) character to separate them (for example, \u201ceng+hin+ita\u201d).<\/p>\n<p>If you are familiar with the Tesseract engine, you can also specify specific configuration parameters inside of the <i>tessConfigVariables <\/i>dictionary.<\/p>\n<p>DarkShield can insert masked text into a PDF, provided that two conditions hold true:<\/p>\n<ol>\n<li aria-level=\"1\">The length of the replacement text is less than or equal to the length of the original. <span style=\"font-weight: 400;\">This condition eliminates certain masking operations like hashing, non-format preserving encryptions, and random pseudonym replacement.<\/span> Note that the length is calculated in the number of characters, and not based on the widths of the glyphs as they are drawn out on the pdf.<\/li>\n<li aria-level=\"1\">The replacement text can be encoded using a standard font available to DarkShield, or a font that is stored in the PDF.<\/li>\n<\/ol>\n<p>By default, if either condition fails, DarkShield will generate a failed masking result and leave the original annotation unredacted. The File Mask Context also provides options for how to handle these issues by specifying the <i>onEncodingError <\/i>and\/or <i>onTextOverflow <\/i>options:<\/p>\n<p><a href=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-24.png\"><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-14251 aligncenter\" src=\"\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-24-1024x550.png\" alt=\"\" width=\"650\" height=\"349\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-24-1024x550.png 1024w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-24-300x161.png 300w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-24-768x412.png 768w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-24.png 1415w\" sizes=\"(max-width: 650px) 100vw, 650px\" \/><\/a><\/p>\n<p><span style=\"font-weight: 400;\">An alternative for random pseudonym replacement for PDF and images is a length preserve pseudonym replacement rule. This rule is specific to DarkShield and allows for random replacement of values while preserving the length of the text.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The length preserving pseudonym replacement operation requires either a set file or a directory containing set files sorted by length of text values.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Syntax for providing a directory:<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-16370 size-full aligncenter\" src=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/directoryPath-1.png\" alt=\"\" width=\"717\" height=\"271\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/directoryPath-1.png 717w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/directoryPath-1-300x113.png 300w\" sizes=\"(max-width: 717px) 100vw, 717px\" \/><\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>Syntax for providing a single set file:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-16371 size-full aligncenter\" src=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/setPath-1.png\" alt=\"\" width=\"543\" height=\"271\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/setPath-1.png 543w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/setPath-1-300x150.png 300w\" sizes=\"(max-width: 543px) 100vw, 543px\" \/><\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">It may be recommended that in combination with this length preserve pseudonym replacement masking operation, the \u201cprettyTextReplacement\u201d configuration in <\/span><i><span style=\"font-weight: 400;\">FileMaskContext,<\/span><\/i><span style=\"font-weight: 400;\"> be enabled.\u00a0 In doing so DarkShield will attempt to shift the words following after the replacement text to the right according to the differences in total character widths between original and replacement text. This scenario can occur when two words possess the same number of letters but the individual character widths may affect the overall length of each word and may result in a replacement word spilling over and overlapping with adjacent words.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Example:<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Text 1: Hello -&gt; 5 letters.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Text 2: MMMMM -&gt; 5 letters.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-16367 size-full aligncenter\" src=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/prettyTextReplacement.png\" alt=\"\" width=\"687\" height=\"201\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/prettyTextReplacement.png 687w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/prettyTextReplacement-300x88.png 300w\" sizes=\"(max-width: 687px) 100vw, 687px\" \/><\/p>\n<p style=\"text-align: center;\"><em>FileMaskContext &#8211; prettyTextReplacement<\/em><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-16373 size-full\" src=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2022\/12\/masking_rule.png\" alt=\"\" width=\"805\" height=\"729\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2022\/12\/masking_rule.png 805w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2022\/12\/masking_rule-300x272.png 300w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2022\/12\/masking_rule-768x695.png 768w\" sizes=\"(max-width: 805px) 100vw, 805px\" \/><\/p>\n<p style=\"text-align: center;\"><em>Extra documentation on masking rule parameters.<\/em><\/p>\n<p>A demo for PDFs can be found under the <i>pdf-image<\/i> folder. Both the PDF and image are a form containing a name and two social security numbers. In this example, DarkShield will redact the first five digits of the social security numbers and format-preserving-encrypt the name. Note that for the image masking, a black box redaction is applied instead.<\/p>\n<p>Here is a snippet of the original PDF file along with the masked result:<\/p>\n<p>Original:<\/p>\n<p><a href=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-25.png\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-14252 aligncenter\" src=\"\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-25-791x1024.png\" alt=\"\" width=\"650\" height=\"841\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-25-791x1024.png 791w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-25-232x300.png 232w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-25-768x994.png 768w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-25.png 816w\" sizes=\"(max-width: 650px) 100vw, 650px\" \/><\/a><\/p>\n<p>Masked:<\/p>\n<p><a href=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-26.png\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-14253 aligncenter\" src=\"\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-26-793x1024.png\" alt=\"\" width=\"650\" height=\"839\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-26-793x1024.png 793w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-26-232x300.png 232w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-26-768x991.png 768w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-26.png 856w\" sizes=\"(max-width: 650px) 100vw, 650px\" \/><\/a><\/p>\n<p>And here is a snippet of the original image file along with the masked result:<\/p>\n<p>Original:<\/p>\n<p><a href=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-27.png\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-14254 aligncenter\" src=\"\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-27-791x1024.png\" alt=\"\" width=\"650\" height=\"841\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-27-791x1024.png 791w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-27-232x300.png 232w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-27-768x994.png 768w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-27.png 816w\" sizes=\"(max-width: 650px) 100vw, 650px\" \/><\/a><\/p>\n<p>Masked:<\/p>\n<p><a href=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-28.png\"><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-14255 aligncenter\" src=\"\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-28-791x1024.png\" alt=\"\" width=\"650\" height=\"841\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-28-791x1024.png 791w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-28-232x300.png 232w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-28-768x994.png 768w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/darkshield-files-rpc-api-28.png 816w\" sizes=\"(max-width: 650px) 100vw, 650px\" \/><\/a><\/p>\n<p>If you have any questions, please feel free to email <a href=\"mailto:darkshield@iri.com\">darkshield@iri.com<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The IRI DarkShield data masking tool features a self-hosted (on-premise) Remote Procedure Call (RPC) Application Programming Interface (API) for PII data masking in structured, semi-structured and unstructured files. This data masking API allows DarkShield to be easily embedded as middleware in a pipeline outside of IRI Workbench. Note this API is also leveraged in the<\/p>\n<div><a class=\"btn-filled btn\" href=\"https:\/\/www.iri.com\/blog\/data-protection\/darkshield-files-rpc-api\/\" title=\"IRI DarkShield-Files RPC API\">Read More<\/a><\/div>\n","protected":false},"author":112,"featured_media":14221,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"_exactmetrics_skip_tracking":false,"_exactmetrics_sitenote_active":false,"_exactmetrics_sitenote_note":"","_exactmetrics_sitenote_category":0,"footnotes":""},"categories":[108,8,34],"tags":[1494,1496,1488,1388,1459,1490,1460],"class_list":["post-14244","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-big-data-2","category-data-protection","category-business","tag-darkshield-api","tag-darkshield-rpc-api","tag-flat-file-masking","tag-iri-darkshield","tag-json-data-masking","tag-search-matcher","tag-xml-data-masking"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v23.4 (Yoast SEO v23.4) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>IRI DarkShield-Files RPC API - IRI<\/title>\n<meta name=\"description\" content=\"IRI DarkShield features a Remote Procedure Call (RPC) Application Programming Interface (API) for searching and masking unstructured files.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.iri.com\/blog\/data-protection\/darkshield-files-rpc-api\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"IRI DarkShield-Files RPC API\" \/>\n<meta property=\"og:description\" content=\"IRI DarkShield features a Remote Procedure Call (RPC) Application Programming Interface (API) for searching and masking unstructured files.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.iri.com\/blog\/data-protection\/darkshield-files-rpc-api\/\" \/>\n<meta property=\"og:site_name\" content=\"IRI\" \/>\n<meta property=\"article:published_time\" content=\"2021-01-12T21:02:37+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-09-15T12:38:14+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/DarkShield-files-api-diagram.png\" \/>\n\t<meta property=\"og:image:width\" content=\"861\" \/>\n\t<meta property=\"og:image:height\" content=\"582\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Dmitry Kulakov, Devon Kozenieski and Adam Lewis\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Dmitry Kulakov, Devon Kozenieski and Adam Lewis\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"39 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/www.iri.com\/blog\/data-protection\/darkshield-files-rpc-api\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/www.iri.com\/blog\/data-protection\/darkshield-files-rpc-api\/\"},\"author\":{\"name\":\"Dmitry Kulakov, Devon Kozenieski and Adam Lewis\",\"@id\":\"https:\/\/www.iri.com\/blog\/#\/schema\/person\/6434d748d01ce766d6a2ff576d747cfb\"},\"headline\":\"IRI DarkShield-Files RPC API\",\"datePublished\":\"2021-01-12T21:02:37+00:00\",\"dateModified\":\"2025-09-15T12:38:14+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/www.iri.com\/blog\/data-protection\/darkshield-files-rpc-api\/\"},\"wordCount\":5911,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/www.iri.com\/blog\/#organization\"},\"image\":{\"@id\":\"https:\/\/www.iri.com\/blog\/data-protection\/darkshield-files-rpc-api\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/DarkShield-files-api-diagram.png\",\"keywords\":[\"Darkshield API\",\"DarkShield RPC API\",\"flat file masking\",\"IRI DarkShield\",\"JSON data masking\",\"search matcher\",\"XML data masking\"],\"articleSection\":[\"Big Data\",\"Data Masking\/Protection\",\"IRI Business\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/www.iri.com\/blog\/data-protection\/darkshield-files-rpc-api\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.iri.com\/blog\/data-protection\/darkshield-files-rpc-api\/\",\"url\":\"https:\/\/www.iri.com\/blog\/data-protection\/darkshield-files-rpc-api\/\",\"name\":\"IRI DarkShield-Files RPC API - IRI\",\"isPartOf\":{\"@id\":\"https:\/\/www.iri.com\/blog\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/www.iri.com\/blog\/data-protection\/darkshield-files-rpc-api\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/www.iri.com\/blog\/data-protection\/darkshield-files-rpc-api\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/DarkShield-files-api-diagram.png\",\"datePublished\":\"2021-01-12T21:02:37+00:00\",\"dateModified\":\"2025-09-15T12:38:14+00:00\",\"description\":\"IRI DarkShield features a Remote Procedure Call (RPC) Application Programming Interface (API) for searching and masking unstructured files.\",\"breadcrumb\":{\"@id\":\"https:\/\/www.iri.com\/blog\/data-protection\/darkshield-files-rpc-api\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.iri.com\/blog\/data-protection\/darkshield-files-rpc-api\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.iri.com\/blog\/data-protection\/darkshield-files-rpc-api\/#primaryimage\",\"url\":\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/DarkShield-files-api-diagram.png\",\"contentUrl\":\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/DarkShield-files-api-diagram.png\",\"width\":861,\"height\":582},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.iri.com\/blog\/data-protection\/darkshield-files-rpc-api\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.iri.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"IRI DarkShield-Files RPC API\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.iri.com\/blog\/#website\",\"url\":\"https:\/\/www.iri.com\/blog\/\",\"name\":\"IRI\",\"description\":\"Total Data Management Blog\",\"publisher\":{\"@id\":\"https:\/\/www.iri.com\/blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.iri.com\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/www.iri.com\/blog\/#organization\",\"name\":\"IRI\",\"url\":\"https:\/\/www.iri.com\/blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.iri.com\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2019\/02\/iri-logo-total-data-management-small-1.png\",\"contentUrl\":\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2019\/02\/iri-logo-total-data-management-small-1.png\",\"width\":750,\"height\":206,\"caption\":\"IRI\"},\"image\":{\"@id\":\"https:\/\/www.iri.com\/blog\/#\/schema\/logo\/image\/\"}},[{\"@type\":[\"Person\"],\"@id\":\"https:\/\/www.iri.com\/blog\/#\/schema\/person\/6434d748d01ce766d6a2ff576d747cfb\",\"name\":\"Dmitry Kulakov\",\"image\":{\"@type\":\"ImageObject\",\"@id\":\"https:\/\/www.iri.com\/blog\/#\/schema\/person\/image\/\",\"inLanguage\":\"en_US\",\"url\":\"\",\"caption\":\"Dmitry Kulakov\"}},{\"@type\":[\"Person\"],\"@id\":\"https:\/\/www.iri.com\/blog\/#\/schema\/person\/6434d748d01ce766d6a2ff576d747cfb\",\"name\":\"Devon Kozenieski\",\"image\":{\"@type\":\"ImageObject\",\"@id\":\"https:\/\/www.iri.com\/blog\/#\/schema\/person\/image\/\",\"inLanguage\":\"en_US\",\"url\":\"\",\"caption\":\"Devon Kozenieski\"}},{\"@type\":[\"Person\"],\"@id\":\"https:\/\/www.iri.com\/blog\/#\/schema\/person\/6434d748d01ce766d6a2ff576d747cfb\",\"name\":\"Adam Lewis\",\"image\":{\"@type\":\"ImageObject\",\"@id\":\"https:\/\/www.iri.com\/blog\/#\/schema\/person\/image\/\",\"inLanguage\":\"en_US\",\"url\":\"\",\"caption\":\"Adam Lewis\"}}]]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"IRI DarkShield-Files RPC API - IRI","description":"IRI DarkShield features a Remote Procedure Call (RPC) Application Programming Interface (API) for searching and masking unstructured files.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.iri.com\/blog\/data-protection\/darkshield-files-rpc-api\/","og_locale":"en_US","og_type":"article","og_title":"IRI DarkShield-Files RPC API","og_description":"IRI DarkShield features a Remote Procedure Call (RPC) Application Programming Interface (API) for searching and masking unstructured files.","og_url":"https:\/\/www.iri.com\/blog\/data-protection\/darkshield-files-rpc-api\/","og_site_name":"IRI","article_published_time":"2021-01-12T21:02:37+00:00","article_modified_time":"2025-09-15T12:38:14+00:00","og_image":[{"width":861,"height":582,"url":"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/DarkShield-files-api-diagram.png","type":"image\/png"}],"author":"Dmitry Kulakov, Devon Kozenieski and Adam Lewis","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Dmitry Kulakov, Devon Kozenieski and Adam Lewis","Est. reading time":"39 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.iri.com\/blog\/data-protection\/darkshield-files-rpc-api\/#article","isPartOf":{"@id":"https:\/\/www.iri.com\/blog\/data-protection\/darkshield-files-rpc-api\/"},"author":{"name":"Dmitry Kulakov, Devon Kozenieski and Adam Lewis","@id":"https:\/\/www.iri.com\/blog\/#\/schema\/person\/6434d748d01ce766d6a2ff576d747cfb"},"headline":"IRI DarkShield-Files RPC API","datePublished":"2021-01-12T21:02:37+00:00","dateModified":"2025-09-15T12:38:14+00:00","mainEntityOfPage":{"@id":"https:\/\/www.iri.com\/blog\/data-protection\/darkshield-files-rpc-api\/"},"wordCount":5911,"commentCount":0,"publisher":{"@id":"https:\/\/www.iri.com\/blog\/#organization"},"image":{"@id":"https:\/\/www.iri.com\/blog\/data-protection\/darkshield-files-rpc-api\/#primaryimage"},"thumbnailUrl":"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/DarkShield-files-api-diagram.png","keywords":["Darkshield API","DarkShield RPC API","flat file masking","IRI DarkShield","JSON data masking","search matcher","XML data masking"],"articleSection":["Big Data","Data Masking\/Protection","IRI Business"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.iri.com\/blog\/data-protection\/darkshield-files-rpc-api\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.iri.com\/blog\/data-protection\/darkshield-files-rpc-api\/","url":"https:\/\/www.iri.com\/blog\/data-protection\/darkshield-files-rpc-api\/","name":"IRI DarkShield-Files RPC API - IRI","isPartOf":{"@id":"https:\/\/www.iri.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.iri.com\/blog\/data-protection\/darkshield-files-rpc-api\/#primaryimage"},"image":{"@id":"https:\/\/www.iri.com\/blog\/data-protection\/darkshield-files-rpc-api\/#primaryimage"},"thumbnailUrl":"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/DarkShield-files-api-diagram.png","datePublished":"2021-01-12T21:02:37+00:00","dateModified":"2025-09-15T12:38:14+00:00","description":"IRI DarkShield features a Remote Procedure Call (RPC) Application Programming Interface (API) for searching and masking unstructured files.","breadcrumb":{"@id":"https:\/\/www.iri.com\/blog\/data-protection\/darkshield-files-rpc-api\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.iri.com\/blog\/data-protection\/darkshield-files-rpc-api\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.iri.com\/blog\/data-protection\/darkshield-files-rpc-api\/#primaryimage","url":"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/DarkShield-files-api-diagram.png","contentUrl":"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/DarkShield-files-api-diagram.png","width":861,"height":582},{"@type":"BreadcrumbList","@id":"https:\/\/www.iri.com\/blog\/data-protection\/darkshield-files-rpc-api\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.iri.com\/blog\/"},{"@type":"ListItem","position":2,"name":"IRI DarkShield-Files RPC API"}]},{"@type":"WebSite","@id":"https:\/\/www.iri.com\/blog\/#website","url":"https:\/\/www.iri.com\/blog\/","name":"IRI","description":"Total Data Management Blog","publisher":{"@id":"https:\/\/www.iri.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.iri.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.iri.com\/blog\/#organization","name":"IRI","url":"https:\/\/www.iri.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.iri.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2019\/02\/iri-logo-total-data-management-small-1.png","contentUrl":"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2019\/02\/iri-logo-total-data-management-small-1.png","width":750,"height":206,"caption":"IRI"},"image":{"@id":"https:\/\/www.iri.com\/blog\/#\/schema\/logo\/image\/"}},[{"@type":["Person"],"@id":"https:\/\/www.iri.com\/blog\/#\/schema\/person\/6434d748d01ce766d6a2ff576d747cfb","name":"Dmitry Kulakov","image":{"@type":"ImageObject","@id":"https:\/\/www.iri.com\/blog\/#\/schema\/person\/image\/","inLanguage":"en_US","url":"","caption":"Dmitry Kulakov"}},{"@type":["Person"],"@id":"https:\/\/www.iri.com\/blog\/#\/schema\/person\/6434d748d01ce766d6a2ff576d747cfb","name":"Devon Kozenieski","image":{"@type":"ImageObject","@id":"https:\/\/www.iri.com\/blog\/#\/schema\/person\/image\/","inLanguage":"en_US","url":"","caption":"Devon Kozenieski"}},{"@type":["Person"],"@id":"https:\/\/www.iri.com\/blog\/#\/schema\/person\/6434d748d01ce766d6a2ff576d747cfb","name":"Adam Lewis","image":{"@type":"ImageObject","@id":"https:\/\/www.iri.com\/blog\/#\/schema\/person\/image\/","inLanguage":"en_US","url":"","caption":"Adam Lewis"}}]]}},"jetpack_featured_media_url":"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2021\/01\/DarkShield-files-api-diagram.png","_links":{"self":[{"href":"https:\/\/www.iri.com\/blog\/wp-json\/wp\/v2\/posts\/14244"}],"collection":[{"href":"https:\/\/www.iri.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.iri.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.iri.com\/blog\/wp-json\/wp\/v2\/users\/112"}],"replies":[{"embeddable":true,"href":"https:\/\/www.iri.com\/blog\/wp-json\/wp\/v2\/comments?post=14244"}],"version-history":[{"count":38,"href":"https:\/\/www.iri.com\/blog\/wp-json\/wp\/v2\/posts\/14244\/revisions"}],"predecessor-version":[{"id":18593,"href":"https:\/\/www.iri.com\/blog\/wp-json\/wp\/v2\/posts\/14244\/revisions\/18593"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.iri.com\/blog\/wp-json\/wp\/v2\/media\/14221"}],"wp:attachment":[{"href":"https:\/\/www.iri.com\/blog\/wp-json\/wp\/v2\/media?parent=14244"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.iri.com\/blog\/wp-json\/wp\/v2\/categories?post=14244"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.iri.com\/blog\/wp-json\/wp\/v2\/tags?post=14244"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}