Managing and Masking SharePoint Data with Voracity
This article explains how to connect with, and use data from, SharePoint sites — using the file system via OneDrive — for operations in IRI Workbench-supported data management software. That software includes the IRI Voracity platform and its component products: CoSort, NextForm, RowGen, FieldShield, DarkShield, and CellShield EE.
IRI data management and protection software products work with a wide range of data sources and formats. However, accessing data in SharePoint directories has heretofore been impossible, as those files are hosted online.
More specifically, there was no way to access them without downloading them to the file system unless the file data were streamed. That precluded manipulating data in file formats like .XLSX, the now prevalent format of Microsoft Excel, since streams of non-textual file formats are difficult to modify directly in stream format.
However, there is another method that a user can take to access files in SharePoint directly from the file system, while still syncing immediately with the online SharePoint site files. By logging into SharePoint and selecting the sync button from the Documents page, a OneDrive hook can be set up between SharePoint.
What this means is that IRI products can access all SharePoint site files chosen to be synced, and process that data virtually in real-time.1 Supported activities include data discovery, integration (ETL), migration, governance (clean, mask, test), and analytics.
The details of this connection method are described below.
Steps
1. To begin with, sign into the SharePoint site through your Microsoft account. Likely there will be a request sent for two-factor authentication (handled easiest with the Microsoft Authenticator app).
I signed in through the Microsoft 365 developer portal, where I have a development SharePoint instance I can access; i.e.,
https://developer.microsoft.com/en-us/microsoft-365/dev-program
2. Once into the account, select the SharePoint site to access. In the below image, I am accessing the SharePoint site “Test”:
3. Once the SharePoint site has been accessed, a view similar to this should appear. Next, click on “Documents” from the menu on the left side of the screen.
4. This allows you to browse the folders and files of the SharePoint site. To set up a sync between the local file system and SharePoint, select “Sync” from the middle menu, which will bring up a prompt such as this.
5. A few prompts will walk you through the setup of OneDrive sync with the SharePoint site. You can choose where the root folder of the site is placed into your file system. Also, folders can be selectively synced; in other words, not all the files on the site need to take up space in the file system if they are not selected to be synced.
Once the setup is complete, you should be able to view the selected files from SharePoint in File Explorer, and they will be synced whenever changes are made — either on the site or locally.
For each site, the syncing button needs to be pressed once, but after the setup of the first site it is as simple as clicking the Sync button from the SharePoint site, with no other configuration needed.
Now, the files can be accessed by any IRI product. Let’s start with CellShield Enterprise Edition (EE), a product which finds and masks sensitive data in XLS and XLSX files. In CellShield EE, the Dark Data Search/Masking job wizard in IRI Workbench can be used to find sensitive data to mask in XLS and XLSX files.
The OneDrive folder that syncs with SharePoint is now selectable from the file browser. Select to search for XLS and XLSX files as needed from the Dark Data Search/Masking job wizard in IRI Workbench.
If this menu is not visible, install the IRI DarkShield feature from “install new software” in the IRI Workbench. Its toolbar menu icon is a charcoal-colored shield. Once you installed, run the wizard and select the data source:
Next, set up data classes to match data based on search matchers like regular expressions, lookup sets (which list values to match against), or NER models. See this article on data classification for more information about this process.
For DarkShield, rules can be set to mask data in a certain way based on the matcher used. With CellShield EE, the masking method is selected later when importing the “EIF” file, and can be based on either the data class matcher or a manual selection of EIF entries.
Run the resulting .search file from the Dark Data Search/Masking job wizard as an IRI search job to generate the .EIF file. Import it in CellShield EE to map search results to cells to mask.
Here is an example of the EIF file:
As you can see, the wizard was able to successfully identify sensitive data from these files in my SharePoint-synced OneDrive folder, which has the base name of “iricosort”.
Here are the dark data search shown in IRI Workbench:
Both DarkShield and CellShield EE can mask data in XLS and XLSX files based on the search results. DarkShield supports many other formats, including Word documents, Powerpoint presentations, PDF and image files, plus unstructured text files.
DarkShield and CellShield are not the only IRI products that benefit from syncing SharePoint files locally. This also allows access for the SortCL engine and 4GL behind IRI CoSort, RowGen, NextForm, and FieldShield — and Voracity operations (e.g., ETL) that include or combine them to process data in any structured format in SharePoint. XLS and XLSX file formats are also supported by SortCL and thus the most recent versions of these products.
With this synced setup to SharePoint, many useful scenarios are possible that may not be as easy if the setup was not synced and files had to be manually downloaded and uploaded. For example, batch files can be scheduled to run at certain times to automatically update XLSX files with new live data, which will get directly synced into SharePoint to be shared with other users.
Setting up SharePoint site files to a file system directory via OneDrive is a simple yet robust way to access SharePoint site files with IRI software, and will work along with other sources as part of heterogeneous IRI data integration, cleansing, masking, reporting, and wrangling operations.
Contact support@iri.com if you need help processing files in SharePoint using this access method.
- The only delays are with network speed. For example, maybe a folder of large files has been uploaded. This may take a few seconds to a few minutes to sync, depending on network speed and file sizes. If there is a new file or folder of files uploaded to the SharePoint site that hasn’t been synced locally yet, OneDrive will be called on to sync the files when the SortCL-compatible job runs. The job will pause until the file has been synced to the local OneDrive folder. Once the job completes, if any files produced were output within the OneDrive folder, they get synced immediately with the SharePoint site, again with the only limitation being the upload speed of the network. Note that limitations can also be intentionally placed on OneDrive upload and download speeds so as not to overload network bandwidth.