{"id":6011,"date":"2024-07-10T17:10:56","date_gmt":"2024-07-10T21:10:56","guid":{"rendered":"http:\/\/www.iri.com\/blog\/?p=6011"},"modified":"2024-09-19T16:41:16","modified_gmt":"2024-09-19T20:41:16","slug":"textual-etl","status":"publish","type":"post","link":"https:\/\/www.iri.com\/blog\/migration\/data-migration\/textual-etl\/","title":{"rendered":"Textual ETL: Unlocking Unstructured Data"},"content":{"rendered":"<h4><b>Synopsis<\/b><\/h4>\n<p><span style=\"font-weight: 400;\">Corporations and government agencies store a lot of useful information in non-transactional semi-structured and unstructured data sources. Finding that data \u2013 in documents, logs, and images \u2013 is important not only for data masking, but also for textual ETL.\u00a0\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Textual ETL lends structure and meaning to data that\u2019s hidden in text and prepares it for us in standard structural repositories like flat files, relational databases and Excel. When combined or joined to matching values in other structured sources, more information can be discovered and mined for the benefit of operational monitoring, marketing, trend analytics, law enforcement, etc.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Beyond enhancing data integration, these values can also be used in some test data scenarios as well. This article talks about producing structure from values discovered during file LAN or cloud <span id='easy-footnote-1-6011' class='easy-footnote-margin-adjust'><\/span><span class='easy-footnote'><a href='https:\/\/www.iri.com\/blog\/migration\/data-migration\/textual-etl\/#easy-footnote-bottom-1-6011' title='DarkShield can also search (and mask) files in Amazon S3, Azure Blob, GCP, and SharePoint Online.'><sup>1<\/sup><\/a><\/span> <\/span><span style=\"font-weight: 400;\">file searches using the <\/span><a href=\"https:\/\/www.iri.com\/products\/darkshield\"><span style=\"font-weight: 400;\">IRI DarkShield<\/span><\/a><span style=\"font-weight: 400;\"> data masking tool. And if you license the broader Voracity data management <\/span><a href=\"https:\/\/www.iri.com\/products\/voracity-platform\"><span style=\"font-weight: 400;\">platform<\/span><\/a><span style=\"font-weight: 400;\"> which includes DarkShield, you have the ETL and reporting facilities built-in.<\/span><\/p>\n<h4><b>How this Works<\/b><\/h4>\n<p><span style=\"font-weight: 400;\">As you know, most data in unstructured sources is difficult to parse, and the values need context and structure to be leveraged in data integration and reporting contexts. Fortunately, the <\/span><a href=\"https:\/\/www.iri.com\/products\/workbench\/darkshield-gui\/file-masking\"><i><span style=\"font-weight: 400;\">New File Search\/Mask Job<\/span><\/i><span style=\"font-weight: 400;\"> wizard<\/span><\/a><span style=\"font-weight: 400;\"> in the <\/span><a href=\"https:\/\/www.iri.com\/products\/workbench\"><span style=\"font-weight: 400;\">IRI Workbench<\/span><\/a><span style=\"font-weight: 400;\"> GUI will extract \u2018dark data\u2019 values (and log their metadata) into a flat file, which is what will ultimate enable textual ETL, data fabrics, and other forms or analytics in the <\/span><a href=\"https:\/\/www.iri.com\/products\/voracity\"><span style=\"font-weight: 400;\">IRI Voracity<\/span><\/a><span style=\"font-weight: 400;\"> ecosystem, powered by <\/span><a href=\"https:\/\/www.iri.com\/products\/cosort\/sortcl\"><span style=\"font-weight: 400;\">SortCL<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">To add value to the delimited log file, DarkShield also generates its field layouts automatically into a separate metadata file in IRI-standard <\/span><a href=\"https:\/\/www.iri.com\/products\/cosort\/sortcl-metadata\"><span style=\"font-weight: 400;\">data definition format (.DDF)<\/span><\/a><span style=\"font-weight: 400;\">. The results file and its metadata repository are easily used and re-used by IRI software to integrate, transform, migrate, mask, and report on that data, and feed it in other formats or to external applications as needed.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Note again that in Voracity, the CoSort <\/span><span style=\"font-weight: 400;\">SortCL<\/span><span style=\"font-weight: 400;\"> ETL engine can query and join over flat files directly, or facilitate the creation and population of tables with DBA-defined primary-foreign keys. In this way, dark data extracts can acquire form and relationships (structure) that can make it a lot more useful.<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><\/p>\n<h4><b>Creating the Search Job<\/b><\/h4>\n<p><span style=\"font-weight: 400;\">A DarkShield Files Search Job will search every supported file type in every directory below the root network drive you specify. The scan through your dark data employs search methods you can align to your sensitive type types (e.g., names, email, IP and street addresses, credit card and NID numbers, medical conditions, etc.), or <\/span><a href=\"https:\/\/www.iri.com\/blog\/data-protection\/iri-data-classification\/\"><i><span style=\"font-weight: 400;\">Data Classes<\/span><\/i><\/a><span style=\"font-weight: 400;\">.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">You can specify search matchers based on the <\/span><a href=\"https:\/\/www.iri.com\/blog\/data-protection\/data-matchers\/\"><span style=\"font-weight: 400;\">content<\/span><\/a><span style=\"font-weight: 400;\"> of the data, fixed metadata (<\/span><a href=\"https:\/\/www.iri.com\/blog\/data-protection\/location-matchers\/\"><span style=\"font-weight: 400;\">location<\/span><\/a><span style=\"font-weight: 400;\">), or both depending on the situation.When looking across all file types, you might want to specify a combination of regular expression patterns, lookup set files, and Named Entity Recognition (NER) models in addition to (metadata) location matchers for data in structured files.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Here is a list of file sources containing strings that the wizard can search, extract, and structure:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Free-form text (.txt)<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Delimited Flat Files (.csv and .tsv)<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Fixed Width Files<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Microsoft Word documents (.doc and .docx)<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Adobe Portable Document Format (.pdf)<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Extensible Markup Language (.xml)<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Microsoft Excel spreadsheets (.xls and .xlsx)<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Microsoft PowerPoint presentations (.ppt and .pptx)<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">JavaScript Object Notation files (.json)<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Parquet (.parquet)<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">DICOM image (.dicom)<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Various image formats (.tiff, .jpeg, .png, .gif, .jp2, .jpx, .bmp)<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">You can create your DarkShield Files Search Job from the <\/span><i><span style=\"font-weight: 400;\">New File Search\/Masking Job<\/span><\/i><span style=\"font-weight: 400;\"> wizard:<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><\/p>\n<p style=\"text-align: center;\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-17693\" src=\"\/blog\/wp-content\/uploads\/2014\/06\/Create-DarkShield-Files-Search-Job-300x68.png\" alt=\"\" width=\"751\" height=\"170\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/06\/Create-DarkShield-Files-Search-Job-300x68.png 300w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/06\/Create-DarkShield-Files-Search-Job-768x173.png 768w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/06\/Create-DarkShield-Files-Search-Job.png 924w\" sizes=\"(max-width: 751px) 100vw, 751px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">To launch the wizard, select the DarkShield menu dropdown and select the <\/span><i><span style=\"font-weight: 400;\">New Files Search\/Masking Job<\/span><\/i><span style=\"font-weight: 400;\">\u2026 wizard. This brings up the first page where you can name the new job:<\/span><\/p>\n<p style=\"text-align: center;\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-17694\" src=\"\/blog\/wp-content\/uploads\/2014\/06\/Specify-job-name-location-type-300x275.png\" alt=\"\" width=\"494\" height=\"453\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/06\/Specify-job-name-location-type-300x275.png 300w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/06\/Specify-job-name-location-type.png 530w\" sizes=\"(max-width: 494px) 100vw, 494px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">Here you specify the folder and file names for the data discovery artifacts, since we only want to perform a <\/span><b>Search <\/b><span style=\"font-weight: 400;\">(vs. mask only or search and mask) job to extract data from files.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Click <\/span><i><span style=\"font-weight: 400;\">Next<\/span><\/i><span style=\"font-weight: 400;\"> to move into this data source specification (files to be masked) page of the wizard:<\/span><\/p>\n<p><span style=\"font-weight: 400;\"><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-17698 aligncenter\" src=\"\/blog\/wp-content\/uploads\/2014\/06\/search-report-options-1-300x275.png\" alt=\"\" width=\"530\" height=\"485\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/06\/search-report-options-1-300x275.png 300w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/06\/search-report-options-1.png 534w\" sizes=\"(max-width: 530px) 100vw, 530px\" \/><\/span><\/p>\n<p><span style=\"font-weight: 400;\">Here you can customize the content of the flat-file search log that would contain actual data. More specifically, you can also generate several metadata attributes of the flies where PII was discovered.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">These attributes will be displayed as columns in a flat text log file containing the values (and specified metadata) from the search operation. The default delimiter is a pipe (\u201c|\u201d) but you can change that.\u00a0<\/span><\/p>\n<p><i><span style=\"font-weight: 400;\">Note that the RESULT attribute contains the actual PII values found, so if you do not wish to persist (make use of the PII in the search report) do not select RESULT).<\/span><\/i><\/p>\n<p><span style=\"font-weight: 400;\">If you want to leverage the textual ETL functionality described below, check the option to create a <\/span><i><span style=\"font-weight: 400;\">Data Definition Format (<\/span><\/i><a href=\"https:\/\/www.iri.com\/products\/cosort\/sortcl-metadata\"><i><span style=\"font-weight: 400;\">DDF<\/span><\/i><\/a><i><span style=\"font-weight: 400;\">) <\/span><\/i><span style=\"font-weight: 400;\">file for the report. The DDF metadata defines the field layouts of the flat report which you can use in the <\/span><a href=\"https:\/\/www.iri.com\/products\/cosort\/sortcl\"><span style=\"font-weight: 400;\">SortCL<\/span><\/a><span style=\"font-weight: 400;\"> textual ETL process below.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">More specifically, the \/FIELD names in the DDF file will correspond to the keywords and patterns you searched, as well as the forensic attributes that you selected in this dialog to be part of that output log\/report.\u00a0<\/span><\/p>\n<p><i><span style=\"font-weight: 400;\">Note that DarkShield search jobs also produce another log in JSON format named annotations.json <\/span><\/i><span style=\"font-weight: 400;\">which DarkShield renders in HTML charts <\/span><a href=\"https:\/\/www.iri.com\/blog\/data-protection\/darkshield-pii-discovery-masking-charts\/\"><span style=\"font-weight: 400;\">like these<\/span><\/a><span style=\"font-weight: 400;\">. Machine-readable DarkShield logs can also be exported to third-party log analytic and action tools like <\/span><a href=\"https:\/\/www.iri.com\/blog\/business-intelligence\/datadog-security-analytics-darkshield\/\"><span style=\"font-weight: 400;\">Datadog<\/span><\/a><span style=\"font-weight: 400;\"> and <\/span><a href=\"https:\/\/www.iri.com\/blog\/data-protection\/splunk-phantom-playbook-masking\/\"><span style=\"font-weight: 400;\">Splunk Phantom Playbooks<\/span><\/a><i><span style=\"font-weight: 400;\">.<\/span><\/i><\/p>\n<p><span style=\"font-weight: 400;\">Continue through the wizard by repeatedly clicking <\/span><i><span style=\"font-weight: 400;\">Next<\/span><\/i><span style=\"font-weight: 400;\"> until you reach the page that will prompt for your data source or sources, on which data discovery will be performed. After which, click Finish and a DarkShield Search Job (.dsc) configuration file will be created.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">For an in-depth look at the DarkShield Files Wizard and the different configuration options it supports (particularly for <\/span><a href=\"https:\/\/www.iri.com\/solutions\/data-masking\"><span style=\"font-weight: 400;\">data maskin<\/span><\/a><span style=\"font-weight: 400;\">g), see this <\/span><a href=\"https:\/\/www.iri.com\/blog\/data-protection\/finding-and-masking-pii-in-files-with-the-darkshield-files-wizard\/\"><span style=\"font-weight: 400;\">article<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n<h4><b>Running the Search Job<\/b><\/h4>\n<p><span style=\"font-weight: 400;\">To run a DarkShield Search Job, right click on the .dsc file and select <\/span><i><span style=\"font-weight: 400;\">IRI &gt; Run Search Job<\/span><\/i><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p style=\"text-align: center;\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-17697\" src=\"\/blog\/wp-content\/uploads\/2014\/06\/run-search-job-300x252.png\" alt=\"\" width=\"537\" height=\"451\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/06\/run-search-job-300x252.png 300w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/06\/run-search-job.png 699w\" sizes=\"(max-width: 537px) 100vw, 537px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">After running a Search Job, the search results and metadata are recorded inside files. Below is an extract of a sample comma-delimited DarkShield file-search log\/report from a scan of Word, Excel, PDF, and image files stored in a local folder:<\/span><\/p>\n<pre><span style=\"font-weight: 400;\">DATA_CLASS_NAME,<\/span><b>PII_RESULT<\/b><span style=\"font-weight: 400;\">,SPAN,OWNER,READ_ONLY,HIDDEN,DATE_CREATED,DATE_MODIFIED,DATE_ACCESSED,FILE_PATH,<\/span><b>FILE_TYPE<\/b>\r\n<span style=\"font-weight: 400;\">\"FIRST_NAME\",\"Holder\",\"8:14\",\"DESKTOP-8NLA23I\\adaml\",\"FALSE\",\"FALSE\",\"2022-10-20T14:27:25.855Z\",\"2022-10-19T16:04:24.822Z\",\"2022-10-21T21:32:30.039Z\",\"FRIDAY_DEMO\/input\/Bank%20Report.xlsx\",\"<\/span><b>openxmlformats-officedocume<\/b><span style=\"font-weight: 400;\">\"<\/span>\r\n<span style=\"font-weight: 400;\">\"FIRST_NAME\",\"Jane\",\"0:4\",\"DESKTOP-8NLA23I\\adaml\",\"FALSE\",\"FALSE\",\"2022-10-20T14:27:25.855Z\",\"2022-10-19T16:04:24.822Z\",\"2022-10-21T21:32:30.039Z\",\"FRIDAY_DEMO\/input\/Bank%20Report.xlsx\",\"openxmlformats-officedocume\"<\/span>\r\n<span style=\"font-weight: 400;\">\"FIRST_NAME\",\"Johnson\",\"5:12\",\"DESKTOP-8NLA23I\\adaml\",\"FALSE\",\"FALSE\",\"2022-10-20T14:27:25.855Z\",\"2022-10-19T16:04:24.822Z\",\"2022-10-21T21:32:30.039Z\",\"FRIDAY_DEMO\/input\/Bank%20Report.xlsx\",\"openxmlformats-officedocume\"<\/span>\r\n<span style=\"font-weight: 400;\">\"DATE_US_MMDDYYYY\",\"04\/07\/2021\",\"0:10\",\"DESKTOP-8NLA23I\\adaml\",\"FALSE\",\"FALSE\",\"2022-10-20T14:27:25.855Z\",\"2022-10-19T16:04:24.822Z\",\"2022-10-21T21:32:30.039Z\",\"FRIDAY_DEMO\/input\/Bank%20Report.xlsx\",\"openxmlformats-officedocume\"<\/span>\r\n<span style=\"font-weight: 400;\">\"FIRST_NAME\",\"John\",\"0:4\",\"DESKTOP-8NLA23I\\adaml\",\"FALSE\",\"FALSE\",\"2022-10-20T14:27:25.859Z\",\"2022-10-19T16:04:24.798Z\",\"2022-10-21T21:32:30.046Z\",\"FRIDAY_DEMO\/input\/Bank%20Statement.docx\",\"openxmlformats-officedocume\"<\/span>\r\n<span style=\"font-weight: 400;\">\"CREDIT_CARD_DS\",\"5235-7345-1354-7345\",\"0:19\",\"DESKTOP-8NLA23I\\adaml\",\"FALSE\",\"FALSE\",\"2022-10-20T14:27:25.885Z\",\"2022-10-19T16:04:24.819Z\",\"2022-10-21T21:32:30.058Z\",\"FRIDAY_DEMO\/input\/buyers.xls\",\"<\/span><b>ms-excel<\/b><span style=\"font-weight: 400;\">\"<\/span>\r\n<span style=\"font-weight: 400;\">\"PIN_US\",\"421-55-7346\",\"0:11\",\"DESKTOP-8NLA23I\\adaml\",\"FALSE\",\"FALSE\",\"2022-10-20T14:27:25.885Z\",\"2022-10-19T16:04:24.819Z\",\"2022-10-21T21:32:30.058Z\",\"FRIDAY_DEMO\/input\/buyers.xls\",\"ms-excel\"<\/span>\r\n<span style=\"font-weight: 400;\">\"PHONE_DS\",\"(310)55-7445\",\"1:14\",\"DESKTOP-8NLA23I\\adaml\",\"FALSE\",\"FALSE\",\"2022-10-20T14:27:25.885Z\",\"2022-10-19T16:04:24.819Z\",\"2022-10-21T21:32:30.058Z\",\"FRIDAY_DEMO\/input\/buyers.xls\",\"ms-excel\"<\/span>\r\n<span style=\"font-weight: 400;\">\"EMAIL\",\"jhelly@gmail.com\",\"0:16\",\"DESKTOP-8NLA23I\\adaml\",\"FALSE\",\"FALSE\",\"2022-10-20T14:27:25.885Z\",\"2022-10-19T16:04:24.819Z\",\"2022-10-21T21:32:30.058Z\",\"FRIDAY_DEMO\/input\/buyers.xls\",\"ms-excel\"<\/span>\r\n<span style=\"font-weight: 400;\">\"FIRST_NAME\",\"Johnson\",\"0:7\",\"DESKTOP-8NLA23I\\adaml\",\"FALSE\",\"FALSE\",\"2022-10-20T14:27:25.885Z\",\"2022-10-19T16:04:24.819Z\",\"2022-10-21T21:32:30.058Z\",\"FRIDAY_DEMO\/input\/buyers.xls\",\"ms-excel\"<\/span>\r\n<span style=\"font-weight: 400;\">\"FIRST_NAME\",\"Doe\",\"35:38:00\",\"DESKTOP-8NLA23I\\adaml\",\"FALSE\",\"FALSE\",\"2022-10-20T14:27:25.879Z\",\"2022-10-19T16:04:24.800Z\",\"2022-10-21T21:32:30.048Z\",\"FRIDAY_DEMO\/input\/bank_info.jpg\",\"<\/span><b>image\/jpeg<\/b><span style=\"font-weight: 400;\">\"<\/span>\r\n<span style=\"font-weight: 400;\">\"CREDIT_CARD_DS\",\"6011-4256-2332-1212\",\"60:18:00\",\"DESKTOP-8NLA23I\\adaml\",\"FALSE\",\"FALSE\",\"2022-10-20T14:27:25.879Z\",\"2022-10-19T16:04:24.800Z\",\"2022-10-21T21:32:30.048Z\",\"FRIDAY_DEMO\/input\/bank_info.jpg\",\"image\/jpeg\"<\/span>\r\n<span style=\"font-weight: 400;\">\"FIRST_NAME\",\"Bay\",\"947:35:00\",\"DESKTOP-8NLA23I\\adaml\",\"FALSE\",\"FALSE\",\"2022-10-20T19:21:47.202Z\",\"2022-10-20T19:06:32.105Z\",\"2022-10-21T21:32:30.076Z\",\"FRIDAY_DEMO\/input\/Sample%20Policy.pdf\",\"<\/span><b>application\/pdf<\/b><span style=\"font-weight: 400;\">\"<\/span>\r\n<span style=\"font-weight: 400;\">\"FIRST_NAME\",\"Ridge\",\"951:41:00\",\"DESKTOP-8NLA23I\\adaml\",\"FALSE\",\"FALSE\",\"2022-10-20T19:21:47.202Z\",\"2022-10-20T19:06:32.105Z\",\"2022-10-21T21:32:30.076Z\",\"FRIDAY_DEMO\/input\/Sample%20Policy.pdf\",\"application\/pdf\"<\/span>\r\n<span style=\"font-weight: 400;\">\"FIRST_NAME\",\"Jack\",\"1109:15:00\",\"DESKTOP-8NLA23I\\adaml\",\"FALSE\",\"FALSE\",\"2022-10-20T19:21:47.202Z\",\"2022-10-20T19:06:32.105Z\",\"2022-10-21T21:32:30.076Z\",\"FRIDAY_DEMO\/input\/Sample%20Policy.pdf\",<\/span><b>\"application\/pdf\"<\/b><\/pre>\n<h4><b>Using the Results<\/b><\/h4>\n<p><span style=\"font-weight: 400;\">Because we chose to create the DDF metadata for this log file, the data above can be leveraged in SortCL operations to report on (e.g, aggregate), extract, transform, or restructure the data within the search log as we demonstrate below. First, here is the DDF file layout for log data above: for the above log data., i.e.,<\/span><\/p>\n<h6><b>DDF File for Use in Script 1 below<\/b><\/h6>\n<pre># Generated with IRI Workbench - Discover Metadata\r\n# Author: chaitalim\r\n# Created: 2024-03-14 14:27:11\r\n\/FILE=InputFiles\/DelmitedDarkShieldSearchLogSample.csv\r\n# \/PROCESS=CSV\r\n\/FIELD=(DATA_CLASS_NAME, TYPE=ASCII, POSITION=1, SEPARATOR=\",\", FRAME=\"\\\"\", CDEF=\"DATA_CLASS_NAME\")\r\n\/FIELD=(PII_RESULT, TYPE=ASCII, POSITION=2, SEPARATOR=\",\", FRAME=\"\\\"\", CDEF=\"PII RESULT\u00a0\r\n\/FIELD=(SPAN, TYPE=ASCII, POSITION=3, SEPARATOR=\",\", FRAME=\"\\\"\", CDEF=\"SPAN\")\r\n\/FIELD=(OWNER, TYPE=ASCII, POSITION=4, SEPARATOR=\",\", FRAME=\"\\\"\", CDEF=\"OWNER\")\r\n\/FIELD=(READ_ONLY, TYPE=ASCII, POSITION=5, SEPARATOR=\",\", FRAME=\"\\\"\", CDEF=\"READ_ONLY\")\r\n\/FIELD=(HIDDEN, TYPE=ASCII, POSITION=6, SEPARATOR=\",\", FRAME=\"\\\"\", CDEF=\"HIDDEN\")\r\n\/FIELD=(DATE_CREATED, TYPE=ASCII, POSITION=7, SEPARATOR=\",\", FRAME=\"\\\"\", CDEF=\"DATE_CREATED\")\r\n\/FIELD=(DATE_MODIFIED, TYPE=ASCII, POS=8, SEP=\",\", FRAME=\"\\\"\", CDEF=\"DATE_MODIFIED\")\r\n\/FIELD=(DATE_ACCESSED, TYPE=ASCII, POS=9, SEP=\",\", FRAME=\"\\\"\", CDEF=\"DATE_ACCESSED\")\r\n\/FIELD=(FILE_PATH, TYPE=ASCII, POSITION=10, SEPARATOR=\",\", FRAME=\"\\\"\", CDEF=\"FILE_PATH\")\r\n\/FIELD=(FILE_TYPE, TYPE=ASCII, POSITION=11, SEPARATOR=\",\", FRAME=\"\\\"\", CDEF=\"FILE_TYPE\")<\/pre>\n<p><span style=\"font-weight: 400;\">You can leverage this metadata in the two-script (or single batch) SortCL job shown below to extract, transpose and group the PII result data by data class in a flattened format for multiple uses:<\/span><\/p>\n<h6><b>SortCL Job Script 1 (Extract):\u00a0<\/b><\/h6>\n<pre><span style=\"font-weight: 400;\"># This job reads the file containing Data Class names in one column and a related value<\/span>\r\n<span style=\"font-weight: 400;\"># another column (the PII) to output row and column data for input to transpose script.<\/span>\r\n\r\n<span style=\"font-weight: 400;\">\/INFILE=InputFiles\/<\/span><b>DelmitedDarkShieldSearchLogSample.csv<\/b>\r\n<span style=\"font-weight: 400;\">       \/PROCESS=CSV<\/span>\r\n<span style=\"font-weight: 400;\">       \/SPECIFICATION=metadata\/DarkShieldSearchLog.ddf # <\/span><span style=\"font-weight: 400;\">metadata shown above<\/span>\r\n\r\n<span style=\"font-weight: 400;\">\/SORT # to create a unique list of column names and row values<\/span>\r\n<span style=\"font-weight: 400;\">       \/KEY=DATA_CLASS_NAME<\/span>\r\n<span style=\"font-weight: 400;\">       \/KEY=PII_RESULT_OPTIONAL<\/span>\r\n\r\n<span style=\"font-weight: 400;\">#<\/span>      <span style=\"font-weight: 400;\">By default duplicate values for a Data Class will be excluded from the output.<\/span>\r\n<span style=\"font-weight: 400;\"># \u00a0 \u00a0 \u00a0If you need all values retained, then comment the line below, i.e # \/NODUPLICATES<\/span>\r\n<span style=\"font-weight: 400;\">\/NODUPLICATES<\/span>\r\n\r\n<span style=\"font-weight: 400;\"># This target will be the input to the transpose script<\/span>\r\n<span style=\"font-weight: 400;\">\/OUTFILE=WorkFiles\/<\/span><b>FileRowColumnData.csv<\/b>\r\n<span style=\"font-weight: 400;\">       \/PROCESS=RECORD<\/span>\r\n<span style=\"font-weight: 400;\">       \/FIELD=(ROW_NUM,POSITION=1, SEPARATOR=\",\" )<\/span>\r\n<span style=\"font-weight: 400;\">       \/FIELD=(DATA_CLASS_NAME, TYPE=ASCII, POSITION=2, SEPARATOR=\",\")<\/span>\r\n<span style=\"font-weight: 400;\">       \/COUNT ROW_NUM RUNNING BREAK DATA_CLASS_NAME<\/span>\r\n<span style=\"font-weight: 400;\">       \/FIELD=(PII_RESULT, TYPE=ASCII, POSITION=3, SEPARATOR=\",\", FRAME=\"\\\"\")\r\n<\/span><\/pre>\n<p>This produces a work file (which you should subsequently delete) needed in the next job, like this:<\/p>\n<pre><span style=\"font-weight: 400;\">1,ACC_NUMBER,\"1001001234\"<\/span>\r\n<span style=\"font-weight: 400;\">2,ACC_NUMBER,\"123456\"<\/span>\r\n<span style=\"font-weight: 400;\">3,ACC_NUMBER,\"123456718735\"<\/span>\r\n<span style=\"font-weight: 400;\">1,CREDIT_CARD_DS,\"040392-5967562\"<\/span>\r\n<span style=\"font-weight: 400;\">2,CREDIT_CARD_DS,\"1134-6845-9545-3453\"<\/span>\r\n<span style=\"font-weight: 400;\">3,CREDIT_CARD_DS,\"120486-7863214\"<\/span>\r\n<span style=\"font-weight: 400;\">4,CREDIT_CARD_DS,\"1624-7457-4567-4545\"<\/span>\r\n<span style=\"font-weight: 400;\">. . .<\/span><\/pre>\n<h6><b>SortCL Job Script 2 (Transform &amp; Load):\u00a0<\/b><\/h6>\n<pre><span style=\"font-weight: 400;\"># This job reads the data file from the prior extract script and outputs a transposed file<\/span>\r\n<span style=\"font-weight: 400;\"># with columns containing each data class (PII type) and rows for each data value<\/span>\r\n\r\n<span style=\"font-weight: 400;\">\/INFILE=WorkFiles\/<\/span><b>FileRowColumnData.csv<\/b>\r\n<span style=\"font-weight: 400;\">        \/PROCESS=DELIMITED<\/span>\r\n<span style=\"font-weight: 400;\">        \/FIELD=(ROW_NUM, TYPE=ASCII, POSITION=1, SEPARATOR=',', FRAME='\"')<\/span>\r\n<span style=\"font-weight: 400;\">        \/FIELD=(COL_NAME, TYPE=ASCII, POSITION=2, SEPARATOR=',', FRAME='\"')<\/span>\r\n<span style=\"font-weight: 400;\">        \/FIELD=(COL_VALUE, TYPE=ASCII, POSITION=3, SEPARATOR=',', FRAME='\"')<\/span>\r\n\r\n<span style=\"font-weight: 400;\">\/INREC<\/span>\r\n<span style=\"font-weight: 400;\"> \u00a0\/FIELD=(ROW_NUM, POSITION=1, SEPARATOR=',', NUMERIC)<\/span>\r\n<span style=\"font-weight: 400;\"> \u00a0\/FIELD=(VALUE_1, POSITION=2, TYPE=ASCII, SEPARATOR=',', IF COL_NAME EQ \"ACC_NUMBER\" THEN COL_VALUE)<\/span>\r\n<span style=\"font-weight: 400;\"> \u00a0\/FIELD=(VALUE_2, POSITION=3, TYPE=ASCII, SEPARATOR=',', IF COL_NAME EQ \"CREDIT_CARD_DS\" THEN COL_VALUE)<\/span>\r\n<span style=\"font-weight: 400;\"> \u00a0\/FIELD=(VALUE_3, POSITION=4, TYPE=ASCII, SEPARATOR=',', IF COL_NAME EQ \"DATE_US_MMDDYYYY\" THEN COL_VALUE)<\/span>\r\n<span style=\"font-weight: 400;\"> \u00a0\/FIELD=(VALUE_4, POSITION=5, TYPE=ASCII, SEPARATOR=',', IF COL_NAME EQ \"EMAIL\" THEN COL_VALUE)<\/span>\r\n<span style=\"font-weight: 400;\"> \u00a0\/FIELD=(VALUE_5, POSITION=6, TYPE=ASCII, SEPARATOR=',', IF COL_NAME EQ \"FIRST_NAME\" THEN COL_VALUE)<\/span>\r\n<span style=\"font-weight: 400;\"> \u00a0\/FIELD=(VALUE_6, POSITION=7, TYPE=ASCII, SEPARATOR=',', IF COL_NAME EQ \"LAST_NAME\" THEN COL_VALUE)<\/span>\r\n<span style=\"font-weight: 400;\"> \u00a0\/FIELD=(VALUE_7, POSITION=8, TYPE=ASCII, SEPARATOR=',', IF COL_NAME EQ \"PHONE_DS\" THEN COL_VALUE)<\/span>\r\n<span style=\"font-weight: 400;\"> \u00a0\/FIELD=(VALUE_8, POSITION=9, TYPE=ASCII, SEPARATOR=',', IF COL_NAME EQ \"PIN_US\" THEN COL_VALUE)<\/span>\r\n\r\n<span style=\"font-weight: 400;\">\/SORT<\/span>\r\n<span style=\"font-weight: 400;\">         \/KEY=(ROW_NUM)<\/span>\r\n\r\n<span style=\"font-weight: 400;\">\/OUTFILE=OutputFiles\/<\/span><b>DarkShieldSearchLogTransposed.csv<\/b>\r\n<span style=\"font-weight: 400;\">        \/PROCESS=CSV<\/span>\r\n<span style=\"font-weight: 400;\">        \/HEADREC=\"ACC_NUMBER,CREDIT_CARD_DS,DATE_US_MMDDYYYY,EMAIL,FIRST_NAME,LAST_NAME,PHONE_DS,PIN_US\\n\"<\/span><span style=\"font-weight: 400;\">\r\n        \/FIELD=(VALUE_MAX_1, POSITION=1, TYPE=ASCII, SEPARATOR=',', FRAME='\"')\r\n        \/<\/span><span style=\"font-weight: 400;\">MAX VALUE_MAX_1 FROM VALUE_1 BREAK ROW_NUM<\/span>\r\n<span style=\"font-weight: 400;\">        \/FIELD=(VALUE_MAX_2, POSITION=2, TYPE=ASCII, SEPARATOR=',', FRAME='\"')<\/span>\r\n<span style=\"font-weight: 400;\">        \/MAX VALUE_MAX_2 FROM VALUE_2 BREAK ROW_NUM<\/span>\r\n<span style=\"font-weight: 400;\">        \/FIELD=(VALUE_MAX_3, POSITION=3, TYPE=ASCII, SEPARATOR=',', FRAME='\"')<\/span>\r\n<span style=\"font-weight: 400;\">        \/MAX VALUE_MAX_3 FROM VALUE_3 BREAK ROW_NUM<\/span>\r\n<span style=\"font-weight: 400;\">        \/FIELD=(VALUE_MAX_4, POSITION=4, TYPE=ASCII, SEPARATOR=',', FRAME='\"')<\/span>\r\n<span style=\"font-weight: 400;\">        \/MAX VALUE_MAX_4 FROM VALUE_4 BREAK ROW_NUM<\/span>\r\n<span style=\"font-weight: 400;\">        \/FIELD=(VALUE_MAX_5, POSITION=5, TYPE=ASCII, SEPARATOR=',', FRAME='\"')<\/span>\r\n<span style=\"font-weight: 400;\">        \/MAX VALUE_MAX_5 FROM VALUE_5 BREAK ROW_NUM<\/span>\r\n<span style=\"font-weight: 400;\">        \/FIELD=(VALUE_MAX_6, POSITION=6, TYPE=ASCII, SEPARATOR=',', FRAME='\"')<\/span>\r\n<span style=\"font-weight: 400;\">        \/MAX VALUE_MAX_6 FROM VALUE_6 BREAK ROW_NUM<\/span>\r\n<span style=\"font-weight: 400;\">        \/FIELD=(VALUE_MAX_7, POSITION=7, TYPE=ASCII, SEPARATOR=',', FRAME='\"')<\/span>\r\n<span style=\"font-weight: 400;\">        \/MAX VALUE_MAX_7 FROM VALUE_7 BREAK ROW_NUM<\/span>\r\n<span style=\"font-weight: 400;\">        \/FIELD=(VALUE_MAX_8, POSITION=8, TYPE=ASCII, SEPARATOR=',', FRAME='\"')<\/span>\r\n<span style=\"font-weight: 400;\">        \/MAX VALUE_MAX_8 FROM VALUE_8 BREAK ROW_NUM<\/span><\/pre>\n<p><span style=\"font-weight: 400;\">Note that the output in this case was defined as a single CSV file, but SortCL also supports one or more outputs of this data into differently formatted files (e.g., fixed position, JSON, XML, and XLS\/X) as well as any ODBC-connected database table. Here are the results of this job in spreadsheet form:<\/span><\/p>\n<p style=\"text-align: center;\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-17704\" src=\"\/blog\/wp-content\/uploads\/2014\/06\/odbc-database-table-300x288.png\" alt=\"\" width=\"772\" height=\"742\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/06\/odbc-database-table-300x288.png 300w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/06\/odbc-database-table-1024x983.png 1024w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/06\/odbc-database-table-768x737.png 768w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/06\/odbc-database-table.png 1099w\" sizes=\"(max-width: 772px) 100vw, 772px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">The output from this SortCL job produces unpivoted, through unrelated structured data; i.e., no logical association (relationship) between these data elements is likely to exist. More specifically, the results of the DarkShield search through files are now available for several uses, including:<\/span><\/p>\n<ol>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><i><span style=\"font-weight: 400;\">PII <a href=\"https:\/\/www.iri.com\/solutions\/data-masking\/verifying-compliance\">Audits <\/a><\/span><\/i><span style=\"font-weight: 400;\">&#8211; review of all values in each data classes to show what was found \/ vulnerable, which from Excel can also be aggregated and graphed as necessary; see also <\/span><a href=\"https:\/\/www.iri.com\/blog\/data-protection\/darkshield-pii-discovery-masking-charts\/\"><span style=\"font-weight: 400;\">this<\/span><\/a><span style=\"font-weight: 400;\"> solution built into DarkShield for graphical PII discovery results<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><a href=\"https:\/\/www.iri.com\/solutions\/data-integration\/etl\"><i><span style=\"font-weight: 400;\">ETL<\/span><\/i><\/a><span style=\"font-weight: 400;\"> &#8211; run join \/ lookup <a href=\"https:\/\/www.iri.com\/solutions\/data-transformation\">transformations<\/a> that match like values in structured sources to learn more from the associated values in those rows; see <a href=\"https:\/\/www.iri.com\/blog\/etl\/etl-part-2\/\">this follow-on article<\/a> for a continuation of this job the performs the join on email address.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><i><span style=\"font-weight: 400;\">Testing <\/span><\/i><span style=\"font-weight: 400;\">&#8211; these scrambled data values can be used for certain logicless but realistic <a href=\"https:\/\/www.iri.com\/blog\/test-data\/all-about-iri-set-files-a-primer\/\">test sets\u00a0<\/a><\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><i><span style=\"font-weight: 400;\"><a href=\"https:\/\/www.iri.com\/solutions\/business-intelligence\/embedded-bi\/customer-data-integration-segmentation\">Segmented<\/a> delivery<\/span><\/i><span style=\"font-weight: 400;\"> &#8211; Selected columns can be sent to individual files in the second script using the \/NEWFILE statement for those with specific PII type (data class) need to know<\/span><\/li>\n<\/ol>\n<p><span style=\"font-weight: 400;\">If you have any questions about PII search\/mask in DarkShield or textual ETL in Voracity, please email us at <\/span><a href=\"mailto:voracity@iri.com\"><span style=\"font-weight: 400;\">voracity@iri.com<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Synopsis Corporations and government agencies store a lot of useful information in non-transactional semi-structured and unstructured data sources. Finding that data \u2013 in documents, logs, and images \u2013 is important not only for data masking, but also for textual ETL.\u00a0\u00a0 Textual ETL lends structure and meaning to data that\u2019s hidden in text and prepares it<\/p>\n<div><a class=\"btn-filled btn\" href=\"https:\/\/www.iri.com\/blog\/migration\/data-migration\/textual-etl\/\" title=\"Textual ETL: Unlocking Unstructured Data\">Read More<\/a><\/div>\n","protected":false},"author":152,"featured_media":17707,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"_exactmetrics_skip_tracking":false,"_exactmetrics_sitenote_active":false,"_exactmetrics_sitenote_note":"","_exactmetrics_sitenote_category":0,"footnotes":""},"categories":[108,31,776,34,91,90],"tags":[688,44,610,55,1826,280,494,107,14,77,1825,417,5,689,71,690,692,1827,686,1402,1388,553,850,1822,571,693,615,1824,691,685,694,1823,68,1820,583,143,1821,687,550],"class_list":["post-6011","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-big-data-2","category-data-migration","category-etl","category-business","category-iri-workbench","category-migration","tag-adobe","tag-cosort","tag-dark-data","tag-data-analytics","tag-data-class-identification","tag-data-discovery","tag-data-extraction","tag-data-integration","tag-data-masking","tag-data-migration-2","tag-data-reporting","tag-data-restructuring","tag-data-transformation","tag-e-mail-messages","tag-eclipse","tag-excel-spreadsheets","tag-exchange","tag-file-search-job","tag-free-form-text","tag-images","tag-iri-darkshield","tag-iri-nextform","tag-iri-workbench","tag-metadata-logging","tag-microsoft","tag-outlook","tag-pdf","tag-pii-audit","tag-powerpoint","tag-restructured","tag-rich-text-format","tag-sensitive-data-search","tag-sortcl","tag-textual-etl","tag-unstructured","tag-unstructured-data","tag-voracity-data-management","tag-word","tag-xml"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v23.4 (Yoast SEO v23.4) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Textual ETL: Unlocking Unstructured Data - IRI<\/title>\n<meta name=\"description\" content=\"Learn how to extract and use the valuable data hidden in semi-structured and unstructured sources, starting with producing a structured set.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.iri.com\/blog\/migration\/data-migration\/textual-etl\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Textual ETL: Unlocking Unstructured Data\" \/>\n<meta property=\"og:description\" content=\"Learn how to extract and use the valuable data hidden in semi-structured and unstructured sources, starting with producing a structured set.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.iri.com\/blog\/migration\/data-migration\/textual-etl\/\" \/>\n<meta property=\"og:site_name\" content=\"IRI\" \/>\n<meta property=\"article:published_time\" content=\"2024-07-10T21:10:56+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2024-09-19T20:41:16+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/06\/Textual-ETL-featured-image.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1110\" \/>\n\t<meta property=\"og:image:height\" content=\"532\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Adam Lewis and Wade Donahue\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Adam Lewis and Wade Donahue\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"8 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/www.iri.com\/blog\/migration\/data-migration\/textual-etl\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/www.iri.com\/blog\/migration\/data-migration\/textual-etl\/\"},\"author\":{\"name\":\"Adam Lewis and Wade Donahue\",\"@id\":\"https:\/\/www.iri.com\/blog\/#\/schema\/person\/37c0e5beab094bd61cc521902df2876e\"},\"headline\":\"Textual ETL: Unlocking Unstructured Data\",\"datePublished\":\"2024-07-10T21:10:56+00:00\",\"dateModified\":\"2024-09-19T20:41:16+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/www.iri.com\/blog\/migration\/data-migration\/textual-etl\/\"},\"wordCount\":1402,\"commentCount\":1,\"publisher\":{\"@id\":\"https:\/\/www.iri.com\/blog\/#organization\"},\"image\":{\"@id\":\"https:\/\/www.iri.com\/blog\/migration\/data-migration\/textual-etl\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/06\/Textual-ETL-featured-image.png\",\"keywords\":[\"Adobe\",\"CoSort\",\"dark data\",\"data analytics\",\"Data Class Identification\",\"data discovery\",\"data extraction\",\"data integration\",\"data masking\",\"data migration\",\"Data Reporting\",\"data restructuring\",\"data transformation\",\"E-mail messages\",\"Eclipse\",\"Excel spreadsheets\",\"Exchange\",\"File Search Job\",\"free-form text\",\"images\",\"IRI DarkShield\",\"IRI NextForm\",\"IRI Workbench\",\"Metadata Logging\",\"Microsoft\",\"Outlook\",\"pdf\",\"PII Audit\",\"PowerPoint\",\"restructured\",\"Rich Text Format\",\"Sensitive Data Search\",\"SortCL\",\"Textual ETL\",\"unstructured\",\"unstructured data\",\"Voracity Data Management\",\"Word\",\"xml\"],\"articleSection\":[\"Big Data\",\"Data Migration\",\"ETL\",\"IRI Business\",\"IRI Workbench\",\"Migration\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/www.iri.com\/blog\/migration\/data-migration\/textual-etl\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.iri.com\/blog\/migration\/data-migration\/textual-etl\/\",\"url\":\"https:\/\/www.iri.com\/blog\/migration\/data-migration\/textual-etl\/\",\"name\":\"Textual ETL: Unlocking Unstructured Data - IRI\",\"isPartOf\":{\"@id\":\"https:\/\/www.iri.com\/blog\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/www.iri.com\/blog\/migration\/data-migration\/textual-etl\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/www.iri.com\/blog\/migration\/data-migration\/textual-etl\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/06\/Textual-ETL-featured-image.png\",\"datePublished\":\"2024-07-10T21:10:56+00:00\",\"dateModified\":\"2024-09-19T20:41:16+00:00\",\"description\":\"Learn how to extract and use the valuable data hidden in semi-structured and unstructured sources, starting with producing a structured set.\",\"breadcrumb\":{\"@id\":\"https:\/\/www.iri.com\/blog\/migration\/data-migration\/textual-etl\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.iri.com\/blog\/migration\/data-migration\/textual-etl\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.iri.com\/blog\/migration\/data-migration\/textual-etl\/#primaryimage\",\"url\":\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/06\/Textual-ETL-featured-image.png\",\"contentUrl\":\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/06\/Textual-ETL-featured-image.png\",\"width\":1110,\"height\":532},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.iri.com\/blog\/migration\/data-migration\/textual-etl\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.iri.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Textual ETL: Unlocking Unstructured Data\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.iri.com\/blog\/#website\",\"url\":\"https:\/\/www.iri.com\/blog\/\",\"name\":\"IRI\",\"description\":\"Total Data Management Blog\",\"publisher\":{\"@id\":\"https:\/\/www.iri.com\/blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.iri.com\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/www.iri.com\/blog\/#organization\",\"name\":\"IRI\",\"url\":\"https:\/\/www.iri.com\/blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.iri.com\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2019\/02\/iri-logo-total-data-management-small-1.png\",\"contentUrl\":\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2019\/02\/iri-logo-total-data-management-small-1.png\",\"width\":750,\"height\":206,\"caption\":\"IRI\"},\"image\":{\"@id\":\"https:\/\/www.iri.com\/blog\/#\/schema\/logo\/image\/\"}},[{\"@type\":[\"Person\"],\"@id\":\"https:\/\/www.iri.com\/blog\/#\/schema\/person\/37c0e5beab094bd61cc521902df2876e\",\"name\":\"Adam Lewis\",\"image\":{\"@type\":\"ImageObject\",\"@id\":\"https:\/\/www.iri.com\/blog\/#\/schema\/person\/image\/\",\"inLanguage\":\"en_US\",\"url\":\"\",\"caption\":\"Adam Lewis\"}},{\"@type\":[\"Person\"],\"@id\":\"https:\/\/www.iri.com\/blog\/#\/schema\/person\/37c0e5beab094bd61cc521902df2876e\",\"name\":\"Wade Donahue\",\"image\":{\"@type\":\"ImageObject\",\"@id\":\"https:\/\/www.iri.com\/blog\/#\/schema\/person\/image\/\",\"inLanguage\":\"en_US\",\"url\":\"\",\"caption\":\"Wade Donahue\"}}]]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Textual ETL: Unlocking Unstructured Data - IRI","description":"Learn how to extract and use the valuable data hidden in semi-structured and unstructured sources, starting with producing a structured set.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.iri.com\/blog\/migration\/data-migration\/textual-etl\/","og_locale":"en_US","og_type":"article","og_title":"Textual ETL: Unlocking Unstructured Data","og_description":"Learn how to extract and use the valuable data hidden in semi-structured and unstructured sources, starting with producing a structured set.","og_url":"https:\/\/www.iri.com\/blog\/migration\/data-migration\/textual-etl\/","og_site_name":"IRI","article_published_time":"2024-07-10T21:10:56+00:00","article_modified_time":"2024-09-19T20:41:16+00:00","og_image":[{"width":1110,"height":532,"url":"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/06\/Textual-ETL-featured-image.png","type":"image\/png"}],"author":"Adam Lewis and Wade Donahue","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Adam Lewis and Wade Donahue","Est. reading time":"8 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.iri.com\/blog\/migration\/data-migration\/textual-etl\/#article","isPartOf":{"@id":"https:\/\/www.iri.com\/blog\/migration\/data-migration\/textual-etl\/"},"author":{"name":"Adam Lewis and Wade Donahue","@id":"https:\/\/www.iri.com\/blog\/#\/schema\/person\/37c0e5beab094bd61cc521902df2876e"},"headline":"Textual ETL: Unlocking Unstructured Data","datePublished":"2024-07-10T21:10:56+00:00","dateModified":"2024-09-19T20:41:16+00:00","mainEntityOfPage":{"@id":"https:\/\/www.iri.com\/blog\/migration\/data-migration\/textual-etl\/"},"wordCount":1402,"commentCount":1,"publisher":{"@id":"https:\/\/www.iri.com\/blog\/#organization"},"image":{"@id":"https:\/\/www.iri.com\/blog\/migration\/data-migration\/textual-etl\/#primaryimage"},"thumbnailUrl":"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/06\/Textual-ETL-featured-image.png","keywords":["Adobe","CoSort","dark data","data analytics","Data Class Identification","data discovery","data extraction","data integration","data masking","data migration","Data Reporting","data restructuring","data transformation","E-mail messages","Eclipse","Excel spreadsheets","Exchange","File Search Job","free-form text","images","IRI DarkShield","IRI NextForm","IRI Workbench","Metadata Logging","Microsoft","Outlook","pdf","PII Audit","PowerPoint","restructured","Rich Text Format","Sensitive Data Search","SortCL","Textual ETL","unstructured","unstructured data","Voracity Data Management","Word","xml"],"articleSection":["Big Data","Data Migration","ETL","IRI Business","IRI Workbench","Migration"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.iri.com\/blog\/migration\/data-migration\/textual-etl\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.iri.com\/blog\/migration\/data-migration\/textual-etl\/","url":"https:\/\/www.iri.com\/blog\/migration\/data-migration\/textual-etl\/","name":"Textual ETL: Unlocking Unstructured Data - IRI","isPartOf":{"@id":"https:\/\/www.iri.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.iri.com\/blog\/migration\/data-migration\/textual-etl\/#primaryimage"},"image":{"@id":"https:\/\/www.iri.com\/blog\/migration\/data-migration\/textual-etl\/#primaryimage"},"thumbnailUrl":"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/06\/Textual-ETL-featured-image.png","datePublished":"2024-07-10T21:10:56+00:00","dateModified":"2024-09-19T20:41:16+00:00","description":"Learn how to extract and use the valuable data hidden in semi-structured and unstructured sources, starting with producing a structured set.","breadcrumb":{"@id":"https:\/\/www.iri.com\/blog\/migration\/data-migration\/textual-etl\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.iri.com\/blog\/migration\/data-migration\/textual-etl\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.iri.com\/blog\/migration\/data-migration\/textual-etl\/#primaryimage","url":"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/06\/Textual-ETL-featured-image.png","contentUrl":"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/06\/Textual-ETL-featured-image.png","width":1110,"height":532},{"@type":"BreadcrumbList","@id":"https:\/\/www.iri.com\/blog\/migration\/data-migration\/textual-etl\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.iri.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Textual ETL: Unlocking Unstructured Data"}]},{"@type":"WebSite","@id":"https:\/\/www.iri.com\/blog\/#website","url":"https:\/\/www.iri.com\/blog\/","name":"IRI","description":"Total Data Management Blog","publisher":{"@id":"https:\/\/www.iri.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.iri.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.iri.com\/blog\/#organization","name":"IRI","url":"https:\/\/www.iri.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.iri.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2019\/02\/iri-logo-total-data-management-small-1.png","contentUrl":"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2019\/02\/iri-logo-total-data-management-small-1.png","width":750,"height":206,"caption":"IRI"},"image":{"@id":"https:\/\/www.iri.com\/blog\/#\/schema\/logo\/image\/"}},[{"@type":["Person"],"@id":"https:\/\/www.iri.com\/blog\/#\/schema\/person\/37c0e5beab094bd61cc521902df2876e","name":"Adam Lewis","image":{"@type":"ImageObject","@id":"https:\/\/www.iri.com\/blog\/#\/schema\/person\/image\/","inLanguage":"en_US","url":"","caption":"Adam Lewis"}},{"@type":["Person"],"@id":"https:\/\/www.iri.com\/blog\/#\/schema\/person\/37c0e5beab094bd61cc521902df2876e","name":"Wade Donahue","image":{"@type":"ImageObject","@id":"https:\/\/www.iri.com\/blog\/#\/schema\/person\/image\/","inLanguage":"en_US","url":"","caption":"Wade Donahue"}}]]}},"jetpack_featured_media_url":"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/06\/Textual-ETL-featured-image.png","_links":{"self":[{"href":"https:\/\/www.iri.com\/blog\/wp-json\/wp\/v2\/posts\/6011"}],"collection":[{"href":"https:\/\/www.iri.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.iri.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.iri.com\/blog\/wp-json\/wp\/v2\/users\/152"}],"replies":[{"embeddable":true,"href":"https:\/\/www.iri.com\/blog\/wp-json\/wp\/v2\/comments?post=6011"}],"version-history":[{"count":70,"href":"https:\/\/www.iri.com\/blog\/wp-json\/wp\/v2\/posts\/6011\/revisions"}],"predecessor-version":[{"id":17995,"href":"https:\/\/www.iri.com\/blog\/wp-json\/wp\/v2\/posts\/6011\/revisions\/17995"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.iri.com\/blog\/wp-json\/wp\/v2\/media\/17707"}],"wp:attachment":[{"href":"https:\/\/www.iri.com\/blog\/wp-json\/wp\/v2\/media?parent=6011"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.iri.com\/blog\/wp-json\/wp\/v2\/categories?post=6011"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.iri.com\/blog\/wp-json\/wp\/v2\/tags?post=6011"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}