{"id":7557,"date":"2015-09-24T10:54:18","date_gmt":"2015-09-24T14:54:18","guid":{"rendered":"http:\/\/www.iri.com\/blog\/?p=7557"},"modified":"2017-11-06T12:23:16","modified_gmt":"2017-11-06T17:23:16","slug":"data-quality-and-fuzzy-searching","status":"publish","type":"post","link":"https:\/\/www.iri.com\/blog\/vldb-operations\/data-quality-and-fuzzy-searching\/","title":{"rendered":"Data Quality and Fuzzy Searching"},"content":{"rendered":"<p><em>IRI is now also delivering fuzzy search functions, both in its free database and flat-file profiling tools, and as available field-function libraries in\u00a0<a href=\"http:\/\/www.iri.com\/products\" target=\"_blank\" rel=\"noopener\">IRI CoSort, FieldShield, and Voracity<\/a> to augment data quality, security, and MDM capabilities. \u00a0This is the first in a series of articles on IRI fuzzy search solutions covering\u00a0their application to data quality improvement.<\/em><\/p>\n<p><strong>Introduction<\/strong><\/p>\n<p>The veracity or reliability of data of one of big &#8216;V&#8217; words (along with volume, variety, velocity, and value) that IRI et al talk\u00a0about in\u00a0the context of data and enterprise information management. Generally, IRI defines data in doubt as having one or more of these attributes:<\/p>\n<ol>\n<li>Low quality, because it is inconsistent, inaccurate, or incomplete<\/li>\n<li>Ambiguous (think MDM), imprecise (unstructured), or deceptive (social media)<\/li>\n<li>Biased (survey question), noisy (superfluous or contaminated), or abnormal (outliers)<\/li>\n<li>Invalid for any other reason (is the data correct and accurate for its intended use?)<\/li>\n<li>Unsafe \u2013 does it contain PII or secrets, and is that properly masked, reversible, etc.?<\/li>\n<\/ol>\n<p>This article focuses only on new fuzzy search solutions to the first problem, data quality. Other articles in this blog discuss how IRI software addresses the other four veracity problems;\u00a0ask for help finding them if you can&#8217;t.<\/p>\n<p><strong>About Fuzzy Searching<\/strong><\/p>\n<p>Fuzzy searches find\u00a0words or phrases (values) that are\u00a0similar, but not necessarily identical,\u00a0to other words or phrases (values). This type of search has many uses, such as finding sequence errors, spelling errors, transposed characters, and others we&#8217;ll cover later.<\/p>\n<p>Performing a fuzzy search for approximate words or phrases\u00a0can help find data that may be a duplicates of previously stored data. However,\u00a0user input or auto correction may have altered the data in some way to make the records seem independent.<\/p>\n<p>The rest of the article will\u00a0cover\u00a0four fuzzy search functions which IRI now supports,\u00a0how to use them to scour your\u00a0data, and return those records approximating the search value.<br \/>\n<strong><br \/>\n1. Levenshtein<\/strong><\/p>\n<p>The Levenshtein algorithm works by taking two words or phrases, and counting how many edit steps it will\u00a0take to turn one word or phrase\u00a0into the other. The less steps it will take, the more likely the word or phrase is a match. The steps the Levenshtein function can take are:<\/p>\n<ol>\n<li>Insertion\u00a0of\u00a0a character into the\u00a0word or phrase<\/li>\n<li>Deletion\u00a0of\u00a0a character from the\u00a0word or phrase<\/li>\n<li>Replacement\u00a0of\u00a0one\u00a0character in a word or phrase with\u00a0another<\/li>\n<\/ol>\n<p>The following is a CoSort <a href=\"http:\/\/www.iri.com\/products\/cosort\/sortcl\" target=\"_blank\" rel=\"noopener\">SortCL<\/a> program (job script) demonstrating how to use\u00a0the Levenshtein fuzzy search function:<\/p>\n<pre>\/INFILE=LevenshteinSample.dat\r\n \/PROCESS=RECORD\r\n \/FIELD=(ID, TYPE=ASCII, POSITION=1, SEPARATOR=\"\\t\") \r\n \/FIELD=(NAME, TYPE=ASCII, POSITION=2, SEPARATOR=\"\\t\")\r\n\/REPORT\r\n\/OUTFILE=LevenshteinOutput.csv\r\n \/PROCESS=CSV \r\n \/FIELD=(ID, TYPE=ASCII, POSITION=1, SEPARATOR=\",\") \r\n \/FIELD=(NAME, TYPE=ASCII, POSITION=2, SEPARATOR=\",\") \r\n \/FIELD=(FS_RESULT=fs_levenshtein(NAME, \"Barney Oakley\"), POSITION=3, SEPARATOR=\",\")\r\n \/INCLUDE WHERE FS_RESULT GT 50<\/pre>\n<p>There are two\u00a0parts that must be used to produce\u00a0the desired output.<\/p>\n<pre>FS_Result=fs_levenshtein(NAME, \"Barney Oakley\")<\/pre>\n<p>This line calls the function fs_levenshtein, and stores the result in the field FS_RESULT. The function takes two input parameters:<\/p>\n<ul>\n<li>The field to run the fuzzy search on (NAME in our example)<\/li>\n<li>The string that the input field will be compared to (&#8220;Barney Oakley&#8221; in our example).<\/li>\n<\/ul>\n<pre>\/INCLUDE WHERE FS_RESULT GT 50<\/pre>\n<p>This line compares the FS_RESULT field and checks if it is greater than 50, then only records with an FS_RESULT of more than 50 are output. The following shows the output from our example.<\/p>\n<p><a href=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2015\/07\/LevenshteinOutput1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-7559 size-full\" src=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2015\/07\/LevenshteinOutput1.png\" alt=\"Fuzzy Search DQ Levenshtein Output\" width=\"438\" height=\"430\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2015\/07\/LevenshteinOutput1.png 438w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2015\/07\/LevenshteinOutput1-300x295.png 300w\" sizes=\"(max-width: 438px) 100vw, 438px\" \/><\/a><\/p>\n<p>As the output shows this type of search is useful for finding:<\/p>\n<ol>\n<li>Concatenated names<\/li>\n<li>Noise<\/li>\n<li>Spelling errors<\/li>\n<li>Transposed characters<\/li>\n<li>Transcription mistakes<\/li>\n<li>Typing errors<\/li>\n<\/ol>\n<p>The\u00a0Levenshtein function is thus\u00a0useful for\u00a0identifying\u00a0common data entry errors, too. However, it takes the longest to perform out of the four\u00a0algorithms, as it compares every character in one string to every character in the other.<\/p>\n<p><strong>2. Dice Coefficient<\/strong><\/p>\n<p>The dice coefficient, or dice algorithm, breaks up words or phrases into character pairs, compares those pairs, and counts the matches. The more matches the words have, the more likely\u00a0the word itself is a match.<\/p>\n<p>The following SortCL script demonstrates the dice coefficient\u00a0fuzzy search function.<\/p>\n<pre>\/INFILE=DiceSample.dat\r\n \/PROCESS=RECORD\r\n \/FIELD=(ID, TYPE=ASCII, POSITION=1, SEPARATOR=\"\\t\") \r\n \/FIELD=(NAME, TYPE=ASCII, POSITION=2, SEPARATOR=\"\\t\")\r\n\/REPORT\r\n\/OUTFILE=DiceOutput.csv\r\n \/PROCESS=CSV \r\n \/FIELD=(ID, TYPE=ASCII, POSITION=1, SEPARATOR=\",\") \r\n \/FIELD=(NAME, TYPE=ASCII, POSITION=2, SEPARATOR=\",\") \r\n \/FIELD=(FS_RESULT=fs_dice(NAME, \"Robert Thomas Smith\"), POSITION=3, SEPARATOR=\",\")\r\n\r\n \/INCLUDE WHERE FS_RESULT GT 50<\/pre>\n<p>There are two\u00a0parts that must be used to give us the desired output.<\/p>\n<pre>FS_Result=fs_dice(NAME, \"Robert Thomas Smith\")<\/pre>\n<p>This line calls the function fs_dice, and stores the result in the field FS_RESULT. The function takes two input parameters:<\/p>\n<ul>\n<li>The field to run the fuzzy search on (NAME in our example).<\/li>\n<li>The String that the input field will be compared to (&#8220;Robert Thomas Smith&#8221; in our example).<\/li>\n<\/ul>\n<pre>\/INCLUDE WHERE FS_RESULT GT 50<\/pre>\n<p>This line compares the FS_RESULT field and checks if it is greater than 50, then only records with an FS_RESULT of more than 50 are output. The following shows the output from our example.<\/p>\n<p><a href=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2015\/07\/DiceOutput1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-7571 size-full\" src=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2015\/07\/DiceOutput1.png\" alt=\"Fuzzy Search DQ Dice Output\" width=\"471\" height=\"453\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2015\/07\/DiceOutput1.png 471w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2015\/07\/DiceOutput1-300x289.png 300w\" sizes=\"(max-width: 471px) 100vw, 471px\" \/><\/a><\/p>\n<p>As the output shows\u00a0the dice coefficient algorithm is useful for finding inconsistent data such as:<\/p>\n<ol>\n<li>Sequence errors<\/li>\n<li>Involuntary corrections<\/li>\n<li>Nicknames<\/li>\n<li>Initials and nicknames<\/li>\n<li>Unpredictable use of initials<\/li>\n<li>Localization<\/li>\n<\/ol>\n<p>The dice algorithm is faster than the\u00a0Levenshtein, but can become less accurate when there are many\u00a0simple errors such as typos.<\/p>\n<p><b>3. Metaphone and 4. Soundex<\/b><\/p>\n<p>the Metaphone and Soundex\u00a0algorithms compare words or phrases based on their phonetic sounds.\u00a0Soundex\u00a0does this by reading through the word or phrase and looking at individual characters, while Metaphone looks at both individual characters and\u00a0character groups. Then both\u00a0give\u00a0codes based on the word&#8217;s spelling and pronunciation.<\/p>\n<p>The following SortCL script demonstrates the Soundex and Metasphone search functions:<\/p>\n<pre>\/INFILE=SoundexSample.dat\r\n \/PROCESS=RECORD\r\n \/FIELD=(ID, TYPE=ASCII, POSITION=1, SEPARATOR=\"\\t\") \r\n \/FIELD=(NAME, TYPE=ASCII, POSITION=2, SEPARATOR=\"\\t\")\r\n\/REPORT\r\n\/OUTFILE=SoundexOutput.csv\r\n \/PROCESS=CSV \r\n \/FIELD=(ID, TYPE=ASCII, POSITION=1, SEPARATOR=\",\") \r\n \/FIELD=(NAME, TYPE=ASCII, POSITION=2, SEPARATOR=\",\") \r\n \/FIELD=(SE_RESULT=fs_soundex(NAME, \"John\"), POSITION=3, SEPARATOR=\",\") \r\n \/FIELD=(MP_RESULT=fs_metaphone(NAME, \"John\"), POSITION=3, SEPARATOR=\",\") \r\n \/INCLUDE WHERE (SE_RESULT GT 0) OR (MP_RESULT GT 0)<\/pre>\n<p>In each case, there are three parts that must be used to give us the desired output.<\/p>\n<pre>SE_RESULT=fs_soundex(NAME, \"John\")\r\nMP_RESULT=fs_metaphone(NAME, \"John\")<\/pre>\n<p>The line calls the function, and stores the result in the field RESULT. The functions both take\u00a0two input parameters:<\/p>\n<ul>\n<li>The field to run the fuzzy search on (NAME in our example)<\/li>\n<li>The xtring that the input field will be compared to (&#8220;John&#8221; in our example)<\/li>\n<\/ul>\n<pre>\/INCLUDE WHERE (SE_RESULT GT 0) OR (MP_RESULT GT 0)<\/pre>\n<p>This line compares the SE_RESULT \u00a0and MP_RESULT fields, and checks and returns the row if either\u00a0is greater than 0.<\/p>\n<p>Soundex returns either 100 for a match, or 0 if it is not a match. Metaphone has more specific results, and returns 100 for a strong match, 66 for a normal match, and 33 for a minor match.<\/p>\n<p><a href=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2015\/07\/SoundexOutput2.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-7570 size-full\" src=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2015\/07\/SoundexOutput2.png\" alt=\"Fuzzy Search DQ Soundex Output\" width=\"533\" height=\"532\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2015\/07\/SoundexOutput2.png 533w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2015\/07\/SoundexOutput2-150x150.png 150w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2015\/07\/SoundexOutput2-300x300.png 300w\" sizes=\"(max-width: 533px) 100vw, 533px\" \/><\/a><\/p>\n<p style=\"text-align: center;\"><em><strong>Column C<\/strong> shows the Soundex results. C<strong>olumn D<\/strong> shows the Metaphone results<\/em><\/p>\n<p>As the output shows this type of search is useful for finding:<\/p>\n<ul>\n<li>Phonetic errors<\/li>\n<\/ul>\n<p>Please submit feedback on this article below, and if you are interested in using\u00a0these functions \u00a0please contact your IRI representative. See <a href=\"http:\/\/www.iri.com\/blog\/master-data-metadata-management\/master-data-management\/data-consolidation-wizard-for-data-quality\/\">our next article<\/a> on using these algorithms in the IRI Workbench data consolidation (quality) wizard.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>IRI is now also delivering fuzzy search functions, both in its free database and flat-file profiling tools, and as available field-function libraries in\u00a0IRI CoSort, FieldShield, and Voracity to augment data quality, security, and MDM capabilities. \u00a0This is the first in a series of articles on IRI fuzzy search solutions covering\u00a0their application to data quality improvement.<\/p>\n<div><a class=\"btn-filled btn\" href=\"https:\/\/www.iri.com\/blog\/vldb-operations\/data-quality-and-fuzzy-searching\/\" title=\"Data Quality and Fuzzy Searching\">Read More<\/a><\/div>\n","protected":false},"author":61,"featured_media":11714,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"_exactmetrics_skip_tracking":false,"_exactmetrics_sitenote_active":false,"_exactmetrics_sitenote_note":"","_exactmetrics_sitenote_category":0,"footnotes":""},"categories":[8,363,776,91,232,3],"tags":[14,77,366,754,867,755,868,100,865,546,866,852,851,869,68,870],"class_list":["post-7557","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-protection","category-data-quality","category-etl","category-iri-workbench","category-master-data-management","category-vldb-operations","tag-data-masking","tag-data-migration-2","tag-data-quality-2","tag-database-profiling","tag-dice-coefficient","tag-dq","tag-enterprise-information-management","tag-etl","tag-fuzzy-search","tag-iri-cosort","tag-levenshtein-algorithm","tag-master-data-management","tag-master-data-metadata-management","tag-metaphone","tag-sortcl","tag-soundex"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v23.4 (Yoast SEO v23.4) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Data Quality and Fuzzy Searching - IRI<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.iri.com\/blog\/vldb-operations\/data-quality-and-fuzzy-searching\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Data Quality and Fuzzy Searching\" \/>\n<meta property=\"og:description\" content=\"IRI is now also delivering fuzzy search functions, both in its free database and flat-file profiling tools, and as available field-function libraries in\u00a0IRI CoSort, FieldShield, and Voracity to augment data quality, security, and MDM capabilities. \u00a0This is the first in a series of articles on IRI fuzzy search solutions covering\u00a0their application to data quality improvement.Read More\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.iri.com\/blog\/vldb-operations\/data-quality-and-fuzzy-searching\/\" \/>\n<meta property=\"og:site_name\" content=\"IRI\" \/>\n<meta property=\"article:published_time\" content=\"2015-09-24T14:54:18+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2017-11-06T17:23:16+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2015\/09\/fuzzy-logic-1.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"498\" \/>\n\t<meta property=\"og:image:height\" content=\"279\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Nathan Dymora\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Nathan Dymora\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"6 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/www.iri.com\/blog\/vldb-operations\/data-quality-and-fuzzy-searching\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/www.iri.com\/blog\/vldb-operations\/data-quality-and-fuzzy-searching\/\"},\"author\":{\"name\":\"Nathan Dymora\",\"@id\":\"https:\/\/www.iri.com\/blog\/#\/schema\/person\/6c3bde00b144e9786b3d024c1e45defa\"},\"headline\":\"Data Quality and Fuzzy Searching\",\"datePublished\":\"2015-09-24T14:54:18+00:00\",\"dateModified\":\"2017-11-06T17:23:16+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/www.iri.com\/blog\/vldb-operations\/data-quality-and-fuzzy-searching\/\"},\"wordCount\":1026,\"commentCount\":1,\"publisher\":{\"@id\":\"https:\/\/www.iri.com\/blog\/#organization\"},\"image\":{\"@id\":\"https:\/\/www.iri.com\/blog\/vldb-operations\/data-quality-and-fuzzy-searching\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2015\/09\/fuzzy-logic-1.jpg\",\"keywords\":[\"data masking\",\"data migration\",\"data quality\",\"database profiling\",\"Dice Coefficient\",\"DQ\",\"enterprise information management\",\"ETL\",\"fuzzy search\",\"IRI CoSort\",\"Levenshtein algorithm\",\"Master Data Management\",\"MDM\",\"Metaphone\",\"SortCL\",\"Soundex\"],\"articleSection\":[\"Data Masking\/Protection\",\"Data Quality (DQ&#041;\",\"ETL\",\"IRI Workbench\",\"Master Data Management\",\"VLDB\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/www.iri.com\/blog\/vldb-operations\/data-quality-and-fuzzy-searching\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.iri.com\/blog\/vldb-operations\/data-quality-and-fuzzy-searching\/\",\"url\":\"https:\/\/www.iri.com\/blog\/vldb-operations\/data-quality-and-fuzzy-searching\/\",\"name\":\"Data Quality and Fuzzy Searching - IRI\",\"isPartOf\":{\"@id\":\"https:\/\/www.iri.com\/blog\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/www.iri.com\/blog\/vldb-operations\/data-quality-and-fuzzy-searching\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/www.iri.com\/blog\/vldb-operations\/data-quality-and-fuzzy-searching\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2015\/09\/fuzzy-logic-1.jpg\",\"datePublished\":\"2015-09-24T14:54:18+00:00\",\"dateModified\":\"2017-11-06T17:23:16+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/www.iri.com\/blog\/vldb-operations\/data-quality-and-fuzzy-searching\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.iri.com\/blog\/vldb-operations\/data-quality-and-fuzzy-searching\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.iri.com\/blog\/vldb-operations\/data-quality-and-fuzzy-searching\/#primaryimage\",\"url\":\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2015\/09\/fuzzy-logic-1.jpg\",\"contentUrl\":\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2015\/09\/fuzzy-logic-1.jpg\",\"width\":498,\"height\":279},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.iri.com\/blog\/vldb-operations\/data-quality-and-fuzzy-searching\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.iri.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Data Quality and Fuzzy Searching\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.iri.com\/blog\/#website\",\"url\":\"https:\/\/www.iri.com\/blog\/\",\"name\":\"IRI\",\"description\":\"Total Data Management Blog\",\"publisher\":{\"@id\":\"https:\/\/www.iri.com\/blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.iri.com\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/www.iri.com\/blog\/#organization\",\"name\":\"IRI\",\"url\":\"https:\/\/www.iri.com\/blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.iri.com\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2019\/02\/iri-logo-total-data-management-small-1.png\",\"contentUrl\":\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2019\/02\/iri-logo-total-data-management-small-1.png\",\"width\":750,\"height\":206,\"caption\":\"IRI\"},\"image\":{\"@id\":\"https:\/\/www.iri.com\/blog\/#\/schema\/logo\/image\/\"}},{\"@type\":\"Person\",\"@id\":\"https:\/\/www.iri.com\/blog\/#\/schema\/person\/6c3bde00b144e9786b3d024c1e45defa\",\"name\":\"Nathan Dymora\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.iri.com\/blog\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/fe3589b371c7912ed817bd9e5e443745?s=96&d=blank&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/fe3589b371c7912ed817bd9e5e443745?s=96&d=blank&r=g\",\"caption\":\"Nathan Dymora\"},\"url\":\"https:\/\/www.iri.com\/blog\/author\/nathand\/\"}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Data Quality and Fuzzy Searching - IRI","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.iri.com\/blog\/vldb-operations\/data-quality-and-fuzzy-searching\/","og_locale":"en_US","og_type":"article","og_title":"Data Quality and Fuzzy Searching","og_description":"IRI is now also delivering fuzzy search functions, both in its free database and flat-file profiling tools, and as available field-function libraries in\u00a0IRI CoSort, FieldShield, and Voracity to augment data quality, security, and MDM capabilities. \u00a0This is the first in a series of articles on IRI fuzzy search solutions covering\u00a0their application to data quality improvement.Read More","og_url":"https:\/\/www.iri.com\/blog\/vldb-operations\/data-quality-and-fuzzy-searching\/","og_site_name":"IRI","article_published_time":"2015-09-24T14:54:18+00:00","article_modified_time":"2017-11-06T17:23:16+00:00","og_image":[{"width":498,"height":279,"url":"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2015\/09\/fuzzy-logic-1.jpg","type":"image\/jpeg"}],"author":"Nathan Dymora","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Nathan Dymora","Est. reading time":"6 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.iri.com\/blog\/vldb-operations\/data-quality-and-fuzzy-searching\/#article","isPartOf":{"@id":"https:\/\/www.iri.com\/blog\/vldb-operations\/data-quality-and-fuzzy-searching\/"},"author":{"name":"Nathan Dymora","@id":"https:\/\/www.iri.com\/blog\/#\/schema\/person\/6c3bde00b144e9786b3d024c1e45defa"},"headline":"Data Quality and Fuzzy Searching","datePublished":"2015-09-24T14:54:18+00:00","dateModified":"2017-11-06T17:23:16+00:00","mainEntityOfPage":{"@id":"https:\/\/www.iri.com\/blog\/vldb-operations\/data-quality-and-fuzzy-searching\/"},"wordCount":1026,"commentCount":1,"publisher":{"@id":"https:\/\/www.iri.com\/blog\/#organization"},"image":{"@id":"https:\/\/www.iri.com\/blog\/vldb-operations\/data-quality-and-fuzzy-searching\/#primaryimage"},"thumbnailUrl":"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2015\/09\/fuzzy-logic-1.jpg","keywords":["data masking","data migration","data quality","database profiling","Dice Coefficient","DQ","enterprise information management","ETL","fuzzy search","IRI CoSort","Levenshtein algorithm","Master Data Management","MDM","Metaphone","SortCL","Soundex"],"articleSection":["Data Masking\/Protection","Data Quality (DQ&#041;","ETL","IRI Workbench","Master Data Management","VLDB"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.iri.com\/blog\/vldb-operations\/data-quality-and-fuzzy-searching\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.iri.com\/blog\/vldb-operations\/data-quality-and-fuzzy-searching\/","url":"https:\/\/www.iri.com\/blog\/vldb-operations\/data-quality-and-fuzzy-searching\/","name":"Data Quality and Fuzzy Searching - IRI","isPartOf":{"@id":"https:\/\/www.iri.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.iri.com\/blog\/vldb-operations\/data-quality-and-fuzzy-searching\/#primaryimage"},"image":{"@id":"https:\/\/www.iri.com\/blog\/vldb-operations\/data-quality-and-fuzzy-searching\/#primaryimage"},"thumbnailUrl":"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2015\/09\/fuzzy-logic-1.jpg","datePublished":"2015-09-24T14:54:18+00:00","dateModified":"2017-11-06T17:23:16+00:00","breadcrumb":{"@id":"https:\/\/www.iri.com\/blog\/vldb-operations\/data-quality-and-fuzzy-searching\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.iri.com\/blog\/vldb-operations\/data-quality-and-fuzzy-searching\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.iri.com\/blog\/vldb-operations\/data-quality-and-fuzzy-searching\/#primaryimage","url":"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2015\/09\/fuzzy-logic-1.jpg","contentUrl":"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2015\/09\/fuzzy-logic-1.jpg","width":498,"height":279},{"@type":"BreadcrumbList","@id":"https:\/\/www.iri.com\/blog\/vldb-operations\/data-quality-and-fuzzy-searching\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.iri.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Data Quality and Fuzzy Searching"}]},{"@type":"WebSite","@id":"https:\/\/www.iri.com\/blog\/#website","url":"https:\/\/www.iri.com\/blog\/","name":"IRI","description":"Total Data Management Blog","publisher":{"@id":"https:\/\/www.iri.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.iri.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.iri.com\/blog\/#organization","name":"IRI","url":"https:\/\/www.iri.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.iri.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2019\/02\/iri-logo-total-data-management-small-1.png","contentUrl":"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2019\/02\/iri-logo-total-data-management-small-1.png","width":750,"height":206,"caption":"IRI"},"image":{"@id":"https:\/\/www.iri.com\/blog\/#\/schema\/logo\/image\/"}},{"@type":"Person","@id":"https:\/\/www.iri.com\/blog\/#\/schema\/person\/6c3bde00b144e9786b3d024c1e45defa","name":"Nathan Dymora","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.iri.com\/blog\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/fe3589b371c7912ed817bd9e5e443745?s=96&d=blank&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/fe3589b371c7912ed817bd9e5e443745?s=96&d=blank&r=g","caption":"Nathan Dymora"},"url":"https:\/\/www.iri.com\/blog\/author\/nathand\/"}]}},"jetpack_featured_media_url":"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2015\/09\/fuzzy-logic-1.jpg","_links":{"self":[{"href":"https:\/\/www.iri.com\/blog\/wp-json\/wp\/v2\/posts\/7557"}],"collection":[{"href":"https:\/\/www.iri.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.iri.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.iri.com\/blog\/wp-json\/wp\/v2\/users\/61"}],"replies":[{"embeddable":true,"href":"https:\/\/www.iri.com\/blog\/wp-json\/wp\/v2\/comments?post=7557"}],"version-history":[{"count":17,"href":"https:\/\/www.iri.com\/blog\/wp-json\/wp\/v2\/posts\/7557\/revisions"}],"predecessor-version":[{"id":11293,"href":"https:\/\/www.iri.com\/blog\/wp-json\/wp\/v2\/posts\/7557\/revisions\/11293"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.iri.com\/blog\/wp-json\/wp\/v2\/media\/11714"}],"wp:attachment":[{"href":"https:\/\/www.iri.com\/blog\/wp-json\/wp\/v2\/media?parent=7557"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.iri.com\/blog\/wp-json\/wp\/v2\/categories?post=7557"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.iri.com\/blog\/wp-json\/wp\/v2\/tags?post=7557"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}