{"id":6386,"date":"2014-11-26T14:57:37","date_gmt":"2014-11-26T19:57:37","guid":{"rendered":"http:\/\/www.iri.com\/blog\/?p=6386"},"modified":"2024-12-13T08:52:13","modified_gmt":"2024-12-13T13:52:13","slug":"easier-big-data-prep-for-r","status":"publish","type":"post","link":"https:\/\/www.iri.com\/blog\/business-intelligence\/easier-big-data-prep-for-r\/","title":{"rendered":"Easier Big Data Prep for R"},"content":{"rendered":"<p>Among analytic tools for statistical computation and graphics, R has shown an increase in popularity among data miners, and in the development of its open source language. However, from a performance standpoint, R holds all of its objects in virtual memory, which becomes an issue when attempting to work with very large data sets.<\/p>\n<p>There are many ways to handle big data while still using R for the statistical analysis. Hadoop distillations come to mind, but they do not come cheaply or easily. Another obvious solution might be to add more memory to the\u00a0PC where R runs (mine\u00a0had\u00a06GB). I tested the limits of how many rows R could handle at one time.<\/p>\n<p><strong>My R Limit<\/strong><\/p>\n<p>On\u00a0my computer, about 3 million rows, or approximately\u00a0112MB, was all R could process at once. This amount of data is by no means part of the scope of \u201cbig data,\u201d which is closer to the 100-million-row range. A little math and that means I&#8217;d\u00a0need 200GB of RAM. That is still not\u00a0feasible for PC users at this time.<\/p>\n<p>Another option is to break the data down into smaller chunks that R can handle, process these chunks individually, and then summarize everything at the end. Let\u2019s say you want to prepare\u00a030 million rows worth of source data. With the aforementioned limitations of\u00a0my PC,\u00a0I had\u00a0to break the data\u00a0down into 10 chunks\u00a0of 3 million rows each before R could\u00a0process it.<\/p>\n<p>With one simple line of code in R that\u00a0handles \u201cgarbage collection,\u201d you can remove the processed information from memory to make room for the next set. But to process these chunks before\u00a0garbage collection, the R code needs to be in\u00a0multiple files, all processed by one final summary code file.\u00a0So if 30 million rows needs\u00a011 R scripts, then 300 million rows would require an impractical 101 scripts.<\/p>\n<p>A\u00a0practical and time-saving solution is\u00a0to use\u00a0third-party software designed for pre-processing\u00a0big data, so as to\u00a0give\u00a0R\u00a0more manageable chunks\u00a0to\u00a0analyze. <a href=\"http:\/\/www.iri.com\/products\/cosort\" target=\"_blank\" rel=\"noopener\">IRI CoSort<\/a>\u00a0is a data manipulation and management package that rapidly prepares, or franchises, raw <a href=\"http:\/\/www.iri.com\/products\/workbench\/data-sources\" target=\"_blank\" rel=\"noopener\">data sources<\/a> for BI and analytics using the <a href=\"http:\/\/www.iri.com\/solutions\/big-data\" target=\"_blank\" rel=\"noopener\">existing Windows or Unix file system<\/a>.<\/p>\n<p><strong>Comparing Options<\/strong><\/p>\n<p>Let&#8217;s use the same example as above, with 30 million total rows of data.\u00a0 Say you have one file with a list of store numbers, manager\u2019s names, and state abbreviations. Then you have 30 million rows of data in another file(s) containing transaction information: account number, store number, item number, price, etc. We need to analyze sales revenue totals using the price of each item\u00a0sold, grouped by each state (in Brazil) in which the item sold.<\/p>\n<p>For R to process this information, the transaction data would have to be in 10 files of 3 million rows each to avoid\u00a0memory overload. Each of these files requires its own R script to join with the store info file, then sort and sum the price based on state, with the totals saved to a results file\u00a0 You would then need an 11<sup>th<\/sup> R script to run the first 10 transform jobs, read the outputs of\u00a0those, and create a final total based on state.\u00a0 Let\u2019s assume this was your only option, and you performed all these steps. How long would it take?<\/p>\n<p>Not counting the time it takes to write 11 R scripts, the run time\u00a0for\u00a030 million rows of data was 510.38 seconds, or 8 minutes and 16.51 seconds. The\u00a0R scripts,\u00a0shown via WalWare&#8217;s StatET for R &#8212; since updated to <a href=\"https:\/\/eclipse.dev\/statet\/\">Eclipse StatET<\/a> &#8212;\u00a0in the center editing window of the <a href=\"http:\/\/www.iri.com\/products\/workbench\" target=\"_blank\" rel=\"noopener\">IRI Workbench<\/a> GUI below, look like this:<\/p>\n<p><a href=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/11\/R_Commands-e1417623383255.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-6420\" src=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/11\/R_Commands-e1417623383255.png\" alt=\"R_Commands\" width=\"600\" height=\"359\" \/><\/a><\/p>\n<p>Here are the 10 data chunks I had to maintain:<\/p>\n<p><a href=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/11\/R_Commands_Output.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-6421\" src=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/11\/R_Commands_Output.png\" alt=\"R_Commands_Output\" width=\"608\" height=\"299\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/11\/R_Commands_Output.png 608w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/11\/R_Commands_Output-300x147.png 300w\" sizes=\"(max-width: 608px) 100vw, 608px\" \/><\/a><\/p>\n<p>before reaching the final summary result:<\/p>\n<p><a href=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/11\/R_Final_Output.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-6422\" src=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/11\/R_Final_Output.png\" alt=\"R_Final_Output\" width=\"676\" height=\"329\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/11\/R_Final_Output.png 676w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/11\/R_Final_Output-300x146.png 300w\" sizes=\"(max-width: 676px) 100vw, 676px\" \/><\/a><br \/>\nTo obtain\u00a0the same sort-sum process results with IRI\u00a0CoSort, however, I needed just one CoSort &#8216;Sort Control Language&#8217; (<a href=\"http:\/\/www.iri.com\/products\/cosort\/sortcl\" target=\"_blank\" rel=\"noopener\">SortCL)<\/a>\u00a0program\u00a0to\u00a0sort and join\u00a0two\u00a015M-row files over their common key and sum them by\u00a0state. SortCL\u00a0supports any number (and size)\u00a0of <a href=\"http:\/\/www.iri.com\/products\/workbench\/data-sources\" target=\"_blank\" rel=\"noopener\">data sources<\/a>\u00a0and\u00a0formats, and produces\u00a0any number and type of targets\u00a0simultaneously. See <a href=\"http:\/\/www.iri.com\/img\/iri-integration_3500x4500.jpg\" target=\"_blank\" rel=\"noopener\">this<\/a> functional summary diagram.<\/p>\n<p>The\u00a0SortCL job producing the same summary output as R (and in the same GUI) is shown here:<\/p>\n<p><a href=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/11\/SortCL_Script_Run-e1417623470915.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-6419\" src=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/11\/SortCL_Script_Run-e1417623470915.png\" alt=\"SortCL_Script_Run\" width=\"600\" height=\"347\" \/><\/a><\/p>\n<p>This way only took\u00a0\u00a0272.95 seconds, or 4 minutes 32.95 seconds, which was 45% faster than R:<\/p>\n<p><a href=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/11\/performance-chart1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-6492\" src=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/11\/performance-chart1.png\" alt=\"performance chart\" width=\"547\" height=\"323\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/11\/performance-chart1.png 547w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/11\/performance-chart1-300x177.png 300w\" sizes=\"(max-width: 547px) 100vw, 547px\" \/><\/a><\/p>\n<p>Plus, only having to write and manage one SortCL script, as opposed to 11 R scripts, saved even more time.<\/p>\n<p><strong>Conclusion<\/strong><\/p>\n<p>Either way,\u00a0the raw data was distilled into the same subset R could quickly analyze and feed to\u00a0a visualization tool like ggplot or qplot:<\/p>\n<p><a href=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/11\/sales31.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-6490\" src=\"http:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/11\/sales31.png\" alt=\"sales3\" width=\"610\" height=\"359\" srcset=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/11\/sales31.png 610w, https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/11\/sales31-300x176.png 300w\" sizes=\"(max-width: 610px) 100vw, 610px\" \/><\/a><\/p>\n<p>However, beyond the time-to-visualization\u00a0advantage CoSort afforded me\u00a0through simpler and speedier data preparation for R, I had centralized data that I could re-use, mask, and quality control in CoSort SortCL and compatible\u00a0IRI software jobs.<\/p>\n<p>This approach also avoids data being out-of-sync between R sessions that call for the same data at different times. Moreover, the data access\u00a0and <a href=\"https:\/\/www.iri.com\/solutions\/metadata-mdm\/metadata-management\" target=\"_blank\" rel=\"noopener\">metadata management<\/a> features\u00a0in\u00a0the IRI Workbench let me take and share control of the data\u00a0life cycle, especially the\u00a0<a href=\"http:\/\/www.iri.com\/solutions\/data-transformation\" target=\"_blank\" rel=\"noopener\">remappings<\/a>\u00a0I wanted to do\u00a0for R&#8217;s sake.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Among analytic tools for statistical computation and graphics, R has shown an increase in popularity among data miners, and in the development of its open source language. However, from a performance standpoint, R holds all of its objects in virtual memory, which becomes an issue when attempting to work with very large data sets. There<\/p>\n<div><a class=\"btn-filled btn\" href=\"https:\/\/www.iri.com\/blog\/business-intelligence\/easier-big-data-prep-for-r\/\" title=\"Easier Big Data Prep for R\">Read More<\/a><\/div>\n","protected":false},"author":32,"featured_media":11641,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"_exactmetrics_skip_tracking":false,"_exactmetrics_sitenote_active":false,"_exactmetrics_sitenote_note":"","_exactmetrics_sitenote_category":0,"footnotes":""},"categories":[108,32,91],"tags":[576,581,81,546,281,575,578,577,80,580,579,161],"class_list":["post-6386","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-big-data-2","category-business-intelligence","category-iri-workbench","tag-big-data-prep","tag-external-sorting","tag-hadoop","tag-iri-cosort","tag-metadata-management-2","tag-r-project","tag-statistical-analysis","tag-statistical-computing","tag-unix","tag-very-large-data-sets","tag-virtual-memory","tag-windows"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v23.4 (Yoast SEO v23.4) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Easier Big Data Prep for R - IRI<\/title>\n<meta name=\"description\" content=\"Learn how to handle big data in R through efficient data wrangling techniques that combine IRI CoSort (in Voracity) with StatET (in Eclipse).\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.iri.com\/blog\/business-intelligence\/easier-big-data-prep-for-r\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Easier Big Data Prep for R\" \/>\n<meta property=\"og:description\" content=\"Learn how to handle big data in R through efficient data wrangling techniques that combine IRI CoSort (in Voracity) with StatET (in Eclipse).\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.iri.com\/blog\/business-intelligence\/easier-big-data-prep-for-r\/\" \/>\n<meta property=\"og:site_name\" content=\"IRI\" \/>\n<meta property=\"article:published_time\" content=\"2014-11-26T19:57:37+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2024-12-13T13:52:13+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/11\/iri-r-logo-2.png\" \/>\n\t<meta property=\"og:image:width\" content=\"550\" \/>\n\t<meta property=\"og:image:height\" content=\"300\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Jackie Sabbagh\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Jackie Sabbagh\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"6 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/www.iri.com\/blog\/business-intelligence\/easier-big-data-prep-for-r\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/www.iri.com\/blog\/business-intelligence\/easier-big-data-prep-for-r\/\"},\"author\":{\"name\":\"Jackie Sabbagh\",\"@id\":\"https:\/\/www.iri.com\/blog\/#\/schema\/person\/a77b509c111f25f8888fe7bd0c73fa9e\"},\"headline\":\"Easier Big Data Prep for R\",\"datePublished\":\"2014-11-26T19:57:37+00:00\",\"dateModified\":\"2024-12-13T13:52:13+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/www.iri.com\/blog\/business-intelligence\/easier-big-data-prep-for-r\/\"},\"wordCount\":849,\"commentCount\":1,\"publisher\":{\"@id\":\"https:\/\/www.iri.com\/blog\/#organization\"},\"image\":{\"@id\":\"https:\/\/www.iri.com\/blog\/business-intelligence\/easier-big-data-prep-for-r\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/11\/iri-r-logo-2.png\",\"keywords\":[\"big data prep\",\"external sorting\",\"hadoop\",\"IRI CoSort\",\"metadata management\",\"R project\",\"statistical analysis\",\"statistical computing\",\"Unix\",\"very large data sets\",\"virtual memory\",\"Windows\"],\"articleSection\":[\"Big Data\",\"Business Intelligence (BI&#041;\",\"IRI Workbench\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/www.iri.com\/blog\/business-intelligence\/easier-big-data-prep-for-r\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.iri.com\/blog\/business-intelligence\/easier-big-data-prep-for-r\/\",\"url\":\"https:\/\/www.iri.com\/blog\/business-intelligence\/easier-big-data-prep-for-r\/\",\"name\":\"Easier Big Data Prep for R - IRI\",\"isPartOf\":{\"@id\":\"https:\/\/www.iri.com\/blog\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/www.iri.com\/blog\/business-intelligence\/easier-big-data-prep-for-r\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/www.iri.com\/blog\/business-intelligence\/easier-big-data-prep-for-r\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/11\/iri-r-logo-2.png\",\"datePublished\":\"2014-11-26T19:57:37+00:00\",\"dateModified\":\"2024-12-13T13:52:13+00:00\",\"description\":\"Learn how to handle big data in R through efficient data wrangling techniques that combine IRI CoSort (in Voracity) with StatET (in Eclipse).\",\"breadcrumb\":{\"@id\":\"https:\/\/www.iri.com\/blog\/business-intelligence\/easier-big-data-prep-for-r\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.iri.com\/blog\/business-intelligence\/easier-big-data-prep-for-r\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.iri.com\/blog\/business-intelligence\/easier-big-data-prep-for-r\/#primaryimage\",\"url\":\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/11\/iri-r-logo-2.png\",\"contentUrl\":\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/11\/iri-r-logo-2.png\",\"width\":550,\"height\":300},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.iri.com\/blog\/business-intelligence\/easier-big-data-prep-for-r\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.iri.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Easier Big Data Prep for R\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.iri.com\/blog\/#website\",\"url\":\"https:\/\/www.iri.com\/blog\/\",\"name\":\"IRI\",\"description\":\"Total Data Management Blog\",\"publisher\":{\"@id\":\"https:\/\/www.iri.com\/blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.iri.com\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/www.iri.com\/blog\/#organization\",\"name\":\"IRI\",\"url\":\"https:\/\/www.iri.com\/blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.iri.com\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2019\/02\/iri-logo-total-data-management-small-1.png\",\"contentUrl\":\"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2019\/02\/iri-logo-total-data-management-small-1.png\",\"width\":750,\"height\":206,\"caption\":\"IRI\"},\"image\":{\"@id\":\"https:\/\/www.iri.com\/blog\/#\/schema\/logo\/image\/\"}},{\"@type\":\"Person\",\"@id\":\"https:\/\/www.iri.com\/blog\/#\/schema\/person\/a77b509c111f25f8888fe7bd0c73fa9e\",\"name\":\"Jackie Sabbagh\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.iri.com\/blog\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/e8797c735e92e5e8a11c74ddeb2c919b?s=96&d=blank&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/e8797c735e92e5e8a11c74ddeb2c919b?s=96&d=blank&r=g\",\"caption\":\"Jackie Sabbagh\"},\"url\":\"https:\/\/www.iri.com\/blog\/author\/jackies\/\"}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Easier Big Data Prep for R - IRI","description":"Learn how to handle big data in R through efficient data wrangling techniques that combine IRI CoSort (in Voracity) with StatET (in Eclipse).","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.iri.com\/blog\/business-intelligence\/easier-big-data-prep-for-r\/","og_locale":"en_US","og_type":"article","og_title":"Easier Big Data Prep for R","og_description":"Learn how to handle big data in R through efficient data wrangling techniques that combine IRI CoSort (in Voracity) with StatET (in Eclipse).","og_url":"https:\/\/www.iri.com\/blog\/business-intelligence\/easier-big-data-prep-for-r\/","og_site_name":"IRI","article_published_time":"2014-11-26T19:57:37+00:00","article_modified_time":"2024-12-13T13:52:13+00:00","og_image":[{"width":550,"height":300,"url":"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/11\/iri-r-logo-2.png","type":"image\/png"}],"author":"Jackie Sabbagh","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Jackie Sabbagh","Est. reading time":"6 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.iri.com\/blog\/business-intelligence\/easier-big-data-prep-for-r\/#article","isPartOf":{"@id":"https:\/\/www.iri.com\/blog\/business-intelligence\/easier-big-data-prep-for-r\/"},"author":{"name":"Jackie Sabbagh","@id":"https:\/\/www.iri.com\/blog\/#\/schema\/person\/a77b509c111f25f8888fe7bd0c73fa9e"},"headline":"Easier Big Data Prep for R","datePublished":"2014-11-26T19:57:37+00:00","dateModified":"2024-12-13T13:52:13+00:00","mainEntityOfPage":{"@id":"https:\/\/www.iri.com\/blog\/business-intelligence\/easier-big-data-prep-for-r\/"},"wordCount":849,"commentCount":1,"publisher":{"@id":"https:\/\/www.iri.com\/blog\/#organization"},"image":{"@id":"https:\/\/www.iri.com\/blog\/business-intelligence\/easier-big-data-prep-for-r\/#primaryimage"},"thumbnailUrl":"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/11\/iri-r-logo-2.png","keywords":["big data prep","external sorting","hadoop","IRI CoSort","metadata management","R project","statistical analysis","statistical computing","Unix","very large data sets","virtual memory","Windows"],"articleSection":["Big Data","Business Intelligence (BI&#041;","IRI Workbench"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.iri.com\/blog\/business-intelligence\/easier-big-data-prep-for-r\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.iri.com\/blog\/business-intelligence\/easier-big-data-prep-for-r\/","url":"https:\/\/www.iri.com\/blog\/business-intelligence\/easier-big-data-prep-for-r\/","name":"Easier Big Data Prep for R - IRI","isPartOf":{"@id":"https:\/\/www.iri.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.iri.com\/blog\/business-intelligence\/easier-big-data-prep-for-r\/#primaryimage"},"image":{"@id":"https:\/\/www.iri.com\/blog\/business-intelligence\/easier-big-data-prep-for-r\/#primaryimage"},"thumbnailUrl":"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/11\/iri-r-logo-2.png","datePublished":"2014-11-26T19:57:37+00:00","dateModified":"2024-12-13T13:52:13+00:00","description":"Learn how to handle big data in R through efficient data wrangling techniques that combine IRI CoSort (in Voracity) with StatET (in Eclipse).","breadcrumb":{"@id":"https:\/\/www.iri.com\/blog\/business-intelligence\/easier-big-data-prep-for-r\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.iri.com\/blog\/business-intelligence\/easier-big-data-prep-for-r\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.iri.com\/blog\/business-intelligence\/easier-big-data-prep-for-r\/#primaryimage","url":"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/11\/iri-r-logo-2.png","contentUrl":"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/11\/iri-r-logo-2.png","width":550,"height":300},{"@type":"BreadcrumbList","@id":"https:\/\/www.iri.com\/blog\/business-intelligence\/easier-big-data-prep-for-r\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.iri.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Easier Big Data Prep for R"}]},{"@type":"WebSite","@id":"https:\/\/www.iri.com\/blog\/#website","url":"https:\/\/www.iri.com\/blog\/","name":"IRI","description":"Total Data Management Blog","publisher":{"@id":"https:\/\/www.iri.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.iri.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.iri.com\/blog\/#organization","name":"IRI","url":"https:\/\/www.iri.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.iri.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2019\/02\/iri-logo-total-data-management-small-1.png","contentUrl":"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2019\/02\/iri-logo-total-data-management-small-1.png","width":750,"height":206,"caption":"IRI"},"image":{"@id":"https:\/\/www.iri.com\/blog\/#\/schema\/logo\/image\/"}},{"@type":"Person","@id":"https:\/\/www.iri.com\/blog\/#\/schema\/person\/a77b509c111f25f8888fe7bd0c73fa9e","name":"Jackie Sabbagh","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.iri.com\/blog\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/e8797c735e92e5e8a11c74ddeb2c919b?s=96&d=blank&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/e8797c735e92e5e8a11c74ddeb2c919b?s=96&d=blank&r=g","caption":"Jackie Sabbagh"},"url":"https:\/\/www.iri.com\/blog\/author\/jackies\/"}]}},"jetpack_featured_media_url":"https:\/\/www.iri.com\/blog\/wp-content\/uploads\/2014\/11\/iri-r-logo-2.png","_links":{"self":[{"href":"https:\/\/www.iri.com\/blog\/wp-json\/wp\/v2\/posts\/6386"}],"collection":[{"href":"https:\/\/www.iri.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.iri.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.iri.com\/blog\/wp-json\/wp\/v2\/users\/32"}],"replies":[{"embeddable":true,"href":"https:\/\/www.iri.com\/blog\/wp-json\/wp\/v2\/comments?post=6386"}],"version-history":[{"count":33,"href":"https:\/\/www.iri.com\/blog\/wp-json\/wp\/v2\/posts\/6386\/revisions"}],"predecessor-version":[{"id":18157,"href":"https:\/\/www.iri.com\/blog\/wp-json\/wp\/v2\/posts\/6386\/revisions\/18157"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.iri.com\/blog\/wp-json\/wp\/v2\/media\/11641"}],"wp:attachment":[{"href":"https:\/\/www.iri.com\/blog\/wp-json\/wp\/v2\/media?parent=6386"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.iri.com\/blog\/wp-json\/wp\/v2\/categories?post=6386"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.iri.com\/blog\/wp-json\/wp\/v2\/tags?post=6386"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}