“That’s all you need in life, a little place for your stuff. That’s all your house is, a place to keep your stuff. If you didn’t have so much stuff, you wouldn’t need a house. You could just walk around all the time. A house is just a pile of stuff with a cover on it.” — George Carlin
In the business world, as in our personal lives, we run into the same quandary of what to do with our continually growing amounts of “stuff.” But it’s not a matter to be taken lightly. When that stuff happens to be digital information, upon which we must base our reasoning and business decisions, we quickly see it’s not just its volume or format that matters. There are many other important considerations as well, especially when pondering the best solutions for storing, integrating, and using big data.1
Gartner defines big data as “high-volume, -velocity and -variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.”
Even data produced on a daily basis can exceed the capacity and capabilities of many on-hand database management systems. This data can be structured or unstructured, static or streaming, and can undergo rapid change. It may require real-time or near-real-time transformation into business intelligence (BI). All these factors make it necessary for our data integration choices to reflect the actual storage and usage patterns of our data.
A core component of analytics and BI is a data warehouse (DW) or Enterprise Data Warehouse (EDW). This centrally-accessible repository, and its simpler, singular, limited form called a data mart, are the oldest and still most widely-used solutions for managing data and creating reports. The EDW also enables uploading for additional data operations and integration through the more basic operational data store (ODS).
Originating in the 1980s, data warehouse architecture was engineered to facilitate data flow from operational systems requiring analysis of massive accumulations of enterprise data. Data is extracted from heterogeneous sources (usually on-premise databases and files) into a staging area, transformed to meet decision support requirements, and stored/loaded into the warehouse. Business users and executives then base their decisions from that data, proving that the EDW is a reliable data integration and storage paradigm for enabling analytic insight.
Most companies continue to use data warehouses today despite their relatively high hardware, software, and design/support costs. Some businesses have become dissatisfied with this paradigm for large-scale information management due to its overhead. And so the EDW architectural model has evolved in recent years from the physical consolidation method to a more logical and virtual one.
Changing requirements in business analytics have created a more modern data virtualization paradigm driven by the demand to easily access and federate data. To complement the traditional data warehouse, the logical data warehouse (LDW) adds an architectural layer to view enterprise data without having to relocate and transform it beforehand. LDWs are capable of retrieving and transforming data in real time (or near-real-time), and producing fresher data without the limitations imposed by the pre-built structures of traditional DWs.
“The Logical Data Warehouse (LDW) is a new data management architecture for analytics which combines the strengths of traditional repository warehouses with alternative data management and access strategy. The LDW will form a new best practices by the end of 2015.” — Gartner Hype Cycle for Information Infrastructure
The popularity and use of LDWs continue to increase with cloud-based deployments of analytic environments. This means more disruptions for the data warehouse market as well as new expectations for the LDW. As the number of organizations looking to adopt cloud analytics grows, the potential exists to change the view of the entire industry, and leave some conventional companies behind.
One of the newest approaches to analyzing big data (which may also be stored in the cloud) is the enterprise data lake. Ironically, this data repository method is thought of as a more natural approach since vast amounts of raw data are stored in their native formats until needed. A data lake also uses a flat architecture for storage, unlike the data warehouse, which stores data in a database or hierarchical file system.
Here are some other key differences between a data lake and a data warehouse:2
A lot of companies are now launching data lake initiatives, many as their first Hadoop project. Since Hadoop is an open source data management platform, is it any wonder why businesses are attracted to its (theoretically) cheaper cost and compelling ecosystem? There is no doubting its immediate value for providing a “Data As a Service” solution. However, like other promising platforms, there are inherent problems with using Hadoop for building the enterprise data lake. Some of the crucial elements to making it work right, which are at the heart of the data lake debate, are to regularly sectionize and streamline it for performance, keep the data lake clean and well-organized, and periodically eliminate irrelevant and unusable data.
Ideally, data lakes can achieve company goals by offering great value and flexibility to business teams. With an adequate amount of foresight and planning, as well as strong data governance practices, organizations can automate data lake initiatives enough to handle their size and scope, and thus maximize their business intelligence value.
“Data lakes that can scale at the pace of the cloud remove integration barriers and clear a path for more timely and informed business decisions.” — Brian Stein & Alan Morrison, PwC, Technology Forecast: Rethinking integration, Issue 1, 2014)
As new advancements in technology emerge, big data keeps getting bigger. Consider for example how the “Internet of Things” (IoT) is transmitting ever-increasing amounts of information, which in turn is creating more storage options, usually in the cloud. Storing all that data is only the beginning of course.
In the case of IoT, leveraging it means being able to integrate it in the ways described above. But i’s also about things like aggregation the edge, and security. The key is having agile strategies that allow you to effectively utilize as much of that stored data as possible … and to enable your business to achieve its information-driven goals within the frameworks of the future.
It is inevitable and perennial, in life and in business, that there will always be more and more data, and newer better ways to store and leverage it through integration methodologies. Think about which one will help you use that data in the ways that matter most to you. Give IRI a call if you need some help.