Both data warehouses and data lakes offer robust options for ensuring that data is well-managed and prepped for today’s analytics requirements. However, the two environments have distinctly different roles, and data managers need to understand how to leverage the strengths of each to make the most of the data feeding into analytics systems.
Data warehouses are repositories of structured, transformed data configured for specific applications. They serve as central locations for integrated data from one or more disparate sources, said Ryan Wisnesky, co-founder of Conexus. “They store current and historical data and are used for creating trending reports such as annual and quarterly comparisons. A data warehouse is highly transformed and data is not loaded to the data warehouse until the use for it has been defined.”
Typically, data warehouses “support functions that are used to create reports, understand trends, and make more tactical decisions that address day-to-day and short- to medium-term business activity,” said Sri Raghavan, director of data science and advanced analytics product marketing at Teradata. Data lakes on the other hand, which typically see a lot of analytics activity, are used to investigate, discover insights, and address a more holistic set of business challenges. They usually require data and analytics functions that are not a part of the data warehouse environment, Raghavan noted.
Data lakes draw in data from all sources, whether for defined or unspecified purposes. They serve as repositories for raw, unprocessed data straight from data sources, and this data may reside in the lakes until needed at a future time. While a data warehouse may be more akin to a city water supply, a data lake “is more like a body of water in its natural state,” said Wisnesky. “Data flows from the streams— the source systems—to the lake. Users have access to the lake to examine, take samples, or dive in. Data lakes retain all data. All data is loaded from source systems. No data is turned away.”
The good news is that both environments can be supported at the same time. “In some cases, enterprises are operating an open data lake right alongside of the data warehouse,” said Dave Mariani, co-founder and chief strategy officer at AtScale. The choice, he noted, often depends on the business case at the end of the data funnel. “It’s not so much a question of which product you should use for your data; rather, it’s a matter of having purpose and intent around how you’re going to use your data and being able to do something with it that is the gold standard,” agreed Nima Negahban, CTO and co-founder at Kinetica.
DATA-DRIVEN FUNCTIONS
It’s important that enterprises understand which functions are applicable to either type of environment. Data warehouses, for one, are traditionally seen as systems of records—implying the data within these environments is well-organized, mapped, supported, and has some level of quality, said Kim Kaluba, senior manager of data management solutions at SAS. Data warehouses best support CRM-, ERP-, EDW-, and MDM-type initiatives which require stable and trusted data for decision-making functions, she said.
At the same time, Kaluba continued, data lakes “offer inexpensive options to traditional database systems.” They expedite processing and function as more of a sandbox or investigational environment for data. Since data lakes are rarely managed and supported to the degree of the data warehouse, Kaluba added, “the data functions or business needs they best support include exploratory analytical functions where raw, unrefined, and largedata is used to test new algorithms, identify insights, and answer questions.”
The bottom line is that data warehouses “are best suited to providing high-performance, ad hoc analytics whereas data lakes are more suitable for use cases where raw data access is required,” said Mariani. “Data warehouses are ideal for analytics because the data is usually cleaned and normalized. In addition, data warehouse architectures are optimized for analytics. In contrast, a data lake is best utilized as a landing zone for raw data for use in downstream applications and data warehouses. Data lakes are optimal for data science workloads, where access to granular data is needed.”
The low-cost availability of storage enables enterprises to increasingly use data lakes, agreed Chris Bergh, CEO of DataKitchen. “A data lake utilizes simple storage to retain the organization’s critical data. Data analysts commonly understand data lakes as a repository for raw data. Processed data can also be deposited into a data lake, allowing it to be more easily combined with other data.” That’s because “in its native and isolated form, accessing data is difficult.” Imagine a new analytics project that needs to work with data from a series of databases containing CRM, ERP, syndicated, and sales-channel data, he noted. “Accessing data in each of these repositories is time-consuming and requires authorization and specific skills. Collocating data all in one place makes it much easier to work with. The data lake serves as the common repository for the various data sources, greatly simplifying the job of transformation.”
THE FUTURE OF DATA WAREHOUSES
What is the future of the data warehouse in the emerging real-time, datadriven enterprise? How is its role changing, and how does it fit into the picture? “The need for warehouses hasn’t changed much; however, now they are being accessed through the cloud in many instances,” said Wisnesky. The problem, he said, is that cloud platforms can create interoperability problems by becoming a new type of silo, “especially given that ELT technologies encourage deferring schema construction.”
Data warehouses, which once focused on historical data, are also taking on real-time duty. Machine learning and AI modeling allow data warehouses to operationalize those models so that the gap between an activity, such as a customer purchase, and the response, in the form of product recommendation, is a matter of seconds as opposed to days or weeks, Wisnesky said. Data warehouses can also handle much larger datasets as they speed through rapid analysis. “Computing Both data lakes and data warehouses can be supported at the same time since it is not so much a question of which product you should use for your data but instead a matter of having purpose and intent regarding how you’re going to use it. A new vision of a hybrid environment, called the ‘lakehouse,’ provides a structured transactional layer to a data lake—allowing many of the use cases that would traditionally have required legacy data warehouses to be accomplished with a data lake alone. power and memory have advanced to the point that data warehouses can process much larger and more complicated datasets,” said Mariani.