REAL-TIME AND UNIFIED ANALYSIS
The data warehouse has adapted by moving from on-premise to the cloud, and it will continue to adapt, he noted. However, when it comes to real-time processing, data lakes present a better choice. “For real-time workloads, data warehouses are not ideal because even this new generation of data warehouse requires that data be loaded, thereby introducing latency,” Mariani said.
While the traditional data warehouse “focused on the first mile of ingesting and storing data for analysis, a modern data warehouse both ingests and stores data, and analyzes that data in real time as it is received,” said Negahban. “Modern data warehouses will deliver real-time analysis on incoming data streams, while incorporating all of an organization’s data and applying cutting-edge location intelligence and machine learning-powered predictive analytics.” Data warehouses of the future that process data in real time and unify analysis of the data in different formats—such as relational, geospatial, graph, and time series—at scale will benefit from increased accuracy and detail for customers across industries, Negahban noted.
Ultimately, the success of data warehouses going forward comes down to “semantics, semantics, semantics,” said Wisnesky. “In 2020 and beyond, the new challenge for data warehouses is how to best internalize the domain semantics in a way that provides the most value to users. For example, a data warehouse that automatically knows that two entities— say Pete and Peter—are actually the same can internalize that fact so that anyone who queries the warehouse will be made aware of the fact that there are two references to the same real-world entity. Similarly, a warehouse that automatically knows what risk is because it has internalized an ontology such as the Financial Industry Business Ontology can provide semantic query capability to users. We see lightweight knowledge graphs—as opposed to decades-old semantic web technology—as being the harbinger of semantics in 2020.”
An emerging generation of ETL visualization tools may also increase the value of data warehouses into the future. “The visualized ETL process that is essential to integrate data from multiple source systems, especially the legacy systems, is the technology having the most positive impact on enterprises’ ability to compete on data,” according to Alex Ough, senior CTO architect at Sungard Availability Services. “Machine learning models, along with the frameworks used to train the models, have improved significantly. These technologies have made it easier for less skilled individuals to train models with high accuracy. However, data engineering is still very complex and time-consuming, as many of the processes need to be done manually, especially when there are multiple sources of truth with duplicated data in legacy systems. In most cases, pre-processing data requires a deep knowledge of SQL or other programming languages to define relationships among the source data, remove duplicates, and clean mistyped data to improve data quality. Having top-notch ML algorithms and frameworks is useless if you cannot prepare quality data.”
TIME FOR A DATA ‘LAKEHOUSE’?
While data lakes may be more ideally suited for fast-paced, real-time requirements, they can be more cumbersome to manage than data warehouses. For example, it’s difficult to automate the way they are used. “If an enterprise relies on the cloud, data ingestion into a cloud data lake is usually a laborious process, given the immutable nature of such systems,” said Raghavan. Data workflows need to be built and managed with a view toward smooth orchestration across multiple environments, including multi-cloud and hybrid cloud scenarios, while dealing with some environment-specific differences in governance, metadata management, and user experiences across different environments, Raghavan added.
In addition, data governance and quality is another challenge with data lakes. “Appending and modifying data is hard, jobs fail without notification, and keeping historical versions is costly,” pointed out Joel Minnick, vice president of product marketing for Databricks. In addition, it’s difficult to handle large metadata catalogs, and consistent performance “is elusive due to small file problems and complicated partitioning.” Finally, it’s a constant headache to maintain data quality, he added.
Mark Fernandes, managing partner at Sierra Ventures, said he has often seen companies build a data lake and quickly start ingesting large amounts of data. “Soon after, the lake turns into a swamp with a lack of visibility and compromised data quality,” he said. “End users don’t feel confident in the data, and analytics projects come to a halt. Data lake technology stacks based upon Hadoop can be complicated and challenging to manage, especially when you start migrating to the cloud and integrating the various tools needed for data ingestion, quality, data management, governance, preparation, etc. Lastly, many data lake projects fail because they weren’t built with a business use case in mind.”
Minnick said a new vision of a hybrid environment, called the “lakehouse,” is emerging, which “provides a structured transactional layer to a data lake to add data warehouse—like performance, reliability, quality, and scale. It allows many of the use cases that would traditionally have required legacy data warehouses to be accomplished with a data lake alone.” A lakehouse architecture also can support “unstructured data like video, audio, and text, as well as structured data that has video, audio, text, as well as structured data that has traditionally been the domain of legacy systems.”
SUPPORTING DATA-DRIVEN INITIATIVES
Establishing the best environments for supporting data-driven initiatives using AI, machine learning, and IoT is a learning process, industry observers noted. “Data lakes can be just as suitable as traditional data warehouse systems for analytical processes and data-driven initiatives if they are grounded in a comprehensive data strategy supported by data governance and data management processes,” said Kaluba. “This ensures that the data inside of the data lakes is reliable for organizational decisioning processes.”
It may be more efficient to keep storage and compute separate as well. “Decoupling of storage and compute reduces costs and improves scalability,” advised Fernandes. “Data can be stored in a cloud environment like AWS S3, and compute clusters can be spun up as needed to run workloads or queries. This type of elastic provisioning and pay-peruse are key requirements for modern data warehouses. Enterprises are also looking for integrated data governance and self-service data access to support various downstream applications, including artificial intelligence and machine learning use cases.”
In addition, it’s important not to rush into anything. Companies that move into AI before mastering the fundamentals—whether their data is in a data lake or a data warehouse—will end up paralyzed, Mariani cautioned. “Not only do you need to be good at data engineering and business analytics, you also need to embrace advanced automation. Lack of automation means people manually use the keyboard to process pipelines, do ETL, move data, and create downstream assets, which does not scale. All of these activities can and should be automated.”
Enterprises are constantly looking for ways to use data across the business to build smart analytical applications that drive competitive advantage, said Negahban. “Traditional data warehouses do not address the need to integrate data across all aspects of the business—custom applications, IoT, or analytics dashboards. A modern data warehouse should provide a full set of APIs to embed analytics in applications. By taking an API-first approach to building data-driven applications, a modern data warehouse is able to present data at any point of user interaction, giving the business the flexibility to use the tools, apps, and platforms it prefers across departments.”
EMERGING APPROACHES
An emerging discipline, DataOps, may also help bring greater order to data lake or data warehouse management. “Imagine a 50-person team managing numerous large integrated databases for a big insurance or financial services company,” said Bergh. “Their customers—colleagues in a business unit—have lots of questions that drive new and updated analytics, but the data team can’t keep up due to heavyweight processes, serialization of tasks, overhead, difficulty in coordination, and lack of automation. They need a way to increase collaboration and streamline the many inefficiencies of their current process without having to abandon their existing tools. DataOps automates the orchestration of data to production and the deployment of new features, both while maintaining impeccable quality. DataOps can be incredibly beneficial to both data lake and data warehouse agility in large data teams.”
When it comes to the best ways to manage data lakes to support data-driven initiatives, “first identify your key use cases,your key business sponsors, and organize your data initiative to achieve use-case success,” said Fernandes. “Then create a solid foundation for your data lake by leveraging an agile and flexible DataOps approach to automate processes, standardize governance, and provide self-service access to the data. DataOps optimizes the full data cycle by controlling data sprawl and managing the entire supply chain of data from ingestion to consumption,” he noted. “Finally, use augmented data management approaches and a unified platform to enable and govern data lake/store functions, such as cleansing, deduplication, data classification, and gain visibility and insights about the data lake’s health and usage.”
Organizations should also consider looking into the hybrid lakehouse approach, Minnick advised. “It builds on the best qualities of data warehouses and data lakes to provide a single solution for all major data workloads and supports use cases from streaming analytics to BI, data science, and AI. Historically, companies have been forced to create data silos with legacy data warehouses and data lakes, and use them separately for BI and AI use cases. This results in information inequality, high costs, and slower operations. By combining all the data onto the same open, high-performance, lowcost platform, the entire organization is able to move faster and make better decisions.”
For those relying more on data lakes as core enterprise repositories, Wisnesky suggested that enterprise data managersbuild data models. “A data lake is a data storage device. The data stored in it still has underlying meaning, even if that meaning isn’t formalized as a data warehouse schema. Automation is driven by formalization. The best-managed data lakes are actually data warehouses of data warehouses.”