Structured data repositories such as data warehouses may be essential within businesses turning to data-intensive approaches such as AI and machine learning. “Many analytics jobs running in an environment like Spark create structured features from unstructured data,” said VanHook. “As features like this feed more machine learning models, as well as traditional reporting, the need for a structured data store to hold the composite features emerges. You can imagine traditional data warehouse capabilities evolving to hold feature tables and becoming a high-performance feature store that can drive both the training and inference activities of machine learning models. At the bottom line, warehouses become the querying point, while lakes are the analytics point.”
The data within data warehouses is generally trusted as the central version of truth because it’s highly curated and processed, said Anjan Kundavaram, chief product officer at Precisely. “For analytics, the structured format of data warehouses makes it easier for standardized access, queries, and reporting. The predetermined structure also offers ready-to-use, clean data that is ideal for organizations that need to conduct operational analysis or reporting.”
The platform “should have a data fabric to drive data flow orchestration and automation to deliver information and intelligence to users,” Winfield said. “The platform will also need shared management and security services and support for a range of clients to meet the application development requirements for different users—including data engineers, data scientists, business analysts, and business users.”
Data fabrics also offer a way to bring these environments closer together to deliver analytics as needed. “Today’s data warehouses are collecting immense amounts of data—more than may have been anticipated when these technologies were originally implemented,” Gnau said. “While data lakes have helped organize this raw data into central repositories, they still are not typically involved in operational and transactional data flows. This is where modern data architectures, such as data fabrics, come into play. Not only do data fabrics effectively organize the datasets into fields that help identify the most actionable and high-quality resources but each one [also] tends to meet a unique, IT-driven purpose. Without a well-orchestrated architecture, the data remains either inaccessible and wasted or not efficiently addressable, regardless of where it sits within the data lake or warehouse.”
TO THE CLOUD—IN MOST CASES
Are analytical platforms such as data warehouses, lakes, or lakehouses going to the cloud? Are there scenarios where on-premise approaches are still preferable? A recent survey of IT leaders found that the majority, 53%, see hybrid or multi-cloud data warehousing as one of the most important data warehousing-related trends of this year—more than any other trend. The question isn’t really about “why” to use cloud anymore, said Minnick. “Increasingly, we’re seeing customers now ponder ‘which’ clouds.” Minnick noted that the majority of Databricks’ enterprise customers work with at least two cloud providers today. “As a result, it’s become much more important that organizations adopt solutions that offer a consistent experience for their employees, regardless of where the data resides.”