Call them what you will—online analytical processing (OLAP) databases, enterprise data warehouses (EDWs), massively parallel processing (MPP) databases but databases designed for analytical workloads rather than for transactional purposes represent a huge segment of the overal database market.
Originally, OLAP workloads were run on the same platforms as OLTP workloads—you might see one Oracle RDBMS performing OLTP while another performed OLAP. However, as the volumes increased, databases specifically designed for OLAP workloads emerged. The enterprise data warehouse became key to an increasingly robust business intelligence market.
A little more than a decade ago, the EDW market was disrupted by Hadoop. Hadoop was able to query data stored in raw formats and was able to apply unprecedented parallelism to the task by deploying massive clusters of commodity servers. Hadoop provided a SQL query capability (Apache Hive) which allowed integration with existing BI systems.
Most significantly, while ETL workflows were required to move data from source systems into the EDW, in Hadoop, data could be left in native format, reducing the delay involved in translating the data from native format to the EDW schemas.
During the Hadoop era, enterprises were encouraged to abandon the EDW in favor of a “data lake.” The data lake was a vast repository of structured and unstructured data that could be exploited using Hadoop to competitive advantage.
However, as companies migrated their assets to the cloud, they typically found alternatives to both the Hadoop storage layer (HDFS) and the Hadoop processing engine. Furthermore, many data lakes became “data swamps” full of poorly defined and inconsistent datasets with no clear means of navigation.
But there’s still life in the data lake concept, as evidenced by the growing success of Dremio. Dremio describes itself as “the cloud data lake” platform. It provides a cloud-based engine that layers over cloud object storage such as Amazon S3, Azure’s Data Lake Storage, or even legacy Hadoop systems.
Why would Dremio succeed where Hadoop ultimately failed? First, Dremio is cloud-native. Hadoop failed to provide a compelling offering when businesses moved assets from on-premise into the cloud. Dremio is completely optimized for cloud use cases.
Second, while Hadoop was scalable, it was too slow for real-time BI. The Hadoop SQL engine was orders of magnitude slower than EDW alternatives. In contrast, Dremio provides a columnar in-memory cache, reflections (similar to materialized views) and sophisticated parallel query optimization, which allow real-time execution of queries.
Third, data lakes became data swamps because of poor metadata management. It was not always possible to determine the definition and meaning of the data held in a Hadoop data lake. To mitigate this failing, Dremio supports a semantic layer that adds business meaning to the data within the lake.
Dremio has succeeded in generating some serious adoption and recently completed a $135 million series D round, demonstrating investor faith in the vision.
Dremio sees its data lake model as eventually supplanting the monolithic data warehouse. And while this rhetoric is reminiscent of the height of the Hadoop era, there are definitely increasing economic incentives to analyze mass data stored on cheap cloud storage rather than migrate it to relatively expensive database storage.
However, it’s definitely too early to posit the death—again!—of the data warehouse. The economic arguments for the data lake arguably fail to take into account all of the costs involved in BI processing. The cost of data storage at rest is only one consideration. When large amounts of data must be aggregated in real time, the cost may be higher if that data is stored on “cheap” and unoptimized cloud storage. Transforming data into an optimized EDW schema might result in faster and, therefore, cheaper (in terms of CPU) real-time queries. During that transformation, data is also reconciled and cleansed—leading to the “single view of the truth” that organizations crave.
Both the data lake and data warehousing models appear to be viable and vibrant segments. Dremio looks well placed to exploit the former, while data warehousing alternatives such as Snowflake exploit the later.