In his Data Summit 2015 presentation, titled “Building a Real-World Data Warehouse,” Elliott Cordo of Caserta Concepts, covered a range of topics including best practices for building data warehouse on Hadoop that also coexists with processing frameworks and non-Hadoiop storage platforms.
But what is driving the adoption of new technologies and techniques, asked Cordo. The answer is that data has changed – with the evolution of semi-structured, unstructured, sparse and evolving schema; and volumes have grown from megabyte to terabyte to petabyte workloads. With this change, cracks in the armor of traditional data warehousing approaches are forming.
The concept of the data warehouse is still sound. It enables the consolidation of data from disparate source systems; provides clean and conformed reference data; clean and integrated business facts; and enables adherence to data governance principles. However, he said, businesses can be more successful by acknowledging that the traditional enterprise data warehouse cannot solve all problems today.
The data lake enables a storage and processing layer for all data where anything can be stored including source data, semi-structured, unstructured, and structured; it can be kept for as long as needed; and it supports a number of processing workloads. “Here is where Hadoop can help us,” he said. However, he noted, full data governance can only be applied to “structured” data. This can include materialized endpoints such as files or tables or projections such as a Hive table. Governed structured data must have a known schema with metadata; a known and certified lineage; a monitored, quality test, managed process for ingestion and transformation. However, even in the case of unstructured data, structure must be extracted or applied in just about every case imaginable before analysis can be performed.
Dumping data into Hadoop with no repeatable process, procedure or data governance will create a mess, said Cordo. Instead, an optimal modern big data warehouse is fully governed, data is structured, partitioned and tuned for data access, governance includes a guarantee of completeness and accuracy and consumers are not only data scientists but ETL processes, applications, data analysts, and business users.
Greg Rahn of Snowflake Computing, a Platinum Sponsor of Data Summit 2015, followed with a discussion on Snowflake’s data as a service platform, which he said makes cloud data warehousing available for everyone. The more data pipelines there are and the more tools that are used on data, the larger the gap from data creation to data consumption. “At Snowflake, our vision has been to reinvent the data warehouse,” he said, explaining that Snowflake provides data as a service designed for the cloud.
To download Cordo’s presentation slides from Data Summit, go to www.dbta.com/DataSummit/2015/presentations.aspx.