There's a new buzzword on the loose, the "data lake." At first glance, a data lake could be easily mistaken for a data warehouse.
The two big data concepts have a common focus on analytics and they may, in certain situations, produce roughly equivalent output. But that’s about where their similarities end. For starters, a data warehouse stores a massive amount of structured data, usually in a big and expensive server with a big and expensive storage array.
A data lake, on the other hand, accesses a massive amount of unstructured, semi-structured, and structured data, often from commodity storage subsystems across many scaled-out nodes. Another key difference is that a data warehouse sits on top of well-designed data models, often produced using the rigorous techniques of the Kimball framework, based upon a proprietary multi-dimensional data store (e.g., SQL Server Analysis Services). In contrast, a data lake sits on top of un-modeled data stored in a Hadoop/Hive/HBase stack.
U-SQL
In keeping with Microsoft’s rapid fire product development trend of the last few years, the data lake is now on offer via Azure Data Late as a public preview (www.azure.com/datalake). Going one step further than the competition, Microsoft has also introduced U-SQL, a SQL-variant that unifies the declarative power of SQL and the extensibility of C# to make writing custom processing of Big Data easy. It also unifies processing over all data – anything from fully structured to fully unstructured data – and across both local and remote SQL data sources. U-SQL has several compelling advantages over other big data languages like Hive:
- Allows the seamless integration and powerful extensibility of SQL capabilities with your own C# code.
- Provides the ability to query and merge data from a variety of distinct data sources, including Azure Data Lake Storage, Azure Blob Storage, Azure SQL DB, Azure SQL Data Warehouse, and SQL Server instances running in Azure VMs.
- Scales easily and efficiently to any size data without requiring all of the plumbing code for parallel executions, multi-node optimizations and scale-out topologies required by other big data languages like Hive.
Let's Learn More!
I asked Michael Rys, Microsoft’s principal program manager for these technologies, about where the public should begin the learning process.
“The new Microsoft Azure Data Lake services for analytics in the cloud (bit.ly/1VcCkaH) includes a hyper-scale repository; a new analytics service built on YARN (bit.ly/1iS8xvP) that lets data developers and data scientists analyze all formats of data; and Azure HDInsight (bit.ly/1KFywqg), a fully managed Hadoop, Spark, Storm and HBase service," he said. "Developers will want to check out U-SQL for free using Visual Studio Community Edition, with Azure SDK (via Web Platform installer) and ADL Tools for VS (http://aka.ms/adltoolsVS) or go online to get an Azure Data Lake Analytics account at http://www.azure.com/datalake).”
To get started, read the introductory blog post (https://blogs.msdn.microsoft.com/visualstudio/2015/09/28/introducing-u-sql-a-language-that-makes-big-data-processing-easy/) or the MSDN Article (https://msdn.microsoft.com/en-us/magazine/mt614251.aspx) about Azure Data Lakes.
Documentation and Samples
ADL and U-SQL has a lot more documentation ready at the preview stage when compared to other product releases I’m used to from the past. Here’s a brief run-down:
Enjoy exploring this big new feature set and share what you learn! As always, I look forward to hearing about your experiences.