The data lake has been the subject of more than its fair share of critics since its inception. Pundits claim it’s a source of chaos and risk. Analysts often slam the concept, calling it a “data swamp” or “data dump.” As a result of this scrutiny, the definition and understanding of the definition of the data lake are rather murky.
To better understand the current dynamic around the data lake, Radiant Advisors and Unisphere Research partnered in 2015 to survey IT practitioners, stakeholders, and executives. Specifically, the research explored the perceived value and maturity of data lake adoption as well as its perceptions, challenges, and success factors. The survey itself was designed based on earlier research that analyzed the successes and struggles of early data lake adopters. (The survey report is available for download at www.dbta.com/DBTA-Downloads/WhitePapers/Data-Lake-Adoption-and-Maturity-Survey-Findings-Report-5688.aspx.)
The new survey found that definitions are becoming more solidified, but are not yet crystallized, with respondents defining the data lake as data architecture for IT (59%) and as a data strategy (67%). Yet, while the definition is not clear, value exists and is real and measurable. Vendor marketing teams are trying to introduce new hope around the data lake concept and to move past the negative connotations. And as a whole, the industry must move past the lack of a consistent definition and embrace the potential. The survey showed traction in this respect. Adoption is moving forward, with more than 32% of survey respondents reporting approved budgets for the next fiscal year in 2016.
People are deploying data lakes because, from an architecture perspective, the data management principles simply make sense. Certain technologies are best suited for certain capabilities, and the data lake facilitates this “polyglot persistence principle”—using various technologies for their advantages—and plays to the strengths of the type of analytics they are doing and the strengths of the database.
These strengths include:
• Affordable scalability, which supports cost justification, which makes the data lake great for discovery.
• Flexibility in loading different formats of data and unknown schemas, which lowers the barrier to data ingestion and increases agility.
• High flexibility for data usage, which facilitates work in an environment with several data engines (both programming and SQL), thus enabling advanced analytics.
What’s in a Name, Anyway?
In my many years of doing architecture workshops, I encourage people to forget about labels. For example, people would talk about data warehouses, but around a single table, there would likely be more than 10 definitions for the data warehouse. Confusion exists because of the label and the baggage that everyone brings with their definition. Talk instead about what the technology enables, its purpose, and its role in the architecture.
The “data lake” is our new label. Just as the data warehouse was criticized for 20 years for its peculiarities but now has become common, we can project (because we’ve seen this before) that the same will happen with the term “data lake.” For now, forget about the name; focus on the role, purpose, and drivers. Concentrate on matching its strengths with the role requirements, and how it fulfills a need that other databases cannot. Its role will be defined as centralizing all raw, detailed data for various uses by the enterprise and therefore enabling reusability and governance in the process.
One early adoption use case is as a data warehouse staging area for all incoming data—prior to integration and cleansing. Typically this data is deleted after it has been processed (transient data), but data scientists want all of this raw operational data, too, for their algorithm development. To fulfill both requirements, the operational data can be ingested into the data lake, thereby allowing the data warehouse to read from it as its staging area, and also enabling data scientists to work with the raw data to find patterns that may be scrubbed out of the data warehouse.
Another purpose is that it also contains historical data for analysis. If you start a new project and then begin pulling the data for it, the resulting trending data for the first month is not that exciting. But if you’ve been collecting raw data over time, you can initially load a predictive model with perhaps years of historical data for rich datasets, yielding more comprehensive, accurate models.
The simple fact is that no matter what we call it, we need an architecture component that serves this purpose and role in enterprise data architecture. As organizations embrace this and launch their data lake programs, there are three very important factors that must be addressed at the onset.
Three Things You Should Do Before Data Lake Adoption
1. Decide how you’re going to organize data in the data lake.
You’re going to start bringing in 15 data sources—or 50—and you know the data needs to be centralized and accessible —but how do you accomplish that? You need to decide the most appropriate organization for your data lake.
One option is to organize by data classification. For this approach, create three data tiers. Be aware, the subject area doesn’t particularly matter for this type of structure, just the classification.
- One tier is raw data: This contains original source data and metadata and should be organized by data source or context. Data in this tier is fully auditable.
- The next tier is derived data: Data in this tier is enhanced, cleansed, and enriched. This data is primarily from integrations and business rules data.
- The third tier is aggregated data: This is a subset of the other two tiers in which precalculated aggregations or subsets of the data are built for consumption. Business and application perspective is applied here.
A second approach is to organize by usage or workload. Consider the ways the data in the lake will be used, and assign the folders accordingly.
- Folders for discovery: Data in these folders will be used for discovering new relationships, analytic models, and insights. Discovery folders are agile and user-/team project-oriented, and are often temporary. Users will bring in data to play with and then discard them when the specific discovery is fulfilled.
- Folders for data science and advanced analytics: Use folders for data preparation and to create derived data for data science. If you’re building for operational analytics, your analytics can run on raw data in the cluster without the need to abstract or extract the data from the folder.
- Folders for operational big data: You are likely to have applications that live in Hadoop because they’re big apps. Applications such as Hbase and others simply cannot run anywhere and must run inside the scalable cluster.
You’ll also need to make decisions around special considerations:
- Proximity: Assess your strategy for cloud versus on premises—or a hybrid approach. You may decide that most or all of your data will stay in the cloud—you don’t have to pull it down to work on it.
- Security/compliance: You should determine how to handle personally identifiable information and other secure data. You may not want to bring secure data in at all. Another option is to establish landing zones and landing pads for tokenized data that you can still do analytics on. Masking or vertical partitioning techniques allow you to hide parts of the record as needed.
- Hadoop as a service: Some companies offer Hadoop in the cloud. Your company’s choice to use cloud-based Hadoop will simplify the ingestion of frequent, new data (such as log data); allow you to quickly analyze the data; and then decide whether to bring it into the ecosystem.
2. Determine how to unify workloads in the data lake.
A key advantage of the data lake strategy and architecture is that it facilitates unified workloads. As you adopt the strategy for the data lake, consider the types of workload operations that will occur there and what needs will exist. For instance, what activities around discovery, data science, and business intelligence will take place?