Triggered by the exponential growth in unstructured data sources, data lakes are re-emerging with growing popularity among enterprises. While data lakes are extremely efficient at collecting and storing large quantities and many types of data, there is a common misconception that once data is in a data lake, it is readily available for analysis. The misuse of a data lake is the reason for its bad rap of being coined a "data swamp" as businesses ponder: Why does it take so long to find anything in the data lake and why is the data I really need not available in a format that I need it in? In addition, with the proper design, a data lake and a data warehouse are highly complementary.
Let’s begin with the impetus for more and more enterprises turning to data lakes.
Enterprises are inundated with structured and unstructured data from a myriad of sources and locations. There’s data from social media channels, event-oriented data, sensor data, machine learning data and Internet of Things (IoT) data, among other types. According to IBM, a staggering 2.5 quintillion bytes of data is created by businesses every single day. Furthermore, human and machine-generated data is experiencing an overall 10x faster growth rate than traditional business data, and machine data is increasing even more rapidly at 50x the growth rate.
Driven by the need for greater business agility and to cope with the sheer volume of data, the concept of a data lake is attractive for companies as a repository to hold raw, disparate data in its native format without the need to formulate how or when the data will be used, until it’s needed. And having a repository to mine all this data for insights can offer tremendous competitive advantage.
Perhaps the biggest driver for businesses in their embrace of data lakes is to avoid the throttling effect of having to wait in the long queue for IT to deliver the information they need to "self-serve." But simply ignoring what IT would traditionally do in preparing the data for consumption is not the answer. Instead, it requires bringing some semblance of governance and management to the data sets with varying grades of control, consistency and governance driven by usage needs.
Before diving into a data lake, which can be a costly endeavor and take many months to implement, it’s important to understand an organization’s needs, rationale and what’s really at stake, such as: How much data are we dealing with? What’s the ease of accessing data and extracting value? How agile is it to implement and change? Will this work across all our data stores and models today and in the future? How do we maintain and ensure the security, privacy and governance of our data lake?
To reap the full benefits of an enterprise data lake for reducing data silos and to deliver greater data access and insights, consider the following fo building and maintaining a valuable data lake:
Understand the problem that needs solving: Begin with a true understanding of business needs and goals, and ensure both business and IT leadership are aligned on why and when to use data. Generally, a good approach is that the data lake doesn’t replace a data warehouse as they can work effectively together if delivered with the right skill-sets and design. The most complementary usage is leveraging the data lake as a staging area to look at a large unstructured data set for creating structured knowledge in a data warehouse, and to provide access to data that is not typically available in the data warehouse for a broader range of analytics.
For example, by sifting through unstructured customer data such as support calls, emails and social media streams, a data analyst can derive knowledge from the data lake to determine if sentiment from customers who call the support desk is positive or negative. Flagging a customer as having negative sentiment allows that bit of structured knowledge to be available to everyone in the organization without a data scientist, providing that information is stored as a property of the customer in the structured side (e.g., the data warehouse). Then any BI tool and dashboard can leverage that learning derived from the data lake.
Create the golden version of truth for everyone: In building a data lake, the ingestion process plays a critical role since data is sourced from disparate systems and apps, demanding multiple ingestion streams. In order to create one version of truth, the data lake should be organized by implementing data governance and tagging to accommodate proper search and querying. The best approach is to ‘clean the rivers not the lake’ by applying data quality as new data flows in. Since most enterprises are dealing with multiple operational systems, it’s important to create "golden" entities across those systems. The systems will have different identifiers, attributes, and hierarchies associated with them.
This is here master data management (MDM) comes in and why it’s critical to harmonize that data. By having the golden record for entities such as customers, products and suppliers, the quality of data that’s fed into the data lake is improved by having a master record in a sea of unstructured data. In an open source Hadoop environment, data storage is cheap, though queries can take ten times as long since there are many identifiers from different systems. By leveraging MDM to build the mapping of golden truth in a data lake, the work of tying the identifiers together is simplified.
Keep the focus: The best approach is to start with the most pressing business issue and fill the lake with data related to the problem at hand, e.g., if the data lake serves to provide information about customer sentiment, once the learning data is identified, store it in a data warehouse. Then, building on successful implementation of the initial business problem, grow the data lake to accommodate other pressing business issues. Since ingestion is an easy process, only bring data as needed to solve a particular problem that the business identifies. The risk of trying to bring in everything (build it and they will come) creates needless complications in data management that could impact the ability to find and deliver business value sooner rather than later.
Have a strong connectivity partner: Accessing disparate types of data and bringing them together is a fundamental challenge that requires reliable data connectivity. To read data from unstructured systems and make it readable by other systems and cloud sources requires the right tools, especially for managing complex job pipelines as the output of one query may be fed as input to another in a data lake. Pre-built connectors that already know the protocols and structures for extracting the right data are invaluable here.
Use the right tool for the right job: Be sure to solve the problem in the right place. While data scientists with a data lake can solve certain problems, others are best served with a data warehouse. For example, a consumer products company seeking to launch a new product line will benefit from having its product and data science teams collaborate by tapping a data lake in structured and unstructured environments to understand consumer trends. With that, it can tailor and design offerings for targeted customers. By contrast, a retailer that wants to understand the discrepancy between written and delivered sales is focused on specific KPIs from structured content. This type of structured analysis is best served with a data warehouse, so the business user can quickly see all relevant data in a dashboard.
It takes a well-designed and managed data lake to avoid turning into a "data swamp."
Agility and MDM are the cornerstones for effective and valuable data lake development; agile businesses leverage new learnings out of data lakes, and having that information written and stored in a data warehouse.
Data-driven companies are beginning to integrate insights and master data derived from their data lakes back into their operational and analytical environments, and are using MDM to push context and structure back into the data lake. In the end, this complementary approach of data lakes and data warehouses working together is how enterprises can derive the insights they need to grow and transform their business.