Video produced by Steve Nathans-Kelly
At Data Summit Fall Connect 2020, Engineering@Uber's Nishith Agarwal discussed how to build a unified system for data management and outlined its key benefits.
Building a data lake is just part of the problem, said Agarwal. It is still necessary to have the benefit of tooling around collecting data from the myriad data sources. "Even though you stored it in some format when you ingested it, it may actually need to evolve based on usage patterns that you may have, either query patterns or access patterns, and then you want to monitor the performance of the data lakes. You want to monitor patterns of ingestion correctness so, essentially, you want to build some data observability modules so that you can actually see what's going on and make sure that what you're doing is right and that it's utilized to the best of the abilities."
Stream and Batch
While maintaining multiple technologies for streaming and batch, there is significant operational overhead, so the question is: Is there a single system that can manage and do all of these operations, but at the same time provide some sort of unification, in-stream and batch? he explained.Agarwal looked at the requirements of the data lake. Ingestion has many challenges and there are many sources. "How do you build a system that just plugs in all of these components? You want to auto-scale for spiky traffic, or for maybe organic explosion and growth. You want to perform schema management. One of the important things about having a data lake which you can use to derive insights is to have high quality trustworthy data and that starts at doing schema management to know what data you're ingesting. And then you need custom transformations, for different kinds of business rules and, and, and you will want to update some tables based on some sort of logic. And then, at the end of the day, you want to provide observability and, you know, support different kinds of query engines out of the box so that you can connect different kinds of query engines, access this data, based on use cases you might have."
Batch is great for slow-moving data of the order of tens of minutes and then stream-style processing is generally of the order of some sub-second latency, he explained. Generally, you solve the use cases that you have for low-latency data, but then you treat batch as your fallback and there are different kinds of processing engines that you would use for that, Agarwal noted. "And that's totally fair, but what about storage? Stream systems need some sort of specialized databases or maybe specialized systems [that are] in many cases row-based, although some of them are also columnar-based, and batch systems are generally columnar-based because they tend to be slow to write, but are really good and efficient for scanning. So, what about use cases between 1 and 5 minute latencies? At Uber, we realized there are lots of use cases that fall within this time bucket."
Streaming can be overkill and batch can be too slow."There's a suggestion in the community that our Kappa architecture is about unifying processing, but, what about unifying storage, especially for use cases that lie in this kind of latency?" And, finally, he said, do you just write once and read many? "In a typical data lake that's what we do, but what happens is that you get stuck with the initial data layout, and it may not be best for your query pattern because that kind of layout may have been best for ingestion latency and it may not be the most efficient for specific types of queries."
For example, Agarwal said, you may want to query on user ID, but the data may actually be split based on event time so you want to adapt your data lake based on the needs of the data lake. "You may want to ingest fast, but then you want to reorganize your data based on query needs and you want to do all of this without blocking readers and writers so the business continues uninterrupted, but you keep tuning the efficiency of your data lake. The question to think about is: Do the daily data layout and query patterns have a strong connection? And, in many cases, we realized they do, and this isn't a new concept."
Videos of full presentations from Data Summit Connect Fall 2020, a series of data management and analytics webinars presented by DBTA and Big Data Quarterly, are also now available for on-demand viewing on the DBTA YouTube channel.