Data lakes are one of the fastest growing trends in managing big data across various industries. However, the rise of event streaming has created a new technology category for stream processing using frameworks like Apache Flink and Kafka Streams.
The two disparate technology stacks provide different trade-offs, tied to specific storage systems/processing frameworks.
Apache Spark/Presto/Apache Hive queries on Cloud storage constitute the data lake "batch" stack and Apache Kafka/Pulsar/AWS Kinesis stored events processed using Apache Flink and served via Apache Druid/Elastic and the like constitute a more real-time "streaming" stack.
Nishith Agarwal, engineering manager, Uber, and Sivabalan Narayanan, senior software engineer, Uber discussed “Apache Hudi: The Streaming Data Lake Platform” during their Data Summit Connect 2021 presentation.
“There are a myriad of data processing methods,” Narayanan said.
This includes batch, incremental on batch, and streaming, However, batch can be slow and inefficient. It can produce large amounts of data and utilizes expensive resources such as compute.
Hudi was originally based on incremental batch processing. But streaming systems are a better option that can be integrated with incremental batch. Streaming systems can produce fresher data, is cost effective, and is very fast, allowing users to stream data continuously.
Apache Hudi is the original pioneer of the transaction data lake movement, Narayanan said. The acronym stands for Hadoop, Upserts, Delete, and Incrementals. It also allows users to pull only changed data improving the query efficiency.
The design goals are based on a combination of streaming and incremental batch. It is faster and offers minimal overhead. It operates on a log based design as well.
It can manage per record metadata, record level merges, and record level indexes, Narayanan noted. Each file group is its own contained log and merges are local to each file group.
With Hudi, they are proposing to use incremental batch processing on streaming engines that can supercharge things that can be done, Narayanan said.
Agarwal looked at the advances of this method. He said Hudi allows teams to build a data lake with unified storage. It can be used along with any streaming processing engine to not only do actions on data but it can write out data and tables, he explained.
The platform is adopted by a mix of companies, proving it’s a battletested platform, he said.
“We are super excited to grow this community,” Agarwal said.
More information about Data Summit Connect 2021 is available here.
Replays of this and all Data Summit Connect 2021 sessions will be available to registered attendees for a limited time and many presenters are making their slide decks available as well.