Newsletters




Confronting the Benefits, Challenges, and Subtleties of Open Data Lakehouses


Considerations about data architecture—from its flexibility to its capacity for cost efficiency, data quality, security, and innovation—should be top of mind for technology and business leaders everywhere. After all, a robust data foundation can power even the most cutting-edge use cases dominating the realm of business. Yet, there is no silver bullet solution, and enterprises must thoroughly examine their options to successfully navigate both today and tomorrow’s needs.

In DBTA’s recent webinar, The Future of Open: What's Next in Lakehouse Architecture, experts offered their perspectives on why open lakehouse architectures continue to be a popular vehicle for enabling modernization and innovation across the enterprise, examining the latest trends, tools, and emerging best practices.

Scott Teal, product marketing at Snowflake, opened by explaining that “many large organizations want to mix and match engines of choice. That’s what prompted the concept of lakehouse architecture.”

Being able to be fully adaptable to the future, plugging in whatever tool or engine into proprietary data, is a large focus for enterprises. Naturally, accomplishing this while keeping costs low is another crucial—yet challenging—requirement.

Historically, lakehouse architectures have “come with less-obvious gotchas,” such as created complexities due to read/write limitations and incompatibilities, or a lack of a shared standard which makes security duplicated and inconsistent, added Teal.

But, where does Snowflake fit in? posed Teal.

“You probably think of Snowflake historically as this vertically integrated data warehouse, and the answer is it absolutely can be that, and it can be a great one. But over the past several years, Snowflake has expanded to be more than just a data warehouse,” Teal explained.

Snowflake addresses lakehouse use cases by working on top of Iceberg tables, capable of being used for data engineering, advanced analytics, AI and machine learning, applications, and collaboration purposes. Users can mix-and-match workloads, ensuring them of the adaptability they need without locking them into an “all or nothing” scenario.

MinIO is uniquely suited to join the conversation surrounding lakehouse architectures, noted Brenna Buuck, developer evangelist at MinIO. This is because “object storage forms the foundation of any data lakehouse architecture, whether or not it is abstracted away by a managed service,” said Buuck.

MinIO’s sole focus is object storage, offering a high performance, software-defined, cloud-native object store that is the most widely deployed object store on the planet, according to the company. Storage—which joins lakehouse architecture’s other two components, open table formats and compute—is arguably the least flexible piece of the lakehouse. This is why, according to Buuck, having an object store that can truly run anywhere, such as MinIO’s offering, is crucial for fusing flexibility into the stack from the get-go.

“When you’re designing your data lakehouse, you need to consider flexibility up-front because storage and data turns out to be really hard to move,” said Buuck. Performance is another foremost consideration, as slow queries “kill data initiatives.”

Emphasizing Teal and Buuck’s points, Kyle Weller, head of product at Onehouse, explained that there are three key necessities for building a lakehouse architecture:

  • Adopt open storage as a source of truth with data stored in open table formats, ensuring that the architecture is interoperable with any warehouse, lake engine, and AI and machine learning framework.
  • Stay free from vendor compute lock-in with open source, cloud-agnostic components—such as Spark, Flink, Ray, or K8s—and mix-and-match your engines based on your use case.
  • Keep the future in mind and ensure that your data platform can scale without exponential costs, acknowledging that as your needs evolve, more advanced workloads will demand different types of data and compute frameworks.

Zooming in on open table formats, Weller pointed to the fact that Hudi, Delta, and Iceberg are not that different. Each format, under the hood, has a special metadata layer on top of parquet files that are working toward achieving the same outcome.

Yet a popular question remains: Which format should I choose?

Instead of choosing, Weller suggested, what if you could work across all three? Replicating the way one mix-and-matches compute engines, Apache XTable is a cross-table converter for lakehouse table formats that drives interoperability across data processing systems and query engines. By helping eliminate the costly evaluation of selecting an open table format, Apache XTable ensures that data is truly universal, enabling users to dynamically select the table format that fits their project’s unique use cases.

While “open” is typically synonymous with “free,” it is often daunting for many enterprises due to its hidden complexities and challenges. However, it doesn’t have to be hard, explained Weller.

Onehouse’s fully managed, universal data lakehouse ingests all your data sources in minutes and supports all your query engines—at scale, for a fraction of the cost. Prioritizing speed, cost efficiency, and true openness, Onehouse is an innovative option for easily modernizing with a data lakehouse infrastructure, according to Weller.

For the full, in-depth webinar featuring thorough explanations, a Q&A, and more, you can view an archived version of the webinar here.


Sponsors