Architecting for speed and scale creates a plethora of positive benefits, ranging from enhancing data analytics to adding tangible business value. Examining how enterprises exactly do this—and the technologies required—helps other organizations determine the ways they can adopt similar methods for success.
Zoe Steinkamp, developer advocate, InfluxData, led the annual Data Summit session, “Architecting for Speed and Scale,” highlighting how Apache Iceberg brings data harmony—complemented by flexibility, scalability, and high performance—to enterprises’ structured data.
The annual Data Summit conference returned to Boston, May 8-9, 2024, with pre-conference workshops on May 7.
Apache Iceberg—the open source, high-performance format for large-scale data sets—delivers a variety of unique features that help to cultivate fast, scalable data architectures built for the modern enterprise. Apache Iceberg revolutionizes data management by addressing traditional catalog inefficiencies and enhancing query performance and storage costs. It acts as a table format specification accompanied by APIs and libraries for interaction with that specification.
Steinkamp made a point to address what Iceberg is not—Iceberg is not a storage engine, an execution engine, or a service—as well as what it’s not meant for—it is not fit for small datasets or real-time data ingestion.
There are several features that Iceberg is known for—including support for “time travel,” ACID transactions, schema evolution, partition evolution, and partition pruning.
Iceberg’s format is structured by the Iceberg catalog, metadata layer, and the data layer, featuring the following key components:
- Metadata is stored as files in object storage (like data files)
- Snapshots that describe a table in specific points in time (allowing queries to “time travel”)
- Read performance scales with low CPU cost
- Hierarchical data statistics allow execution engines to efficiently prune metadata and data files
- Data files allow execution planners to aggressively and accurately prune data files to be read
Iceberg allows users to make efficient, small updates, as well as enact snapshot isolation for transactions; all engines see these changes immediately. Its schema evolution—while not instantaneous—is not real time on purpose; slow, incremental changes to the schema ensure minimal disruption. Iceberg is also built to handle high concurrency, offering transactional support which is crucial for multi-user environments.
Iceberg additionally offers the following benefits:
- Query execution optimization for columnar I/O operations
- Incremental data processing, ideal for reducing computational loads with high update frequencies
- Reliability features, including atomic commits, reliable reads, and file-level operations
Many Data Summit 2024 presentations are available for review at https://www.dbta.com/DataSummit/2024/Presentations.aspx.