Apache Iceberg: A Table Format for Large Scale Data

Written by Matthias Vallaey | Dec 27, 2023 1:44:40 PM

A comparison with Apache Kudu and Delta Lake

Apache Iceberg is an open source table format for storing and querying large-scale data sets. It is designed to improve the performance, reliability and scalability of data lake analytics. Iceberg supports both batch and streaming data sources, and provides a rich set of features such as schema evolution, partitioning, time travel, snapshots, transactions and row-level deletes. Iceberg also integrates with popular query engines such as Apache Spark, Apache Flink, Apache Hive and Presto.

How does Iceberg partition data?

Iceberg supports two types of partitioning: identity partitioning and bucket partitioning. Identity partitioning assigns each data file to a partition based on the value of one or more columns. For example, if a table is partitioned by date, each data file will belong to a specific date partition. Bucket partitioning assigns each data file to a partition based on a hash function of one or more columns. For example, if a table is bucketed by user_id, each data file will belong to a specific user_id bucket. Bucket partitioning can help reduce data skew and improve join performance.

How does Iceberg support time travel and snapshots?

Iceberg supports time travel and snapshots by maintaining a history of table metadata. Each change to the table, such as adding, deleting or updating data files, creates a new snapshot of the table metadata. Each snapshot has a unique ID and a timestamp, and can be referenced by queries. Iceberg also keeps track of the parent-child relationship between snapshots, forming a snapshot lineage. This allows users to query the table at any point in time, or to roll back the table to a previous state.

How does Iceberg compare with Apache Kudu?

Apache Kudu is another open source table format for storing and querying large-scale data sets. Kudu is optimized for fast analytics on fast data, such as real-time or near-real-time data. Kudu supports both row-oriented and column-oriented storage, and provides features such as schema evolution, partitioning, compression, encryption and row-level updates. Kudu also integrates with popular query engines such as Apache Spark, Apache Impala and Presto.

Some of the differences between Iceberg and Kudu are:

Iceberg supports both batch and streaming data sources, while Kudu is mainly focused on streaming data sources.
Iceberg supports bucket partitioning, while Kudu only supports range and hash partitioning.
Iceberg supports time travel and snapshots, while Kudu does not.
Iceberg supports transactions and row-level deletes, while Kudu only supports row-level updates.
Iceberg uses a file-based storage layer, such as HDFS or S3, while Kudu uses its own storage layer, which requires dedicated servers and disks.

How does Iceberg compare with Delta Lake?

Delta Lake is another open source table format for storing and querying large-scale data sets. Delta Lake is developed by Databricks, and is based on the Spark SQL engine. Delta Lake supports both batch and streaming data sources, and provides features such as schema evolution, partitioning, time travel, snapshots, transactions and row-level updates and deletes. Delta Lake also integrates with popular query engines such as Apache Spark, Apache Hive and Presto.

Some of the similarities and differences between Iceberg and Delta Lake are:

Both Iceberg and Delta Lake support batch and streaming data sources, schema evolution, partitioning, time travel, snapshots, transactions and row-level updates and deletes.
Both Iceberg and Delta Lake use a file-based storage layer, such as HDFS or S3, and store table metadata in JSON files.
Iceberg supports bucket partitioning, while Delta Lake only supports range and hash partitioning.
Iceberg supports identity partitioning, while Delta Lake does not.
Iceberg supports multiple query engines, such as Spark, Flink, Hive and Presto, while Delta Lake is mainly based on Spark SQL.
Iceberg is designed to be independent of any specific query engine, while Delta Lake is tightly coupled with Spark SQL.

source image: Ryan Blue - Tabular

View full post