Apache Iceberg is an open source table format for storing and querying large-scale data sets. It is designed to improve the performance, reliability and scalability of data lake analytics. Iceberg supports both batch and streaming data sources, and provides a rich set of features such as schema evolution, partitioning, time travel, snapshots, transactions and row-level deletes. Iceberg also integrates with popular query engines such as Apache Spark, Apache Flink, Apache Hive and Presto.
Iceberg supports two types of partitioning: identity partitioning and bucket partitioning. Identity partitioning assigns each data file to a partition based on the value of one or more columns. For example, if a table is partitioned by date, each data file will belong to a specific date partition. Bucket partitioning assigns each data file to a partition based on a hash function of one or more columns. For example, if a table is bucketed by user_id, each data file will belong to a specific user_id bucket. Bucket partitioning can help reduce data skew and improve join performance.
Iceberg supports time travel and snapshots by maintaining a history of table metadata. Each change to the table, such as adding, deleting or updating data files, creates a new snapshot of the table metadata. Each snapshot has a unique ID and a timestamp, and can be referenced by queries. Iceberg also keeps track of the parent-child relationship between snapshots, forming a snapshot lineage. This allows users to query the table at any point in time, or to roll back the table to a previous state.
Apache Kudu is another open source table format for storing and querying large-scale data sets. Kudu is optimized for fast analytics on fast data, such as real-time or near-real-time data. Kudu supports both row-oriented and column-oriented storage, and provides features such as schema evolution, partitioning, compression, encryption and row-level updates. Kudu also integrates with popular query engines such as Apache Spark, Apache Impala and Presto.
Some of the differences between Iceberg and Kudu are:
Delta Lake is another open source table format for storing and querying large-scale data sets. Delta Lake is developed by Databricks, and is based on the Spark SQL engine. Delta Lake supports both batch and streaming data sources, and provides features such as schema evolution, partitioning, time travel, snapshots, transactions and row-level updates and deletes. Delta Lake also integrates with popular query engines such as Apache Spark, Apache Hive and Presto.
Some of the similarities and differences between Iceberg and Delta Lake are:
source image: Ryan Blue - Tabular