What is the advantage of Iceberg over standard Parquet sinks?

Iceberg provides ACID transactions, schema evolution, and time-travel querying. Standard S3 sinks often suffer from 'small file' problems and eventual consistency issues that Iceberg solves fundamentally.

Does this work with Avro or Protobuf?

Yes, MSK Connect supports pluggable converters. You can configure the 'key.converter' and 'value.converter' properties to use the Confluent Avro or Protobuf converters if your Schema Registry is accessible.

How does it handle duplicates?

The Iceberg Sink Connector supports 'exactly-once' delivery semantics when coupled with a consistent committer, ensuring that records are not duplicated even if tasks restart.

Real-Time Data Lake Guide: Amazon MSK Connect + Apache Iceberg

The modern data lake is no longer just a dumping ground for logs; it's a live, queryable engine for business intelligence. But bridging the gap between high-velocity streams (Kafka) and analytical tables (Iceberg) has often required complex, custom-coded consumers.

No more. With Amazon MSK Connect and the Apache Iceberg Kafka Connect plugin, you can build a zero-code pipeline that sinks streaming data directly into ACID-compliant Iceberg tables. In this guide, we'll build a real-time telemetry pipeline for LogiFleet, a logistics company tracking 5,000 trucks in real-time.

Why Iceberg + Kafka?

ACID Transactions: Consumers see atomic updates, never partial writes.
Schema Evolution: Add columns to your fleet data without breaking the pipeline.
Time Travel: Query the fleet's location as it was at 10:00 AM yesterday.

Step 1: Preparing the Plugin

MSK Connect requires a JAR file (the plugin) to know how to talk to Iceberg. This isn't built-in, so we must build it from source.

Bash - Build Plugin

# Clone the Apache Iceberg repository
git clone https://github.com/apache/iceberg.git
cd iceberg/

# Build the Kafka Connect runtime (Skip tests for speed)
./gradlew -x test -x integrationTest clean build

# The artifact will be in:
# ./kafka-connect/kafka-connect-runtime/build/distributions/iceberg-kafka-connect-runtime-*.zip

Once built, upload this ZIP file to an S3 bucket (e.g., s3://logifleet-plugins/iceberg-connect.zip). Then, creating the Custom Plugin in the MSK console is as simple as pointing to this S3 URI.

Step 2: Configuring the Connector

Now we tell MSK Connect how to route data. We want to read from the gps-telemetry topic and write to the glue_catalog.logistics.truck_positions Iceberg table.

Connector Configuration (JSON)

connector.class=org.apache.iceberg.connect.IcebergSinkConnector
tasks.max=2
topics=gps-telemetry

# Iceberg Catalog Settings (AWS Glue)
iceberg.catalog.type=glue
iceberg.catalog.client.region=us-east-1
iceberg.catalog.warehouse=s3://logifleet-lake/warehouse/

# Schema Integration
iceberg.tables=logistics.truck_positions
iceberg.tables.auto-create-enabled=true
iceberg.control.commit.interval-ms=60000
iceberg.tables.route-field=truck_id

Step 3: Multi-Table Routing

What if you have multiple topics like orders, shipments, and returns? You don't need a separate connector for each. Iceberg Kafka Connect supports Multi-Table Mode.

By setting iceberg.tables.dynamic-enabled=true, the connector can route records to different Iceberg tables based on the Kafka topic name. For example, data from topic dbserver1.inventory.products automatically lands in the inventory.products table.

Conclusion

LogiFleet now has a pipeline where GPS coordinates emitted by trucks appear in their data lake within minutes, ready for SQL analysis via Amazon Athena. No custom Lambda functions, no complex Spark streaming jobs—just configuration.

Modernize Your Data Lake?

Ready to move from batch processing to real-time streaming ingestion? Our Big Data architects can help you deploy production-grade Iceberg pipelines.

Talk to a practice lead

Getting hit by surprise AWS bills?

Free AWS cost audit. Send your last 3 invoices and your Cost Explorer view; we return a written report with the top 3 cost leaks and projected savings in 48 hours.

Book a free AWS audit → AWS Consulting Services →

Related resources

Why Iceberg + Kafka?

ACID Transactions: Consumers see atomic updates, never partial writes.
Schema Evolution: Add columns to your fleet data without breaking the pipeline.
Time Travel: Query the fleet's location as it was at 10:00 AM yesterday.

Step 1: Preparing the Plugin

MSK Connect requires a JAR file (the plugin) to know how to talk to Iceberg. This isn't built-in, so we must build it from source.

Bash - Build Plugin

# Clone the Apache Iceberg repository
git clone https://github.com/apache/iceberg.git
cd iceberg/

# Build the Kafka Connect runtime (Skip tests for speed)
./gradlew -x test -x integrationTest clean build

# The artifact will be in:
# ./kafka-connect/kafka-connect-runtime/build/distributions/iceberg-kafka-connect-runtime-*.zip

Once built, upload this ZIP file to an S3 bucket (e.g., s3://logifleet-plugins/iceberg-connect.zip). Then, creating the Custom Plugin in the MSK console is as simple as pointing to this S3 URI.

Step 2: Configuring the Connector

Now we tell MSK Connect how to route data. We want to read from the gps-telemetry topic and write to the glue_catalog.logistics.truck_positions Iceberg table.

Connector Configuration (JSON)

connector.class=org.apache.iceberg.connect.IcebergSinkConnector
tasks.max=2
topics=gps-telemetry

# Iceberg Catalog Settings (AWS Glue)
iceberg.catalog.type=glue
iceberg.catalog.client.region=us-east-1
iceberg.catalog.warehouse=s3://logifleet-lake/warehouse/

# Schema Integration
iceberg.tables=logistics.truck_positions
iceberg.tables.auto-create-enabled=true
iceberg.control.commit.interval-ms=60000
iceberg.tables.route-field=truck_id

Step 3: Multi-Table Routing

What if you have multiple topics like orders, shipments, and returns? You don't need a separate connector for each. Iceberg Kafka Connect supports Multi-Table Mode.

Conclusion

Modernize Your Data Lake?

Ready to move from batch processing to real-time streaming ingestion? Our Big Data architects can help you deploy production-grade Iceberg pipelines.

Talk to a practice lead

Getting hit by surprise AWS bills?

Free AWS cost audit. Send your last 3 invoices and your Cost Explorer view; we return a written report with the top 3 cost leaks and projected savings in 48 hours.

Book a free AWS audit → AWS Consulting Services →

Related resources

Building a Real-Time Data Lake with Amazon MSK and Iceberg

Step 1: Preparing the Plugin

Step 2: Configuring the Connector

Step 3: Multi-Table Routing

Conclusion

Modernize Your Data Lake?

Getting hit by surprise AWS bills?

Need this implemented in your project?

Take the guide with you

Book a 30-min architecture call

Get a free 48-hour written brief

Building a Real-Time Data Lake with Amazon MSK and Iceberg

Step 1: Preparing the Plugin

Step 2: Configuring the Connector

Step 3: Multi-Table Routing

Conclusion

Modernize Your Data Lake?

Getting hit by surprise AWS bills?

Need this implemented in your project?

Take the guide with you

Book a 30-min architecture call

Get a free 48-hour written brief