Building a Real-Time Data Lake with Amazon MSK and Iceberg
By Braincuber Team
Published on February 6, 2026
The modern data lake is no longer just a dumping ground for logs; it's a live, queryable engine for business intelligence. But bridging the gap between high-velocity streams (Kafka) and analytical tables (Iceberg) has often required complex, custom-coded consumers.
No more. With Amazon MSK Connect and the Apache Iceberg Kafka Connect plugin, you can build a zero-code pipeline that sinks streaming data directly into ACID-compliant Iceberg tables. In this guide, we'll build a real-time telemetry pipeline for LogiFleet, a logistics company tracking 5,000 trucks in real-time.
Why Iceberg + Kafka?
- ACID Transactions: Consumers see atomic updates, never partial writes.
- Schema Evolution: Add columns to your fleet data without breaking the pipeline.
- Time Travel: Query the fleet's location as it was at 10:00 AM yesterday.
Step 1: Preparing the Plugin
MSK Connect requires a JAR file (the plugin) to know how to talk to Iceberg. This isn't built-in, so we must build it from source.
# Clone the Apache Iceberg repository
git clone https://github.com/apache/iceberg.git
cd iceberg/
# Build the Kafka Connect runtime (Skip tests for speed)
./gradlew -x test -x integrationTest clean build
# The artifact will be in:
# ./kafka-connect/kafka-connect-runtime/build/distributions/iceberg-kafka-connect-runtime-*.zip
Once built, upload this ZIP file to an S3 bucket (e.g., s3://logifleet-plugins/iceberg-connect.zip). Then, creating the Custom Plugin in the MSK console is as simple as pointing to this S3 URI.
Step 2: Configuring the Connector
Now we tell MSK Connect how to route data. We want to read from the gps-telemetry topic and write to the glue_catalog.logistics.truck_positions Iceberg table.
connector.class=org.apache.iceberg.connect.IcebergSinkConnector
tasks.max=2
topics=gps-telemetry
# Iceberg Catalog Settings (AWS Glue)
iceberg.catalog.type=glue
iceberg.catalog.client.region=us-east-1
iceberg.catalog.warehouse=s3://logifleet-lake/warehouse/
# Schema Integration
iceberg.tables=logistics.truck_positions
iceberg.tables.auto-create-enabled=true
iceberg.control.commit.interval-ms=60000
iceberg.tables.route-field=truck_id
Step 3: Multi-Table Routing
What if you have multiple topics like orders, shipments, and returns? You don't need a separate connector for each. Iceberg Kafka Connect supports Multi-Table Mode.
By setting iceberg.tables.dynamic-enabled=true, the connector can route records to different Iceberg tables based on the Kafka topic name. For example, data from topic dbserver1.inventory.products automatically lands in the inventory.products table.
Conclusion
LogiFleet now has a pipeline where GPS coordinates emitted by trucks appear in their data lake within minutes, ready for SQL analysis via Amazon Athena. No custom Lambda functions, no complex Spark streaming jobs—just configuration.
Modernize Your Data Lake?
Ready to move from batch processing to real-time streaming ingestion? Our Big Data architects can help you deploy production-grade Iceberg pipelines.
