Data Engineering9 min readMay 2026

Why I Stopped Batch Processing Everything (And Built a Kafka Pipeline Instead)

Real-time vs batch — the honest tradeoffs

The Dashboard That Was Always Wrong

For most of my analytics career, every pipeline I built ran on a schedule. ETL every hour. Reporting tables refreshed at midnight. KPI dashboards updated at 7am so the team had numbers for their 9am standup.

That worked fine — until I was in a meeting where the head of operations was making a decision about whether to spin up additional capacity for a peak-hour surge. The dashboard said everything was normal. The actual system, as we later found out, had been in distress for 52 minutes. The report hadn't refreshed yet.

Fifty-two minutes of wrong data. In operations, that's not a data quality problem. That's a $40,000 problem.

That was the day I started taking real-time pipelines seriously. Not because batch is bad — batch is great for most things. But because I finally understood what "latency" actually costs.

The Honest Case For Kafka (And Against It)

Before I explain the pipeline, I want to be direct about something most Kafka tutorials skip: you probably don't need Kafka for your current job.

If your organisation refreshes dashboards daily, runs reports weekly, and the business questions your data answers don't change faster than your ETL runs — batch is the right tool. Simpler. Cheaper. Easier to debug. Fewer moving parts.

You need Kafka when:

Business decisions need to happen in seconds, not hours — fraud detection, operations monitoring, live inventory, real-time personalisation
You have multiple downstream consumers that all need the same event stream independently — Kafka decouples the producer from every consumer, so adding a new consumer doesn't touch the source system
Your event volume is too high to poll efficiently — instead of a consumer checking "is there new data?" every few seconds, Kafka pushes events as they arrive
You need replay — Kafka retains event history, so a consumer that was down can catch up exactly where it left off, with zero data loss

The moment I understood the decoupling story, the architecture clicked. A traditional ETL couples source to destination. Kafka puts a message bus in the middle. Producers don't know or care who's consuming. Consumers don't know or care who produced. That separation is what makes the system resilient.

How the Pipeline Works

Here's the full architecture:

┌──────────────────────────────────────────────────────────────────────┐
│                        Kafka Streaming Pipeline                       │
│                                                                      │
│  ┌───────────┐    ┌─────────────────────┐    ┌──────────────────┐   │
│  │ Producer  │───▶│   Kafka Broker      │───▶│ Stream Consumer  │   │
│  │ (Python)  │    │   (Docker)          │    │ (Python)         │   │
│  │           │    │                     │    │                  │   │
│  │ Generates │    │ Topic: events       │    │ Validates schema │   │
│  │ JSON      │    │ Partitions: 3       │    │ Enriches fields  │   │
│  │ events    │    │ Replication: 1      │    │ Filters noise    │   │
│  └───────────┘    └─────────────────────┘    └────────┬─────────┘   │
│                                                        │             │
│  ┌───────────────────────┐    ┌───────────────────────▼──────────┐  │
│  │ Streamlit Dashboard   │◀───│   Analytics Sink (SQLite)        │  │
│  │ - Live event feed     │    │   - Cleaned event records        │  │
│  │ - Throughput chart    │    │   - Aggregated metrics           │  │
│  │ - Error rate monitor  │    │   - Consumer lag tracking        │  │
│  └───────────────────────┘    └──────────────────────────────────┘  │
│                                                                      │
│  Orchestrated with Docker Compose  ·  Scalable with Kubernetes      │
└──────────────────────────────────────────────────────────────────────┘

Each component is a separate Python process, containerised with Docker, and wired together via Docker Compose locally or Kubernetes manifests for anything that needs to scale.

The Producer: Generating the Event Stream

The producer simulates a real-world event source — in this case, a retail transaction stream. Every event is a JSON payload published to a Kafka topic:

from confluent_kafka import Producer
import json, time, random, uuid
from datetime import datetime

producer = Producer({'bootstrap.servers': 'localhost:9092'})

def generate_event():
    return {
        "event_id":   str(uuid.uuid4()),
        "event_type": random.choice(["purchase", "view", "cart_add", "checkout"]),
        "user_id":    random.randint(1000, 9999),
        "product_id": random.randint(1, 500),
        "amount":     round(random.uniform(5.0, 850.0), 2),
        "timestamp":  datetime.utcnow().isoformat(),
        "region":     random.choice(["AU-NSW", "AU-VIC", "AU-WA", "AU-QLD"]),
    }

while True:
    event = generate_event()
    producer.produce(
        topic="retail-events",
        key=str(event["user_id"]),
        value=json.dumps(event).encode("utf-8"),
    )
    producer.poll(0)
    time.sleep(0.1)  # ~10 events per second

The key=user_id is important. Kafka routes all messages with the same key to the same partition, which guarantees that events for a given user always arrive in order within a partition. For a transaction stream, you want all of user 1234's events to be processed sequentially.

The Consumer: Schema Validation + Enrichment

The consumer is where the analytical work happens. It's not just reading and writing — it's validating the schema, enriching the record with derived fields, filtering out noise, and landing clean data into the analytics sink.

from confluent_kafka import Consumer
import json, sqlite3
from datetime import datetime

consumer = Consumer({
    'bootstrap.servers': 'localhost:9092',
    'group.id':          'analytics-consumer-group',
    'auto.offset.reset': 'earliest',  # replay from start if group is new
})
consumer.subscribe(['retail-events'])

REQUIRED_FIELDS = {"event_id", "event_type", "user_id", "amount", "timestamp"}

def enrich(event: dict) -> dict:
    event["processed_at"]  = datetime.utcnow().isoformat()
    event["amount_bucket"] = (
        "low" if event["amount"] < 50
        else "mid" if event["amount"] < 200
        else "high"
    )
    event["is_revenue_event"] = event["event_type"] in ("purchase", "checkout")
    return event

while True:
    msg = consumer.poll(timeout=1.0)
    if msg is None or msg.error():
        continue

    try:
        event = json.loads(msg.value().decode("utf-8"))

        # Schema validation
        if not REQUIRED_FIELDS.issubset(event.keys()):
            log_dead_letter(event, reason="missing_fields")
            continue

        # Enrichment + load
        clean = enrich(event)
        write_to_sink(clean)

    except Exception as e:
        log_dead_letter(msg.value(), reason=str(e))

The dead letter queue pattern is critical. In a real pipeline, malformed events will arrive. You never want to crash your consumer — you want to route bad records somewhere you can inspect them later without blocking the good ones.

Docker: Making It Reproducible

The entire stack runs with a single docker compose up. The compose file defines four services: Zookeeper (Kafka's coordination layer), the Kafka broker, the Python producer, and the Python consumer.

services:
  zookeeper:
    image: confluentinc/cp-zookeeper:7.5.0
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181

  kafka:
    image: confluentinc/cp-kafka:7.5.0
    depends_on: [zookeeper]
    ports: ["9092:9092"]
    environment:
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:29092,PLAINTEXT_HOST://localhost:9092
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1

  producer:
    build: ./producer
    depends_on: [kafka]
    environment:
      KAFKA_BOOTSTRAP_SERVERS: kafka:29092

  consumer:
    build: ./consumer
    depends_on: [kafka]
    environment:
      KAFKA_BOOTSTRAP_SERVERS: kafka:29092

The key detail: the producer and consumer talk to Kafka at kafka:29092 (inside Docker's network), while your local machine accesses it at localhost:9092. The KAFKA_ADVERTISED_LISTENERS config handles both.

Kubernetes: When You Need to Scale

Docker Compose is fine for development and single-node deployments. When you need the consumer to scale horizontally — multiple replicas processing partitions in parallel — you need Kubernetes.

The key concept: Kafka consumer groups + Kubernetes replicas work together naturally. If your topic has 3 partitions and you run 3 consumer replicas, each replica owns exactly one partition. Kafka distributes the load automatically. Scale to 6 replicas and Kafka rebalances — 3 active, 3 on standby. Scale down and it rebalances again. No custom logic required.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: kafka-consumer
spec:
  replicas: 3            # one per partition
  selector:
    matchLabels: {app: kafka-consumer}
  template:
    spec:
      containers:
      - name: consumer
        image: umangknp/kafka-consumer:latest
        env:
        - name: KAFKA_BOOTSTRAP_SERVERS
          value: kafka-service:9092
        - name: CONSUMER_GROUP_ID
          value: analytics-consumer-group

The Monitoring Dashboard

A streaming pipeline without observability is a black box. The Streamlit dashboard pulls from the analytics sink every 2 seconds and shows:

Live event feed — last 50 events with type, region, amount
Throughput chart — events per second over the last 5 minutes
Error rate — dead letter queue volume vs total events
Consumer lag — how far behind the consumer is from the producer's latest offset

Consumer lag is the most important metric in a streaming system. If it's growing, your consumer is falling behind the producer — you need to scale up or optimise the processing logic. If it's zero, you're keeping up in real time.

What This Changes About How I Think About Data

Building this pipeline shifted something in how I approach data problems. Batch ETL is fundamentally a world where you ask: "what happened?" Streaming is a world where you can ask: "what is happening right now?"

For most analytics use cases — trend analysis, reporting, forecasting — batch is right. But once you've built a streaming system and watched events land in your dashboard less than 100ms after they were produced, you start seeing the operational decisions that were impossible before: detect a spike in checkout abandonment as it happens and trigger a real-time discount, not the next morning after the damage is done.

The full project — producer, consumer, Docker Compose, Kubernetes manifests, and Streamlit dashboard — is on GitHub. Clone it, run docker compose up, and you'll have a working Kafka pipeline in under 5 minutes.