January 15, 2025

Building Scalable Distributed Systems: Lessons from the Trenches

distributed-systemskubernetesgrpcnats

Building Scalable Distributed Systems: Lessons from the Trenches

When I started working with distributed systems at MinIO, I quickly realized that the theoretical knowledge from coursework only scratches the surface. Textbooks teach you about consensus algorithms and CAP theorem, but they don't prepare you for the 2 AM debugging sessions where a message queue is silently dropping events because of a misconfigured retention policy. Here are some hard-won lessons from building production event-driven architectures.

Why Event-Driven?

Before diving into the technical details, it's worth understanding why we chose an event-driven architecture in the first place. Our system needed to coordinate three very different workloads:

Go gRPC APIs handling catalog operations on Apache Iceberg data lakes
Python workers running AI agent pipelines for business intelligence
A React/TypeScript UI that needed real-time status updates

A traditional request-response architecture would have created tight coupling between these services. If the Python worker was slow (which ML workloads often are), the API would block, and the UI would hang. We needed something that could decouple producers from consumers while guaranteeing message delivery.

Enter NATS JetStream.

The Event-Driven Paradigm Shift

Traditional request-response architectures are straightforward: a client sends a request, a server processes it, and returns a response. But event-driven systems flip this model on its head. Instead of "ask and wait," you "publish and forget" — trusting that downstream consumers will pick up the event and process it asynchronously.

This sounds simple in theory, but the mental model shift is significant. You have to start thinking about:

Eventual consistency — Your UI might show stale data for a few seconds, and that's okay
Idempotency — Messages can be delivered more than once, so every handler must produce the same result regardless of how many times it runs
Ordering guarantees — Do you need strict ordering? Per-subject ordering? Or is unordered fine?
Dead letter queues — What happens when a message can't be processed after multiple retries?

NATS JetStream: The Sweet Spot

We evaluated several messaging systems — Kafka, RabbitMQ, Redis Streams — before settling on NATS JetStream. Here's why:

Operational simplicity: NATS is a single binary with zero external dependencies. No ZooKeeper, no cluster manager.
Built-in persistence: JetStream adds durable streams on top of core NATS, with configurable retention policies.
Lightweight: The entire NATS server binary is ~20MB. Compare that to a Kafka deployment.
Native Go client: Since our primary API layer was in Go, the first-class NATS Go client was a huge advantage.

Here's a simplified version of our event publisher:

package events
 
import (
    "fmt"
    "time"
 
    "github.com/nats-io/nats.go"
    "google.golang.org/protobuf/proto"
)
 
type Publisher struct {
    js nats.JetStreamContext
}
 
func NewPublisher(js nats.JetStreamContext) *Publisher {
    return &Publisher{js: js}
}
 
func (p *Publisher) Publish(subject string, msg proto.Message) error {
    data, err := proto.Marshal(msg)
    if err != nil {
        return fmt.Errorf("failed to marshal protobuf: %w", err)
    }
 
    ack, err := p.js.Publish(subject, data,
        nats.MsgId(generateDedupID(subject, msg)),
        nats.ExpectLastSequencePerSubject(0),
    )
    if err != nil {
        return fmt.Errorf("failed to publish to %s: %w", subject, err)
    }
 
    _ = ack // Log sequence number in production
    return nil
}

Notice the nats.MsgId option — this enables server-side deduplication. If the publisher retries (say, after a network blip), NATS will recognize the duplicate message ID and silently drop it. This is one of those small details that saves you from hours of debugging duplicate processing issues.

Protocol Buffers: Schema as a Contract

Early on, we made the decision to use Protocol Buffers for all inter-service communication. This turned out to be one of the best architectural decisions we made.

JSON is great for external APIs, but for internal service-to-service communication, it has real downsides: no schema enforcement, verbose serialization, and no built-in support for schema evolution. With Protobuf, we got:

syntax = "proto3";
 
package catalog.v1;
 
message CatalogOperation {
  string operation_id = 1;
  OperationType type = 2;
  string table_name = 3;
  string namespace = 4;
  google.protobuf.Timestamp created_at = 5;
  map<string, string> metadata = 6;
 
  // Added in v1.1 — old consumers safely ignore this field
  optional ValidationResult validation = 7;
}
 
enum OperationType {
  OPERATION_TYPE_UNSPECIFIED = 0;
  OPERATION_TYPE_CREATE = 1;
  OPERATION_TYPE_ALTER = 2;
  OPERATION_TYPE_DROP = 3;
}

The key insight is field numbering. When we added the validation field in v1.1, existing consumers that didn't know about it simply ignored it. No breaking changes, no coordinated deployments, no downtime. This is backwards-compatible schema evolution, and it's incredibly powerful in a microservices environment where you can't deploy all services simultaneously.

Coordinating Go and Python: The Polyglot Challenge

One of the trickiest aspects of our architecture was coordinating Go and Python services. Go handled the high-throughput API layer and orchestration logic, while Python ran the AI/ML workloads (DSPy-based prompting, data validation, report generation).

The naive approach would be to have Go call Python over HTTP. But this creates synchronous coupling — exactly what we were trying to avoid. Instead, we used NATS as the bridge:

Go API receives a request and publishes a tasks.ai.validate event
Python worker subscribes to tasks.ai.> and picks up the event
Python processes it (running the DSPy pipeline) and publishes a results.ai.validate event
Go orchestrator picks up the result and updates the catalog

This pattern gave us several advantages:

Independent scaling: We could scale Python workers independently based on AI workload
Language-appropriate tooling: Go for concurrency and API handling, Python for ML libraries
Fault isolation: A crashing Python worker didn't take down the Go API
Replay capability: If a Python worker crashed mid-processing, JetStream would redeliver the message to another worker

Kubernetes: Making It All Run

With multiple services in multiple languages, Kubernetes was the natural deployment target. But "just put it in Kubernetes" glosses over a lot of complexity.

Kustomize for Environment Management

We used Kustomize (not Helm) for managing our Kubernetes manifests. The reason was simplicity — Kustomize uses plain YAML with strategic merge patches, so there's no templating language to learn and no hidden rendering logic.

Our structure looked like:

k8s/
  base/
    deployment.yaml
    service.yaml
    kustomization.yaml
  overlays/
    dev/
      kustomization.yaml
      patches/
        replicas.yaml
    staging/
      kustomization.yaml
    production/
      kustomization.yaml
      patches/
        replicas.yaml
        resources.yaml

Each overlay could patch resource limits, replica counts, and environment variables without duplicating the base manifests. The CI pipeline simply ran kustomize build k8s/overlays/production | kubectl apply -f -.

Health Checks That Actually Work

One lesson we learned the hard way: your health check endpoints need to be meaningful. Our initial liveness probe just returned 200 OK if the HTTP server was running. But the service could be "alive" while completely unable to process events — for example, if the NATS connection had dropped.

We refactored our health checks to verify actual system health:

func (s *Server) healthHandler(w http.ResponseWriter, r *http.Request) {
    health := map[string]string{
        "nats":  s.checkNATSConnection(),
        "iceberg": s.checkIcebergCatalog(),
    }
 
    for _, status := range health {
        if status != "ok" {
            w.WriteHeader(http.StatusServiceUnavailable)
            json.NewEncoder(w).Encode(health)
            return
        }
    }
 
    w.WriteHeader(http.StatusOK)
    json.NewEncoder(w).Encode(health)
}

Testing Distributed Systems

Testing was where we invested the most effort — and where we saw the highest return. We achieved 85%+ code coverage across the Go services, but more importantly, we had confidence that our tests actually caught real bugs.

Unit Tests: Fast and Focused

Standard unit tests for business logic. Nothing fancy here — table-driven tests in Go, pytest in Python. The key was keeping them fast (under 5 seconds total) so developers would actually run them.

Integration Tests: The Real MVP

Integration tests were where we caught the most bugs. We used testcontainers to spin up real NATS and Iceberg instances in Docker during CI:

func TestCatalogOperationFlow(t *testing.T) {
    ctx := context.Background()
 
    // Start a real NATS server in Docker
    natsC, err := testcontainers.GenericContainer(ctx, testcontainers.GenericContainerRequest{
        ContainerRequest: testcontainers.ContainerRequest{
            Image:        "nats:latest",
            ExposedPorts: []string{"4222/tcp"},
            Cmd:          []string{"-js"},  // Enable JetStream
        },
        Started: true,
    })
    require.NoError(t, err)
    defer natsC.Terminate(ctx)
 
    // Connect and run the actual flow
    // ...
}

This approach caught bugs that unit tests with mocked NATS clients never would have found — things like serialization mismatches, incorrect JetStream consumer configurations, and race conditions in message acknowledgment.

End-to-End Tests: Confidence Before Deploy

We had a small suite of E2E tests that ran the entire system (Go API + Python workers + NATS + Iceberg catalog) in a docker-compose environment. These ran in CI before any merge to main and took about 3 minutes. Worth every second.

Key Takeaways

After building and operating this system for several months, processing over 100 catalog operations and automating 60% of data validation tasks, here are the principles I'd carry forward to any distributed system:

Design for failure from day one. Every network call can fail. Build retry logic with exponential backoff, circuit breakers for cascading failure prevention, and graceful degradation into your systems before you need them — not after the first outage.
Observability is not optional. You can't debug what you can't see. We instrumented everything with structured JSON logging, OpenTelemetry traces across service boundaries, and Prometheus metrics for queue depths, processing latencies, and error rates. When something broke at 2 AM, we could trace a request across all three services in under a minute.
Schema evolution matters more than you think. Protocol Buffers gave us backwards-compatible schema evolution that let us deploy services independently. Plan for how your data contracts will change over time. If you're using JSON, at least document your schema with JSON Schema or OpenAPI and version your APIs.
Test at every level, but invest most in integration tests. Unit tests are cheap and fast. E2E tests give confidence. But integration tests — testing real interactions between your service and its dependencies — catch the bugs that actually cause production incidents.
Operational simplicity beats theoretical elegance. We chose NATS over Kafka despite Kafka's richer feature set because NATS was dramatically simpler to operate. In a small team, every hour spent on infrastructure operations is an hour not spent building product.

What's Next

I'm continuing to explore how AI agents can be integrated into data lake workflows. The intersection of intelligent automation and distributed infrastructure is where I see the most exciting opportunities — systems that don't just process data but understand and correct themselves. The self-correcting agent system we built at MinIO is just the beginning.

I'm also interested in how emerging technologies like WebAssembly might change the polyglot service story. Imagine running Python ML models compiled to Wasm inside a Go service, eliminating the network hop entirely while keeping language-appropriate tooling. That's a future I'm excited about.

Stay tuned for more posts on this topic.