The Cache Consistency Problem I Didn't Know I Had (Until It Cost Us)

Three months into production, our ReplicaSet controller started behaving like a ghost—taking actions based on information that didn't actually exist anymore. A pod would be deleted, our controller would see the old state in its cache, and we'd recreate it seconds later. Users reported applications flickering in and out of existence. It took us two weeks to trace the issue back to our controller's cache being out of sync with reality.

I remember sitting with our DevOps lead, both of us staring at controller logs at 2 AM, asking: "How is this thing seeing data we deleted an hour ago?" The answer was both simple and infuriating—our controller was making decisions based on an outdated view of the cluster state. This is the kind of bug that doesn't show up in staging because it's timing-dependent, and it doesn't show up in your first month because you haven't accumulated enough concurrent operations to trigger the race condition. It's a production problem waiting to happen.

Kubernetes v1.36 just addressed exactly this category of issue, and I have thoughts about why this matters more than most feature releases.

The Staleness Problem Is Real, and We've All Experienced It

Let me be direct: if you're running controllers in Kubernetes at any real scale, you've probably hit staleness issues without realizing it. Controllers maintain local caches of cluster state for performance reasons—querying the API server on every reconciliation loop would destroy throughput. But that cache is only as fresh as the last watch update that reached it.

Staleness happens when your controller's view of the world diverges from actual cluster state. A pod gets deleted but your cache hasn't received the deletion event yet. The API server goes briefly unavailable and your informer queue gets out of order. Your controller restarts and has to rebuild its cache from scratch. In any of these scenarios, your controller might take an action based on incomplete information.

I've seen this manifest as: controllers not scaling deployments when they should, controllers creating duplicate resources because they didn't see the first one was already created, and worst case, controllers getting into reconciliation loops because they can't see what they just wrote.

How Kubernetes v1.36 Tackles This

The fix comes at two levels. First, in client-go, there's now atomic FIFO processing for batched events. When an informer first populates its cache, it receives a batch of objects. Previously, these were added to the queue one at a time in order received. Now, they're processed atomically. This prevents the cache from entering an inconsistent intermediate state where you're looking at objects 1-5 and 10-15, but 6-9 haven't arrived yet.

More importantly, there's a new way to query your cache: LastStoreSyncResourceVersion(). This tells you the latest resource version your informer has actually seen. You can compare this against what you've written to the API server. If your cache is behind, you don't act.

The four core controllers that manage the most contended resources—DaemonSet, StatefulSet, ReplicaSet, and Job—now use this check before taking action. You can enable or disable this per-controller using feature gates.

My Take: This Should Have Been Obvious Five Years Ago

Here's what I think: this is a good fix for a problem that shouldn't have existed in this form for this long. The insight isn't complex—"check if your cache is fresh before acting on it"—but the execution is solid.

What impresses me is that the Kubernetes team didn't force a breaking change. They made staleness mitigation opt-in via feature gates, then enabled it by default for the controllers where it matters most. That's the right approach for a distributed system where you can't afford to break everyone's controllers simultaneously.

The part I'm watching carefully is adoption by third-party controller authors. The example they provide shows how to use ConsistencyStore to track written resource versions, but my concern is that this adds cognitive load. Not every controller author thinks about staleness proactively. Some will see this feature and think "interesting," and keep writing controllers the old way.

Practical Application

If you're building a custom controller, you should now check cache staleness. Here's the pattern:

// Before taking action, verify your cache is fresh
lastSyncVersion := store.LastStoreSyncResourceVersion()
writtenVersion := controller.getLastWrittenResourceVersion(object)

if lastSyncVersion < writtenVersion {
    // Your cache is behind. Don't act yet.
    // Requeue the work item and try again.
    workqueue.AddAfter(key, 5*time.Second)
    return nil
}

// Safe to proceed with your reconciliation logic

This shifts your mental model from "act immediately" to "verify then act." It costs you a few microseconds of latency but saves you from silent correctness bugs.

What I'm Wondering

Will this fix actually get adopted widely, or will it remain known only to operators running the biggest clusters? And more importantly—does this push us toward thinking about stronger consistency guarantees at the architecture level, or does it encourage everyone to build on top of fundamentally eventually-consistent systems?

I'm genuinely curious whether you've run into staleness issues in your own controllers, and whether something like this would have saved you time.

Source: This post was inspired by "Kubernetes v1.36: Staleness Mitigation and Observability for Controllers" by Kubernetes Blog. Read the original article

The Cache Consistency Problem I Didn't Know I Had (Until It Cost Us)

The Staleness Problem Is Real, and We've All Experienced It

How Kubernetes v1.36 Tackles This

My Take: This Should Have Been Obvious Five Years Ago

Practical Application

What I'm Wondering

Share this article

Written by Adil Sher

Related Articles

The Security Hole in My Monitoring Stack I Didn't Know I Had

Stop Letting Your Kubernetes Nodes Blow Up: Why Memory QoS Finally Makes Sense

Stop Wasting CPU Cores on Sidecars: Why Pod-Level Resource Managers Finally Solve a Real Problem