Stop Guessing Why Your Kubernetes Cluster Is Slow: PSI Metrics Finally Tell the Real Story
Admin User
Author
I spent three hours last month debugging why a client's Kubernetes cluster was dropping requests during peak load. CPU utilization was at 60%. Memory at 45%. Everything looked fine on the dashboards. But their users were experiencing 10-second latencies on p95. I was looking at the wrong metrics entirely, and I didn't even know it.
That's the problem PSI (Pressure Stall Information) solves. And with Kubernetes v1.36 graduating it to GA, there's no excuse left to ignore it.
For years, we've been operating on a lie: that utilization numbers tell us anything meaningful about actual performance. A node showing 60% CPU doesn't mean there's 40% of headroom left. Tasks could be stalled, waiting, burning cycles on context switching. You won't see that in traditional metrics. You'll only see it in your latency spikes and confused on-call engineers at 2 AM.
What PSI Actually Is (And Why It Matters)
PSI tracks something utilization metrics never will: time lost. Specifically, how much time tasks spend stalled waiting for CPU, memory, or I/O resources that aren't available. It gives you percentages—10-second, 60-second, and 300-second moving averages—that show you whether you're dealing with a transient spike or real sustained contention.
Think of it this way: utilization is a snapshot. PSI is the story. "Your CPU is at 60%" tells you one thing. "Tasks are stalled 15% of the time waiting for CPU" tells you something completely different—and actionable.
In v1.36, Kubernetes now exposes these metrics at the node, pod, and container levels through the /metrics/cadvisor endpoint. That means your Prometheus setup can finally answer the question that kept me up at night: "Why is this slow?"
The Performance Validation That Convinced Me
What won me over wasn't the concept—it was the benchmarking. The Kubernetes team ran v1.36 through actual load testing on high-density clusters (80+ pods per node) and measured the overhead of collecting these metrics.
The kernel-level tracking added between 0.9% and 3.1% of CPU overhead under load. The Kubelet's collection logic added almost nothing—staying in the noise of normal operations. These aren't theoretical numbers. They tested this at scale.
This matters because I've burned too many production incidents on features that looked good until they tanked performance. The fact that they isolated Kubelet overhead from kernel overhead separately tells me the team actually understood what they were measuring.
My Take: It's Table Stakes Now
Here's what I think: if you're running Kubernetes in production and you're not looking at PSI metrics, you're flying blind. Full stop.
The old objection—"won't collecting these metrics slow us down?"—is dead. The testing proves it. You get real visibility into resource contention for essentially zero cost. That's a trade I'll take every single time.
The one thing that annoyed me: the v1.36 improvements around detecting OS-level PSI support. The fact that v1.34 and earlier could emit misleading zero-valued metrics if your kernel didn't support PSI? That's the kind of gotcha that wastes an afternoon. I'm glad they fixed it, but it should never have happened.
Requirements are straightforward: Linux kernel 4.20+, cgroup v2, and PSI enabled at boot. If you're on a modern Linux distribution (Ubuntu 20.04+, newer COS builds), you're already there. Windows clusters obviously don't get this—but honestly, if you're running production workloads on Windows containers in Kubernetes, you have bigger problems.
Where I'd Actually Use This
I'm immediately adding PSI metric scraping to the cluster monitoring setup for that client who had the latency problem. Here's roughly how I'd configure it in Prometheus:
- job_name: 'kubernetes-nodes-cadvisor'
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
metric_relabel_configs:
- source_labels: [__name__]
regex: 'container_memory_pressure_stall_seconds_total|container_cpu_pressure_stall_seconds_total|container_io_pressure_stall_seconds_total'
action: keep
Then I'd create alerts on the 60-second moving average. If PSI shows sustained stall time above 5-10%, that's a real problem worth investigating, not a false alarm from normal utilization.
What's Your Monitoring Actually Telling You?
Here's my question for you: if I looked at your current Kubernetes monitoring right now, could you tell me whether your latency issues are from saturation or something else entirely? Or are you, like I was, just staring at utilization percentages and guessing?
PSI changes that game. It's available, it's free, and it actually works. The only reason not to use it is if you're deliberately choosing to understand less about your systems.
Source: This post was inspired by "Kubernetes v1.36: PSI Metrics for Kubernetes Graduates to GA" by Kubernetes Blog. Read the original article