Why Kubernetes v1.36 Finally Made Me Stop Dreading Batch Job Scheduling
Admin User
Author
I spent three hours last Tuesday debugging why our ML training pipeline kept deadlocking on Kubernetes. Four worker pods needed to run simultaneously, but the scheduler was placing them on different nodes at different times, and once the third one landed, there wasn't enough memory left for the fourth. Classic gang scheduling nightmare. I remember thinking: "This shouldn't be this hard in 2026." Then I read about Kubernetes v1.36, and for the first time in a while, I felt like the platform was actually listening to people building real workloads.
The problem isn't new. Anyone running batch jobs, ML training, or distributed workloads on Kubernetes knows the pain. Pod-by-pod scheduling works great for microservices, but it's fundamentally broken for workloads that need coordinated placement. You can't just schedule pods independently and hope they magically cooperate. Yet that's essentially what we've been doing, patching it with nodeSelectors, affinity rules, and crossed fingers.
v1.36 represents something bigger than incremental improvements—it's a philosophical shift in how Kubernetes thinks about scheduling.
The Architecture Changed, and It Actually Makes Sense
The separation between Workload and PodGroup APIs initially confused me. Why split what was already together? But as I started thinking through this, I realized it's brilliant architecture.
Previously, everything—the template and the runtime state—lived in one object. That meant the scheduler had to watch and parse templates even though it only cares about runtime decisions. It's like asking your database to re-evaluate query plans every time a comment changes in the schema file.
Now the Workload is purely a static template. Your Job controller defines it once, stamps out actual PodGroup instances, and the scheduler only concerns itself with the PodGroup. The scheduler reads what matters: the actual scheduling policy and current pod states. Nothing more.
This also solves a scaling problem I didn't immediately appreciate. When you have hundreds of replicas, status updates can become a bottleneck. The new architecture allows per-replica sharding of status through the PodGroup, which actually gives you breathing room at scale.
The PodGroup Scheduling Cycle: Atomic Operations Finally
Here's where this gets genuinely useful. Instead of the scheduler evaluating your four worker pods separately, it now takes a snapshot of the cluster, evaluates them as a unit, and makes an atomic decision.
Let me show you what this looks like in practice:
apiVersion: scheduling.k8s.io/v1alpha2
kind: Workload
metadata:
name: training-job-workload
spec:
podGroupTemplates:
- name: workers
schedulingPolicy:
gang:
minCount: 4
When the Job controller runs, it creates actual PodGroup instances from this template:
apiVersion: scheduling.k8s.io/v1alpha2
kind: PodGroup
metadata:
name: training-job-workers-pg
spec:
podGroupTemplateRef:
workload:
workloadName: training-job-workload
podGroupTemplateName: workers
schedulingPolicy:
gang:
minCount: 4
The pods then reference the PodGroup via schedulingGroup instead of workloadRef. Now when the scheduler processes this group, it doesn't make individual decisions—it finds nodes that can accommodate all four workers, or it keeps waiting. No partial scheduling. No deadlocks.
This is exactly what I've been needing.
What This Means for How I Build
I'm genuinely excited about the topology-aware scheduling and preemption pieces mentioned here, though the article doesn't go deep. What I'm most interested in is how this will evolve.
One question though: what happens when you want different scheduling policies for different subgroups? Can I say "4 GPU pods must colocate, but 10 CPU pods can be scattered"? The article doesn't clarify this, and I suspect real-world batch workloads will need this flexibility.
The Dynamic Resource Allocation support for workloads is also intriguing. I've been avoiding DRA because it felt immature, but if it's now integrated with proper gang scheduling, I might actually use it for our GPU-hungry pipelines.
Here's What I'm Doing Next
I'm spinning up a test cluster with v1.36 this month. I want to see how this handles our actual training job topology in production conditions. The architectural improvements sound solid, but I need to feel where the rough edges are myself.
If you're running batch workloads on Kubernetes, I'd genuinely love to hear if you're hitting the same deadlock problems I've been struggling with. Are you considering moving to the new Workload API?
Source: This post was inspired by "Kubernetes v1.36: Advancing Workload-Aware Scheduling" by Kubernetes Blog. Read the original article