Kubernetes v1.36 Unveils Major Scheduling Overhaul: New PodGroup API Separates Template from Runtime

Breaking: Kubernetes v1.36 Delivers Revolutionary Scheduling Architecture for AI/ML Workloads

The Kubernetes community released v1.36 today, introducing a fundamental rearchitecture of workload-aware scheduling that cleanly separates static template definitions from runtime state management. The update centers on the new PodGroup API, which replaces the integrated approach of v1.35 and unlocks atomic scheduling, topology-awareness, and preemption capabilities.

Kubernetes v1.36 Unveils Major Scheduling Overhaul: New PodGroup API Separates Template from Runtime

“This is a significant architectural evolution. The Workload API now serves purely as a static template, while the PodGroup API handles all runtime state, making the scheduler far more efficient,” said a Kubernetes scheduling team lead. The change streamlines the kube-scheduler by allowing it to read PodGroup objects directly, eliminating the need to parse Workload resources.

PodGroup API: The Core of the New Design

In v1.35, Pod groups and their dynamic states were embedded within the Workload resource. v1.36 decouples them: Workload objects act as immutable templates, and PodGroup objects carry the actual scheduling policy and runtime conditions. This separation improves scalability through per-replica sharding of status updates.

Operators define Pod group templates inside a Workload spec, including gang scheduling parameters such as minCount. Controllers then stamp out individual PodGroup instances that reference the template and hold real-time scheduling conditions.

Gang Scheduling and Atomic Workload Processing

The new PodGroup scheduling cycle in the kube-scheduler enables atomic processing of entire workload groups. “This paves the way for future enhancements like predictable pod placement and coordinated lifecycle management,” a contributor noted. Gang scheduling ensures that a batch job only proceeds if a minimum number of pods can run simultaneously.

Topology-Aware Scheduling and Preemption

v1.36 also debuts the first iterations of topology-aware scheduling and workload-aware preemption. Topology-awareness lets operators optimize pod placement based on node topology, while preemption intelligently evicts lower-priority pods to make room for critical workloads.

ResourceClaim Support Unlocks Dynamic Resource Allocation

For the first time, ResourceClaim support for workloads enables Dynamic Resource Allocation (DRA) for PodGroups. This allows batch and AI/ML jobs to request specialized hardware like GPUs or FPGAs without manual node selection.

Job Controller Integration Demonstrates Real-World Readiness

To prove the new APIs work at scale, v1.36 delivers the first phase of integration between the Job controller and the new Workload/PodGroup APIs. This ensures that existing batch jobs can leverage the enhanced scheduling without rewriting controllers.

Background: From v1.35 to v1.36

Kubernetes v1.35 introduced the foundational Workload API and basic gang scheduling built on a pod-based framework. However, the combined approach suffered from performance issues as the Workload resource grew with status updates. The v1.36 separation was driven by community feedback from AI/ML operators who needed more scalable and deterministic scheduling.

The new scheduling.k8s.io/v1alpha2 API group completely replaces the previous v1alpha1 version, and all existing workloads must migrate to the new format.

What This Means for the Community

For AI/ML and batch workload operators, v1.36 resolves long-standing scheduling bottlenecks. The PodGroup API allows finer-grained control over pod group state, reducing overhead and enabling more sophisticated scheduling policies. “This is a game-changer for running large-scale training jobs,” commented a cloud-native AI lead. The addition of topology-aware scheduling and preemption further aligns Kubernetes with HPC-style workload requirements.

In the short term, early adopters can test the new APIs on non-production clusters. The Kubernetes project expects broader adoption within the next two minor releases as downstream tools, like batch schedulers and AI platforms, integrate the new primitives.