DoiT Cloud Intelligence™

Stop Node Hunting: How Kubernetes DRA Simplifies GPU Scheduling for AI Workloads

By Chimbu ChinnaduraiApr 7, 20267 min read
Stop Node Hunting: How Kubernetes DRA Simplifies GPU Scheduling for AI Workloads

AI is no longer a side project. Teams everywhere are running large language models (LLMs), training pipelines, and inference servers on Kubernetes. And when your workload needs a GPU, things get complicated fast.

GPUs and TPUs are expensive and hard to obtain, and for a long time Kubernetes used the Device Plugin framework to manage them. This framework was originally developed during a period when Kubernetes workloads were relatively simple, primarily demanding only "a GPU," which the system would then supply. However, modern AI workloads are significantly more complex. As soon as you introduce mixed GPU types, NVLink topologies, or specific VRAM requirements, Device Plugins fall apart, requiring extensive manual configuration.

Dynamic Resource Allocation (DRA) is Kubernetes answer to these shortcomings. Rather than the old model where nodes advertised fixed hardware counts and the scheduler blindly claimed them, DRA introduces a request-based allocation model. Workloads describe what they need, and DRA's control plane figures out how to satisfy that claim across the cluster. This shift moves hardware awareness out of individual node agents and into a centralized, expressive API.

At KubeCon Europe 2026, NVIDIA donated its DRA GPU driver to CNCF, and Google announced the open-source release of the DRA TPU driver. These weren't just community goodwill gestures; they show that the two leading AI hardware vendors have fully adopted DRA as the standard interface for managing hardware in Kubernetes. For platform teams, this means you no longer need to depend on vendor-specific workarounds or proprietary scheduling logic. The same DRA primitives now operate reliably whether you're using NVIDIA GPUs, Google TPUs, or both in the same cluster.

The Old Way: Device Plugins and Their Pain Points

Before DRA, if you wanted your pod to use a GPU, you'd add something like this to your pod spec:

resources:
 limits:
 nvidia.com/gpu: 1

This simple integer request created three massive pain points:

1. Lack of Attribute-Based Selection and Native Fractional Support

Device Plugins natively supported basic integer counting (e.g., "1 GPU") and no fractional GPUs! While there are workarounds like NVIDIA's Time-Slicing or Multi-Instance GPU (MIG) to split hardware resources, these remain external "hacks" that the Kubernetes scheduler doesn't truly understand. It also lacked the ability to request resources based on specific attributes, like "I need one with at least 40 GB of VRAM" or "I need one from a specific architecture." This often resulted in workloads being assigned to underpowered or incompatible hardware.

2. Manual Orchestration Overhead

Device Plugins don't provide the Kubernetes scheduler with any useful information about hardware. Because the scheduler lacked granular hardware awareness, administrators were forced to manually map workloads to specific nodes using hard-coded labels. This approach is not scalable in large clusters, as it requires constant manual updates whenever hardware is added, removed, or decommissioned.

3. Static Provisioning Constraints

Device Plugins required hardware to be pre-configured and available before a task was initiated. There was no mechanism for dynamic, "just-in-time" resource allocation, nor for the system to search for and initialize hardware in response to a pending request.

Enter Dynamic Resource Allocation (DRA): The New Standard for AI Hardware

Dynamic Resource Allocation (DRA) is the new Kubernetes standard for managing specialized hardware. The primary objective of DRA is to decouple resource management from the core Kubernetes scheduler. Instead of having the user identify specific nodes or manually tag hardware, DRA allows the workload to define its requirements. The system then dynamically identifies, claims, and prepares the optimal hardware across the entire cluster.

DRA introduces three key concepts that make this work.

1. DeviceClass — Abstractions for Platform Teams

DeviceClass is a blueprint defined by platform or cluster admins. Instead of making developers know hardware specifics, admins can create named classes like high-memory-gpu or low-latency-fpga. Developers just request a class by name, and the scheduler handles the rest.

2. ResourceSlice — What's Available

Think of a ResourceSlice as a hardware inventory report that represents one or more devices in a pool. DRA drivers (like the NVIDIA GPU driver or Google's TPU driver) publish detailed information about the devices on each node. Not just "this node has 4 GPUs," but rich details like:

  • Total GPU memory (VRAM)
  • Architecture and hardware model
  • Which PCIe root complex or NUMA node the device sits on
  • Number of compute cores

This is the key shift: Hardware details that used to be hidden are now fully visible to the scheduler.

3. ResourceClaim — What You Need

A ResourceClaim is how you describe your workload's requirements. This is where DRA gets powerful. Instead of asking for "1 GPU," you can now say things like:

  • I need a GPU with at least 40 GB of VRAM
  • I need a GPU and a high-speed NIC that are on the same NUMA node
  • I need an accelerator that matches the high-memory-gpu class

The scheduler reads this claim, looks at all the ResourceSlices published across the cluster, and finds the best match automatically.

Real-World Example: Running vLLM with DRA

Let's say you're running a large language model inference server using vLLM. You need a GPU with plenty of VRAM, and you want scheduling to be automatic and no manual node pinning. With DRA, your setup might look something like this:

Source: Gemini + Nano Banana

Step 1: The cluster admin creates a DeviceClass

This class filters for any GPU with more than 40GB of memory using a Common Expression Language (CEL) filter.

---
apiVersion: resource.k8s.io/v1
kind: DeviceClass
metadata:
  name: high-memory-gpu
spec:
  selectors:
    - cel:
        expression: device.capacity["memory"].isGreaterThan(quantity("40Gi"))

Step 2: You create a ResourceClaim for your pod spec

The user requests one device from that specific class.

apiVersion: resource.k8s.io/v1
kind: ResourceClaim
metadata:
  name: vllm-gpu-claim
spec:
  devices:
    requests:
    - name: gpu
      deviceClassName: high-memory-gpu
      count: 1

Step 3: Reference the claim in your pod

Finally, the Pod simply points to the claim. No nodeSelector or complex affinity rules required.

apiVersion: v1
kind: Pod
metadata:
  name: vllm-inference
spec:
  resourceClaims:
  - name: gpu-claim
    resourceClaimName: vllm-gpu-claim
  containers:
  - name: vllm
    image: vllm/vllm-openai:latest
    resources:
      claims:
      - name: gpu-claim

That's it. You described what you need. Kubernetes — now armed with full visibility into every node's hardware via ResourceSlices — finds a suitable node and schedules your pod there. No nodeSelector. No hunting through kubectl get nodes.

The Strategic Value of DRA

DRA ensures you get the full performance by eliminating manual configuration errors and hardware bottlenecks.

  • For developers: Stop the manual "node hunting." Define the hardware requirements your code needs, and let Kubernetes handle the discovery and attachment.
  • For platform teams: You can create hardware "tiers" using DeviceClasses and expose them, without giving everyone raw access to node labels and hardware specs.
  • For operations, Resource utilization improves. The scheduler has better information, which means better placement decisions, less fragmentation, fewer idle GPUs and better ROI on expensive hardware.
  • For AI at scale: DRA has already become the foundation of the Kubernetes AI Conformance program. It is no longer an optional feature; it has become the industry standard.

Wrapping Up

Kubernetes has evolved from simply hosting web servers and microservices to becoming the preferred platform for some of the most intensive AI workloads. However, this exciting progress has also brought new infrastructure challenges, particularly in managing specialized hardware such as GPUs, TPUs, and high-speed networking devices.

The Device Plugin framework served its purpose, but it was designed for a simpler time. DRA is built for our current environment: clusters with diverse, costly hardware, workloads with specific and complex needs, and teams that need to move quickly without becoming hardware topology experts.

Optimize Your AI Infrastructure with DoiT

If you are currently running AI workloads or planning to implement DRA, DoiT can accelerate your journey. Our team of over 100 cloud experts specializes in tailored solutions to optimize your infrastructure, ensure compliance, and maximize your hardware ROI.

Contact us today to transition your cluster to the new Kubernetes AI standard.

If you want to dig in further, here are the best places to start: