Understanding Kubernetes Scheduler

The Uptime Engineer

👋 Hi, I am Yoshik

This week, you'll learn how kube-scheduler actually eliminates and ranks nodes through the Filter, Score, Bind pipeline, why the scheduling plugin framework makes every phase configurable at the architecture level.

🔥Tool Spotlight

kube-scheduler-simulator - a Kubernetes-native simulator that runs the real scheduler against synthetic clusters and shows you which plugin made which decision for each pod, without touching your production cluster.

📚 Worth Your Time

Kubernetes Scheduling Framework - official docs. The canonical reference on extension points, plugin lifecycle, and KubeSchedulerProfile configuration.

How the scheduler works

kube-scheduler does one thing: watch for unassigned pods, pick a node, write the assignment back.

That's the whole job. The interesting part is how it picks.

Every pod goes through a two-cycle pipeline. The scheduling cycle selects the node synchronously, one pod at a time. The binding cycle writes that decision to the API server asynchronously, while the next pod is already being scheduled.

Why split them? Network writes are slow. Node selection is fast. Coupling them would mean the scheduler waits on I/O between every decision.

Phase 1: Filter

Filter is not selection. It's elimination.

Every node runs through a set of filter plugins in parallel. Fail one plugin, that node is gone.

The five that matter most in production:

NodeResourcesFit - does the node have enough CPU and memory for the pod's requests? No requests set means the scheduler is guessing. Set them.
TaintToleration - does the pod tolerate the node's taint? A NoSchedule taint removes that node for every pod without a matching toleration. Most common Pending cause after resource exhaustion.
NodeAffinity - does the node match the pod's nodeSelector or nodeAffinity? A disktype: ssd selector on a cluster with no SSD-labeled nodes means zero feasible nodes. Permanently.
PodTopologySpread - would placing this pod push a zone past maxSkew? Resources can be fine, taints can be fine - this constraint alone will block placement. It's the sneaky one.
VolumeBinding - can the pod's PVCs be satisfied here? A PVC requiring a missing storage class eliminates every node.

On clusters above 1,000 nodes, the scheduler samples by default - 50% of nodes or at least 100, whichever is larger. Not every node gets evaluated every time.

After filtering, you either have feasible nodes or a Pending pod.

That 0/5 nodes are available message is the filter output. Each number tells you exactly which plugin rejected exactly which nodes.

Phase 2: Score

Once filtering produces a feasible set, scoring picks the winner.

Each surviving node gets a score from 0-100 per scoring plugin. Scores combine with configurable weights. Highest total wins.

Two scoring functions drive most real cluster behavior - and they pull in opposite directions.

LeastAllocated spreads pods across nodes with the most remaining capacity. Better blast radius isolation. One node going down hits fewer pods. This is the default.

MostAllocated packs pods onto the most utilized nodes. Fewer nodes running, lower bill. If your cluster sits at 30% average utilization because LeastAllocated keeps spreading everything thin, this is worth evaluating.

ImageLocality scores nodes higher when they already have your container image cached. For images above 1GB, pull time dominates cold-start latency. This scoring function has a real operational impact - most people ignore it.

After scoring, the winning node is reserved while binding completes. If binding fails - node went down mid-cycle, the pod re-enters the queue and starts over.

When the filter finds nothing

Zero feasible nodes doesn't mean the scheduler stops. It evaluates preemption.

It finds lower-priority pods whose eviction would free up enough resources, picks the smallest victim set, marks those pods for graceful termination, and waits for the next scheduling cycle.

For this to work, your pods need PriorityClass definitions.

Without them, every pod competes equally. Your batch job and your payment service have identical priority. Under pressure, the scheduler picks by submission order - not business importance.

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: critical-production
value: 1000000
globalDefault: false
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: batch-low
value: 1000
globalDefault: false

Two additions to your workload specs. Most clusters don't have them. Most clusters find out why during their first resource-constrained incident.

How to read what the scheduler is telling you

Three commands. Run them in order.

# exact rejection reason per node
kubectl get events \
  --field-selector reason=FailedScheduling \
  -n <namespace> \
  --sort-by='.metadata.creationTimestamp'

# full event context on the pod
kubectl describe pod <pod-name> -n <namespace> \
  | grep -A30 "Events:"

# node allocation state
kubectl describe nodes \
  | grep -A5 "Allocated resources"

The FailedScheduling event is explicit:

2 Insufficient cpu = NodeResourcesFit rejected 2 nodes.
3 node(s) had taint = TaintToleration rejected 3 nodes.

TLDR;

Think of it as a negotiation between what the pod needs and what the cluster has.

Filter is the pod saying: I won't run here unless these conditions are met.
Score is the scheduler saying: Here's the best option from what's left.
Bind is the contract: You're going to this node. Kubelet takes it from here.

When a pod is Pending, the negotiation broke down at Filter. The fix is always in the filter output - not in adding nodes, not in reapplying the manifest, not in restarting the scheduler.

Thank you for supporting this newsletter.
Y’all are the best.

Join 1,000+ engineers learning DevOps the hard way

Every week, I share:

How I'd approach problems differently (real projects, real mistakes)
Career moves that actually work (not LinkedIn motivational posts)
Technical deep-dives that change how you think about infrastructure

No fluff. No roadmaps. Just what works when you're building real systems.

👉 Subscribe for free to get it delivered every week

👋 Find me on Twitter | Linkedin | Connect 1:1

Thank you for supporting this newsletter.

Y’all are the best.

Understanding Kubernetes Scheduler

The Uptime Engineer

🔥Tool Spotlight

📚 Worth Your Time

How the scheduler works

Phase 1: Filter

Phase 2: Score

When the filter finds nothing

How to read what the scheduler is telling you

TLDR;

Keep Reading

The Uptime Engineer