Skip to main content

Designing Workloads for Spot and High-Churn Node Pools

Karpenter is often used with spot or otherwise preemptible instances to reduce cost. These node pools are intentionally high churn: nodes may be created and terminated frequently.

Understanding this behavior is critical. Many issues that appear to be "Karpenter problems" are actually workload design issues when running on spot capacity.

On-Demand vs Spot Capacity

  • On-demand capacity

    • Nodes are stable and long-lived
    • Terminations are rare and usually manual
    • Higher cost, higher predictability
  • Spot capacity

    • Nodes can be interrupted at any time
    • Nodes may disappear and be replaced frequently
    • Lower cost, lower predictability

Karpenter will aggressively use spot capacity when it is available. This is expected behavior.

What "High Churn" Means in Practice

In a high-churn node pool, you should expect:

  • Nodes to be terminated and replaced regularly
  • Pods to be evicted and rescheduled
  • Temporary drops in capacity while new nodes are created

These events are normal and not indicators of failure.

Workload Characteristics That Work Well on Spot

These workloads generally behave well on spot or high-churn node pools:

  • Stateless services

    • No local state
    • Can be restarted anywhere
  • Horizontally scalable deployments

    • Multiple replicas
    • Load can shift between pods
  • Jobs or batch workloads

    • Short-lived tasks
    • Can be retried safely
  • Workloads with graceful shutdown handling

    • Respond correctly to termination signals
    • Cleanly stop accepting traffic before exiting

Workload Characteristics That Do Not Work Well on Spot

These workloads often cause confusion when placed on spot capacity:

  • Single-replica services

    • A single pod restart causes downtime
  • Stateful workloads without replication

    • Local state is lost on node termination
  • Long-running jobs without checkpointing

    • Work is lost when the pod is interrupted
  • Workloads that cannot tolerate restarts

    • Assume long-lived nodes
    • Fail hard on eviction

If these workloads are required, they should usually run on on-demand capacity.

To run safely on spot capacity, workloads should use:

  • Multiple replicas spread across nodes
  • PodDisruptionBudgets to control eviction behavior
  • Readiness probes so traffic is only sent to healthy pods
  • Graceful termination handling so pods exit cleanly when interrupted

These patterns allow Karpenter to replace capacity without visible impact.

Isolating Spot vs On-Demand Workloads

A common and recommended pattern is to separate capacity types:

  • Use separate NodePools for spot and on-demand nodes
  • Use node selectors or affinities to place critical workloads on on-demand capacity
  • Use tolerations to allow spot-capable workloads onto spot nodes

Prefer on-demand capacity for:

  • Critical system components
  • Single-replica services
  • Stateful workloads that cannot tolerate interruption

Prefer spot capacity for:

  • Scalable application workloads
  • Batch processing
  • Cost-sensitive, interruption-tolerant services

Forcing a Workload onto On-Demand Capacity

If a workload cannot tolerate interruption or frequent restarts, it should run on on-demand capacity.

A common approach is:

  • Define separate NodePools for spot and on-demand nodes
  • Label or taint on-demand nodes
  • Constrain critical workloads to those nodes

From a workload perspective, this usually involves:

  • Adding a node selector or affinity that targets on-demand nodes
    • Karpenter labels on-demand nodes with:
      karpenter.sh/capacity-type: on-demand
  • Adding or removing tolerations so the workload does not land on spot nodes