Designing Workloads for Spot and High-Churn Node Pools

Karpenter is often used with spot or otherwise preemptible instances to reduce cost. These node pools are intentionally high churn: nodes may be created and terminated frequently.

Understanding this behavior is critical. Many issues that appear to be "Karpenter problems" are actually workload design issues when running on spot capacity.

On-Demand vs Spot Capacity

On-demand capacity
- Nodes are stable and long-lived
- Terminations are rare and usually manual
- Higher cost, higher predictability
Spot capacity
- Nodes can be interrupted at any time
- Nodes may disappear and be replaced frequently
- Lower cost, lower predictability

Karpenter will aggressively use spot capacity when it is available. This is expected behavior.

What "High Churn" Means in Practice

In a high-churn node pool, you should expect:

Nodes to be terminated and replaced regularly
Pods to be evicted and rescheduled
Temporary drops in capacity while new nodes are created

These events are normal and not indicators of failure.

Workload Characteristics That Work Well on Spot

These workloads generally behave well on spot or high-churn node pools:

Stateless services
- No local state
- Can be restarted anywhere
Horizontally scalable deployments
- Multiple replicas
- Load can shift between pods
Jobs or batch workloads
- Short-lived tasks
- Can be retried safely
Workloads with graceful shutdown handling
- Respond correctly to termination signals
- Cleanly stop accepting traffic before exiting

Workload Characteristics That Do Not Work Well on Spot

These workloads often cause confusion when placed on spot capacity:

Single-replica services
- A single pod restart causes downtime
Stateful workloads without replication
- Local state is lost on node termination
Long-running jobs without checkpointing
- Work is lost when the pod is interrupted
Workloads that cannot tolerate restarts
- Assume long-lived nodes
- Fail hard on eviction

If these workloads are required, they should usually run on on-demand capacity.

Recommended Design Patterns for Spot

To run safely on spot capacity, workloads should use:

Multiple replicas spread across nodes
PodDisruptionBudgets to control eviction behavior
Readiness probes so traffic is only sent to healthy pods
Graceful termination handling so pods exit cleanly when interrupted

These patterns allow Karpenter to replace capacity without visible impact.

Isolating Spot vs On-Demand Workloads

A common and recommended pattern is to separate capacity types:

Use separate NodePools for spot and on-demand nodes
Use node selectors or affinities to place critical workloads on on-demand capacity
Use tolerations to allow spot-capable workloads onto spot nodes

Prefer on-demand capacity for:

Critical system components
Single-replica services
Stateful workloads that cannot tolerate interruption

Prefer spot capacity for:

Scalable application workloads
Batch processing
Cost-sensitive, interruption-tolerant services

Forcing a Workload onto On-Demand Capacity

If a workload cannot tolerate interruption or frequent restarts, it should run on on-demand capacity.

A common approach is:

Define separate NodePools for spot and on-demand nodes
Label or taint on-demand nodes
Constrain critical workloads to those nodes

From a workload perspective, this usually involves:

Adding a node selector or affinity that targets on-demand nodes
- Karpenter labels on-demand nodes with:
```
karpenter.sh/capacity-type: on-demand
```
Adding or removing tolerations so the workload does not land on spot nodes

On-Demand vs Spot Capacity​

What "High Churn" Means in Practice​

Workload Characteristics That Work Well on Spot​

Workload Characteristics That Do Not Work Well on Spot​

Recommended Design Patterns for Spot​

Isolating Spot vs On-Demand Workloads​

Forcing a Workload onto On-Demand Capacity​