Skip to main content

Karpenter: Corrective Actions and Escalation

Excessive Node Churn: When to Escalate and When Not To

Karpenter is designed to create and remove nodes frequently, especially when using spot or high-churn capacity. Some node churn is expected and healthy.

When Node Churn Is Normal (Do Not Escalate)

Do not escalate if:

  • Nodes are being created and removed gradually
  • Pods are successfully rescheduled after node deletion
  • Workloads briefly restart but recover automatically
  • Churn aligns with traffic changes or workload scaling
  • Nodes are primarily spot or preemptible instances

In these cases, Karpenter is behaving as designed.

When to Escalate Excessive Node Churn

Escalate if you observe:

  • Nodes repeatedly created and deleted within minutes
  • Large numbers of nodes cycling continuously
  • Pods never stabilizing or constantly restarting
  • NodeClaims flapping without ever producing Ready nodes
  • Churn affecting on-demand or critical workloads

These patterns often indicate:

  • Misconfigured NodePools or constraints
  • Cloud capacity or quota issues
  • Node startup failures (AMI, bootstrap, CNI)

Basic Corrective Actions

These actions are safe and may resolve transient issues.

Restart the Karpenter Controller

kubectl rollout restart deployment/karpenter -n karpenter

Re-trigger Scheduling

Scaling a workload can force re-evaluation of pending pods:

kubectl scale deployment <name> --replicas=<n>

Forcing Karpenter to Replace a Node (Cordon / Drain / Delete)

In some situations, a node may be stuck, misconfigured, or unhealthy, even though Karpenter itself is working correctly. In these cases, it can be reasonable to force Karpenter to replace the node.

This approach makes sense when:

  • A node remains NotReady for an extended period
  • Node-level DaemonSets (especially CNI / networking) have failed
  • The node is clearly unhealthy, but Karpenter is not removing it automatically

Step 1: Cordon the Node

Prevent new pods from being scheduled onto the node:

kubectl cordon <node-name>

Step 2: Drain the Node

Evict existing pods from the node:

kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

This will cause pods to be rescheduled elsewhere.

Step 3: Delete the Node

Remove the node from the cluster:

kubectl delete node <node-name>

Once the node is deleted, Karpenter will observe the resulting unschedulable pods and provision replacement capacity if needed.

Important Notes
  • Draining a node will disrupt running workloads
  • Do not do this if many nodes are NotReady at the same time — this usually indicates a broader cluster or networking issue
  • Always inspect what is running on the node first to ensure no single‑replica or non‑HA workloads will be disrupted:
    kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=<node-name>
  • If you are unsure whether workloads on the node can tolerate disruption, escalate instead of proceeding
  • Do not do this for multiple nodes at once unless explicitly instructed
  • This is a recovery step, not routine maintenance

When Not to Intervene

Do not restart or modify Karpenter or manually remove nodes if:

  • Nodes are actively being created
  • Pods are pending on launching nodes
  • Errors indicate temporary cloud capacity shortages

In these cases, waiting is often the correct action.

Information to Collect Before Escalation

Before escalating, collect the following information:

  • Karpenter controller pod status and logs
  • Pod describe output showing scheduling errors
  • NodePool or Provisioner configuration
  • Cloud provider context (region, instance types, quotas)

This information allows faster diagnosis.

Escalate Immediately If

Escalate without further troubleshooting if:

  • The Karpenter controller is not running
  • No nodes are ever created despite sustained pending pods
  • Errors clearly indicate cloud provider permission or quota failures