Skip to content
Azure Kubernetes Service 📅 2026-02-03

Azure Kubernetes Service Pod CrashLoopBackOff Error: How to Resolve

🚨 Symptoms & Diagnosis

When an Azure Kubernetes Service (AKS) pod enters a CrashLoopBackOff state, it signifies that a container within the pod is repeatedly starting and crashing. The Kubernetes Kubelet component attempts to restart the container, applying an exponential backoff delay to prevent overwhelming the system. This state is a critical indicator of application instability or misconfiguration.

You'll typically observe the following signatures when a pod is stuck in a CrashLoopBackOff loop:

STATUS: CrashLoopBackOff

Checking pod details might reveal:

State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error

Events log will show repeated back-off warnings:

Warning BackOff 1m (x5 over 1m) kubelet, aks-agentpool-12345678-vmss000001 Back-off restarting failed container

And detailed pod description will highlight the restart count and exit codes:

Ready: False
Restart Count: 2
Last State: Terminated
  Reason:     Error
  Exit Code:  137
  Started:    Thu, 01 Jan 1970 00:00:00 +0000
  Finished:   Thu, 01 Jan 1970 00:00:01 +0000
Common Exit Code values include 1 (general application error) or 137 (OOMKilled - Out Of Memory).

Root Cause: The CrashLoopBackOff state fundamentally indicates that your container's main process is exiting prematurely with a non-zero status, prompting the Kubelet to continually attempt restarts, often due to application errors, resource exhaustion, or configuration issues.


🛠️ Solutions

Resolving CrashLoopBackOff requires a systematic approach to identify and rectify the underlying cause. Begin with immediate debugging steps and then move towards more permanent fixes.

Immediate Debug & Restart

Immediate Mitigation: Parse Logs, Identify Exit Code, Force Restart

This initial set of commands helps you quickly gather information about the crashing pod and force a restart, which can sometimes resolve transient issues.

  1. Identify the Crashing Pod:

    kubectl get pods -n default --field-selector=status.phase=Pending
    # Or, to see all pods, then look for CrashLoopBackOff status
    kubectl get pods -n <your-namespace>
    

  2. Describe the Pod for Events and Exit Codes: Examine the Events section and Last State for crucial clues like exit codes or probe failures.

    kubectl describe pod <pod-name> -n <your-namespace>
    

  3. Tail Logs from the Previous Container Instance: Logs from the previous failed container instance are vital for understanding why it crashed.

    kubectl logs <pod-name> -n <your-namespace> --previous --tail=50
    # If multiple containers, specify the container name
    kubectl logs <pod-name> -c <container-name> -n <your-namespace> --previous --tail=50
    

  4. Force Restart the Pod: Deleting the pod will cause the Deployment/StatefulSet controller to create a new one, clearing the CrashLoopBackOff state and often providing a fresh start. !!! warning "Data Loss Warning" Deleting a pod can lead to temporary service disruption and potential loss of ephemeral data within the pod if it's not backed by persistent storage. Proceed with caution in production environments.

    kubectl delete pod <pod-name> -n <your-namespace>
    

Ephemeral Debug Pod

Immediate Mitigation: Live Inspection with a Debug Pod

If logs aren't sufficient, creating a temporary debug pod using the same image can help you inspect the environment, filesystems, and processes interactively. This is particularly useful for diagnosing kernel-level issues or container entrypoint failures.

  1. Create a Debug Pod: This command runs a new pod based on your application's image, but overrides the entrypoint to allow you to shell into it.

    kubectl run debug-pod --image=<crashing-image> --restart=Never -n default --overrides='{"spec":{"containers":[{"name":"debug","command":["/bin/sh"], "stdin":true, "tty":true}]}}'
    

  2. Exec into the Debug Pod and Investigate: Once inside, you can check application processes, memory usage, file permissions, and system logs.

    kubectl exec -it debug-pod -n default -- /bin/sh
    
    Inside the debug pod:
    ps aux | grep <your-app-process-name> # Check if your app process is running
    cat /proc/<pid>/status | grep VmRSS   # Check Resident Set Size for memory usage
    dmesg | tail -20                      # Look for kernel messages, especially OOM
    kill -l                               # List Linux signals to understand exit codes (e.g., 9=SIGKILL, 15=SIGTERM)
    

  3. Clean Up the Debug Pod:

    exit
    kubectl delete pod debug-pod -n default
    

Fix Resource Limits

Best Practice Fix: Adjust CPU/Memory Requests and Limits

A common cause of CrashLoopBackOff is OOMKilled (Out Of Memory) from insufficient memory limits (exit code 137) or being CPU throttled. Adjusting resource requests and limits in your Deployment or StatefulSet can prevent this.

  1. Edit Your Deployment or StatefulSet:

    kubectl edit deployment <deployment-name> -n <your-namespace>
    # Or for StatefulSet:
    kubectl edit statefulset <statefulset-name> -n <your-namespace>
    

  2. Add or Adjust resources in the Container Spec: Based on your debugging (e.g., VmRSS from the debug pod), set appropriate requests (guaranteed resources) and limits (maximum allowed).

    # Example snippet within a container specification:
    resources:
      requests:
        memory: "256Mi"
        cpu: "250m" # 250 milli-cores = 0.25 CPU core
      limits:
        memory: "512Mi"
        cpu: "500m" # 500 milli-cores = 0.5 CPU core
    

  3. Apply Changes and Rollout Restart: Saving the kubectl edit command will automatically apply the changes. For Deployments, a rollout restart ensures all existing pods are replaced with the new configuration.

    kubectl rollout restart deployment <deployment-name> -n <your-namespace>
    

Permanent Probe & Init Fix

Best Practice Fix: Correct Liveness/Readiness Probes and Init Containers

Misconfigured liveness or readiness probes can cause pods to be prematurely killed or never become Ready. Similarly, a failing initContainer will prevent the main application container from ever starting.

  1. Check Pod Description for Probe Failures: The kubectl describe pod output will explicitly show if liveness or readiness probes are failing and why.

    kubectl describe pod <pod-name> -n <your-namespace>
    # Look for "Liveness probe failed:" or "Readiness probe failed:"
    

  2. Update YAML with Proper Probe Configuration: Adjust initialDelaySeconds (time before first probe), periodSeconds (how often to probe), timeoutSeconds (how long to wait for a response), and failureThreshold (number of consecutive failures before action).

    kubectl edit deployment <deployment-name> -n <your-namespace>
    
    # Example livenessProbe fix within a container specification:
    livenessProbe:
      httpGet:
        path: /healthz
        port: 8080
      initialDelaySeconds: 30 # Give the app enough time to start
      periodSeconds: 10       # Check every 10 seconds
      timeoutSeconds: 5       # Wait up to 5 seconds for a response
      failureThreshold: 3     # Allow 3 consecutive failures before restarting
    # Add or adjust readinessProbe similarly
    readinessProbe:
      httpGet:
        path: /ready
        port: 8080
      initialDelaySeconds: 15
      periodSeconds: 5
      timeoutSeconds: 3
      failureThreshold: 1
    

  3. Review and Fix Failing Init Containers: If an initContainer is failing, ensure its command, dependencies, and permissions are correct. Remove it if it's no longer necessary, or troubleshoot its specific error using logs.

🧩 Technical Context (Visualized)

The Kubernetes Kubelet, running on each node in your Azure Kubernetes Service (AKS) cluster, is responsible for managing the lifecycle of pods and their containers. When a container's main process exits with a non-zero status code, the Kubelet detects this as a crash. It then applies an exponential backoff strategy, progressively increasing the delay (from 10 seconds up to 5 minutes) before attempting to restart the container, ultimately leading to the CrashLoopBackOff state.

graph TD
    A[Pod Scheduled on Node] --> B{"Kubelet Initializes Pod/Containers"}
    B --> C(Init Containers Run)
    C -- Success --> D[Application Container Starts]
    C -- Failure (Non-Zero Exit) --> E[CrashLoopBackOff for Init Container]
    E --> F{Exponential Backoff Timer}
    F -- Timer Expires --> C

    D -- Application Runs --> G(Pod Status: Running)
    D -- Application Crashes (Non-Zero Exit) --> H["Container Terminated: Error"]
    H --> I[Kubelet Detects Crash]
    I --> J["Pod Status: CrashLoopBackOff"]
    J --> K{"Exponential Backoff Timer (10s -> 5m)"}
    K -- Timer Expires --> D
    J -- "Max Restarts/Backoff Exceeded" --> L(Pod Status: Failed)

✅ Verification

After implementing any solution, verify that your pods are running stably and without errors.

  1. Watch Pod Status: Continuously monitor pod status until they transition to Running and Ready.

    kubectl get pods -n <your-namespace> -w
    

  2. Check Pod State and Restart Count: Ensure the State is Running and Restart Count is stable (ideally 0, or consistent with expected behavior if using an exit-on-completion pattern).

    kubectl describe pod <pod-name> -n <your-namespace> | grep -E 'State|Restart'
    

  3. Review Recent Pod Logs: Confirm that the application logs show no new errors and are processing requests correctly.

    kubectl logs <pod-name> -n <your-namespace> --tail=10
    

  4. Inspect Recent Events: Look for any new Warning or Error events related to your pod.

    kubectl get events -n <your-namespace> --sort-by='.lastTimestamp' | grep <pod-name>
    

📦 Prerequisites

To effectively troubleshoot CrashLoopBackOff errors in AKS, ensure you have the following:

  • kubectl: Version 1.28 or later.
  • Azure CLI: Version 2.60 or later.
  • RBAC Access: cluster-admin or equivalent RBAC permissions to the AKS cluster.
  • Cluster Credentials: Obtain cluster credentials using az aks get-credentials --resource-group <rg> --name <aks-cluster>.
  • Linux Signals Knowledge: Familiarity with common Linux signals (e.g., SIGTERM=15 for graceful shutdown, SIGKILL=9 for forceful termination, OOM=137 indicating Out Of Memory).