Kubernetes Exit Code 137: How to Resolve OOM Issues
🚨 Symptoms & Diagnosis¶
When a Kubernetes pod terminates with Exit Code 137, it's a clear indicator that the pod was subjected to the Linux kernel's Out Of Memory (OOM) killer. This often means the container exceeded its allocated memory limit and was forcefully terminated to protect the host node.
Observe the following error signatures in your cluster logs or pod descriptions:
dmesg | grep -i 'killed process'
Out of memory: Killed process 12345 (my-app) total-vm:1048576kB, anon-rss:524288kB
Root Cause: Kubernetes containers exceeding their defined memory
limitstrigger the underlying Linux kernel's OOM killer. This results in aSIGKILL(signal 9) being sent to the container process, which Kubernetes reports as Exit Code 137.
🛠️ Solutions¶
Resolving Exit Code 137 involves both immediate mitigation strategies to restore service and long-term best practices to prevent recurrence.
Increase Container Memory Limits¶
Immediate Mitigation: Scale and Adjust Memory
This quick fix involves identifying and restarting affected pods after increasing their memory limits. This provides immediate relief but should be followed by a proper analysis.
-
Identify affected pods: Use
kubectl get podsto find pods that are repeatedly restarting or showing anOOMKilledstate. -
Edit the deployment to increase memory limits: Access the deployment specification and adjust the
resources.limits.memoryandresources.requests.memoryfor the affected container.Locate the
spec.template.spec.containerssection and modify theresourcesblock: -
Force pod restart: Deleting the old OOMKilled pod will trigger the deployment controller to create a new one with the updated resource limits.
Caution: Deleting Pods
Deleting a pod will momentarily interrupt service for that specific instance. Ensure you have sufficient replicas or a graceful shutdown mechanism if this is a production environment.
-
Monitor events: Observe the new pod's status and events to confirm it starts successfully and doesn't get OOMKilled again.
Implement Comprehensive Resource Management¶
Best Practice Fix: Resource Requests, Limits, and Automation
For long-term stability and efficient resource utilization, a robust resource management strategy is essential. This includes defining appropriate resource requests and limits, leveraging tools like the Vertical Pod Autoscaler (VPA), and setting namespace-level quotas.
-
Analyze historical usage: Understand the actual memory consumption patterns of your applications.
For more granular historical data, consider integrating Prometheus and Grafana.
-
Update all deployments with appropriate requests and limits: Based on your analysis, set
requeststo the minimum required memory for the application to start and run effectively, andlimitsto the absolute maximum it should ever consume. A common best practice is to setrequests.memorylower thanlimits.memoryto allow for burst capacity, but ensurelimits.memoryis still enforced.# Example deployment.yaml with proper resources apiVersion: apps/v1 kind: Deployment metadata: name: my-app spec: selector: matchLabels: app: my-app replicas: 3 template: metadata: labels: app: my-app spec: containers: - name: app image: myapp:1.0 resources: requests: memory: "256Mi" # Guaranteed minimum cpu: "100m" limits: memory: "512Mi" # Hard ceiling, prevents OOMKilled cpu: "500m" livenessProbe: exec: command: ['/bin/sh', '-c', 'ps aux | wc -l'] initialDelaySeconds: 30 timeoutSeconds: 5Apply this configuration:
-
Install Prometheus + kube-state-metrics for comprehensive monitoring: These tools provide metrics on pod and node resource usage, helping you identify trends and potential bottlenecks.
-
Deploy Vertical Pod Autoscaler (VPA): VPA can automatically recommend or apply optimal resource requests and limits for your pods based on historical usage patterns, reducing manual overhead and preventing OOM events.
-
Set Namespace ResourceQuotas: Enforce memory and CPU constraints at the namespace level to prevent any single team or application from consuming excessive cluster resources.
🧩 Technical Context (Visualized)¶
Kubernetes orchestrates containers managed by a Container Runtime Interface (CRI). When a container within a pod attempts to consume memory beyond its limits defined in the pod specification, the underlying Linux kernel's Out Of Memory (OOM) killer is invoked. This OOM killer intervenes by sending a SIGKILL (signal 9) to the container's process, forcefully terminating it. Kubernetes then registers this termination as Exit Code 137, signifying an OOMKilled event.
graph TD
A[Pod Container Running] --> B{Memory Usage > Resource Limit?};
B -- Yes --> C[Linux Kernel OOM Killer Activates];
C --> D[Sends SIGKILL (Signal 9)];
D --> E[Container Process Terminated];
E --> F[Kubernetes Reports Exit Code 137 (OOMKilled)];
B -- No --> A;
✅ Verification¶
After implementing solutions, use these commands to verify that your pods are running stably and no longer encountering OOM issues:
# Check specific pod status for OOMKilled or Exit Code 137
kubectl describe pod my-app-newpod -n default | grep -E 'OOMKilled|Exit Code'
# Review cluster events for OOMKilled warnings
kubectl get events --sort-by='.lastTimestamp' | grep OOM
# Monitor current pod resource usage
kubectl top pods -n default
# Monitor current node resource usage
kubectl top nodes
# Continuously observe pod states for any non-Running status
watch 'kubectl get pods -n default | grep -v Running'
📦 Prerequisites¶
To effectively diagnose and resolve Kubernetes Exit Code 137, ensure you have:
kubectl1.29+: For interacting with your Kubernetes cluster.- Cluster-admin rights: Or equivalent permissions to modify deployments and view events/logs.
metrics-serverenabled: Required forkubectl topcommands to function.- Linux nodes with
dmesgaccess: For direct kernel OOM logs, typically via SSH.