Fixing Docker Exit Code 137 but OOMKilled is False
As SREs and platform engineers, encountering Exit Code 137 is a familiar signal of a container termination. However, when this happens with OOMKilled: false, it often points to a more nuanced resource management issue, requiring a deeper dive beyond the immediate OOM killer reports. This scenario demands precise diagnostic steps to identify the true cause and restore service stability.
🚨 Symptoms & Diagnosis¶
When a Docker container unexpectedly exits with code 137, but its OOMKilled flag remains false, you'll typically observe these signatures in your logs and container status:
Or from the Docker daemon:
Kernel logs might indicate the signal:
Root Cause: Docker exit code 137, with
OOMKilled: false, indicates aSIGKILLsignal (signal 9) terminated the container. This typically stems from memory exhaustion not explicitly flagged by the kernel's OOM killer, strict cgroup memory limits being enforced, or external orchestration actions such as failed health checks or Kubernetes pod evictions due to node-level pressure.
🛠️ Solutions¶
Immediate Diagnosis: Check Kernel Logs for SIGKILL¶
Identify whether SIGKILL was triggered by the OOM killer or an external signal by examining kernel logs directly. This is your first line of defense to understand the immediate context of termination.
- SSH into the Docker host node.
- Check kernel logs for OOM killer activity and 'Killed process' entries with memory details.
- Cross-reference the
SIGKILLevent with the container's exit time.
# Check kernel buffer for recent 'killed process' entries
dmesg | grep -i 'killed process'
# Check Docker daemon logs for exit code 137 around the incident time
journalctl -u docker --no-pager | grep -i 'exit code 137'
# Check system logs for OOM-kill events
syslog | grep -i 'oom-kill'
grep -i 'killed process' /var/log/kern.log
!!! tip "Immediate Mitigation: Increase Memory Limits"¶
Immediately increase container memory allocation to prevent recurrence while a full root cause analysis is underway. This provides temporary relief and buys time.
- Stop the affected container.
- Update your
docker-compose.ymlor Kubernetes manifest with increased memory limits. - Restart the container.
- Monitor memory usage for 24-48 hours to assess stability.
Diagnostic: Inspect Container State and Events¶
Extract detailed container metadata and Kubernetes events to correlate exit code 137 with system conditions. This provides crucial context for further investigation.
- For Docker: Run
docker inspecton the exited container (before it's cleaned up). - For Kubernetes: Check
kubectl eventsandkubectl describe pod. - Review
OOMKilled,ExitCode, andFinishedAttimestamps. - Cross-reference with node resource metrics (
kubectl top nodes).
Root Cause Analysis: Memory Leak Detection¶
Identify memory leaks in your application code that cause gradual memory exhaustion and eventual SIGKILL, even if the kernel's OOM killer doesn't explicitly flag it.
- Enable container memory metrics collection.
- Monitor memory growth over time using
docker statsor Kubernetes metrics. - Capture memory usage trends leading up to container exits.
- Analyze application logs for memory-related warnings.
- If suspecting a code-level leak, profile the application with memory debugging tools.
# Real-time memory monitoring (current usage)
docker stats <container_id> --no-stream
# Check application logs for memory warnings
docker logs <container_id> | grep -i 'memory\|heap\|allocation'
# Monitor memory growth over time (repeatedly)
watch -n 5 'docker stats --no-stream | grep <container_id>'
!!! success "Best Practice Fix: Set Appropriate Resource Limits and Requests"¶
Configure memory limits based on actual application requirements to prevent both explicit OOMs and unexpected SIGKILL events from cgroup enforcement or eviction policies.
- Baseline application memory usage under normal and peak load conditions.
- Add a 20-30% headroom for unexpected spikes.
- Set memory requests equal to your baselined normal usage.
- Set memory limits to the baseline plus the calculated headroom.
- Implement comprehensive memory monitoring and alerting.
- Thoroughly test under load before production deployment.
# Kubernetes - Production-grade configuration example
apiVersion: v1
kind: Pod
metadata:
name: mongodb
spec:
containers:
- name: mongodb
image: mongo:4.4
resources:
requests:
memory: "512Mi" # Baseline required memory
cpu: "250m"
limits:
memory: "1Gi" # Baseline + headroom (e.g., 512Mi + 500Mi)
cpu: "500m"
livenessProbe: # Ensure application is responsive
exec:
command:
- /bin/sh
- -c
- mongo --eval 'db.adminCommand("ping")'
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe: # Ensure application is ready to serve traffic
exec:
command:
- /bin/sh
- -c
- mongo --eval 'db.adminCommand("ping")'
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 1
Advanced: Trace SIGKILL Signal Origin¶
When OOMKilled: false persists, utilize system tracing tools to identify precisely which process or kernel subsystem initiated the SIGKILL signal. This is a deep dive for persistent, elusive issues.
- Enable process accounting on the Docker host.
- Use
straceto monitor signal delivery to the container's main process. - Check cgroup memory event files for direct memory limit violations.
- Review
systemdjournal for service-level kills. - Analyze container runtime logs for deeper insights.
# Enable process accounting (if not already active)
sudo apt-get install acct
sudo systemctl start acct
# Trace signals to container's main PID (requires container_pid)
# Find PID: docker inspect -f '{{.State.Pid}}' <container_id>
sudo strace -p <container_pid> -e signal
# Check cgroup memory event files for specific violations
cat /sys/fs/cgroup/memory/docker/<container_id>/memory.events
cat /sys/fs/cgroup/memory.events.local
# Check systemd journal for SIGKILL events related to docker
journalctl -u docker -f | grep -i 'sigkill\|signal 9'
# Monitor overall cgroup memory pressure
watch -n 1 'cat /proc/pressure/memory'
Monitoring & Alerting: Prevent Future Occurrences¶
Implement proactive monitoring and alerting to detect memory pressure before it escalates to an unexpected SIGKILL.
- Set up memory usage alerts at 70-80% of configured limits.
- Configure Kubernetes pod eviction thresholds to manage node resource pressure gracefully.
- Implement container restart policies (
unless-stopped,on-failure). - Integrate graceful shutdown handlers within your applications to save state.
- Set up centralized log aggregation for
exit code 137events.
services:
mongodb:
image: mongo:4.4
restart: unless-stopped # Ensure container restarts unless explicitly stopped
mem_limit: 2g
healthcheck: # Ensure application is healthy within container
test: ["CMD", "mongo", "--eval", "db.adminCommand('ping')"]
interval: 10s
timeout: 5s
retries: 3
start_period: 30s # Give application time to start
# Example: Resource quotas for a namespace
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-quota
namespace: your-namespace
spec:
hard:
requests.memory: "10Gi"
limits.memory: "20Gi"
---
# Example: Default limits for containers in a namespace
apiVersion: v1
kind: LimitRange
metadata:
name: memory-limits
namespace: your-namespace
spec:
limits:
- default:
memory: "512Mi"
defaultRequest:
memory: "256Mi"
max:
memory: "2Gi"
min:
memory: "128Mi"
type: Container
🧩 Technical Context (Visualized)¶
Exit code 137 is a direct result of a SIGKILL (signal 9) termination, which tells us the process was forcibly stopped by the Linux kernel. When OOMKilled: false is present, it signifies that the termination wasn't explicitly flagged by the kernel's OOM killer process, but rather by other kernel mechanisms like strict cgroup memory limit enforcement, or an external orchestrator (e.g., Kubernetes) reacting to node pressure or failed liveness probes.
graph TD
A[Container Process Running] --> B{Application Memory Usage Increases};
B --> C{Resource Limits Exceeded?};
C -- Yes --> D{Is it the Kernel OOM Killer?};
D -- Yes --> E[Container Terminated by OOM Killer (OOMKilled: true)];
D -- No (e.g., cgroup limit, external signal) --> F[Kernel Sends SIGKILL (Signal 9)];
F --> G{Container Exits with Code 137};
G --> H[OOMKilled Flag: false];
style D fill:#f9f,stroke:#333,stroke-width:2px;
style H fill:#ffcc00,stroke:#333,stroke-width:2px;
✅ Verification¶
After implementing solutions, verify the fix using these commands:
# Verify memory limits are applied correctly
docker inspect <container_id> | jq '.HostConfig | {Memory, MemorySwap, MemoryReservation}'
# Confirm container is running without exit code 137
docker ps | grep <container_id>
# Check for recent SIGKILL events in kernel logs (should be clear)
dmesg | tail -20 | grep -i 'killed'
# For Kubernetes: Verify pod is running and healthy
kubectl get pod <pod_name> -n <namespace> -o wide
kubectl describe pod <pod_name> -n <namespace> | grep -A 5 'State:'
# Monitor memory usage over time (minimum 1 hour to detect leaks or spikes)
docker stats <container_id> --no-stream
kubectl top pod <pod_name> -n <namespace>
# Confirm no OOM events in last 24 hours (specific to Kubernetes)
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | grep -i 'oom\|memory'
📦 Prerequisites¶
To effectively troubleshoot and resolve this issue, you'll need:
- Docker Engine 18.09+ or Kubernetes 1.14+
- SSH access to the Docker host node
- Root or
sudoprivileges for kernel log inspection kubectlCLI configured for Kubernetes environmentsjqfor JSON parsing (highly recommended fordocker inspectandkubectloutputs)dmesgandjournalctlutilities available on the Linux host- Memory monitoring tools (e.g.,
docker stats,kubectl top, Prometheus/Grafana)