Docker 📅 2026-02-02

Fixing Docker Exit Code 137 but OOMKilled is False

As SREs and platform engineers, encountering Exit Code 137 is a familiar signal of a container termination. However, when this happens with OOMKilled: false, it often points to a more nuanced resource management issue, requiring a deeper dive beyond the immediate OOM killer reports. This scenario demands precise diagnostic steps to identify the true cause and restore service stability.

🚨 Symptoms & Diagnosis¶

When a Docker container unexpectedly exits with code 137, but its OOMKilled flag remains false, you'll typically observe these signatures in your logs and container status:

ExitCode: 137
OOMKilled: false
Status: exited

Or from the Docker daemon:

docker: Error response from daemon: container exited with code 137

Kernel logs might indicate the signal:

Killed signal: 9 (SIGKILL)
kernel: Killed process [PID]

Root Cause: Docker exit code 137, with OOMKilled: false, indicates a SIGKILL signal (signal 9) terminated the container. This typically stems from memory exhaustion not explicitly flagged by the kernel's OOM killer, strict cgroup memory limits being enforced, or external orchestration actions such as failed health checks or Kubernetes pod evictions due to node-level pressure.

🛠️ Solutions¶

Immediate Diagnosis: Check Kernel Logs for SIGKILL¶

Identify whether SIGKILL was triggered by the OOM killer or an external signal by examining kernel logs directly. This is your first line of defense to understand the immediate context of termination.

SSH into the Docker host node.
Check kernel logs for OOM killer activity and 'Killed process' entries with memory details.
Cross-reference the SIGKILL event with the container's exit time.

# Check kernel buffer for recent 'killed process' entries
dmesg | grep -i 'killed process'

# Check Docker daemon logs for exit code 137 around the incident time
journalctl -u docker --no-pager | grep -i 'exit code 137'

# Check system logs for OOM-kill events
syslog | grep -i 'oom-kill'
grep -i 'killed process' /var/log/kern.log

!!! tip "Immediate Mitigation: Increase Memory Limits"¶

Immediately increase container memory allocation to prevent recurrence while a full root cause analysis is underway. This provides temporary relief and buys time.

Stop the affected container.
Update your docker-compose.yml or Kubernetes manifest with increased memory limits.
Restart the container.
Monitor memory usage for 24-48 hours to assess stability.

Docker ComposeKubernetes Pod

services:
  mongodb:
    image: mongo:4.4
    mem_limit: 2g
    memswap_limit: 2g # Often important to increase alongside mem_limit

resources:
  limits:
    memory: "2Gi"
  requests:
    memory: "1Gi" # Keep request lower or equal to limit

Diagnostic: Inspect Container State and Events¶

Extract detailed container metadata and Kubernetes events to correlate exit code 137 with system conditions. This provides crucial context for further investigation.

For Docker: Run docker inspect on the exited container (before it's cleaned up).
For Kubernetes: Check kubectl events and kubectl describe pod.
Review OOMKilled, ExitCode, and FinishedAt timestamps.
Cross-reference with node resource metrics (kubectl top nodes).

Docker CLIKubernetes CLI

docker inspect <container_id> | jq '.State'
docker inspect <container_id> | jq '.HostConfig.Memory'

kubectl get events --sort-by='.lastTimestamp'
kubectl describe pod <pod_name> -n <namespace>
kubectl top pod <pod_name> -n <namespace>
kubectl top nodes

Root Cause Analysis: Memory Leak Detection¶

Identify memory leaks in your application code that cause gradual memory exhaustion and eventual SIGKILL, even if the kernel's OOM killer doesn't explicitly flag it.

Enable container memory metrics collection.
Monitor memory growth over time using docker stats or Kubernetes metrics.
Capture memory usage trends leading up to container exits.
Analyze application logs for memory-related warnings.
If suspecting a code-level leak, profile the application with memory debugging tools.

Docker CLIKubernetes CLI

# Real-time memory monitoring (current usage)
docker stats <container_id> --no-stream

# Check application logs for memory warnings
docker logs <container_id> | grep -i 'memory\|heap\|allocation'

# Monitor memory growth over time (repeatedly)
watch -n 5 'docker stats --no-stream | grep <container_id>'

# Historical memory usage (requires metrics server)
kubectl get --raw /apis/metrics.k8s.io/v1beta1/namespaces/<namespace>/pods/<pod_name> | jq '.containers[].usage.memory'

!!! success "Best Practice Fix: Set Appropriate Resource Limits and Requests"¶

Configure memory limits based on actual application requirements to prevent both explicit OOMs and unexpected SIGKILL events from cgroup enforcement or eviction policies.

Baseline application memory usage under normal and peak load conditions.
Add a 20-30% headroom for unexpected spikes.
Set memory requests equal to your baselined normal usage.
Set memory limits to the baseline plus the calculated headroom.
Implement comprehensive memory monitoring and alerting.
Thoroughly test under load before production deployment.

# Kubernetes - Production-grade configuration example
apiVersion: v1
kind: Pod
metadata:
  name: mongodb
spec:
  containers:
  - name: mongodb
    image: mongo:4.4
    resources:
      requests:
        memory: "512Mi" # Baseline required memory
        cpu: "250m"
      limits:
        memory: "1Gi" # Baseline + headroom (e.g., 512Mi + 500Mi)
        cpu: "500m"
    livenessProbe: # Ensure application is responsive
      exec:
        command:
        - /bin/sh
        - -c
        - mongo --eval 'db.adminCommand("ping")'
      initialDelaySeconds: 30
      periodSeconds: 10
      timeoutSeconds: 5
      failureThreshold: 3
    readinessProbe: # Ensure application is ready to serve traffic
      exec:
        command:
        - /bin/sh
        - -c
        - mongo --eval 'db.adminCommand("ping")'
      initialDelaySeconds: 5
      periodSeconds: 5
      timeoutSeconds: 3
      failureThreshold: 1

Advanced: Trace SIGKILL Signal Origin¶

When OOMKilled: false persists, utilize system tracing tools to identify precisely which process or kernel subsystem initiated the SIGKILL signal. This is a deep dive for persistent, elusive issues.

Enable process accounting on the Docker host.
Use strace to monitor signal delivery to the container's main process.
Check cgroup memory event files for direct memory limit violations.
Review systemd journal for service-level kills.
Analyze container runtime logs for deeper insights.

# Enable process accounting (if not already active)
sudo apt-get install acct
sudo systemctl start acct

# Trace signals to container's main PID (requires container_pid)
# Find PID: docker inspect -f '{{.State.Pid}}' <container_id>
sudo strace -p <container_pid> -e signal

# Check cgroup memory event files for specific violations
cat /sys/fs/cgroup/memory/docker/<container_id>/memory.events
cat /sys/fs/cgroup/memory.events.local

# Check systemd journal for SIGKILL events related to docker
journalctl -u docker -f | grep -i 'sigkill\|signal 9'

# Monitor overall cgroup memory pressure
watch -n 1 'cat /proc/pressure/memory'

Monitoring & Alerting: Prevent Future Occurrences¶

Implement proactive monitoring and alerting to detect memory pressure before it escalates to an unexpected SIGKILL.

Set up memory usage alerts at 70-80% of configured limits.
Configure Kubernetes pod eviction thresholds to manage node resource pressure gracefully.
Implement container restart policies (unless-stopped, on-failure).
Integrate graceful shutdown handlers within your applications to save state.
Set up centralized log aggregation for exit code 137 events.

Docker ComposeKubernetes Cluster Configuration

services:
  mongodb:
    image: mongo:4.4
    restart: unless-stopped # Ensure container restarts unless explicitly stopped
    mem_limit: 2g
    healthcheck: # Ensure application is healthy within container
      test: ["CMD", "mongo", "--eval", "db.adminCommand('ping')"]
      interval: 10s
      timeout: 5s
      retries: 3
      start_period: 30s # Give application time to start

# Example: Resource quotas for a namespace
apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-quota
  namespace: your-namespace
spec:
  hard:
    requests.memory: "10Gi"
    limits.memory: "20Gi"
---
# Example: Default limits for containers in a namespace
apiVersion: v1
kind: LimitRange
metadata:
  name: memory-limits
  namespace: your-namespace
spec:
  limits:
  - default:
      memory: "512Mi"
    defaultRequest:
      memory: "256Mi"
    max:
      memory: "2Gi"
    min:
      memory: "128Mi"
    type: Container

🧩 Technical Context (Visualized)¶

Exit code 137 is a direct result of a SIGKILL (signal 9) termination, which tells us the process was forcibly stopped by the Linux kernel. When OOMKilled: false is present, it signifies that the termination wasn't explicitly flagged by the kernel's OOM killer process, but rather by other kernel mechanisms like strict cgroup memory limit enforcement, or an external orchestrator (e.g., Kubernetes) reacting to node pressure or failed liveness probes.

graph TD
    A[Container Process Running] --> B{Application Memory Usage Increases};
    B --> C{Resource Limits Exceeded?};
    C -- Yes --> D{Is it the Kernel OOM Killer?};
    D -- Yes --> E[Container Terminated by OOM Killer (OOMKilled: true)];
    D -- No (e.g., cgroup limit, external signal) --> F[Kernel Sends SIGKILL (Signal 9)];
    F --> G{Container Exits with Code 137};
    G --> H[OOMKilled Flag: false];
    style D fill:#f9f,stroke:#333,stroke-width:2px;
    style H fill:#ffcc00,stroke:#333,stroke-width:2px;

✅ Verification¶

After implementing solutions, verify the fix using these commands:

# Verify memory limits are applied correctly
docker inspect <container_id> | jq '.HostConfig | {Memory, MemorySwap, MemoryReservation}'

# Confirm container is running without exit code 137
docker ps | grep <container_id>

# Check for recent SIGKILL events in kernel logs (should be clear)
dmesg | tail -20 | grep -i 'killed'

# For Kubernetes: Verify pod is running and healthy
kubectl get pod <pod_name> -n <namespace> -o wide
kubectl describe pod <pod_name> -n <namespace> | grep -A 5 'State:'

# Monitor memory usage over time (minimum 1 hour to detect leaks or spikes)
docker stats <container_id> --no-stream
kubectl top pod <pod_name> -n <namespace>

# Confirm no OOM events in last 24 hours (specific to Kubernetes)
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | grep -i 'oom\|memory'

📦 Prerequisites¶

To effectively troubleshoot and resolve this issue, you'll need:

Docker Engine 18.09+ or Kubernetes 1.14+
SSH access to the Docker host node
Root or sudo privileges for kernel log inspection
kubectl CLI configured for Kubernetes environments
jq for JSON parsing (highly recommended for docker inspect and kubectl outputs)
dmesg and journalctl utilities available on the Linux host
Memory monitoring tools (e.g., docker stats, kubectl top, Prometheus/Grafana)