Fixing Docker Container Exit Code 137 on Oracle Linux 9: Docker Compose Restart Loop
🚨 Symptoms & Diagnosis¶
Encountering Exit Code 137 when managing Docker containers via docker-compose on Oracle Linux 9 typically signals critical resource exhaustion, predominantly memory-related. This frequently manifests as containers entering a persistent restart loop, severely impacting application availability and overall system stability. Identifying the exact trigger is paramount for effective remediation.
Common error signatures include:
Direct inspection of container state confirms the exit code and often, the underlying cause:
Expected output showing theOOMKilled flag:
"State": {
"Status": "exited",
"Running": false,
"Paused": false,
"Restarting": false,
"OOMKilled": true,
"Dead": false,
"Pid": 0,
"ExitCode": 137,
"Error": "",
"StartedAt": "2023-10-27T10:00:05.123Z",
"FinishedAt": "2023-10-27T10:00:10.987Z"
},
Additional diagnostic indicators may include:
docker inspect <container>: "ExitCode": 137
docker logs <container>: Killed (often without an explicit Out-of-Memory message in application logs)
journalctl -u docker: Out of memory: Kill process
Root Cause: Exit code 137 is fundamentally triggered by the Linux kernel's Out-of-Memory (OOM) killer. It issues a SIGKILL (signal 9) to terminate processes, including Docker containers, when memory demand exceeds either the container's explicit memory limit or the Docker host's available memory. This often leads to persistent restart loops in
docker-composeenvironments, indicative of critical resource contention.
🛠️ Solutions¶
Immediate Diagnosis & Quick Fix¶
Immediate Mitigation: Temporarily Increase Memory Limits
To stabilize your environment and mitigate ongoing service disruptions, rapidly diagnose the OOMKilled status and temporarily adjust memory allocations. This provides operational breathing room while you implement a more robust, permanent solution.
-
Verify
OOMKilledStatus and Exit Code: Confirm if the container was explicitly killed by the OOM killer. -
Monitor Real-time Container Memory Usage: Assess the container's current and historical memory footprint.
-
Check Docker Host System Memory Pressure: Determine if the underlying host system is experiencing memory exhaustion.
-
Review
docker-compose.ymlfor Resource Allocation: Examine thedocker-compose.ymlfile for existingmem_limitandmemswap_limitsettings under the affected service. -
Temporarily Increase Memory Limits: Adjust
mem_limitandmemswap_limitin yourdocker-compose.ymlto provide additional memory. Exercise caution not to over-allocate, which could starve other services or the host system. -
Restart Docker Compose Services: Apply the updated resource configurations and restart your services.
-
Monitor Post-Restart Stability: Observe container memory usage and status to confirm stability.
Permanent Fix: Optimize Memory Allocation & Application¶
Best Practice Fix: Resource Optimization and Application Tuning
For sustained stability and optimal performance, a thorough analysis of application memory consumption, paired with precise resource allocation and resilient health checks, is critical. This approach targets the root cause of Exit Code 137.
-
Profile Application Memory Usage: Utilize language-specific profiling tools (e.g.,
memory_profilerfor Python) to understand your application's memory footprint under various load conditions. -
Identify and Resolve Memory Leaks: Memory leaks are a common source of gradual memory exhaustion. Debug your application code to pinpoint and eliminate these issues.
-
Set Appropriate Memory Requests and Limits: Based on profiling data, configure
mem_limit(hard ceiling) andmem_reservation(soft limit for scheduling) indocker-compose.ymlto reflect actual application requirements. -
Configure Robust Health Checks: Implement
healthcheckconfigurations with adequatestart_period,interval,timeout, andretries. This prevents Docker from prematurely terminating a container that is still initializing or temporarily unresponsive. -
Implement Resource Quotas (for multi-service environments): In more complex deployments, consider resource quotas across services to prevent any single service from monopolizing host memory resources.
-
Conduct Load Testing: Thoroughly test your application under simulated production load to validate memory allocations and identify potential bottlenecks before production deployment.
-
Proactive Monitoring with Docker Stats and Kernel Logs: Continuously monitor container metrics and review kernel OOM events for any signs of memory pressure.
-
Establish Alerting: Configure monitoring and alerting for memory usage exceeding predefined thresholds (e.g., 70-80% of
mem_limit) and for host-level OOM killer activations.Example
docker-compose.ymlwith optimized resource configuration and health checks:version: '3.8' services: app: image: myapp:latest mem_limit: 1g # Hard memory limit (RAM) memswap_limit: 1g # Total memory (RAM + swap) limit mem_reservation: 512m # Soft reservation; Docker attempts to keep usage below this healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8080/health"] # Example health endpoint interval: 30s # Check every 30 seconds timeout: 10s # Allow 10 seconds for the check to complete retries: 3 # Three consecutive failures to be considered unhealthy start_period: 40s # Initial delay before health checks begin restart: unless-stopped # Define proper restart behavior logging: driver: "json-file" options: max-size: "10m" max-file: "3"Verify Memory Limits Enforcement:
Test Restart Behavior Post-Configuration:
Advanced: Horizontal Scaling & Resource Quotas¶
Best Practice Fix: Distributed Resources and Scalability
For high-availability, high-traffic, or performance-sensitive applications, scaling services horizontally and implementing orchestration-level resource quotas offer robust protection against OOM conditions and single points of failure.
-
Implement Docker Compose Scaling (via Docker Swarm) or Kubernetes Deployment: Distribute the application load across multiple container instances. While
docker-composeitself primarily manages a single instance fordeploy, Docker Swarm mode (docker stack deploy) or a dedicated Kubernetes cluster are the standard approaches for true scaling. -
Configure Load Balancer: Utilize an external load balancer (e.g., Nginx, HAProxy) or an ingress controller (in Kubernetes) to efficiently distribute incoming traffic among the scaled application replicas.
-
Set Resource Quotas Per Service/Namespace: In orchestrated environments, define CPU and memory quotas at the service or namespace level to ensure fair resource sharing and prevent any single service from exhausting host resources.
Example
docker-compose.yml(for Docker Swarm deployment withdeploysection):version: '3.8' services: app: image: myapp:latest deploy: replicas: 3 # Scale to 3 instances resources: limits: cpus: '0.5' # Each replica limited to 0.5 CPU cores memory: 512M # Each replica limited to 512MB RAM reservations: cpus: '0.25' # Each replica reserves 0.25 CPU cores memory: 256M # Each replica reserves 256MB RAM restart_policy: condition: on-failure delay: 5s max_attempts: 3 window: 120s nginx: # Example load balancer image: nginx:latest ports: - "80:80" depends_on: - app -
Enable Autoscaling: For dynamically varying workloads, configure autoscaling mechanisms (e.g., Kubernetes Horizontal Pod Autoscaler) based on memory utilization metrics.
-
Implement Centralized Logging: Aggregate logs from all container instances into a centralized system for comprehensive memory tracking, performance analysis, and anomaly detection.
-
Configure Advanced Alerting Thresholds: Set up proactive alerts for memory usage exceeding 70-80% of allocated limits across all service replicas to enable timely intervention.
Monitor Memory Across All Containers:
Check HostConfig for Applied Resource Limits:
Debugging: Log Parsing & Signal Analysis¶
Best Practice Fix: Deep Dive into Diagnostics
If standard memory-related solutions do not resolve the issue, or if OOMKilled is false despite Exit Code 137, a detailed forensic analysis of Docker daemon logs, kernel events, and application logs is indispensable to identify non-OOM SIGKILL triggers.
-
Capture Docker Daemon Logs: The Docker daemon logs provide critical insights into container lifecycle events, including termination reasons.
-
Parse Kernel OOM Killer Events: The kernel logs (
dmesgorjournalctlwith kernel filters) are the authoritative source for OOM killer activations. -
Check for Health Check Failures: A misconfigured or consistently failing health check can lead Docker or an orchestrator to send a
SIGKILL, resulting inExit Code 137without an explicit OOM event. -
Analyze Application Logs for Internal Crashes: Application-level errors, unhandled exceptions, or segmentation faults (
SIGSEGV) can also cause process termination, potentially triggering aSIGKILLby Docker's supervisor, which would manifest asExit Code 137. -
Correlate Timestamps Across Log Sources: Match timestamps from
docker logs,journalctl -u docker, anddmesgto reconstruct the precise sequence of events leading to the container's termination. -
Verify SIGKILL vs. OOMKilled Distinction: Use
docker inspectto explicitly differentiate between anOOMKilledevent and otherSIGKILLsignals. IfOOMKilledisfalsebutExitCodeis137, another process (e.g., an orchestrator, or another part of Docker's supervision) sent theSIGKILL.docker inspect <container_name> --format='OOMKilled: {{.State.OOMKilled}}, ExitCode: {{.State.ExitCode}}, Error: {{.State.Error}}'Real-time System Monitoring:
Capture Memory Pressure Events (if Pressure Stall Information (PSI) is available on kernel):
🧩 Technical Context (Visualized)¶
Exit Code 137 precisely indicates that a container process was terminated by a SIGKILL signal (signal 9) from the operating system. In the context of Docker, especially with resource constraints, this signal is most frequently issued by the Linux kernel's Out-of-Memory (OOM) killer. The OOM killer intervenes when system or container memory limits are surpassed, strategically terminating processes to reclaim memory and prevent critical system instability. This often precipitates a docker-compose restart loop, as the container repeatedly attempts to initiate, exhausts its memory, gets killed, and then attempts to restart again.
graph TD
A[Docker Container Application] --> B{Memory Consumption Increases};
B -- Exceeds Docker Mem_Limit --> C["Container Runtime (Docker)"];
B -- Exceeds Host RAM / Swap --> D[Linux Kernel OOM Killer];
C -- "No available memory or swap, or container limit hit" --> D;
D -- "Sends SIGKILL (Signal 9)" --> E[Container Process Terminated];
E -- Reports --> F[Exit Code 137];
F --> G[Docker Daemon];
G -- "restart: unless-stopped" Policy --> H{docker-compose Initiates Restart};
H -- Leads to --> I[Persistent Restart Loop];
D -- Logs Events To --> J["Kernel Log (dmesg, journalctl)"];
G -- Logs Events To --> K["Docker Daemon Log (journalctl -u docker)"];
✅ Verification¶
After implementing any of the proposed solutions, systematically verify the container's operational status and resource consumption to confirm the fix:
-
Check Container State and
Expected output should beOOMKilledFlag:0forExitCodeandfalseforOOMKilled, indicating a clean exit or continuous running. -
Monitor Live Container Resource Statistics:
Observe memory usage to ensure it remains stable and well within the configuredmem_limit. -
Assess Docker Host Memory Availability:
Confirm that the Docker host system has sufficient free memory, reducing the likelihood of host-level OOM events. -
Validate Docker Compose Service Status:
Verify that all your services are in theUpstate and not repeatedly exiting or restarting. -
Review Recent Container Logs:
Look for anyKilledmessages, application-level errors, or unusual termination patterns. -
Inspect Docker Daemon and Kernel Logs for OOM Events:
Confirm the absence of new OOM-related messages orExit Code 137entries.
📦 Prerequisites¶
To effectively apply and troubleshoot these solutions, ensure your environment meets the following prerequisites:
- Docker Engine: Version 20.10 or newer.
- Docker Compose: Version 2.0 or newer.
- Operating System: Oracle Linux 9 (or any compatible RHEL 9 distribution).
- Access Privileges:
sudoorrootaccess is required for inspecting kernel logs (dmesg,journalctl). - Utilities:
curlorwgetmight be necessary for implementing health checks within containers.jqis highly recommended for efficient JSON parsing of Docker inspection output.