Fix Google Cloud Run 'Container Failed with Exit Code 137'
Encountering an Exit Code 137 on Google Cloud Run indicates a critical container failure, almost universally pointing to an Out Of Memory (OOM) event. This typically means your application consumed more memory than allocated, triggering the Linux OOM killer to terminate the process. As SREs, our goal is to restore stability and optimize resource utilization for predictable deployments.
🚨 Symptoms & Diagnosis¶
When your Cloud Run service experiences an OOM condition, you'll observe specific error signatures in your deployment logs and service status. Identifying these is the first step to a rapid resolution.
You might also see kernel messages if you have access to system logs (though less common in Cloud Run directly, it's the underlying mechanism):
dmesg: Out of memory: Killed process 12345 (your-app) total-vm:123456kB, anon-rss:78901kB
/var/log/system: memory limit exceeded
Application-specific logs can also hint at the issue:
Root Cause: Your container exceeded its configured memory limit, leading the Linux OOM killer to send a
SIGKILLsignal (signal 9) to the primary process, resulting in the137exit code (128 + 9). This can stem from memory leaks, spikes, or insufficient resource requests/limits.
🛠️ Solutions¶
Addressing Exit Code 137 requires both immediate mitigation to restore service and long-term optimization to prevent recurrence.
Quick Fix: Increase Memory Limit¶
Immediate Mitigation: Increase Memory Limit
This strategy provides immediate relief by giving your container more breathing room. It's a critical first step to stabilize a crashing service, but should be followed by deeper analysis.
- Check Current Service Configuration: Before making changes, understand your service's current memory allocation.
- Redeploy with Increased Memory: Increment the memory allocation. A common starting point is to double it if the current limit is already low (e.g., from 256Mi to 512Mi, or 512Mi to 1Gi).
Note: Replace
SERVICE_NAMEandREGIONwith your actual service details. - Verify No Restarts: Monitor your Cloud Run service for stability and absence of further
Exit Code 137events.
Permanent Fix: Optimize Resources + Monitoring¶
Best Practice Fix: Optimize Resources and Implement Monitoring
For sustained reliability, it's essential to understand your application's actual memory footprint, optimize its resource usage, and proactively monitor for potential issues.
- Optimize Local Filesystem Writes: Cloud Run counts local filesystem writes (outside
/dev/shmor/tmpwhich is often a symlink to/dev/shm) towards your container's memory limit. Redirect temporary files to/var/log(or a specific subdirectory within it) which is typically handled as external storage for logs and does not consume container memory. Ensure your application respects theTMPDIRenvironment variable, or configure it specifically to write temporary data to a/var/logsubdirectory. - Set Resource Requests/Limits Explicitly: When deploying, specify both memory and CPU limits to ensure predictable behavior and prevent over-provisioning or under-provisioning.
This example also includes setting
gcloud run deploy SERVICE_NAME --image gcr.io/PROJECT/IMAGE --memory 2Gi --cpu 2 --min-instances 1 --max-instances 10 --region REGION --set-env-vars TEMP_DIR=/var/log/tmpTEMP_DIRas an environment variable directly during deployment. - Enable Cloud Monitoring Alerts: Proactively monitor memory utilization. Set up alerts in Google Cloud Monitoring to notify you if memory usage approaches your configured limits.
- Navigate to Cloud Monitoring -> Alerting -> Create Policy.
- Select metric:
Cloud Run Revision: Container Memory Utilization. - Set a threshold (e.g., 80% or 90% of your allocated memory).
- Deploy and Test Under Load: After implementing optimizations, deploy the service and subject it to realistic load tests to ensure stability and validate resource allocation.
Log Diagnosis Workflow¶
Actionable Diagnosis: Confirm OOM via Logs
Before applying any fixes, confirm the OOM event by parsing your logs. This prevents misdiagnosis and ensures you're addressing the correct problem.
- Fetch Recent Logs: Use
gcloud loggingto filter for recent occurrences ofexit code: 137. - Grep for OOM Signals: Examine the fetched logs for explicit "OOMKilled" or "Out of memory" messages which definitively point to an OOM event.
- Check Metrics in Cloud Monitoring: Correlate log events with memory utilization graphs in Cloud Monitoring to observe spikes leading up to the crash.
(Adjust
gcloud monitoring query 'fetch cloud_run_revision::cloud_run.googleapis.com/container/memory/utilization' --metric-type='cloud_run.googleapis.com/container/memory/utilization' --project PROJECT_ID --filter='resource.labels.service_name="SERVICE_NAME"' --start-time='2023-10-26T00:00:00Z' --end-time='2023-10-26T01:00:00Z'PROJECT_ID,SERVICE_NAME,start-time, andend-timeas needed)
🧩 Technical Context (Visualized)¶
The Exit Code 137 on Google Cloud Run is fundamentally a Linux kernel event. When a container's process attempts to allocate memory beyond its configured limit, the operating system's Out-Of-Memory (OOM) killer steps in. This subsystem is designed to prevent the entire host machine from crashing due to a single runaway process. It prioritizes killing processes that are consuming excessive resources, specifically by sending a SIGKILL (signal 9), which cannot be caught or ignored by the application, leading to an immediate termination. The 137 exit code is a standard representation of a process terminated by signal 9 (128 + 9).
graph TD
A[Cloud Run Container Process] --> B{Memory Usage Increases};
B --> C{Memory Exceeds Configured Limit?};
C -- Yes --> D[Linux OOM Killer Activated];
D --> E["Sends SIGKILL (Signal 9)"];
E --> F{Container Process Terminated};
F --> G["Cloud Run Reports: Exit Code 137"];
C -- No --> H[Process Continues Running];
✅ Verification¶
After implementing any of the solutions, it's crucial to verify the stability of your Cloud Run service and confirm the absence of the Exit Code 137.
- Check Service Conditions: Ensure the service is in a healthy state with no unexpected conditions.
A clean output (or only
Trueconditions) indicates health. - Monitor Recent Logs for Errors: Search your recent logs for any further
Exit Code 137or OOM messages. An empty output from this command for recent logs is a good sign.
📦 Prerequisites¶
To effectively diagnose and resolve Exit Code 137 issues on Google Cloud Run, ensure you have the following:
gcloudCLI: Version 450.0.0 or newer, with theruncomponent installed.- Cloud Run Admin Role: Sufficient IAM permissions to deploy and manage Cloud Run services.
- Docker: Version 24+ (if building local images or inspecting Dockerfiles).
- Cloud Logging and Monitoring APIs: Enabled in your Google Cloud Project to access detailed logs and metrics.