Google Cloud Run 📅 2026-02-03

Fix Google Cloud Run 'Container Failed with Exit Code 137'

Encountering an Exit Code 137 on Google Cloud Run indicates a critical container failure, almost universally pointing to an Out Of Memory (OOM) event. This typically means your application consumed more memory than allocated, triggering the Linux OOM killer to terminate the process. As SREs, our goal is to restore stability and optimize resource utilization for predictable deployments.

🚨 Symptoms & Diagnosis¶

When your Cloud Run service experiences an OOM condition, you'll observe specific error signatures in your deployment logs and service status. Identifying these is the first step to a rapid resolution.

Container failed with exit code: 137

Exit Code 137 (128 + SIGKILL=9)
Status: OOMKilled

You might also see kernel messages if you have access to system logs (though less common in Cloud Run directly, it's the underlying mechanism):

dmesg: Out of memory: Killed process 12345 (your-app) total-vm:123456kB, anon-rss:78901kB
/var/log/system: memory limit exceeded

Application-specific logs can also hint at the issue:

java.lang.OutOfMemoryError

Root Cause: Your container exceeded its configured memory limit, leading the Linux OOM killer to send a SIGKILL signal (signal 9) to the primary process, resulting in the 137 exit code (128 + 9). This can stem from memory leaks, spikes, or insufficient resource requests/limits.

🛠️ Solutions¶

Addressing Exit Code 137 requires both immediate mitigation to restore service and long-term optimization to prevent recurrence.

Quick Fix: Increase Memory Limit¶

Immediate Mitigation: Increase Memory Limit

This strategy provides immediate relief by giving your container more breathing room. It's a critical first step to stabilize a crashing service, but should be followed by deeper analysis.

Check Current Service Configuration: Before making changes, understand your service's current memory allocation.

gcloud run services describe SERVICE_NAME --format='value(spec.template.spec.containers.resources.limits.memory)' --region REGION

Redeploy with Increased Memory: Increment the memory allocation. A common starting point is to double it if the current limit is already low (e.g., from 256Mi to 512Mi, or 512Mi to 1Gi).
```
gcloud run services update SERVICE_NAME --memory 1Gi --region REGION
```
Note: Replace SERVICE_NAME and REGION with your actual service details.
Verify No Restarts: Monitor your Cloud Run service for stability and absence of further Exit Code 137 events.

Permanent Fix: Optimize Resources + Monitoring¶

Best Practice Fix: Optimize Resources and Implement Monitoring

For sustained reliability, it's essential to understand your application's actual memory footprint, optimize its resource usage, and proactively monitor for potential issues.

Optimize Local Filesystem Writes: Cloud Run counts local filesystem writes (outside /dev/shm or /tmp which is often a symlink to /dev/shm) towards your container's memory limit. Redirect temporary files to /var/log (or a specific subdirectory within it) which is typically handled as external storage for logs and does not consume container memory.
```
# Dockerfile addition to redirect temporary files
ENV TMPDIR=/var/log/tmp
```
Ensure your application respects the TMPDIR environment variable, or configure it specifically to write temporary data to a /var/log subdirectory.
Set Resource Requests/Limits Explicitly: When deploying, specify both memory and CPU limits to ensure predictable behavior and prevent over-provisioning or under-provisioning.
```
gcloud run deploy SERVICE_NAME --image gcr.io/PROJECT/IMAGE --memory 2Gi --cpu 2 --min-instances 1 --max-instances 10 --region REGION --set-env-vars TEMP_DIR=/var/log/tmp
```
This example also includes setting TEMP_DIR as an environment variable directly during deployment.
Enable Cloud Monitoring Alerts: Proactively monitor memory utilization. Set up alerts in Google Cloud Monitoring to notify you if memory usage approaches your configured limits.
- Navigate to Cloud Monitoring -> Alerting -> Create Policy.
- Select metric: Cloud Run Revision: Container Memory Utilization.
- Set a threshold (e.g., 80% or 90% of your allocated memory).
Deploy and Test Under Load: After implementing optimizations, deploy the service and subject it to realistic load tests to ensure stability and validate resource allocation.

Log Diagnosis Workflow¶

Actionable Diagnosis: Confirm OOM via Logs

Before applying any fixes, confirm the OOM event by parsing your logs. This prevents misdiagnosis and ensures you're addressing the correct problem.

Fetch Recent Logs: Use gcloud logging to filter for recent occurrences of exit code: 137.

gcloud logging read 'resource.type=cloud_run_revision AND "exit code: 137"' --limit=10 --freshness=1h

Grep for OOM Signals: Examine the fetched logs for explicit "OOMKilled" or "Out of memory" messages which definitively point to an OOM event.

Check Metrics in Cloud Monitoring: Correlate log events with memory utilization graphs in Cloud Monitoring to observe spikes leading up to the crash.

gcloud monitoring query 'fetch cloud_run_revision::cloud_run.googleapis.com/container/memory/utilization' --metric-type='cloud_run.googleapis.com/container/memory/utilization' --project PROJECT_ID --filter='resource.labels.service_name="SERVICE_NAME"' --start-time='2023-10-26T00:00:00Z' --end-time='2023-10-26T01:00:00Z'

(Adjust PROJECT_ID, SERVICE_NAME, start-time, and end-time as needed)

🧩 Technical Context (Visualized)¶

The Exit Code 137 on Google Cloud Run is fundamentally a Linux kernel event. When a container's process attempts to allocate memory beyond its configured limit, the operating system's Out-Of-Memory (OOM) killer steps in. This subsystem is designed to prevent the entire host machine from crashing due to a single runaway process. It prioritizes killing processes that are consuming excessive resources, specifically by sending a SIGKILL (signal 9), which cannot be caught or ignored by the application, leading to an immediate termination. The 137 exit code is a standard representation of a process terminated by signal 9 (128 + 9).

graph TD
    A[Cloud Run Container Process] --> B{Memory Usage Increases};
    B --> C{Memory Exceeds Configured Limit?};
    C -- Yes --> D[Linux OOM Killer Activated];
    D --> E["Sends SIGKILL (Signal 9)"];
    E --> F{Container Process Terminated};
    F --> G["Cloud Run Reports: Exit Code 137"];
    C -- No --> H[Process Continues Running];

✅ Verification¶

After implementing any of the solutions, it's crucial to verify the stability of your Cloud Run service and confirm the absence of the Exit Code 137.

Check Service Conditions: Ensure the service is in a healthy state with no unexpected conditions.
```
gcloud run services describe SERVICE_NAME --format='value(status.conditions)' | grep -v 'True'
```
A clean output (or only True conditions) indicates health.
Monitor Recent Logs for Errors: Search your recent logs for any further Exit Code 137 or OOM messages.
```
gcloud logs read SERVICE_NAME --limit=5 --freshness=30m | grep -i '137\|oom\|killed'
```
An empty output from this command for recent logs is a good sign.

📦 Prerequisites¶

To effectively diagnose and resolve Exit Code 137 issues on Google Cloud Run, ensure you have the following:

gcloud CLI: Version 450.0.0 or newer, with the run component installed.
Cloud Run Admin Role: Sufficient IAM permissions to deploy and manage Cloud Run services.
Docker: Version 24+ (if building local images or inspecting Dockerfiles).
Cloud Logging and Monitoring APIs: Enabled in your Google Cloud Project to access detailed logs and metrics.