Fix Ollama CUDA Out of Memory in 5 Minutes
Terminal-first quick fix path for the most common Ollama runtime failure.
CUDA out of memory is usually not a single problem. It is a budget mismatch between model size, context window, and runtime overhead.
Fast fix order
- Lower quantization
- Reduce context size
- Reduce GPU layers
- Retry with smaller output length
Why this works
Each step reduces memory pressure from a different axis. Most users only change one variable and stop too early.
Prevent repeated OOM
- Keep a per-model context cap
- Save known-good launch commands
- Use a fit calculator before pulling new large models
The fastest stable workflow is: estimate -> verify -> lock known-safe parameters.