Ollama CUDA Out of Memory

Model loading fails or generation stops with OOM.

Raw Stack Trace

CUDA error: out of memory
current device: 0, capacity: 24GB
ggml_cuda_mul_mat: out of memory
[ERROR] Failed to allocate 42.5 GB

Common causes

  • Model quantization is too large for available VRAM.
  • Context window is too high and KV cache expands.
  • GPU layers exceed practical memory budget.

Copy-fix commands

  • Use lower quantization ollama pull <model>:q4_k_m
  • Lower context length ollama run <model> --ctx-size 4096
  • Reduce GPU layers ollama run <model> --gpu-layers 40
Use smaller model Switch quantization Use Cloud 48GB