Ollama CUDA Out of Memory

Model loading fails or generation stops with OOM.

Raw Stack Trace

CUDA error: out of memory
current device: 0, capacity: 24GB
ggml_cuda_mul_mat: out of memory
[ERROR] Failed to allocate 42.5 GB

Common causes

Model quantization is too large for available VRAM.
Context window is too high and KV cache expands.
GPU layers exceed practical memory budget.

Copy-fix commands

Use lower quantization ollama pull <model>:q4_k_m
Lower context length ollama run <model> --ctx-size 4096
Reduce GPU layers ollama run <model> --gpu-layers 40

Use smaller model Switch quantization Use Cloud 48GB