Common causes
- Model quantization is too large for available VRAM.
- Context window is too high and KV cache expands.
- GPU layers exceed practical memory budget.
Model loading fails or generation stops with OOM.
CUDA error: out of memory
current device: 0, capacity: 24GB
ggml_cuda_mul_mat: out of memory
[ERROR] Failed to allocate 42.5 GB ollama pull <model>:q4_k_m ollama run <model> --ctx-size 4096 ollama run <model> --gpu-layers 40