Multimodal Vision Guide

Pick vision-capable models by memory ceiling first. This keeps deployment stable and prevents over-sizing before you have evidence that a heavier profile is required.

Starter tier (<=12GB optimal)

Model VRAM min/optimal Detail
LLaVA 7B Q4 8GB / 10GB Open
Gemma 3 270M Q4 2GB / 10GB Open
LLaVA 7B Q5 10GB / 12GB Open
Qwen3 VL 2B Q4 2GB / 12GB Open
Qwen3 VL 2B CLOUD 4GB / 12GB Open
Gemma 3n E2B Q4 2GB / 12GB Open

Balanced tier (13GB-24GB optimal)

Model VRAM min/optimal Detail
Llama 3.2 Vision 11B Q4 12GB / 14GB Open
Gemma 3 4B Q4 4GB / 14GB Open
Qwen3 VL 4B Q4 4GB / 14GB Open
Qwen3 VL 4B CLOUD 6GB / 14GB Open
Gemma 3n E4B Q4 4GB / 14GB Open
Qwen2.5 VL 3B Q4 4GB / 14GB Open
Qwen3 VL 2B Q5 4GB / 14GB Open
Gemma 3n E2B Q5 4GB / 14GB Open

Heavy tier (>24GB optimal)

Model VRAM min/optimal Cloud fallback Detail
Llama 4 128X17B Q4 418GB / 428GB H100/H200 class Open
Llama 4 128X17B Q5 420GB / 430GB H100/H200 class Open
Llama 4 128X17B Q8 424GB / 434GB H100/H200 class Open
Llama 4 128X17B FP16 430GB / 442GB H100/H200 class Open
Qwen3.5 397B-A17B CLOUD 215GB / 223GB H100/H200 class Open
Llama 4 16X17B Q4 213GB / 223GB H100/H200 class Open
Llama 4 16X17B Q5 215GB / 225GB H100/H200 class Open
Llama 4 16X17B Q8 219GB / 229GB H100/H200 class Open
Back to multimodal hub Open multimodal group Estimate VRAM