Best RAG Models for Local Deployments

Strong RAG results come from the pairing of retrieval and generation. Use this page as a planning hub, then open model detail pages for measured signals and hardware paths.

Embedding models for retrieval quality

Model VRAM min/optimal Category Detail
MXBAI Embed Large 335M FP16 2GB / 10GB embedding Open
Snowflake Arctic Embed 335M FP16 2GB / 10GB embedding Open
Nomic Embed Text 137M FP16 2GB / 10GB embedding Open
Snowflake Arctic Embed 137M FP16 2GB / 10GB embedding Open
Snowflake Arctic Embed 110M FP16 2GB / 10GB embedding Open
All-MiniLM 33M FP16 2GB / 10GB embedding Open
Snowflake Arctic Embed 33M FP16 2GB / 10GB embedding Open
All-MiniLM 22M FP16 2GB / 10GB embedding Open
Snowflake Arctic Embed 22M FP16 2GB / 10GB embedding Open
BGE-M3 567M FP16 4GB / 12GB embedding Open

Generation models for local answer quality

Model VRAM min/optimal 3090 tok/s Data Detail
Glm 4.7 Flash 7B Q4 6GB / 16GB 30 Measured Open
Glm 4.7 Flash 7B Q5 8GB / 18GB 27 Measured Open
DeepSeek-R1 14B Q4 10GB / 20GB 21 Measured Open
Ministral 3 14B Q4 10GB / 20GB 21 Measured Open
DeepSeek-R1 14B Q5 12GB / 22GB 18.9 Measured Open
Ministral 3 14B Q5 12GB / 22GB 18.9 Measured Open
Glm 4.7 Flash 7B Q8 12GB / 22GB 21.6 Measured Open
Gemma 2 2B Q4 2GB / 4GB 42 Estimated Open
TinyLlama 1.1B Q4 2GB / 4GB 42 Estimated Open
Phi-3 3.8B Q4 4GB / 6GB 36 Estimated Open
Gemma 2 2B Q5 3GB / 6GB 37.8 Estimated Open
TinyLlama 1.1B Q5 3GB / 6GB 37.8 Estimated Open

Recommended local RAG flow

  1. Pick an embedding model that fits your latency and memory budget.
  2. Use a 7B to 32B generator model that can sustain your expected context length.
  3. Tune retrieval quality first, then upgrade generation model size if needed.
Read RAG field notes Open embedding group Estimate VRAM for your stack