Best RAG Models for Local Deployments
Strong RAG results come from the pairing of retrieval and generation. Use this page as a planning hub, then open model detail pages for measured signals and hardware paths.
Embedding models for retrieval quality
| Model | VRAM min/optimal | Category | Detail |
|---|---|---|---|
| MXBAI Embed Large 335M FP16 | 2GB / 10GB | embedding | Open |
| Snowflake Arctic Embed 335M FP16 | 2GB / 10GB | embedding | Open |
| Nomic Embed Text 137M FP16 | 2GB / 10GB | embedding | Open |
| Snowflake Arctic Embed 137M FP16 | 2GB / 10GB | embedding | Open |
| Snowflake Arctic Embed 110M FP16 | 2GB / 10GB | embedding | Open |
| All-MiniLM 33M FP16 | 2GB / 10GB | embedding | Open |
| Snowflake Arctic Embed 33M FP16 | 2GB / 10GB | embedding | Open |
| All-MiniLM 22M FP16 | 2GB / 10GB | embedding | Open |
| Snowflake Arctic Embed 22M FP16 | 2GB / 10GB | embedding | Open |
| BGE-M3 567M FP16 | 4GB / 12GB | embedding | Open |
Generation models for local answer quality
| Model | VRAM min/optimal | 3090 tok/s | Data | Detail |
|---|---|---|---|---|
| Glm 4.7 Flash 7B Q4 | 6GB / 16GB | 30 | Measured | Open |
| Glm 4.7 Flash 7B Q5 | 8GB / 18GB | 27 | Measured | Open |
| DeepSeek-R1 14B Q4 | 10GB / 20GB | 21 | Measured | Open |
| Ministral 3 14B Q4 | 10GB / 20GB | 21 | Measured | Open |
| DeepSeek-R1 14B Q5 | 12GB / 22GB | 18.9 | Measured | Open |
| Ministral 3 14B Q5 | 12GB / 22GB | 18.9 | Measured | Open |
| Glm 4.7 Flash 7B Q8 | 12GB / 22GB | 21.6 | Measured | Open |
| Gemma 2 2B Q4 | 2GB / 4GB | 42 | Estimated | Open |
| TinyLlama 1.1B Q4 | 2GB / 4GB | 42 | Estimated | Open |
| Phi-3 3.8B Q4 | 4GB / 6GB | 36 | Estimated | Open |
| Gemma 2 2B Q5 | 3GB / 6GB | 37.8 | Estimated | Open |
| TinyLlama 1.1B Q5 | 3GB / 6GB | 37.8 | Estimated | Open |
Recommended local RAG flow
- Pick an embedding model that fits your latency and memory budget.
- Use a 7B to 32B generator model that can sustain your expected context length.
- Tune retrieval quality first, then upgrade generation model size if needed.