Quant Hub
Structured index of open-source quantized models — no file hosting, just precise metadata · 51 models in index
Parameters
Category
Hardware
Format
51 / 51 models
Showing all 51 models
Llama 3.1 8B Instruct
8BMeta Llama 3.1
Meta's flagship 8B model with 128K context. Best-in-class for local deployment.
3.2 GB
min VRAM
131K
ctx
235
tok/s
Formats
Llama 3.1 70B Instruct
70BMeta Llama 3.1
Meta's frontier 70B model. Requires 40GB+ VRAM; dual 3090 or M2 Ultra.
33.4 GB
min VRAM
131K
ctx
62
tok/s
Formats
Llama 3.2 3B Instruct
3BMeta Llama 3.2
Tiny but capable. Runs on 4GB VRAM or 8GB RAM, even on phones via llama.cpp.
2.0 GB
min VRAM
131K
ctx
420
tok/s
Formats
Qwen2.5 7B Instruct
7BAlibaba Qwen2.5
Alibaba's highly optimized 7B. Punches well above its weight, especially in coding.
4.8 GB
min VRAM
131K
ctx
245
tok/s
Formats
Qwen2.5 14B Instruct
14BAlibaba Qwen2.5
The sweet spot between performance and resource usage. 16GB VRAM with Q4.
9.2 GB
min VRAM
131K
ctx
138
tok/s
Formats
Qwen2.5 32B Instruct
32BAlibaba Qwen2.5
Near-GPT-4 reasoning on a 24GB VRAM card (Q4_K_S). Groundbreaking value.
16.4 GB
min VRAM
131K
ctx
68
tok/s
Formats
DeepSeek-Coder-V2-Lite Instruct
16BDeepSeek
MoE architecture coding model. Active params ~2.4B, total ~16B. Exceptional code quality.
9.8 GB
min VRAM
164K
ctx
192
tok/s
Formats
Phi-3.5 Mini Instruct
3.8BMicrosoft Phi
Microsoft's tiny powerhouse. Best 4B model for on-device deployment.
2.5 GB
min VRAM
131K
ctx
385
tok/s
Formats
Mistral Nemo 12B Instruct
12BMistral AI
Mistral + NVIDIA collaboration. 128K context, excellent multilingual support.
7.8 GB
min VRAM
131K
ctx
148
tok/s
Formats
Gemma 2 9B Instruct
9BGoogle Gemma 2
Google's compact Gemma 2 with sliding window attention. Punches above 9B.
5.8 GB
min VRAM
8K
ctx
188
tok/s
Formats
Qwen2.5 72B Instruct
72BAlibaba Qwen2.5
Flagship Qwen2.5. Requires dual 4090 or A100 80G. Exceptional reasoning at scale.
33.8 GB
min VRAM
131K
ctx
48
tok/s
Formats
DeepSeek-R1-Distill-Qwen-14B
14BDeepSeek
R1 reasoning distilled into 14B. Huge community interest; excellent chain-of-thought.
9.2 GB
min VRAM
131K
ctx
128
tok/s
Formats
Llama 3.3 70B Instruct
70BMeta Llama 3.3
Latest Meta 70B with improved multilingual. Drop-in upgrade from Llama 3.1 70B.
38.2 GB
min VRAM
131K
ctx
54
tok/s
Formats
Mistral Small 24B Instruct
24BMistral AI
Mistral's efficient 24B. Strong multilingual; fits on 24GB with Q4.
13.5 GB
min VRAM
33K
ctx
88
tok/s
Formats
Qwen2.5-Coder 32B Instruct
32BAlibaba Qwen2.5
Top-tier open coding model. HumanEval competitive with GPT-4o on 32B scale.
16.4 GB
min VRAM
131K
ctx
65
tok/s
Formats
Qwen2.5-Coder 7B Instruct
7BAlibaba Qwen2.5
Best 7B coding model. Ideal for local dev assistants on 8–16GB VRAM.
4.8 GB
min VRAM
131K
ctx
248
tok/s
Formats
Qwen2.5 3B Instruct
3BAlibaba Qwen2.5
Tiny Qwen2.5 for edge devices. Runs on 4GB VRAM or Raspberry Pi class hardware.
2.1 GB
min VRAM
33K
ctx
340
tok/s
Formats
Llama 3.2 1B Instruct
1BMeta Llama 3.2
Ultra-light Llama for mobile and embedded. Sub-2GB VRAM with Q4.
1.0 GB
min VRAM
131K
ctx
520
tok/s
Formats
DeepSeek-R1-Distill-Llama-70B
70BDeepSeek
R1 reasoning in Llama 70B architecture. Top open reasoning model for dual-GPU setups.
38.2 GB
min VRAM
131K
ctx
52
tok/s
Formats
Codestral 22B
22BMistral AI
Mistral's dedicated code model. 80+ language support, Fill-in-the-Middle capable.
13.2 GB
min VRAM
33K
ctx
72
tok/s
Formats
Mixtral 8x7B Instruct
47B MoEMistral AI
Classic MoE model. ~13B active params per token; needs 32GB+ VRAM for Q4.
25.2 GB
min VRAM
33K
ctx
62
tok/s
Formats
Command R 35B
35BCohere
Cohere's RAG-optimised model. Excellent retrieval-augmented generation.
20.5 GB
min VRAM
131K
ctx
55
tok/s
Formats
Yi 1.5 34B Chat
34B01.AI Yi
01.AI's strong bilingual (EN/ZH) model. Competitive with Qwen 32B.
19.8 GB
min VRAM
4K
ctx
52
tok/s
Formats
Solar 10.7B Instruct
11BUpstage
Depth-upscaled 10.7B punching above weight. Strong on reasoning benchmarks.
6.5 GB
min VRAM
4K
ctx
168
tok/s
Formats
StarCoder2 15B
15BBigCode
BigCode's open code model trained on 600+ languages. Great for polyglot dev.
9.2 GB
min VRAM
16K
ctx
115
tok/s
Formats
Llama 3.2 11B Vision Instruct
11BMeta Llama 3.2
Multimodal Llama with image understanding. Vision encoder adds ~2GB VRAM overhead.
9.5 GB
min VRAM
131K
ctx
88
tok/s
Formats
Qwen2-VL 7B Instruct
7BAlibaba Qwen2
Vision-language model with video understanding. Strong OCR and chart reading.
6.0 GB
min VRAM
33K
ctx
95
tok/s
Formats
Nous Hermes 3 Llama 3.1 8B
8BNousResearch
Fine-tuned Llama 3.1 8B with improved roleplay and instruction following.
5.4 GB
min VRAM
131K
ctx
232
tok/s
Formats
WizardLM-2 7B
7BMicrosoft / WizardLM
Evol-Instruct fine-tuned Mistral-based 7B. Strong complex instruction handling.
4.8 GB
min VRAM
33K
ctx
218
tok/s
Formats
Granite 3.1 8B Instruct
8BIBM Granite
IBM's enterprise-grade 8B. Strong RAG and tool-use; permissive Apache 2.0 license.
5.0 GB
min VRAM
131K
ctx
195
tok/s
Formats
Gemma 2 2B Instruct
2BGoogle Gemma 2
Ultra-compact Gemma 2. Runs on 4GB VRAM; great for edge prototyping.
1.8 GB
min VRAM
8K
ctx
450
tok/s
Formats
Gemma 2 27B Instruct
27BGoogle Gemma 2
Largest open Gemma 2. Strong reasoning; needs 24GB+ VRAM at Q4.
16.2 GB
min VRAM
8K
ctx
58
tok/s
Formats
Qwen2.5 0.5B Instruct
0.5BAlibaba Qwen2.5
Smallest Qwen2.5. Ideal for Raspberry Pi, phones, and ultra-low-latency demos.
0.6 GB
min VRAM
33K
ctx
620
tok/s
Formats
Qwen2.5 1.5B Instruct
1.5BAlibaba Qwen2.5
Tiny Qwen with 128K context. Surprisingly capable for summarisation and chat.
1.4 GB
min VRAM
131K
ctx
480
tok/s
Formats
Phi-3 Medium 14B Instruct
14BMicrosoft Phi
Microsoft's mid-size Phi-3. Excellent quality-per-GB on 16GB cards.
8.8 GB
min VRAM
131K
ctx
135
tok/s
Formats
Phi-4 Mini Instruct
3.8BMicrosoft Phi
Latest Phi mini with improved math and code. Strong 4B-class performer.
2.5 GB
min VRAM
131K
ctx
395
tok/s
Formats
Mistral 7B Instruct v0.3
7BMistral AI
Classic Mistral 7B v0.3. Still a reliable baseline for local chat APIs.
4.6 GB
min VRAM
33K
ctx
225
tok/s
Formats
DeepSeek-V2-Lite Chat
16BDeepSeek
MoE general model (~2.4B active). Long context and strong multilingual chat.
9.6 GB
min VRAM
164K
ctx
188
tok/s
Formats
DeepSeek-R1-Distill-Qwen-7B
7BDeepSeek
R1 reasoning in a 7B footprint. Best value for 8–12GB VRAM CoT experiments.
5.2 GB
min VRAM
131K
ctx
210
tok/s
Formats
DeepSeek-R1-Distill-Qwen-32B
32BDeepSeek
R1 distilled to 32B. Near-frontier reasoning on a single 24GB card (Q3/Q4).
16.8 GB
min VRAM
131K
ctx
65
tok/s
Formats
Llama 3.2 90B Vision Instruct
90BMeta Llama 3.2
Flagship multimodal Llama. Requires dual 4090 or A100; vision adds ~3GB overhead.
44.2 GB
min VRAM
131K
ctx
28
tok/s
Formats
OLMo 2 7B Instruct
7BAllen AI OLMo
Fully open training pipeline from Allen AI. Great for reproducibility research.
5.3 GB
min VRAM
4K
ctx
150
tok/s
Formats
InternLM2 7B Chat
7BShanghai AI Lab
Strong bilingual (EN/ZH) 7B from Shanghai AI Lab. Competitive with Qwen 7B.
4.9 GB
min VRAM
33K
ctx
215
tok/s
Formats
InternLM2 20B Chat
20BShanghai AI Lab
Mid-size InternLM2 with excellent Chinese comprehension. Fits 24GB at Q4.
13.8 GB
min VRAM
33K
ctx
78
tok/s
Formats
Aya 23 8B
8BCohere For AI
Multilingual specialist covering 23 languages. Strong for non-English local apps.
5.0 GB
min VRAM
8K
ctx
205
tok/s
Formats
OpenChat 3.6 8B
8BOpenChat
C-RLFT fine-tuned Llama 3.1 8B. Known for natural conversational tone.
5.4 GB
min VRAM
8K
ctx
228
tok/s
Formats
Zephyr 7B Beta
7BHuggingFaceH4
DPO-aligned Mistral 7B. Classic choice for helpful, harmless chat baselines.
5.2 GB
min VRAM
33K
ctx
155
tok/s
Formats
Stable LM 2 12B Chat
12BStability AI
Stability AI's 12B chat model. Solid general-purpose option for 16GB GPUs.
7.2 GB
min VRAM
4K
ctx
142
tok/s
Formats
Falcon 3 10B Instruct
10BTII UAE
Technology Innovation Institute's latest Falcon. Good multilingual and code mix.
6.2 GB
min VRAM
33K
ctx
155
tok/s
Formats
Jamba 1.5 Mini
12BAI21 Labs
Hybrid SSM-Transformer with 256K context. Efficient long-document QA on 16GB.
7.5 GB
min VRAM
262K
ctx
125
tok/s
Formats
DBRX Instruct
132BDatabricks
MoE flagship (~36B active). Needs multi-GPU; strong code and reasoning at scale.
63.2 GB
min VRAM
33K
ctx
18
tok/s
Formats