Gemma 3 12B IT

12B

Google Gemma 3

Mid-size Gemma 3 with vision. Fits 16GB at Q4; excellent multilingual chat.

Consumer GPUMac / Apple Silicon

131K

Max Context

Quant Variants

GGUF Q5_K_M

Best Quality

98.7%

Accuracy Retained

Quantization Variants

Per-quant VRAM, quality loss, and inference speed on RTX 4090

Format	Level	BPW	VRAM	PPL Loss	Speed	Actions
GGUF	Q4_K_M	4.85	8.8 GB	2.9%	105 tok/s	Calc HF
GGUF	Q5_K_M	5.68	10.2 GB	1.3%	92 tok/s	Calc HF
AWQ	INT4	4	8.0 GB	3.9%	128 tok/s	Calc HF

Google Gemma 3

Google Gemma 3 multimodal 4B. 128K context; strong vision + text on 8GB cards.

Alibaba Qwen2.5

The sweet spot between performance and resource usage. 16GB VRAM with Q4.

Mistral AI

Mistral + NVIDIA collaboration. 128K context, excellent multilingual support.

Google Gemma 2

Google's compact Gemma 2 with sliding window attention. Punches above 9B.