Gemma 3 4B IT

Google Gemma 3

Google Gemma 3 multimodal 4B. 128K context; strong vision + text on 8GB cards.

Consumer GPUMac / Apple SiliconCPU / VPS

131K

Max Context

Quant Variants

GGUF Q8_0

Best Quality

99.8%

Accuracy Retained

Quantization Variants

Per-quant VRAM, quality loss, and inference speed on RTX 4090

Format	Level	BPW	VRAM	PPL Loss	Speed	Actions
GGUF	Q4_K_M	4.85	3.4 GB	3.2%	175 tok/s	Calc HF
GGUF	Q8_0	8.5	5.2 GB	0.2%	145 tok/s	Calc HF
AWQ	INT4	4	3.0 GB	4.2%	210 tok/s	Calc HF

Google Gemma 3

Mid-size Gemma 3 with vision. Fits 16GB at Q4; excellent multilingual chat.

Meta Llama 3.1

Meta's flagship 8B model with 128K context. Best-in-class for local deployment.

Alibaba Qwen2.5

Alibaba's highly optimized 7B. Punches well above its weight, especially in coding.

Microsoft Phi

Microsoft's tiny powerhouse. Best 4B model for on-device deployment.