Benchmarks & Insights

Hard numbers, no hype — measured on real hardware

Inference Speed

Tokens/second · Llama 3.1 8B Instruct · batch=1

Perplexity vs Quantization Level

WikiText-2 PPL · Llama 3.1 8B · lower = better (FP16 baseline = 6.14)

Hardware × Format Matrix

Tokens/second · Llama 3.1 8B Instruct · batch=1

HardwareFrameworkQuantSpeed (tok/s)VRAM UsedNotes
RTX 4090 24GExLlamaV2EXL2 4.65bpw2355.4 GBPeak consumer performance
RTX 4090 24GvLLMAWQ INT42184.9 GBBest for batch API
RTX 4090 24Gllama.cppGGUF Q4_K_M1485.7 GBEasiest setup
RTX 4060 Ti 16GExLlamaV2EXL2 4.65bpw985.4 GBGreat budget option
RTX 4060 Ti 16Gllama.cppGGUF Q4_K_M785.7 GBBudget-friendly
RTX 3090 24GExLlamaV2EXL2 4.65bpw1755.4 GBOlder but capable
M3 Max 48GOllamaGGUF Q4_K_M685.7 GBUnified memory advantage
M2 Ultra 192Gllama.cppGGUF Q4_K_M905.7 GBCan run 70B models solo