IntermediateEdge / Local 8 min read

Qwen2.5-Coder 32B on a Single RTX 4090

The best open coding model that fits in 24GB — quant selection and tuning tips.

Qwen2.5-Coder32BRTX 4090GGUF

Quant choice

Q4_K_M uses ~22GB for weights alone. Drop to Q3_K_M or EXL2 3.5bpw if you need 8K+ context.

text

Q4_K_M: best quality, ~22 GB weights
Q3_K_M: saves ~4 GB, acceptable for code
EXL2 3.5bpw: fastest, ~16 GB weights

Recommended command

llama.cpp with full GPU offload and 4K context is the sweet spot.

bash

./build/bin/llama-server \
  -m ./models/Qwen2.5-Coder-32B-Q4_K_M.gguf \
  -ngl 99 -c 4096 --host 0.0.0.0 --port 8080

Deployment guides are educational. Each model is subject to its own license — read the official Hugging Face model card before downloading or deploying.