Back to Cookbook
AdvancedServer / VPS 10 min read
vLLM + AWQ in Production: Tuning Guide
gpu-memory-utilization, max-model-len, and batching knobs for stable API serving.
vLLMAWQproductionAPI
Memory tuning
Start at 0.85 gpu-memory-utilization. Lower to 0.75 if you see OOM on long contexts.
bash
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-7B-Instruct-AWQ \
--quantization awq \
--gpu-memory-utilization 0.85 \
--max-model-len 32768 \
--max-num-seqs 32Deployment guides are educational. Each model is subject to its own license — read the official Hugging Face model card before downloading or deploying.