Running Qwen3.6 35B A3B with VLLM on RTX5090

ยท
5 min read
blog
#welcome #blog #meta

I run Qwen 3.6 35B A3B from the new nVIDIA NVFP4 quantization

Get VLLM

Terminal window
git clone https://github.com/vllm-project/vllm.git && cd vllm

Setup venv

Terminal window
uv venv venv-vllm
source venv-vllm/bin/activate
uv pip install -U pip

Install deps

Terminal window
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu132
python use_existing_torch.py
pip install setuptools-rust setuptools-scm
pip install -r requirements/cuda.txt

Install vllm

Terminal window
pip install --no-build-isolation --editable .

Run VLLM

Terminal window
export MODEL_HOME=$HOME/Models/Qwen3.6-35B-A3B-NVFP4
export MODEL_NAME=Qwen3.6-35B-A3B-NVFP4
export CUDA_HOME=/usr/local/cuda
export FLASHINFER_NVCC="$CUDA_HOME/bin/nvcc"
export FLASHINFER_CUDA_ARCH_LIST="12.0f"
export NVCC_PREPEND_FLAGS="-DCCCL_DISABLE_CTK_COMPATIBILITY_CHECK"
export LIBRARY_PATH="$CUDA_HOME/lib:$LIBRARY_PATH"
export LD_LIBRARY_PATH="$CUDA_HOME/lib:$LD_LIBRARY_PATH"
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export VLLM_USE_FLASHINFER_MOE_FP4=0
export VLLM_USE_FLASHINFER_SAMPLER=0
vllm serve $MODEL_HOME \
--served-model-name $MODEL_NAME \
--host 0.0.0.0 --port 8082 \
--max-model-len 196608 \
--max-num-seqs 1 \
--max-num-batched-tokens 384 \
--gpu-memory-utilization 0.93 \
--kv-cache-dtype fp8 \
--quantization modelopt \
--async-scheduling \
--enable-chunked-prefill \
--language-model-only \
--skip-mm-profiling \
--no-enable-prefix-caching \
--no-calculate-kv-scales \
--max-cudagraph-capture-size 64 \
--attention-backend flashinfer \
--moe-backend marlin \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_xml \
--trust-remote-code
## NVIDIA defaults
vllm serve $MODEL_HOME \
--served-model-name $MODEL_NAME \
--host 0.0.0.0 --port 8082 \
--tensor-parallel-size 1 \
--trust-remote-code \
--dtype auto \
--quantization modelopt \
--kv-cache-dtype fp8 \
--attention-backend flashinfer \
--moe-backend marlin \
--gpu-memory-utilization 0.85 \
--max-model-len 65536 \
--max-num-seqs 4 \
--max-num-batched-tokens 8192 \
--enable-chunked-prefill \
--async-scheduling \
--enable-prefix-caching \
--speculative-config '{"method":"mtp","num_speculative_tokens":3,"moe_backend":"triton"}'
# Gemini version
vllm serve $MODEL_HOME \
--served-model-name $MODEL_NAME \
--host 0.0.0.0 --port 8082 \
--max-model-len 196608 \
--max-num-seqs 1 \
--max-num-batched-tokens 4096 \
--gpu-memory-utilization 0.94 \
--kv-cache-dtype fp8 \
--quantization modelopt \
--async-scheduling \
--enable-chunked-prefill \
--language-model-only \
--skip-mm-profiling \
--no-calculate-kv-scales \
--enable-prefix-caching \
--attention-backend flashinfer \
--moe-backend marlin \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_xml \
--trust-remote-code \
--enforce-eager