Running Qwen3.6 27B with Beellama.cpp on RTX5090

ยท
5 min read
blog
#welcome #blog #meta

I run Qwen 3.6 27B with 80+ tok/sec

Install nVIDIA graphic driver and CUDA beyond this.

Prerequisites

For not shooting yourself in the foot, just install Ubuntu 24.04. I also installed:

  • Driver 570 from nVIDIA website
  • CUDA 13.2 from nVIDIA

Notice the from nVIDIA? Yes, use nVIDIA

Build Beellama.cpp

Configure the build with CUDA.

Terminal window
cmake -B build-5090 -DGGML_CUDA=ON -DGGML_NATIVE=ON \
-DGGML_CUDA_FA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON \
-DCMAKE_CUDA_ARCHITECTURES=120 \
-DCMAKE_BUILD_TYPE=Release
cmake --build build-5090 -j

RTX 50xx Blackwell cards have CMAKE_CUDA_ARCHITECTURES=120, refer to other number for other card.

Download the models

We need these files:

  • Qwen3.6-27B-Q5_K_S.gguf: The Unsloth model
  • dflash-draft-3.6-q8_0.gguf: The draft model
  • mmproj-F16.gguf: use this to enable multi-modal, without this, can only do text.

Before download, make an account in Hugging Face. After that, login first:

Terminal window
uvx hf auth login

Download the main model and the draft for DFlash.

Terminal window
## Download model
uvx hf download unsloth/Qwen3.6-27B-GGUF \
--include "Qwen3.6-27B-Q5_K_S.gguf" \
--include "mmproj-F16.gguf" \
--local-dir ~/Models/Qwen3.6-27B
## Draft model
uvx hf download spiritbuun/Qwen3.6-27B-DFlash-GGUF \
--include "dflash-draft-3.6-q8_0.gguf" \
--local-dir ~/Models/Qwen3.6-27B

Make

Terminal window
git clone https://github.com/Anbeeld/beellama.cpp.git && cd beellama.cpp
export CUDA_HOME=/usr/local/cuda
cmake -B build-5090 -DGGML_CUDA=ON -DGGML_NATIVE=ON -DGGML_CUDA_FA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_ARCHITECTURES=100
cmake -b build-5090 -j

Make a run script

Create a script in $HOME/.local/bin/run-beellama-5090

#!/bin/sh
## make sure to see only one nVIDIA GPU
export CUDA_VISIBLE_DEVICES=0
export BEELLAMA_RUNTIME=$HOME/Projects/eyay/beellama.cpp/build-5090/bin
export MODEL_HOME=$HOME/Models/Qwen3.6-27B
export MAIN_MODEL=$MODEL_HOME/Qwen3.6-27B-Q5_K_S.gguf
export MMPROJ=$MODEL_HOME/mmproj-F16.gguf
export DRAFT_MODEL=$MODEL_HOME/dflash-draft-3.6-q8_0.gguf
$BEELLAMA_RUNTIME/llama-server \
-m "$MAIN_MODEL" \
--mmproj "$MMPROJ" \
--no-mmproj-offload \
--spec-draft-model "$DRAFT_MODEL" \
--spec-type dflash \
--spec-dflash-cross-ctx 4096 \
--host 0.0.0.0 \
--port 8082 \
-np 1 \
--image-min-tokens 1024 \
--kv-unified \
-ngl all \
--spec-draft-ngl all \
-b 2048 -ub 512 \
--ctx-size 122800 \
--cache-type-k turbo4 --cache-type-v turbo3_tcq \
--flash-attn on \
--cache-ram 0 \
--jinja \
--no-mmap --mlock \
--no-host --metrics \
--log-timestamps --log-prefix -lv 1 --log-colors off \
--reasoning on \
--chat-template-kwargs '{"preserve_thinking":true}' \
--temp 0.6 --top-k 20 --min-p 0.0

Make a service

Create a new systemd service in $HOME/.config/systemd/user/llama-server.service

[Unit]
Description=llama.cpp server
After=default.target
[Service]
Type=simple
ExecStart=$HOME/.local/bin/run-beellama-5090
WorkingDirectory=$HOME/Projects/eyay/beellama.cpp/
# This prevents systemd from stopping if it crashes too often
StartLimitIntervalSec=0
# Keep restarting forever
Restart=always
RestartSec=1
# Useful for debugging crashes
Environment=PYTHONUNBUFFERED=1
[Install]
WantedBy=default.target

Install the service.

Terminal window
systemctl daemon-reload --user
systemctl enable llama-server.service --user --now

With --now we also start the service.

Agent Settings

Various coding agent

Pi Coding

$HOME/.pi/agent/models.json
{
"providers": {
"pc-eyay": {
"baseUrl": "http://localhost:8082/v1",
"api": "openai-completions",
"apiKey": "none",
"models": [
{
"id": "Qwen3.6-27B-Q5_K_S.gguf",
"name": "Qwen3.6-27B-Q5_K_S",
"contextWindow": 122800,
"input": [
"text",
"image"
]
}
]
}
}
}

References:

PS: Zed editor seems cannot handle tool calls from Qwen3.6 models.