Install nVIDIA graphic driver and CUDA beyond this.

Prerequisites

For not shooting yourself in the foot, just install Ubuntu 24.04. I also installed:

Driver 570 from nVIDIA website
CUDA 13.2 from nVIDIA

Notice the from nVIDIA? Yes, use nVIDIA

Build Beellama.cpp

Configure the build with CUDA.

1
cmake -B build-5090 -DGGML_CUDA=ON -DGGML_NATIVE=ON \
2
   -DGGML_CUDA_FA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON \
3
   -DCMAKE_CUDA_ARCHITECTURES=120 \
4
   -DCMAKE_BUILD_TYPE=Release
5

6
cmake --build build-5090 -j

RTX 50xx Blackwell cards have CMAKE_CUDA_ARCHITECTURES=120, refer to other number for other card.

Download the models

We need these files:

Qwen3.6-27B-Q5_K_S.gguf: The Unsloth model
dflash-draft-3.6-q8_0.gguf: The draft model
mmproj-F16.gguf: use this to enable multi-modal, without this, can only do text.

Before download, make an account in Hugging Face. After that, login first:

1
uvx hf auth login

Download the main model and the draft for DFlash.

1
## Download model
2
uvx hf download unsloth/Qwen3.6-27B-GGUF \
3
   --include "Qwen3.6-27B-Q5_K_S.gguf" \
4
   --include "mmproj-F16.gguf" \
5
   --local-dir ~/Models/Qwen3.6-27B
6

7
## Draft model
8
uvx hf download spiritbuun/Qwen3.6-27B-DFlash-GGUF \
9
   --include "dflash-draft-3.6-q8_0.gguf" \
10
   --local-dir ~/Models/Qwen3.6-27B

Make

1
git clone https://github.com/Anbeeld/beellama.cpp.git && cd beellama.cpp
2

3
export CUDA_HOME=/usr/local/cuda
4
cmake -B build-5090 -DGGML_CUDA=ON   -DGGML_NATIVE=ON   -DGGML_CUDA_FA=ON   -DGGML_CUDA_FA_ALL_QUANTS=ON   -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_ARCHITECTURES=100
5
cmake -b build-5090 -j

Make a run script

Create a script in $HOME/.local/bin/run-beellama-5090

1
#!/bin/sh
2

3
## make sure to see only one nVIDIA GPU
4
export CUDA_VISIBLE_DEVICES=0
5

6
export BEELLAMA_RUNTIME=$HOME/Projects/eyay/beellama.cpp/build-5090/bin
7
export MODEL_HOME=$HOME/Models/Qwen3.6-27B
8

9
export MAIN_MODEL=$MODEL_HOME/Qwen3.6-27B-Q5_K_S.gguf
10
export MMPROJ=$MODEL_HOME/mmproj-F16.gguf
11
export DRAFT_MODEL=$MODEL_HOME/dflash-draft-3.6-q8_0.gguf
12

13
$BEELLAMA_RUNTIME/llama-server \
14
  -m "$MAIN_MODEL" \
15
  --mmproj "$MMPROJ" \
16
  --no-mmproj-offload \
17
  --spec-draft-model "$DRAFT_MODEL" \
18
  --spec-type dflash \
19
  --spec-dflash-cross-ctx 4096 \
20
  --host 0.0.0.0 \
21
  --port 8082 \
22
  -np 1 \
23
  --image-min-tokens 1024 \
24
  --kv-unified \
25
  -ngl all \
26
  --spec-draft-ngl all \
27
  -b 2048 -ub 512 \
28
  --ctx-size 122800 \
29
  --cache-type-k turbo4 --cache-type-v turbo3_tcq \
30
  --flash-attn on \
31
  --cache-ram 0 \
32
  --jinja \
33
  --no-mmap --mlock \
34
  --no-host --metrics \
35
  --log-timestamps --log-prefix -lv 1  --log-colors off \
36
  --reasoning on \
37
  --chat-template-kwargs '{"preserve_thinking":true}' \
38
  --temp 0.6 --top-k 20 --min-p 0.0

Make a service

Create a new systemd service in $HOME/.config/systemd/user/llama-server.service

1
[Unit]
2
Description=llama.cpp server
3
After=default.target
4

5
[Service]
6
Type=simple
7
ExecStart=$HOME/.local/bin/run-beellama-5090
8
WorkingDirectory=$HOME/Projects/eyay/beellama.cpp/
9
# This prevents systemd from stopping if it crashes too often
10
StartLimitIntervalSec=0
11
# Keep restarting forever
12
Restart=always
13
RestartSec=1
14
# Useful for debugging crashes
15
Environment=PYTHONUNBUFFERED=1
16

17
[Install]
18
WantedBy=default.target

Install the service.

1
systemctl daemon-reload --user
2
systemctl enable llama-server.service --user --now

With --now we also start the service.

Agent Settings

Various coding agent

Pi Coding

1
{
2
  "providers": {
3
    "pc-eyay": {
4
      "baseUrl": "http://localhost:8082/v1",
5
      "api": "openai-completions",
6
      "apiKey": "none",
7
      "models": [
8
        {
9
          "id": "Qwen3.6-27B-Q5_K_S.gguf",
10
          "name": "Qwen3.6-27B-Q5_K_S",
11
          "contextWindow": 122800,
12
          "input": [
13
            "text",
14
            "image"
15
          ]
16
        }
17
      ]
18
    }
19
  }
20
}

References:

Beellama.cpp

PS: Zed editor seems cannot handle tool calls from Qwen3.6 models.

Running Qwen3.6 27B with Beellama.cpp on RTX5090

Prerequisites

Build Beellama.cpp

Download the models

Make

Make a run script

Make a service

Agent Settings

Pi Coding

References:

Related Articles

Running Qwen3.6 35B A3B with VLLM on RTX5090

Running Qwen3.6 27B with Llama.cpp on RTX5090

Malloc in ArchLinux