Running Qwen3.6 27B with Llama.cpp on RTX5090

The official llama.cpp just got a new feature where you can use ngram with mtp. However, this is my most stable runs of llama.cpp with Qwen 3.6 27B on RTX 5090. However, when using only ngram optimization, the configuration never breaks.

Well, may be a little it would stops from time to time when you are in thinking mode running tool call. But, that’s a very random rare occurance. And in Pi just type continue and be done with it.

Prerequisites

For not shooting yourself in the foot, just install Ubuntu 24.04. I also installed:

Driver 570 from nVIDIA website
CUDA 13.2 from nVIDIA

Notice the from nVIDIA? Yes, use nVIDIA drivers and CUDA from their website.

Download the models

We need these files:

Qwen3.6-27B-Q5_K_S.gguf: The Unsloth model
mmproj-F16.gguf: use this to enable multi-modal, without this, can only do text.

Before download, make an account in Hugging Face. After that, login first:

1
uvx hf auth login

Download the main model.

1
## Download model
2
uvx hf download unsloth/Qwen3.6-27B-GGUF \
3
   --include "Qwen3.6-27B-Q5_K_S.gguf" \
4
   --include "mmproj-F16.gguf" \
5
   --local-dir ~/Models/Qwen3.6-27B

Install llama.cpp

Clone the repository and build it with CUDA support.

1
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp
2

3
export CUDA_HOME=/usr/local/cuda
4

5
cmake -B build-5090 -DGGML_CUDA=ON \
6
  -DGGML_NATIVE=ON \
7
  -DGGML_CUDA_FA=ON \
8
  -DGGML_CUDA_FA_ALL_QUANTS=ON \
9
  -DCMAKE_BUILD_TYPE=Release \
10
  -DCMAKE_CUDA_ARCHITECTURES=100
11

12
cmake -b build-5090 --config Release -j

RTX 50xx Blackwell cards have CMAKE_CUDA_ARCHITECTURES=100, refer to other number for other card.

Make a run script

Create a script in $HOME/.local/bin/run-qwen3.6-27b-llama-5090

1
#!/bin/sh
2

3

4
export CUDA_VISIBLE_DEVICES=0
5

6
export LLAMA_CPP_HOME=$HOME/Projects/eyay/llama.cpp
7
export MODEL_HOME=/home/spidey/Models/Qwen3.6-27B
8

9
export MAIN_MODEL=$MODEL_HOME/Qwen3.6-27B-Q5_K_S.gguf
10
export MMPROJ_MODEL=$MODEL_HOME/mmproj-F16.gguf
11

12
$LLAMA_CPP_HOME/build-5090-v3/bin/llama-server \
13
  -m "$MAIN_MODEL" \
14
  --mmproj "$MMPROJ_MODEL" \
15
  --no-mmproj-offload \
16
  --image-min-tokens 1024 \
17
  -ctk q8_0 -ctv q8_0 \
18
  -fa on \
19
  --kv-unified \
20
  --host 0.0.0.0 --port 8082 \
21
  -ngl all \
22
  -np 1 \
23
  --spec-type ngram-mod \
24
  --no-mmap --mlock --jinja \
25
  --no-host --metrics \
26
  --log-timestamps --log-prefix -lv 3 \
27
  --no-context-shift \
28
  --reasoning on \
29
  --chat-template-kwargs '{"preserve_thinking":true}' \
30
  --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0  --presence-penalty 0.0 --repeat-penalty 1.0

Here’s what the parameters told you:

Model Parameters

Main model:

-m "Qwen3.6-27B-Q5_K_S.gguf": loads the main model from the specified path. I use Q5_K_S which is suggested by Beellama team for quality coding experience.

Multimodal projection model:

--mmproj "mmproj-F16.gguf": loads the multimodal projection model from the specified path.
--no-mmproj-offload: disables offloading of the multimodal projection model.
--image-min-tokens 1024: sets the minimum number of tokens for image processing. This is required by Llama.cpp to run with multimodal with this model.

Model optimizations:

-ctk q8_0 -ctv q8_0: sets the quantization level to Q8_0 for both key and value tensors. If you want to have more space, be sure to have the context for key (ctk) have Q8_0 quantization and play with value quantization. In other forks that enable Turboquants, the ctv can even have turbo2 quantization (2bits quantization).
-fa on: enables flash attention. The very first parameter available for optimizing attention computation.
--no-context-shift: Shifting context is broken and you will get repeating answer on a long context.

Drafting optimizations:

--spec-type ngram-mod: Enables ngram-mod drafting optimization.

Runtime optimizations:

--no-mmap: disables swapping model from NVRAM to system memory.
--mlock: locks the memory to prevent swapping to disk.
--no-host: Disable buffering on system RAM; all on GPU.
-ngl 999: Load all layers of the model into GPU memory.
-np 1: Tell llama.cpp to use single-threaded processing. Don’t bother with multi GPU setup.

Qwen recommends:

--jinja: enables jinja template support.
--reasoning-on: enables reasoning capabilities.
--chat-template-kwargs '{"preserve_thinking":true}': sets the chat template to preserve thinking.
--temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0: these are the values that is suggested by Qwen team for running Qwen3.6-27B for coding tasks.

Make a service

Create a new systemd service in $HOME/.config/systemd/user/llama-server.service

1
[Unit]
2
Description=llama.cpp server
3
After=default.target
4

5
[Service]
6
Type=simple
7
ExecStart=$HOME/.local/bin/run-qwen3.6-27b-llama-5090
8
WorkingDirectory=$HOME/.local/bin/run-qwen3.6-27b-llama-5090
9
# This prevents systemd from stopping if it crashes too often
10
StartLimitIntervalSec=0
11
# Keep restarting forever
12
Restart=always
13
RestartSec=1
14
# Useful for debugging crashes
15
Environment=PYTHONUNBUFFERED=1
16

17
[Install]
18
WantedBy=default.target

Install the service.

1
systemctl daemon-reload --user
2
systemctl enable llama-server.service --user --now

With --now we also start the service.

Agent Settings

Various coding agent

Pi Coding

1
{
2
  "providers": {
3
    "pc-eyay": {
4
      "baseUrl": "http://localhost:8082/v1",
5
      "api": "openai-completions",
6
      "apiKey": "none",
7
      "models": [
8
        {
9
          "id": "Qwen3.6-27B-Q5_K_S.gguf",
10
          "name": "Qwen3.6-27B-Q5_K_S",
11
          "contextWindow": 122800,
12
          "input": [
13
            "text",
14
            "image"
15
          ]
16
        }
17
      ]
18
    }
19
  }
20
}

PS: Zed editor seems cannot handle tool calls from Qwen3.6 models.