The official llama.cpp just got a new feature where you can use ngram with mtp. However, this is my most stable runs of llama.cpp with Qwen 3.6 27B on RTX 5090. However, when using only ngram optimization, the configuration never breaks.
Well, may be a little it would stops from time to time when you are in thinking mode running tool call. But, that’s a very random rare occurance. And in Pi just type continue and be done with it.
Prerequisites
For not shooting yourself in the foot, just install Ubuntu 24.04. I also installed:
- Driver 570 from nVIDIA website
- CUDA 13.2 from nVIDIA
Notice the from nVIDIA? Yes, use nVIDIA drivers and CUDA from their website.
Download the models
We need these files:
- Qwen3.6-27B-Q5_K_S.gguf: The Unsloth model
- mmproj-F16.gguf: use this to enable multi-modal, without this, can only do text.
Before download, make an account in Hugging Face. After that, login first:
uvx hf auth loginDownload the main model.
## Download modeluvx hf download unsloth/Qwen3.6-27B-GGUF \ --include "Qwen3.6-27B-Q5_K_S.gguf" \ --include "mmproj-F16.gguf" \ --local-dir ~/Models/Qwen3.6-27BInstall llama.cpp
Clone the repository and build it with CUDA support.
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp
export CUDA_HOME=/usr/local/cuda
cmake -B build-5090 -DGGML_CUDA=ON \ -DGGML_NATIVE=ON \ -DGGML_CUDA_FA=ON \ -DGGML_CUDA_FA_ALL_QUANTS=ON \ -DCMAKE_BUILD_TYPE=Release \ -DCMAKE_CUDA_ARCHITECTURES=100
cmake -b build-5090 --config Release -jRTX 50xx Blackwell cards have CMAKE_CUDA_ARCHITECTURES=100, refer to other number for other card.
Make a run script
Create a script in $HOME/.local/bin/run-qwen3.6-27b-llama-5090
#!/bin/sh
export CUDA_VISIBLE_DEVICES=0
export LLAMA_CPP_HOME=$HOME/Projects/eyay/llama.cppexport MODEL_HOME=/home/spidey/Models/Qwen3.6-27B
export MAIN_MODEL=$MODEL_HOME/Qwen3.6-27B-Q5_K_S.ggufexport MMPROJ_MODEL=$MODEL_HOME/mmproj-F16.gguf
$LLAMA_CPP_HOME/build-5090-v3/bin/llama-server \ -m "$MAIN_MODEL" \ --mmproj "$MMPROJ_MODEL" \ --no-mmproj-offload \ --image-min-tokens 1024 \ -ctk q8_0 -ctv q8_0 \ -fa on \ --kv-unified \ --host 0.0.0.0 --port 8082 \ -ngl all \ -np 1 \ --spec-type ngram-mod \ --no-mmap --mlock --jinja \ --no-host --metrics \ --log-timestamps --log-prefix -lv 3 \ --no-context-shift \ --reasoning on \ --chat-template-kwargs '{"preserve_thinking":true}' \ --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0Here’s what the parameters told you:
Model Parameters
Main model:
-m "Qwen3.6-27B-Q5_K_S.gguf": loads the main model from the specified path. I use Q5_K_S which is suggested by Beellama team for quality coding experience.
Multimodal projection model:
--mmproj "mmproj-F16.gguf": loads the multimodal projection model from the specified path.--no-mmproj-offload: disables offloading of the multimodal projection model.--image-min-tokens 1024: sets the minimum number of tokens for image processing. This is required by Llama.cpp to run with multimodal with this model.
Model optimizations:
-ctk q8_0 -ctv q8_0: sets the quantization level to Q8_0 for both key and value tensors. If you want to have more space, be sure to have the context for key (ctk) have Q8_0 quantization and play with value quantization. In other forks that enable Turboquants, thectvcan even have turbo2 quantization (2bits quantization).-fa on: enables flash attention. The very first parameter available for optimizing attention computation.--no-context-shift: Shifting context is broken and you will get repeating answer on a long context.
Drafting optimizations:
--spec-type ngram-mod: Enables ngram-mod drafting optimization.
Runtime optimizations:
--no-mmap: disables swapping model from NVRAM to system memory.--mlock: locks the memory to prevent swapping to disk.--no-host: Disable buffering on system RAM; all on GPU.-ngl 999: Load all layers of the model into GPU memory.-np 1: Tell llama.cpp to use single-threaded processing. Don’t bother with multi GPU setup.
Qwen recommends:
--jinja: enables jinja template support.--reasoning-on: enables reasoning capabilities.--chat-template-kwargs '{"preserve_thinking":true}': sets the chat template to preserve thinking.--temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0: these are the values that is suggested by Qwen team for running Qwen3.6-27B for coding tasks.
Make a service
Create a new systemd service in $HOME/.config/systemd/user/llama-server.service
[Unit]Description=llama.cpp serverAfter=default.target
[Service]Type=simpleExecStart=$HOME/.local/bin/run-qwen3.6-27b-llama-5090WorkingDirectory=$HOME/.local/bin/run-qwen3.6-27b-llama-5090# This prevents systemd from stopping if it crashes too oftenStartLimitIntervalSec=0# Keep restarting foreverRestart=alwaysRestartSec=1# Useful for debugging crashesEnvironment=PYTHONUNBUFFERED=1
[Install]WantedBy=default.targetInstall the service.
systemctl daemon-reload --usersystemctl enable llama-server.service --user --nowWith --now we also start the service.
Agent Settings
Various coding agent
Pi Coding
{ "providers": { "pc-eyay": { "baseUrl": "http://localhost:8082/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "Qwen3.6-27B-Q5_K_S.gguf", "name": "Qwen3.6-27B-Q5_K_S", "contextWindow": 122800, "input": [ "text", "image" ] } ] } }}PS: Zed editor seems cannot handle tool calls from Qwen3.6 models.

