Install nVIDIA graphic driver and CUDA beyond this.
Prerequisites
For not shooting yourself in the foot, just install Ubuntu 24.04. I also installed:
- Driver 570 from nVIDIA website
- CUDA 13.2 from nVIDIA
Notice the from nVIDIA? Yes, use nVIDIA
Build Beellama.cpp
Configure the build with CUDA.
cmake -B build-5090 -DGGML_CUDA=ON -DGGML_NATIVE=ON \ -DGGML_CUDA_FA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON \ -DCMAKE_CUDA_ARCHITECTURES=120 \ -DCMAKE_BUILD_TYPE=Release
cmake --build build-5090 -jRTX 50xx Blackwell cards have CMAKE_CUDA_ARCHITECTURES=120, refer to other number for other card.
Download the models
We need these files:
- Qwen3.6-27B-Q5_K_S.gguf: The Unsloth model
- dflash-draft-3.6-q8_0.gguf: The draft model
- mmproj-F16.gguf: use this to enable multi-modal, without this, can only do text.
Before download, make an account in Hugging Face. After that, login first:
uvx hf auth loginDownload the main model and the draft for DFlash.
## Download modeluvx hf download unsloth/Qwen3.6-27B-GGUF \ --include "Qwen3.6-27B-Q5_K_S.gguf" \ --include "mmproj-F16.gguf" \ --local-dir ~/Models/Qwen3.6-27B
## Draft modeluvx hf download spiritbuun/Qwen3.6-27B-DFlash-GGUF \ --include "dflash-draft-3.6-q8_0.gguf" \ --local-dir ~/Models/Qwen3.6-27BMake
git clone https://github.com/Anbeeld/beellama.cpp.git && cd beellama.cpp
export CUDA_HOME=/usr/local/cudacmake -B build-5090 -DGGML_CUDA=ON -DGGML_NATIVE=ON -DGGML_CUDA_FA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_ARCHITECTURES=100cmake -b build-5090 -jMake a run script
Create a script in $HOME/.local/bin/run-beellama-5090
#!/bin/sh
## make sure to see only one nVIDIA GPUexport CUDA_VISIBLE_DEVICES=0
export BEELLAMA_RUNTIME=$HOME/Projects/eyay/beellama.cpp/build-5090/binexport MODEL_HOME=$HOME/Models/Qwen3.6-27B
export MAIN_MODEL=$MODEL_HOME/Qwen3.6-27B-Q5_K_S.ggufexport MMPROJ=$MODEL_HOME/mmproj-F16.ggufexport DRAFT_MODEL=$MODEL_HOME/dflash-draft-3.6-q8_0.gguf
$BEELLAMA_RUNTIME/llama-server \ -m "$MAIN_MODEL" \ --mmproj "$MMPROJ" \ --no-mmproj-offload \ --spec-draft-model "$DRAFT_MODEL" \ --spec-type dflash \ --spec-dflash-cross-ctx 4096 \ --host 0.0.0.0 \ --port 8082 \ -np 1 \ --image-min-tokens 1024 \ --kv-unified \ -ngl all \ --spec-draft-ngl all \ -b 2048 -ub 512 \ --ctx-size 122800 \ --cache-type-k turbo4 --cache-type-v turbo3_tcq \ --flash-attn on \ --cache-ram 0 \ --jinja \ --no-mmap --mlock \ --no-host --metrics \ --log-timestamps --log-prefix -lv 1 --log-colors off \ --reasoning on \ --chat-template-kwargs '{"preserve_thinking":true}' \ --temp 0.6 --top-k 20 --min-p 0.0Make a service
Create a new systemd service in $HOME/.config/systemd/user/llama-server.service
[Unit]Description=llama.cpp serverAfter=default.target
[Service]Type=simpleExecStart=$HOME/.local/bin/run-beellama-5090WorkingDirectory=$HOME/Projects/eyay/beellama.cpp/# This prevents systemd from stopping if it crashes too oftenStartLimitIntervalSec=0# Keep restarting foreverRestart=alwaysRestartSec=1# Useful for debugging crashesEnvironment=PYTHONUNBUFFERED=1
[Install]WantedBy=default.targetInstall the service.
systemctl daemon-reload --usersystemctl enable llama-server.service --user --nowWith --now we also start the service.
Agent Settings
Various coding agent
Pi Coding
{ "providers": { "pc-eyay": { "baseUrl": "http://localhost:8082/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "Qwen3.6-27B-Q5_K_S.gguf", "name": "Qwen3.6-27B-Q5_K_S", "contextWindow": 122800, "input": [ "text", "image" ] } ] } }}References:
PS: Zed editor seems cannot handle tool calls from Qwen3.6 models.

