llama.cpp avec Vulkan

farias

Installation de SDK Vulkan :

# wget -qO- https://packages.lunarg.com/lunarg-signing-key-pub.asc | sudo tee /etc/apt/trusted.gpg.d/lunarg.asc
# sudo wget -qO /etc/apt/sources.list.d/lunarg-vulkan-jammy.list http://packages.lunarg.com/vulkan/lunarg-vulkan-jammy.list
# sudo apt update
# sudo apt install vulkan-sdk

Mais erreur :

# sudo apt install vulkan-sdk
Lecture des listes de paquets... Fait
Construction de l'arbre des dépendances... Fait
Lecture des informations d'état... Fait      
Certains paquets ne peuvent être installés. Ceci peut signifier
que vous avez demandé l'impossible, ou bien, si vous utilisez
la distribution unstable, que certains paquets n'ont pas encore
été créés ou ne sont pas sortis d'Incoming.
L'information suivante devrait vous aider à résoudre la situation : 

Les paquets suivants contiennent des dépendances non satisfaites :
 crashdiagnosticlayer : Dépend: libyaml-cpp0.7 (>= 0.7.0) mais il n'est pas installable
E: Impossible de corriger les problèmes, des paquets défectueux sont en mode « garder en l'état ».

farias

La boulette j’ai pas pris la bonne version… on recommance :

rm  /etc/apt/sources.list.d/lunarg-vulkan-jammy.list 
wget -qO- https://packages.lunarg.com/lunarg-signing-key-pub.asc | sudo tee /etc/apt/trusted.gpg.d/lunarg.asc
sudo wget -qO /etc/apt/sources.list.d/lunarg-vulkan-noble.list http://packages.lunarg.com/vulkan/lunarg-vulkan-noble.list
sudo apt update
sudo apt install vulkan-sdk

farias

Nouveau build :

#  cmake -B build -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release
CMAKE_BUILD_TYPE=Release
-- Warning: ccache not found - consider installing it for faster compilation or disable this warning with GGML_CCACHE=OFF
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- GGML_SYSTEM_ARCH: x86
-- Including CPU backend
-- x86 detected
-- Adding CPU backend variant ggml-cpu: -march=native 
-- Found Vulkan: /usr/lib/x86_64-linux-gnu/libvulkan.so (found version "1.4.313") found components: glslc glslangValidator 
-- Vulkan found
-- GL_KHR_cooperative_matrix supported by glslc
-- GL_NV_cooperative_matrix2 supported by glslc
-- GL_NV_cooperative_matrix_decode_vector not supported by glslc
-- GL_EXT_integer_dot_product supported by glslc
-- GL_EXT_bfloat16 supported by glslc
-- Including Vulkan backend
-- ggml version: 0.15.2
-- ggml commit:  5fd2dc2c4
-- Found OpenSSL: /usr/lib/x86_64-linux-gnu/libcrypto.so (found version "3.0.13")  
-- Performing Test OPENSSL_VERSION_SUPPORTED
-- Performing Test OPENSSL_VERSION_SUPPORTED - Success
-- OpenSSL found: 3.0.13
-- Generating embedded license file for target: llama-app
-- Configuring done (5.0s)
-- Generating done (0.6s)

farias

La commande pour le build :

# cmake --build build --config Release -j

farias

Petit test :

# make install
# ldconfig -v
#  llama-bench -m  /models/qwen2.5-1.5b-instruct-q4_k_m.gguf
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = Quadro M5000 (NVIDIA) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
ggml_vulkan: 1 = Quadro M4000 (NVIDIA) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen2 1.5B Q4_K - Medium       | 934.69 MiB |     1.54 B | Vulkan     |  -1 |           pp512 |         53.48 ± 0.42 |
| qwen2 1.5B Q4_K - Medium       | 934.69 MiB |     1.54 B | Vulkan     |  -1 |           tg128 |         63.55 ± 0.73 |

build: 5fd2dc2c4 (9721)

farias

Arret de openwebui :

# systemctl stop openwebui
# systemctl disable openwebui
Removed "/etc/systemd/system/multi-user.target.wants/openwebui.service".

Arret de ollama :

# systemctl stop ollama
# systemctl disable ollama
Removed "/etc/systemd/system/default.target.wants/ollama.service".

farias

Test en ligne de commande :

# llama-server -m /models/qwen2.5-1.5b-instruct-q4_k_m.gguf --host 0.0.0.0

farias

Mon fichier service :

# systemctl status llama-server
● llama-server.service - Llama Server
     Loaded: loaded (/etc/systemd/system/llama-server.service; disabled; preset: enabled)
     Active: active (running) since Fri 2026-06-19 17:27:42 UTC; 29s ago
   Main PID: 37413 (llama-server)
      Tasks: 41 (limit: 94224)
     Memory: 91.7M (peak: 91.7M)
        CPU: 3.103s
     CGroup: /system.slice/llama-server.service
             └─37413 /usr/local/bin/llama-server --model /models/qwen2.5-1.5b-instruct-q4_k_m.gguf --host 0.0.0.0 --port 8080

juin 19 17:27:42 jellyfin systemd[1]: Started llama-server.service - Llama Server.
root@jellyfin:/home/arias/llama.cpp/build# cat /etc/systemd/system/llama-server.service
[Unit]
Description=Llama Server
After=network.target

[Service]
Type=simple
User=root
WorkingDirectory=/home/XXXX/llama.cpp
Environment="NVM_BIN=/root/.nvm/versions/node/v26.3.1/bin"
Environment="LD_LIBRARY_PATH=:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64"
Environment="VULKAN_VERSION=1.4.350.1"
ExecStart=/usr/local/bin/llama-server \
  --model /models/qwen2.5-1.5b-instruct-q4_k_m.gguf \
  --host 0.0.0.0 --port 8080
Restart=on-failure
RestartSec=5s
StandardOutput=file:/tmp/llama-server.stdout.log
StandardError=file:/tmp/llama-server.stderr.log

[Install]
WantedBy=multi-user.target

farias

Le meilleur modèle semble être https://huggingface.co/Qwen/Qwen3.5-2B pour mes cartes.

farias

https://huggingface.co/unsloth/Qwen3.5-2B-GGUF/resolve/main/Qwen3.5-2B-Q4_0.gguf?download=true

NodeBB

llama.cpp avec Vulkan