Benchmark llama.cpp sur Tuxedo 17
-
Ma version de CUDA : 12.6
$ nvidia-smi Wed Jun 17 17:13:28 2026 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3060 ... Off | 00000000:01:00.0 Off | N/A | | N/A 49C P8 12W / 115W | 803MiB / 6144MiB | 29% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+Mise à jours de CUDA :
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb sudo dpkg -i cuda-keyring_1.1-1_all.deb sudo apt-get update sudo apt-get -y install cuda-toolkit-13-3 -
Mise à jours de CUDA : 12.6 => 13.2
Mise à jours de NVIDIA : 560.35.03 ( 21 aout 2024 : https://www.nvidia.com/en-us/drivers/details/230918/ ) => 595.71.05(base) root@tuxedo17:/home/arias# nvidia-smi Thu Jun 18 09:37:40 2026 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 595.71.05 Driver Version: 595.71.05 CUDA Version: 13.2 | +-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3060 ... Off | 00000000:01:00.0 Off | N/A | | N/A 42C P0 752W / 115W | 15MiB / 6144MiB | 9% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ -
Build llama.cpp for CUDA :
# cd llama.cpp # export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64 # export PATH=$PATH:$CUDA_HOME/bin # cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_COMPILER=`which nvcc` CMAKE_BUILD_TYPE=Release -- Warning: ccache not found - consider installing it for faster compilation or disable this warning with GGML_CCACHE=OFF -- CMAKE_SYSTEM_PROCESSOR: x86_64 -- GGML_SYSTEM_ARCH: x86 -- Including CPU backend -- x86 detected -- Adding CPU backend variant ggml-cpu: -march=native -- CUDA Toolkit found -- The CUDA compiler identification is NVIDIA 13.3.33 -- Detecting CUDA compiler ABI info -- Detecting CUDA compiler ABI info - done -- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped -- Detecting CUDA compile features -- Detecting CUDA compile features - done -- Using CMAKE_CUDA_ARCHITECTURES=75-virtual;80-virtual;86-real;89-real;90-virtual;120a-real;121a-real CMAKE_CUDA_ARCHITECTURES_NATIVE= -- Could NOT find NCCL (missing: NCCL_LIBRARY NCCL_INCLUDE_DIR) -- Warning: NCCL not found, performance for multiple CUDA GPUs will be suboptimal -- CUDA host compiler is GNU 11.4.0 -- Including CUDA backend -- ggml version: 0.15.1 -- ggml commit: b4024af6c -- OpenSSL found: 3.0.2 -- Generating embedded license file for target: llama-app -- Configuring done -- Generating done -- Build files have been written to: /home/arias/GIT/llama.cpp/build # cmake --build build --config Release -j 20 -
Nouveau test :
# llama-bench -m ~/.cache/huggingface/hub/models--ggml-org--gemma-3-1b-it-GGUF/snapshots/f9c28bcd85737ffc5aef028638d3341d49869c27/gemma-3-1b-it-Q4_K_M.gguf ggml_cuda_init: failed to initialize CUDA: CUDA driver version is insufficient for CUDA runtime version | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | | gemma3 1B Q4_K - Medium | 762.49 MiB | 999.89 M | CUDA | -1 | pp512 | 303.04 ± 1.63 | | gemma3 1B Q4_K - Medium | 762.49 MiB | 999.89 M | CUDA | -1 | tg128 | 46.90 ± 1.21 | build: b4024af6c (9687) -
Version :
# nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2026 NVIDIA Corporation Built on Fri_Apr_24_07:22:02_PM_PDT_2026 Cuda compilation tools, release 13.3, V13.3.33 Build cuda_13.3.r13.3/compiler.37862127_0 -
Aie …
# nvidia-smi Failed to initialize NVML: Driver/library version mismatch NVML library version: 560.35 # sudo apt-get purge nvidia-* # ubuntu-drivers install # nvidia-smi Thu Jun 18 10:26:10 2026 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 595.71.05 Driver Version: 595.71.05 CUDA Version: 13.2 | +-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3060 ... Off | 00000000:01:00.0 Off | N/A | | N/A 40C P0 752W / 115W | 15MiB / 6144MiB | 9% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 5003 G /usr/lib/xorg/Xorg 4MiB | +-----------------------------------------------------------------------------------------+ # llama-bench -m ~/.cache/huggingface/hub/models--ggml-org--gemma-3-1b-it-GGUF/snapshots/f9c28bcd85737ffc5aef028638d3341d49869c27/gemma-3-1b-it-Q4_K_M.gguf ggml_cuda_init: found 1 CUDA devices (Total VRAM: 5806 MiB): Device 0: NVIDIA GeForce RTX 3060 Laptop GPU, compute capability 8.6, VMM: yes, VRAM: 5806 MiB | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | | gemma3 1B Q4_K - Medium | 762.49 MiB | 999.89 M | CUDA | -1 | pp512 | 10984.34 ± 550.24 | | gemma3 1B Q4_K - Medium | 762.49 MiB | 999.89 M | CUDA | -1 | tg128 | 225.56 ± 0.57 | build: b4024af6c (9687) -
Donc on est passé de : CPU vs CUDA
pp512 : 303.04 t/s => 10984.34 t/s ( x36 env. )
tg128 : 46.90 t/s => 225.56 t/s ( x5 env. ) -
Chargement de modèle sur : https://huggingface.co/unsloth
-
Test du modèle : Qwen3.6-35B-A3B-UD-Q4_K_M.gguf
# llama-server -m /models/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf 0.01.777.599 I log_info: verbosity = 3 (adjust with the `-lv N` CLI arg) 0.01.777.603 I device_info: 0.01.868.348 I - CUDA0 : NVIDIA GeForce RTX 3060 Laptop GPU (5806 MiB, 5674 MiB free) 0.01.868.364 I - CPU : 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (64044 MiB, 64044 MiB free) 0.01.868.506 I system_info: n_threads = 8 (n_threads_batch = 8) / 16 | CUDA : ARCHS = 750,800,860,890,900,1200,1210 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 0.01.868.514 I srv llama_server: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true 0.01.868.566 I srv init: running without SSL 0.01.868.646 I srv init: using 15 threads for HTTP server 0.01.869.153 I srv start: binding port with default address family 0.01.870.425 I srv llama_server: loading model 0.01.870.434 I srv load_model: loading model '/moddels/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf' -
Nouveau test :
# curl -L --fail -o /models/qwen2.5-1.5b-instruct-q4_k_m.gguf https://huggingface.co/bartowski/Qwen2.5-1.5B-Instruct-GGUF/resolve/main/Qwen2.5-1.5B-Instruct-Q4_K_M.gguf # llama-server -m /models/qwen2.5-1.5b-instruct-q4_k_m.gguf --host "0.0.0.0" --port "8080" --ctx-size "4096" --threads 8 --alias qwen2.5-1.5bhttp://127.0.0.1:8080/health => OK.
-
Test :
$ curl -fsS http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "qwen2.5-1.5b", "messages": [ {"role": "system", "content": "Tu es un assistant DevOps qui répond en français en une phrase."}, {"role": "user", "content": "Quelle commande systemd affiche les services actifs ?"} ], "temperature": 0.2, "max_tokens": 80 }' | jq { "choices": [ { "finish_reason": "stop", "index": 0, "message": { "role": "assistant", "content": "systemctl list-units --active" } } ], "created": 1781775300, "model": "qwen2.5-1.5b", "system_fingerprint": "b9687-b4024af6c", "object": "chat.completion", "usage": { "completion_tokens": 8, "prompt_tokens": 38, "total_tokens": 46, "prompt_tokens_details": { "cached_tokens": 37 } }, "id": "chatcmpl-mXP4t0noHYBKTaBSnZfxFdoTycVt42zC", "timings": { "cache_n": 37, "prompt_n": 1, "prompt_ms": 21.161, "prompt_per_token_ms": 21.161, "prompt_per_second": 47.25674590047729, "predicted_n": 8, "predicted_ms": 40.424, "predicted_per_token_ms": 5.053, "predicted_per_second": 197.90223629527014 } }
Bonjour ! Vous semblez intéressé par cette conversation, mais vous n’avez pas encore de compte.
Marre de refaire défiler les mêmes messages ? Créez un compte pour retrouver votre position, recevoir des notifications des nouvelles réponses, sauvegarder vos favoris et voter pour les messages que vous appréciez.
Grâce à votre participation, ce message peut devenir encore meilleur 💗
S'inscrire Se connecter