NVIDIA RTX 5090 推理测试

1. 安装驱动

下载驱动

访问 https://www.nvidia.com/en-us/drivers/ 选择对应的驱动版本下载

1
wget https://us.download.nvidia.com/XFree86/Linux-x86_64/580.76.05/NVIDIA-Linux-x86_64-580.76.05.run

安装驱动

1
bash NVIDIA-Linux-x86_64-580.76.05.run

查看显卡

1
nvidia-smi

1
2
3
GPU 0: NVIDIA GeForce RTX 5090 (UUID: GPU-92fcdc58-4754-73c7-af6c-56740936817d)
GPU 1: NVIDIA GeForce RTX 5090 (UUID: GPU-e05cb455-7dd3-0db5-ac39-70794aa19d4e)
...

开启持久模式

1
nvidia-smi -pm 1

查看拓扑结构

1
nvidia-smi topo -m

1
2
3
4
5
6
7
8
9
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity  GPU NUMA ID
GPU0     X      PIX     NODE    NODE    SYS     SYS     SYS     SYS     0-47,96-143     0     N/A
GPU1    PIX      X      NODE    NODE    SYS     SYS     SYS     SYS     0-47,96-143     0     N/A
GPU2    NODE    NODE     X      PIX     SYS     SYS     SYS     SYS     0-47,96-143     0     N/A
GPU3    NODE    NODE    PIX      X      SYS     SYS     SYS     SYS     0-47,96-143     0     N/A
GPU4    SYS     SYS     SYS     SYS      X      PIX     NODE    NODE    48-95,144-191   1     N/A
GPU5    SYS     SYS     SYS     SYS     PIX      X      NODE    NODE    48-95,144-191   1     N/A
GPU6    SYS     SYS     SYS     SYS     NODE    NODE     X      PIX     48-95,144-191   1     N/A
GPU7    SYS     SYS     SYS     SYS     NODE    NODE    PIX      X      48-95,144-191   1     N/A

2. TLLM

启动环境

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
nerdctl run -it \
        -p 8001:8000 \
        --gpus all \
        --ipc=host \
        --ulimit memlock=-1 \
        --ulimit stack=67108864 \
        --name tllm \
        --volume /data/models:/data/models \
        --entrypoint /bin/bash \
        nvcr.io/nvidia/tensorrt-llm/release:0.21.0

Qwen2.5-7B-Instruct

1
2
3
4
5
6
7
8
export CUDA_VISIBLE_DEVICES=0
trtllm-serve /data/models/Qwen2.5-7B-Instruct \
        --host 0.0.0.0 \
        --port 8000 \
        --backend pytorch \
        --max_batch_size 128 \
        --max_num_tokens 16384 \
        --kv_cache_free_gpu_memory_fraction 0.95

Qwen2.5-VL-7B-Instruct

如果使用 nvcr.io/nvidia/tensorrt-llm-release:0.21.0 模型会报错:

1
ValueError: Inferred model format _ModelFormatKind.HF, but failed to load config.json: The given huggingface model architecture Qwen2_5_VLForConditionalGeneration is not supported in TRT-LLM yet

1
2
3
4
5
6
7
8
9
export CUDA_VISIBLE_DEVICES=0
trtllm-serve /data/models/Qwen2.5-VL-7B-Instruct \
        --host 0.0.0.0 \
        --port 8000 \
        --tp_size 1 \
        --backend pytorch \
        --max_batch_size 128 \
        --max_num_tokens 16384 \
        --kv_cache_free_gpu_memory_fraction 0.95

3. VLLM

启动环境

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
nerdctl run -it \
        -p 8002:8000 \
        --gpus all \
        --ipc=host \
        --ulimit memlock=-1 \
        --ulimit stack=67108864 \
        --name vllm \
        --volume /data/models:/data/models \
        --entrypoint /bin/bash \
        nvcr.io/nvidia/tritonserver:25.03-vllm-python-py3

使用 vllm/vllm-openai:v0.10.1.1 镜像也是可以的，性能上差不多。

Qwen2.5-7B-Instruct

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
export CUDA_VISIBLE_DEVICES=1
vllm serve /data/models/Qwen2.5-7B-Instruct \
        --served-model-name /data/models/Qwen2.5-7B-Instruct \
        --port 8000 \
        --gpu_memory_utilization 0.90 \
        --max-model-len 4096 \
        --max-seq-len-to-capture 8192 \
        --max-num-seqs 128 \
        --disable-log-stats \
        --enforce-eager

Qwen2.5-VL-7B-Instruct

使用 nvcr.io/nvidia/tritonserver:25.01-vllm-python-py3 模型会报错:

1
ValueError: The checkpoint you are trying to load has model type qwen2_5_vl but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.

参考 https://github.com/vllm-project/vllm/issues/13446 。

太高的版本也会报错，找了个能支持 Qwen2.5-VL 最低的镜像版本，参考 https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/introduction/compatibility.html 和 https://github.com/QwenLM/Qwen2.5-VL 。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
export CUDA_VISIBLE_DEVICES=1
vllm serve /data/models/Qwen2.5-VL-7B-Instruct \
        --served-model-name Qwen2.5-VL-7B-Instruct \
        --port 8000 \
        --gpu_memory_utilization 0.90 \
        --max-model-len 4096 \
        --max-seq-len-to-capture 8192 \
        --max-num-seqs 128 \
        --disable-log-stats \
        --enforce-eager

4. SGLANG

启动环境

SGLANG 针对 5090 提供了 blackwell 优化版本，参考 https://github.com/sgl-project/sglang/issues/5334 。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
nerdctl run -it \
        -p 8003:8000 \
        --gpus all \
        --ipc=host \
        --ulimit memlock=-1 \
        --ulimit stack=67108864 \
        --name sglang \
        --volume /data/models:/data/models \
        --entrypoint /bin/bash \
        lmsysorg/sglang:blackwell

Qwen2.5-7B-Instruct

1
2
3
4
5
6
7
8
9
export CUDA_VISIBLE_DEVICES=2
python3 -m sglang.launch_server \
        --model /data/models/Qwen2.5-7B-Instruct \
        --tp 1 \
        --mem-fraction-static 0.8 \
        --trust-remote-code \
        --dtype bfloat16 \
        --host 0.0.0.0 \
        --port 8000

Qwen2.5-VL-7B-Instruct

1
2
3
4
5
6
7
8
9
export CUDA_VISIBLE_DEVICES=2
python3 -m sglang.launch_server \
        --model /data/models/Qwen2.5-VL-7B-Instruct \
        --tp 1 \
        --mem-fraction-static 0.8 \
        --trust-remote-code \
        --dtype bfloat16 \
        --host 0.0.0.0 \
        --port 8000

5. 框架显存占用

以 Qwen2.5-7B-Instruct 为例

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
nvidia-smi

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.76.05              Driver Version: 580.76.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5090        On  |   00000000:08:00.0 Off |                  N/A |
|  0%   34C    P8             24W /  575W |   31016MiB /  32607MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 5090        On  |   00000000:0C:00.0 Off |                  N/A |
|  0%   32C    P8             17W /  575W |   28477MiB /  32607MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 5090        On  |   00000000:7E:00.0 Off |                  N/A |
|  0%   33C    P8             15W /  575W |   27087MiB /  32607MiB |      0%      Default |
|                                         |                        |                  N/A |
...
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A          124671      C   /usr/bin/python                         568MiB |
|    0   N/A  N/A          124905      C   /usr/bin/python                       30434MiB |
|    1   N/A  N/A          125873      C   /usr/bin/python3                      28454MiB |
|    2   N/A  N/A          127174      C   sglang::scheduler                     27078MiB |
+-----------------------------------------------------------------------------------------+

框架	显存占用 (MiB)
tllm (/usr/bin/python)	30,434+568=31,002
sglang (sglang::scheduler)	27,078
vllm (/usr/bin/python3)	28,454

6. 功能测试

设置端口

为了方便快速切换，测试不同推理框架，这里设置了环境变量 PORT 。

1
export PORT=8001

测试 Qwen2.5-7B-Instruct

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
curl -X POST "http://127.0.0.1:$PORT/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/data/models/Qwen2.5-7B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "请简单介绍一下量子计算。"}
    ],
    "temperature": 0.7,
    "max_tokens": 200
  }'

测试 Qwen2.5-VL-7B-Instruct

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
curl -X POST "http://127.0.0.1:$PORT/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/data/models/Qwen2.5-VL-7B-Instruct",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": [
          {"type": "text", "text": "请描述这张图片的内容"},
          {"type": "image_url", "image_url": {"url": "https://www.gov.cn/xhtml/2019zhuanti/guoqiguohui20201217V1/images/202012251135.png"}}
        ]
      }
    ],
    "temperature": 0.7,
    "max_tokens": 200
  }'

7. 性能测试

Qwen2.5-7B-Instruct 性能测试

项目	tllm	sglang	vllm
QPS	24	21	14
new tokens	4669	4167	2647
token 流速	15720	13938	8880

Qwen2.5-VL-7B-Instruct 性能测试

项目	tllm	sglang	vllm
QPS	-	31	12
new tokens	-	2436	983
token 流速	-	16581	6468

tllm 并发高就 Out of Memory 。