1. 安装驱动
访问 https://www.nvidia.com/en-us/drivers/ 选择对应的驱动版本下载
1
| wget https://us.download.nvidia.com/XFree86/Linux-x86_64/580.76.05/NVIDIA-Linux-x86_64-580.76.05.run
|
1
| bash NVIDIA-Linux-x86_64-580.76.05.run
|
1
2
3
| GPU 0: NVIDIA GeForce RTX 5090 (UUID: GPU-92fcdc58-4754-73c7-af6c-56740936817d)
GPU 1: NVIDIA GeForce RTX 5090 (UUID: GPU-e05cb455-7dd3-0db5-ac39-70794aa19d4e)
...
|
1
2
3
4
5
6
7
8
9
10
| nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X PIX NODE NODE SYS SYS SYS SYS 0-47,96-143 0 N/A
GPU1 PIX X NODE NODE SYS SYS SYS SYS 0-47,96-143 0 N/A
GPU2 NODE NODE X PIX SYS SYS SYS SYS 0-47,96-143 0 N/A
GPU3 NODE NODE PIX X SYS SYS SYS SYS 0-47,96-143 0 N/A
GPU4 SYS SYS SYS SYS X PIX NODE NODE 48-95,144-191 1 N/A
GPU5 SYS SYS SYS SYS PIX X NODE NODE 48-95,144-191 1 N/A
GPU6 SYS SYS SYS SYS NODE NODE X PIX 48-95,144-191 1 N/A
GPU7 SYS SYS SYS SYS NODE NODE PIX X 48-95,144-191 1 N/A
|
2. TLLM
1
2
3
4
5
6
7
8
9
10
| nerdctl run -it \
-p 8001:8000 \
--gpus all \
--ipc=host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
--name tllm \
--volume /data/models:/data/models \
--entrypoint /bin/bash \
registry.cn-beijing.aliyuncs.com/opshub/nvcr-io-nvidia-tensorrt-llm-release:0.21.0
|
1
2
3
4
5
6
7
8
| export CUDA_VISIBLE_DEVICES=0
trtllm-serve /data/models/Qwen2.5-7B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--backend pytorch \
--max_batch_size 128 \
--max_num_tokens 16384 \
--kv_cache_free_gpu_memory_fraction 0.95
|
如果使用 nvcr.io/nvidia/tensorrt-llm-release:0.21.0 模型会报错:
1
| ValueError: Inferred model format _ModelFormatKind.HF, but failed to load config.json: The given huggingface model architecture Qwen2_5_VLForConditionalGeneration is not supported in TRT-LLM yet
|
1
2
3
4
5
6
7
8
9
| export CUDA_VISIBLE_DEVICES=0
trtllm-serve /data/models/Qwen2.5-VL-7B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--tp_size 1 \
--backend pytorch \
--max_batch_size 128 \
--max_num_tokens 16384 \
--kv_cache_free_gpu_memory_fraction 0.95
|
3. VLLM
1
2
3
4
5
6
7
8
9
10
| nerdctl run -it \
-p 8002:8000 \
--gpus all \
--ipc=host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
--name vllm \
--volume /data/models:/data/models \
--entrypoint /bin/bash \
registry.cn-beijing.aliyuncs.com/opshub/nvcr-io-nvidia-tritonserver:25.03-vllm-python-py3
|
1
2
3
4
5
6
7
8
9
10
| export CUDA_VISIBLE_DEVICES=1
vllm serve /data/models/Qwen2.5-7B-Instruct \
--served-model-name /data/models/Qwen2.5-7B-Instruct \
--port 8000 \
--gpu_memory_utilization 0.90 \
--max-model-len 4096 \
--max-seq-len-to-capture 8192 \
--max-num-seqs 128 \
--disable-log-stats \
--enforce-eager
|
使用 nvcr.io/nvidia/tritonserver:25.01-vllm-python-py3 模型会报错:
1
| ValueError: The checkpoint you are trying to load has model type qwen2_5_vl but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.
|
参考 https://github.com/vllm-project/vllm/issues/13446 。
太高的版本也会报错,找了个能支持 Qwen2.5-VL 最低的镜像版本,参考 https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/introduction/compatibility.html 和 https://github.com/QwenLM/Qwen2.5-VL 。
1
2
3
4
5
6
7
8
9
10
| export CUDA_VISIBLE_DEVICES=1
vllm serve /data/models/Qwen2.5-VL-7B-Instruct \
--served-model-name Qwen2.5-VL-7B-Instruct \
--port 8000 \
--gpu_memory_utilization 0.90 \
--max-model-len 4096 \
--max-seq-len-to-capture 8192 \
--max-num-seqs 128 \
--disable-log-stats \
--enforce-eager
|
4. SGLANG
SGLANG 针对 5090 提供了 blackwell 优化版本,参考 https://github.com/sgl-project/sglang/issues/5334 。
1
2
3
4
5
6
7
8
9
10
| nerdctl run -it \
-p 8003:8000 \
--gpus all \
--ipc=host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
--name sglang \
--volume /data/models:/data/models \
--entrypoint /bin/bash \
registry.cn-beijing.aliyuncs.com/opshub/lmsysorg-sglang:blackwell
|
1
2
3
4
5
6
7
8
9
| export CUDA_VISIBLE_DEVICES=2
python3 -m sglang.launch_server \
--model /data/models/Qwen2.5-7B-Instruct \
--tp 1 \
--mem-fraction-static 0.8 \
--trust-remote-code \
--dtype bfloat16 \
--host 0.0.0.0 \
--port 8000
|
1
2
3
4
5
6
7
8
9
| export CUDA_VISIBLE_DEVICES=2
python3 -m sglang.launch_server \
--model /data/models/Qwen2.5-VL-7B-Instruct \
--tp 1 \
--mem-fraction-static 0.8 \
--trust-remote-code \
--dtype bfloat16 \
--host 0.0.0.0 \
--port 8000
|
5. 框架显存占用
以 Qwen2.5-7B-Instruct 为例
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
| nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.76.05 Driver Version: 580.76.05 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 5090 On | 00000000:08:00.0 Off | N/A |
| 0% 34C P8 24W / 575W | 31016MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 5090 On | 00000000:0C:00.0 Off | N/A |
| 0% 32C P8 17W / 575W | 28477MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA GeForce RTX 5090 On | 00000000:7E:00.0 Off | N/A |
| 0% 33C P8 15W / 575W | 27087MiB / 32607MiB | 0% Default |
| | | N/A |
...
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 124671 C /usr/bin/python 568MiB |
| 0 N/A N/A 124905 C /usr/bin/python 30434MiB |
| 1 N/A N/A 125873 C /usr/bin/python3 28454MiB |
| 2 N/A N/A 127174 C sglang::scheduler 27078MiB |
+-----------------------------------------------------------------------------------------+
|
框架 | 显存占用 (MiB) |
---|
tllm (/usr/bin/python) | 30,434+568=31,002 |
sglang (sglang::scheduler) | 27,078 |
vllm (/usr/bin/python3) | 28,454 |
6. 功能测试
为了方便快速切换,测试不同推理框架,这里设置了环境变量 PORT 。
1
2
3
4
5
6
7
8
9
10
11
| curl -X POST "http://127.0.0.1:$PORT/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "/data/models/Qwen2.5-7B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "请简单介绍一下量子计算。"}
],
"temperature": 0.7,
"max_tokens": 200
}'
|
- 测试 Qwen2.5-VL-7B-Instruct
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
| curl -X POST "http://127.0.0.1:$PORT/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "/data/models/Qwen2.5-VL-7B-Instruct",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": [
{"type": "text", "text": "请描述这张图片的内容"},
{"type": "image_url", "image_url": {"url": "https://www.gov.cn/xhtml/2019zhuanti/guoqiguohui20201217V1/images/202012251135.png"}}
]
}
],
"temperature": 0.7,
"max_tokens": 200
}'
|
7. 性能测试
项目 | tllm | sglang | vllm |
---|
QPS | 24 | 21 | 14 |
new tokens | 4669 | 4167 | 2647 |
token 流速 | 15720 | 13938 | 8880 |
- Qwen2.5-VL-7B-Instruct 性能测试
项目 | tllm | sglang | vllm |
---|
QPS | - | 31 | 12 |
new tokens | - | 2436 | 983 |
token 流速 | - | 16581 | 6468 |
tllm 并发高就 Out of Memory 。