VLLM Benchmark 是 VLLM 提供的一个用于测试模型性能的工具,支持多种推理后端。本文主要记录一些使用 VLLM Benchmark 进行模型性能测试的过程。
1. 启动模型服务
1
2
3
4
5
6
7
8
9
10
11
12
13
| python -m vllm.entrypoints.openai.api_server \
--model /models/Qwen2.5-7B-Instruct \
--served-model-name /models/Qwen2.5-7B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--trust-remote-code \
--dtype bfloat16 \
--gpu-memory-utilization 0.90 \
--max-model-len 4096 \
--max-seq-len-to-capture 8192 \
--max-num-seqs 128 \
--disable-log-stats \
--no-enable-prefix-caching
|
2. 启动客户端
2.1 查找 benchmark_serving.py
文件位置
不同的构建方式,benchmark_serving.py
文件可能位于不同的目录下。可以使用以下命令查找:
1
2
3
| find / -name benchmark_serving.py 2>/dev/null
/vllm-workspace/benchmarks/benchmark_serving.py
|
2.2 使用随机输入进行测试
这里的 backend 可选值很多,tgi,vllm,lmdeploy,deepspeed-mii,openai,openai-chat,openai-audio,tensorrt-llm,scalellm,sglang,llama.cpp,需要根据不同的推理后端选择。
1
2
3
4
5
6
7
8
9
10
11
| python3 benchmark_serving.py \
--backend vllm \
--model /models/Qwen2.5-7B-Instruct \
--dataset-name random \
--random-input-len 1024 \
--num-prompts 1024 \
--request-rate inf \
--max-concurrency 32 \
--result-dir /tmp/results \
--result-filename Qwen2.5-7B-Instruct.json \
--save-result
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
| ============ Serving Benchmark Result ============
Successful requests: 1024
Benchmark duration (s): 119.45
Total input tokens: 1048576
Total generated tokens: 128163
Request throughput (req/s): 8.57
Output token throughput (tok/s): 1072.95
Total Token throughput (tok/s): 9851.38
---------------Time to First Token----------------
Mean TTFT (ms): 445.42
Median TTFT (ms): 196.37
P99 TTFT (ms): 1625.23
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 27.83
Median TPOT (ms): 29.24
P99 TPOT (ms): 44.15
---------------Inter-token Latency----------------
Mean ITL (ms): 26.18
Median ITL (ms): 13.31
P99 ITL (ms): 359.57
==================================================
|
2.3 使用 ShareGPT 数据集进行测试
1
| wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
|
1
2
3
4
5
6
7
8
9
10
11
12
| python3 benchmark_serving.py \
--backend vllm \
--model /models/Qwen2.5-7B-Instruct \
--dataset-name sharegpt \
--dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
--random-input-len 1024 \
--num-prompts 1024 \
--request-rate inf \
--max-concurrency 32 \
--result-dir /tmp/results \
--result-filename Qwen2.5-7B-Instruct.json \
--save-result
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
| ============ Serving Benchmark Result ============
Successful requests: 1024
Benchmark duration (s): 108.24
Total input tokens: 225502
Total generated tokens: 201458
Request throughput (req/s): 9.46
Output token throughput (tok/s): 1861.29
Total Token throughput (tok/s): 3944.73
---------------Time to First Token----------------
Mean TTFT (ms): 57.06
Median TTFT (ms): 44.11
P99 TTFT (ms): 158.86
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 16.40
Median TPOT (ms): 15.89
P99 TPOT (ms): 29.18
---------------Inter-token Latency----------------
Mean ITL (ms): 16.12
Median ITL (ms): 12.36
P99 ITL (ms): 88.38
==================================================
|
3. 测试及策略
3.1 不同 request rate 下的性能指标
1
2
3
4
| --dataset-name random \
--num-prompts 512 \
--random-input-len 512 \
--request-rate 4
|
Request Rate | Successful requests | Request throughput (req/s) | Output token throughput (tok/s) | P99 TTFT (ms) | P99 TPOT (ms) | P99 ITL (ms) |
---|
4 | 512 | 3.88 | 481.83 | 28.47 | 11.40 | 12.44 |
8 | 512 | 7.66 | 953.97 | 30.52 | 12.01 | 12.98 |
16 | 512 | 14.90 | 1858.65 | 38.75 | 16.37 | 17.34 |
32 | 512 | 26.21 | 3260.09 | 517.96 | 34.56 | 40.41 |
64 | 512 | 28.09 | 3515.68 | 6932.16 | 35.06 | 42.04 |
128 | 512 | 30.57 | 3811.86 | 9272.95 | 36.47 | 41.44 |
TTFT(Time to First Token)表示从请求开始到第一个 token 返回的时间。
TPOT(Time per Output Token)表示每个输出 token 的平均生成时间。
ITL(Inter-token Latency)表示连续两个 token 之间的延迟。
3.2 不同数据集下的性能指标
1
2
3
4
5
| --dataset-name sharegpt \
--dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
--num-prompts 512 \
--random-input-len 512 \
--request-rate 4
|
Request Rate | Successful requests | Request throughput (req/s) | Output token throughput (tok/s) | P99 TTFT (ms) | P99 TPOT (ms) | P99 ITL (ms) |
---|
4 | 512 | 3.78 | 818.64 | 126.82 | 19.17 | 58.99 |
8 | 512 | 7.09 | 1538.00 | 149.21 | 26.77 | 85.53 |
16 | 512 | 11.39 | 2493.77 | 2493.77 | 57.19 | 126.70 |
32 | 512 | 12.17 | 2621.23 | 14438.37 | 168.35 | 144.85 |
64 | 512 | 12.27 | 2638.42 | 22112.56 | 260.28 | 129.32 |
128 | 512 | 12.27 | 2637.73 | 26116.57 | 238.28 | 128.18 |
3.3 不同输入长度下的性能指标
1
2
3
4
| --dataset-name random \
--num-prompts 512 \
--request-rate 16 \
--random-input-len 256 \
|
input length | Successful requests | Request throughput (req/s) | Output token throughput (tok/s) | P99 TTFT (ms) | P99 TPOT (ms) | P99 ITL (ms) |
---|
256 | 512 | 14.58 | 1821.14 | 288.79 | 71.90 | 125.74 |
512 | 512 | 11.34 | 1413.72 | 9793.23 | 48.16 | 509.29 |
1024 | 512 | 6.81 | 849.99 | 38449.23 | 212.07 | 876.85 |
2048 | 512 | 3.63 | 451.75 | 102938.13 | 898.59 | 915.02 |