使用 VLLM Benchmark 进行模型性能测试

VLLM Benchmark 是 VLLM 提供的一个用于测试模型性能的工具，支持多种推理后端。本文主要记录一些使用 VLLM Benchmark 进行模型性能测试的过程。

1. 启动模型服务

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
python -m vllm.entrypoints.openai.api_server \
  --model /models/Qwen2.5-7B-Instruct \
  --served-model-name /models/Qwen2.5-7B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --trust-remote-code \
  --dtype bfloat16 \
  --gpu-memory-utilization 0.90 \
  --max-model-len 4096 \
  --max-seq-len-to-capture 8192 \
  --max-num-seqs 128 \
  --disable-log-stats \
  --tensor-parallel-size 1 \
  --no-enable-prefix-caching

2. 启动客户端

2.1 查找 `benchmark_serving.py` 文件位置

不同的构建方式，benchmark_serving.py 文件可能位于不同的目录下。可以使用以下命令查找：

1
2
3
find / -name benchmark_serving.py 2>/dev/null

/vllm-workspace/benchmarks/benchmark_serving.py

2.2 使用随机输入进行测试

这里的 backend 可选值很多，tgi,vllm,lmdeploy,deepspeed-mii,openai,openai-chat,openai-audio,tensorrt-llm,scalellm,sglang,llama.cpp，需要根据不同的推理后端选择。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
python3 benchmark_serving.py \
  --backend vllm \
  --model /models/Qwen2.5-7B-Instruct \
  --dataset-name random \
  --random-input-len 1024 \
  --num-prompts 1024 \
  --request-rate inf \
  --max-concurrency 32 \
  --result-dir /tmp/results \
  --result-filename Qwen2.5-7B-Instruct.json \
  --save-result

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
============ Serving Benchmark Result ============
Successful requests:                     1024
Benchmark duration (s):                  119.45
Total input tokens:                      1048576
Total generated tokens:                  128163
Request throughput (req/s):              8.57
Output token throughput (tok/s):         1072.95
Total Token throughput (tok/s):          9851.38
---------------Time to First Token----------------
Mean TTFT (ms):                          445.42
Median TTFT (ms):                        196.37
P99 TTFT (ms):                           1625.23
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          27.83
Median TPOT (ms):                        29.24
P99 TPOT (ms):                           44.15
---------------Inter-token Latency----------------
Mean ITL (ms):                           26.18
Median ITL (ms):                         13.31
P99 ITL (ms):                            359.57
==================================================

2.3 使用 ShareGPT 数据集进行测试

下载 ShareGPT 数据集

1
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

开始测试

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
python3 benchmark_serving.py \
  --backend vllm \
  --model /models/Qwen2.5-7B-Instruct \
  --dataset-name sharegpt \
  --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
  --random-input-len 1024 \
  --num-prompts 1024 \
  --request-rate inf \
  --max-concurrency 32 \
  --result-dir /tmp/results \
  --result-filename Qwen2.5-7B-Instruct.json \
  --save-result

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
============ Serving Benchmark Result ============
Successful requests:                     1024
Benchmark duration (s):                  108.24
Total input tokens:                      225502
Total generated tokens:                  201458
Request throughput (req/s):              9.46
Output token throughput (tok/s):         1861.29
Total Token throughput (tok/s):          3944.73
---------------Time to First Token----------------
Mean TTFT (ms):                          57.06
Median TTFT (ms):                        44.11
P99 TTFT (ms):                           158.86
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          16.40
Median TPOT (ms):                        15.89
P99 TPOT (ms):                           29.18
---------------Inter-token Latency----------------
Mean ITL (ms):                           16.12
Median ITL (ms):                         12.36
P99 ITL (ms):                            88.38
==================================================

3. 测试及策略

3.1 不同 request rate 下的性能指标

1
2
3
4
  --dataset-name random \
  --num-prompts 512 \
  --random-input-len 512 \
  --request-rate 4

Request Rate	Successful requests	Request throughput (req/s)	Output token throughput (tok/s)	P99 TTFT (ms)	P99 TPOT (ms)	P99 ITL (ms)
4	512	3.88	481.83	28.47	11.40	12.44
8	512	7.66	953.97	30.52	12.01	12.98
16	512	14.90	1858.65	38.75	16.37	17.34
32	512	26.21	3260.09	517.96	34.56	40.41
64	512	28.09	3515.68	6932.16	35.06	42.04
128	512	30.57	3811.86	9272.95	36.47	41.44

TTFT（Time to First Token）表示从请求开始到第一个 token 返回的时间。
TPOT（Time per Output Token）表示每个输出 token 的平均生成时间。
ITL（Inter-token Latency）表示连续两个 token 之间的延迟。

3.2 不同数据集下的性能指标

1
2
3
4
5
  --dataset-name sharegpt \
  --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
  --num-prompts 512 \
  --random-input-len 512 \
  --request-rate 4

Request Rate	Successful requests	Request throughput (req/s)	Output token throughput (tok/s)	P99 TTFT (ms)	P99 TPOT (ms)	P99 ITL (ms)
4	512	3.78	818.64	126.82	19.17	58.99
8	512	7.09	1538.00	149.21	26.77	85.53
16	512	11.39	2493.77	2493.77	57.19	126.70
32	512	12.17	2621.23	14438.37	168.35	144.85
64	512	12.27	2638.42	22112.56	260.28	129.32
128	512	12.27	2637.73	26116.57	238.28	128.18

3.3 不同输入长度下的性能指标

1
2
3
4
    --dataset-name random \
    --num-prompts 512 \
    --request-rate 16 \
    --random-input-len 256 \

input length	Successful requests	Request throughput (req/s)	Output token throughput (tok/s)	P99 TTFT (ms)	P99 TPOT (ms)	P99 ITL (ms)
256	512	14.58	1821.14	288.79	71.90	125.74
512	512	11.34	1413.72	9793.23	48.16	509.29
1024	512	6.81	849.99	38449.23	212.07	876.85
2048	512	3.63	451.75	102938.13	898.59	915.02