Please enable Javascript to view the contents

使用 VLLM Benchmark 进行模型性能测试

 ·  ☕ 3 分钟

VLLM Benchmark 是 VLLM 提供的一个用于测试模型性能的工具,支持多种推理后端。本文主要记录一些使用 VLLM Benchmark 进行模型性能测试的过程。

1. 启动模型服务

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
python -m vllm.entrypoints.openai.api_server \
  --model /models/Qwen2.5-7B-Instruct \
  --served-model-name /models/Qwen2.5-7B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --trust-remote-code \
  --dtype bfloat16 \
  --gpu-memory-utilization 0.90 \
  --max-model-len 4096 \
  --max-seq-len-to-capture 8192 \
  --max-num-seqs 128 \
  --disable-log-stats \
  --no-enable-prefix-caching

2. 启动客户端

2.1 查找 benchmark_serving.py 文件位置

不同的构建方式,benchmark_serving.py 文件可能位于不同的目录下。可以使用以下命令查找:

1
2
3
find / -name benchmark_serving.py 2>/dev/null

/vllm-workspace/benchmarks/benchmark_serving.py

2.2 使用随机输入进行测试

这里的 backend 可选值很多,tgi,vllm,lmdeploy,deepspeed-mii,openai,openai-chat,openai-audio,tensorrt-llm,scalellm,sglang,llama.cpp,需要根据不同的推理后端选择。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
python3 benchmark_serving.py \
  --backend vllm \
  --model /models/Qwen2.5-7B-Instruct \
  --dataset-name random \
  --random-input-len 1024 \
  --num-prompts 1024 \
  --request-rate inf \
  --max-concurrency 32 \
  --result-dir /tmp/results \
  --result-filename Qwen2.5-7B-Instruct.json \
  --save-result
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
============ Serving Benchmark Result ============
Successful requests:                     1024
Benchmark duration (s):                  119.45
Total input tokens:                      1048576
Total generated tokens:                  128163
Request throughput (req/s):              8.57
Output token throughput (tok/s):         1072.95
Total Token throughput (tok/s):          9851.38
---------------Time to First Token----------------
Mean TTFT (ms):                          445.42
Median TTFT (ms):                        196.37
P99 TTFT (ms):                           1625.23
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          27.83
Median TPOT (ms):                        29.24
P99 TPOT (ms):                           44.15
---------------Inter-token Latency----------------
Mean ITL (ms):                           26.18
Median ITL (ms):                         13.31
P99 ITL (ms):                            359.57
==================================================

2.3 使用 ShareGPT 数据集进行测试

  • 下载 ShareGPT 数据集
1
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
  • 开始测试
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
python3 benchmark_serving.py \
  --backend vllm \
  --model /models/Qwen2.5-7B-Instruct \
  --dataset-name sharegpt \
  --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
  --random-input-len 1024 \
  --num-prompts 1024 \
  --request-rate inf \
  --max-concurrency 32 \
  --result-dir /tmp/results \
  --result-filename Qwen2.5-7B-Instruct.json \
  --save-result
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
============ Serving Benchmark Result ============
Successful requests:                     1024
Benchmark duration (s):                  108.24
Total input tokens:                      225502
Total generated tokens:                  201458
Request throughput (req/s):              9.46
Output token throughput (tok/s):         1861.29
Total Token throughput (tok/s):          3944.73
---------------Time to First Token----------------
Mean TTFT (ms):                          57.06
Median TTFT (ms):                        44.11
P99 TTFT (ms):                           158.86
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          16.40
Median TPOT (ms):                        15.89
P99 TPOT (ms):                           29.18
---------------Inter-token Latency----------------
Mean ITL (ms):                           16.12
Median ITL (ms):                         12.36
P99 ITL (ms):                            88.38
==================================================

3. 测试及策略

3.1 不同 request rate 下的性能指标

1
2
3
4
  --dataset-name random \
  --num-prompts 512 \
  --random-input-len 512 \
  --request-rate 4
Request RateSuccessful requestsRequest throughput (req/s)Output token throughput (tok/s)P99 TTFT (ms)P99 TPOT (ms)P99 ITL (ms)
45123.88481.8328.4711.4012.44
85127.66953.9730.5212.0112.98
1651214.901858.6538.7516.3717.34
3251226.213260.09517.9634.5640.41
6451228.093515.686932.1635.0642.04
12851230.573811.869272.9536.4741.44

TTFT(Time to First Token)表示从请求开始到第一个 token 返回的时间。
TPOT(Time per Output Token)表示每个输出 token 的平均生成时间。
ITL(Inter-token Latency)表示连续两个 token 之间的延迟。

3.2 不同数据集下的性能指标

1
2
3
4
5
  --dataset-name sharegpt \
  --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
  --num-prompts 512 \
  --random-input-len 512 \
  --request-rate 4
Request RateSuccessful requestsRequest throughput (req/s)Output token throughput (tok/s)P99 TTFT (ms)P99 TPOT (ms)P99 ITL (ms)
45123.78818.64126.8219.1758.99
85127.091538.00149.2126.7785.53
1651211.392493.772493.7757.19126.70
3251212.172621.2314438.37168.35144.85
6451212.272638.4222112.56260.28129.32
12851212.272637.7326116.57238.28128.18

3.3 不同输入长度下的性能指标

1
2
3
4
    --dataset-name random \
    --num-prompts 512 \
    --request-rate 16 \
    --random-input-len 256 \
input lengthSuccessful requestsRequest throughput (req/s)Output token throughput (tok/s)P99 TTFT (ms)P99 TPOT (ms)P99 ITL (ms)
25651214.581821.14288.7971.90125.74
51251211.341413.729793.2348.16509.29
10245126.81849.9938449.23212.07876.85
20485123.63451.75102938.13898.59915.02

微信公众号
作者
微信公众号