Please enable Javascript to view the contents

使用 lmcache 能显著改善模型推理的 TTFT

 ·  ☕ 4 分钟

1. LMCache 简介

TTFT 是指从请求发出到模型生成第一个 token 的时间。由于 Prefill 阶段需要把输入的上下文编码成 KV Cache,才能开始生成,在生成第一个 token 时需要大量的计算从而导致 TTFT 很高。

为了降低 TTFT,有一个思路就是将 Prefill 阶段计算出来的 KV Cache 缓存起来,下次遇到相同的上下文时,直接复用缓存的 KV Cache,就可以大幅降低 TTFT。

在模型推理的场景下,https://github.com/LMCache/LMCache 就是针对 KV Cache 缓存的一个开源项目,支持将 KV Cache 存储到内存、磁盘、Redis、GDS、Nixl 等多种存储后端。详情查看 https://docs.lmcache.ai/kv_cache/storage_backends/index.html

此外,lmcache 还提供了计算 KV Cache 大小的工具 https://lmcache.ai/kv_cache_calculator.html ,以 4k 中文估算,2k token 需要 106 MB 的 KV Cache,存储开销非常大。虽然 LMCache 有 LRU、FIFO、LFU、MRU 等缓存淘汰策略,但在生产环境中,通常还是需要配合大容量的存储后端,比如 Redis、3FS、大磁盘。

接下来我们通过一些 benchmark 来展示 LMCache 的效果。

2. 缓存到内存

  • 启动 lmcache 环境
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
nerdctl run -it \
        -p 8000:8000 \
        --gpus all \
        --ipc=host \
        --ulimit memlock=-1 \
        --ulimit stack=67108864 \
        --name lmcache \
        --volume /data/models:/data/models \
        --entrypoint /bin/bash \
        lmcache/vllm-openai:v0.3.6

其他测试也都是基于这个镜像创建的环境,测试设备是 NVIDIA A100-SXM4-80GB。

  • 设置环境变量
1
2
3
4
5
6
7
8
# Specify LMCache V1
export LMCACHE_USE_EXPERIMENTAL=True
# 256 Tokens per KV Chunk
export LMCACHE_CHUNK_SIZE=256
# Enable CPU memory backend
export LMCACHE_LOCAL_CPU=True # default
# 5GB of Pinned CPU memory
export LMCACHE_MAX_LOCAL_CPU_SIZE=5.0 # default
  • 启动模型服务
1
2
3
4
5
6
export CUDA_VISIBLE_DEVICES=7
/opt/venv/bin/vllm serve \
    /data/models/Qwen2.5-7B-Instruct \
    --max-model-len 16384 \
    --kv-transfer-config \
    '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
  • 第一次测试
1
2
3
4
5
6
7
8
/opt/venv/bin/vllm bench serve \
  --backend openai \
  --model /data/models/Qwen2.5-7B-Instruct \
  --dataset-name sharegpt \
  --dataset-path /data/models/ShareGPT_V3_unfiltered_cleaned_split.json \
  --random-input-len 1024 \
  --num-prompts 1024 \
  --request-rate 16
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
============ Serving Benchmark Result ============
Successful requests:                     1024
Request rate configured (RPS):           16.00
Benchmark duration (s):                  73.02
Total input tokens:                      225502
Total generated tokens:                  202459
Request throughput (req/s):              14.02
Output token throughput (tok/s):         2772.73
Total Token throughput (tok/s):          5861.05
---------------Time to First Token----------------
Mean TTFT (ms):                          66.56
Median TTFT (ms):                        59.31
P99 TTFT (ms):                           163.56
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          22.03
Median TPOT (ms):                        21.67
P99 TPOT (ms):                           38.58
---------------Inter-token Latency----------------
Mean ITL (ms):                           21.34
Median ITL (ms):                         16.06
P99 ITL (ms):                            79.21
==================================================
  • 第二次测试
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
============ Serving Benchmark Result ============
Successful requests:                     1024
Request rate configured (RPS):           16.00
Benchmark duration (s):                  72.48
Total input tokens:                      225502
Total generated tokens:                  202372
Request throughput (req/s):              14.13
Output token throughput (tok/s):         2792.16
Total Token throughput (tok/s):          5903.45
---------------Time to First Token----------------
Mean TTFT (ms):                          38.53
Median TTFT (ms):                        37.94
P99 TTFT (ms):                           56.22
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          16.34
Median TPOT (ms):                        16.31
P99 TPOT (ms):                           19.28
---------------Inter-token Latency----------------
Mean ITL (ms):                           16.31
Median ITL (ms):                         15.06
P99 ITL (ms):                            27.39
==================================================
  • 查看日志
1
2
(EngineCore_DP0 pid=3888) [2025-09-17 07:15:11,245] LMCache INFO: Stored 776 out of total 776 tokens. size: 0.0414 gb, cost 6.5508 ms, throughput: 6.3264 GB/s; offload_time: 2.4183 ms, put_time: 4.1324 ms (cache_engine.py:309:lmcache.v1.cache_engine)
(APIServer pid=3625) INFO 09-17 07:15:15 [loggers.py:123] Engine 000: Avg prompt throughput: 2302.0 tokens/s, Avg generation throughput: 2162.3 tokens/s, Running: 5 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.2%, Prefix cache hit rate: 96.0%
  • 小结

第一次测试时,TTFT 平均是 66.56ms,第二次测试时,TTFT 平均是 38.53ms,提升了大约 42%。

3. 缓存到磁盘

  • 设置环境变量
1
2
3
4
5
6
7
export LMCACHE_CHUNK_SIZE=256
export LMCACHE_LOCAL_DISK="file:///data/models/lmcache/"
# 5GB of disk space
export LMCACHE_MAX_LOCAL_DISK_SIZE=5.0
export LMCACHE_LOCAL_CPU=False
export LMCACHE_EXTRA_CONFIG='{'use_odirect': True}'
export LMCACHE_USE_EXPERIMENTAL=True
  • 启动模型服务
1
2
3
4
5
6
export CUDA_VISIBLE_DEVICES=7
/opt/venv/bin/vllm serve \
    /data/models/Qwen2.5-7B-Instruct \
    --max-model-len 16384 \
    --kv-transfer-config \
    '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
  • 第一次测试
1
2
3
4
5
6
7
8
/opt/venv/bin/vllm bench serve \
  --backend openai \
  --model /data/models/Qwen2.5-7B-Instruct \
  --dataset-name sharegpt \
  --dataset-path /data/models/ShareGPT_V3_unfiltered_cleaned_split.json \
  --random-input-len 1024 \
  --num-prompts 1024 \
  --request-rate 16
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
============ Serving Benchmark Result ============
Successful requests:                     1024
Request rate configured (RPS):           16.00
Benchmark duration (s):                  72.98
Total input tokens:                      225502
Total generated tokens:                  202624
Request throughput (req/s):              14.03
Output token throughput (tok/s):         2776.24
Total Token throughput (tok/s):          5865.95
---------------Time to First Token----------------
Mean TTFT (ms):                          65.63
Median TTFT (ms):                        58.48
P99 TTFT (ms):                           163.15
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          21.66
Median TPOT (ms):                        21.40
P99 TPOT (ms):                           38.46
---------------Inter-token Latency----------------
Mean ITL (ms):                           21.02
Median ITL (ms):                         15.94
P99 ITL (ms):                            77.89
==================================================
  • 第二次测试
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
============ Serving Benchmark Result ============
Successful requests:                     1024
Request rate configured (RPS):           16.00
Benchmark duration (s):                  72.45
Total input tokens:                      225502
Total generated tokens:                  202228
Request throughput (req/s):              14.13
Output token throughput (tok/s):         2791.29
Total Token throughput (tok/s):          5903.82
---------------Time to First Token----------------
Mean TTFT (ms):                          37.53
Median TTFT (ms):                        37.06
P99 TTFT (ms):                           54.17
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          16.05
Median TPOT (ms):                        16.04
P99 TPOT (ms):                           18.66
---------------Inter-token Latency----------------
Mean ITL (ms):                           16.01
Median ITL (ms):                         14.91
P99 ITL (ms):                            26.14
==================================================
  • 查看日志
1
2
(EngineCore_DP0 pid=2531) [2025-09-17 07:08:17,522] LMCache INFO: Stored 12 out of total 12 tokens. size: 0.0006 gb, cost 0.3068 ms, throughput: 2.0891 GB/s; offload_time: 0.2484 ms, put_time: 0.0584 ms (cache_engine.py:309:lmcache.v1.cache_engine)
(APIServer pid=2268) INFO 09-17 07:08:24 [loggers.py:123] Engine 000: Avg prompt throughput: 1464.7 tokens/s, Avg generation throughput: 1653.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 96.0%
  • 查看缓存文件
1
ls -alh /data/models/lmcache/
1
2
3
4
-rw-r--r-- 1 root root  3.8M Sep 17 07:08 vllm@-data-models-Qwen2.5-7B-Instruct@1@0@f2b3b301b772c39.pt
-rw-r--r-- 1 root root   14M Sep 17 07:08 vllm@-data-models-Qwen2.5-7B-Instruct@1@0@f6af89a43044ea.pt
-rw-r--r-- 1 root root  3.9M Sep 17 07:14 vllm@-data-models-Qwen2.5-7B-Instruct@1@0@f80c6f8223a03ab.pt
-rw-r--r-- 1 root root  2.0M Sep 17 07:08 vllm@-data-models-Qwen2.5-7B-Instruct@1@0@fe656443e3ca60c.pt
  • 小结

第一次测试时,TTFT 平均是 65.63ms,第二次测试时,TTFT 平均是 37.53ms,提升了大约 43%。

4. 缓存到 Redis

  • 启动 Redis
1
nerdctl run -d --name redis -p 6379:6379 redis:7
  • 设置环境变量
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Specify LMCache V1
export LMCACHE_USE_EXPERIMENTAL=True
# 256 Tokens per KV Chunk
export LMCACHE_CHUNK_SIZE=256
# Redis host
export LMCACHE_REMOTE_URL="redis://10.8.1.28:6379"
# Redis Sentinel hosts (for high availability)
# export LMCACHE_REMOTE_URL="redis-sentinel://localhost:26379,localhost:26380,localhost:26381"
# LMCache Server host
# export LMCACHE_REMOTE_URL="lm://localhost:65432"

# How to serialize and deserialize KV cache on remote transmission
export LMCACHE_REMOTE_SERDE="naive" # "naive" (default) or "cachegen"
  • 启动模型服务
1
2
3
4
5
6
export CUDA_VISIBLE_DEVICES=0
/opt/venv/bin/vllm serve \
    /data/models/Qwen2.5-7B-Instruct \
    --max-model-len 16384 \
    --kv-transfer-config \
    '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
  • 第一次测试
1
2
3
4
5
6
7
8
/opt/venv/bin/vllm bench serve \
  --backend openai \
  --model /data/models/Qwen2.5-7B-Instruct \
  --dataset-name sharegpt \
  --dataset-path /data/models/ShareGPT_V3_unfiltered_cleaned_split.json \
  --random-input-len 1024 \
  --num-prompts 1024 \
  --request-rate 16
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
============ Serving Benchmark Result ============
Successful requests:                     1024
Request rate configured (RPS):           16.00
Benchmark duration (s):                  72.92
Total input tokens:                      225502
Total generated tokens:                  201777
Request throughput (req/s):              14.04
Output token throughput (tok/s):         2767.02
Total Token throughput (tok/s):          5859.39
---------------Time to First Token----------------
Mean TTFT (ms):                          68.37
Median TTFT (ms):                        60.88
P99 TTFT (ms):                           168.11
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          22.44
Median TPOT (ms):                        21.99
P99 TPOT (ms):                           41.70
---------------Inter-token Latency----------------
Mean ITL (ms):                           21.70
Median ITL (ms):                         16.05
P99 ITL (ms):                            82.06
==================================================
  • 第二次测试
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
============ Serving Benchmark Result ============
Successful requests:                     1024
Request rate configured (RPS):           16.00
Benchmark duration (s):                  72.42
Total input tokens:                      225502
Total generated tokens:                  202202
Request throughput (req/s):              14.14
Output token throughput (tok/s):         2792.20
Total Token throughput (tok/s):          5906.14
---------------Time to First Token----------------
Mean TTFT (ms):                          34.24
Median TTFT (ms):                        34.04
P99 TTFT (ms):                           48.07
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          15.43
Median TPOT (ms):                        15.46
P99 TPOT (ms):                           17.31
---------------Inter-token Latency----------------
Mean ITL (ms):                           15.39
Median ITL (ms):                         14.78
P99 ITL (ms):                            22.44
==================================================
  • 查看日志
1
2
(EngineCore_DP0 pid=6917) [2025-09-17 09:23:36,580] LMCache INFO: Storing KV cache for 12 out of 12 tokens (skip_leading_tokens=0) for request cmpl-benchmark-serving1023-0 (vllm_v1_adapter.py:988:lmcache.integration.vllm.vllm_v1_adapter)
(APIServer pid=6654) INFO 09-17 09:24:04 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 96.0%
  • 缓存
1
nerdctl exec -it redis redis-cli KEYS "*"
1
2
3
4
5
  1) "vllm@/data/models/Qwen2.5-7B-Instruct@1@0@40584cb624749b3fkv_bytes"
  2) "vllm@/data/models/Qwen2.5-7B-Instruct@1@0@-2beea002c97002f8metadata"
  3) "vllm@/data/models/Qwen2.5-7B-Instruct@1@0@-187f929f91136dd6metadata"
  4) "vllm@/data/models/Qwen2.5-7B-Instruct@1@0@-bce14c0b271abefmetadata"
  5) "vllm@/data/models/Qwen2.5-7B-Instruct@1@0@ce27b7747a0440fkv_bytes"
1
2
nerdctl exec -it redis redis-cli MEMORY USAGE "vllm@/data/models/Qwen2.5-7B-Instruct@1@0@40584cb624749b3fkv_bytes"
(integer) 2097272

一个缓存块大约是 2MB,与磁盘缓存的文件大小差不多。


微信公众号
作者
微信公众号