1. LMCache 简介
TTFT 是指从请求发出到模型生成第一个 token 的时间。由于 Prefill 阶段需要把输入的上下文编码成 KV Cache,才能开始生成,在生成第一个 token 时需要大量的计算从而导致 TTFT 很高。
为了降低 TTFT,有一个思路就是将 Prefill 阶段计算出来的 KV Cache 缓存起来,下次遇到相同的上下文时,直接复用缓存的 KV Cache,就可以大幅降低 TTFT。
在模型推理的场景下,https://github.com/LMCache/LMCache 就是针对 KV Cache 缓存的一个开源项目,支持将 KV Cache 存储到内存、磁盘、Redis、GDS、Nixl 等多种存储后端。详情查看 https://docs.lmcache.ai/kv_cache/storage_backends/index.html 。
此外,lmcache 还提供了计算 KV Cache 大小的工具 https://lmcache.ai/kv_cache_calculator.html ,以 4k 中文估算,2k token 需要 106 MB 的 KV Cache,存储开销非常大。虽然 LMCache 有 LRU、FIFO、LFU、MRU 等缓存淘汰策略,但在生产环境中,通常还是需要配合大容量的存储后端,比如 Redis、3FS、大磁盘。
接下来我们通过一些 benchmark 来展示 LMCache 的效果。
2. 缓存到内存
1
2
3
4
5
6
7
8
9
10
| nerdctl run -it \
-p 8000:8000 \
--gpus all \
--ipc=host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
--name lmcache \
--volume /data/models:/data/models \
--entrypoint /bin/bash \
lmcache/vllm-openai:v0.3.6
|
其他测试也都是基于这个镜像创建的环境,测试设备是 NVIDIA A100-SXM4-80GB。
1
| unset $(env | awk -F= '/^LMCACHE_/ {print $1}')
|
1
2
3
4
5
6
7
8
| # Specify LMCache V1
export LMCACHE_USE_EXPERIMENTAL=True
# 256 Tokens per KV Chunk
export LMCACHE_CHUNK_SIZE=256
# Enable CPU memory backend
export LMCACHE_LOCAL_CPU=True # default
# 50 GB of Pinned CPU memory
export LMCACHE_MAX_LOCAL_CPU_SIZE=50 # default 5.0
|
1
2
3
4
5
6
7
| export CUDA_VISIBLE_DEVICES=7
/opt/venv/bin/vllm serve \
/data/models/Qwen2.5-7B-Instruct \
--no-enable-prefix-caching \
--max-model-len 16384 \
--kv-transfer-config \
'{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
|
1
2
3
4
5
6
7
| /opt/venv/bin/vllm bench serve \
--backend openai \
--model /data/models/Qwen2.5-7B-Instruct \
--dataset-name sharegpt \
--dataset-path /data/models/ShareGPT_V3_unfiltered_cleaned_split.json \
--num-prompts 1024 \
--request-rate 16
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
| ============ Serving Benchmark Result ============
Successful requests: 1024
Request rate configured (RPS): 16.00
Benchmark duration (s): 72.91
Total input tokens: 225502
Total generated tokens: 202560
Request throughput (req/s): 14.04
Output token throughput (tok/s): 2778.23
Total Token throughput (tok/s): 5871.13
---------------Time to First Token----------------
Mean TTFT (ms): 62.06
Median TTFT (ms): 55.99
P99 TTFT (ms): 140.46
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 20.90
Median TPOT (ms): 20.73
P99 TPOT (ms): 36.28
---------------Inter-token Latency----------------
Mean ITL (ms): 20.39
Median ITL (ms): 15.81
P99 ITL (ms): 72.54
==================================================
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
| ============ Serving Benchmark Result ============
Successful requests: 1024
Request rate configured (RPS): 16.00
Benchmark duration (s): 72.35
Total input tokens: 225502
Total generated tokens: 202945
Request throughput (req/s): 14.15
Output token throughput (tok/s): 2805.13
Total Token throughput (tok/s): 5922.04
---------------Time to First Token----------------
Mean TTFT (ms): 32.65
Median TTFT (ms): 32.43
P99 TTFT (ms): 44.34
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 15.00
Median TPOT (ms): 15.07
P99 TPOT (ms): 16.15
---------------Inter-token Latency----------------
Mean ITL (ms): 14.99
Median ITL (ms): 14.72
P99 ITL (ms): 19.05
==================================================
|
1
| (EngineCore_DP0 pid=18318) [2025-09-18 05:07:16,918] LMCache INFO: Retrieved 776 out of total 776 out of total 776 tokens. size: 0.0414 gb, cost 2.0837 ms, throughput: 19.8891 GB/s; (cache_engine.py:519:lmcache.v1.cache_engine)
|
可以看到建立 KV Cache 相关的日志信息。
| 指标 | 第一次测试 | 第二次测试 | 降低 |
|---|
| TTFT | 62.06ms | 32.65ms | 47% |
| TPOT | 20.90ms | 15.00ms | 28% |
| ITL | 20.39ms | 14.99ms | 26% |
3. 缓存到磁盘
1
| unset $(env | awk -F= '/^LMCACHE_/ {print $1}')
|
1
2
3
4
5
6
7
| export LMCACHE_CHUNK_SIZE=256
export LMCACHE_LOCAL_DISK="file:///data/models/lmcache/"
# 50GB of disk space
export LMCACHE_MAX_LOCAL_DISK_SIZE=50
export LMCACHE_LOCAL_CPU=False
export LMCACHE_EXTRA_CONFIG='{'use_odirect': True}'
export LMCACHE_USE_EXPERIMENTAL=True
|
1
2
3
4
5
6
7
| export CUDA_VISIBLE_DEVICES=7
/opt/venv/bin/vllm serve \
/data/models/Qwen2.5-7B-Instruct \
--no-enable-prefix-caching \
--max-model-len 16384 \
--kv-transfer-config \
'{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
|
1
2
3
4
5
6
7
| /opt/venv/bin/vllm bench serve \
--backend openai \
--model /data/models/Qwen2.5-7B-Instruct \
--dataset-name sharegpt \
--dataset-path /data/models/ShareGPT_V3_unfiltered_cleaned_split.json \
--num-prompts 1024 \
--request-rate 16
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
| ============ Serving Benchmark Result ============
Successful requests: 1024
Request rate configured (RPS): 16.00
Benchmark duration (s): 72.92
Total input tokens: 225502
Total generated tokens: 202927
Request throughput (req/s): 14.04
Output token throughput (tok/s): 2783.03
Total Token throughput (tok/s): 5875.66
---------------Time to First Token----------------
Mean TTFT (ms): 63.79
Median TTFT (ms): 57.74
P99 TTFT (ms): 145.83
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 21.34
Median TPOT (ms): 21.15
P99 TPOT (ms): 37.31
---------------Inter-token Latency----------------
Mean ITL (ms): 20.78
Median ITL (ms): 15.88
P99 ITL (ms): 76.45
==================================================
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
| ============ Serving Benchmark Result ============
Successful requests: 1024
Request rate configured (RPS): 16.00
Benchmark duration (s): 72.49
Total input tokens: 225502
Total generated tokens: 201717
Request throughput (req/s): 14.13
Output token throughput (tok/s): 2782.66
Total Token throughput (tok/s): 5893.42
---------------Time to First Token----------------
Mean TTFT (ms): 39.40
Median TTFT (ms): 37.25
P99 TTFT (ms): 89.50
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 16.21
Median TPOT (ms): 15.99
P99 TPOT (ms): 20.42
---------------Inter-token Latency----------------
Mean ITL (ms): 16.17
Median ITL (ms): 14.85
P99 ITL (ms): 34.62
==================================================
|
1
2
| (EngineCore_DP0 pid=21129) [2025-09-18 05:23:16,568] LMCache INFO: Retrieved 12 out of total 12 out of total 12 tokens. size: 0.0006 gb, cost 0.6591 ms, throughput: 0.9724 GB/s; (cache_engine.py:519:lmcache.v1.cache_engine)
(APIServer pid=20851) INFO 09-18 05:23:22 [loggers.py:123] Engine 000: Avg prompt throughput: 1700.5 tokens/s, Avg generation throughput: 1847.9 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
|
1
| ls -alh /data/models/lmcache/
|
1
2
3
| -rw-r--r-- 1 root root 14M Sep 18 05:20 vllm@-data-models-Qwen2.5-7B-Instruct@1@0@f2f002abf32763.pt
-rw-r--r-- 1 root root 14M Sep 18 05:20 vllm@-data-models-Qwen2.5-7B-Instruct@1@0@f838a2991593dd7.pt
-rw-r--r-- 1 root root 12M Sep 18 05:20 vllm@-data-models-Qwen2.5-7B-Instruct@1@0@fb7cf79a0adacc1.pt
|
| 指标 | 第一次测试 | 第二次测试 | 降低 |
|---|
| TTFT | 63.79ms | 39.40ms | 38% |
| TPOT | 21.34ms | 16.21ms | 24% |
| ITL | 20.78ms | 16.17ms | 22% |
4. 缓存到 Redis
1
| nerdctl run -d --name redis -p 6379:6379 redis:7
|
1
| unset $(env | awk -F= '/^LMCACHE_/ {print $1}')
|
1
2
3
4
5
6
7
8
9
10
11
12
13
| # Specify LMCache V1
export LMCACHE_USE_EXPERIMENTAL=True
# 256 Tokens per KV Chunk
export LMCACHE_CHUNK_SIZE=256
# Redis host
export LMCACHE_REMOTE_URL="redis://x.x.x.x:6379"
# Redis Sentinel hosts (for high availability)
# export LMCACHE_REMOTE_URL="redis-sentinel://localhost:26379,localhost:26380,localhost:26381"
# LMCache Server host
# export LMCACHE_REMOTE_URL="lm://localhost:65432"
# How to serialize and deserialize KV cache on remote transmission
export LMCACHE_REMOTE_SERDE="naive" # "naive" (default) or "cachegen"
|
1
2
3
4
5
6
7
| export CUDA_VISIBLE_DEVICES=7
/opt/venv/bin/vllm serve \
/data/models/Qwen2.5-7B-Instruct \
--no-enable-prefix-caching \
--max-model-len 16384 \
--kv-transfer-config \
'{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
|
1
2
3
4
5
6
7
| /opt/venv/bin/vllm bench serve \
--backend openai \
--model /data/models/Qwen2.5-7B-Instruct \
--dataset-name sharegpt \
--dataset-path /data/models/ShareGPT_V3_unfiltered_cleaned_split.json \
--num-prompts 1024 \
--request-rate 16
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
| ============ Serving Benchmark Result ============
Successful requests: 1024
Request rate configured (RPS): 16.00
Benchmark duration (s): 72.90
Total input tokens: 225502
Total generated tokens: 202337
Request throughput (req/s): 14.05
Output token throughput (tok/s): 2775.41
Total Token throughput (tok/s): 5868.57
---------------Time to First Token----------------
Mean TTFT (ms): 67.79
Median TTFT (ms): 60.94
P99 TTFT (ms): 165.84
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 22.20
Median TPOT (ms): 21.75
P99 TPOT (ms): 40.42
---------------Inter-token Latency----------------
Mean ITL (ms): 21.43
Median ITL (ms): 15.96
P99 ITL (ms): 78.68
==================================================
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
| ============ Serving Benchmark Result ============
Successful requests: 1024
Request rate configured (RPS): 16.00
Benchmark duration (s): 72.91
Total input tokens: 225502
Total generated tokens: 202978
Request throughput (req/s): 14.04
Output token throughput (tok/s): 2783.98
Total Token throughput (tok/s): 5876.88
---------------Time to First Token----------------
Mean TTFT (ms): 50.34
Median TTFT (ms): 39.07
P99 TTFT (ms): 142.44
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 18.80
Median TPOT (ms): 17.32
P99 TPOT (ms): 35.65
---------------Inter-token Latency----------------
Mean ITL (ms): 18.17
Median ITL (ms): 15.13
P99 ITL (ms): 66.43
==================================================
|
1
| (EngineCore_DP0 pid=23013) [2025-09-18 05:29:58,971] LMCache INFO: Storing KV cache for 776 out of 776 tokens (skip_leading_tokens=0) for request cmpl-benchmark-serving1022-0 (vllm_v1_adapter.py:988:lmcache.integration.vllm.vllm_v1_adapter)
|
1
| nerdctl exec -it redis redis-cli KEYS "*"
|
1
2
3
4
5
6
| 4087) "vllm@/data/models/Qwen2.5-7B-Instruct@1@0@-7542097a982a0d29metadata"
4088) "vllm@/data/models/Qwen2.5-7B-Instruct@1@0@ab0b65969d69b56metadata"
4089) "vllm@/data/models/Qwen2.5-7B-Instruct@1@0@-49f1c05dccbcca9metadata"
4090) "vllm@/data/models/Qwen2.5-7B-Instruct@1@0@-2ede777a488b6923kv_bytes"
4091) "vllm@/data/models/Qwen2.5-7B-Instruct@1@0@-27a856291a779d38kv_bytes"
4092) "vllm@/data/models/Qwen2.5-7B-Instruct@1@0@7522166acc9c0267kv_bytes"
|
1
2
| nerdctl exec -it redis redis-cli MEMORY USAGE "vllm@/data/models/Qwen2.5-7B-Instruct@1@0@-27a856291a779d38kv_bytes"
(integer) 524408
|
一个缓存块大约是 12 MB,与磁盘缓存块大小一致。
| 指标 | 第一次测试 | 第二次测试 | 降低 |
|---|
| TTFT | 67.79ms | 50.34ms | 25% |
| TPOT | 22.20ms | 18.80ms | 15% |
| ITL | 21.43ms | 18.17ms | 15% |
5. 无 LMCache 对照
1
2
3
4
5
6
7
8
9
10
| nerdctl run -it \
-p 8000:8000 \
--gpus all \
--ipc=host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
--name vllm \
--volume /data/models:/data/models \
--entrypoint /bin/bash \
vllm/vllm-openai:v0.10.1.1
|
1
2
3
4
5
6
7
8
9
10
11
| export CUDA_VISIBLE_DEVICES=7
python3 -m vllm.entrypoints.openai.api_server \
--model /data/models/Qwen2.5-7B-Instruct \
--served-model-name /data/models/Qwen2.5-7B-Instruct \
--port 8000 \
--gpu_memory_utilization 0.8 \
--max-model-len 4096 \
--max-seq-len-to-capture 8192 \
--max-num-seqs 128 \
--enforce-eager \
--no-enable-prefix-caching
|
1
2
3
4
5
6
7
| vllm bench serve \
--backend openai \
--model /data/models/Qwen2.5-7B-Instruct \
--dataset-name sharegpt \
--dataset-path /data/models/ShareGPT_V3_unfiltered_cleaned_split.json \
--num-prompts 1024 \
--request-rate 16
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
| ============ Serving Benchmark Result ============
Successful requests: 1024
Request rate configured (RPS): 16.00
Benchmark duration (s): 73.42
Total input tokens: 225502
Total generated tokens: 203130
Request throughput (req/s): 13.95
Output token throughput (tok/s): 2766.85
Total Token throughput (tok/s): 5838.43
---------------Time to First Token----------------
Mean TTFT (ms): 61.55
Median TTFT (ms): 54.89
P99 TTFT (ms): 174.88
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 21.49
Median TPOT (ms): 21.27
P99 TPOT (ms): 36.38
---------------Inter-token Latency----------------
Mean ITL (ms): 21.00
Median ITL (ms): 16.70
P99 ITL (ms): 72.61
==================================================
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
| ============ Serving Benchmark Result ============
Successful requests: 1024
Request rate configured (RPS): 16.00
Benchmark duration (s): 73.84
Total input tokens: 225502
Total generated tokens: 203659
Request throughput (req/s): 13.87
Output token throughput (tok/s): 2758.13
Total Token throughput (tok/s): 5812.08
---------------Time to First Token----------------
Mean TTFT (ms): 59.70
Median TTFT (ms): 54.41
P99 TTFT (ms): 139.79
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 21.31
Median TPOT (ms): 21.08
P99 TPOT (ms): 36.62
---------------Inter-token Latency----------------
Mean ITL (ms): 20.78
Median ITL (ms): 16.63
P99 ITL (ms): 71.70
==================================================
|
| 指标 | 第一次测试 | 第二次测试 | 降低 |
|---|
| TTFT | 61.55ms | 59.70ms | 3% |
| TPOT | 21.49ms | 21.31ms | 1% |
| ITL | 21.00ms | 20.78ms | 1% |
6. 总结
本篇主要是通过 benchmark 来展示 LMCache 的效果,并分别缓存到内存、磁盘、Redis 三种后端。
在 Qwen2.5-7B-Instruct 模型,使用 NVIDIA A100-SXM4-80GB 设备,16 个并发请求,测试结果如下:
| 缓存后端 | TTFT 降低 | TPOT 降低 | ITL 降低 |
|---|
| 内存 | 47% | 28% | 26% |
| 磁盘 | 38% | 24% | 22% |
| Redis | 25% | 15% | 15% |
| 无 LMCache | 3% | 1% | 1% |