1. LMCache 简介
TTFT 是指从请求发出到模型生成第一个 token 的时间。由于 Prefill 阶段需要把输入的上下文编码成 KV Cache,才能开始生成,在生成第一个 token 时需要大量的计算从而导致 TTFT 很高。
为了降低 TTFT,有一个思路就是将 Prefill 阶段计算出来的 KV Cache 缓存起来,下次遇到相同的上下文时,直接复用缓存的 KV Cache,就可以大幅降低 TTFT。
在模型推理的场景下,https://github.com/LMCache/LMCache 就是针对 KV Cache 缓存的一个开源项目,支持将 KV Cache 存储到内存、磁盘、Redis、GDS、Nixl 等多种存储后端。详情查看 https://docs.lmcache.ai/kv_cache/storage_backends/index.html 。
此外,lmcache 还提供了计算 KV Cache 大小的工具 https://lmcache.ai/kv_cache_calculator.html ,以 4k 中文估算,2k token 需要 106 MB 的 KV Cache,存储开销非常大。虽然 LMCache 有 LRU、FIFO、LFU、MRU 等缓存淘汰策略,但在生产环境中,通常还是需要配合大容量的存储后端,比如 Redis、3FS、大磁盘。
接下来我们通过一些 benchmark 来展示 LMCache 的效果。
2. 缓存到内存
1
2
3
4
5
6
7
8
9
10
| nerdctl run -it \
-p 8000:8000 \
--gpus all \
--ipc=host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
--name lmcache \
--volume /data/models:/data/models \
--entrypoint /bin/bash \
lmcache/vllm-openai:v0.3.6
|
其他测试也都是基于这个镜像创建的环境,测试设备是 NVIDIA A100-SXM4-80GB。
1
2
3
4
5
6
7
8
| # Specify LMCache V1
export LMCACHE_USE_EXPERIMENTAL=True
# 256 Tokens per KV Chunk
export LMCACHE_CHUNK_SIZE=256
# Enable CPU memory backend
export LMCACHE_LOCAL_CPU=True # default
# 5GB of Pinned CPU memory
export LMCACHE_MAX_LOCAL_CPU_SIZE=5.0 # default
|
1
2
3
4
5
6
| export CUDA_VISIBLE_DEVICES=7
/opt/venv/bin/vllm serve \
/data/models/Qwen2.5-7B-Instruct \
--max-model-len 16384 \
--kv-transfer-config \
'{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
|
1
2
3
4
5
6
7
8
| /opt/venv/bin/vllm bench serve \
--backend openai \
--model /data/models/Qwen2.5-7B-Instruct \
--dataset-name sharegpt \
--dataset-path /data/models/ShareGPT_V3_unfiltered_cleaned_split.json \
--random-input-len 1024 \
--num-prompts 1024 \
--request-rate 16
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
| ============ Serving Benchmark Result ============
Successful requests: 1024
Request rate configured (RPS): 16.00
Benchmark duration (s): 73.02
Total input tokens: 225502
Total generated tokens: 202459
Request throughput (req/s): 14.02
Output token throughput (tok/s): 2772.73
Total Token throughput (tok/s): 5861.05
---------------Time to First Token----------------
Mean TTFT (ms): 66.56
Median TTFT (ms): 59.31
P99 TTFT (ms): 163.56
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 22.03
Median TPOT (ms): 21.67
P99 TPOT (ms): 38.58
---------------Inter-token Latency----------------
Mean ITL (ms): 21.34
Median ITL (ms): 16.06
P99 ITL (ms): 79.21
==================================================
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
| ============ Serving Benchmark Result ============
Successful requests: 1024
Request rate configured (RPS): 16.00
Benchmark duration (s): 72.48
Total input tokens: 225502
Total generated tokens: 202372
Request throughput (req/s): 14.13
Output token throughput (tok/s): 2792.16
Total Token throughput (tok/s): 5903.45
---------------Time to First Token----------------
Mean TTFT (ms): 38.53
Median TTFT (ms): 37.94
P99 TTFT (ms): 56.22
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 16.34
Median TPOT (ms): 16.31
P99 TPOT (ms): 19.28
---------------Inter-token Latency----------------
Mean ITL (ms): 16.31
Median ITL (ms): 15.06
P99 ITL (ms): 27.39
==================================================
|
1
2
| (EngineCore_DP0 pid=3888) [2025-09-17 07:15:11,245] LMCache INFO: Stored 776 out of total 776 tokens. size: 0.0414 gb, cost 6.5508 ms, throughput: 6.3264 GB/s; offload_time: 2.4183 ms, put_time: 4.1324 ms (cache_engine.py:309:lmcache.v1.cache_engine)
(APIServer pid=3625) INFO 09-17 07:15:15 [loggers.py:123] Engine 000: Avg prompt throughput: 2302.0 tokens/s, Avg generation throughput: 2162.3 tokens/s, Running: 5 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.2%, Prefix cache hit rate: 96.0%
|
第一次测试时,TTFT 平均是 66.56ms,第二次测试时,TTFT 平均是 38.53ms,提升了大约 42%。
3. 缓存到磁盘
1
2
3
4
5
6
7
| export LMCACHE_CHUNK_SIZE=256
export LMCACHE_LOCAL_DISK="file:///data/models/lmcache/"
# 5GB of disk space
export LMCACHE_MAX_LOCAL_DISK_SIZE=5.0
export LMCACHE_LOCAL_CPU=False
export LMCACHE_EXTRA_CONFIG='{'use_odirect': True}'
export LMCACHE_USE_EXPERIMENTAL=True
|
1
2
3
4
5
6
| export CUDA_VISIBLE_DEVICES=7
/opt/venv/bin/vllm serve \
/data/models/Qwen2.5-7B-Instruct \
--max-model-len 16384 \
--kv-transfer-config \
'{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
|
1
2
3
4
5
6
7
8
| /opt/venv/bin/vllm bench serve \
--backend openai \
--model /data/models/Qwen2.5-7B-Instruct \
--dataset-name sharegpt \
--dataset-path /data/models/ShareGPT_V3_unfiltered_cleaned_split.json \
--random-input-len 1024 \
--num-prompts 1024 \
--request-rate 16
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
| ============ Serving Benchmark Result ============
Successful requests: 1024
Request rate configured (RPS): 16.00
Benchmark duration (s): 72.98
Total input tokens: 225502
Total generated tokens: 202624
Request throughput (req/s): 14.03
Output token throughput (tok/s): 2776.24
Total Token throughput (tok/s): 5865.95
---------------Time to First Token----------------
Mean TTFT (ms): 65.63
Median TTFT (ms): 58.48
P99 TTFT (ms): 163.15
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 21.66
Median TPOT (ms): 21.40
P99 TPOT (ms): 38.46
---------------Inter-token Latency----------------
Mean ITL (ms): 21.02
Median ITL (ms): 15.94
P99 ITL (ms): 77.89
==================================================
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
| ============ Serving Benchmark Result ============
Successful requests: 1024
Request rate configured (RPS): 16.00
Benchmark duration (s): 72.45
Total input tokens: 225502
Total generated tokens: 202228
Request throughput (req/s): 14.13
Output token throughput (tok/s): 2791.29
Total Token throughput (tok/s): 5903.82
---------------Time to First Token----------------
Mean TTFT (ms): 37.53
Median TTFT (ms): 37.06
P99 TTFT (ms): 54.17
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 16.05
Median TPOT (ms): 16.04
P99 TPOT (ms): 18.66
---------------Inter-token Latency----------------
Mean ITL (ms): 16.01
Median ITL (ms): 14.91
P99 ITL (ms): 26.14
==================================================
|
1
2
| (EngineCore_DP0 pid=2531) [2025-09-17 07:08:17,522] LMCache INFO: Stored 12 out of total 12 tokens. size: 0.0006 gb, cost 0.3068 ms, throughput: 2.0891 GB/s; offload_time: 0.2484 ms, put_time: 0.0584 ms (cache_engine.py:309:lmcache.v1.cache_engine)
(APIServer pid=2268) INFO 09-17 07:08:24 [loggers.py:123] Engine 000: Avg prompt throughput: 1464.7 tokens/s, Avg generation throughput: 1653.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 96.0%
|
1
| ls -alh /data/models/lmcache/
|
1
2
3
4
| -rw-r--r-- 1 root root 3.8M Sep 17 07:08 vllm@-data-models-Qwen2.5-7B-Instruct@1@0@f2b3b301b772c39.pt
-rw-r--r-- 1 root root 14M Sep 17 07:08 vllm@-data-models-Qwen2.5-7B-Instruct@1@0@f6af89a43044ea.pt
-rw-r--r-- 1 root root 3.9M Sep 17 07:14 vllm@-data-models-Qwen2.5-7B-Instruct@1@0@f80c6f8223a03ab.pt
-rw-r--r-- 1 root root 2.0M Sep 17 07:08 vllm@-data-models-Qwen2.5-7B-Instruct@1@0@fe656443e3ca60c.pt
|
第一次测试时,TTFT 平均是 65.63ms,第二次测试时,TTFT 平均是 37.53ms,提升了大约 43%。
4. 缓存到 Redis
1
| nerdctl run -d --name redis -p 6379:6379 redis:7
|
1
2
3
4
5
6
7
8
9
10
11
12
13
| # Specify LMCache V1
export LMCACHE_USE_EXPERIMENTAL=True
# 256 Tokens per KV Chunk
export LMCACHE_CHUNK_SIZE=256
# Redis host
export LMCACHE_REMOTE_URL="redis://10.8.1.28:6379"
# Redis Sentinel hosts (for high availability)
# export LMCACHE_REMOTE_URL="redis-sentinel://localhost:26379,localhost:26380,localhost:26381"
# LMCache Server host
# export LMCACHE_REMOTE_URL="lm://localhost:65432"
# How to serialize and deserialize KV cache on remote transmission
export LMCACHE_REMOTE_SERDE="naive" # "naive" (default) or "cachegen"
|
1
2
3
4
5
6
| export CUDA_VISIBLE_DEVICES=0
/opt/venv/bin/vllm serve \
/data/models/Qwen2.5-7B-Instruct \
--max-model-len 16384 \
--kv-transfer-config \
'{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
|
1
2
3
4
5
6
7
8
| /opt/venv/bin/vllm bench serve \
--backend openai \
--model /data/models/Qwen2.5-7B-Instruct \
--dataset-name sharegpt \
--dataset-path /data/models/ShareGPT_V3_unfiltered_cleaned_split.json \
--random-input-len 1024 \
--num-prompts 1024 \
--request-rate 16
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
| ============ Serving Benchmark Result ============
Successful requests: 1024
Request rate configured (RPS): 16.00
Benchmark duration (s): 72.92
Total input tokens: 225502
Total generated tokens: 201777
Request throughput (req/s): 14.04
Output token throughput (tok/s): 2767.02
Total Token throughput (tok/s): 5859.39
---------------Time to First Token----------------
Mean TTFT (ms): 68.37
Median TTFT (ms): 60.88
P99 TTFT (ms): 168.11
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 22.44
Median TPOT (ms): 21.99
P99 TPOT (ms): 41.70
---------------Inter-token Latency----------------
Mean ITL (ms): 21.70
Median ITL (ms): 16.05
P99 ITL (ms): 82.06
==================================================
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
| ============ Serving Benchmark Result ============
Successful requests: 1024
Request rate configured (RPS): 16.00
Benchmark duration (s): 72.42
Total input tokens: 225502
Total generated tokens: 202202
Request throughput (req/s): 14.14
Output token throughput (tok/s): 2792.20
Total Token throughput (tok/s): 5906.14
---------------Time to First Token----------------
Mean TTFT (ms): 34.24
Median TTFT (ms): 34.04
P99 TTFT (ms): 48.07
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 15.43
Median TPOT (ms): 15.46
P99 TPOT (ms): 17.31
---------------Inter-token Latency----------------
Mean ITL (ms): 15.39
Median ITL (ms): 14.78
P99 ITL (ms): 22.44
==================================================
|
1
2
| (EngineCore_DP0 pid=6917) [2025-09-17 09:23:36,580] LMCache INFO: Storing KV cache for 12 out of 12 tokens (skip_leading_tokens=0) for request cmpl-benchmark-serving1023-0 (vllm_v1_adapter.py:988:lmcache.integration.vllm.vllm_v1_adapter)
(APIServer pid=6654) INFO 09-17 09:24:04 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 96.0%
|
1
| nerdctl exec -it redis redis-cli KEYS "*"
|
1
2
3
4
5
| 1) "vllm@/data/models/Qwen2.5-7B-Instruct@1@0@40584cb624749b3fkv_bytes"
2) "vllm@/data/models/Qwen2.5-7B-Instruct@1@0@-2beea002c97002f8metadata"
3) "vllm@/data/models/Qwen2.5-7B-Instruct@1@0@-187f929f91136dd6metadata"
4) "vllm@/data/models/Qwen2.5-7B-Instruct@1@0@-bce14c0b271abefmetadata"
5) "vllm@/data/models/Qwen2.5-7B-Instruct@1@0@ce27b7747a0440fkv_bytes"
|
1
2
| nerdctl exec -it redis redis-cli MEMORY USAGE "vllm@/data/models/Qwen2.5-7B-Instruct@1@0@40584cb624749b3fkv_bytes"
(integer) 2097272
|
一个缓存块大约是 2MB,与磁盘缓存的文件大小差不多。