【原】LLMs之LCM：《CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving》翻译与解读

处女座的程序猿 2024-09-22 发布于上海

展开全文

LLMs之LCM：《CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving》翻译与解读

导读：CacheGen 是一种有效的解决方案，可以有效地减少获取和处理长上下文的网络延迟，从而加速大型语言模型服务，提升用户体验。它利用 KV 缓存的分布特性，并通过动态调整压缩级别来适应网络带宽变化，在保持高响应质量的同时，显著降低了带宽使用和传输延迟。CacheGen 可以与其他上下文压缩技术互补，并有望在未来进一步优化和扩展。

>> 背景痛点：

● 长上下文的处理挑战：随着大语言模型（LLMs）在复杂任务中的应用，其输入常常包含长上下文，来提供更丰富的知识和更好的响应质量。这导致生成响应之前需要处理整个上下文，长上下文导致预填充阶段的计算量大幅增加，从而延长了时间到第一个token (TTFT) 的时间，影响用户体验。

● KV缓存的网络传输延迟：为了减少预填充延迟，LLM 系统通常会缓存上下文对应的 KV 缓存，以便在后续请求中重复利用。但由于KV缓存包含大尺寸张量，网络传输会引入显著的额外延迟。然而，当 KV 缓存不在本地 GPU 内存中时，需要从其他机器上获取，这会导致额外的网络延迟，成为新的瓶颈。

>> 解决方案：CacheGen 是一种专门为 LLM 系统设计的快速上下文加载模块，旨在减少获取和处理长上下文的网络延迟。它采用两项关键技术：

● KV 缓存编码和解码：利用KV缓存的分布特性，设计自定义张量编码器，将KV缓存压缩为更紧凑的比特流，以降低带宽使用。

● KV 缓存流式传输：根据网络带宽条件的变化调整KV缓存的压缩级别，以保持低延迟和高生成质量。

>> 核心思路步骤：

KV 缓存—编码：计算相邻token的KV张量差值（delta），利用token之间的局部性特征进行压缩。对不同层的delta张量应用不同级别的量化，较浅层使用较高精度，较深层使用较低精度。使用算术编码器压缩量化后的delta张量，将其编码转化为紧凑的比特流。

KV 缓存流式传输—解码：将长上下文分块并对每个块独立编码。根据网络带宽动态调整每个块的压缩级别，确保服务水平目标（SLO）内的延迟。当带宽不足时，将某些块以文本格式传输，并在接收端让LLM重新计算其KV缓存。

>> 优势：

● 显著减少 TTFT：与直接加载文本上下文的相比，CacheGen 将 TTFT 减少了 3.1-4.7 倍；与 KV 量化基线相比，CacheGen 将 TTFT 减少了 3.2-3.7 倍。

● 大幅降低带宽使用：与 KV 量化基线相比，CacheGen 在保持相同响应质量的情况下，将带宽使用减少了 3.5-4.3 倍。

● 与其他上下文压缩技术互补： CacheGen 可以进一步压缩由其他上下文压缩技术（例如 H2O 和 LLMlingua）生成的 KV 缓存，进一步减少带宽使用。

● 解码开销低： CacheGen 的解码过程在 GPU 上加速执行，并与传输过程并行化，对整体延迟的影响很小。

通过这些技术，CacheGen在不显著影响生成质量的前提下，大幅降低了长上下文处理的网络延迟。

《CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving》翻译与解读

地址

论文地址：https:///pdf/2310.07240

时间

2023年10月11日

最近更新：2024年7月19日

作者

芝加哥大学、微软团队、斯坦福大学

Abstract

As large language models (LLMs) take on complex tasks, their inputs are supplemented with longer contexts that incorporate domain knowledge. Yet using long contexts is challenging as nothing can be generated until the whole context is processed by the LLM. While the context-processing delay can be reduced by reusing the KV cache of a context across different inputs, fetching the KV cache, which contains large tensors, over the network can cause high extra network delays.

CacheGen is a fast context-loading module for LLM systems. First, CacheGen uses a custom tensor encoder, leveraging KV cache’s distributional properties to encode a KV cache into more compact bitstream representations with negligible decoding overhead, to save bandwidth usage. Second, CacheGen adapts the compression level of different parts of a KV cache to cope with changes in avail-able bandwidth, in order to maintain low context-loading delay and high generation quality. We test CacheGen on popular LLMs and datasets. Compared to the recent systems that reuse the KV cache, CacheGen reduces the KV cache size by 3.5-4.3x and the total delay in fetching and processing contexts by 3.2-3.7x with negligible impact on the LLM response quality. Our code is at: https://github.com/UChi-JCL/CacheGen.

随着大型语言模型（LLMs）承担复杂的任务，其输入被补充了包含领域知识的更长上下文。然而，使用长上下文具有挑战性，因为在整个上下文被LLM处理完毕之前，无法生成任何内容。虽然可以通过在不同的输入之间重用上下文的KV缓存来减少上下文处理延迟，但通过网络获取包含大量张量的KV缓存可能会导致较高的额外网络延迟。 CacheGen是一个用于LLM系统的快速上下文加载模块。

>> 首先，CacheGen利用自定义张量编码器，基于KV缓存的分布特性将其编码为更加紧凑的位流表示形式，解码开销可以忽略不计，从而节省带宽使用。

>> 其次，CacheGen根据可用带宽的变化调整KV缓存不同部分的压缩级别，以保持较低的上下文加载延迟和高质量的生成结果。

我们在流行的LLM和数据集上测试了CacheGen。与最近重用KV缓存的系统相比，CacheGen将KV缓存大小减少了3.5-4.3倍，并且在获取和处理上下文时的总延迟降低了3.2-3.7倍，同时对LLM响应质量的影响微乎其微。我们的代码位于：https://github.com/UChi-JCL/CacheGen。

1 Introduction

With impressive generative quality, large language models (LLMs) are ubiquitously used [22, 38, 46, 128] in personal assistance, AI healthcare, and marketing. The wide use of LLM APIs (e.g., OpenAI GPT-4 [108]) and the industry-quality open-source models (e.g., Llama [129]), combined with popular application frameworks (e.g., HuggingFace [10], Langchain [83]), further boosts LLMs’ popular-ity.

To perform complex tasks, users or applications often prepend an LLM input with a long context containing thousands of tokens or more. For example, some context supplements user prompts with domain-knowledge text so that the LLM can generate re-sponses using specific knowledge not embedded in the LLM it-self. As another example, a user prompt can be supplemented with the conversation histories accumulated during the interac-tions between the user and the LLM. Though short inputs are useful [94, 124], longer inputs often improve response quality and coherence [31, 32, 35, 45, 67, 116, 130, 141], which has fueled the ongoing race to train LLMs that accept ever longer inputs, from 2K tokens in ChatGPT to 100K in Claude [24].

凭借出色的生成质量，大型语言模型（LLMs）被广泛应用于个人助手、AI医疗保健和市场营销等领域[22, 38, 46, 128]。LLM API（例如，OpenAI GPT-4 [108]）和行业级开源模型（例如，Llama [129]）的广泛应用，结合流行的应用框架（例如，HuggingFace [10], Langchain [83]），进一步提升了LLM的普及度。

为了执行复杂任务，用户或应用程序通常会在LLM输入前加上一个包含数千个标记甚至更多的长上下文。例如，某些上下文会用领域知识文本补充用户提示，使LLM能够利用自身未嵌入的具体知识生成回应。另一个例子是，用户提示可以由用户与LLM交互过程中积累的对话历史记录进行补充。尽管短输入有用[94, 124]，较长的输入通常能提高响应的质量和连贯性[31, 32, 35, 45, 67, 116, 130, 141]，这推动了训练接受越来越长输入的LLM竞赛，从ChatGPT中的2K标记到Claude中的100K标记[24]。

Using long contexts poses a challenge to the response generation latency, as no response can be generated until the whole context is loaded and processed by the LLM. The amount of computation in processing a long context grows super-linearly with the context length [31, 47, 116, 131, 150]. While some recent works increase the throughput of processing long context [17], the delay of processing the context can still be several seconds for long contexts (2 seconds for a 3K context) [17, 58]. In response, many systems reduce the context-processing delay by storing and reusing the KV cache of the context to skip redundant computation when the context is used again (e.g., [23, 58, 82, 156]).

Yet, the KV cache of a reused context may not always be in local GPU memory when the next input comes; instead, the KV cache may need to be retrieved from another machine(s) first, caus-ing extra network delays (Figure 1a). For instance, a database of background documents might reside in a separate storage service, and the documents (i.e., context) assisting LLM inference are only to be selected and fetched to the LLM when a relevant query is received [27, 31, 36, 84, 110].

使用长上下文对响应生成延迟提出了挑战，因为在整个上下文被LLM加载并处理之前，无法生成任何响应。处理长上下文所需的计算量随上下文长度超线性增长[31, 47, 116, 131, 150]。虽然一些近期的工作提高了处理长上下文的吞吐量[17]，但对于长上下文（如3K上下文需要2秒）的处理延迟仍然可能达到几秒钟[17, 58]。作为回应，许多系统通过存储和重用上下文的KV缓存来减少上下文处理延迟，当再次使用相同上下文时跳过冗余计算（例如，[23, 58, 82, 156]）。

然而，当下一个输入到来时，重用上下文的KV缓存并不总是在本地GPU内存中；相反，可能需要先从其他机器中检索KV缓存，从而造成额外的网络延迟（图1a）。例如，背景文档数据库可能驻留在独立的存储服务中，只有当接收到相关查询时，协助LLM推理的文档（即上下文）才会被选择并获取到LLM[27, 31, 36, 84, 110]。

The extra network delay for fetching the KV cache has not yet received much attention. Previous systems assume the KV cache of a context is always kept in the same GPU memory between different requests sharing the same context [58], or the KV cache is small enough to be sent quickly by a fast interconnection [111, 157]. Yet, as elaborated in §3, the delay for fetching a KV cache can be non-trivial, since a KV cache consists of large high-dimensional floating-point tensors, whose sizes grow with both the context length and model size and can easily reach 10s GB. The resulting network delay can be 100s milliseconds to over 10 seconds, hurting the interactive user experience [1, 2, 87]. In short, when loading contexts’ KV cache from other machines, solely optimizing computational delay may cause higher response latency, as loading the KV cache increases the network delay.

对于获取KV缓存的额外网络延迟尚未得到足够重视。先前的系统假设在共享相同上下文的不同请求之间，上下文的KV缓存始终保留在相同的GPU内存中[58]，或者KV缓存足够小，可以通过快速互连迅速发送[111, 157]。然而，正如§3所详细阐述的那样，由于KV缓存由大尺寸高维浮点张量组成，其大小随着上下文长度和模型大小的增长而增长，很容易达到数十GB。由此产生的网络延迟可以从几百毫秒到超过10秒不等，损害了交互式用户体验[1, 2, 87]。简而言之，当从其他机器加载上下文的KV缓存时，仅优化计算延迟可能导致更高的响应延迟，因为加载KV缓存增加了网络延迟。

There have been a few recent efforts to reduce the run-time size of KV cache in GPU memory in order to fit the memory limit or LLM’s input limit. Some drop unimportant tokens from KV cache or context text [71, 72, 95, 153], and others apply smart quantization on KV cache tensor [62, 78, 97]. In contrast, we want to reduce the transmission-time size of KV cache to reduce the network delay. Thus, we do not need to keep the tensor format of KV cache and, instead, can encode it into more compact bitstreams.

近期有一些努力旨在减少GPU内存中KV缓存的运行时大小，以便适应内存限制或LLM的输入限制。有些方法从KV缓存或上下文文本中删除不重要的标记[71, 72, 95, 153]，另一些则对KV缓存张量应用智能量化[62, 78, 97]。相比之下，我们希望减小KV缓存的传输时间大小以减少网络延迟。因此，我们不需要保持KV缓存的张量格式，而是可以将其编码为更加紧凑的位流。

We present CacheGen, a fast context-loading module in LLM systems for reducing the network delay in fetching and processing long contexts (Figure 1b). It entails two techniques.

KV cache encoding and decoding: CacheGen encodes a pre-computed KV cache into more compact bitstream representations, rather than keeping the tensor shapes of the KV cache. This greatly saves bandwidth and delays when sending a KV cache. Our KV cache encoder employs a custom quantization and arithmetic cod-ing strategy to leverage the distributional properties of KV cache, such as locality of KV tensors across nearby tokens and different sensitivities towards quantization losses at different layers of a KV cache. Furthermore, the decoding (decompression) of KV caches is accelerated by a GPU-based implementation, and the decoding is pipelined with transmission to further reduce its impact on the overall inference delay.

KV cache streaming: CacheGen streams the encoded bitstreams of a KV cache in a way that adapts to changes in network conditions. Before a user query arrives, CacheGen splits a long context into chunks and encodes the KV of each chunk separately at various compression levels (similar to video streaming). When sending a context’s KV cache, CacheGen fetches the chunks one by one and adapts the per-chunk compression level to maintain high generation quality while keeping the network delay within a Service-Level Objective (SLO). When the bandwidth is too low, CacheGen can also fall back to sending a chunk in text format and leave it to the LLM to recompute the KV cache of the chunk.

我们提出了CacheGen，一个用于LLM系统的快速上下文加载模块，旨在减少获取和处理长上下文时的网络延迟（图1b）。它包含了两种技术：

KV缓存编码和解码：CacheGen将预先计算好的KV缓存编码为更加紧凑的位流表示形式，而不是保留KV缓存的张量形状。这大大节省了发送KV缓存时的带宽和延迟。我们的KV缓存编码器采用定制的量化和算术编码策略，利用KV缓存的分布特性，如附近标记之间的KV张量局部性和KV缓存在不同层面对量化损失的不同敏感度。此外，通过基于GPU的实现加速了KV缓存的解码（解压），并且解码与传输流水化以进一步减少其对整体推理延迟的影响。

KV缓存流式传输：CacheGen以一种适应网络条件变化的方式流式传输KV缓存的编码位流。在用户查询到达之前，CacheGen将长上下文分割成块，并分别以不同的压缩级别对每个块的KV进行编码（类似于视频流）。在发送上下文的KV缓存时，CacheGen逐个获取这些块，并调整每个块的压缩级别以保持高质量的生成结果，同时将网络延迟控制在服务等级目标（SLO）内。当带宽太低时，CacheGen还可以退回到以文本格式发送块，并让LLM重新计算该块的KV缓存。

In short, unlike prior systems that optimize the KV cache in GPU memory, CacheGen focuses on the network delay for sending the KV cache. We compare CacheGen with a range of baselines, including KV quantization [120], loading contexts in text form, and state-of-the-art context compression [72, 153], using three popular LLMs of various sizes (from 7B to 70B) and four datasets of long contexts (662 contexts with 1.4 K to 16 K tokens). Table 1 gives a preview of the results. Our key findings are:

>> In terms of the delay of transmitting and processing contexts (i.e., time-to-first-token), CacheGen is 3.2-3.7× faster than the quantization baseline at the similar generation quality (F1 score and perplexity), and 3.1-4.7× faster than loading the text contexts with less than 2% accuracy drop. Notably, compared with 8-bit quantization, a nearly lossless KV cache compression, CacheGen is still able to reduce the delay of loading context by 1.67-1.81×.

>> In terms of the bandwidth usage for sending KV cache, CacheGen achieves the same generation quality while using 3.5-4.3× less bandwidth than the quantization baseline.

>> When combined with the recent context compression methods [72,153], CacheGen further reduces the bandwidth usage for sending their KV caches by 3.3-4.2×.

This work does not raise any ethical issues.

简而言之，不同于之前的系统优化GPU内存中的KV缓存，CacheGen专注于发送KV缓存时的网络延迟。我们将CacheGen与一系列基线进行了比较，包括KV量化[120]、以文本形式加载上下文以及最先进的上下文压缩[72, 153]，使用了三个不同规模的流行LLM（从7B到70B）和四个长上下文数据集（662个上下文，从1.4K到16K标记）。表1展示了结果预览。我们的主要发现是：

在传输和处理上下文的延迟方面（即首次生成标记的时间），CacheGen比量化基线快3.2-3.7倍，在相似的生成质量（F1分数和困惑度）下，比以文本形式加载上下文快3.1-4.7倍，准确性下降不到2%。值得注意的是，与几乎无损的8比特量化相比，CacheGen仍然能够将加载上下文的延迟降低1.67-1.81倍。

在发送KV缓存的带宽使用方面，CacheGen在保持相同生成质量的同时，使用的带宽比量化基线少3.5-4.3倍。

当与最近的上下文压缩方法[72,153]结合使用时，CacheGen进一步减少了发送它们的KV缓存所需带宽的3.3-4.2倍。这项工作没有引发任何伦理问题。

图1：当上下文被重用时，CacheGen通过压缩（编码）KV缓存加快了它的共享速度。

Figure 1: When the context is reused, CacheGen speeds up the sharing of its KV cache by compressing (encoding) the KV cache.表1：CacheGen与基线在Mistral-7B上使用LongChat数据集[90]的表现。完整结果见§7。

Table 1: Performance of CacheGen and the baselines on Mistral-7B with LongChat dataset [90]. Full results are shown in §7.表1：CacheGen与基线方法在Mistral-7B模型使用LongChat数据集[90]上的性能表现。完整结果见§7。

9 Discussion and Limitations

Compatibility with other KV-cache compression work: Emerg-ing techniques like smart quantization [62, 78, 97] are complemen-tary with CacheGen. After quantization, CacheGen can still apply delta encoding and arithmetic coding, as shown in Figure 10.

Incremental KV cache streaming: Future work includes extend-ing CacheGen to stream KV caches incrementally, akin to Scalable Video Coding (SVC) [61], by initially sending low-quality KV caches and then incrementally improving quality by sending differences.

Context reuse in real-world LLM applications: In §2.2, we explain why contexts are likely reused across requests using anec-dotal evidence, but unfortunately, few industry datasets exist to support it. Future work includes finding or creating such datasets.

Evaluation on higher-end GPUs: In §7, we use NVIDIA A40 GPUs to conduct the experiments. We acknowledge that with very high-power GPUs and relatively low bandwidth, CacheGen might not significantly improve over the text context baseline. Further-more, due to GPU memory limitations, we have not evaluated our ideas on extra-large models such as OPT-175B. Evaluating CacheGen on more powerful GPUs and larger LLMs is left for future work.

Other system designs: §5 covers CacheGen’s encoder and streamer design. Other aspects such as which storage device(s) to store KV cache, caching policies, and locating KV cache quickly are discussed in concurrent works [52, 74, 147]. We leave combining CacheGen with these works to future work.

Other limitations: Task-wise, we did not extensively evaluate CacheGen’s performance on “free-text generation” tasks such as story generation because the quality metrics are less well-defined than the tasks in our evaluation. Network-wise, our network model does not include conditions with extremely high bandwidths. Addi-tionally, not all LLM applications can cache KV features. Search-based apps, like Google and Bing, use real-time search results as context, and their volatile contexts will unlikely be reused unless for very popular search results. We expect future work to address these issues.

与其他KV缓存压缩工作的兼容性：诸如智能量化[62, 78, 97]等新兴技术可以与CacheGen互补。如图10所示，量化后，CacheGen仍然可以应用差分编码和算术编码。

增量KV缓存流式传输：未来的工作包括将CacheGen扩展到类似可扩展视频编码（SVC）[61]的方式，通过首先发送低质量的KV缓存然后逐步通过发送差异来提高质量。

实际LLM应用程序中的上下文重用：在§2.2中，我们通过轶事证据解释了为什么跨请求可能重用上下文，但遗憾的是，很少有行业数据集支持这一点。未来工作包括寻找或创建此类数据集。

高端GPU上的评估：在§7中，我们使用NVIDIA A40 GPU进行实验。我们承认，在非常强大的GPU和相对较低带宽的情况下，CacheGen可能不会显著优于文本上下文基准。此外，由于GPU内存限制，我们尚未在像OPT-175B这样的超大模型上测试我们的想法。在更强大的GPU和更大的LLM上评估CacheGen留待未来研究。

其他系统设计：§5涵盖了CacheGen的编码器和流媒体设计。关于存储KV缓存的存储设备、缓存策略以及快速定位KV缓存等方面的讨论见并行工作[52, 74, 147]。将CacheGen与这些工作结合留待未来研究。

其他限制：从任务角度来看，我们没有广泛评估CacheGen在“自由文本生成”任务（如故事生成）上的性能，因为这些任务的质量指标不如我们评估的任务定义得那么清晰。在网络方面，我们的网络模型不包括极高带宽的情况。另外，并非所有LLM应用程序都能缓存KV特征。基于搜索的应用程序，例如Google和Bing，使用实时搜索结果作为上下文，它们易变的上下文除非是非常流行的搜索结果否则不太可能被重用。我们期望未来的研究能够解决这些问题。

10 Conclusion

We present CacheGen, a context-loading module to minimize over-all delays in fetching and processing contexts for LLMs. CacheGen reduces the bandwidth needed to transmit long contexts’ KV cache through an encoder tailored to compress KV cache into compact bitstreams. Experiments across three models of various capacities and four datasets with various context lengths show that CacheGen reduces overall delays while maintaining high task performance.

我们介绍了CacheGen，这是一个用于最小化LLM获取和处理上下文总体延迟的上下文加载模块。CacheGen通过一个定制的编码器将KV缓存压缩成紧凑的比特流，从而减少了传输长上下文KV缓存所需的带宽。通过对三种不同容量模型和四个具有不同上下文长度的数据集进行实验表明，CacheGen能够在保持高任务性能的同时减少总体延迟。