【原】LLMs：《A Decoder-Only Foundation Model For Time-Series Forecasting》的翻译与解读

处女座的程序猿 2024-06-19 发布于上海

展开全文

LLMs：《A Decoder-Only Foundation Model For Time-Series Forecasting》的翻译与解读

导读：本文提出了一种名为TimesFM的时序基础模型，用于零样本学习模式下的时序预测任务。

背景痛点：近年来，深度学习模型在有充足训练数据的情况下已成为时序预测的主流方法，但这些方法通常需要独立在每个数据集上训练。同时，自然语言处理领域的大规模预训练语言模型在下游任务上的表现不断提高。而时序数据 volumes 难以与 NLP 中的文本数据相比，且时序数据没有明确的词汇或语法规则。是否可训练一个基础模型，其零样本学习模式下各个新未见数据集的预测效果能与专门为每个数据集训练的模型相媲美?

解决方案：

>> 构建了大规模时序预训练语料，包含真实世界数据(谷歌查询趋势、维基浏览量等)和人工生成的数据。

>> 提出了TimesFM模型，采用解码器样式的注意力结构加输入切片策略进行预训练。

>> 模型规模为200M，预训练数据规模在100B时间点级别，远小于NLP领域的大模型。

>> 在多重未见预测任务上，TimesFM的零射效果趋近或超越各任务的专门训练基线模型。

核心思路：

>> 输入切片，相当于NLP中词语，提高计算效率。

>> 解码器训练策略，支持任意输入长度。

>> 输出切片长度大于输入切片长度，减少自动回归步骤。

>> 训练采样掩码策略，覆盖所有可能输入窗口长度。

>> 人工数据增加训练语料多样性。

>> 小模型达到较好效果，说明时序预训练也可取得成果。

主要优势：

>> 可一致处理不同应用场景，预测长度、细粒度等。

>> 零样本学习模式实现，无需额外训练即可直接应用。

>> 导入成本低，计算资源消耗小。

>> 提供了可行的时序基础模型范例。

>> 有助提升时序深度学习在实际应用中的采用度。

《A Decoder-Only Foundation Model For Time-Series Forecasting》的翻译与解读

地址	论文地址：https:///abs/2310.10688
时间	2024年4月17日
作者	Google Research

ABSTRACT

Motivated by recent advances in large language models for Natural Language Processing (NLP), we design a time-series foundation model for forecasting whose out-of-the-box zero-shot performance on a variety of public datasets comes close to the accuracy of state-of-the-art supervised forecasting models for each individual dataset. Our model is based on pretraining a decoder style attention model with input patching, using a large time-series corpus comprising both real-world and synthetic datasets. Experiments on a diverse set of previously unseen forecasting datasets suggests that the model can yield accurate zero-shot forecasts across different domains, forecasting horizons and temporal granularities.

受自然语言处理（NLP）领域中大规模语言模型（LLMs）近期进展的启发，我们设计了一种用于预测的时间序列基础模型，其在各种公共数据集上的即开即用零样本表现接近于每个单独数据集上最先进的监督预测模型的准确性。我们的模型基于对一个包含真实世界和合成数据集的大型时间序列语料库的解码器样式的注意力模型进行预训练，并采用输入分片技术。对一组多样化且之前未见过的预测数据集的实验表明，该模型能够在不同领域、预测范围和时间粒度下产生准确的零样本预测。

1 Introduction引言

Time-series data is ubiquitous in various domains such as retail, finance, manufacturing, healthcare and natural sciences. In many of these domains, one of the most important use-cases of time-series data is forecasting. Time-series forecasting is critical to several scientific and industrial applications, like retail supply chain optimization, energy and traffic prediction, and weather forecasting. In recent times, Deep learning models [SFGJ20, OCCB19] have emerged as a popular approach for forecasting rich, multivariate, time-series data, often outperforming classical statistical approaches such as ARIMA or GARCH [BJ68]. In several forecasting competitions such as the M5 competition [MSA22] and IARAI Traffic4cast contest [KKN+21] deep network based solutions performed very well.

At the same time, we are witnessing a rapid progress in the Natural Language Processing (NLP) domain on large foundation models for downstream NLP tasks. Large language models (LLMs) are growing in popularity because they can be used to generate text, translate languages, write different kinds of creative content, and answer your questions in an informative way [RWC+19]. They are trained on massive amounts of data, which allows them to learn the patterns of human language. This makes them very powerful tools that can be used for a variety of downstream tasks, often in a zero-shot learning mode.

时间序列数据在零售、金融、制造、医疗保健和自然科学等各个领域中无处不在。在这些领域中，时间序列数据最重要的用例之一是预测。时间序列预测对多个科学和工业应用至关重要，如零售供应链优化、能源和交通预测以及天气预报。近年来，深度学习模型（如[SFGJ20, OCCB19]）在处理丰富的多变量时间序列数据方面已经成为一种流行的方法，往往优于经典的统计方法，如ARIMA或GARCH[BJ68]。在多个预测竞赛中，如M5竞赛[MSA22]和IARAI Traffic4cast比赛[KKN+21]，基于深度网络的解决方案表现非常出色。

与此同时，我们在NLP领域见证了基础模型在下游任务上的快速进展。大规模语言模型（LLMs）因其能够生成文本、翻译语言、撰写各种创意内容，并以信息丰富的方式回答问题而日益受到欢迎[RWC+19]。它们经过大量数据的训练，能够学习人类语言的模式，这使得它们成为可以用于各种下游任务的强大工具，通常在零样本学习模式下使用。

This motivates the question: “Can large pretrained models trained on massive amounts of time-series data learn temporal patterns that can be useful for time-series forecasting on previously unseen datasets?” In particular, can we design a time-series foundation model that obtains good zero-shot out-of-the-box forecasting performance ? Such a pretrained time-series foundation model, if possible, would bring significant benefits for downstream forecasting users in terms of no additional training burden and significantly reduced compute requirements. It is not immediately obvious that such a foundation model for time-series forecasting is possible. Unlike in NLP, there is no well defined vocabulary or grammar for time-series. Additionally, such a model would need to support forecasting with varying history lengths (context) , prediction lengths (horizon) and time granularities. Furthermore, unlike the huge volume of public text data for pretraining language models, vast amounts of time-series data is not readily available. In spite of these issues, we provide evidence to answer the above question in the affirmative.In particular, we design TimesFM, a single foundation model for time-series forecasting that, when applied to a variety of previously-unseen forecasting datasets across different domains, obtains close to state-of-the-art zero-shot accuracy (compared to the best supervised models trained individually for these datasets). Our model can work well across different forecasting history lengths, prediction lengths and time granularities at inference time. The key elements of our foundation model are twofold: 1) a large-scale time-series corpus built using both real-world (mostly time-series data from web search queries1 and Wikipedia page visits2) and synthetic data, which meets the volume and diversity of data needed for training our foundation model, and 2) a decoder style attention architecture with input patching, that can be efficiently pre-trained on this time-series corpus.

这引发了一个问题：“大规模预训练的模型如果在海量时间序列数据上训练，能否学习到对之前未见过的数据集进行时间序列预测有用的时间模式？”具体来说，我们能否设计出一种时间序列基础模型，其在即开即用的情况下能够获得良好的零样本预测性能？如果这种预训练的时间序列基础模型是可行的，它将为下游预测用户带来显著的好处，既无需额外的训练负担，又显著减少了计算需求。显然，设计这样一个时间序列预测的基础模型并不容易。与NLP不同，时间序列没有明确定义的词汇或语法。此外，该模型还需要支持具有不同历史长度（上下文）、预测长度（范围）和时间粒度的预测。此外，与用于预训练语言模型的大量公共文本数据不同，大量的时间序列数据并不容易获得。尽管存在这些问题，我们仍然提供了肯定答案的证据。我们设计了TimesFM，一个单一的时间序列基础模型，用于时间序列预测，当应用于各种以前未见过的预测数据集时，其零样本准确性接近于在这些数据集上单独训练的最先进的监督模型。我们的模型在推理时能够在不同的预测历史长度、预测长度和时间粒度下表现良好。我们的基础模型的关键要素有两个：1）一个由真实世界（主要是来自网络搜索查询和维基百科页面访问的时间序列数据）和合成数据组成的大规模时间序列语料库，满足了训练我们基础模型所需的数据量和多样性；2）一种带有输入分片的解码器样式的注意力架构，可以高效地在这个时间序列语料库上进行预训练。

Compared to the latest large language models, our time-series foundation model is much smaller in both parameter size (200M parameters) and pretraining data size (O(100B) timepoints); yet we show that even at such scales, it is possible to pretrain a practical foundation model for forecasting whose zero-shot performance comes close to the accuracy of fully-supervised approaches on a diverse set of time-series data. Our work also suggests that unlike recent work [GFQW23] that recommends Large Language Models such as GPT-3 and LLama-2 as out-of-the-box zero-shot forecasters, foundation models trained from scratch exclusively on time-series data can obtain much better zero-shot performance at a tiny fraction of its costs.

与最新的大型语言模型相比，我们的时间序列基础模型在参数规模（2亿参数）和预训练数据规模（约1000亿个时间点）方面要小得多；然而，我们表明，即使在这样的规模下，也可以预训练出一个实用的基础模型，其零样本性能接近于在多样化的时间序列数据上最先进的全监督方法的准确性。我们的工作还表明，与近期建议的大型语言模型（如GPT-3和LLama-2）作为即开即用的零样本预测模型的研究[GFQW23]不同，专门从头在时间序列数据上训练的基础模型可以在极小的成本下获得更好的零样本性能。

7 Conclusion结论

In this paper, we presented TimesFM, a practical foundation model for forecasting whose zero-shot performance comes close to the accuracy of fully-supervised forecasting models on a diverse set of time-series data. This model is pretrained on real-world and synthetic datasets comprising O(100B) timepoints. We discuss limitations and future work in more detail in Appendix A.1.

在本文中，我们介绍了TimesFM，一个实用的时间序列基础模型，其零样本性能接近于在多样化时间序列数据上最先进的全监督预测模型。该模型在包含约1000亿个时间点的真实世界和合成数据集上进行了预训练。我们在附录A.1中更详细地讨论了局限性和未来工作。