如何使用 XGBoost 进行时间序列预测

作者： Jason Brownlee 于 2021年3月19日发布在 XGBoost 135

XGBoost是梯度提升在分类和回归问题上的高效实现。

它既快速又高效，在各种预测建模任务中表现出色，如果不是最好的话，并且是诸如Kaggle之类的数据科学竞赛获胜者的最爱。

XGBoost也可以用于时间序列预测，尽管它要求时间序列数据集首先被转换为监督学习问题。它还需要使用一种称为前向验证的专门技术来评估模型，因为使用k折交叉验证评估模型会导致乐观的偏差结果。

在本教程中，您将了解如何为时间序列预测开发XGBoost模型。

完成本教程后，您将了解：

XGBoost是用于分类和回归的梯度提升集成算法的实现。
时间序列数据集可以使用滑动窗口表示转换为监督学习。
如何为时间序列预测拟合、评估和生成XGBoost模型。

通过我的新书《XGBoost With Python》启动您的项目，其中包括所有示例的分步教程和 Python 源代码文件。

让我们开始吧。

更新于2020年8月：修正了MAE计算中的错误，更新了模型配置以获得更好的预测（感谢Kaustav！)

How to Use XGBoost for Time Series Forecasting

如何使用 XGBoost 进行时间序列预测
照片由gothopotam拍摄，部分权利保留。

教程概述

本教程分为三个部分；它们是：

XGBoost集成
时间序列数据准备
XGBoost用于时间序列预测

XGBoost集成

XGBoost是Extreme Gradient Boosting的缩写，是随机梯度提升机器学习算法的高效实现。

随机梯度提升算法，也称为梯度提升机或树提升，是一种强大的机器学习技术，在各种具有挑战性的机器学习问题上表现良好甚至最佳。

树模型提升已被证明在许多标准的分类基准测试中能取得最先进的结果。

— XGBoost: A Scalable Tree Boosting System, 2016。

它是一个决策树集成算法，新树可以纠正模型中已有的树的错误。会一直添加树，直到模型无法进一步改进为止。

XGBoost提供了随机梯度提升算法的高度优化实现，并提供了一套模型超参数，用于控制模型训练过程。

XGBoost成功的最重要的因素是其在所有场景下的可扩展性。该系统在单台机器上的运行速度比现有流行解决方案快十倍以上，并且在分布式或内存受限的环境中可以扩展到数十亿个示例。

— XGBoost: A Scalable Tree Boosting System, 2016。

XGBoost是为表格数据集上的分类和回归设计的，尽管它也可以用于时间序列预测。

有关梯度提升和XGBoost实现的更多信息，请参阅教程

机器学习梯度提升算法简明介绍

首先，必须安装XGBoost库。

您可以使用pip安装它，如下所示

sudo pip install xgboost

1	sudo pip install xgboost

安装完成后，您可以通过运行以下代码来确认它已成功安装并正在使用最新版本

# xgboost
import xgboost
print("xgboost", xgboost.__version__)

# xgboost

import xgboost

print("xgboost", xgboost.__version__)

运行代码，您应该看到以下版本号或更高版本。

xgboost 1.0.1

1	xgboost 1.0.1

尽管XGBoost库有自己的Python API，但我们可以通过XGBRegressor包装器类将XGBoost模型与scikit-learn API一起使用。

可以像任何其他scikit-learn类一样实例化模型实例并用于模型评估。例如

...
# define model
model = XGBRegressor()

...

# 定义模型

model = XGBRegressor()

现在我们熟悉了XGBoost，让我们来看看如何准备时间序列数据集以进行监督学习。

时间序列数据准备

时间序列数据可以被表述为监督学习。

给定一个时间序列数据集的数字序列，我们可以重新构造数据，使其看起来像一个监督学习问题。我们可以通过使用先前的时间步作为输入变量，并将下一个时间步作为输出变量来实现这一点。

让我们用一个例子来具体说明这一点。假设我们有一个时间序列如下：

time, measure
1, 100
2, 110
3, 108
4, 115
5, 120

时间，测量值

1, 100

2, 110

3, 108

4, 115

5, 120

我们可以通过使用前一个时间步的值来预测下一个时间步的值来重构这个时间序列数据集为一个监督学习问题。

以这种方式重组时间序列数据集，数据将如下所示

X, y
?, 100
100, 110
110, 108
108, 115
115, 120
120, ?

X, y

?, 100

100, 110

110, 108

108, 115

115, 120

120, ?

请注意，时间列被删除，并且某些数据行（如第一行和最后一行）无法用于训练模型。

这种表示称为滑动窗口，因为输入和预期输出的窗口会随时间向前滑动，为监督学习模型创建新的“样本”。

有关使用滑动窗口方法准备时间序列预测数据的更多信息，请参阅教程

将时间序列预测作为监督学习

我们可以使用Pandas中的shift()函数，根据所需的输入和输出序列长度，自动创建时间序列问题的新框架。

这将是一个有用的工具，因为它允许我们使用机器学习算法探索时间序列问题的不同框架，以查看哪个可能导致性能更好的模型。

下面的函数将把一个NumPy数组形式的时间序列（包含一列或多列）转换为一个指定输入和输出数量的监督学习问题。

# transform a time series dataset into a supervised learning dataset
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
	n_vars = 1 if type(data) is list else data.shape[1]
	df = DataFrame(data)
	cols = list()
	# input sequence (t-n, ... t-1)
	for i in range(n_in, 0, -1):
		cols.append(df.shift(i))
	# forecast sequence (t, t+1, ... t+n)
	for i in range(0, n_out):
		cols.append(df.shift(-i))
	# put it all together
	agg = concat(cols, axis=1)
	# drop rows with NaN values
	if dropnan:
		agg.dropna(inplace=True)
	return agg.values

# 将时间序列数据集转换为监督学习数据集

def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):

n_vars = 1 if type(data) is list else data.shape[1]

df = DataFrame(data)

cols = list()

# 输入序列 (t-n, ... t-1)

for i in range(n_in, 0, -1):

cols.append(df.shift(i))

# 预测序列 (t, t+1, ... t+n)

for i in range(0, n_out):

cols.append(df.shift(-i))

# 将它们组合在一起

agg = concat(cols, axis=1)

# 删除包含 NaN 值的行

if dropnan:

agg.dropna(inplace=True)

return agg.values

我们可以使用此函数为XGBoost准备时间序列数据集。

有关此函数分步开发的更多信息，请参阅教程

如何在 Python 中将时间序列转换为监督学习问题

一旦数据集准备就绪，我们必须小心如何使用它来拟合和评估模型。

例如，在未来数据上拟合模型并让它预测过去是无效的。模型必须在过去训练，并预测未来。

这意味着不能使用在评估期间随机化数据集的方法，如k折交叉验证。相反，我们必须使用一种称为前向验证的技术。

在前向验证中，数据集首先通过选择一个分割点来分割成训练集和测试集，例如，所有数据（除了最后12天）用于训练，最后12天用于测试。

如果我们对进行一步预测感兴趣，例如一个月，那么我们可以通过在训练数据集上训练模型，然后预测测试数据集中的第一个步骤来评估模型。然后，我们可以将测试集中的实际观测值添加到训练数据集中，重新拟合模型，然后让模型预测测试数据集中的第二个步骤。

对整个测试数据集重复此过程将为整个测试数据集提供一步预测，从中可以计算出误差度量来评估模型的技能。

有关前向验证的更多信息，请参阅教程

如何为时间序列预测回测机器学习模型

下面的函数执行前向验证。

它接收时间序列数据集的整个监督学习版本和要用作测试集的行数作为参数。

然后，它遍历测试集，调用xgboost_forecast()函数进行一步预测。计算误差度量，并将详细信息返回以供分析。

# walk-forward validation for univariate data
def walk_forward_validation(data, n_test):
	predictions = list()
	# split dataset
	train, test = train_test_split(data, n_test)
	# seed history with training dataset
	history = [x for x in train]
	# step over each time-step in the test set
	for i in range(len(test)):
		# split test row into input and output columns
		testX, testy = test[i, :-1], test[i, -1]
		# fit model on history and make a prediction
		yhat = xgboost_forecast(history, testX)
		# store forecast in list of predictions
		predictions.append(yhat)
		# add actual observation to history for the next loop
		history.append(test[i])
		# summarize progress
		print('>expected=%.1f, predicted=%.1f' % (testy, yhat))
	# estimate prediction error
	error = mean_absolute_error(test[:, -1], predictions)
	return error, test[:, 1], predictions

# 单变量数据的滚动预测验证

def walk_forward_validation(data, n_test):

predictions = list()

# 拆分数据集

train, test = train_test_split(data, n_test)

# 用训练数据集初始化历史数据

history = [x for x in train]

# 遍历测试集中的每个时间步

for i in range(len(test)):

# 将测试行分割成输入和输出列

testX, testy = test[i, :-1], test[i, -1]

# 在历史数据上拟合模型并进行预测

yhat = xgboost_forecast(history, testX)

# 将预测结果存储在预测列表中

predictions.append(yhat)

# 将实际观测值添加到历史数据中以进行下一次循环

history.append(test[i])

# 总结进度

print('>expected=%.1f, predicted=%.1f' % (testy, yhat))

# 估计预测误差

error = mean_absolute_error(test[:, -1], predictions)

return error, test[:, 1], predictions

调用train_test_split()函数将数据集分割成训练集和测试集。

我们可以在下面定义这个函数。

# split a univariate dataset into train/test sets
def train_test_split(data, n_test):
	return data[:-n_test, :], data[-n_test:, :]

# 将单变量数据集拆分为训练/测试集

def train_test_split(data, n_test):

return data[:-n_test, :], data[-n_test:, :]

我们可以使用XGBRegressor类来生成一步预测。

下面xgboost_forecast()函数实现了这一点，它接收训练数据集和测试输入行作为输入，拟合模型，并生成一步预测。

# fit an xgboost model and make a one step prediction
def xgboost_forecast(train, testX):
	# transform list into array
	train = asarray(train)
	# split into input and output columns
	trainX, trainy = train[:, :-1], train[:, -1]
	# fit model
	model = XGBRegressor(objective='reg:squarederror', n_estimators=1000)
	model.fit(trainX, trainy)
	# make a one-step prediction
	yhat = model.predict([testX])
	return yhat[0]

# 拟合一个xgboost模型并进行一步预测

def xgboost_forecast(train, testX):

# 将列表转换为数组

train = asarray(train)

# 分割成输入和输出列

trainX, trainy = train[:, :-1], train[:, -1]

# 拟合模型

model = XGBRegressor(objective='reg:squarederror', n_estimators=1000)

model.fit(trainX, trainy)

# 进行一步预测

yhat = model.predict([testX])

return yhat[0]

现在我们知道如何准备时间序列数据进行预测以及如何评估XGBoost模型，接下来我们将研究如何在实际数据集上使用XGBoost。

XGBoost用于时间序列预测

在本节中，我们将探讨如何使用XGBoost进行时间序列预测。

我们将使用一个标准的单变量时间序列数据集，目的是使用该模型进行一步预测。

您可以使用本节中的代码作为您自己项目的起点，并轻松地将其改编为多变量输入、多变量预测和多步预测。

我们将使用每日女性生育数据集，即三年来的月度生育数量。

您可以从这里下载数据集，将其放在当前工作目录中，文件名设置为“daily-total-female-births.csv”。

数据集的前几行如下所示

"Date","Births"
"1959-01-01",35
"1959-01-02",32
"1959-01-03",30
"1959-01-04",31
"1959-01-05",44
...

"日期","出生人数"

"1959-01-01",35

"1959-01-02",32

"1959-01-03",30

"1959-01-04",31

"1959-01-05",44

...

首先，让我们加载并绘制数据集。

完整的示例如下所示。

# load and plot the time series dataset
from pandas import read_csv
from matplotlib import pyplot
# load dataset
series = read_csv('daily-total-female-births.csv', header=0, index_col=0)
values = series.values
# plot dataset
pyplot.plot(values)
pyplot.show()

# 加载并绘制时间序列数据集

from pandas import read_csv

from matplotlib import pyplot

# 加载数据集

series = read_csv('daily-total-female-births.csv', header=0, index_col=0)

values = series.values

# 绘制数据集

pyplot.plot(values)

pyplot.show()

运行示例将创建数据集的折线图。

我们看不到明显的趋势或季节性。

Line Plot of Monthly Births Time Series Dataset

月度生育时间序列数据集的折线图

朴素模型在预测过去12天时的MAE约为6.7次生育。这提供了一个性能基准，高于此的模型可以被认为是有技能的。

接下来，我们将评估XGBoost模型在预测过去12天数据时的一步预测。

我们将只使用前6个时间步作为模型的输入，并使用默认的模型超参数，除了我们将损失函数改为“reg:squarederror”（以避免警告消息），并使用1000棵树进行集成（以避免欠学习）。

完整的示例如下所示。

# forecast monthly births with xgboost
from numpy import asarray
from pandas import read_csv
from pandas import DataFrame
from pandas import concat
from sklearn.metrics import mean_absolute_error
from xgboost import XGBRegressor
from matplotlib import pyplot

# transform a time series dataset into a supervised learning dataset
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
	n_vars = 1 if type(data) is list else data.shape[1]
	df = DataFrame(data)
	cols = list()
	# input sequence (t-n, ... t-1)
	for i in range(n_in, 0, -1):
		cols.append(df.shift(i))
	# forecast sequence (t, t+1, ... t+n)
	for i in range(0, n_out):
		cols.append(df.shift(-i))
	# put it all together
	agg = concat(cols, axis=1)
	# drop rows with NaN values
	if dropnan:
		agg.dropna(inplace=True)
	return agg.values

# split a univariate dataset into train/test sets
def train_test_split(data, n_test):
	return data[:-n_test, :], data[-n_test:, :]

# fit an xgboost model and make a one step prediction
def xgboost_forecast(train, testX):
	# transform list into array
	train = asarray(train)
	# split into input and output columns
	trainX, trainy = train[:, :-1], train[:, -1]
	# fit model
	model = XGBRegressor(objective='reg:squarederror', n_estimators=1000)
	model.fit(trainX, trainy)
	# make a one-step prediction
	yhat = model.predict(asarray([testX]))
	return yhat[0]

# walk-forward validation for univariate data
def walk_forward_validation(data, n_test):
	predictions = list()
	# split dataset
	train, test = train_test_split(data, n_test)
	# seed history with training dataset
	history = [x for x in train]
	# step over each time-step in the test set
	for i in range(len(test)):
		# split test row into input and output columns
		testX, testy = test[i, :-1], test[i, -1]
		# fit model on history and make a prediction
		yhat = xgboost_forecast(history, testX)
		# store forecast in list of predictions
		predictions.append(yhat)
		# add actual observation to history for the next loop
		history.append(test[i])
		# summarize progress
		print('>expected=%.1f, predicted=%.1f' % (testy, yhat))
	# estimate prediction error
	error = mean_absolute_error(test[:, -1], predictions)
	return error, test[:, -1], predictions

# load the dataset
series = read_csv('daily-total-female-births.csv', header=0, index_col=0)
values = series.values
# transform the time series data into supervised learning
data = series_to_supervised(values, n_in=6)
# evaluate
mae, y, yhat = walk_forward_validation(data, 12)
print('MAE: %.3f' % mae)
# plot expected vs preducted
pyplot.plot(y, label='Expected')
pyplot.plot(yhat, label='Predicted')
pyplot.legend()
pyplot.show()

# 使用xgboost预测月度生育数

from numpy import asarray

from pandas import read_csv

from pandas import DataFrame

从 pandas 导入 concat

from sklearn.metrics import mean_absolute_error

from xgboost import XGBRegressor

from matplotlib import pyplot

# 将时间序列数据集转换为监督学习数据集

def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):

n_vars = 1 if type(data) is list else data.shape[1]

df = DataFrame(data)

cols = list()

# 输入序列 (t-n, ... t-1)

for i in range(n_in, 0, -1):

cols.append(df.shift(i))

# 预测序列 (t, t+1, ... t+n)

for i in range(0, n_out):

cols.append(df.shift(-i))

# 将它们组合在一起

agg = concat(cols, axis=1)

# 删除包含 NaN 值的行

if dropnan:

agg.dropna(inplace=True)

return agg.values

# 将单变量数据集拆分为训练/测试集

def train_test_split(data, n_test):

return data[:-n_test, :], data[-n_test:, :]

# 拟合一个xgboost模型并进行一步预测

def xgboost_forecast(train, testX):

# 将列表转换为数组

train = asarray(train)

# 分割成输入和输出列

trainX, trainy = train[:, :-1], train[:, -1]

# 拟合模型

model = XGBRegressor(objective='reg:squarederror', n_estimators=1000)

model.fit(trainX, trainy)

# 进行一步预测

yhat = model.predict(asarray([testX]))

return yhat[0]

# 单变量数据的滚动预测验证

def walk_forward_validation(data, n_test):

predictions = list()

# 拆分数据集

train, test = train_test_split(data, n_test)

# 用训练数据集初始化历史数据

history = [x for x in train]

# 遍历测试集中的每个时间步

for i in range(len(test)):

# 将测试行分割成输入和输出列

testX, testy = test[i, :-1], test[i, -1]

# 在历史数据上拟合模型并进行预测

yhat = xgboost_forecast(history, testX)

# 将预测结果存储在预测列表中

predictions.append(yhat)

# 将实际观测值添加到历史数据中以进行下一次循环

history.append(test[i])

# 总结进度

print('>expected=%.1f, predicted=%.1f' % (testy, yhat))

# 估计预测误差

error = mean_absolute_error(test[:, -1], predictions)

return error, test[:, -1], predictions

# 加载数据集

series = read_csv('daily-total-female-births.csv', header=0, index_col=0)

values = series.values

# 将时间序列数据转换为监督学习

data = series_to_supervised(values, n_in=6)

# 评估

mae, y, yhat = walk_forward_validation(data, 12)

print('MAE: %.3f' % mae)

# 绘制预期值与预测值

pyplot.plot(y, label='预期值')

pyplot.plot(yhat, label='预测值')

pyplot.legend()

pyplot.show()

运行示例将报告测试集每个步骤的预期值和预测值，然后是所有预测值的MAE。

注意：由于算法或评估程序的随机性，或数值精度的差异，您的结果可能会有所不同。考虑运行示例几次并比较平均结果。

我们可以看到，该模型比朴素模型表现更好，MAE约为5.9次生育，而朴素模型为6.7次生育。

你能做得更好吗？
您可以尝试不同的XGBoost超参数和输入时间步长，看看是否能获得更好的性能。请在下面的评论中分享您的结果。

>expected=42.0, predicted=44.5
>expected=53.0, predicted=42.5
>expected=39.0, predicted=40.3
>expected=40.0, predicted=32.5
>expected=38.0, predicted=41.1
>expected=44.0, predicted=45.3
>expected=34.0, predicted=40.2
>expected=37.0, predicted=35.0
>expected=52.0, predicted=32.5
>expected=48.0, predicted=41.4
>expected=55.0, predicted=46.6
>expected=50.0, predicted=47.2
MAE: 5.957

>预期=42.0, 预测=44.5

>预期=53.0, 预测=42.5

>预期=39.0, 预测=40.3

>预期=40.0, 预测=32.5

>预期=38.0, 预测=41.1

>预期=44.0, 预测=45.3

>预期=34.0, 预测=40.2

>预期=37.0, 预测=35.0

>预期=52.0, 预测=32.5

>预期=48.0, 预测=41.4

>预期=55.0, 预测=46.6

>预期=50.0, 预测=47.2

MAE: 5.957

将创建折线图，比较数据集中最后12天数据的预期值和预测值。

这提供了模型在测试集上表现如何的几何解释。

Line Plot of Expected vs. Births Predicted Using XGBoost

使用XGBoost预测的生育预期值与实际值折线图

一旦选择了最终的XGBoost模型配置，就可以最终确定模型并用于对新数据进行预测。

这称为**样本外预测**，例如，预测训练数据集之外的数据。这与在模型评估期间进行预测相同：因为当模型用于对新数据进行预测时，我们总是希望使用相同的过程来评估模型。

下面的示例演示了在所有可用数据上拟合最终的XGBoost模型，并进行训练数据集结束之后的一步预测。

# finalize model and make a prediction for monthly births with xgboost
from numpy import asarray
from pandas import read_csv
from pandas import DataFrame
from pandas import concat
from xgboost import XGBRegressor

# transform a time series dataset into a supervised learning dataset
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
	n_vars = 1 if type(data) is list else data.shape[1]
	df = DataFrame(data)
	cols = list()
	# input sequence (t-n, ... t-1)
	for i in range(n_in, 0, -1):
		cols.append(df.shift(i))
	# forecast sequence (t, t+1, ... t+n)
	for i in range(0, n_out):
		cols.append(df.shift(-i))
	# put it all together
	agg = concat(cols, axis=1)
	# drop rows with NaN values
	if dropnan:
		agg.dropna(inplace=True)
	return agg.values

# load the dataset
series = read_csv('daily-total-female-births.csv', header=0, index_col=0)
values = series.values
# transform the time series data into supervised learning
train = series_to_supervised(values, n_in=6)
# split into input and output columns
trainX, trainy = train[:, :-1], train[:, -1]
# fit model
model = XGBRegressor(objective='reg:squarederror', n_estimators=1000)
model.fit(trainX, trainy)
# construct an input for a new preduction
row = values[-6:].flatten()
# make a one-step prediction
yhat = model.predict(asarray([row]))
print('Input: %s, Predicted: %.3f' % (row, yhat[0]))

# 使用xgboost最终确定模型并进行月度生育数预测

from numpy import asarray

from pandas import read_csv

from pandas import DataFrame

从 pandas 导入 concat

from xgboost import XGBRegressor

# 将时间序列数据集转换为监督学习数据集

def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):

n_vars = 1 if type(data) is list else data.shape[1]

df = DataFrame(data)

cols = list()

# 输入序列 (t-n, ... t-1)

for i in range(n_in, 0, -1):

cols.append(df.shift(i))

# 预测序列 (t, t+1, ... t+n)

for i in range(0, n_out):

cols.append(df.shift(-i))

# 将它们组合在一起

agg = concat(cols, axis=1)

# 删除包含 NaN 值的行

if dropnan:

agg.dropna(inplace=True)

return agg.values

# 加载数据集

series = read_csv('daily-total-female-births.csv', header=0, index_col=0)

values = series.values

# 将时间序列数据转换为监督学习

train = series_to_supervised(values, n_in=6)

# 分割成输入和输出列

trainX, trainy = train[:, :-1], train[:, -1]

# 拟合模型

model = XGBRegressor(objective='reg:squarederror', n_estimators=1000)

model.fit(trainX, trainy)

# 构建新预测的输入

row = values[-6:].flatten()

# 进行一步预测

yhat = model.predict(asarray([row]))

print('输入: %s, 预测: %.3f' % (row, yhat[0]))

运行示例将在所有可用数据上拟合XGBoost模型。

使用过去6天已知数据准备一个新的输入行，并预测数据集结束之后的下一个月。

Input: [34 37 52 48 55 50], Predicted: 42.708

1	输入: [34 37 52 48 55 50], 预测: 42.708

进一步阅读

如果您想深入了解，本节提供了更多关于该主题的资源。

总结

在本教程中，您学习了如何开发XGBoost模型进行时间序列预测。

具体来说，你学到了：

XGBoost是用于分类和回归的梯度提升集成算法的实现。
时间序列数据集可以使用滑动窗口表示转换为监督学习。
如何为时间序列预测拟合、评估和生成XGBoost模型。

你有什么问题吗？
在下面的评论中提出你的问题，我会尽力回答。

关于此主题的更多信息

Python中用于模型评估的重复k折交叉验证

多类不平衡分类