XGBoost 用于回归

作者： Jason Brownlee 发布于 2021年3月7日分类： XGBoost 35

Extreme Gradient Boosting (XGBoost) 是一个开源库，它提供了一个高效且有效的梯度提升算法实现。

在开发和初步发布后不久，XGBoost 就成为了各种机器学习竞赛获胜解决方案的首选方法，并且通常是其中的关键组成部分。

回归预测建模问题涉及预测一个数值，例如美元金额或身高。XGBoost 可以直接用于回归预测建模。

在本教程中，您将了解如何在 Python 中开发和评估 XGBoost 回归模型。

完成本教程后，您将了解：

XGBoost 是一种高效的梯度提升实现，可用于回归预测建模。
如何使用重复 k 折交叉验证的最佳实践技术来评估 XGBoost 回归模型。
如何拟合最终模型并使用它对新数据进行预测。

让我们开始吧。

XGBoost 用于回归
照片由 chas B 提供，保留部分权利。

教程概述

本教程分为三个部分；它们是：

极端梯度提升
XGBoost 回归 API
XGBoost 回归示例

极端梯度提升

梯度提升是指一类集成机器学习算法，可用于分类或回归预测建模问题。

集成模型由决策树模型构建。树被一个接一个地添加到集成中，并进行拟合以纠正先前模型所做的预测错误。这是一种被称为“提升”的集成机器学习模型。

模型使用任何任意可微损失函数和梯度下降优化算法进行拟合。这赋予了该技术“梯度提升”的名称，因为在模型拟合时，损失梯度被最小化，很像神经网络。

有关梯度提升的更多信息，请参阅本教程。

机器学习梯度提升算法简明介绍

Extreme Gradient Boosting，简称 XGBoost，是梯度提升算法的一个高效开源实现。因此，XGBoost 是一个算法、一个开源项目和一个 Python 库。

它最初由 Tianqi Chen 开发，并在他们 2016 年的论文《XGBoost: A Scalable Tree Boosting System》中由 Chen 和 Carlos Guestrin 进行了描述。

它旨在兼具计算效率（例如，执行速度快）和高效性，可能比其他开源实现更有效。

使用 XGBoost 的两个主要原因是执行速度和模型性能。

XGBoost 在分类和回归预测建模问题上主导结构化或表格数据集。证据表明，它是 Kaggle 竞争数据科学平台比赛获胜者的首选算法。

在 2015 年 Kaggle 博客上发布的 29 个挑战获胜解决方案中，有 17 个解决方案使用了 XGBoost。 […] 该系统在 KDDCup 2015 中也取得了成功，前 10 名的每个获胜团队都使用了 XGBoost。

— XGBoost: A Scalable Tree Boosting System, 2016。

现在我们已经了解了 XGBoost 是什么以及它为什么重要，让我们更仔细地看看如何在我们的回归预测建模项目中使用它。

XGBoost 回归 API

XGBoost 可以作为一个独立的库安装，并且可以使用 scikit-learn API 开发 XGBoost 模型。

第一步是安装 XGBoost 库（如果尚未安装）。这可以通过在大多数平台上使用 pip python 包管理器来实现；例如：

sudo pip install xgboost

1	sudo pip install xgboost

然后，您可以通过运行以下脚本来确认 XGBoost 库已正确安装并可以使用。

# check xgboost version
import xgboost
print(xgboost.__version__)

# 检查 xgboost 版本

import xgboost

print(xgboost.__version__)

运行脚本将打印您已安装的 XGBoost 库的版本。

您的版本应该相同或更高。否则，您必须升级您的 XGBoost 库版本。

1.1.1

1.1.1

您可能会遇到最新版本库的问题。这不是您的错。

有时，最新版本的库会施加额外的要求或可能不太稳定。

如果您在尝试运行上述脚本时遇到错误，我建议降级到 1.0.1 版（或更低）。这可以通过在 pip 命令中指定要安装的版本来实现，如下所示：

sudo pip install xgboost==1.0.1

1	sudo pip install xgboost==1.0.1

如果您需要特定于您的开发环境的说明，请参阅教程：

XGBoost 安装指南

XGBoost 库有自己的自定义 API，但我们将通过 scikit-learn 包装器类使用该方法：XGBRegressor 和 XGBClassifier。这将使我们能够使用 scikit-learn 机器学习库的全部工具来准备数据和评估模型。

可以通过创建 XGBRegressor 类的实例来定义 XGBoost 回归模型；例如

...
# create an xgboost regression model
model = XGBRegressor()

...

# 创建一个 xgboost 回归模型

model = XGBRegressor()

您可以将超参数值传递给类构造函数来配置模型。

也许最常配置的超参数是以下这些

n_estimators：集成中的树的数量，通常会增加，直到不再看到进一步的改进。
max_depth：每棵树的最大深度，通常在 1 到 10 之间。
eta：用于加权每个模型的学习率，通常设置为小值，例如 0.3、0.1、0.01 或更小。
subsample：每棵树使用的样本（行）数，设置为 0 到 1 之间的值，通常为 1.0 以使用所有样本。
colsample_bytree：每棵树使用的特征（列）数，设置为 0 到 1 之间的值，通常为 1.0 以使用所有特征。

例如

...
# create an xgboost regression model
model = XGBRegressor(n_estimators=1000, max_depth=7, eta=0.1, subsample=0.7, colsample_bytree=0.8)

...

# 创建一个 xgboost 回归模型

model = XGBRegressor(n_estimators=1000, max_depth=7, eta=0.1, subsample=0.7, colsample_bytree=0.8)

可以通过反复试验为给定数据集找到好的超参数值，或通过系统实验，例如在值范围内进行网格搜索。

模型构建过程中使用了随机性。这意味着每次在相同数据上运行算法时，它可能会产生略有不同的模型。

在使用具有随机学习算法的机器学习算法时，通过对交叉验证的多次运行或重复进行平均来评估它们是一个好习惯。拟合最终模型时，最好是增加树的数量，直到模型方差在重复评估中减小，或者拟合多个最终模型并对它们的预测进行平均。

让我们来看看如何为回归开发 XGBoost 集成。

XGBoost 回归示例

在本节中，我们将介绍如何为标准的回归预测建模数据集开发 XGBoost 模型。

首先，让我们介绍一个标准的回归数据集。

我们将使用住房数据集。

住房数据集是一个标准的机器学习数据集，包含 506 行数据，其中有 13 个数值输入变量和一个数值目标变量。

使用重复分层 10 折交叉验证（重复 3 次）的测试框架，一个朴素模型可以达到大约 6.6 的平均绝对误差 (MAE)。一个表现最佳的模型在该测试框架上可以达到大约 1.9 的 MAE。这提供了该数据集预期性能的范围。

该数据集涉及根据美国波士顿郊区的房屋细节来预测房价。

无需下载数据集；我们将在工作示例中自动下载它。

下面的示例下载并以 Pandas DataFrame 的形式加载数据集，并总结了数据集的形状和前五行数据。

# load and summarize the housing dataset
from pandas import read_csv
from matplotlib import pyplot
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
dataframe = read_csv(url, header=None)
# summarize shape
print(dataframe.shape)
# summarize first few lines
print(dataframe.head())

# 加载和汇总住房数据集

from pandas import read_csv

from matplotlib import pyplot

# 加载数据集

url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'

dataframe = read_csv(url, header=None)

# 总结形状

print(dataframe.shape)

# 总结前几行

print(dataframe.head())

运行示例，确认了 506 行数据和 13 个输入变量以及一个数值目标变量（共 14 个）。我们还可以看到所有输入变量都是数值型的。

(506, 14)
        0     1     2   3      4      5   ...  8      9     10      11    12    13
0  0.00632  18.0  2.31   0  0.538  6.575  ...   1  296.0  15.3  396.90  4.98  24.0
1  0.02731   0.0  7.07   0  0.469  6.421  ...   2  242.0  17.8  396.90  9.14  21.6
2  0.02729   0.0  7.07   0  0.469  7.185  ...   2  242.0  17.8  392.83  4.03  34.7
3  0.03237   0.0  2.18   0  0.458  6.998  ...   3  222.0  18.7  394.63  2.94  33.4
4  0.06905   0.0  2.18   0  0.458  7.147  ...   3  222.0  18.7  396.90  5.33  36.2

[5 rows x 14 columns]

(506, 14)

0 1 2 3 4 5 ... 8 9 10 11 12 13

0 0.00632 18.0 2.31 0 0.538 6.575 ... 1 296.0 15.3 396.90 4.98 24.0

1 0.02731 0.0 7.07 0 0.469 6.421 ... 2 242.0 17.8 396.90 9.14 21.6

2 0.02729 0.0 7.07 0 0.469 7.185 ... 2 242.0 17.8 392.83 4.03 34.7

3 0.03237 0.0 2.18 0 0.458 6.998 ... 3 222.0 18.7 394.63 2.94 33.4

4 0.06905 0.0 2.18 0 0.458 7.147 ... 3 222.0 18.7 396.90 5.33 36.2

[5 行 x 14 列]

接下来，让我们评估一个在问题上具有默认超参数的回归 XGBoost 模型。

首先，我们可以将加载的数据集拆分为用于训练和评估预测模型的输入和输出列。

...
# split data into input and output columns
X, y = data[:, :-1], data[:, -1]

...

# 将数据拆分为输入和输出列

X, y = data[:, :-1], data[:, -1]

接下来，我们可以创建一个具有默认配置的模型实例。

...
# define model
model = XGBRegressor()

...

# 定义模型

model = XGBRegressor()

我们将使用最佳实践，即重复 k 折交叉验证（重复 3 次，折数 10 次）来评估模型。

这可以通过使用 RepeatedKFold 类来配置评估过程，并调用 cross_val_score() 来使用该过程评估模型并收集分数来实现。

模型性能将使用均方误差 (MAE) 进行评估。请注意，MAE 在 scikit-learn 库中被设为负数，以便可以最大化。因此，我们可以忽略符号并假定所有误差都是正数。

...
# define model evaluation method
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)

...

# 定义模型评估方法

cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

# 评估模型

scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)

评估完成后，我们可以报告模型在为此问题进行新数据预测时估计的性能。

在这种情况下，由于分数被设为负数，我们可以使用 NumPy 的 absolute() 函数使分数变为正数。

然后，我们使用分数分布的均值和标准差来报告性能的统计摘要，这也是一个好习惯。

...
# force scores to be positive
scores = absolute(scores)
print('Mean MAE: %.3f (%.3f)' % (scores.mean(), scores.std()) )

...

# 将分数强制为正数

scores = absolute(scores)

print('平均 MAE: %.3f (%.3f)' % (scores.mean(), scores.std()) )

总而言之，在住房回归预测建模问题上评估 XGBoost 模型的完整示例如下。

# evaluate an xgboost regression model on the housing dataset
from numpy import absolute
from pandas import read_csv
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from xgboost import XGBRegressor
# load the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
dataframe = read_csv(url, header=None)
data = dataframe.values
# split data into input and output columns
X, y = data[:, :-1], data[:, -1]
# define model
model = XGBRegressor()
# define model evaluation method
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)
# force scores to be positive
scores = absolute(scores)
print('Mean MAE: %.3f (%.3f)' % (scores.mean(), scores.std()) )

# 在住房数据集上评估 xgboost 回归模型

from numpy import absolute

from pandas import read_csv

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import RepeatedKFold

from xgboost import XGBRegressor

# 加载数据集

url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'

dataframe = read_csv(url, header=None)

data = dataframe.values

# 将数据拆分为输入和输出列

X, y = data[:, :-1], data[:, -1]

# 定义模型

model = XGBRegressor()

# 定义模型评估方法

cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

# 评估模型

scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)

# 将分数强制为正数

scores = absolute(scores)

print('平均 MAE: %.3f (%.3f)' % (scores.mean(), scores.std()) )

运行示例，在住房数据集上评估 XGBoost 回归算法，并报告三次重复的 10 折交叉验证的平均 MAE。

注意：由于算法或评估程序的随机性，或数值精度的差异，您的结果可能会有所不同。请考虑运行示例几次并比较平均结果。

在这种情况下，我们可以看到模型达到了大约 2.1 的 MAE。

这是一个不错的得分，比基线好，这意味着模型具有技能，并且接近最佳得分 1.9。

Mean MAE: 2.109 (0.320)

1	平均 MAE: 2.109 (0.320)

我们可以决定使用 XGBoost 回归模型作为最终模型，并对新数据进行预测。

这可以通过在所有可用数据上拟合模型并调用 predict() 函数来实现，传入新的数据行。

例如

...
# make a prediction
yhat = model.predict(new_data)

...

# 进行预测

yhat = model.predict(new_data)

我们可以用一个完整的示例来演示这一点，如下所示。

# fit a final xgboost model on the housing dataset and make a prediction
from numpy import asarray
from pandas import read_csv
from xgboost import XGBRegressor
# load the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
dataframe = read_csv(url, header=None)
data = dataframe.values
# split dataset into input and output columns
X, y = data[:, :-1], data[:, -1]
# define model
model = XGBRegressor()
# fit model
model.fit(X, y)
# define new data
row = [0.00632,18.00,2.310,0,0.5380,6.5750,65.20,4.0900,1,296.0,15.30,396.90,4.98]
new_data = asarray([row])
# make a prediction
yhat = model.predict(new_data)
# summarize prediction
print('Predicted: %.3f' % yhat)

# 在住房数据集上拟合最终的 xgboost 模型并进行预测

from numpy import asarray

from pandas import read_csv

from xgboost import XGBRegressor

# 加载数据集

url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'

dataframe = read_csv(url, header=None)

data = dataframe.values

# 将数据集拆分为输入和输出列

X, y = data[:, :-1], data[:, -1]

# 定义模型

model = XGBRegressor()

# 拟合模型

model.fit(X, y)

# 定义新数据

row = [0.00632,18.00,2.310,0,0.5380,6.5750,65.20,4.0900,1,296.0,15.30,396.90,4.98]

new_data = asarray([row])

# 进行预测

yhat = model.predict(new_data)

# 总结预测

print('预测值: %.3f' % yhat)

运行示例，拟合模型并对新数据行进行预测。

注意：由于算法或评估程序的随机性，或数值精度的差异，您的结果可能会有所不同。请考虑运行示例几次并比较平均结果。

在这种情况下，我们可以看到模型预测的值约为 24。

Predicted: 24.019

1	预测值: 24.019

进一步阅读

如果您想深入了解，本节提供了更多关于该主题的资源。

教程

论文

XGBoost：一个可扩展的树形增强系统, 2016.

API

总结

在本教程中，您了解了如何在 Python 中开发和评估 XGBoost 回归模型。

具体来说，你学到了：

XGBoost 是一种高效的梯度提升实现，可用于回归预测建模。
如何使用重复 k 折交叉验证的最佳实践技术来评估 XGBoost 回归模型。
如何拟合最终模型并使用它对新数据进行预测。

你有什么问题吗？
在下面的评论中提出你的问题，我会尽力回答。

关于此主题的更多信息

Python 中的盆地跳跃优化

开发用于银行票据认证的神经网络

35 条对XGBoost 回归的回应

Anthony The Koala 2021年3月12日上午5:45 #

尊敬的Jason博士，
xgboost 的当前版本

import xgboost xgboost.__version__ '1.3.3'

1
2
3

import xgboost
xgboost.__version__
'1.3.3'

升级或安装

pip install -U xgboost --upgrade

1

pip install -U xgboost --upgrade

谢谢你，
悉尼的Anthony

回复
- Jason Brownlee 2021年3月12日上午6:07 #
  
  干得好！
  
  回复
- Nicholas Roth 2023年1月28日上午8:26 #
  
  这个
  # 将数据拆分为输入和输出列
  X, y = data[:, :-1], data[:, -1]
  
  应该是这个
  # 将数据拆分为输入和输出列
  X, y = data.iloc[:, :-1], data.iloc[:, -1]
  
  回复

Anthony The Koala 2021年3月12日下午5:20 #

尊敬的Jason博士，
XGBoost 是否可以与 SVM 和随机森林分类器结合使用？
谢谢你，
悉尼的Anthony

Jason Brownlee 2021年3月13日上午5:25 #

我看不出为什么不。

Anthony The Koala 2021年3月13日下午5:03 #

尊敬的Jason博士，
有两种方法可以实现随机森林集成：使用 XGBoost 的 XGBRFClassifier，以及使用 sklearn.ensemble 的 RandomForestClassifier，基于以下教程：

https://machinelearning.org.cn/random-forest-ensembles-with-xgboost 
https://machinelearning.org.cn/random-forest-ensemble-in-python/

1 2	https://machinelearning.org.cn/random-forest-ensembles-with-xgboost https://machinelearning.org.cn/random-forest-ensemble-in-python/

程序

# evaluate xgboost random forest algorithm for classification
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from xgboost import XGBRFClassifier
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
# define the model
#model = XGBRFClassifier(n_estimators=100, subsample=0.9, colsample_bynode=0.2)
#experimenting 
#increasing n_estimators does not improve the accuracy. Same as n_estimators=100model = XGBRFClassifier(n_estimators=200, subsample=0.9, colsample_bynode=0.2)
#Changing subsample either 0.9 decreases accuracy
#Changing colsample_bynode between 0.25 to 0.29 improves accuracy to 0.896
model = XGBRFClassifier(n_estimators=100, subsample=0.9, colsample_bynode=0.28)

# define the model evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model and collect the scores
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print("using xgboost's randomforest classifer XGBRFClassifier")
print('Mean Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
# fit the model on the whole dataset
#model.fit(X, y)
## make a single prediction
row = [[-8.52381793,5.24451077,-12.14967704,-2.92949242,0.99314133,0.67326595,-0.38657932,1.27955683,-0.60712621,3.20807316,0.60504151,-1.38706415,8.92444588,-7.43027595,-2.33653219,1.10358169,0.21547782,1.05057966,0.6975331,0.26076035]]
from numpy import asarray
row = asarray(row)
#yhat = model.predict(row)
#print('Predicted Class: %d' % yhat[0])
print()
print()
#Now doing the same with sklearn's
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
# define the model evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model and collect the scores
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print("using sklearn's  randomforest classifer RandomForestClassifier")
print('Mean Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
# fit the model on the whole dataset
model.fit(X, y)
## make a single prediction
#row = already define
yhat = model.predict(row)
print('Predicted Class: %d' % yhat[0])

# 评估 xgboost 随机森林算法进行分类

from numpy import mean

from numpy import std

from sklearn.datasets import make_classification

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import RepeatedStratifiedKFold

from xgboost import XGBRFClassifier

# 定义数据集

X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)

# 定义模型

#model = XGBRFClassifier(n_estimators=100, subsample=0.9, colsample_bynode=0.2)

#实验

#增加 n_estimators 不会提高准确率。与 n_estimators=100 相同 model = XGBRFClassifier(n_estimators=200, subsample=0.9, colsample_bynode=0.2)

#将 subsample 改为 0.9 会降低准确率

#将 colsample_bynode 在 0.25 到 0.29 之间改变可将准确率提高到 0.896

model = XGBRFClassifier(n_estimators=100, subsample=0.9, colsample_bynode=0.28)

# 定义模型评估程序

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# 评估模型并收集分数

n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

# 报告表现

print("使用 xgboost 的随机森林分类器 XGBRFClassifier")

print('平均准确率: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

# 在整个数据集上拟合模型

#model.fit(X, y)

## 进行单次预测

row = [[-8.52381793,5.24451077,-12.14967704,-2.92949242,0.99314133,0.67326595,-0.38657932,1.27955683,-0.60712621,3.20807316,0.60504151,-1.38706415,8.92444588,-7.43027595,-2.33653219,1.10358169,0.21547782,1.05057966,0.6975331,0.26076035]]

from numpy import asarray

row = asarray(row)

#yhat = model.predict(row)

#print('Predicted Class: %d' % yhat[0])

print()

#现在用 scikit-learn 的方法做同样的事情

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()

# 定义模型评估程序

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# 评估模型并收集分数

n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

# 报告表现

print("使用 sklearn 的 randomforest 分类器 RandomForestClassifier")

print('平均准确率: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

# 在整个数据集上拟合模型

model.fit(X, y)

## 进行单次预测

#row = 已经定义

yhat = model.predict(row)

print('预测类别: %d' % yhat[0])

结果

using xgboost's randomforest classifer XGBRFClassifier
Mean Accuracy: 0.896 (0.037)


using sklearn's  randomforest classifer RandomForestClassifier
Mean Accuracy: 0.917 (0.031)
Predicted Class: 1

使用 xgboost的随机森林分类器 XGBRFClassifier

平均准确率: 0.896 (0.037)

使用 sklearn's randomforest 分类器 RandomForestClassifier

平均准确率: 0.917 (0.031)

预测类别: 1

注释
* sklearn 的 randomforeclassifier 准确率最高，为 0.917，而 XGBoost 的 XGBRFClassifier 为 0.896。
– 要最大化 XGBRFClassifier 的准确率，需要调整 colsample 和 subsample 参数。
– subsample 最佳值为 0.9。调整 subsample 0-.9 会降低准确率。
– 将 colsample 在 0.25 和 0.29 之间调整可将准确率从 0.894 提高到 0.896

结论：在实现随机森林分类器时，sklearn 的版本比 XGBoost 的版本更准确。

其他备注 – 我无法解释
* 在实现 XGboost 的随机森林分类器模型时，进行 model.fit(X,y) 以预测 yhat，程序会“报错”。请参见我在 https://machinelearning.org.cn/random-forest-ensemble-in-python/ 上的评论，时间为 2021 年 3 月 13 日 16:00 左右。
当我为 XGBoost 的 XGBRFClassifier 实现 model.fit(X,y) 时出现的错误是

The error that I get when copying the identical code and trying to do model.fit(X,y) is:
***
[16:58:45] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.3.0/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
***

Other notes:
<pre>
numpy.__version__; sklearn.__version__; xgboost.__version__;"....respectively"
'1.20.1'
'0.23.2'
'1.3.3'
'....respectively'

我复制完全相同的代码并尝试执行 model.fit(X,y) 时遇到的错误是：

***

[16:58:45] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.3.0/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the 'binary:logistic' objective was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.

***

其他说明

<pre>

numpy.__version__; sklearn.__version__; xgboost.__version__;".... respectively"

'1.20.1'

'0.23.2'

'1.3.3'

'....respectively'

谢谢你，
悉尼的Anthony

Jason Brownlee 2021年3月14日上午5:24 #

精彩的实验！

注意，RandomForestClassifier 不使用 xgboost。

回复
- Anthony The Koala 2021年3月14日下午1:01 #
  
  尊敬的Jason博士，
  
  虽然我的实验并没有证明 XGBoost 的随机森林分类器（“rfc”）比 sklearn 的随机森林分类器差，但对于特定的数据和特征集，sklearn 的随机森林分类器（“rfc”）的表现确实比 XGBoost 的随机森林分类器略好。
  
  换句话说，可能存在其他条件会产生相反的结果，即 XBoost 的 rfc 比 sklearn 的 rfc 更好。
  
  结论：如果使用 rfc 进行建模，请同时使用 XGBoost 和 sklearn 并选择表现最好的。
  
  谢谢你，
  Anthony of Sydney
- Jason Brownlee 2021年3月15日上午5:52 #
  
  很好的建议。
- Anthony The Koala 2021年3月18日下午4:13 #
  
  尊敬的Jason博士，
  在您的回复“注意，RandomForestClassifier 不使用 xgboost”中，是否有任何不属于 xgboost 但利用了 xgboost 的“…用于处理结构化或表格数据…”的“…用于速度和性能的梯度提升决策树实现…”的包？
  
  参考：https://machinelearning.org.cn/gentle-introduction-xgboost-applied-machine-learning/
  
  例如，我能
  * 使用 sklearn.svm.SVR 和 xgboost 来使用 xgboost 的梯度提升决策树吗？
  * 使用
  sklearn.neighbors.KNeighborsRegressor 和 xgboost 来使用 xgboost 的梯度提升决策树吗？
  *使用
  sklearn.tree.DecisionTreeRegressor 和 xgboost 来使用 xgboost 的梯度提升决策树吗？
  
  谢谢你
  悉尼的Anthony
- Jason Brownlee 2021年3月19日上午6:16 #
  
  据我所知，不行，xgboost 特定于决策树。
- Anthony The Koala 2021年3月19日上午7:24 #
  
  尊敬的Jason博士，
  谢谢你的回复。
  您说“…xgboost 特定于决策树…”时，您是指 xgboost 模块中发现的特定决策树吗？
  谢谢你，
  悉尼的Anthony
- Jason Brownlee 2021年3月19日上午7:51 #
  
  不，但我敢肯定这也很适用。
- Anthony The Koala 2021年3月19日下午9:50 #
  
  亲爱的 Jason,
  我写得更清楚一些，
  有没有办法将 xgboost 的梯度提升函数与 sklearn 的
  sklearn.tree.DecisionTreeClassifier 与 xgboost 的梯度提升算法结合使用。
  谢谢你，
  悉尼的Anthony
- Jason Brownlee 2021年3月20日上午5:21 #
  
  据我所知，不行。
- Anthony The Koala 2021年3月20日上午5:42 #
  
  尊敬的Jason博士，
  感谢您的回复和耐心，
  悉尼的Anthony
- Jason Brownlee 2021年3月21日上午6:00 #
  
  不客气。

Matthias 2021年3月22日下午8:21 #

布朗利博士您好，

长期以来，我一直在为具有许多输入的回归问题寻找合适的模型。现在我也用 XGBoost 进行了测试。训练数据的结果非常好。分开的测试数据的结果较差。对于与训练数据非常相似的验证数据（真实数据），结果却很糟糕。我认为我看到了过拟合。RandomForestRegressor 的结果非常相似。如果这是过拟合，您有什么避免它的建议吗？
致以诚挚的问候

Matthias

回复
- Jason Brownlee 2021年3月23日上午4:56 #
  
  也许测试集太小或不具代表性？也许您可以尝试重复 k 折交叉验证来估计模型性能？
  
  回复
Matthias 2021年3月24日下午7:16 #

您可能说对了，即使我认为验证数据与训练数据差异很小，而且实际上有很多测试数据。但一定有某种原因。我会再次重复 cv。
非常感谢！

回复
Tom 2021年4月19日下午7:44 #

您好 Jason，感谢您和此文及其他教程。

在…的最终代码中
# 在住房数据集上评估 xgboost 回归模型
我确实理解 sklearn 用于评估 => model = XGBRegressor()，其中 XGBRegressor() 具有默认参数值。

然而，在…的第二个最终代码中
# 在住房数据集上拟合最终的 xgboost 模型并进行预测
我不明白一个最终的 XGBOOST 模型是如何得出的。

好的，我假设“final”这个词或许应该被替换为“default”？

如果我猜对了，那么在现实世界中是如何得出 FINAL 模型的？
这与参数调优有关吗？

谢谢

回复
- Jason Brownlee 2021 年 4 月 20 日上午 5:56 #
  
  这里的 Final 指的是拟合了所有数据并用于对新数据进行预测的模型。
  
  确实，在大多数情况下，您都需要调整超参数。
  
  回复
ttbek 2021 年 11 月 1 日上午 2:50 #

我不认为在这里对全部数据进行交叉验证而不留出测试集是有意义的。我猜想如果我们假设要构建一个最终的生产模型，但这并不是我们在比较模型时使用的假设。住房数据集对此特别敏感，因为它存在异常值，而将它们仅包含在训练集或测试集中会产生很大的差异，与能够同时将它们包含在训练集和“测试”集中（就像您进行交叉验证一样）相比。也许我错过了代码中保留测试集的部分，或者我不理解 RepeatedKFold 中完成的所有操作？

我对以下内容感到好奇：“使用重复分层 10 折交叉验证（重复三次）的测试框架，一个朴素模型可以实现约 6.6 的平均绝对误差 (MAE)。一个表现最佳的模型在此测试框架上可以实现约 1.9 的 MAE。这提供了该数据集的预期性能范围。”

这些数字是您自己没有留出测试集进行的实验得出的吗？这里的“朴素”模型是指哪种模型？我没有在其他地方看到在留出的测试集上达到 1.9 的结果，所以如果您有参考资料会很好（我不太关注住房数据集竞赛等……但我正在尝试了解我目前使用的方法如何进行比较，我想我使用的平均运行 MAE 约为 3，而一个异常运行最低可达 2.3408，其中涉及给随机性带来的抽样）。因此，它有可能有时在留出测试集的情况下比未调优的 XGBoost 结果更好，例如 https://www.kaggle.com/shreayan98c/boston-house-price-prediction/notebook 在测试集上的 MAE 为 2.45，但它在训练集上没有使用任何交叉验证（即没有验证集）。

回复
Sofia V. 2021 年 12 月 9 日上午 2:35 #

大家好！！🙂

我有一个问题！

我们也可以在您的代码中实现 XGBoost Ranker 吗？

提前感谢！

索菲亚

回复
- Adrian Tam 2021 年 12 月 10 日上午 4:16 #
  
  应该可以。你能试试吗？
  
  回复
Medlien 2021 年 12 月 30 日晚上 10:16 #

嗨，Jason，

我对您上面提到的陈述有两个问题

“使用重复分层 10 折交叉验证（重复三次）的测试框架，一个朴素模型可以实现约 6.6 的平均绝对误差 (MAE)。一个表现最佳的模型在此测试框架上可以实现约 1.9 的 MAE。这提供了该数据集的预期性能范围。”

1. 我从您关于零规则算法的帖子中了解了如何通过训练-测试分割来查找朴素模型的 MAE。您如何进行交叉验证？

2. 您是如何得出表现最佳的模型 MAE 的，这为我们提供了数据集预期性能的上限？

回复
- Medlien 2022 年 1 月 18 日晚上 9:29 #
  
  这是个愚蠢的问题吗？我很抱歉，只是以防万一。
  
  回复
Alex Fontes 2022 年 5 月 16 日上午 9:50 #

嗨 Jason，我正在尝试在项目中使用 XGBRegressor，但它总是为给定的输入返回相同的值，即使在重新拟合之后。
所以，作为测试，我来到这篇帖子并使用了上面您的代码（波士顿房价数据集），它也返回相同的值（也与您获得的值相同）。

X 形状：(506, 13)
y 形状：(506,)
输入行：[0.00632, 18.0, 2.31, 0, 0.538, 6.575, 65.2, 4.09, 1, 296.0, 15.3, 396.9, 4.98]
预测：24.0193386078
预测：24.0193386078
预测：24.0193386078
预测：24.0193386078
预测：24.0193386078
预测：24.0193386078
预测：24.0193386078
预测：24.0193386078
预测：24.0193386078
预测：24.0193386078

（附注 – 在上述每次运行中，模型都重新拟合到 (X,y)）

您在使用此代码的每次运行时会得到不同的预测吗？
我正在使用 Python 3.10.3，我的库都是最新的……我希望您或社区中的任何人都能帮助我指明一个解决此问题的方向？

谢谢！！！

回复
- James Carmichael 2022 年 5 月 17 日上午 9:55 #
  
  嗨 Alex……您尝试过在 Google Colaboratory 中实现您的模型吗？
  
  回复
Alex Fontes 2022 年 5 月 18 日上午 10:16 #

嗨 James，我感谢您的回复，并感谢您给我提供了这个资源。
作为一项实验，我在电脑上写了一个简单的代码，然后也在 Google Colab 上运行了。

这是代码（在我的电脑和 Google Colab 上相同）

from pandas import read_csv
import xgboost as xgb

path = ‘https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv’
ds = read_csv(path, header=None).values

ds_train = xgb.DMatrix(ds[:500,:-1], label=ds[:500,-1:])
ds_test = xgb.DMatrix(ds[500:,:-1], label=ds[500:,-1:])

params = {
‘colsample_bynode’: 0.8,
‘learning_rate’: 1,
‘max_depth’: 5,
‘num_parallel_tree’: 100,
‘objective’: ‘reg:squarederror’,
‘subsample’: 0.8,
}
num_round = 100

for _ in range(5)
bst = xgb.train(params, ds_train, num_round)
preds = bst.predict(ds_test)
print(preds)

***********************************************************
这些是我电脑上的预测
[20.235838 23.819088 21.035912 28.117573 26.266716 21.39746 ]
[20.235838 23.819088 21.035912 28.117573 26.266716 21.39746 ]
[20.235838 23.819088 21.035912 28.117573 26.266716 21.39746 ]
[20.235838 23.819088 21.035912 28.117573 26.266716 21.39746 ]
[20.235838 23.819088 21.035912 28.117573 26.266716 21.39746 ]

这些是 Google Colab 上的预测
[20.380007 23.985199 21.223272 28.555704 26.747416 21.575823]
[20.380007 23.985199 21.223272 28.555704 26.747416 21.575823]
[20.380007 23.985199 21.223272 28.555704 26.747416 21.575823]
[20.380007 23.985199 21.223272 28.555704 26.747416 21.575823]
[20.380007 23.985199 21.223272 28.555704 26.747416 21.575823]

所以，当我在不同环境中运行相同的代码时，结果是不同的……但无论哪种情况，每次拟合模型到数据集时，它仍然生成相同的预测……XGBooster 应该是这样的吗？

再次感谢您的帮助。

回复
Emerson de Lemmus 2022 年 7 月 13 日上午 1:16 #

这一行特别

# 将数据拆分为输入和输出列
X, y = data[:, :-1], data[:, -1]

导致了以下错误：pandas.errors.InvalidIndexError: (slice(None, None, None), slice(None, -1, None)). 在示例中，“data”未定义，但“dataframe”已定义。

以下是修复了该错误的代码，因此示例得以运行

# 将数据拆分为输入和输出列
X, y = dataframe.iloc[:, :-1], dataframe.iloc[:, -1]

回复
- James Carmichael 2022 年 7 月 13 日上午 7:46 #
  
  感谢 Emerson 的反馈！
  
  回复
Lee 2022 年 9 月 22 日下午 7:14 #

您为什么没有像处理其他回归项目那样将数据集拆分为训练集和测试集？

回复
- James Carmichael 2022 年 9 月 23 日上午 5:55 #
  
  嗨 Lee……没有原因，我们同意您应该这样做，因为这是最佳实践。本教程展示了另一个概念的示例，但您的理解是正确的。继续保持出色的工作！
  
  回复
Atena 2022 年 12 月 8 日上午 8:21 #

尊敬的Jason博士，
XGBoost 可以用于只有 5 个特征和 40 个样本的小型数据集吗？

回复

导航

XGBoost 用于回归

教程概述

极端梯度提升

XGBoost 回归 API

XGBoost 回归示例

进一步阅读

教程

论文

API

总结

发现赢得竞赛的算法！

在几分钟内开发您自己的 XGBoost 模型

将 XGBoost 的强大功能带入您自己的项目

关于此主题的更多信息

35 条对XGBoost 回归的回应

发表回复点击此处取消回复。

导航

教程概述

极端梯度提升

XGBoost 回归 API

XGBoost 回归示例

进一步阅读

教程

论文

API

总结

发现赢得竞赛的算法！

在几分钟内开发您自己的 XGBoost 模型

将 XGBoost 的强大功能带入您自己的项目

关于此主题的更多信息

35 条对XGBoost 回归的回应

发表回复 点击此处取消回复。

发表回复点击此处取消回复。