使用 Scikit-Learn、XGBoost、LightGBM 和 CatBoost 进行梯度提升

作者 Jason Brownlee 于 2021年4月27日发布在集成学习 59

梯度提升是一种强大的集成机器学习算法。

它在结构化预测建模问题中很受欢迎，例如在表格数据上的分类和回归，并且通常是机器学习竞赛（如 Kaggle）获胜解决方案中的主要算法或主要算法之一。

有许多可用的梯度提升实现，包括 SciPy 中的标准实现和高效的第三方库。每个库都有不同的接口，甚至算法名称也不同。

在本教程中，您将学习如何在 Python 中使用梯度提升模型进行分类和回归。

提供了 Python 中梯度提升四个主要实现的标准化代码示例，您可以复制粘贴到您自己的预测建模项目中。

完成本教程后，您将了解：

梯度提升是一种集成算法，通过最小化误差梯度来拟合提升决策树。
如何使用 Scikit-learn 评估和使用梯度提升，包括梯度提升机和基于直方图的算法。
如何评估和使用第三方梯度提升算法，包括 XGBoost、LightGBM 和 CatBoost。

开始您的项目，阅读我的新书《Python 集成学习算法》，其中包含分步教程和所有示例的Python 源代码文件。

让我们开始吧。

Gradient Boosting with Scikit-Learn, XGBoost, LightGBM, and CatBoost

使用 Scikit-Learn、XGBoost、LightGBM 和 CatBoost 进行梯度提升
照片作者：John，保留部分权利。

教程概述

本教程分为五个部分；它们是：

梯度提升概述
使用 Scikit-learn 进行梯度提升
1. 库安装
2. 测试问题
3. 梯度提升
4. 基于直方图的梯度提升
使用 XGBoost 进行梯度提升
1. 库安装
2. XGBoost 分类
3. XGBoost 用于回归
使用 LightGBM 进行梯度提升
1. 库安装
2. LightGBM 分类
3. LightGBM 回归
使用 CatBoost 进行梯度提升
1. 库安装
2. CatBoost 分类
3. CatBoost 回归

梯度提升概述

梯度提升指的是一类集成机器学习算法，可用于分类或回归预测建模问题。

梯度提升也称为梯度树提升、随机梯度提升（一个扩展）以及梯度提升机（简称 GBM）。

集成模型由决策树模型构建。树被一个接一个地添加到集成中，并进行拟合以纠正先前模型所做的预测错误。这是一种被称为“提升”的集成机器学习模型。

模型使用任何任意可微损失函数和梯度下降优化算法进行拟合。这赋予了该技术“梯度提升”的名称，因为在模型拟合时，损失梯度被最小化，很像神经网络。

梯度提升是一种有效的机器学习算法，通常是在表格和类似结构化数据集上赢得机器学习竞赛（如 Kaggle）的主要或主要算法之一。

注意：在本教程中，我们不会深入探讨梯度提升算法的工作原理。

有关梯度提升算法的更多信息，请参阅教程

机器学习梯度提升算法简明介绍

该算法提供了应且可能必须针对特定数据集进行调整的超参数。尽管有许多超参数需要调整，但最重要的可能包括以下内容

模型中的树或估计器的数量。
模型的学习率。
随机模型的行和列采样率。
树的最大深度。
树的最小权重。
正则化项 alpha 和 lambda。

注意：在本教程中，我们不会探讨如何配置或调整梯度提升算法的配置。

有关梯度提升算法的超参数调整的更多信息，请参阅教程

如何配置梯度提升算法

Python 中有许多梯度提升算法的实现。也许最常用的实现是 Scikit-learn 库提供的版本。

提供了额外的第三方库，它们提供了算法的计算效率更高的替代实现，通常在实践中能获得更好的结果。示例包括 XGBoost 库、LightGBM 库和 CatBoost 库。

您是否有其他喜欢的梯度提升实现？
在下面的评论中告诉我。

在预测建模项目中使用梯度提升时，您可能希望测试每种算法的实现。

本教程提供了每种梯度提升算法在分类和回归预测建模问题上的示例，您可以复制粘贴到您的项目中。

让我们依次看看每一个。

注意：在本教程中，我们不对算法的性能进行比较。相反，我们提供了代码示例来演示如何使用每种不同的实现。因此，我们使用合成测试数据集来演示如何评估和使用每种实现进行预测。

本教程假设您已安装 Python 和 SciPy。如果您需要帮助，请参阅教程

如何使用 Anaconda 配置您的 Python 机器学习环境

想开始学习集成学习吗？

立即参加我为期7天的免费电子邮件速成课程（附示例代码）。

点击注册，同时获得该课程的免费PDF电子书版本。

使用 Scikit-learn 进行梯度提升

在本节中，我们将回顾如何在 Scikit-learn 库中使用梯度提升算法实现。

库安装

首先，让我们安装库。

不要跳过此步骤，您需要确保安装了最新版本。

您可以使用 pip Python 安装程序安装 scikit-learn 库，如下所示：

sudo pip install scikit-learn

1	sudo pip install scikit-learn

有关特定于您平台的其他安装说明，请参阅：

安装 scikit-learn

接下来，让我们确认已安装该库并正在使用现代版本。

运行以下脚本打印库版本号。

# check scikit-learn version
import sklearn
print(sklearn.__version__)

# 检查 scikit-learn 版本

import sklearn

print(sklearn.__version__)

运行该示例，您应该会看到以下版本号或更高版本。

0.22.1

0.22.1

测试问题

我们将演示分类和回归的梯度提升算法。

因此，我们将使用 Scikit-learn 库中的合成测试问题。

分类数据集

我们将使用 make_classification() 函数来创建测试二分类数据集。

数据集将包含 1,000 个示例，具有 10 个输入特征，其中 5 个是信息性的，其余 5 个是冗余的。我们将固定随机数种子，以确保每次运行代码时都获得相同的示例。

下面列出了创建和总结数据集的示例。

# test classification dataset
from sklearn.datasets import make_classification
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)
# summarize the dataset
print(X.shape, y.shape)

# 测试分类数据集

from sklearn.datasets import make_classification

# 定义数据集

X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)

# 汇总数据集

print(X.shape, y.shape)

运行示例将创建数据集并确认预期的样本数和特征数。

(1000, 10) (1000,)

1	(1000, 10) (1000,)

回归数据集

我们将使用 make_regression() 函数来创建测试回归数据集。

与分类数据集一样，回归数据集也将包含 1,000 个示例，具有 10 个输入特征，其中 5 个是信息性的，其余 5 个是冗余的。

# test regression dataset
from sklearn.datasets import make_regression
# define dataset
X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1)
# summarize the dataset
print(X.shape, y.shape)

# 测试回归数据集

from sklearn.datasets import make_regression

# 定义数据集

X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1)

# 汇总数据集

print(X.shape, y.shape)

运行示例将创建数据集并确认预期的样本数和特征数。

(1000, 10) (1000,)

1	(1000, 10) (1000,)

接下来，我们看看如何开发 Scikit-learn 中的梯度提升模型。

梯度提升

Scikit-learn 库通过 `GradientBoostingClassifier` 和 `GradientBoostingRegressor` 类为回归和分类提供了 GBM 算法。

让我们依次仔细看看每一个。

梯度提升机分类

下面的示例首先使用重复的 k 折交叉验证在测试问题上评估 GradientBoostingClassifier 并报告平均准确率。然后，将单个模型拟合到所有可用数据上，并进行单个预测。

完整的示例如下所示。

# gradient boosting for classification in scikit-learn
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from matplotlib import pyplot
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)
# evaluate the model
model = GradientBoostingClassifier()
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
# fit the model on the whole dataset
model = GradientBoostingClassifier()
model.fit(X, y)
# make a single prediction
row = [[2.56999479, -0.13019997, 3.16075093, -4.35936352, -1.61271951, -1.39352057, -2.48924933, -1.93094078, 3.26130366, 2.05692145]]
yhat = model.predict(row)
print('Prediction: %d' % yhat[0])

# scikit-learn 中的梯度提升分类

from numpy import mean

from numpy import std

from sklearn.datasets import make_classification

from sklearn.ensemble import GradientBoostingClassifier

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import RepeatedStratifiedKFold

from matplotlib import pyplot

# 定义数据集

X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)

# 评估模型

model = GradientBoostingClassifier()

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')

print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

# 在整个数据集上拟合模型

model = GradientBoostingClassifier()

model.fit(X, y)

# 进行单次预测

row = [[2.56999479, -0.13019997, 3.16075093, -4.35936352, -1.61271951, -1.39352057, -2.48924933, -1.93094078, 3.26130366, 2.05692145]]

yhat = model.predict(row)

print('Prediction: %d' % yhat[0])

注意：由于算法或评估程序的随机性，或数值精度的差异，您的结果可能会有所不同。考虑运行示例几次并比较平均结果。

运行示例后，将首先报告使用重复的 k 折交叉验证对模型进行的评估，然后是使用整个数据集训练的模型进行单个预测的结果。

Accuracy: 0.915 (0.025)
Prediction: 1

1 2	准确率: 0.915 (0.025) 预测: 1

梯度提升机回归

下面的示例首先使用重复的 k 折交叉验证在测试问题上评估 GradientBoostingRegressor 并报告平均绝对误差。然后，将单个模型拟合到所有可用数据上，并进行单个预测。

完整的示例如下所示。

# gradient boosting for regression in scikit-learn
from numpy import mean
from numpy import std
from sklearn.datasets import make_regression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from matplotlib import pyplot
# define dataset
X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1)
# evaluate the model
model = GradientBoostingRegressor()
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')
print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
# fit the model on the whole dataset
model = GradientBoostingRegressor()
model.fit(X, y)
# make a single prediction
row = [[2.02220122, 0.31563495, 0.82797464, -0.30620401, 0.16003707, -1.44411381, 0.87616892, -0.50446586, 0.23009474, 0.76201118]]
yhat = model.predict(row)
print('Prediction: %.3f' % yhat[0])

# scikit-learn 中的梯度提升回归

from numpy import mean

from numpy import std

from sklearn.datasets import make_regression

from sklearn.ensemble import GradientBoostingRegressor

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import RepeatedKFold

from matplotlib import pyplot

# 定义数据集

X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1)

# 评估模型

model = GradientBoostingRegressor()

cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')

print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

# 在整个数据集上拟合模型

model = GradientBoostingRegressor()

model.fit(X, y)

# 进行单次预测

row = [[2.02220122, 0.31563495, 0.82797464, -0.30620401, 0.16003707, -1.44411381, 0.87616892, -0.50446586, 0.23009474, 0.76201118]]

yhat = model.predict(row)

print('Prediction: %.3f' % yhat[0])

注意：由于算法或评估程序的随机性，或数值精度的差异，您的结果可能会有所不同。考虑运行示例几次并比较平均结果。

运行示例后，将首先报告使用重复的 k 折交叉验证对模型进行的评估，然后是使用整个数据集训练的模型进行单个预测的结果。

MAE: -11.854 (1.121)
Prediction: -80.661

1 2	MAE: -11.854 (1.121) 预测: -80.661

基于直方图的梯度提升

Scikit-learn 库提供了梯度提升算法的替代实现，称为基于直方图的梯度提升。

这是实现梯度树提升的另一种方法，其灵感来自 LightGBM 库（稍后详述）。此实现通过 `HistGradientBoostingClassifier` 和 `HistGradientBoostingRegressor` 类提供。

基于直方图的梯度提升方法的主要优点是速度。这些实现的设计旨在加快训练数据的拟合速度。

在撰写本文时，这是一个实验性实现，需要您在代码中添加以下行才能访问这些类。

from sklearn.experimental import enable_hist_gradient_boosting

1	from sklearn.experimental import enable_hist_gradient_boosting

没有这一行，您会收到类似以下的错误：

ImportError: cannot import name 'HistGradientBoostingClassifier'

1	ImportError: cannot import name 'HistGradientBoostingClassifier'

或者

ImportError: cannot import name 'HistGradientBoostingRegressor'

1	ImportError: cannot import name 'HistGradientBoostingRegressor'

让我们仔细看看如何使用此实现。

基于直方图的梯度提升机分类

下面的示例首先使用重复的 k 折交叉验证在测试问题上评估 HistGradientBoostingClassifier 并报告平均准确率。然后，将单个模型拟合到所有可用数据上，并进行单个预测。

完整的示例如下所示。

# histogram-based gradient boosting for classification in scikit-learn
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from matplotlib import pyplot
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)
# evaluate the model
model = HistGradientBoostingClassifier()
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
# fit the model on the whole dataset
model = HistGradientBoostingClassifier()
model.fit(X, y)
# make a single prediction
row = [[2.56999479, -0.13019997, 3.16075093, -4.35936352, -1.61271951, -1.39352057, -2.48924933, -1.93094078, 3.26130366, 2.05692145]]
yhat = model.predict(row)
print('Prediction: %d' % yhat[0])

# scikit-learn 中的基于直方图的梯度提升分类

from numpy import mean

from numpy import std

from sklearn.datasets import make_classification

from sklearn.experimental import enable_hist_gradient_boosting

from sklearn.ensemble import HistGradientBoostingClassifier

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import RepeatedStratifiedKFold

from matplotlib import pyplot

# 定义数据集

X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)

# 评估模型

model = HistGradientBoostingClassifier()

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')

print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

# 在整个数据集上拟合模型

model = HistGradientBoostingClassifier()

model.fit(X, y)

# 进行单次预测

row = [[2.56999479, -0.13019997, 3.16075093, -4.35936352, -1.61271951, -1.39352057, -2.48924933, -1.93094078, 3.26130366, 2.05692145]]

yhat = model.predict(row)

print('Prediction: %d' % yhat[0])

注意：由于算法或评估程序的随机性，或数值精度的差异，您的结果可能会有所不同。考虑运行示例几次并比较平均结果。

运行示例后，将首先报告使用重复的 k 折交叉验证对模型进行的评估，然后是使用整个数据集训练的模型进行单个预测的结果。

Accuracy: 0.935 (0.024)
Prediction: 1

1 2	准确率: 0.935 (0.024) 预测: 1

基于直方图的梯度提升机回归

下面的示例首先使用重复的 k 折交叉验证在测试问题上评估 HistGradientBoostingRegressor 并报告平均绝对误差。然后，将单个模型拟合到所有可用数据上，并进行单个预测。

完整的示例如下所示。

# histogram-based gradient boosting for regression in scikit-learn
from numpy import mean
from numpy import std
from sklearn.datasets import make_regression
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from matplotlib import pyplot
# define dataset
X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1)
# evaluate the model
model = HistGradientBoostingRegressor()
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')
print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
# fit the model on the whole dataset
model = HistGradientBoostingRegressor()
model.fit(X, y)
# make a single prediction
row = [[2.02220122, 0.31563495, 0.82797464, -0.30620401, 0.16003707, -1.44411381, 0.87616892, -0.50446586, 0.23009474, 0.76201118]]
yhat = model.predict(row)
print('Prediction: %.3f' % yhat[0])

# scikit-learn 中的基于直方图的梯度提升回归

from numpy import mean

from numpy import std

from sklearn.datasets import make_regression

from sklearn.experimental import enable_hist_gradient_boosting

from sklearn.ensemble import HistGradientBoostingRegressor

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import RepeatedKFold

from matplotlib import pyplot

# 定义数据集

X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1)

# 评估模型

model = HistGradientBoostingRegressor()

cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')

print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

# 在整个数据集上拟合模型

model = HistGradientBoostingRegressor()

model.fit(X, y)

# 进行单次预测

row = [[2.02220122, 0.31563495, 0.82797464, -0.30620401, 0.16003707, -1.44411381, 0.87616892, -0.50446586, 0.23009474, 0.76201118]]

yhat = model.predict(row)

print('Prediction: %.3f' % yhat[0])

注意：由于算法或评估程序的随机性，或数值精度的差异，您的结果可能会有所不同。考虑运行示例几次并比较平均结果。

运行示例后，将首先报告使用重复的 k 折交叉验证对模型进行的评估，然后是使用整个数据集训练的模型进行单个预测的结果。

MAE: -12.723 (1.540)
Prediction: -77.837

1 2	MAE: -12.723 (1.540) 预测: -77.837

使用 XGBoost 进行梯度提升

XGBoost，缩写为“Extreme Gradient Boosting”，是一个提供梯度提升算法高效实现的库。

XGBoost 实现的主要优点是计算效率和模型性能通常更好。

有关 XGBoost 的优点和功能的更多信息，请参阅教程

应用机器学习 XGBoost 简明介绍

库安装

您可以使用 pip Python 安装程序安装 XGBoost 库，如下所示

sudo pip install xgboost

1	sudo pip install xgboost

有关针对您的平台的其他安装说明，请参阅

XGBoost 安装指南

接下来，让我们确认已安装该库并正在使用现代版本。

运行以下脚本打印库版本号。

# check xgboost version
import xgboost
print(xgboost.__version__)

# 检查 xgboost 版本

import xgboost

print(xgboost.__version__)

运行该示例，您应该会看到以下版本号或更高版本。

1.0.1

1.0.1

XGBoost 库提供了包装类，因此可以将高效的算法实现与 Scikit-learn 库一起使用，特别是通过 `XGBClassifier` 和 `XGBregressor` 类。

让我们依次仔细看看每一个。

XGBoost 分类

下面的示例首先使用重复的 k 折交叉验证在测试问题上评估 XGBClassifier 并报告平均准确率。然后，将单个模型拟合到所有可用数据上，并进行单个预测。

完整的示例如下所示。

# xgboost for classification
from numpy import asarray
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from matplotlib import pyplot
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)
# evaluate the model
model = XGBClassifier()
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
# fit the model on the whole dataset
model = XGBClassifier()
model.fit(X, y)
# make a single prediction
row = [2.56999479, -0.13019997, 3.16075093, -4.35936352, -1.61271951, -1.39352057, -2.48924933, -1.93094078, 3.26130366, 2.05692145]
row = asarray(row).reshape((1, len(row)))
yhat = model.predict(row)
print('Prediction: %d' % yhat[0])

# XGBoost 分类

from numpy import asarray

from numpy import mean

from numpy import std

from sklearn.datasets import make_classification

from xgboost import XGBClassifier

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import RepeatedStratifiedKFold

from matplotlib import pyplot

# 定义数据集

X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)

# 评估模型

model = XGBClassifier()

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')

print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

# 在整个数据集上拟合模型

model = XGBClassifier()

model.fit(X, y)

# 进行单次预测

row = [2.56999479, -0.13019997, 3.16075093, -4.35936352, -1.61271951, -1.39352057, -2.48924933, -1.93094078, 3.26130366, 2.05692145]

row = asarray(row).reshape((1, len(row)))

yhat = model.predict(row)

print('Prediction: %d' % yhat[0])

注意：由于算法或评估程序的随机性，或数值精度的差异，您的结果可能会有所不同。考虑运行示例几次并比较平均结果。

运行示例后，将首先报告使用重复的 k 折交叉验证对模型进行的评估，然后是使用整个数据集训练的模型进行单个预测的结果。

Accuracy: 0.936 (0.019)
Prediction: 1

1 2	准确率: 0.936 (0.019) 预测: 1

XGBoost 用于回归

下面的示例首先使用重复的 k 折交叉验证在测试问题上评估 XGBRegressor 并报告平均绝对误差。然后，将单个模型拟合到所有可用数据上，并进行单个预测。

完整的示例如下所示。

# xgboost for regression
from numpy import asarray
from numpy import mean
from numpy import std
from sklearn.datasets import make_regression
from xgboost import XGBRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from matplotlib import pyplot
# define dataset
X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1)
# evaluate the model
model = XGBRegressor(objective='reg:squarederror')
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')
print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
# fit the model on the whole dataset
model = XGBRegressor(objective='reg:squarederror')
model.fit(X, y)
# make a single prediction
row = [2.02220122, 0.31563495, 0.82797464, -0.30620401, 0.16003707, -1.44411381, 0.87616892, -0.50446586, 0.23009474, 0.76201118]
row = asarray(row).reshape((1, len(row)))
yhat = model.predict(row)
print('Prediction: %.3f' % yhat[0])

# XGBoost 回归

from numpy import asarray

from numpy import mean

from numpy import std

from sklearn.datasets import make_regression

from xgboost import XGBRegressor

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import RepeatedKFold

from matplotlib import pyplot

# 定义数据集

X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1)

# 评估模型

model = XGBRegressor(objective='reg:squarederror')

cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')

print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

# 在整个数据集上拟合模型

model = XGBRegressor(objective='reg:squarederror')

model.fit(X, y)

# 进行单次预测

row = [2.02220122, 0.31563495, 0.82797464, -0.30620401, 0.16003707, -1.44411381, 0.87616892, -0.50446586, 0.23009474, 0.76201118]

row = asarray(row).reshape((1, len(row)))

yhat = model.predict(row)

print('Prediction: %.3f' % yhat[0])

注意：由于算法或评估程序的随机性，或数值精度的差异，您的结果可能会有所不同。考虑运行示例几次并比较平均结果。

运行示例后，将首先报告使用重复的 k 折交叉验证对模型进行的评估，然后是使用整个数据集训练的模型进行单个预测的结果。

MAE: -15.048 (1.316)
Prediction: -93.434

1 2	MAE: -15.048 (1.316) 预测: -93.434

使用 LightGBM 进行梯度提升

LightGBM，缩写为 Light Gradient Boosted Machine，是微软开发的一个库，它提供了梯度提升算法的高效实现。

LightGBM 的主要优点是其训练算法的改进，使得过程显著加快，并且在许多情况下能获得更有效的模型。

有关 LightGBM 算法的更多技术细节，请参阅论文

LightGBM：一种高效的梯度提升决策树, 2017.

库安装

您可以使用 pip Python 安装程序安装 LightGBM 库，如下所示

sudo pip install lightgbm

1	sudo pip install lightgbm

有关特定于您平台的其他安装说明，请参阅：

LightGBM 安装指南

接下来，让我们确认已安装该库并正在使用现代版本。

运行以下脚本打印库版本号。

# check lightgbm version
import lightgbm
print(lightgbm.__version__)

# 检查 lightgbm 版本

import lightgbm

print(lightgbm.__version__)

运行该示例，您应该会看到以下版本号或更高版本。

2.3.1

2.3.1

LightGBM 库提供了包装类，因此可以将高效的算法实现与 Scikit-learn 库一起使用，特别是通过 `LGBMClassifier` 和 `LGBMRegressor` 类。

让我们依次仔细看看每一个。

LightGBM 分类

下面的示例首先使用重复的 k 折交叉验证在测试问题上评估 LGBMClassifier 并报告平均准确率。然后，将单个模型拟合到所有可用数据上，并进行单个预测。

完整的示例如下所示。

# lightgbm for classification
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from lightgbm import LGBMClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from matplotlib import pyplot
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)
# evaluate the model
model = LGBMClassifier()
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
# fit the model on the whole dataset
model = LGBMClassifier()
model.fit(X, y)
# make a single prediction
row = [[2.56999479, -0.13019997, 3.16075093, -4.35936352, -1.61271951, -1.39352057, -2.48924933, -1.93094078, 3.26130366, 2.05692145]]
yhat = model.predict(row)
print('Prediction: %d' % yhat[0])

# LightGBM 分类

from numpy import mean

from numpy import std

from sklearn.datasets import make_classification

from lightgbm import LGBMClassifier

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import RepeatedStratifiedKFold

from matplotlib import pyplot

# 定义数据集

X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)

# 评估模型

model = LGBMClassifier()

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')

print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

# 在整个数据集上拟合模型

model = LGBMClassifier()

model.fit(X, y)

# 进行单次预测

row = [[2.56999479, -0.13019997, 3.16075093, -4.35936352, -1.61271951, -1.39352057, -2.48924933, -1.93094078, 3.26130366, 2.05692145]]

yhat = model.predict(row)

print('Prediction: %d' % yhat[0])

注意：由于算法或评估程序的随机性，或数值精度的差异，您的结果可能会有所不同。考虑运行示例几次并比较平均结果。

运行示例后，将首先报告使用重复的 k 折交叉验证对模型进行的评估，然后是使用整个数据集训练的模型进行单个预测的结果。

Accuracy: 0.934 (0.021)
Prediction: 1

1 2	准确率: 0.934 (0.021) 预测: 1

LightGBM 回归

下面的示例首先使用重复的 k 折交叉验证在测试问题上评估 LGBMRegressor 并报告平均绝对误差。然后，将单个模型拟合到所有可用数据上，并进行单个预测。

完整的示例如下所示。

# lightgbm for regression
from numpy import mean
from numpy import std
from sklearn.datasets import make_regression
from lightgbm import LGBMRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from matplotlib import pyplot
# define dataset
X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1)
# evaluate the model
model = LGBMRegressor()
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')
print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
# fit the model on the whole dataset
model = LGBMRegressor()
model.fit(X, y)
# make a single prediction
row = [[2.02220122, 0.31563495, 0.82797464, -0.30620401, 0.16003707, -1.44411381, 0.87616892, -0.50446586, 0.23009474, 0.76201118]]
yhat = model.predict(row)
print('Prediction: %.3f' % yhat[0])

# LightGBM 回归

from numpy import mean

from numpy import std

from sklearn.datasets import make_regression

from lightgbm import LGBMRegressor

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import RepeatedKFold

from matplotlib import pyplot

# 定义数据集

X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1)

# 评估模型

model = LGBMRegressor()

cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')

print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

# 在整个数据集上拟合模型

model = LGBMRegressor()

model.fit(X, y)

# 进行单次预测

row = [[2.02220122, 0.31563495, 0.82797464, -0.30620401, 0.16003707, -1.44411381, 0.87616892, -0.50446586, 0.23009474, 0.76201118]]

yhat = model.predict(row)

print('Prediction: %.3f' % yhat[0])

注意：由于算法或评估程序的随机性，或数值精度的差异，您的结果可能会有所不同。考虑运行示例几次并比较平均结果。

运行示例后，将首先报告使用重复的 k 折交叉验证对模型进行的评估，然后是使用整个数据集训练的模型进行单个预测的结果。

MAE: -12.739 (1.408)
Prediction: -82.040

1 2	MAE: -12.739 (1.408) 预测: -82.040

梯度提升与 CatBoost

CatBoost 是俄罗斯 Yandex 开发的一个第三方库，它提供了梯度提升算法的高效实现。

CatBoost 的主要优点（除了计算速度的提高）是支持类别输入变量。这为该库赋予了 CatBoost 的名称，意为“Category Gradient Boosting”（类别梯度提升）。

有关 CatBoost 算法的更多技术细节，请参阅论文

CatBoost：具有类别特征支持的梯度提升, 2017.

库安装

您可以使用 pip Python 安装程序安装 CatBoost 库，如下所示

sudo pip install catboost

1	sudo pip install catboost

有关特定于您平台的其他安装说明，请参阅：

CatBoost 安装指南

接下来，让我们确认已安装该库并正在使用现代版本。

运行以下脚本打印库版本号。

# check catboost version
import catboost
print(catboost.__version__)

# 检查 catboost 版本

import catboost

print(catboost.__version__)

运行该示例，您应该会看到以下版本号或更高版本。

0.21

0.21

CatBoost 库提供了包装类，因此可以将高效的算法实现与 Scikit-learn 库一起使用，特别是通过 `CatBoostClassifier` 和 `CatBoostRegressor` 类。

让我们依次仔细看看每一个。

CatBoost 分类

下面的示例首先使用重复的 k 折交叉验证在测试问题上评估 CatBoostClassifier 并报告平均准确率。然后，将单个模型拟合到所有可用数据上，并进行单个预测。

完整的示例如下所示。

# catboost for classification
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from catboost import CatBoostClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from matplotlib import pyplot
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)
# evaluate the model
model = CatBoostClassifier(verbose=0, n_estimators=100)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
# fit the model on the whole dataset
model = CatBoostClassifier(verbose=0, n_estimators=100)
model.fit(X, y)
# make a single prediction
row = [[2.56999479, -0.13019997, 3.16075093, -4.35936352, -1.61271951, -1.39352057, -2.48924933, -1.93094078, 3.26130366, 2.05692145]]
yhat = model.predict(row)
print('Prediction: %d' % yhat[0])

# CatBoost 分类

from numpy import mean

from numpy import std

from sklearn.datasets import make_classification

from catboost import CatBoostClassifier

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import RepeatedStratifiedKFold

from matplotlib import pyplot

# 定义数据集

X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)

# 评估模型

model = CatBoostClassifier(verbose=0, n_estimators=100)

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')

print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

# 在整个数据集上拟合模型

model = CatBoostClassifier(verbose=0, n_estimators=100)

model.fit(X, y)

# 进行单次预测

row = [[2.56999479, -0.13019997, 3.16075093, -4.35936352, -1.61271951, -1.39352057, -2.48924933, -1.93094078, 3.26130366, 2.05692145]]

yhat = model.predict(row)

print('Prediction: %d' % yhat[0])

注意：由于算法或评估程序的随机性，或数值精度的差异，您的结果可能会有所不同。考虑运行示例几次并比较平均结果。

运行示例后，将首先报告使用重复的 k 折交叉验证对模型进行的评估，然后是使用整个数据集训练的模型进行单个预测的结果。

Accuracy: 0.931 (0.026)
Prediction: 1

1 2	准确率: 0.931 (0.026) 预测: 1

CatBoost 回归

下面的示例首先使用重复的 k 折交叉验证在测试问题上评估 CatBoostRegressor 并报告平均绝对误差。然后，将单个模型拟合到所有可用数据上，并进行单个预测。

完整的示例如下所示。

# catboost for regression
from numpy import mean
from numpy import std
from sklearn.datasets import make_regression
from catboost import CatBoostRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from matplotlib import pyplot
# define dataset
X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1)
# evaluate the model
model = CatBoostRegressor(verbose=0, n_estimators=100)
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')
print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
# fit the model on the whole dataset
model = CatBoostRegressor(verbose=0, n_estimators=100)
model.fit(X, y)
# make a single prediction
row = [[2.02220122, 0.31563495, 0.82797464, -0.30620401, 0.16003707, -1.44411381, 0.87616892, -0.50446586, 0.23009474, 0.76201118]]
yhat = model.predict(row)
print('Prediction: %.3f' % yhat[0])

# CatBoost 回归

from numpy import mean

from numpy import std

from sklearn.datasets import make_regression

from catboost import CatBoostRegressor

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import RepeatedKFold

from matplotlib import pyplot

# 定义数据集

X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1)

# 评估模型

model = CatBoostRegressor(verbose=0, n_estimators=100)

cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')

print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

# 在整个数据集上拟合模型

model = CatBoostRegressor(verbose=0, n_estimators=100)

model.fit(X, y)

# 进行单次预测

row = [[2.02220122, 0.31563495, 0.82797464, -0.30620401, 0.16003707, -1.44411381, 0.87616892, -0.50446586, 0.23009474, 0.76201118]]

yhat = model.predict(row)

print('Prediction: %.3f' % yhat[0])

注意：由于算法或评估程序的随机性，或数值精度的差异，您的结果可能会有所不同。考虑运行示例几次并比较平均结果。

运行示例后，将首先报告使用重复的 k 折交叉验证对模型进行的评估，然后是使用整个数据集训练的模型进行单个预测的结果。

MAE: -9.281 (0.951)
Prediction: -74.212

1 2	MAE: -9.281 (0.951) 预测: -74.212

进一步阅读

如果您想深入了解，本节提供了更多关于该主题的资源。

教程

论文

API

文章

总结

在本教程中，您学习了如何在 Python 中使用梯度提升模型进行分类和回归。

具体来说，你学到了：

梯度提升是一种集成算法，通过最小化误差梯度来拟合提升决策树。
如何使用 Scikit-learn 评估和使用梯度提升，包括梯度提升机和基于直方图的算法。
如何评估和使用第三方梯度提升算法，包括 XGBoost、LightGBM 和 CatBoost。

你有什么问题吗？
在下面的评论中提出你的问题，我会尽力回答。

关于此主题的更多信息

如何使用 Python 计算特征重要性

机器学习中的 Argmax 是什么？

59 条关于“使用 Scikit-Learn、XGBoost、LightGBM 和 CatBoost 进行梯度提升”的回复

jaehyeong 2020年4月2日 11:49 am #

感谢您的精彩文章！

回复
- Jason Brownlee 2020年4月2日 1:30 pm #
  
  很高兴它有帮助。
  
  回复
Santiago 2020年4月3日 6:41 am #

精彩的文章，非常感谢！

回复
- Jason Brownlee 2020年4月3日 6:59 am #
  
  谢谢！
  
  回复
TOUNSI Youssef 2020年4月3日 10:52 am #

组织得很好，继续保持！

回复
- Jason Brownlee 2020年4月3日 1:16 pm #
  
  谢谢！
  
  回复
Ben 2020年4月4日 1:24 am #

你好 Jason，我的所有工作都是关于时间序列回归和公用事业计量数据。我总是只看 RSME，因为它单位对我来说有意义。基本上，当我使用 from sklearn.metrics import mean_squared_error 时，我只取 math.sqrt(mse)。我注意到你在上面的代码中使用了平均绝对误差……我为获得最佳模型结果而只查看 RSME 的做法有什么问题吗？

回复
- Jason Brownlee 2020年4月4日 6:20 am #
  
  没问题！我过去也经常使用 RMSE。
  
  最近我更喜欢 MAE——说不清原因。也许是个人喜好。也许是因为不需要进行平方根运算。
  
  回复
Svetlana 2020年4月26日 2:13 am #

最好的文章。感谢如此令人费解的文章。

当您使用 RepeatedStratifiedKFold 时，通常会计算准确率来了解性能最佳的模型。如果您想计算召回率、精确率、灵敏度和特异性等参数，该如何操作？您如何为每个重复的折叠计算这些参数，以及如何计算所有折叠的最终平均值，就像计算准确率一样？

谢谢！

回复
- Jason Brownlee 2020 年 4 月 26 日上午 6:15 #
  
  您可以为分层 k 折交叉验证指定任何您喜欢的指标。
  
  回复
Edivaldo 2020 年 4 月 27 日上午 2:18 #

祝贺您的文章。
很棒。

回复
- Jason Brownlee 2020 年 4 月 27 日上午 5:37 #
  
  谢谢！
  
  回复
Svetlana 2020 年 4 月 27 日上午 2:24 #

您有类似的例子吗？或者可以展示如何做到这一点吗？

回复
- Jason Brownlee 2020 年 4 月 27 日上午 5:37 #
  
  什么相同？
  
  回复
Fernando 2020 年 5 月 3 日下午 8:03 #

谢谢！好教程。

回复
- Jason Brownlee 2020 年 5 月 4 日上午 6:19 #
  
  不客气。
  
  回复
Engr. Wasiu Ajao 2020 年 5 月 9 日上午 12:50 #

干得好！谢谢 Jason。

回复
- Jason Brownlee 2020 年 5 月 9 日上午 6:16 #
  
  谢谢！
  
  回复
Jie 2020 年 5 月 11 日下午 8:31 #

您好 Jason，我有一个关于生成数据集的问题。

那么，如果您将 informative 设置为 5，这是否意味着分类器将在特征重要性方面检测到这 5 个属性，得分较高，而其他 5 个冗余属性的得分较低？

如果您将 informative 设置为 5，redundant 设置为 2，那么其他 3 个属性的重要性是随机的吗？

谢谢

回复
- Jason Brownlee 2020 年 5 月 12 日上午 6:44 #
  
  并不是真的。
  
  我们改变 informative/redundant 以便使问题在普遍意义上更简单/更难。
  
  树模型擅长自动筛选冗余特征。
  
  回复
Fábio Albuquerque 2020 年 8 月 26 日上午 7:14 #

任何梯度提升方法都可以处理多维数组作为目标值（y）吗？

回复
- Jason Brownlee 2020 年 8 月 26 日下午 1:41 #
  
  我相信 scikit-learn 的梯度提升实现直接支持多输出回归。
  
  也许可以测试并确认一下？
  
  回复
  - Fábio Albuquerque 2020 年 8 月 28 日上午 3:38 #
    
    https://scikit-learn.cn/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html#sklearn.ensemble.GradientBoostingRegressor.fit
    
    y 形状为 (n_samples,) 的数组状
    目标值（分类中的字符串或整数，回归中的实数）。对于分类，标签必须对应于类。
    
    与直接支持多输出回归的不同之处
    
    https://scikit-learn.cn/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor.fit
    
    y 形状为 (n_samples,) 或 (n_samples, n_outputs) 的数组状
    目标值（分类中的类标签，回归中的实数）。
    
    回复
    - Jason Brownlee 2020 年 8 月 28 日上午 6:54 #
      
      不错。
      
      也许可以试试这个。
      https://machinelearning.org.cn/multi-output-regression-models-with-python/
      
      回复
Tony 2020 年 9 月 1 日上午 12:15 #

嗨，Jason，
我对 light gradient boosting 模型的工作方式感到困惑，因为在 API 中他们使用“num_round = 10
bst = lgb.train(param, train_data, num_round, valid_sets=[validation_data])”来用训练数据拟合模型。

为什么 .fit 方法在您的代码中可以使用？仅仅是因为您导入了 LGBMRegressor 模型吗？

谢谢，

回复
- Jason Brownlee 2020 年 9 月 1 日上午 6:34 #
  
  是的，我推荐使用 scikit-learn 的包装器类，这使得模型的使用更加简单。
  
  回复
Atis 2020 年 9 月 14 日晚上 11:24 #

您好 Jason – 我对我的 LSTM 神经网络的回归结果不太满意。特别是，y 分布的末端预测得不太好。

我想知道是否可以使用梯度提升的原理来训练连续的网络，以纠正前一个网络造成的剩余误差。

您对此有何看法？风险是什么？

回复
- Jason Brownlee 2020 年 9 月 15 日上午 5:26 #
  
  试试看！
  
  回复
Paolina 2020 年 10 月 22 日上午 3:03 #

你好
我创建并调优了 XGBoost 参数，使用了网格搜索（虽然我知道贝叶斯优化更好，但我不得不使用网格搜索）。

问题是我必须回答这个问题：（系统的鲁棒性不清楚，您必须明确它）但我不知道如何估计鲁棒性以及应该阅读什么来回答这个问题。
任何帮助，请。

回复
- Jason Brownlee 2020 年 10 月 22 日上午 6:47 #
  
  模型鲁棒性的一种估计是在同一测试框架上重复评估时，性能指标的方差或标准差。
  
  回复
Ron 2021 年 2 月 15 日上午 6:33 #

Jason，

您的又一篇由 Johar Ashfaque 发表的博客。 https://medium.com/ai-in-plain-english/gradient-boosting-with-scikit-learn-xgboost-lightgbm-and-catboost-58e372d0d34b。
我没有找到您文章的引用。这是我所知道的第二篇。他似乎在这里省略了基于直方图的梯度提升。

罗恩

回复
- Jason Brownlee 2021 年 2 月 15 日上午 8:11 #
  
  谢谢告知，人们如此明目张胆地抄袭我，真是令人失望。
  
  回复
Sam 2021 年 2 月 16 日上午 6:27 #

我也不得不评论他的帖子，因为它真的很可耻。他本可以引用你并添加他自己的评论，但却在代码中进行回收和添加。但说实话，那是糟糕的复制粘贴。

回复
- Jason Brownlee 2021 年 2 月 16 日上午 8:01 #
  
  谢谢！！！
  
  是的，人们不知羞耻，我看得越多——直接在 medium 或类似平台上复制粘贴我的教程。
  
  我认为谷歌可以检测到重复内容，并通过低排名惩罚抄袭者。
  
  回复
SSS 2021 年 2 月 19 日上午 9:54 #

只使用 Numpy 实现梯度提升分类器怎么样？

回复
- Jason Brownlee 2021 年 2 月 19 日上午 10:51 #
  
  好建议，谢谢！
  
  回复
MS 2021 年 3 月 7 日上午 3:59 #

嗨
Catboost 是否熟悉 scikitlearn API？

回复
- Jason Brownlee 2021 年 3 月 7 日上午 5:14 #
  
  Catboost 可以通过 scikit-learn 包装器类使用，如上例所示。
  
  回复
MS 2021 年 3 月 9 日下午 3:49 #

谢谢

回复
- Jason Brownlee 2021 年 3 月 10 日上午 4:37 #
  
  不客气。
  
  回复
Nick 2021 年 3 月 13 日上午 11:11 #

嗨 Jason，我只是想知道如何将早停法与 catboost 和 lightgbm 结合使用？我收到了一个错误，要求生成验证集。我想知道 cross_val_score 是否与早停法不兼容？
祝好！

回复
- Jason Brownlee 2021 年 3 月 14 日上午 5:21 #
  
  抱歉，我没有例子。我建议您查阅 API 文档。
  
  是的，CV + 早停法不太好配合，这可能会给您一些启发。
  https://machinelearning.org.cn/faq/single-faq/how-do-i-use-early-stopping-with-k-fold-cross-validation-or-grid-search
  
  回复
Saima 2021 年 4 月 26 日上午 4:00 #

如何从中找到精确率、召回率、F1 分数？

回复
- Jason Brownlee 2021 年 4 月 26 日上午 5:38 #
  
  您可以在评估模型时指定要计算的指标，我建议选择一个——请参阅此链接。
  https://machinelearning.org.cn/tour-of-evaluation-metrics-for-imbalanced-classification/
  
  回复
Mehdi 2021 年 5 月 3 日下午 1:21 #

谢谢 Jason。我一直很享受阅读您的文章。

回复
- Jason Brownlee 2021 年 5 月 4 日上午 6:43 #
  
  不客气！
  
  回复
jtm 2022 年 1 月 7 日晚上 10:34 #

谢谢您这篇简洁的文章。但是，在尝试重现这里的分类结果时，我要么收到 joblib 的错误，要么运行一直挂起。有什么想法吗？谢谢！

回复
- James Carmichael 2022 年 1 月 8 日上午 11:00 #
  
  嗨 JTM……您是否尝试运行我们材料中的特定代码列表？如果是这样，请指明特定的代码列表并提供确切的错误消息。
  
  此致，
  
  回复
Maya 2022 年 4 月 6 日下午 8:04 #

谢谢！我想问一下，当您报告回归的 MAE 值时，括号中的值是否代表交叉验证？如果是，当值大于 1 时意味着什么？理想情况下，最大值应该是 1 吗？

另外，当我测试一个使用 gbr = GradientBoostingRegressor(parameters) 构建的模型时，gbr.score(X_test, y_test) 函数给出了一个负值，例如 -1.08，这意味着模型是一个错误？这些负值意味着什么？

回复
- James Carmichael 2022 年 4 月 7 日上午 9:43 #
  
  嗨 Maya……以下资源可能会有所帮助。
  
  https://machinelearning.org.cn/regression-metrics-for-machine-learning/
  
  在 scikit-learn 中计算某些模型评估指标（例如均方误差 (MSE)）时，结果为负值。
  
  这令人困惑，因为像 MSE 这样的误差分数实际上不能为负，最小值为零或无误差。
  
  scikit-learn 库有一个统一的模型评分系统，它假设所有模型分数都是最大化的。为了让这个系统能够处理最小化的分数，例如 MSE 和其他误差度量，最小化的分数被取负来反转。
  
  这也可以在指标的规范中看到，例如，在指标名称“neg_mean_squared_error”中使用了“neg”。
  
  在解释负误差分数时，你可以忽略符号并直接使用它们。
  
  你可以在这里了解更多
  
  模型评估：量化预测质量
  
  回复
Faiy V. 2022 年 5 月 21 日上午 1:10 #

你好！

非常有用的教程！

我可以通过仅更改模型拟合和某些参数来将相同的代码用于 LightGBM Ranker 和 XGBoost Ranker 吗？

提前感谢您！

索菲亚

回复
- James Carmichael 2022 年 5 月 21 日晚上 11:46 #
  
  嗨 Faiy V.……代码会有很大的复用性。您是否已经为两者实现了模型并比较了结果？让我们知道您的发现！
  
  回复
  - Faiy V. 2022 年 5 月 27 日晚上 11:14 #
    
    是的，我试过了。需要定义一个分组进行排序！希望有效！
    
    回复
Faiy V. 2022 年 5 月 27 日晚上 11:15 #

是的，我试过了。需要定义一个分组进行排序！希望有效！

回复
Sepideh 2022 年 12 月 13 日上午 6:53 #

感谢您的文章，非常有帮助。
有一个问题我想问。
您首先为每个算法创建了一个模型，对其应用了 K 折交叉验证，然后创建了另一个模型并用它来预测目标。
换句话说，您为每个算法创建了 2 个不同的模型，只对其中一个进行交叉验证，然后使用另一个模型进行预测。
请问您为什么这样做？
您能否解释一下为什么您不只创建一个模型用于交叉验证和预测？

提前感谢。

回复
- James Carmichael 2022 年 12 月 13 日上午 10:54 #
  
  Sepideh，非常欢迎！以下内容可能令您感兴趣。
  
  https://machinelearning.org.cn/training-validation-test-split-and-cross-validation-done-right/
  
  回复
  - Sepideh 2022 年 12 月 13 日下午 4:08 #
    
    非常感谢。
    
    回复
Natalie 2023 年 9 月 26 日下午 8:09 #

您好，我正在使用您的一种梯度提升方法来处理回归数据集。它大大改善了 RMSE。谢谢。

回复
- James Carmichael 2023 年 9 月 27 日上午 8:00 #
  
  谢谢 Natalie 的反馈！我们很感激！
  
  回复

导航

使用 Scikit-Learn、XGBoost、LightGBM 和 CatBoost 进行梯度提升

教程概述

梯度提升概述

想开始学习集成学习吗？

使用 Scikit-learn 进行梯度提升

库安装

测试问题

分类数据集

回归数据集

梯度提升

梯度提升机分类

梯度提升机回归

基于直方图的梯度提升

基于直方图的梯度提升机分类

基于直方图的梯度提升机回归

使用 XGBoost 进行梯度提升

库安装

XGBoost 分类

XGBoost 用于回归

使用 LightGBM 进行梯度提升

库安装

LightGBM 分类

LightGBM 回归

梯度提升与 CatBoost

库安装

CatBoost 分类

CatBoost 回归

进一步阅读

教程

论文

API

文章

总结

掌握现代集成学习！

在几分钟内改进您的预测

将现代集成学习技术带入
您的机器学习项目

关于此主题的更多信息

59 条关于“使用 Scikit-Learn、XGBoost、LightGBM 和 CatBoost 进行梯度提升”的回复

留下回复点击此处取消回复。

导航

教程概述

梯度提升概述

想开始学习集成学习吗？

使用 Scikit-learn 进行梯度提升

库安装

测试问题

分类数据集

回归数据集

梯度提升

梯度提升机分类

梯度提升机回归

基于直方图的梯度提升

基于直方图的梯度提升机分类

基于直方图的梯度提升机回归

使用 XGBoost 进行梯度提升

库安装

XGBoost 分类

XGBoost 用于回归

使用 LightGBM 进行梯度提升

库安装

LightGBM 分类

LightGBM 回归

梯度提升与 CatBoost

库安装

CatBoost 分类

CatBoost 回归

进一步阅读

教程

论文

API

文章

总结

掌握现代集成学习！

在几分钟内改进您的预测

将现代集成学习技术带入您的机器学习项目

关于此主题的更多信息

59 条关于“使用 Scikit-Learn、XGBoost、LightGBM 和 CatBoost 进行梯度提升”的回复

留下回复 点击此处取消回复。

将现代集成学习技术带入
您的机器学习项目

留下回复点击此处取消回复。