如何在Python中开发AdaBoost集成模型

作者： Jason Brownlee 发布于 2021 年 4 月 27 日分类：集成学习 13

提升（Boosting）是一类集成机器学习算法，它涉及结合许多弱学习器的预测结果。

弱学习器是一个非常简单的模型，尽管它在数据集上具有一定的能力。在开发出实用算法之前，提升长期以来一直是一个理论概念，而 AdaBoost（自适应提升）算法是该思想的第一个成功方法。

AdaBoost 算法涉及使用非常短（一层）的决策树作为弱学习器，这些学习器按顺序添加到集成模型中。每个后续模型都试图纠正序列中前一个模型所做的预测。这是通过加权训练数据集来实现的，以便将更多关注点放在先前模型做出预测错误的训练样本上。

在本教程中，您将学习如何开发用于分类和回归的 AdaBoost 集成模型。

完成本教程后，您将了解：

AdaBoost 集成模型是由按顺序添加到模型中的决策树创建的集成模型。
如何使用 AdaBoost 集成模型进行分类和回归（使用 scikit-learn）。
如何探索 AdaBoost 模型超参数对模型性能的影响。

通过我的新书《使用 Python 的集成学习算法》启动您的项目，其中包括逐步教程和所有示例的Python 源代码文件。

让我们开始吧。

2020 年 8 月更新：添加了网格搜索模型超参数的示例。

How to Develop an AdaBoost Ensemble in Python

如何在Python中开发AdaBoost集成模型
图片由 Ray in Manila 拍摄，保留部分权利。

教程概述

本教程分为四个部分；它们是

AdaBoost 集成算法
AdaBoost Scikit-Learn API
1. 用于分类的 AdaBoost
2. 用于回归的 AdaBoost
AdaBoost 超参数
1. 探索树的数量
2. 探索弱学习器
3. 探索学习率
4. 探索替代算法
网格搜索 AdaBoost 超参数

AdaBoost 集成算法

提升（Boosting）是指一类机器学习集成算法，其中模型按顺序添加，并且序列中后面的模型纠正前面模型所做的预测。

AdaBoost，是“Adaptive Boosting”的缩写，是一种提升集成机器学习算法，也是最早成功的提升方法之一。

我们将该算法称为 AdaBoost，因为与以前的算法不同，它能自适应地调整弱假设的误差。

— 在线学习的决策理论泛化及其在 Boosting 中的应用，1996 年。

AdaBoost 结合了短一层决策树（称为决策树桩）的预测结果，尽管也可以使用其他算法。使用决策树桩算法是因为 AdaBoost 算法旨在利用许多弱模型，并通过添加额外的弱模型来纠正它们的预测。

训练算法包括从一个决策树开始，找到训练数据集中被错误分类的样本，并增加这些样本的权重。然后在相同数据上训练另一个树，但现在根据错误分类误差进行加权。重复此过程，直到添加所需数量的树。

如果训练数据点被错误分类，则该训练数据点的权重会增加（提升）。使用不再相等的权重构建第二个分类器。同样，错误分类的训练数据会增加其权重，并重复该过程。

— 多类 AdaBoost，2009 年。

该算法是为分类开发的，涉及结合集成模型中所有决策树所做的预测。也为回归问题开发了类似的方法，其中预测是使用决策树的平均值进行的。每个模型对集成预测的贡献是根据模型在训练数据集上的性能加权的。

……新算法无需预先了解弱假设的准确性。相反，它会自适应地调整这些准确性，并生成一个加权多数假设，其中每个弱假设的权重是其准确性的函数。

— 在线学习的决策理论泛化及其在 Boosting 中的应用，1996 年。

既然我们熟悉了 AdaBoost 算法，那么让我们看看如何在 Python 中拟合 AdaBoost 模型。

想开始学习集成学习吗？

立即参加我为期7天的免费电子邮件速成课程（附示例代码）。

点击注册，同时获得该课程的免费PDF电子书版本。

AdaBoost Scikit-Learn API

AdaBoost 集成模型可以从头开始实现，但这对于初学者来说可能具有挑战性。

有关示例，请参阅教程

机器学习中的增强和AdaBoost

scikit-learn Python 机器学习库提供了用于机器学习的 AdaBoost 集成模型的实现。

它在库的现代版本中可用。

首先，通过运行以下脚本确认您正在使用该库的现代版本

# check scikit-learn version
import sklearn
print(sklearn.__version__)

# 检查 scikit-learn 版本

import sklearn

print(sklearn.__version__)

运行脚本将打印您的 scikit-learn 版本。

您的版本应该相同或更高。如果不是，您必须升级您的 scikit-learn 库版本。

0.22.1

0.22.1

AdaBoost 通过 AdaBoostRegressor 和 AdaBoostClassifier 类提供。

这两个模型的操作方式相同，并接受影响决策树创建方式的相同参数。

模型构建中使用了随机性。这意味着每次在相同数据上运行算法时，它都会生成一个略有不同的模型。

在使用具有随机学习算法的机器学习算法时，最好通过对其在多次运行或交叉验证重复中的性能进行平均来评估它们。在拟合最终模型时，可能需要增加树的数量，直到模型在重复评估中的方差减小，或者拟合多个最终模型并对它们的预测进行平均。

让我们看看如何为分类和回归开发 AdaBoost 集成模型。

用于分类的 AdaBoost

在本节中，我们将探讨在分类问题中使用 AdaBoost。

首先，我们可以使用 make_classification() 函数创建一个包含 1,000 个示例和 20 个输入特征的合成二元分类问题。

完整的示例如下所示。

# test classification dataset
from sklearn.datasets import make_classification
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=6)
# summarize the dataset
print(X.shape, y.shape)

# 测试分类数据集

从 sklearn.数据集导入 make_分类

# 定义数据集

X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=6)

# 汇总数据集

print(X.shape, y.shape)

运行示例会创建数据集并总结输入和输出组件的形状。

(1000, 20) (1000,)

1	(1000, 20) (1000,)

接下来，我们可以在此数据集上评估 AdaBoost 算法。

我们将使用重复分层 k 折交叉验证评估模型，其中包含三次重复和 10 折。我们将报告模型在所有重复和折叠中的准确性的平均值和标准差。

# evaluate adaboost algorithm for classification
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import AdaBoostClassifier
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=6)
# define the model
model = AdaBoostClassifier()
# evaluate the model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

# 评估 AdaBoost 分类算法

from numpy import mean

from numpy import std

from sklearn.datasets import make_classification

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import RepeatedStratifiedKFold

从 sklearn.ensemble 导入 AdaBoostClassifier

# 定义数据集

X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=6)

# 定义模型

模型 = AdaBoostClassifier()

# 评估模型

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

n_scores = cross_val_score(模型, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')

# 报告表现

print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

运行示例报告了模型的平均准确度和标准差。

注意：考虑到算法或评估过程的随机性，或数值精度的差异，您的结果可能会有所不同。请考虑多次运行示例并比较平均结果。

在这种情况下，我们可以看到具有默认超参数的 AdaBoost 集成模型在此测试数据集上实现了大约 80% 的分类准确率。

Accuracy: 0.806 (0.041)

1	准确率：0.806 (0.041)

我们还可以将 AdaBoost 模型用作最终模型并进行分类预测。

首先，AdaBoost 集成模型在所有可用数据上进行拟合，然后可以调用 predict() 函数对新数据进行预测。

以下示例在我们的二元分类数据集上演示了这一点。

# make predictions using adaboost for classification
from sklearn.datasets import make_classification
from sklearn.ensemble import AdaBoostClassifier
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=6)
# define the model
model = AdaBoostClassifier()
# fit the model on the whole dataset
model.fit(X, y)
# make a single prediction
row = [[-3.47224758,1.95378146,0.04875169,-0.91592588,-3.54022468,1.96405547,-7.72564954,-2.64787168,-1.81726906,-1.67104974,2.33762043,-4.30273117,0.4839841,-1.28253034,-10.6704077,-0.7641103,-3.58493721,2.07283886,0.08385173,0.91461126]]
yhat = model.predict(row)
print('Predicted Class: %d' % yhat[0])

# 使用 AdaBoost 进行分类预测

from sklearn.datasets import make_classification

从 sklearn.ensemble 导入 AdaBoostClassifier

# 定义数据集

X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=6)

# 定义模型

模型 = AdaBoostClassifier()

# 在整个数据集上拟合模型

模型.fit(X, y)

# 进行单次预测

行 = [[-3.47224758,1.95378146,0.04875169,-0.91592588,-3.54022468,1.96405547,-7.72564954,-2.64787168,-1.81726906,-1.67104974,2.33762043,-4.30273117,0.4839841,-1.28253034,-10.6704077,-0.7641103,-3.58493721,2.07283886,0.08385173,0.91461126]]

yhat = 模型.predict(行)

print('预测类别: %d' % yhat[0])

运行示例后，AdaBoost 集成模型会在整个数据集上拟合，然后用于对新数据行进行预测，就像我们在应用程序中使用模型时一样。

Predicted Class: 0

预测类别：0

现在我们熟悉了 AdaBoost 用于分类，接下来让我们看看回归的 API。

用于回归的 AdaBoost

在本节中，我们将探讨在回归问题中使用 AdaBoost。

首先，我们可以使用 make_regression() 函数创建一个包含 1,000 个示例和 20 个输入特征的合成回归问题。

完整的示例如下所示。

# test regression dataset
from sklearn.datasets import make_regression
# define dataset
X, y = make_regression(n_samples=1000, n_features=20, n_informative=15, noise=0.1, random_state=6)
# summarize the dataset
print(X.shape, y.shape)

# 测试回归数据集

from sklearn.datasets import make_regression

# 定义数据集

X, y = make_regression(n_samples=1000, n_features=20, n_informative=15, noise=0.1, random_state=6)

# 汇总数据集

print(X.shape, y.shape)

运行示例会创建数据集并总结输入和输出组件的形状。

(1000, 20) (1000,)

1	(1000, 20) (1000,)

接下来，我们可以在此数据集上评估 AdaBoost 算法。

与上一节一样，我们将使用重复的 k 折交叉验证评估模型，其中包含三次重复和 10 折。我们将报告模型在所有重复和折叠中的平均绝对误差 (MAE)。scikit-learn 库将 MAE 设置为负值，以便最大化而不是最小化。这意味着较大的负 MAE 更好，完美模型的 MAE 为 0。

完整的示例如下所示。

# evaluate adaboost ensemble for regression
from numpy import mean
from numpy import std
from sklearn.datasets import make_regression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from sklearn.ensemble import AdaBoostRegressor
# define dataset
X, y = make_regression(n_samples=1000, n_features=20, n_informative=15, noise=0.1, random_state=6)
# define the model
model = AdaBoostRegressor()
# evaluate the model
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')
# report performance
print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

# 评估 AdaBoost 回归集成模型

from numpy import mean

from numpy import std

from sklearn.datasets import make_regression

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import RepeatedKFold

从 sklearn.ensemble 导入 AdaBoostRegressor

# 定义数据集

X, y = make_regression(n_samples=1000, n_features=20, n_informative=15, noise=0.1, random_state=6)

# 定义模型

模型 = AdaBoostRegressor()

# 评估模型

cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

n_scores = cross_val_score(模型, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')

# 报告表现

print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

运行示例报告了模型的平均准确度和标准差。

注意：考虑到算法或评估过程的随机性，或数值精度的差异，您的结果可能会有所不同。请考虑多次运行示例并比较平均结果。

在这种情况下，我们可以看到具有默认超参数的 AdaBoost 集成模型实现了大约 100 的 MAE。

MAE: -72.327 (4.041)

1	MAE: -72.327 (4.041)

我们还可以将 AdaBoost 模型用作最终模型并进行回归预测。

首先，AdaBoost 集成模型在所有可用数据上进行拟合，然后可以调用 predict() 函数对新数据进行预测。

以下示例在我们的回归数据集上演示了这一点。

# adaboost ensemble for making predictions for regression
from sklearn.datasets import make_regression
from sklearn.ensemble import AdaBoostRegressor
# define dataset
X, y = make_regression(n_samples=1000, n_features=20, n_informative=15, noise=0.1, random_state=6)
# define the model
model = AdaBoostRegressor()
# fit the model on the whole dataset
model.fit(X, y)
# make a single prediction
row = [[1.20871625,0.88440466,-0.9030013,-0.22687731,-0.82940077,-1.14410988,1.26554256,-0.2842871,1.43929072,0.74250241,0.34035501,0.45363034,0.1778756,-1.75252881,-1.33337384,-1.50337215,-0.45099008,0.46160133,0.58385557,-1.79936198]]
yhat = model.predict(row)
print('Prediction: %d' % yhat[0])

# 用于回归预测的 AdaBoost 集成模型

from sklearn.datasets import make_regression

从 sklearn.ensemble 导入 AdaBoostRegressor

# 定义数据集

X, y = make_regression(n_samples=1000, n_features=20, n_informative=15, noise=0.1, random_state=6)

# 定义模型

模型 = AdaBoostRegressor()

# 在整个数据集上拟合模型

模型.fit(X, y)

# 进行单次预测

行 = [[1.20871625,0.88440466,-0.9030013,-0.22687731,-0.82940077,-1.14410988,1.26554256,-0.2842871,1.43929072,0.74250241,0.34035501,0.45363034,0.1778756,-1.75252881,-1.33337384,-1.50337215,-0.45099008,0.46160133,0.58385557,-1.79936198]]

yhat = 模型.predict(行)

print('预测: %d' % yhat[0])

运行示例后，AdaBoost 集成模型会在整个数据集上拟合，然后用于对新数据行进行预测，就像我们在应用程序中使用模型时一样。

Prediction: -10

预测：-10

既然我们熟悉了使用 scikit-learn API 评估和使用 AdaBoost 集成模型，接下来让我们看看如何配置模型。

AdaBoost 超参数

在本节中，我们将更深入地研究您应该考虑为 AdaBoost 集成模型调整的一些超参数及其对模型性能的影响。

探索树的数量

AdaBoost 算法的一个重要超参数是集成模型中使用的决策树的数量。

回想一下，集成模型中使用的每个决策树都设计为弱学习器。也就是说，它具有超越随机预测的能力，但并非高度熟练。因此，使用一层决策树，称为决策树桩。

添加到模型的树的数量必须很高才能使模型运行良好，通常是数百甚至数千。

树的数量可以通过“n_estimators”参数设置，默认为 50。

下面的示例探讨了树的数量在 10 到 5,000 之间对性能的影响。

# explore adaboost ensemble number of trees effect on performance
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import AdaBoostClassifier
from matplotlib import pyplot

# get the dataset
def get_dataset():
	X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=6)
	return X, y

# get a list of models to evaluate
def get_models():
	models = dict()
	# define number of trees to consider
	n_trees = [10, 50, 100, 500, 1000, 5000]
	for n in n_trees:
		models[str(n)] = AdaBoostClassifier(n_estimators=n)
	return models

# evaluate a given model using cross-validation
def evaluate_model(model, X, y):
	# define the evaluation procedure
	cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
	# evaluate the model and collect the results
	scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
	return scores

# define dataset
X, y = get_dataset()
# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
	# evaluate the model
	scores = evaluate_model(model, X, y)
	# store the results
	results.append(scores)
	names.append(name)
	# summarize the performance along the way
	print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores)))
# plot model performance for comparison
pyplot.boxplot(results, labels=names, showmeans=True)
pyplot.show()

# 探索 AdaBoost 集成模型树数量对性能的影响

from numpy import mean

from numpy import std

from sklearn.datasets import make_classification

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import RepeatedStratifiedKFold

from sklearn.ensemble import AdaBoostClassifier

from matplotlib import pyplot

# 获取数据集

定义获取_数据集():

X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=6)

返回 X, y

# 获取要评估的模型列表

定义获取_模型():

models = dict()

# 定义要考虑的树的数量

n_trees = [10, 50, 100, 500, 1000, 5000]

对于 n 在 n_trees:

models[str(n)] = AdaBoostClassifier(n_estimators=n)

返回模型

# 使用交叉验证评估给定模型

def evaluate_model(model, X, y):

# 定义评估过程

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# 评估模型并收集结果

scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

返回分数

# 定义数据集

X, y = get_dataset()

# 获取要评估的模型

模型 = 获取_模型()

# 评估模型并存储结果

results, names = list(), list()

对于 name, model 在 models.items():

# 评估模型

scores = evaluate_model(model, X, y)

# 存储结果

results.append(scores)

names.append(name)

# 沿途总结性能

print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores)))

# 绘制模型性能以供比较

pyplot.boxplot(results, labels=names, showmeans=True)

pyplot.show()

运行示例首先报告每个配置的决策树数量的平均准确率。

注意：考虑到算法或评估过程的随机性，或数值精度的差异，您的结果可能会有所不同。请考虑多次运行示例并比较平均结果。

在这种情况下，我们可以看到在此数据集上，性能在大约 50 棵树之后有所改善，之后有所下降。这可能表明在添加更多树后，集成模型对训练数据集过拟合。

>10 0.773 (0.039)
>50 0.806 (0.041)
>100 0.801 (0.032)
>500 0.793 (0.028)
>1000 0.791 (0.032)
>5000 0.782 (0.031)

>10 0.773 (0.039)

>50 0.806 (0.041)

>100 0.801 (0.032)

>500 0.793 (0.028)

>1000 0.791 (0.032)

>5000 0.782 (0.031)

为每个配置的树数量创建了准确率分数的箱线图。

我们可以看到模型性能和集成模型大小的总体趋势。

Box Plot of AdaBoost Ensemble Size vs. Classification Accuracy

AdaBoost 集成模型大小与分类准确率的箱线图

探索弱学习器

默认情况下，使用一层决策树作为弱学习器。

我们可以通过增加决策树的深度来使集成模型中使用的模型不那么弱（更熟练）。

以下示例探讨了增加 DecisionTreeClassifier 弱学习器深度对 AdBoost 集成模型的影响。

# explore adaboost ensemble tree depth effect on performance
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from matplotlib import pyplot

# get the dataset
def get_dataset():
	X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=6)
	return X, y

# get a list of models to evaluate
def get_models():
	models = dict()
	# explore depths from 1 to 10
	for i in range(1,11):
		# define base model
		base = DecisionTreeClassifier(max_depth=i)
		# define ensemble model
		models[str(i)] = AdaBoostClassifier(base_estimator=base)
	return models

# evaluate a given model using cross-validation
def evaluate_model(model, X, y):
	# define the evaluation procedure
	cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
	# evaluate the model and collect the results
	scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
	return scores

# define dataset
X, y = get_dataset()
# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
	# evaluate the model
	scores = evaluate_model(model, X, y)
	# store the results
	results.append(scores)
	names.append(name)
	# summarize the performance along the way
	print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores)))
# plot model performance for comparison
pyplot.boxplot(results, labels=names, showmeans=True)
pyplot.show()

# 探索 AdaBoost 集成模型树深度对性能的影响

from numpy import mean

from numpy import std

from sklearn.datasets import make_classification

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import RepeatedStratifiedKFold

from sklearn.ensemble import AdaBoostClassifier

from sklearn.tree import DecisionTreeClassifier

from matplotlib import pyplot

# 获取数据集

定义获取_数据集():

X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=6)

返回 X, y

# 获取要评估的模型列表

定义获取_模型():

models = dict()

# 探索 1 到 10 的深度

对于 i 在 range(1,11):

# 定义基础模型

base = DecisionTreeClassifier(max_depth=i)

# 定义集成模型

models[str(i)] = AdaBoostClassifier(base_estimator=base)

返回模型

# 使用交叉验证评估给定模型

def evaluate_model(model, X, y):

# 定义评估过程

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# 评估模型并收集结果

scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

返回分数

# 定义数据集

X, y = get_dataset()

# 获取要评估的模型

模型 = 获取_模型()

# 评估模型并存储结果

results, names = list(), list()

对于 name, model 在 models.items():

# 评估模型

scores = evaluate_model(model, X, y)

# 存储结果

results.append(scores)

names.append(name)

# 沿途总结性能

print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores)))

# 绘制模型性能以供比较

pyplot.boxplot(results, labels=names, showmeans=True)

pyplot.show()

运行示例首先报告每个配置的弱学习器树深度的平均准确率。

注意：考虑到算法或评估过程的随机性，或数值精度的差异，您的结果可能会有所不同。请考虑多次运行示例并比较平均结果。

在这种情况下，我们可以看到随着决策树深度的增加，集成模型在该数据集上的性能也随之增加。

>1 0.806 (0.041)
>2 0.864 (0.028)
>3 0.867 (0.030)
>4 0.889 (0.029)
>5 0.909 (0.021)
>6 0.923 (0.020)
>7 0.927 (0.025)
>8 0.928 (0.028)
>9 0.923 (0.017)
>10 0.926 (0.030)

>1 0.806 (0.041)

>2 0.864 (0.028)

>3 0.867 (0.030)

>4 0.889 (0.029)

>5 0.909 (0.021)

>6 0.923 (0.020)

>7 0.927 (0.025)

>8 0.928 (0.028)

>9 0.923 (0.017)

>10 0.926 (0.030)

为每个配置的弱学习器深度创建了准确率分数的箱线图。

我们可以看到模型性能和弱学习器深度的总体趋势。

Box Plot of AdaBoost Ensemble Weak Learner Depth vs. Classification Accuracy

AdaBoost 集成模型弱学习器深度与分类准确率的箱线图

探索学习率

AdaBoost 还支持一个学习率，它控制每个模型对集成预测的贡献。

这由“learning_rate”参数控制，默认设置为 1.0 或完全贡献。根据集成模型中使用的模型数量，较小或较大的值可能更合适。模型的贡献和集成模型中树的数量之间存在平衡。

更多的树可能需要较小的学习率；较少的树可能需要较大的学习率。通常使用 0 到 1 之间的值，有时为了避免过拟合而使用非常小的值，例如 0.1、0.01 或 0.001。

以下示例探讨了以 0.1 为增量，在 0.1 到 2.0 之间的学习率值。

# explore adaboost ensemble learning rate effect on performance
from numpy import mean
from numpy import std
from numpy import arange
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import AdaBoostClassifier
from matplotlib import pyplot

# get the dataset
def get_dataset():
	X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=6)
	return X, y

# get a list of models to evaluate
def get_models():
	models = dict()
	# explore learning rates from 0.1 to 2 in 0.1 increments
	for i in arange(0.1, 2.1, 0.1):
		key = '%.3f' % i
		models[key] = AdaBoostClassifier(learning_rate=i)
	return models

# evaluate a given model using cross-validation
def evaluate_model(model, X, y):
	# define the evaluation procedure
	cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
	# evaluate the model and collect the results
	scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
	return scores

# define dataset
X, y = get_dataset()
# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
	# evaluate the model
	scores = evaluate_model(model, X, y)
	# store the results
	results.append(scores)
	names.append(name)
	# summarize the performance along the way
	print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores)))
# plot model performance for comparison
pyplot.boxplot(results, labels=names, showmeans=True)
pyplot.xticks(rotation=45)
pyplot.show()

# 探索 AdaBoost 集成模型学习率对性能的影响

from numpy import mean

from numpy import std

from numpy import arange

from sklearn.datasets import make_classification

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import RepeatedStratifiedKFold

from sklearn.ensemble import AdaBoostClassifier

from matplotlib import pyplot

# 获取数据集

定义获取_数据集():

X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=6)

返回 X, y

# 获取要评估的模型列表

定义获取_模型():

models = dict()

# 探索 0.1 到 2（增量为 0.1）的学习率

对于 i 在 arange(0.1, 2.1, 0.1):

key = '%.3f' % i

models[key] = AdaBoostClassifier(learning_rate=i)

返回模型

# 使用交叉验证评估给定模型

def evaluate_model(model, X, y):

# 定义评估过程

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# 评估模型并收集结果

scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

返回分数

# 定义数据集

X, y = get_dataset()

# 获取要评估的模型

模型 = 获取_模型()

# 评估模型并存储结果

results, names = list(), list()

对于 name, model 在 models.items():

# 评估模型

scores = evaluate_model(model, X, y)

# 存储结果

results.append(scores)

names.append(name)

# 沿途总结性能

print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores)))

# 绘制模型性能以供比较

pyplot.boxplot(results, labels=names, showmeans=True)

pyplot.xticks(rotation=45)

pyplot.show()

运行示例首先报告每个配置的学习率的平均准确率。

注意：考虑到算法或评估过程的随机性，或数值精度的差异，您的结果可能会有所不同。请考虑多次运行示例并比较平均结果。

在这种情况下，我们可以看到 0.5 到 1.0 之间的值相似，之后模型性能下降。

>0.100 0.767 (0.049)
>0.200 0.786 (0.042)
>0.300 0.802 (0.040)
>0.400 0.798 (0.037)
>0.500 0.805 (0.042)
>0.600 0.795 (0.031)
>0.700 0.799 (0.035)
>0.800 0.801 (0.033)
>0.900 0.805 (0.032)
>1.000 0.806 (0.041)
>1.100 0.801 (0.037)
>1.200 0.800 (0.030)
>1.300 0.799 (0.041)
>1.400 0.793 (0.041)
>1.500 0.790 (0.040)
>1.600 0.775 (0.034)
>1.700 0.767 (0.054)
>1.800 0.768 (0.040)
>1.900 0.736 (0.047)
>2.000 0.682 (0.048)

>0.100 0.767 (0.049)

>0.200 0.786 (0.042)

>0.300 0.802 (0.040)

>0.400 0.798 (0.037)

>0.500 0.805 (0.042)

>0.600 0.795 (0.031)

>0.700 0.799 (0.035)

>0.800 0.801 (0.033)

>0.900 0.805 (0.032)

>1.000 0.806 (0.041)

>1.100 0.801 (0.037)

>1.200 0.800 (0.030)

>1.300 0.799 (0.041)

>1.400 0.793 (0.041)

>1.500 0.790 (0.040)

>1.600 0.775 (0.034)

>1.700 0.767 (0.054)

>1.800 0.768 (0.040)

>1.900 0.736 (0.047)

>2.000 0.682 (0.048)

为每个配置的学习率的准确度分数分布创建了一个箱线图。

我们可以看到在此数据集上，学习率大于 1.0 时模型性能普遍下降的趋势。

Box Plot of AdaBoost Ensemble Learning Rate vs. Classification Accuracy

AdaBoost 集成模型学习率与分类准确率的箱线图

探索替代算法

集成模型中使用的默认算法是决策树，尽管也可以使用其他算法。

目的是使用非常简单的模型，称为弱学习器。此外，scikit-learn 实现要求使用的任何模型都必须支持加权样本，因为它们是通过根据训练数据集的加权版本拟合模型来创建集成模型的。

基础模型可以通过“base_estimator”参数指定。对于分类情况，基础模型还必须支持预测概率或类似概率的分数。如果指定的模型不支持加权训练数据集，您将看到以下错误消息：

ValueError: KNeighborsClassifier doesn't support sample_weight.

1	ValueError: KNeighborsClassifier 不支持 sample_weight。

支持加权训练的一个模型示例是逻辑回归算法。

以下示例演示了一个使用 LogisticRegression 弱学习器的 AdaBoost 算法。

# evaluate adaboost algorithm with logistic regression weak learner for classification
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=6)
# define the model
model = AdaBoostClassifier(base_estimator=LogisticRegression())
# evaluate the model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

# 评估使用逻辑回归弱学习器进行分类的 AdaBoost 算法

from numpy import mean

from numpy import std

from sklearn.datasets import make_classification

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import RepeatedStratifiedKFold

from sklearn.ensemble import AdaBoostClassifier

从 sklearn.线性模型导入 LogisticRegression

# 定义数据集

X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=6)

# 定义模型

model = AdaBoostClassifier(base_estimator=LogisticRegression())

# 评估模型

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

n_scores = cross_val_score(模型, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')

# 报告表现

print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

运行示例报告了模型的平均准确度和标准差。

注意：考虑到算法或评估过程的随机性，或数值精度的差异，您的结果可能会有所不同。请考虑多次运行示例并比较平均结果。

在这种情况下，我们可以看到具有逻辑回归弱模型的AdaBoost集成模型在这个测试数据集上达到了约79%的分类准确率。

Accuracy: 0.794 (0.032)

1	准确率：0.794 (0.032)

网格搜索 AdaBoost 超参数

AdaBoost的配置可能具有挑战性，因为该算法有许多关键的超参数，这些超参数会影响模型在训练数据上的行为，并且这些超参数之间会相互作用。

因此，最好使用搜索过程来发现适用于给定预测建模问题的模型超参数配置，或者找到最佳配置。常见的搜索过程包括随机搜索和网格搜索。

在本节中，我们将介绍AdaBoost算法关键超参数的常见范围的网格搜索，您可以将其作为自己项目的起点。这可以通过使用GridSearchCV类并指定一个将模型超参数名称映射到要搜索值的字典来实现。

在这种情况下，我们将对AdaBoost的两个关键超参数进行网格搜索：集成中使用的树的数量和学习率。我们将为每个超参数使用一系列流行的、表现良好的值。

每个配置组合都将使用重复的k折交叉验证进行评估，并且配置将通过平均分数（在本例中为分类准确率）进行比较。

下面列出了在我们的合成分类数据集上对AdaBoost算法关键超参数进行网格搜索的完整示例。

# example of grid searching key hyperparameters for adaboost on a classification dataset
from sklearn.datasets import make_classification
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import AdaBoostClassifier
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=6)
# define the model with default hyperparameters
model = AdaBoostClassifier()
# define the grid of values to search
grid = dict()
grid['n_estimators'] = [10, 50, 100, 500]
grid['learning_rate'] = [0.0001, 0.001, 0.01, 0.1, 1.0]
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# define the grid search procedure
grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy')
# execute the grid search
grid_result = grid_search.fit(X, y)
# summarize the best score and configuration
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
# summarize all scores that were evaluated
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

# 对分类数据集上的AdaBoost关键超参数进行网格搜索的示例

from sklearn.datasets import make_classification

from sklearn.model_selection import RepeatedStratifiedKFold

from sklearn.model_selection import GridSearchCV

从 sklearn.ensemble 导入 AdaBoostClassifier

# 定义数据集

X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=6)

# 使用默认超参数定义模型

模型 = AdaBoostClassifier()

# 定义要搜索的值网格

grid = dict()

grid['n_estimators'] = [10, 50, 100, 500]

grid['learning_rate'] = [0.0001, 0.001, 0.01, 0.1, 1.0]

# 定义评估过程

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# 定义网格搜索过程

grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy')

# 执行网格搜索

grid_result = grid_search.fit(X, y)

# 总结最佳分数和配置

print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

# 总结所有评估过的分数

means = grid_result.cv_results_['mean_test_score']

stds = grid_result.cv_results_['std_test_score']

params = grid_result.cv_results_['params']

for mean, stdev, param in zip(means, stds, params):

print("%f (%f) with: %r" % (mean, stdev, param))

运行示例可能需要一些时间，具体取决于您的硬件。运行结束时，首先报告获得最佳分数的配置，然后是所有其他考虑过的配置的分数。

注意：考虑到算法或评估过程的随机性，或数值精度的差异，您的结果可能会有所不同。请考虑多次运行示例并比较平均结果。

在这种情况下，我们可以看到一个包含500棵树和学习率为0.1的配置表现最佳，分类准确率约为81.3%。

该模型在更多树（例如1,000或5,000棵）的情况下可能表现更好，尽管为了确保网格搜索在合理时间内完成，本例中没有测试这些配置。

Best: 0.813667 using {'learning_rate': 0.1, 'n_estimators': 500}
0.646333 (0.036376) with: {'learning_rate': 0.0001, 'n_estimators': 10}
0.646667 (0.036545) with: {'learning_rate': 0.0001, 'n_estimators': 50}
0.646667 (0.036545) with: {'learning_rate': 0.0001, 'n_estimators': 100}
0.647000 (0.038136) with: {'learning_rate': 0.0001, 'n_estimators': 500}
0.646667 (0.036545) with: {'learning_rate': 0.001, 'n_estimators': 10}
0.647000 (0.038136) with: {'learning_rate': 0.001, 'n_estimators': 50}
0.654333 (0.045511) with: {'learning_rate': 0.001, 'n_estimators': 100}
0.672667 (0.046543) with: {'learning_rate': 0.001, 'n_estimators': 500}
0.648333 (0.042197) with: {'learning_rate': 0.01, 'n_estimators': 10}
0.671667 (0.045613) with: {'learning_rate': 0.01, 'n_estimators': 50}
0.715000 (0.053213) with: {'learning_rate': 0.01, 'n_estimators': 100}
0.767667 (0.045948) with: {'learning_rate': 0.01, 'n_estimators': 500}
0.716667 (0.048876) with: {'learning_rate': 0.1, 'n_estimators': 10}
0.767000 (0.049271) with: {'learning_rate': 0.1, 'n_estimators': 50}
0.784667 (0.042874) with: {'learning_rate': 0.1, 'n_estimators': 100}
0.813667 (0.032092) with: {'learning_rate': 0.1, 'n_estimators': 500}
0.773333 (0.038759) with: {'learning_rate': 1.0, 'n_estimators': 10}
0.806333 (0.040701) with: {'learning_rate': 1.0, 'n_estimators': 50}
0.801000 (0.032491) with: {'learning_rate': 1.0, 'n_estimators': 100}
0.792667 (0.027560) with: {'learning_rate': 1.0, 'n_estimators': 500}

最佳: 0.813667 使用 {'learning_rate': 0.1, 'n_estimators': 500}

0.646333 (0.036376) 带有: {'learning_rate': 0.0001, 'n_estimators': 10}

0.646667 (0.036545) 带有: {'learning_rate': 0.0001, 'n_estimators': 50}

0.646667 (0.036545) 带有: {'learning_rate': 0.0001, 'n_estimators': 100}

0.647000 (0.038136) 带有: {'learning_rate': 0.0001, 'n_estimators': 500}

0.646667 (0.036545) 带有: {'learning_rate': 0.001, 'n_estimators': 10}

0.647000 (0.038136) 带有: {'learning_rate': 0.001, 'n_estimators': 50}

0.654333 (0.045511) 带有: {'learning_rate': 0.001, 'n_estimators': 100}

0.672667 (0.046543) 带有: {'learning_rate': 0.001, 'n_estimators': 500}

0.648333 (0.042197) 带有: {'learning_rate': 0.01, 'n_estimators': 10}

0.671667 (0.045613) 带有: {'learning_rate': 0.01, 'n_estimators': 50}

0.715000 (0.053213) 带有: {'learning_rate': 0.01, 'n_estimators': 100}

0.767667 (0.045948) 带有: {'learning_rate': 0.01, 'n_estimators': 500}

0.716667 (0.048876) 带有: {'learning_rate': 0.1, 'n_estimators': 10}

0.767000 (0.049271) 带有: {'learning_rate': 0.1, 'n_estimators': 50}

0.784667 (0.042874) 带有: {'learning_rate': 0.1, 'n_estimators': 100}

0.813667 (0.032092) 带有: {'learning_rate': 0.1, 'n_estimators': 500}

0.773333 (0.038759) 带有: {'learning_rate': 1.0, 'n_estimators': 10}

0.806333 (0.040701) 带有: {'learning_rate': 1.0, 'n_estimators': 50}

0.801000 (0.032491) 带有: {'learning_rate': 1.0, 'n_estimators': 100}

0.792667 (0.027560) 带有: {'learning_rate': 1.0, 'n_estimators': 500}

进一步阅读

如果您想深入了解，本节提供了更多关于该主题的资源。

教程

机器学习中的增强和AdaBoost

论文

API

文章

总结

在本教程中，您学习了如何为分类和回归开发AdaBoost集成模型。

具体来说，你学到了：

AdaBoost集成是一个由顺序添加到模型中的决策树组成的集成模型。
如何使用 AdaBoost 集成模型进行分类和回归（使用 scikit-learn）。
如何探索 AdaBoost 模型超参数对模型性能的影响。

你有什么问题吗？
在下面的评论中提出你的问题，我会尽力回答。

关于此主题的更多信息

机器学习中算法与模型的区别

如何在Python中开发梯度提升机集成模型