如何在 Python 中使用 XGBoost 调整决策树的数量和大小

作者： Jason Brownlee 发布于 2020年8月27日分类： XGBoost 55

梯度提升涉及决策树的顺序创建和添加，每个决策树都试图纠正之前学习器所犯的错误。

这就提出了一个问题，即在梯度提升模型中要配置多少棵树（弱学习器或估计器），以及每棵树应该有多大。

在这篇文章中，您将发现如何设计一个系统的实验来选择您的问题中使用的决策树的数量和大小。

阅读本文后，您将了解

如何评估向XGBoost模型添加更多决策树的效果。
如何评估在XGBoost模型中创建更大的决策树的效果。
如何研究问题中树的数量和深度之间的关系。

通过我的新书《XGBoost With Python》启动您的项目，其中包括所有示例的分步教程和 Python 源代码文件。

让我们开始吧。

2017 年 1 月更新：已更新以反映 scikit-learn API 0.18.1 版本中的更改。

How to Tune the Number and Size of Decision Trees with XGBoost in Python

如何在 Python 中使用 XGBoost 调整决策树的数量和大小
图片来源：USFWSmidwest，保留部分权利。

在 Python 中使用 XGBoost 需要帮助吗？

参加我的免费 7 天电子邮件课程，探索 xgboost（含示例代码）。

立即点击注册，还将免费获得本课程的 PDF 电子书版本。

问题描述：Otto 数据集

在本教程中，我们将使用 Otto Group 产品分类挑战赛数据集。

该数据集可从Kaggle免费获取（您需要注册Kaggle才能下载此数据集）。您可以从数据页面下载训练数据集train.csv.zip，并将解压后的train.csv文件放入您的工作目录中。

此数据集描述了超过 61,000 种产品在 10 个产品类别（例如时尚、电子产品等）中的 93 个模糊细节。输入属性是某种不同事件的计数。

目标是为新产品进行预测，预测结果是每个 10 个类别的概率数组，模型使用多类对数损失（也称为交叉熵）进行评估。

该竞赛于2015年5月完成，该数据集对XGBoost来说是一个很好的挑战，因为它具有非平凡的示例数量、问题的难度以及几乎不需要数据准备（除了将字符串类别变量编码为整数）。

调整XGBoost中决策树的数量

大多数梯度提升实现默认配置相对较少的树，例如数百或数千棵。

一般原因是，在大多数问题上，超出限制添加更多树并不能提高模型的性能。

原因在于提升树模型的构建方式，它是一个顺序过程，每棵新树都试图建模并纠正前面一系列树所犯的错误。很快，模型就会达到收益递减点。

我们可以很容易地在Otto数据集上演示这个收益递减点。

XGBoost模型中的树数量（或轮次）在XGBClassifier或XGBRegressor类的n_estimators参数中指定。XGBoost库中的默认值为100。

使用scikit-learn，我们可以对n_estimators模型参数执行网格搜索，评估从50到350的一系列值，步长为50（50、150、200、250、300、350）。

# grid search
model = XGBClassifier()
n_estimators = range(50, 400, 50)
param_grid = dict(n_estimators=n_estimators)
kfold = StratifiedKFold(n_splits scoring="neg_log_loss", n_jobs=-1, cv=kfold)
result = grid_search.fit(X, label_encoded_y)

# 网格搜索

model = XGBClassifier()

n_estimators = range(50, 400, 50)

param_grid = dict(n_estimators=n_estimators)

kfold = StratifiedKFold(n_splits scoring="neg_log_loss", n_jobs=-1, cv=kfold)

result = grid_search.fit(X, label_encoded_y)

我们可以在Otto数据集上执行此网格搜索，使用10折交叉验证，需要训练60个模型（6种配置 * 10折）。

完整的代码清单如下。

# XGBoost on Otto dataset, Tune n_estimators
from pandas import read_csv
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import LabelEncoder
import matplotlib
matplotlib.use('Agg')
from matplotlib import pyplot
# load data
data = read_csv('train.csv')
dataset = data.values
# split data into X and y
X = dataset[:,0:94]
y = dataset[:,94]
# encode string class values as integers
label_encoded_y = LabelEncoder().fit_transform(y)
# grid search
model = XGBClassifier()
n_estimators = range(50, 400, 50)
param_grid = dict(n_estimators=n_estimators)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7)
grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold)
grid_result = grid_search.fit(X, label_encoded_y)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
	print("%f (%f) with: %r" % (mean, stdev, param))
# plot
pyplot.errorbar(n_estimators, means, yerr=stds)
pyplot.title("XGBoost n_estimators vs Log Loss")
pyplot.xlabel('n_estimators')
pyplot.ylabel('Log Loss')
pyplot.savefig('n_estimators.png')

# Otto数据集上的XGBoost，调整n_estimators

from pandas import read_csv

from xgboost import XGBClassifier

from sklearn.model_selection import GridSearchCV

from sklearn.model_selection import StratifiedKFold

from sklearn.preprocessing import LabelEncoder

import matplotlib

matplotlib.use('Agg')

from matplotlib import pyplot

# 加载数据

data = read_csv('train.csv')

dataset = data.values

# 将数据拆分为 X 和 y

X = dataset[:,0:94]

y = dataset[:,94]

# 将字符串类值编码为整数

label_encoded_y = LabelEncoder().fit_transform(y)

# 网格搜索

model = XGBClassifier()

n_estimators = range(50, 400, 50)

param_grid = dict(n_estimators=n_estimators)

kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7)

grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold)

grid_result = grid_search.fit(X, label_encoded_y)

# 总结结果

print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

means = grid_result.cv_results_['mean_test_score']

stds = grid_result.cv_results_['std_test_score']

params = grid_result.cv_results_['params']

for mean, stdev, param in zip(means, stds, params):

print("%f (%f) with: %r" % (mean, stdev, param))

# 绘图

pyplot.errorbar(n_estimators, means, yerr=stds)

pyplot.title("XGBoost n_estimators vs Log Loss")

pyplot.xlabel('n_estimators')

pyplot.ylabel('Log Loss')

pyplot.savefig('n_estimators.png')

注意：鉴于算法或评估过程的随机性，或数值精度的差异，您的结果可能会有所不同。考虑运行几次示例并比较平均结果。

运行此示例将打印以下结果。

Best: -0.001152 using {'n_estimators': 250}
-0.010970 (0.001083) with: {'n_estimators': 50}
-0.001239 (0.001730) with: {'n_estimators': 100}
-0.001163 (0.001715) with: {'n_estimators': 150}
-0.001153 (0.001702) with: {'n_estimators': 200}
-0.001152 (0.001702) with: {'n_estimators': 250}
-0.001152 (0.001704) with: {'n_estimators': 300}
-0.001153 (0.001706) with: {'n_estimators': 350}

最佳值：-0.001152，使用 {'n_estimators': 250}

-0.010970 (0.001083) with: {'n_estimators': 50}

-0.001239 (0.001730) with: {'n_estimators': 100}

-0.001163 (0.001715) with: {'n_estimators': 150}

-0.001153 (0.001702) with: {'n_estimators': 200}

-0.001152 (0.001702) with: {'n_estimators': 250}

-0.001152 (0.001704) with: {'n_estimators': 300}

-0.001153 (0.001706) with: {'n_estimators': 350}

我们可以看到交叉验证对数损失分数是负的。这是因为scikit-learn交叉验证框架将它们倒置了。原因是，在内部，该框架要求所有要优化的度量都必须最大化，而对数损失是一个最小化度量。通过将分数倒置，可以很容易地将其最大化。

最佳树的数量是n_estimators=250，导致对数损失为0.001152，但与n_estimators=200相比，差异并不显著。事实上，如果我们绘制结果，100到350之间的树数量相对差异不大。

下面是显示树的数量与平均（倒置）对数损失之间关系的折线图，标准差以误差条显示。

调整XGBoost中的树数量

调整XGBoost中决策树的大小

在梯度提升中，我们可以控制决策树的大小，也称为层数或深度。

浅层树的性能预计会很差，因为它们捕获的问题细节很少，通常被称为弱学习器。深层树通常捕获的问题细节过多，会过度拟合训练数据集，限制了对新数据做出良好预测的能力。

通常，提升算法配置为弱学习器，即层数较少的决策树，有时甚至简单到只有一个根节点，也称为决策桩而不是决策树。

最大深度可以在XGBoost的XGBClassifier和XGBRegressor包装类中通过max_depth参数指定。此参数接受一个整数值，默认值为3。

model = XGBClassifier(max_depth=3)

1	model = XGBClassifier(max_depth=3)

我们可以使用scikit-learn中的网格搜索基础设施，在Otto数据集上调整XGBoost的这个超参数。下面我们评估max_depth在1到9之间的奇数值（1、3、5、7、9）。

每个5个配置都使用10折交叉验证进行评估，导致构建了50个模型。完整的代码清单如下所示。

# XGBoost on Otto dataset, Tune max_depth
from pandas import read_csv
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import LabelEncoder
import matplotlib
matplotlib.use('Agg')
from matplotlib import pyplot
# load data
data = read_csv('train.csv')
dataset = data.values
# split data into X and y
X = dataset[:,0:94]
y = dataset[:,94]
# encode string class values as integers
label_encoded_y = LabelEncoder().fit_transform(y)
# grid search
model = XGBClassifier()
max_depth = range(1, 11, 2)
print(max_depth)
param_grid = dict(max_depth=max_depth)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7)
grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold, verbose=1)
grid_result = grid_search.fit(X, label_encoded_y)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
	print("%f (%f) with: %r" % (mean, stdev, param))
# plot
pyplot.errorbar(max_depth, means, yerr=stds)
pyplot.title("XGBoost max_depth vs Log Loss")
pyplot.xlabel('max_depth')
pyplot.ylabel('Log Loss')
pyplot.savefig('max_depth.png')

# XGBoost在Otto数据集上，调整max_depth

from pandas import read_csv

from xgboost import XGBClassifier

from sklearn.model_selection import GridSearchCV

from sklearn.model_selection import StratifiedKFold

from sklearn.preprocessing import LabelEncoder

import matplotlib

matplotlib.use('Agg')

from matplotlib import pyplot

# 加载数据

data = read_csv('train.csv')

dataset = data.values

# 将数据拆分为 X 和 y

X = dataset[:,0:94]

y = dataset[:,94]

# 将字符串类值编码为整数

label_encoded_y = LabelEncoder().fit_transform(y)

# 网格搜索

model = XGBClassifier()

max_depth = range(1, 11, 2)

print(max_depth)

param_grid = dict(max_depth=max_depth)

kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7)

grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold, verbose=1)

grid_result = grid_search.fit(X, label_encoded_y)

# 总结结果

print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

means = grid_result.cv_results_['mean_test_score']

stds = grid_result.cv_results_['std_test_score']

params = grid_result.cv_results_['params']

for mean, stdev, param in zip(means, stds, params):

print("%f (%f) with: %r" % (mean, stdev, param))

# 绘图

pyplot.errorbar(max_depth, means, yerr=stds)

pyplot.title("XGBoost max_depth vs Log Loss")

pyplot.xlabel('max_depth')

pyplot.ylabel('Log Loss')

pyplot.savefig('max_depth.png')

运行此示例将打印每个max_depth的对数损失。

注意：鉴于算法或评估过程的随机性，或数值精度的差异，您的结果可能会有所不同。考虑运行几次示例并比较平均结果。

最佳配置是max_depth=5，导致对数损失为0.001236。

Best: -0.001236 using {'max_depth': 5}
-0.026235 (0.000898) with: {'max_depth': 1}
-0.001239 (0.001730) with: {'max_depth': 3}
-0.001236 (0.001701) with: {'max_depth': 5}
-0.001237 (0.001701) with: {'max_depth': 7}
-0.001237 (0.001701) with: {'max_depth': 9}

最佳值：-0.001236，使用 {'max_depth': 5}

-0.026235 (0.000898) with: {'max_depth': 1}

-0.001239 (0.001730) with: {'max_depth': 3}

-0.001236 (0.001701) with: {'max_depth': 5}

-0.001237 (0.001701) with: {'max_depth': 7}

-0.001237 (0.001701) with: {'max_depth': 9}

查看对数损失分数的图表，我们可以看到从max_depth=1到max_depth=3有显著跳跃，然后max_depth的其他值性能相当。

尽管在max_depth=5时观察到最佳分数，但值得注意的是，使用max_depth=3或max_depth=7几乎没有实际差异。

这表明在您可以使用网格搜索来找出问题的max_depth上的收益递减点。下面绘制了max_depth值与（倒置）对数损失的图表。

调整XGBoost中的最大树深度

调整XGBoost中树的数量和最大深度

模型中树的数量与每棵树的深度之间存在关系。

我们预计，更深的树将导致模型中所需的树更少，反之，更简单的树（如决策桩）需要更多的树才能达到相似的结果。

我们可以通过评估n_estimators和max_depth配置值的网格来调查这种关系。为了避免评估时间过长，我们将限制评估的总配置值数量。参数的选择旨在揭示关系而不是优化模型。

我们将创建一个包含4个不同n_estimators值（50、100、150、200）和4个不同max_depth值（2、4、6、8）的网格，每个组合将使用10折交叉验证进行评估。总共将训练和评估4*4*10或160个模型。

完整的代码清单如下。

# XGBoost on Otto dataset, Tune n_estimators and max_depth
from pandas import read_csv
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import LabelEncoder
import matplotlib
matplotlib.use('Agg')
from matplotlib import pyplot
import numpy
# load data
data = read_csv('train.csv')
dataset = data.values
# split data into X and y
X = dataset[:,0:94]
y = dataset[:,94]
# encode string class values as integers
label_encoded_y = LabelEncoder().fit_transform(y)
# grid search
model = XGBClassifier()
n_estimators = [50, 100, 150, 200]
max_depth = [2, 4, 6, 8]
print(max_depth)
param_grid = dict(max_depth=max_depth, n_estimators=n_estimators)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7)
grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold, verbose=1)
grid_result = grid_search.fit(X, label_encoded_y)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
	print("%f (%f) with: %r" % (mean, stdev, param))
# plot results
scores = numpy.array(means).reshape(len(max_depth), len(n_estimators))
for i, value in enumerate(max_depth):
    pyplot.plot(n_estimators, scores[i], label='depth: ' + str(value))
pyplot.legend()
pyplot.xlabel('n_estimators')
pyplot.ylabel('Log Loss')
pyplot.savefig('n_estimators_vs_max_depth.png')

# Otto数据集上的XGBoost，调整n_estimators和max_depth

from pandas import read_csv

from xgboost import XGBClassifier

from sklearn.model_selection import GridSearchCV

from sklearn.model_selection import StratifiedKFold

from sklearn.preprocessing import LabelEncoder

import matplotlib

matplotlib.use('Agg')

from matplotlib import pyplot

import numpy

# 加载数据

data = read_csv('train.csv')

dataset = data.values

# 将数据拆分为 X 和 y

X = dataset[:,0:94]

y = dataset[:,94]

# 将字符串类值编码为整数

label_encoded_y = LabelEncoder().fit_transform(y)

# 网格搜索

model = XGBClassifier()

n_estimators = [50, 100, 150, 200]

max_depth = [2, 4, 6, 8]

print(max_depth)

param_grid = dict(max_depth=max_depth, n_estimators=n_estimators)

kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7)

grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold, verbose=1)

grid_result = grid_search.fit(X, label_encoded_y)

# 总结结果

print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

means = grid_result.cv_results_['mean_test_score']

stds = grid_result.cv_results_['std_test_score']

params = grid_result.cv_results_['params']

for mean, stdev, param in zip(means, stds, params):

print("%f (%f) with: %r" % (mean, stdev, param))

# 绘制结果

scores = numpy.array(means).reshape(len(max_depth), len(n_estimators))

for i, value in enumerate(max_depth):

pyplot.plot(n_estimators, scores[i], label='depth: ' + str(value))

pyplot.legend()

pyplot.xlabel('n_estimators')

pyplot.ylabel('Log Loss')

pyplot.savefig('n_estimators_vs_max_depth.png')

注意：鉴于算法或评估过程的随机性，或数值精度的差异，您的结果可能会有所不同。考虑运行几次示例并比较平均结果。

运行代码会为每个参数对生成对数损失的列表。

Best: -0.001141 using {'n_estimators': 200, 'max_depth': 4}
-0.012127 (0.001130) with: {'n_estimators': 50, 'max_depth': 2}
-0.001351 (0.001825) with: {'n_estimators': 100, 'max_depth': 2}
-0.001278 (0.001812) with: {'n_estimators': 150, 'max_depth': 2}
-0.001266 (0.001796) with: {'n_estimators': 200, 'max_depth': 2}
-0.010545 (0.001083) with: {'n_estimators': 50, 'max_depth': 4}
-0.001226 (0.001721) with: {'n_estimators': 100, 'max_depth': 4}
-0.001150 (0.001704) with: {'n_estimators': 150, 'max_depth': 4}
-0.001141 (0.001693) with: {'n_estimators': 200, 'max_depth': 4}
-0.010341 (0.001059) with: {'n_estimators': 50, 'max_depth': 6}
-0.001237 (0.001701) with: {'n_estimators': 100, 'max_depth': 6}
-0.001163 (0.001688) with: {'n_estimators': 150, 'max_depth': 6}
-0.001154 (0.001679) with: {'n_estimators': 200, 'max_depth': 6}
-0.010342 (0.001059) with: {'n_estimators': 50, 'max_depth': 8}
-0.001237 (0.001701) with: {'n_estimators': 100, 'max_depth': 8}
-0.001161 (0.001688) with: {'n_estimators': 150, 'max_depth': 8}
-0.001153 (0.001679) with: {'n_estimators': 200, 'max_depth': 8}

最佳值：-0.001141，使用 {'n_estimators': 200, 'max_depth': 4}

-0.012127 (0.001130) with: {'n_estimators': 50, 'max_depth': 2}

-0.001351 (0.001825) with: {'n_estimators': 100, 'max_depth': 2}

-0.001278 (0.001812) with: {'n_estimators': 150, 'max_depth': 2}

-0.001266 (0.001796) with: {'n_estimators': 200, 'max_depth': 2}

-0.010545 (0.001083) with: {'n_estimators': 50, 'max_depth': 4}

-0.001226 (0.001721) with: {'n_estimators': 100, 'max_depth': 4}

-0.001150 (0.001704) with: {'n_estimators': 150, 'max_depth': 4}

-0.001141 (0.001693) with: {'n_estimators': 200, 'max_depth': 4}

-0.010341 (0.001059) with: {'n_estimators': 50, 'max_depth': 6}

-0.001237 (0.001701) with: {'n_estimators': 100, 'max_depth': 6}

-0.001163 (0.001688) with: {'n_estimators': 150, 'max_depth': 6}

-0.001154 (0.001679) with: {'n_estimators': 200, 'max_depth': 6}

-0.010342 (0.001059) with: {'n_estimators': 50, 'max_depth': 8}

-0.001237 (0.001701) with: {'n_estimators': 100, 'max_depth': 8}

-0.001161 (0.001688) with: {'n_estimators': 150, 'max_depth': 8}

-0.001153 (0.001679) with: {'n_estimators': 200, 'max_depth': 8}

我们可以看到，最佳结果是在n_estimators=200和max_depth=4下实现的，这与前两轮独立参数调整中找到的最佳值（n_estimators=250，max_depth=5）相似。

我们可以绘制给定n_estimators下每个max_depth系列之间的关系。

Tune The Number of Trees and Max Tree Depth in XGBoost

调整XGBoost中的树数量和最大树深度

这些线条重叠，使得难以看清它们之间的关系，但总体而言，我们看到了预期的交互作用。随着树深度的增加，所需的提升树数量减少。

此外，我们预计更深的个体树所提供的增加的复杂性会导致训练数据的过度拟合，这会因拥有更多的树而加剧，从而导致较低的交叉验证分数。我们在这里没有看到这种情况，因为我们的树没有那么深，也没有太多树。探索这种预期留作您可以自行探索的练习。

总结

在这篇文章中，您了解了如何在Python中使用XGBoost进行梯度提升时调整决策树的数量和深度。

具体来说，你学到了：

如何调整XGBoost模型中决策树的数量。
如何调整XGBoost模型中决策树的深度。
如何联合调整XGBoost模型中的树数量和树深度。

您对梯度提升模型中决策树的数量或大小或本文有任何疑问吗？请在评论中提出您的问题，我将尽力回答。

关于此主题的更多信息

如何最佳调整Python中XGBoost的多线程支持

机器学习梯度提升算法温和介绍