提升法优于装袋法：用梯度提升回归器提高预测准确性

作者： Vinod Chugani 发布于 2025年2月28日分类：中级数据科学 0

集成学习技术主要分为两类：bagging 和 boosting。Bagging 通过聚合独立预测来提高稳定性和准确性，而 boosting 则按顺序纠正先前模型的错误，通过每次迭代改进其性能。本文将深入探讨 boosting，从梯度提升回归器开始。通过在 Ames 住房数据集上的应用，我们将展示 boosting 如何独特地增强模型，为在后续文章中探索各种 boosting 技术奠定基础。

通过我的书《进阶数据科学》启动您的项目。它提供了带有可运行代码的自学教程。

让我们开始吧。

提升法优于装袋法：用梯度提升回归器提高预测准确性
图片来源：Erol Ahmed。保留部分权利。

概述

本文分为四个部分；它们是：

什么是 Boosting？
比较模型性能：从决策树基线到梯度提升集成
通过学习率调整优化梯度提升
最终优化：调整学习率和树的数量

什么是 Boosting？

Boosting 是一种集成技术，它结合多个模型来创建一个强大的学习器。与其他可能并行构建模型的集成方法不同，boosting 按顺序添加模型，每个新模型都专注于改进先前模型表现不佳的领域。这种方法通过每次迭代系统地提高集成的准确性，使其特别适用于复杂数据集。

Boosting 的主要特点

顺序学习：Boosting 一次构建一个模型。每个新模型都从前一个模型的缺点中学习，从而逐步提高捕获数据复杂性的能力。
错误纠正：新的学习器专注于先前被错误预测的实例，不断增强集成捕获数据中困难模式的能力。
模型复杂度：随着添加更多模型，集成的复杂度会增加，使其能够有效地捕获复杂的数据结构。

Boosting 与 Bagging

Bagging 涉及构建多个模型（通常是独立的）并组合它们的输出，以提高集成的整体性能，主要通过降低训练数据中噪声过拟合的风险。相比之下，boosting 专注于通过按顺序从错误中学习来提高预测准确性，这使其能够更精细地适应数据。

scikit-learn 中的 Boosting 回归器

Scikit-learn 提供了多种 boosting 实现，针对不同的需求和数据场景进行了定制

AdaBoost 回归器：采用一系列弱学习器，并根据先前模型的错误调整它们的关注点，改进过去模型不足的地方。
梯度提升回归器：一次构建一个模型，每个新模型都经过训练以纠正先前模型造成的残差（错误），通过仔细调整提高准确性。
HistGradientBoosting 回归器：梯度提升的优化形式，专为大型数据集设计，通过使用直方图近似梯度来加快计算。

每种方法都利用 boosting 的核心原理来提高其组件的性能，展示了这种方法在解决预测建模挑战方面的多功能性和强大功能。在本文的以下部分，我们将使用 Ames 住房数据集演示梯度提升回归器的实际应用。

想开始学习进阶数据科学吗？

立即参加我的免费电子邮件速成课程（附示例代码）。

点击注册，同时获得该课程的免费PDF电子书版本。

比较模型性能：从决策树基线到梯度提升集成

从 boosting 的理论方面转向其实际应用，本节将使用经过精心预处理的 Ames 住房数据集演示梯度提升回归器。我们的预处理步骤在各种基于树的模型中保持一致，确保观察到的改进可以直接归因于模型的能力，为有效的比较奠定基础。

下面的代码通过首先使用单个决策树（它不是集成方法）建立基线来建立我们的比较分析框架。这个基线将使我们能够清楚地说明实际集成方法带来的增量效益。在此之后，我们配置了 Bagging、随机森林和梯度提升回归器各自的两个版本，分别有 100 棵和 200 棵树，以探索这些集成技术相对于基线所提供的增强。

# Import necessary libraries for preprocessing and modeling
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, FunctionTransformer
from sklearn.ensemble import GradientBoostingRegressor, BaggingRegressor, RandomForestRegressor

# Load the dataset
Ames = pd.read_csv('Ames.csv')

# Adjust data types for categorical variables
for col in ['MSSubClass', 'YrSold', 'MoSold']:
    Ames[col] = Ames[col].astype('object')

# Exclude 'PID' and 'SalePrice' from features and specifically handle the 'Electrical' column
numeric_features = Ames.select_dtypes(include=['int64', 'float64']).drop(columns=['PID', 'SalePrice']).columns
categorical_features = Ames.select_dtypes(include=['object']).columns.difference(['Electrical'])
electrical_feature = ['Electrical']

# Manually specify the categories for ordinal encoding according to the data dictionary
ordinal_order = {
    'Electrical': ['Mix', 'FuseP', 'FuseF', 'FuseA', 'SBrkr'],  # Electrical system
    'LotShape': ['IR3', 'IR2', 'IR1', 'Reg'],  # General shape of property
    'Utilities': ['ELO', 'NoSeWa', 'NoSewr', 'AllPub'],  # Type of utilities available
    'LandSlope': ['Sev', 'Mod', 'Gtl'],  # Slope of property
    'ExterQual': ['Po', 'Fa', 'TA', 'Gd', 'Ex'],  # Evaluates the quality of the material on the exterior
    'ExterCond': ['Po', 'Fa', 'TA', 'Gd', 'Ex'],  # Evaluates the present condition of the material on the exterior
    'BsmtQual': ['None', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],  # Height of the basement
    'BsmtCond': ['None', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],  # General condition of the basement
    'BsmtExposure': ['None', 'No', 'Mn', 'Av', 'Gd'],  # Walkout or garden level basement walls
    'BsmtFinType1': ['None', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ'],  # Quality of basement finished area
    'BsmtFinType2': ['None', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ'],  # Quality of second basement finished area
    'HeatingQC': ['Po', 'Fa', 'TA', 'Gd', 'Ex'],  # Heating quality and condition
    'KitchenQual': ['Po', 'Fa', 'TA', 'Gd', 'Ex'],  # Kitchen quality
    'Functional': ['Sal', 'Sev', 'Maj2', 'Maj1', 'Mod', 'Min2', 'Min1', 'Typ'],  # Home functionality
    'FireplaceQu': ['None', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],  # Fireplace quality
    'GarageFinish': ['None', 'Unf', 'RFn', 'Fin'],  # Interior finish of the garage
    'GarageQual': ['None', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],  # Garage quality
    'GarageCond': ['None', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],  # Garage condition
    'PavedDrive': ['N', 'P', 'Y'],  # Paved driveway
    'PoolQC': ['None', 'Fa', 'TA', 'Gd', 'Ex'],  # Pool quality
    'Fence': ['None', 'MnWw', 'GdWo', 'MnPrv', 'GdPrv']  # Fence quality
}

# Extract list of ALL ordinal features from dictionary
ordinal_features = list(ordinal_order.keys())

# List of ordinal features except Electrical
ordinal_except_electrical = [feature for feature in ordinal_features if feature != 'Electrical']

# Define transformations for various feature types
electrical_transformer = Pipeline(steps=[
    ('impute_electrical', SimpleImputer(strategy='most_frequent')),
    ('ordinal_electrical', OrdinalEncoder(categories=[ordinal_order['Electrical']]))
])

numeric_transformer = Pipeline(steps=[
    ('impute_mean', SimpleImputer(strategy='mean'))
])

# Updated categorical imputer using SimpleImputer
categorical_imputer = SimpleImputer(strategy='constant', fill_value='None')

ordinal_transformer = Pipeline([
    ('impute_ordinal', categorical_imputer),
    ('ordinal', OrdinalEncoder(categories=[ordinal_order[feature] for feature in ordinal_features if feature in ordinal_except_electrical]))
])

nominal_features = [feature for feature in categorical_features if feature not in ordinal_features]
categorical_transformer = Pipeline([
    ('impute_nominal', categorical_imputer),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combined preprocessor for numeric, ordinal, nominal, and specific electrical data
preprocessor = ColumnTransformer(
    transformers=[
        ('electrical', electrical_transformer, ['Electrical']),
        ('num', numeric_transformer, numeric_features),
        ('ordinal', ordinal_transformer, ordinal_except_electrical),
        ('nominal', categorical_transformer, nominal_features)
])

# Define model pipelines including Gradient Boosting Regressor
models = {
    'Decision Tree (1 Tree)': DecisionTreeRegressor(random_state=42),
    'Bagging Regressor (100 Decision Trees)': BaggingRegressor(base_estimator=DecisionTreeRegressor(random_state=42),
                                                      n_estimators=100, random_state=42),
    'Bagging Regressor (200 Decision Trees)': BaggingRegressor(base_estimator=DecisionTreeRegressor(random_state=42),
                                                      n_estimators=200, random_state=42),
    'Random Forest (Default of 100 Trees)': RandomForestRegressor(random_state=42),
    'Random Forest (200 Trees)': RandomForestRegressor(n_estimators=200, random_state=42),
    'Gradient Boosting Regressor (Default of 100 Trees)': GradientBoostingRegressor(random_state=42),
    'Gradient Boosting Regressor (200 Trees)': GradientBoostingRegressor(n_estimators=200, random_state=42)
}

# Evaluate models using cross-validation and print results
results = {}
for name, model in models.items():
    model_pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('regressor', model)
    ])
    scores = cross_val_score(model_pipeline, Ames.drop(columns='SalePrice'), Ames['SalePrice'], cv=5)
    results[name] = round(scores.mean(), 4)
    print(f"{name}: Mean CV R² = {results[name]}")

100

101

102

103

104

105

106

107

108

109

# 导入预处理和建模所需的库

import pandas as pd

from sklearn.pipeline import Pipeline

from sklearn.impute import SimpleImputer

from sklearn.compose import ColumnTransformer

来自 sklearn.tree 导入 DecisionTreeRegressor

from sklearn.model_selection import cross_val_score

from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder,FunctionTransformer

from sklearn.ensemble import GradientBoostingRegressor, BaggingRegressor, RandomForestRegressor

# 加载数据集

Ames = pd.read_csv('Ames.csv')

# 调整分类变量的数据类型

for col in ['MSSubClass', 'YrSold', 'MoSold']:

Ames[col] = Ames[col].astype('object')

# 从特征中排除 'PID' 和 'SalePrice'，并专门处理 'Electrical' 列

numeric_features = Ames.select_dtypes(include=['int64', 'float64']).drop(columns=['PID', 'SalePrice']).columns

categorical_features = Ames.select_dtypes(include=['object']).columns.difference(['Electrical'])

electrical_feature = ['Electrical']

# 根据数据字典手动指定序数编码的类别

ordinal_order = {

'Electrical': ['Mix', 'FuseP', 'FuseF', 'FuseA', 'SBrkr'], # 电气系统

'LotShape': ['IR3', 'IR2', 'IR1', 'Reg'], # 物业的一般形状

'Utilities': ['ELO', 'NoSeWa', 'NoSewr', 'AllPub'], # 可用的公用设施类型

'LandSlope': ['Sev', 'Mod', 'Gtl'], # 物业坡度

'ExterQual': ['Po', 'Fa', 'TA', 'Gd', 'Ex'], # 评估外部材料的质量

'ExterCond': ['Po', 'Fa', 'TA', 'Gd', 'Ex'], # 评估外部材料的当前状况

'BsmtQual': ['None', 'Po', 'Fa', 'TA', 'Gd', 'Ex'], # 地下室高度

'BsmtCond': ['None', 'Po', 'Fa', 'TA', 'Gd', 'Ex'], # 地下室的一般状况

'BsmtExposure': ['None', 'No', 'Mn', 'Av', 'Gd'], # 步出式或花园层地下室墙壁

'BsmtFinType1': ['None', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ'], # 地下室完工区域的质量

'BsmtFinType2': ['None', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ'], # 第二个地下室完工区域的质量

'HeatingQC': ['Po', 'Fa', 'TA', 'Gd', 'Ex'], # 供暖质量和状况

'KitchenQual': ['Po', 'Fa', 'TA', 'Gd', 'Ex'], # 厨房质量

'Functional': ['Sal', 'Sev', 'Maj2', 'Maj1', 'Mod', 'Min2', 'Min1', 'Typ'], # 房屋功能

'FireplaceQu': ['None', 'Po', 'Fa', 'TA', 'Gd', 'Ex'], # 壁炉质量

'GarageFinish': ['None', 'Unf', 'RFn', 'Fin'], # 车库内部装修

'GarageQual': ['None', 'Po', 'Fa', 'TA', 'Gd', 'Ex'], # 车库质量

'GarageCond': ['None', 'Po', 'Fa', 'TA', 'Gd', 'Ex'], # 车库状况

'PavedDrive': ['N', 'P', 'Y'], # 铺砌车道

'PoolQC': ['None', 'Fa', 'TA', 'Gd', 'Ex'], # 泳池质量

'Fence': ['None', 'MnWw', 'GdWo', 'MnPrv', 'GdPrv'] # 围栏质量

}

# 从字典中提取所有序数特征列表

ordinal_features = list(ordinal_order.keys())

# 除了 Electrical 之外的序数特征列表

ordinal_except_electrical = [feature for feature in ordinal_features if feature != 'Electrical']

# 定义各种特征类型的转换

electrical_transformer = Pipeline(steps=[

('impute_electrical', SimpleImputer(strategy='most_frequent')),

('ordinal_electrical', OrdinalEncoder(categories=[ordinal_order['Electrical']]))

])

numeric_transformer = Pipeline(steps=[

('impute_mean', SimpleImputer(strategy='mean'))

])

# 使用 SimpleImputer 更新分类插补器

categorical_imputer = SimpleImputer(strategy='constant', fill_value='None')

ordinal_transformer = Pipeline([

('impute_ordinal', categorical_imputer),

('ordinal', OrdinalEncoder(categories=[ordinal_order[feature] for feature in ordinal_features if feature in ordinal_except_electrical]))

])

nominal_features = [feature for feature in categorical_features if feature not in ordinal_features]

categorical_transformer = Pipeline([

('impute_nominal', categorical_imputer),

('onehot', OneHotEncoder(handle_unknown='ignore'))

])

# 数值、序数、名义和特定电气数据的组合预处理器

preprocessor = ColumnTransformer(

transformers=[

('electrical', electrical_transformer, ['Electrical']),

('num', numeric_transformer, numeric_features),

('ordinal', ordinal_transformer, ordinal_except_electrical),

('nominal', categorical_transformer, nominal_features)

])

# 定义包括梯度提升回归器在内的模型管道

models = {

'决策树 (1 棵树)': DecisionTreeRegressor(random_state=42),

'Bagging 回归器 (100 棵决策树)': BaggingRegressor(base_estimator=DecisionTreeRegressor(random_state=42),

n_estimators=100, random_state=42),

'Bagging 回归器 (200 棵决策树)': BaggingRegressor(base_estimator=DecisionTreeRegressor(random_state=42),

n_estimators=200, random_state=42),

'随机森林 (默认 100 棵树)': RandomForestRegressor(random_state=42),

'随机森林 (200 棵树)': RandomForestRegressor(n_estimators=200, random_state=42),

'梯度提升回归器 (默认 100 棵树)': GradientBoostingRegressor(random_state=42),

'梯度提升回归器 (200 棵树)': GradientBoostingRegressor(n_estimators=200, random_state=42)

}

# 使用交叉验证评估模型并打印结果

results = {}

for name, model in models.items():

model_pipeline = Pipeline([

('preprocessor', preprocessor),

('regressor', model)

])

scores = cross_val_score(model_pipeline, Ames.drop(columns='SalePrice'), Ames['SalePrice'], cv=5)

results[name] = round(scores.mean(), 4)

print(f"{name}: Mean CV R² = {results[name]}")

以下是交叉验证结果，展示了每个模型在平均 R² 值方面的表现。

Decision Tree (1 Tree): Mean CV R² = 0.7663
Bagging Regressor (100 Decision Trees): Mean CV R² = 0.8957
Bagging Regressor (200 Decision Trees): Mean CV R² = 0.897
Random Forest (Default of 100 Trees): Mean CV R² = 0.8954
Random Forest (200 Trees): Mean CV R² = 0.8969
Gradient Boosting Regressor (Default of 100 Trees): Mean CV R² = 0.9027
Gradient Boosting Regressor (200 Trees): Mean CV R² = 0.9061

决策树（1棵树）：平均交叉验证 R² = 0.7663

Bagging 回归器（100棵决策树）：平均交叉验证 R² = 0.8957

Bagging 回归器（200棵决策树）：平均交叉验证 R² = 0.897

随机森林（默认100棵树）：平均交叉验证 R² = 0.8954

随机森林（200棵树）：平均交叉验证 R² = 0.8969

梯度提升回归器（默认100棵树）：平均交叉验证 R² = 0.9027

梯度提升回归器（200棵树）：平均交叉验证 R² = 0.9061

我们的集成模型结果揭示了关于高级回归技术行为和性能的几个关键见解。

基线和改进：从作为基线的基本决策树回归器开始，其 R² 为 0.7663，随着我们引入更复杂的模型，我们观察到性能显著提升。Bagging 和随机森林回归器，使用不同数量的树，都显示出改进的分数，这说明了集成方法在利用多个学习模型来减少误差方面的强大功能。
梯度提升回归器的优势：梯度提升回归器尤其值得关注。在默认设置100棵树的情况下，它达到了0.9027的 R²，进一步将树的数量增加到200棵，分数提高到0.9061。这表明 GBR 在此上下文中的有效性，并强调了它通过额外学习器实现顺序改进的效率。
增加树的数量带来的边际收益：虽然增加树的数量通常会导致更好的性能，但随着集成规模的扩大，增量收益会减小。这种趋势在 Bagging、随机森林和梯度提升模型中都很明显，这表明存在一个收益递减点，即额外的计算资源只能带来最小的性能改进。

结果强调了梯度提升回归器强大的性能。它有效地利用了全面的预处理和提升算法特有的顺序改进策略。接下来，我们将探讨如何调整学习率来优化模型的性能，从而提高其预测准确性。

通过学习率调整优化梯度提升

learning_rate（学习率）是梯度提升回归器等提升模型所特有的，它与其他模型（如决策树和随机森林）不同，后者没有直接等效的此参数。调整 learning_rate 允许我们更深入地探究提升算法的机制，并通过微调模型从每棵连续的树中学习的速度来增强模型的预测能力。

什么是学习率？

在梯度提升回归器和其他基于梯度下降的算法中，“学习率”是一个至关重要的超参数，它控制模型学习的速度。其核心是，学习率影响模型在训练过程中向最优解迈进的步长。以下是详细说明：

步长：学习率决定了训练期间模型权重更新的幅度。较高的学习率会进行较大的更新，使模型学习更快，但有可能会错过最优解。相反，较低的学习率会进行较小的更新，这意味着模型学习较慢，但可能具有更高的精度。
对模型训练的影响:
- 收敛性：过高的学习率可能导致训练过程过快收敛到次优解，或者由于越过最小值而根本不收敛。
- 准确性和过拟合：过低的学习率可能导致模型学习过慢，这可能需要更多的树才能达到相似的准确性，如果不加以监控，可能会导致过拟合。
调优：选择合适的学习率可以平衡速度和准确性。通常通过试错法或更系统的方法（如 GridSearchCV 和 RandomizedSearchCV）来选择，因为调整学习率会显著影响模型的性能和训练时间。

通过调整学习率，数据科学家可以控制提升模型适应其错误复杂性的速度。这使得学习率成为微调模型性能的强大工具，尤其是在提升算法中，其中每棵新树都是为了纠正前几棵树留下的残差（错误）而构建的。

为了优化learning_rate，我们从GridSearchCV开始，这是一种系统方法，它将探索预定义的值（[0.001, 0.01, 0.1, 0.2, 0.3]），以确定提高模型准确性的最有效设置。

# Experiment with GridSearchCV
from sklearn.model_selection import GridSearchCV

# Parameter grid for GridSearchCV
param_grid = {
    'regressor__learning_rate': [0.001, 0.01, 0.1, 0.2, 0.3]
}

# Setup the GridSearchCV
grid_search = GridSearchCV(model_pipeline, param_grid, cv=5, scoring='r2', verbose=1)

# Fit the GridSearchCV to the data
grid_search.fit(Ames.drop(columns='SalePrice'), Ames['SalePrice'])

# Best parameters and best score from Grid Search
print("Best parameters (Grid Search):", grid_search.best_params_)
print("Best score (Grid Search):", round(grid_search.best_score_, 4))

# 使用 GridSearchCV 进行实验

from sklearn.model_selection import GridSearchCV

# GridSearchCV 的参数网格

param_grid = {

'regressor__learning_rate': [0.001, 0.01, 0.1, 0.2, 0.3]

}

# 设置 GridSearchCV

grid_search = GridSearchCV(model_pipeline, param_grid, cv=5, scoring='r2', verbose=1)

# 将 GridSearchCV 拟合到数据

grid_search.fit(Ames.drop(columns='SalePrice'), Ames['SalePrice'])

# 网格搜索的最佳参数和最佳分数

print("最佳参数（网格搜索）：", grid_search.best_params_)

print("最佳分数（网格搜索）：", round(grid_search.best_score_, 4))

以下是我们 GridSearchCV 的结果，仅侧重于优化 learning_rate 参数。

Fitting 5 folds for each of 5 candidates, totalling 25 fits
Best parameters (Grid Search): {'regressor__learning_rate': 0.1}
Best score (Grid Search): 0.9061

对 5 个候选进行了 5 折拟合，总共 25 次拟合。

最佳参数（网格搜索）：{'regressor__learning_rate': 0.1}

最佳分数（网格搜索）：0.9061

使用 GridSearchCV，我们发现 learning_rate 为 0.1 时产生了最佳结果，与默认设置相符。这表明对于我们的数据集和预处理设置，在此值附近增加或减少学习率并不能显著改善模型。

在此之后，我们利用 RandomizedSearchCV 扩大我们的搜索范围。与 GridSearchCV 不同，RandomizedSearchCV 从连续范围中随机选择，通过探索标准值之间的范围，实现更精确的优化，从而全面了解 learning_rate 的细微变化如何影响性能。

# Experiment with RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform

# Parameter distribution for RandomizedSearchCV
param_dist = {
    'regressor__learning_rate': uniform(0.001, 0.299)  # Uniform distribution between 0.001 and 0.3
}

# Setup the RandomizedSearchCV
random_search = RandomizedSearchCV(model_pipeline, param_distributions=param_dist,
                                   n_iter=50, cv=5, scoring='r2', verbose=1, random_state=42)

# Fit the RandomizedSearchCV to the data
random_search.fit(Ames.drop(columns='SalePrice'), Ames['SalePrice'])

# Best parameters and best score from Random Search
print("Best parameters (Random Search):", random_search.best_params_)
print("Best score (Random Search):", round(random_search.best_score_, 4))

# 使用 RandomizedSearchCV 进行实验

from sklearn.model_selection import RandomizedSearchCV

from scipy.stats import uniform

# RandomizedSearchCV 的参数分布

param_dist = {

'regressor__learning_rate': uniform(0.001, 0.299) # 0.001 到 0.3 之间的均匀分布

}

# 设置 RandomizedSearchCV

random_search = RandomizedSearchCV(model_pipeline, param_distributions=param_dist,

n_iter=50, cv=5, scoring='r2', verbose=1, random_state=42)

# 将 RandomizedSearchCV 拟合到数据

random_search.fit(Ames.drop(columns='SalePrice'), Ames['SalePrice'])

# 随机搜索的最佳参数和最佳分数

print("最佳参数（随机搜索）：", random_search.best_params_)

print("最佳分数（随机搜索）：", round(random_search.best_score_, 4))

与 GridSearchCV 相比，RandomizedSearchCV 确定了一个略微不同的最优 learning_rate，约为 0.158，这提升了我们模型的性能。这一改进强调了随机搜索的价值，尤其是在微调模型时，因为它能够探索更多样化的可能性并可能产生更好的配置。

Fitting 5 folds for each of 50 candidates, totalling 250 fits
Best parameters (Random Search): {'regressor__learning_rate': 0.1579021730580391}
Best score (Random Search): 0.9134

对 50 个候选进行了 5 折拟合，总共 250 次拟合

最佳参数（随机搜索）：{'regressor__learning_rate': 0.1579021730580391}

最佳分数（随机搜索）：0.9134

通过 RandomizedSearchCV 进行的优化已证明其有效性，它确定了一个将模型性能推向新高度的学习率，R² 得分达到 0.9134。这些使用 GridSearchCV 和 RandomizedSearchCV 调整 learning_rate 的实验说明了在调整梯度提升模型时所需的微妙平衡。它们还强调了探索系统化和随机化参数搜索策略以充分优化模型的好处。

受到这些优化策略所取得的成果的鼓舞，我们现在将把重点扩展到同时微调 learning_rate 和 n_estimators。下一阶段旨在通过探索这些关键参数对梯度提升回归器性能的综合影响，发现更优的设置。

最终优化：调整学习率和树的数量

基于我们之前的发现，我们现在将采用更全面的优化方法，同时调整 learning_rate 和 n_estimators。这种双参数调整旨在探索这些参数如何协同工作，从而可能进一步提高梯度提升回归器的性能。

我们从 GridSearchCV 开始，系统地探索 learning_rate 和 n_estimators 的组合。这种方法提供了一种结构化的方式来评估改变这两个参数对模型准确性的影响。

# Build on previous blocks of code
# 'preprocessor' is already set up as your preprocessing pipeline
model_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', GradientBoostingRegressor(random_state=42))
])

# Parameter grid for GridSearchCV
param_grid = {
    'regressor__learning_rate': [0.001, 0.01, 0.1, 0.2, 0.3],
    'regressor__n_estimators': [100, 200, 300, 400, 500]
}

# Setup the GridSearchCV
grid_search = GridSearchCV(model_pipeline, param_grid, cv=5, scoring='r2', verbose=1)

# Fit the GridSearchCV to the data
grid_search.fit(Ames.drop(columns='SalePrice'), Ames['SalePrice'])

# Best parameters and best score from Grid Search
print("Best parameters (Grid Search):", grid_search.best_params_)
print("Best score (Grid Search):", round((grid_search.best_score_), 4))

# 基于之前的代码块

# 'preprocessor' 已设置为您的预处理管道

model_pipeline = Pipeline([

('preprocessor', preprocessor),

('regressor', GradientBoostingRegressor(random_state=42))

])

# GridSearchCV 的参数网格

param_grid = {

'regressor__learning_rate': [0.001, 0.01, 0.1, 0.2, 0.3],

'regressor__n_estimators': [100, 200, 300, 400, 500]

}

# 设置 GridSearchCV

grid_search = GridSearchCV(model_pipeline, param_grid, cv=5, scoring='r2', verbose=1)

# 将 GridSearchCV 拟合到数据

grid_search.fit(Ames.drop(columns='SalePrice'), Ames['SalePrice'])

# 网格搜索的最佳参数和最佳分数

print("最佳参数（网格搜索）：", grid_search.best_params_)

print("最佳分数（网格搜索）：", round((grid_search.best_score_), 4))

GridSearchCV 过程评估了 25 种不同的组合，跨越 5 折，总共 125 次拟合。

Fitting 5 folds for each of 25 candidates, totalling 125 fits
Best parameters (Grid Search): {'regressor__learning_rate': 0.1, 'regressor__n_estimators': 500}
Best score (Grid Search): 0.9089

对 25 个候选进行了 5 折拟合，总共 125 次拟合。

最佳参数（网格搜索）：{'regressor__learning_rate': 0.1, 'regressor__n_estimators': 500}

最佳分数（网格搜索）：0.9089

它确认了 0.1 的 learning_rate（默认设置）仍然有效。然而，它建议增加到 500 棵树可以稍微提高我们模型的性能，将 R² 分数提高到 0.9089。与之前使用 200 棵树和 0.1 的 learning_rate 获得的 0.9061 的 R² 相比，这是一个适度的提升。有趣的是，我们之前的随机搜索在仅 200 棵树和大约 0.158 的 learning_rate 下取得了更好的 0.9134 的结果，这说明了探索更广阔的参数空间以优化性能的潜在好处。

为了确保我们已彻底探索了参数空间并可能发现更好的配置，我们现在将采用 RandomizedSearchCV。此方法通过从参数值的连续分布中采样，允许采取更具探索性和不那么确定性的方法。

# Build on previous blocks of code
from scipy.stats import uniform, randint

# Parameter distribution for RandomizedSearchCV
param_dist = {
    'regressor__learning_rate': uniform(0.001, 0.299),  # Uniform distribution between 0.001 and 0.3
    'regressor__n_estimators': randint(100, 501)  # Uniform distribution of integers from 100 to 500
}

# Setup the RandomizedSearchCV
random_search = RandomizedSearchCV(model_pipeline, param_distributions=param_dist,
                                   n_iter=50, cv=5, scoring='r2', verbose=1, random_state=42)

# Fit the RandomizedSearchCV to the data
random_search.fit(Ames.drop(columns='SalePrice'), Ames['SalePrice'])

# Best parameters and best score from Random Search
print("Best parameters (Random Search):", random_search.best_params_)
print("Best score (Random Search):", round((random_search.best_score_), 4))

# 基于之前的代码块

from scipy.stats import uniform, randint

# RandomizedSearchCV 的参数分布

param_dist = {

'regressor__learning_rate': uniform(0.001, 0.299), # 0.001 到 0.3 之间的均匀分布

'regressor__n_estimators': randint(100, 501) # 100 到 500 之间整数的均匀分布

}

# 设置 RandomizedSearchCV

random_search = RandomizedSearchCV(model_pipeline, param_distributions=param_dist,

n_iter=50, cv=5, scoring='r2', verbose=1, random_state=42)

# 将 RandomizedSearchCV 拟合到数据

random_search.fit(Ames.drop(columns='SalePrice'), Ames['SalePrice'])

# 随机搜索的最佳参数和最佳分数

print("最佳参数（随机搜索）：", random_search.best_params_)

print("最佳分数（随机搜索）：", round((random_search.best_score_), 4))

RandomizedSearchCV 扩展了我们的搜索范围，在 5 折中测试了 50 种不同的配置，总共进行了 250 次拟合。

Fitting 5 folds for each of 50 candidates, totalling 250 fits
Best parameters (Random Search): {'regressor__learning_rate': 0.12055843054286139, 'regressor__n_estimators': 287}
Best score (Random Search): 0.9158

对 50 个候选进行了 5 折拟合，总共 250 次拟合

最佳参数 (随机搜索): {'regressor__learning_rate': 0.12055843054286139, 'regressor__n_estimators': 287}

最佳分数 (随机搜索): 0.9158

它确定了一个更有效的设置，learning_rate 约为 0.121，n_estimators 为 287，达到了我们迄今为止最佳的 R² 分数 0.9158。这强调了随机参数调优的潜力，可以发现更僵硬的方法可能错过的最佳设置。

为了验证通过调优工作实现的性能改进，我们现在将使用配置了最佳参数的梯度提升回归器进行最终的交叉验证：n_estimators 设置为 287，learning_rate 约为 0.121。

# Build on previous blocks of code
# Cross check model performance of Gradient Boosting Regressor with tuned parameters

# 'preprocessor' is already set up as your preprocessing pipeline
model_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', GradientBoostingRegressor(n_estimators=287, learning_rate=0.12055843054286139, random_state=42))
])

# Using the full dataset X, y
X = Ames.drop(columns='SalePrice')
y = Ames['SalePrice']

# Perform 5-fold cross-validation
cv_scores = cross_val_score(model_pipeline, X, y, cv=5, scoring='r2')

# Output the mean cross-validated score of tuned model
print("Performance of Gradient Boosting Regressor with tuned parameters:", round(cv_scores.mean(), 4))

# 基于之前的代码块

# 使用调整后的参数交叉验证梯度提升回归器的模型性能

# 'preprocessor' 已设置为您的预处理管道

model_pipeline = Pipeline([

('preprocessor', preprocessor),

('regressor', GradientBoostingRegressor(n_estimators=287, learning_rate=0.12055843054286139, random_state=42))

])

# 使用完整数据集 X，y

X = Ames.drop(columns='SalePrice')

y = Ames['SalePrice']

# 执行 5 折交叉验证

cv_scores = cross_val_score(model_pipeline, X, y, cv=5, scoring='r2')

# 输出调整后模型的平均交叉验证分数

print("使用调整参数的梯度提升回归器的性能：", round(cv_scores.mean(), 4))

最终输出确认了我们调整后的梯度提升回归器的性能。

Performance of Gradient Boosting Regressor with tuned parameters: 0.9158

1	使用调整参数的梯度提升回归器的性能: 0.9158

通过优化 learning_rate 和 n_estimators，我们实现了 0.9158 的 R² 分数。该分数不仅验证了通过参数调整所做的改进，而且强调了梯度提升回归器在整个数据集上进行适应并持续表现的能力。

进一步阅读

API

教程

Ames 住房数据集和数据字典

总结

这篇文章探讨了梯度提升回归器（GBR）的功能，从理解提升算法的基本概念到使用 Ames 住房数据集的高级优化技术。它侧重于 GBR 的关键参数，例如树的数量和学习率，这些参数对于提高模型的准确性和效率至关重要。通过系统化和随机化方法，它展示了如何使用 GridSearchCV 和 RandomizedSearchCV 微调这些参数，从而显著提高模型的性能。

具体来说，你学到了：

提升算法的基础知识以及它与 Bagging 等其他集成技术的区别。
如何通过实验各种模型实现增量改进。
调整梯度提升回归器的学习率和树的数量的技术。

您有任何问题吗？请在下面的评论中提出您的问题，我将尽力回答。

开始学习进阶数据科学！

掌握数据科学项目成功的思维模式

...通过清晰、实用的例子建立专业知识，尽量减少复杂的数学，并专注于实践学习。

在我的新电子书中探索如何实现
新一代数据科学

它提供自学教程，旨在引导您从初级到高级。学习如何优化工作流程、管理多重共线性、优化基于树的模型以及处理缺失数据等等，以帮助您获得更深入的见解和有效的数据讲故事能力。

通过实际练习提升您的数据科学技能

查看内容

关于此主题的更多信息

关于Vinod Chugani

我出生于印度，在日本长大，是一个拥有全球视野的第三文化孩子。我在杜克大学学习经济学，在大三时有幸入选 Phi Beta Kappa。多年来，我积累了多样化的专业经验，在华尔街复杂的固定收益领域摸索了十年，随后在主街领导了一家全球分销企业。目前，我作为纽约市数据科学学院的导师，将我对数据科学、机器学习和人工智能的热情付诸实践。我珍惜通过直播学习课程或深入的一对一互动来激发好奇心和分享知识的机会。凭借金融/创业基础和目前在数据领域的沉浸，我以目标感和确定性展望未来。我期待进一步探索、持续学习，并有机会为数据科学和机器学习这两个不断发展的领域做出有意义的贡献，尤其是在 MLM。

查看Vinod Chugani发布的所有帖子 →

导航

提升法优于装袋法：用梯度提升回归器提高预测准确性

概述

什么是 Boosting？

想开始学习进阶数据科学吗？

比较模型性能：从决策树基线到梯度提升集成

通过学习率调整优化梯度提升

最终优化：调整学习率和树的数量

进一步阅读

API

教程

Ames 住房数据集和数据字典

总结

开始学习进阶数据科学！

掌握数据科学项目成功的思维模式

通过实际练习提升您的数据科学技能

关于此主题的更多信息

暂无评论。

发表评论点击此处取消回复。

导航

概述

什么是 Boosting？

想开始学习进阶数据科学吗？

比较模型性能：从决策树基线到梯度提升集成

通过学习率调整优化梯度提升

最终优化：调整学习率和树的数量

进一步阅读

API

教程

Ames 住房数据集和数据字典

总结

开始学习进阶数据科学！

掌握数据科学项目成功的思维模式

通过实际练习提升您的数据科学技能

关于此主题的更多信息

暂无评论。

发表评论 点击此处取消回复。

发表评论点击此处取消回复。