从单棵树到森林：使用集成模型增强房地产预测

作者： Vinod Chugani 于 2025年2月28日发布在中间数据科学 0

本帖深入探讨了基于树的模型，特别是决策树、Bagging 和随机森林在 Ames Housing 数据集中的应用。文章首先强调了预处理的关键作用，这是确保数据为这些模型的要求进行最佳配置的基础步骤。从单棵决策树到强大的集成树，其路径凸显了多棵树对预测性能的转变性影响。随着我们深入研究模型评估和增强的细节，我们的目标是为您提供实用的见解和高级策略，以完善您在机器学习和房地产价格预测中的方法。

通过我的书《进阶数据科学》启动您的项目。它提供了带有可运行代码的自学教程。

让我们开始吧。

从单棵树到森林：使用集成模型增强房地产预测
照片来自 Steven Kamenar。部分权利保留。

概述

本文分为四个部分；它们是：

奠定基础：树模型的预处理技术
评估基础：决策树回归器评估
改进预测：决策树 Bagging 简介
高级集成：比较 Bagging 和随机森林回归器

奠定基础：树模型的预处理技术

预处理在任何数据科学工作流中都至关重要，尤其是在处理基于树的模型时。本帖的第一部分汇集了前面讨论中的关键技术，例如来自帖子《决策树与顺序编码：实践指南》中的顺序编码、独热编码、各种插补方法等，以确保我们的数据集为树模型的复杂要求做好充分准备。为了实际演示这些原则，让我们通过一个将这些预处理技术应用于 Ames Housing 数据集的实例来进行讲解。

# Import necessary libraries for preprocessing
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, FunctionTransformer
from sklearn.compose import ColumnTransformer

# Load the dataset
Ames = pd.read_csv('Ames.csv')

# Convert the below numeric features to categorical features
Ames['MSSubClass'] = Ames['MSSubClass'].astype('object')
Ames['YrSold'] = Ames['YrSold'].astype('object')
Ames['MoSold'] = Ames['MoSold'].astype('object')

# Exclude 'PID' and 'SalePrice' from features and specifically handle the 'Electrical' column
numeric_features = Ames.select_dtypes(include=['int64', 'float64']).drop(columns=['PID', 'SalePrice']).columns
categorical_features = Ames.select_dtypes(include=['object']).columns.difference(['Electrical'])
electrical_feature = ['Electrical']

# Manually specify the categories for ordinal encoding according to the data dictionary
ordinal_order = {
    'Electrical': ['Mix', 'FuseP', 'FuseF', 'FuseA', 'SBrkr'],  # Electrical system
    'LotShape': ['IR3', 'IR2', 'IR1', 'Reg'],  # General shape of property
    'Utilities': ['ELO', 'NoSeWa', 'NoSewr', 'AllPub'],  # Type of utilities available
    'LandSlope': ['Sev', 'Mod', 'Gtl'],  # Slope of property
    'ExterQual': ['Po', 'Fa', 'TA', 'Gd', 'Ex'],  # Evaluates the quality of the material on the exterior
    'ExterCond': ['Po', 'Fa', 'TA', 'Gd', 'Ex'],  # Evaluates the present condition of the material on the exterior
    'BsmtQual': ['None', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],  # Height of the basement
    'BsmtCond': ['None', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],  # General condition of the basement
    'BsmtExposure': ['None', 'No', 'Mn', 'Av', 'Gd'],  # Walkout or garden level basement walls
    'BsmtFinType1': ['None', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ'],  # Quality of basement finished area
    'BsmtFinType2': ['None', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ'],  # Quality of second basement finished area
    'HeatingQC': ['Po', 'Fa', 'TA', 'Gd', 'Ex'],  # Heating quality and condition
    'KitchenQual': ['Po', 'Fa', 'TA', 'Gd', 'Ex'],  # Kitchen quality
    'Functional': ['Sal', 'Sev', 'Maj2', 'Maj1', 'Mod', 'Min2', 'Min1', 'Typ'],  # Home functionality
    'FireplaceQu': ['None', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],  # Fireplace quality
    'GarageFinish': ['None', 'Unf', 'RFn', 'Fin'],  # Interior finish of the garage
    'GarageQual': ['None', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],  # Garage quality
    'GarageCond': ['None', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],  # Garage condition
    'PavedDrive': ['N', 'P', 'Y'],  # Paved driveway
    'PoolQC': ['None', 'Fa', 'TA', 'Gd', 'Ex'],  # Pool quality
    'Fence': ['None', 'MnWw', 'GdWo', 'MnPrv', 'GdPrv']  # Fence quality
}

# Extract list of ALL ordinal features from dictionary
ordinal_features = list(ordinal_order.keys())

# List of ordinal features except Electrical
ordinal_except_electrical = [feature for feature in ordinal_features if feature != 'Electrical']

# Helper function to fill 'None' for missing categorical data
def fill_none(X):
    return X.fillna("None")

# Pipeline for 'Electrical': Fill missing value with mode then apply ordinal encoding
electrical_transformer = Pipeline(steps=[
    ('impute_electrical', SimpleImputer(strategy='most_frequent')),
    ('ordinal_electrical', OrdinalEncoder(categories=[ordinal_order['Electrical']]))
])

# Pipeline for numeric features: Impute missing values using mean
numeric_transformer = Pipeline(steps=[
    ('impute_mean', SimpleImputer(strategy='mean'))
])

# Pipeline for ordinal features: Fill missing values with 'None' then apply ordinal encoding
ordinal_transformer = Pipeline(steps=[
    ('fill_none', FunctionTransformer(fill_none, validate=False)),
    ('ordinal', OrdinalEncoder(categories=[ordinal_order[feature] for feature in ordinal_features if feature in ordinal_except_electrical]))
])

# Pipeline for nominal categorical features: Fill missing values with 'None' then apply one-hot encoding
nominal_features = [feature for feature in categorical_features if feature not in ordinal_features]
categorical_transformer = Pipeline(steps=[
    ('fill_none', FunctionTransformer(fill_none, validate=False)),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combined preprocessor for numeric, ordinal, nominal, and specific electrical data
preprocessor = ColumnTransformer(
    transformers=[
        ('electrical', electrical_transformer, ['Electrical']),
        ('num', numeric_transformer, numeric_features),
        ('ordinal', ordinal_transformer, ordinal_except_electrical),
        ('nominal', categorical_transformer, nominal_features)
])

# Apply the preprocessing pipeline to Ames
transformed_data = preprocessor.fit_transform(Ames).toarray()

# Generate column names for the one-hot encoded features
onehot_features = preprocessor.named_transformers_['nominal'].named_steps['onehot'].get_feature_names_out()

# Combine all feature names
all_feature_names = ['Electrical'] + list(numeric_features) + list(ordinal_except_electrical) + list(onehot_features)

# Convert the transformed array to a DataFrame
transformed_df = pd.DataFrame(transformed_data, columns=all_feature_names)

# 导入预处理所需的库

import pandas as pd

from sklearn.pipeline import Pipeline

from sklearn.impute import SimpleImputer

from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, FunctionTransformer

from sklearn.compose import ColumnTransformer

# 加载数据集

Ames = pd.read_csv('Ames.csv')

# 将以下数值特征转换为分类特征

Ames['MSSubClass'] = Ames['MSSubClass'].astype('object')

Ames['YrSold'] = Ames['YrSold'].astype('object')

Ames['MoSold'] = Ames['MoSold'].astype('object')

# 从特征中排除 'PID' 和 'SalePrice'，并专门处理 'Electrical' 列

numeric_features = Ames.select_dtypes(include=['int64', 'float64']).drop(columns=['PID', 'SalePrice']).columns

categorical_features = Ames.select_dtypes(include=['object']).columns.difference(['Electrical'])

electrical_feature = ['Electrical']

# 根据数据字典手动指定序数编码的类别

ordinal_order = {

'Electrical': ['Mix', 'FuseP', 'FuseF', 'FuseA', 'SBrkr'], # 电气系统

'LotShape': ['IR3', 'IR2', 'IR1', 'Reg'], # 房产的整体形状

'Utilities': ['ELO', 'NoSeWa', 'NoSewr', 'AllPub'], # 可用的公用设施类型

'LandSlope': ['Sev', 'Mod', 'Gtl'], # 房产坡度

'ExterQual': ['Po', 'Fa', 'TA', 'Gd', 'Ex'], # 评估外部材料的质量

'ExterCond': ['Po', 'Fa', 'TA', 'Gd', 'Ex'], # 评估外部材料的当前状况

'BsmtQual': ['None', 'Po', 'Fa', 'TA', 'Gd', 'Ex'], # 地下室高度

'BsmtCond': ['None', 'Po', 'Fa', 'TA', 'Gd', 'Ex'], # 地下室的一般状况

'BsmtExposure': ['None', 'No', 'Mn', 'Av', 'Gd'], # 走出式或花园层地下室墙壁

'BsmtFinType1': ['None', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ'], # 地下室装修区域的质量

'BsmtFinType2': ['None', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ'], # 第二个地下室装修区域的质量

'HeatingQC': ['Po', 'Fa', 'TA', 'Gd', 'Ex'], # 加热质量和状况

'KitchenQual': ['Po', 'Fa', 'TA', 'Gd', 'Ex'], # 厨房质量

'Functional': ['Sal', 'Sev', 'Maj2', 'Maj1', 'Mod', 'Min2', 'Min1', 'Typ'], # 家庭功能

'FireplaceQu': ['None', 'Po', 'Fa', 'TA', 'Gd', 'Ex'], # 壁炉质量

'GarageFinish': ['None', 'Unf', 'RFn', 'Fin'], # 车库内部装修

'GarageQual': ['None', 'Po', 'Fa', 'TA', 'Gd', 'Ex'], # 车库质量

'GarageCond': ['None', 'Po', 'Fa', 'TA', 'Gd', 'Ex'], # 车库状况

'PavedDrive': ['N', 'P', 'Y'], # 铺砌的车道

'PoolQC': ['None', 'Fa', 'TA', 'Gd', 'Ex'], # 泳池质量

'Fence': ['None', 'MnWw', 'GdWo', 'MnPrv', 'GdPrv'] # 围栏质量

}

# 从字典中提取所有序数特征列表

ordinal_features = list(ordinal_order.keys())

# 除了 Electrical 之外的序数特征列表

ordinal_except_electrical = [feature for feature in ordinal_features if feature != 'Electrical']

# 用于填充“None”缺失分类数据的辅助函数

def fill_none(X):

return X.fillna("None")

# “Electrical”的管道：用众数填充缺失值，然后应用顺序编码

electrical_transformer = Pipeline(steps=[

('impute_electrical', SimpleImputer(strategy='most_frequent')),

('ordinal_electrical', OrdinalEncoder(categories=[ordinal_order['Electrical']]))

])

# 数值特征管道：使用均值插补缺失值

numeric_transformer = Pipeline(steps=[

('impute_mean', SimpleImputer(strategy='mean'))

])

# 顺序特征管道：用“None”填充缺失值，然后应用顺序编码

ordinal_transformer = Pipeline(steps=[

('fill_none', FunctionTransformer(fill_none, validate=False)),

('ordinal', OrdinalEncoder(categories=[ordinal_order[feature] for feature in ordinal_features if feature in ordinal_except_electrical]))

])

# 名义分类特征管道：用“None”填充缺失值，然后应用独热编码

nominal_features = [feature for feature in categorical_features if feature not in ordinal_features]

categorical_transformer = Pipeline(steps=[

('fill_none', FunctionTransformer(fill_none, validate=False)),

('onehot', OneHotEncoder(handle_unknown='ignore'))

])

# 适用于数值、顺序、名义和特定电气数据的组合预处理器

preprocessor = ColumnTransformer(

transformers=[

('electrical', electrical_transformer, ['Electrical']),

('num', numeric_transformer, numeric_features),

('ordinal', ordinal_transformer, ordinal_except_electrical),

('nominal', categorical_transformer, nominal_features)

])

# 将预处理管道应用于 Ames 数据集

transformed_data = preprocessor.fit_transform(Ames).toarray()

# 为独热编码的特征生成列名

onehot_features = preprocessor.named_transformers_['nominal'].named_steps['onehot'].get_feature_names_out()

# 合并所有特征名

all_feature_names = ['Electrical'] + list(numeric_features) + list(ordinal_except_electrical) + list(onehot_features)

# 将转换后的数组转换为 DataFrame

transformed_df = pd.DataFrame(transformed_data, columns=all_feature_names)

在加载数据并完成初步转换后，我们现在有了一种处理缺失值和适当编码分类变量的结构化方法。以下摘要概述了我们完成的关键预处理任务，为即将到来的建模阶段奠定了坚实的基础。

数据分类
- 将“MSSubClass”、“YrSold”和“MoSold”从数值数据类型转换为分类数据类型，以反映其实际数据特征。
排除不相关特征:
- 从特征集中删除“PID”和“SalePrice”，以关注预测变量并避免包含唯一标识符。
处理缺失值:
- 数值特征：使用均值插补缺失值以保持分布。
- 分类特征：根据数据字典的指导，为除“Electrical”之外的所有分类特征填入缺失值“None”。
- “Electrical”特征：根据数据字典的指导，使用众数插补一个缺失值。
编码分类数据:
- 顺序特征：使用预定义的顺序进行编码，以尊重数据中固有的排名（例如，“ExterQual”从差到优）。
- 名义特征：应用独热编码将其转换为适合建模的格式，为每个类别创建二进制列。
用于简化处理的管道:
- 为数值、顺序和名义特征分别设置管道，以简化转换并确保在整个数据集中的一致应用。
组合预处理:
- 使用 `ColumnTransformer` 来一步应用所有管道，从而提高数据转换过程的效率和可管理性。
应用转换和检查结果:
- 将预处理管道应用于数据集，将转换后的数组转换回 DataFrame，并系统地为列命名，尤其是在独热编码之后，以便于识别和分析。

观察上面的转换后的 DataFrame，我们可以清楚地了解我们的预处理步骤是如何改变数据的。这种转换确保每个特征都经过适当格式化，并为我们分析的后续步骤做好准备。请注意，我们是如何处理每个类别和数值特征以尽可能保留最多信息的。

# # 可选的扩展视图命令

# pd.set_option('display.max_columns', None)

# 查看转换结果

print(transformed_df)

Electrical  GrLivArea  LotFrontage  ...  YrSold_2008  YrSold_2009  YrSold_2010
0            4.0      856.0    68.510628  ...          0.0          0.0          1.0
1            4.0     1049.0    42.000000  ...          0.0          1.0          0.0
2            4.0     1001.0    60.000000  ...          0.0          0.0          0.0
3            4.0     1039.0    80.000000  ...          0.0          1.0          0.0
4            4.0     1665.0    70.000000  ...          0.0          1.0          0.0
...          ...        ...          ...  ...          ...          ...          ...
2574         2.0      952.0    68.510628  ...          0.0          1.0          0.0
2575         3.0     1733.0    68.510628  ...          0.0          1.0          0.0
2576         3.0     2002.0    82.000000  ...          0.0          0.0          0.0
2577         4.0     1842.0    68.510628  ...          0.0          0.0          0.0
2578         4.0     1911.0    80.000000  ...          0.0          0.0          0.0

[2579 rows x 2819 columns]

Electrical GrLivArea LotFrontage ... YrSold_2008 YrSold_2009 YrSold_2010

0 4.0 856.0 68.510628 ... 0.0 0.0 1.0

1 4.0 1049.0 42.000000 ... 0.0 1.0 0.0

2 4.0 1001.0 60.000000 ... 0.0 0.0 0.0

3 4.0 1039.0 80.000000 ... 0.0 1.0 0.0

4 4.0 1665.0 70.000000 ... 0.0 1.0 0.0

... ... ... ... ... ... ... ...

2574 2.0 952.0 68.510628 ... 0.0 1.0 0.0

2575 3.0 1733.0 68.510628 ... 0.0 1.0 0.0

2576 3.0 2002.0 82.000000 ... 0.0 0.0 0.0

2577 4.0 1842.0 68.510628 ... 0.0 0.0 0.0

2578 4.0 1911.0 80.000000 ... 0.0 0.0 0.0

[2579 行 x 2819 列]

原始数据集现在已扩展到 2819 列。我们可以进行快速计算来核对转换后正确的列数。

1 2	# 预处理后快速核对列数的方法 print(len(numeric_features) + len(ordinal_features) + Ames[nominal_features].fillna("None").nunique().sum())

这个快速验证向我们展示了预处理后的总特征数，确认了所有转换都已正确应用。

2819

在这个阶段确保数据的完整性对于构建可靠的模型至关重要。

想开始学习进阶数据科学吗？

立即参加我的免费电子邮件速成课程（附示例代码）。

点击注册，同时获得该课程的免费PDF电子书版本。

评估基础：决策树回归器评估

在本帖的第二部分，我们将重点关注通过构建我们上述的基础来评估基本决策树模型的性能。

# 构建在之前的代码块之上

# 导入其他必要的建模和评估库

来自 sklearn.tree 导入 DecisionTreeRegressor

from sklearn.model_selection import cross_val_score

# 定义完整模型管道

model_pipeline = Pipeline(steps=[

('preprocessor', preprocessor),

('regressor', DecisionTreeRegressor(random_state=42))

])

# 使用交叉验证评估模型

scores = cross_val_score(model_pipeline, Ames.drop(columns='SalePrice'), Ames['SalePrice'])

# 输出结果

print("决策树回归器平均交叉验证 R²：", round(scores.mean(),4))

通过应用交叉验证，我们旨在获得一个基准，以便在系列后续部分中与更复杂的模型进行比较。

1	决策树回归器平均交叉验证 R²：0.7663

R²得分为 0.7663 表明我们的模型解释了大约 77% 的房价波动，这是一个不错的（但不是非常出色）的起点。这个基础性能将帮助我们欣赏我们将在接下来探索的更复杂的集成方法所提供的渐进式好处。

改进预测：决策树 Bagging 简介

在最初模型的基础上，本部分将探讨如何通过Bagging来提高预测性能。Bagging，或称 Bootstrap Aggregating，是一种集成技术，旨在通过有效降低方差和防止过拟合来提高稳定性和准确性。与简单地克隆相同的决策树多次不同，Bagging 涉及创建多棵树，其中每棵树都在不同的数据集自助样本上进行训练。这些样本是带放回抽样的，这意味着每棵树都从数据的略微不同的切片中学习，从而确保模型视角的多元化。我们将比较单个决策树与使用多棵树的 Bagging 回归器的有效性，以展示集成学习的力量。

# 导入 Bagging Regressor 并基于之前的代码块构建

# 比较 Bagging 如何影响性能（即增加树的数量）

from sklearn.ensemble import BaggingRegressor

models = {

'决策树 (1 棵树)': DecisionTreeRegressor(random_state=42),

'Bagging Regressor (10 棵树)': BaggingRegressor(base_estimator=DecisionTreeRegressor(random_state=42),

n_estimators=10, random_state=42)

}

results = {}

for name, model in models.items():

# 为每个模型定义完整的模型管道

model_pipeline = Pipeline([

('preprocessor', preprocessor),

('regressor', model)

])

# 执行交叉验证

scores = cross_val_score(model_pipeline, Ames.drop(columns='SalePrice'), Ames['SalePrice'])

# 存储并打印分数的平均值

results[name] = round(scores.mean(), 4)

# 输出交叉验证分数

print("交叉验证分数:", results)

通过利用多个决策树，Bagging 大约比单个决策树提高了 11%，这表明集成方法可以提高模型性能。

1	交叉验证分数: {'决策树 (1 棵树)': 0.7663, 'Bagging Regressor (10 棵树)': 0.8781}

为了进一步研究这一点，我们将检查性能随集成中树数量的变化情况。

# Build on previous blocks of code
# Compare how performance is affected by Bagging in increments of 10 trees

# Number of trees to test
n_trees = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

# Define the model pipelines with various regressors
models = {
    'Decision Tree (1 Tree)': DecisionTreeRegressor(random_state=42)
}

# Adding Bagging models for each tree count
for n in n_trees:
    models[f'Bagging Regressor {n} Trees'] = BaggingRegressor(
        base_estimator=DecisionTreeRegressor(random_state=42),
        n_estimators=n,
        random_state=42
    )

results = {}
for name, model in models.items():
    # Define the full model pipeline for each model
    model_pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('regressor', model)
    ])

# Perform cross-validation
    scores = cross_val_score(model_pipeline, Ames.drop(columns='SalePrice'), Ames['SalePrice'])

# Store and print the mean of the scores
    results[name] = round(scores.mean(), 4)

# Output the cross-validation scores
print("Cross-validation scores:")
for name, score in results.items():
    print(f"{name}: {score}")

# 构建在之前的代码块之上

# 比较 Bagging 如何影响性能（以 10 棵树为增量）

# 要测试的树的数量

n_trees = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

# 定义具有各种回归器的模型管道

models = {

'决策树 (1 棵树)': DecisionTreeRegressor(random_state=42)

}

# 为每个树计数添加 Bagging 模型

for n in n_trees:

models[f'Bagging Regressor {n} Trees'] = BaggingRegressor(

base_estimator=DecisionTreeRegressor(random_state=42),

n_estimators=n,

random_state=42

)

results = {}

for name, model in models.items():

# 为每个模型定义完整的模型管道

model_pipeline = Pipeline([

('preprocessor', preprocessor),

('regressor', model)

])

# 执行交叉验证

scores = cross_val_score(model_pipeline, Ames.drop(columns='SalePrice'), Ames['SalePrice'])

# 存储并打印分数的平均值

results[name] = round(scores.mean(), 4)

# 输出交叉验证分数

print("交叉验证分数:")

for name, score in results.items():

print(f"{name}: {score}")

随着我们增加 Bagging Regressor 中的树的数量，我们观察到模型性能有了显著的初步提高。然而，需要注意的是，边际收益在某个点之后开始趋于平缓。例如，虽然从 1 棵树增加到 20 棵树的 R² 分数有明显提升，但超过 20 棵树后的增量改进则不那么明显。

交叉验证分数

决策树 (1 棵树): 0.7663

Bagging Regressor 10 棵树: 0.8781

Bagging Regressor 20 棵树: 0.8898

Bagging Regressor 30 棵树: 0.8911

Bagging Regressor 40 棵树: 0.8922

Bagging Regressor 50 棵树: 0.8931

Bagging Regressor 60 棵树: 0.8933

Bagging Regressor 70 棵树: 0.8936

Bagging Regressor 80 棵树: 0.895

Bagging Regressor 90 棵树: 0.8954

Bagging Regressor 100 棵树: 0.8957

这种趋势展示了模型复杂度的收益递减法则，并突显了机器学习中的一个重要考虑因素：超过一定的复杂程度后，额外的计算成本可能不值得性能的微小提升。

高级集成：比较 Bagging 和随机森林回归器

在我们关于基于树的模型技术的系列博客的最后一部分，我们将深入探讨两种流行集成方法的比较分析：Bagging Regressors 和 Random Forests。这两种方法都建立在集成学习的概念之上，我们在前面的部分已经探讨过，但它们在树的构建和组合方式上采用了不同的方法。

Random Forest 是 Bagging 技术的一种扩展，它涉及在训练过程中创建许多决策树。与简单的 Bagging 不同，在 Bagging 中，每棵树都基于数据的引导样本进行构建，Random Forest 通过考虑一个随机特征子集来划分决策树中的每个节点，引入了另一层随机性。这种随机性有助于创建更多样化的树，通常能生成具有更好泛化能力的模型。

让我们使用 Ames Housing 数据集来评估和比较这两种方法的性能，重点关注增加树的数量如何影响交叉验证的 R² 分数。

# Build on previous blocks of code
# Evaluate performance of Random Forest against Bagging Regressor

from sklearn.ensemble import RandomForestRegressor

# Number of trees to test
n_trees = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

# Define the model pipelines with various regressors
models = {
    'Decision Tree (1 Tree)': DecisionTreeRegressor(random_state=42),
}

# Adding Bagging and Random Forest models for each tree count
for n in n_trees:
    models[f'Bagging Regressor {n} Trees'] = BaggingRegressor(
        base_estimator=DecisionTreeRegressor(random_state=42),
        n_estimators=n,
        random_state=42
    )
    models[f'Random Forest {n} Trees'] = RandomForestRegressor(
        n_estimators=n,
        random_state=42
    )

# Perform cross-validation
    scores = cross_val_score(model_pipeline, Ames.drop(columns='SalePrice'), Ames['SalePrice'])

# Store and print the mean of the scores
    results[name] = round(scores.mean(), 4)

# Output the cross-validation scores
print("Cross-validation scores:")
for name, score in results.items():
    print(f"{name}: {score}")

# 构建在之前的代码块之上

# 评估 Random Forest 相对于 Bagging Regressor 的性能

from sklearn.ensemble import RandomForestRegressor

# 要测试的树的数量

n_trees = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

# 定义具有各种回归器的模型管道

models = {

'决策树 (1 棵树)': DecisionTreeRegressor(random_state=42),

}

# 为每个树计数添加 Bagging 和 Random Forest 模型

for n in n_trees:

models[f'Bagging Regressor {n} Trees'] = BaggingRegressor(

base_estimator=DecisionTreeRegressor(random_state=42),

n_estimators=n,

random_state=42

)

models[f'Random Forest {n} Trees'] = RandomForestRegressor(

n_estimators=n,

random_state=42

)

results = {}

for name, model in models.items():

# 为每个模型定义完整的模型管道

model_pipeline = Pipeline([

('preprocessor', preprocessor),

('regressor', model)

])

# 执行交叉验证

scores = cross_val_score(model_pipeline, Ames.drop(columns='SalePrice'), Ames['SalePrice'])

# 存储并打印分数的平均值

results[name] = round(scores.mean(), 4)

# 输出交叉验证分数

print("交叉验证分数:")

for name, score in results.items():

print(f"{name}: {score}")

检查交叉验证分数会揭示有趣的模式。Bagging 和 Random Forest 模型都显示出相对于单个决策树的显著改进，突显了集成方法的优势。

交叉验证分数

决策树 (1 棵树): 0.7663

Bagging Regressor 10 棵树: 0.8781

Random Forest 10 棵树: 0.8762

Bagging Regressor 20 棵树: 0.8898

Random Forest 20 棵树: 0.8893

Bagging Regressor 30 棵树: 0.8911

Random Forest 30 棵树: 0.8897

Bagging Regressor 40 棵树: 0.8922

Random Forest 40 棵树: 0.8909

Bagging Regressor 50 棵树: 0.8931

Random Forest 50 棵树: 0.8922

Bagging Regressor 60 棵树: 0.8933

Random Forest 60 棵树: 0.8931

Bagging Regressor 70 棵树: 0.8936

Random Forest 70 棵树: 0.8932

Bagging Regressor 80 棵树: 0.895

Random Forest 80 棵树: 0.8943

Bagging Regressor 90 棵树: 0.8954

Random Forest 90 棵树: 0.8948

Bagging Regressor 100 棵树: 0.8957

Random Forest 100 棵树: 0.8954

有趣的是，随着树的数量增加，这两种方法都表现出相似的性能水平，没有一种持续显著优于另一种。这种相似的性能可以归因于 Ames Housing 数据集的特定特征可能自然地限制了 Random Forest 引入的额外随机化的好处。如果数据集有一些高度预测性的特征，那么 Random Forest 的随机特征选择不会显著增强模型与仅使用所有特征的 Bagging 相比的泛化能力。

这些见解表明，虽然 Random Forest 通常通过其特征随机化来减少树之间的相关性，从而提供比 Bagging 更好的性能，但数据集的特定动态和问题背景有时会抵消这些优势。因此，在计算效率是一个考虑因素的情况下，由于其简单性和相似的性能水平，Bagging 可能会更受青睐。这种比较强调了在选择集成策略时理解数据集和建模目标的重要性。

进一步阅读

API

sklearn.ensemble.BaggingRegressor API
sklearn.ensemble.RandomForestRegressor API

教程

Ames 住房数据集和数据字典

总结

这篇博客文章详细探讨了使用 Ames Housing 数据集的基于树的模型技术。它从编码和处理缺失值等基本预处理步骤开始，然后通过 Bagging 的评估和增强决策树模型。叙述最终完成了 Bagging 和 Random Forest Regressors 的比较分析，重点介绍了随着树的数量变化而带来的增量收益和性能比较。每个部分都建立在前一个部分的基础上，提供实际示例和见解，最终实现了对基于树的预测模型的全面理解。

具体来说，你学到了：

预处理对于基于树的模型至关重要，包括分类转换、处理缺失值和应用适当的编码等技术。
使用交叉验证评估基本的决策树模型可以为评估更复杂的基于树的模型提供坚实的基准。
使用 Bagging 和 Random Forest 可以增强决策树的性能，通过集成技术显著提高预测准确性。

您有任何问题吗？请在下面的评论中提出您的问题，我将尽力回答。

开始学习进阶数据科学！

掌握数据科学项目成功的思维模式

...通过清晰、实用的例子建立专业知识，尽量减少复杂的数学，并专注于实践学习。

在我的新电子书中探索如何实现
新一代数据科学

它提供自学教程，旨在引导您从初级到高级。学习优化工作流程、处理多重共线性、优化基于树的模型以及处理缺失数据——以及更多，以帮助您获得更深入的见解并有效地用数据讲述故事。

通过实际练习提升您的数据科学技能

查看内容

关于此主题的更多信息

关于Vinod Chugani

我出生在印度，在印度长大，有着全球化的视野。我在杜克大学的学术生涯包括主修经济学，并在大三时被选入 Phi Beta Kappa 荣誉学会。多年来，我积累了丰富的专业经验，在华尔街的固定收益领域摸爬滚打了十年，随后在主街领导了一个全球分销企业。目前，我作为纽约市数据科学学院的导师，将我对数据科学、机器学习和人工智能的热情投入其中。我珍视能够激发好奇心和分享知识的机会，无论是通过在线学习课程还是深入的一对一交流。凭借我在金融/创业方面的基础以及目前在数据领域的沉浸，我以一种有目的和有信心的方式展望未来。我期待进一步探索、持续学习，并有机会为不断发展的数据科学和机器学习领域做出有意义的贡献，尤其是在 MLM。

查看Vinod Chugani发布的所有帖子 →

导航

从单棵树到森林：使用集成模型增强房地产预测

概述

奠定基础：树模型的预处理技术

想开始学习进阶数据科学吗？

评估基础：决策树回归器评估

改进预测：决策树 Bagging 简介

高级集成：比较 Bagging 和随机森林回归器

进一步阅读

API

教程

Ames 住房数据集和数据字典

总结

开始学习进阶数据科学！

掌握数据科学项目成功的思维模式

通过实际练习提升您的数据科学技能

关于此主题的更多信息

暂无评论。

留下回复点击此处取消回复。

导航

概述

奠定基础：树模型的预处理技术

想开始学习进阶数据科学吗？

评估基础：决策树回归器评估

改进预测：决策树 Bagging 简介

高级集成：比较 Bagging 和随机森林回归器

进一步阅读

API

教程

Ames 住房数据集和数据字典

总结

开始学习进阶数据科学！

掌握数据科学项目成功的思维模式

通过实际练习提升您的数据科学技能

关于此主题的更多信息

暂无评论。

留下回复 点击此处取消回复。

留下回复点击此处取消回复。