用于不平衡分类的 Bagging 和随机森林

作者： Jason Brownlee 发表于 2021 年 1 月 5 日在不平衡分类 36

Bagging 是一种集成算法，它在训练数据集的不同子集上拟合多个模型，然后将所有模型的预测结果结合起来。

随机森林是 Bagging 的扩展，它还会随机选择每个数据样本中使用的特征子集。Bagging 和随机森林都被证明对各种预测建模问题有效。

尽管它们有效，但它们不适用于类别分布偏斜的分类问题。尽管如此，已经提出了许多对算法的修改，以调整它们的行为，使其更适合严重的类别不平衡问题。

在本教程中，您将学习如何使用 Bagging 和随机森林进行不平衡分类。

完成本教程后，您将了解：

如何将 Bagging 与随机欠采样结合使用进行不平衡分类。
如何在不平衡分类中使用带有类别加权和随机欠采样的随机森林。
如何在不平衡分类中使用结合了 Bagging 和 Boosting 的 Easy Ensemble。

通过我的新书《Python 不平衡分类》**启动您的项目**，其中包括**分步教程**和所有示例的**Python 源代码**文件。

让我们开始吧。

2021 年 1 月更新：更新了 API 文档链接。

Bagging and Random Forest for Imbalanced Classification

用于不平衡分类的 Bagging 和随机森林
图片由 Don Graham 提供，保留部分权利。

教程概述

本教程分为三个部分；它们是：

不平衡分类的 Bagging
1. 标准 Bagging
2. 带有随机欠采样的 Bagging
不平衡分类的随机森林
1. 标准随机森林
2. 带有类别加权的随机森林
3. 带有 Bootstrap 类别加权的随机森林
4. 带有随机欠采样的随机森林
不平衡分类的 Easy Ensemble
1. Easy Ensemble

不平衡分类的 Bagging

Bootstrap Aggregation，简称 Bagging，是一种集成机器学习算法。

它首先从训练数据集中有放回地选择随机样本，这意味着给定的样本可能包含训练数据集中零个、一个或多个示例副本。这称为引导样本。然后，在一个弱学习器模型拟合在每个数据样本上。通常，不使用剪枝的决策树模型（例如，可能稍微过拟合其训练集）用作弱学习器。最后，将所有拟合的弱学习器的预测结果组合起来以进行单个预测（例如，聚合）。

然后，集成中的每个模型都用于为新样本生成预测，这些 m 个预测被平均以给出 Bagging 模型的预测。

— 第 192 页，《应用预测建模》，2013 年。

创建新的引导样本并拟合和向样本添加树的过程可以继续，直到集成在验证数据集上的性能没有进一步改进。

这个简单的过程通常比单个配置良好的决策树算法产生更好的性能。

Bagging 本身将创建不考虑不平衡分类数据集的偏斜类别分布的引导样本。因此，尽管该技术通常表现良好，但在存在严重类别不平衡的情况下可能表现不佳。

标准 Bagging

在我们深入探讨 Bagging 的扩展之前，让我们评估一个标准的 Bagging 决策树集成，并将其用作比较点。

我们可以使用 BaggingClassifier scikit-sklearn 类来创建配置大致相同的 Bagging 决策树模型。

首先，让我们定义一个合成的不平衡二分类问题，包含 10,000 个示例，其中 99% 属于多数类别，1% 属于少数类别。

...
# generate dataset
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,
	n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=4)

...

# 生成数据集

X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,

n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=4)

然后我们可以定义标准的 Bagging 决策树集成模型，准备进行评估。

...
# define model
model = BaggingClassifier()

...

# 定义模型

model = BaggingClassifier()

然后我们可以使用重复分层 k 折交叉验证来评估这个模型，重复三次，折叠十次。

我们将使用所有折叠和重复的平均 ROC AUC 分数来评估模型的性能。

...
# define evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)

...

# 定义评估过程

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# 评估模型

scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)

结合起来，评估不平衡分类数据集上的标准 Bagging 集成的完整示例列于下方。

# bagged decision trees on an imbalanced classification problem
from numpy import mean
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import BaggingClassifier
# generate dataset
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,
	n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=4)
# define model
model = BaggingClassifier()
# define evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)
# summarize performance
print('Mean ROC AUC: %.3f' % mean(scores))

# 不平衡分类问题的 Bagging 决策树

from numpy import mean

from sklearn.datasets import make_classification

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import RepeatedStratifiedKFold

from sklearn.ensemble import BaggingClassifier

# 生成数据集

X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,

n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=4)

# 定义模型

model = BaggingClassifier()

# 定义评估过程

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# 评估模型

scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)

# 总结性能

print('Mean ROC AUC: %.3f' % mean(scores))

运行示例评估模型并报告平均 ROC AUC 分数。

**注意**：考虑到算法或评估过程的随机性，或者数值精度的差异，您的结果可能会有所不同。考虑多次运行示例并比较平均结果。

在这种情况下，我们可以看到模型的分数约为 0.87。

Mean ROC AUC: 0.871

1	平均 ROC AUC：0.871

想要开始学习不平衡分类吗？

立即参加我为期7天的免费电子邮件速成课程（附示例代码）。

点击注册，同时获得该课程的免费PDF电子书版本。

带有随机欠采样的 Bagging

有许多方法可以调整 Bagging 以用于不平衡分类。

也许最直接的方法是在拟合弱学习器模型之前对引导样本应用数据重采样。这可能涉及过采样少数类别或欠采样多数类别。

在 Bagging 的重采样阶段克服类别不平衡问题的一个简单方法是，在从原始数据集中随机抽取实例时，考虑实例的类别。

— 第 175 页，《从不平衡数据集学习》，2018 年。

在引导中对少数类别进行过采样称为 OverBagging；同样，在引导中对多数类别进行欠采样称为 UnderBagging，结合这两种方法称为 OverUnderBagging。

imbalanced-learn 库提供了 UnderBagging 的实现。

具体而言，它提供了一个 Bagging 版本，该版本在引导样本内对多数类别使用随机欠采样策略，以平衡这两个类别。这在 BalancedBaggingClassifier 类中提供。

...
# define model
model = BalancedBaggingClassifier()

...

# 定义模型

model = BalancedBaggingClassifier()

接下来，我们可以评估 Bagging 决策树集成的修改版本，该版本在拟合每个决策树之前对多数类别执行随机欠采样。

我们预期使用随机欠采样会改善集成的性能。

此模型和前一个模型的默认树数量（n_estimators）为 10。实际上，最好测试此超参数的较大值，例如 100 或 1,000。

完整的示例如下所示。

# bagged decision trees with random undersampling for imbalanced classification
from numpy import mean
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from imblearn.ensemble import BalancedBaggingClassifier
# generate dataset
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,
	n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=4)
# define model
model = BalancedBaggingClassifier()
# define evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)
# summarize performance
print('Mean ROC AUC: %.3f' % mean(scores))

# 带随机欠采样的 Bagging 决策树用于不平衡分类

from numpy import mean

from sklearn.datasets import make_classification

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import RepeatedStratifiedKFold

from imblearn.ensemble import BalancedBaggingClassifier

# 生成数据集

X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,

n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=4)

# 定义模型

model = BalancedBaggingClassifier()

# 定义评估过程

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# 评估模型

scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)

# 总结性能

print('Mean ROC AUC: %.3f' % mean(scores))

运行示例评估模型并报告平均 ROC AUC 分数。

**注意**：考虑到算法或评估过程的随机性，或者数值精度的差异，您的结果可能会有所不同。考虑多次运行示例并比较平均结果。

在这种情况下，我们可以看到平均 ROC AUC 从不进行数据重采样的约 0.87 提升到对多数类别进行随机欠采样的约 0.96。

这并不是一个真正的同类比较，因为我们使用的是来自两个不同库的相同算法实现，但这说明了在类别分布偏斜时，在拟合弱学习器之前平衡引导程序会带来一些好处。

Mean ROC AUC: 0.962

1	平均 ROC AUC：0.962

尽管 BalancedBaggingClassifier 类使用决策树，但您可以测试不同的模型，例如 k 最近邻等。您可以在定义类时设置 base_estimator 参数以使用不同的弱学习器分类器模型。

不平衡分类的随机森林

随机森林是决策树模型的另一个集成，可以认为是 Bagging 的改进。

与 Bagging 一样，随机森林涉及从训练数据集中选择引导样本并在每个样本上拟合决策树。主要区别在于并非使用所有特征（变量或列）；相反，为每个引导样本选择一个小型的随机选择的特征（列）子集。这具有使决策树去相关（使其更独立）的效果，进而提高集成预测。

然后，集成中的每个模型都用于为新样本生成预测，这些 m 个预测被平均以给出森林的预测。由于算法在每个分裂处随机选择预测变量，因此树相关性必然会减小。

— 第 199 页，《应用预测建模》，2013 年。

同样，随机森林在各种问题上都非常有效，但与 Bagging 类似，标准算法在不平衡分类问题上的性能并不理想。

在学习极度不平衡数据时，引导样本包含少数类别很少甚至没有少数类别的可能性很高，导致树对少数类别的预测性能不佳。

— 《使用随机森林学习不平衡数据》，2004 年。

标准随机森林

在我们深入探讨随机森林集成算法的扩展，使其更适合不平衡分类之前，让我们先在合成数据集上拟合和评估随机森林算法。

我们可以使用 scikit-learn 中的 RandomForestClassifier 类，并使用少量树，在本例中为 10 棵。

...
# define model
model = RandomForestClassifier(n_estimators=10)

...

# 定义模型

model = RandomForestClassifier(n_estimators=10)

在不平衡数据集上拟合标准随机森林集成的完整示例列于下方。

# random forest for imbalanced classification
from numpy import mean
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier
# generate dataset
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,
	n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=4)
# define model
model = RandomForestClassifier(n_estimators=10)
# define evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)
# summarize performance
print('Mean ROC AUC: %.3f' % mean(scores))

# 用于不平衡分类的随机森林

from numpy import mean

from sklearn.datasets import make_classification

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import RepeatedStratifiedKFold

from sklearn.ensemble import RandomForestClassifier

# 生成数据集

X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,

n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=4)

# 定义模型

model = RandomForestClassifier(n_estimators=10)

# 定义评估过程

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# 评估模型

scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)

# 总结性能

print('Mean ROC AUC: %.3f' % mean(scores))

运行示例评估模型并报告平均 ROC AUC 分数。

**注意**：考虑到算法或评估过程的随机性，或者数值精度的差异，您的结果可能会有所不同。考虑多次运行示例并比较平均结果。

在这种情况下，我们可以看到模型的平均 ROC AUC 达到了约 0.86。

Mean ROC AUC: 0.869

1	平均 ROC AUC：0.869

带有类别加权的随机森林

修改决策树以用于不平衡分类的一个简单技术是改变每个类别在计算所选分割点的“杂质”分数时的权重。

杂质衡量训练数据集中给定分割的样本组混合程度，通常用基尼系数或熵来衡量。计算可以偏置，使得有利于少数类别的混合受到青睐，从而允许多数类别出现一些误报。

随机森林的这种修改被称为加权随机森林。

使随机森林更适合从极度不平衡数据中学习的另一种方法遵循成本敏感学习的思想。由于 RF 分类器倾向于偏向多数类别，因此我们将对少数类别的错误分类施加更重的惩罚。

— 《使用随机森林学习不平衡数据》，2004 年。

这可以通过在 RandomForestClassifier 类上设置 class_weight 参数来实现。

此参数接受一个字典，其中包含每个类别值（例如 0 和 1）到权重的映射。可以提供参数值“balanced”以自动使用训练数据集的反向权重，从而关注少数类别。

...
# define model
model = RandomForestClassifier(n_estimators=10, class_weight='balanced')

...

# 定义模型

model = RandomForestClassifier(n_estimators=10, class_weight='balanced')

我们可以将随机森林的这个修改应用到我们的测试问题上。虽然不是随机森林特有的，但我们预期会有一些适度的改进。

完整的示例如下所示。

# class balanced random forest for imbalanced classification
from numpy import mean
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier
# generate dataset
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,
	n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=4)
# define model
model = RandomForestClassifier(n_estimators=10, class_weight='balanced')
# define evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)
# summarize performance
print('Mean ROC AUC: %.3f' % mean(scores))

# 用于不平衡分类的类别平衡随机森林

from numpy import mean

from sklearn.datasets import make_classification

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import RepeatedStratifiedKFold

from sklearn.ensemble import RandomForestClassifier

# 生成数据集

X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,

n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=4)

# 定义模型

model = RandomForestClassifier(n_estimators=10, class_weight='balanced')

# 定义评估过程

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# 评估模型

scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)

# 总结性能

print('Mean ROC AUC: %.3f' % mean(scores))

运行示例评估模型并报告平均 ROC AUC 分数。

**注意**：考虑到算法或评估过程的随机性，或者数值精度的差异，您的结果可能会有所不同。考虑多次运行示例并比较平均结果。

在这种情况下，我们可以看到模型的平均 ROC AUC 从 0.86 适度提升到约 0.87。

Mean ROC AUC: 0.871

1	平均 ROC AUC：0.871

带有 Bootstrap 类别加权的随机森林

鉴于每个决策树都是从引导样本构建的（例如，有放回的随机选择），数据样本中的类别分布对于每棵树都会有所不同。

因此，根据每个引导样本中的类别分布而不是整个训练数据集来改变类别权重可能很有趣。

这可以通过将 class_weight 参数设置为值“balanced_subsample”来实现。

我们可以测试此修改并将结果与上述“平衡”情况进行比较；完整示例列于下方。

# bootstrap class balanced random forest for imbalanced classification
from numpy import mean
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier
# generate dataset
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,
	n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=4)
# define model
model = RandomForestClassifier(n_estimators=10, class_weight='balanced_subsample')
# define evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)
# summarize performance
print('Mean ROC AUC: %.3f' % mean(scores))

# 用于不平衡分类的引导类别平衡随机森林

from numpy import mean

from sklearn.datasets import make_classification

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import RepeatedStratifiedKFold

from sklearn.ensemble import RandomForestClassifier

# 生成数据集

X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,

n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=4)

# 定义模型

model = RandomForestClassifier(n_estimators=10, class_weight='balanced_subsample')

# 定义评估过程

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# 评估模型

scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)

# 总结性能

print('Mean ROC AUC: %.3f' % mean(scores))

运行示例评估模型并报告平均 ROC AUC 分数。

**注意**：考虑到算法或评估过程的随机性，或者数值精度的差异，您的结果可能会有所不同。考虑多次运行示例并比较平均结果。

在这种情况下，我们可以看到模型的平均 ROC AUC 从 0.87 适度提升到约 0.88。

Mean ROC AUC: 0.884

1	平均 ROC AUC：0.884

带有随机欠采样的随机森林

随机森林的另一个有用修改是对引导样本执行数据重采样，以明确改变类别分布。

imbalanced-learn 库中的 BalancedRandomForestClassifier 类实现了这一点，并在每个引导样本中对多数类别执行随机欠采样。这通常被称为平衡随机森林。

...
# define model
model = BalancedRandomForestClassifier(n_estimators=10)

...

# 定义模型

model = BalancedRandomForestClassifier(n_estimators=10)

鉴于数据重采样技术的广泛成功，我们预计这将对模型性能产生更显著的影响。

我们可以将随机森林的这种修改应用到我们的合成数据集上并比较结果。完整的示例列于下方。

# random forest with random undersampling for imbalanced classification
from numpy import mean
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from imblearn.ensemble import BalancedRandomForestClassifier
# generate dataset
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,
	n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=4)
# define model
model = BalancedRandomForestClassifier(n_estimators=10)
# define evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)
# summarize performance
print('Mean ROC AUC: %.3f' % mean(scores))

# 带有随机欠采样的随机森林用于不平衡分类

from numpy import mean

from sklearn.datasets import make_classification

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import RepeatedStratifiedKFold

from imblearn.ensemble import BalancedRandomForestClassifier

# 生成数据集

X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,

n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=4)

# 定义模型

model = BalancedRandomForestClassifier(n_estimators=10)

# 定义评估过程

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# 评估模型

scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)

# 总结性能

print('Mean ROC AUC: %.3f' % mean(scores))

运行示例评估模型并报告平均 ROC AUC 分数。

**注意**：考虑到算法或评估过程的随机性，或者数值精度的差异，您的结果可能会有所不同。考虑多次运行示例并比较平均结果。

在这种情况下，我们可以看到模型的平均 ROC AUC 从 0.89 适度提升到约 0.97。

Mean ROC AUC: 0.970

1	平均 ROC AUC：0.970

不平衡分类的 Easy Ensemble

在考虑用于不平衡分类的 Bagging 集成时，一个自然的思路可能是使用多数类别的随机重采样来创建多个具有平衡类别分布的数据集。

具体而言，可以通过所有少数类别示例和从多数类别中随机选择的样本来创建数据集。然后可以在此数据集上拟合模型或弱学习器。该过程可以重复多次，并且可以使用集成模型中的平均预测来做出预测。

这正是徐颖刘等人于 2008 年发表的题为《用于类别不平衡学习的探索性欠采样》的论文中提出的方法。

子样本的选择性构建被视为多数类别的一种欠采样。生成多个子样本使集成能够克服欠采样的缺点，即在训练过程中丢弃有价值的信息。

…欠采样是处理类别不平衡的有效策略。然而，欠采样的缺点是它会丢弃许多潜在有用的数据。

— 《用于类别不平衡学习的探索性欠采样》，2008 年。

作者提出了该方法的变体，例如 Easy Ensemble 和 Balance Cascade。

让我们仔细看看 Easy Ensemble。

Easy Ensemble

Easy Ensemble 涉及通过选择少数类别的所有示例和多数类别的一个子集来创建训练数据集的平衡样本。

不是使用剪枝决策树，而是在每个子集上使用 Boosting 决策树，特别是 AdaBoost 算法。

AdaBoost 首先在数据集上拟合决策树，然后确定树所犯的错误，并根据这些错误对数据集中的示例进行加权，以便更多地关注错误分类的示例，而较少关注正确分类的示例。然后，在加权数据集上拟合后续树，旨在纠正错误。然后，该过程重复给定数量的决策树。

这意味着难以分类的样本将获得越来越大的权重，直到算法识别出正确分类这些样本的模型。因此，算法的每次迭代都需要学习数据的不同方面，重点关注包含难以分类样本的区域。

— 第 389 页，《应用预测建模》，2013 年。

imbalanced-learn 库中的 EasyEnsembleClassifier 类提供了 Easy Ensemble 技术的实现。

...
# define model
model = EasyEnsembleClassifier(n_estimators=10)

...

# 定义模型

model = EasyEnsembleClassifier(n_estimators=10)

我们可以在我们的合成不平衡分类问题上评估该技术。

鉴于使用了某种类型的随机欠采样，我们期望该技术总体上表现良好。

完整的示例如下所示。

# easy ensemble for imbalanced classification
from numpy import mean
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from imblearn.ensemble import EasyEnsembleClassifier
# generate dataset
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,
	n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=4)
# define model
model = EasyEnsembleClassifier(n_estimators=10)
# define evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)
# summarize performance
print('Mean ROC AUC: %.3f' % mean(scores))

# 用于不平衡分类的 Easy Ensemble

from numpy import mean

from sklearn.datasets import make_classification

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import RepeatedStratifiedKFold

from imblearn.ensemble import EasyEnsembleClassifier

# 生成数据集

X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,

n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=4)

# 定义模型

model = EasyEnsembleClassifier(n_estimators=10)

# 定义评估过程

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# 评估模型

scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)

# 总结性能

print('Mean ROC AUC: %.3f' % mean(scores))

运行示例评估模型并报告平均 ROC AUC 分数。

**注意**：考虑到算法或评估过程的随机性，或者数值精度的差异，您的结果可能会有所不同。考虑多次运行示例并比较平均结果。

在这种情况下，我们可以看到集成在数据集上表现良好，平均 ROC AUC 达到了约 0.96，接近于在此数据集上使用随机欠采样的随机森林所达到的（0.97）。

Mean ROC AUC: 0.968

1	平均 ROC AUC：0.968

尽管在每个子样本上都使用 AdaBoost 分类器，但可以通过将 base_estimator 参数设置为模型来使用替代分类器模型。

进一步阅读

如果您想深入了解，本节提供了更多关于该主题的资源。

论文

使用随机森林学习不平衡数据, 2004.
用于类别不平衡学习的探索性欠采样, 2008.

书籍

API

总结

在本教程中，您学习了如何使用 Bagging 和随机森林进行不平衡分类。

具体来说，你学到了：

如何将 Bagging 与随机欠采样结合使用进行不平衡分类。
如何在不平衡分类中使用带有类别加权和随机欠采样的随机森林。
如何在不平衡分类中使用结合了 Bagging 和 Boosting 的 Easy Ensemble。

你有什么问题吗？
在下面的评论中提出你的问题，我会尽力回答。

关于此主题的更多信息

不平衡分类的阈值移动简介

不平衡数据集的单类别分类算法

对《不平衡分类的 Bagging 和随机森林》的 36 条回复

marco 2020 年 2 月 14 日上午 3:13 #

你好 Jason，
我在网上找到了 Pandas Profiling Tool。
它似乎对执行分析很有帮助（它还创建了一个带有图表的 HTML 文件）。
你认为这足以进行数据分析阶段，然后开始数据准备和建模吗？
谢谢

回复
- Jason Brownlee 2020 年 2 月 14 日上午 6:39 #
  
  我不知道那个工具，抱歉。
  
  回复
Frank 2020 年 2 月 15 日晚上 8:31 #

你好 Jason，
生成的样本数据集服从正态分布，对吗？
如果数据集服从偏态分布，或者完全不规则呢？

回复
- Jason Brownlee 2020 年 2 月 16 日上午 6:06 #
  
  对于决策树集成，分布并不重要。
  
  回复
  - sst 2020 年 3 月 11 日下午 4:11 #
    
    在哪个方法中，分类模型在与原始训练数据集分布相比已修改的数据集上进行训练
    Bagging
    Boosting
    两者
    都不是
    
    回复
    - Jason Brownlee 2020 年 3 月 12 日上午 8:38 #
      
      抱歉，我不明白你的问题。也许你可以重新措辞或详细说明？
      
      回复
Carlos 2020 年 2 月 16 日上午 5:31 #

你好 Jason，

正如有人在您描述 xgBoost 的情况时所问，这种方法可以应用于多类别问题吗？

另一个问题是，这种方法对模型校准有什么影响？它改变了输出概率的分布，对吗？

谢谢，
卡洛斯。
附注：关于之前的问题，这种“剖析工具”是 pandas 的一个新功能，它会创建一个更详细的输出 html。它就像一个“增强版”的 pandas.describe。:-)。

回复
- Jason Brownlee 2020 年 2 月 16 日上午 6:16 #
  
  我相信是的。试试看。
  
  是的，如果你想要概率，你可能需要探索校准。
  
  回复
Igor Franzoni 2020 年 4 月 9 日晚上 10:19 #

你好，Jason！

首先感谢您的帖子！

在这个问题中，你决定使用重复分层 k 折交叉验证。对于不平衡问题，它比普通分层 k 折交叉验证更方便吗？

最棒的！

回复
- Jason Brownlee 2020 年 4 月 10 日上午 8:29 #
  
  在大多数情况下，重复可以给出性能的无偏估计。
  
  回复
Igor Franzoni 2020 年 4 月 9 日晚上 11:27 #

嗨，Jason，

我想到的另一个问题是 ROC-AUC 是否适合这个问题。我看到它正在增加，但检查 Precision-Recall 曲线也很有趣，对吗？我正面临一个问题，其中 ROC-AUC 很高（约 0.9），但 Precision-Recall 区域非常低（0.005）…

谢谢！

回复
- Jason Brownlee 2020 年 4 月 10 日上午 8:32 #
  
  我用它只是为了简化，因为我们专注于算法，而不是解决问题。
  
  这将帮助你选择一个指标
  https://machinelearning.org.cn/tour-of-evaluation-metrics-for-imbalanced-classification/
  
  回复
Steven Larsson 2020 年 5 月 19 日上午 7:54 #

有没有关于如何从头开始编写 Easy Ensemble 的指南？
我想将采样算法与 XGB 结合，然后将其打包成一个集成，以获得一个“高级”的 Easy Ensemble，但我不知道该如何做。

回复
- Jason Brownlee 2020 年 5 月 19 日下午 1:23 #
  
  抱歉，我没有。
  
  我预计实现起来不会太难。告诉我进展如何！
  
  回复
suyash 2020 年 7 月 23 日下午 6:59 #

from imblearn.ensemble import BalanceCascade

错误
无法从“imblearn.ensemble”导入名称“BalanceCascade”

我正在使用 python 3.8.3。我也安装了 imblearn，但无法导入 Balance Cascade。
请帮我解决这个问题。

回复
- Jason Brownlee 2020 年 7 月 24 日上午 6:26 #
  
  请看这里
  https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.ensemble.BalanceCascade.html
  
  也许可以确保您的 imbalanced-learn 版本是最新的。
  
  回复
suyash 2020 年 7 月 24 日晚上 9:47 #

Balance Cascade 和 Balanced Bagging Classifier 有什么区别？

回复
- Jason Brownlee 2020 年 7 月 25 日上午 6:18 #
  
  我不记得了，我建议查阅相关文献。
  
  回复
elham 2020 年 11 月 9 日上午 1:09 #

我不明白 BalancedBaggingClassifier 在哪个阶段进行欠采样？当从主数据集中抽取随机样本时，正类和负类是平衡的，还是主数据集从一开始就是平衡的，然后才抽取随机样本？

回复
- Jason Brownlee 2020 年 11 月 9 日上午 6:13 #
  
  好问题。
  
  Bagging 多次从数据集中抽取样本，以便在集成中创建每棵树。
  
  平衡 Bagging 确保用于训练树的每个抽取样本都是平衡的。
  
  回复
Carlos G 2020 年 11 月 22 日上午 12:25 #

嗨，Jason，

感谢这篇精彩的帖子！随机森林欠采样与随机森林平衡类别加权之间存在显著的性能差异，这是什么原因？我原本期望类别加权能够达到与欠采样类似的目的，但又不会因为将数据点排除在训练集之外而丢失信息。

您有什么资源建议可以让我进一步了解这两种方法之间的区别吗？

回复
- Jason Brownlee 2020 年 11 月 22 日上午 6:55 #
  
  不客气。
  
  我们无法解释为什么一个模型在给定数据集上的表现优于另一个模型。如果能，我们就能为数据集选择算法——但我们不能。
  
  这些方法之间的区别仅在实现层面理解——如上述教程所述。
  
  回复
  - Carlos G 2020 年 11 月 25 日晚上 11:42 #
    
    谢谢 Jason。您能详细说明您的观点吗？我理解预测特定问题的实际性能几乎是不可能的，但是我们能否根据我们对算法工作原理的理解，选择最有可能适用于特定数据集的算法？
    
    回复
    - Jason Brownlee 2020 年 11 月 26 日上午 6:35 #
      
      了解算法的工作原理（理论/实现）并不能帮助你配置它，也不能帮助你选择何时使用它。如果能，那么学者们将赢得每一场 Kaggle 竞赛。他们并没有。他们很少表现出色，因为他们固守自己喜欢的方法。
      
      相反，现有的最佳过程是试错（实验），并发现对数据集最有效的方法。这就是应用机器学习领域。
      
      也许这会有帮助。
      https://machinelearning.org.cn/faq/single-faq/what-algorithm-config-should-i-use
      
      回复
- Tim Martin 2021 年 10 月 3 日上午 5:22 #
  
  尽管我们不知道为什么在这种特定情况下欠 Bagging 比加权效果更好，但这种方法之所以有效，有一个理论解释。这里 (https://www.svds.com/tbt-learning-imbalanced-classes/) 的“Wallace 等人的贝叶斯论证”部分对此进行了非常简单的解释。
  
  希望这会有所帮助。
  
  回复
nabila 2021 年 3 月 23 日下午 3:28 #

你好，杰森，
感谢这篇精彩的帖子！我想问你，如何在 Python 中对 Bagging SVM 和 Boosting SVM 进行 SMOTE？

回复
- Jason Brownlee 2021 年 3 月 24 日上午 5:49 #
  
  您可以开发一个包含 SMOTE 数据转换和您喜欢的任何模型的管道。
  
  请参阅本教程开始学习
  https://machinelearning.org.cn/smote-oversampling-for-imbalanced-classification/
  
  回复
Nisha 2021 年 11 月 14 日下午 2:30 #

当我尝试使用 imblearn 中的 BalancedRandomForestClassifier 时，我收到一个错误“AttributeError: can’t set attribute”，而且这篇文章是 2021 年 1 月的，所以我想知道其他人是否有同样的问题，如果有，如何解决或有什么建议？

回复
- Adrian Tam 2021 年 11 月 14 日下午 3:07 #
  
  我检查了，但没有发现任何错误。
  
  回复
Nisha 2021 年 11 月 15 日上午 1:48 #

您能发布此代码中使用的 imblearn 包的版本吗？

回复
- Adrian Tam 2021 年 11 月 15 日上午 2:56 #
  
  imbalanced-learn 0.8.1
  
  回复
Eva 2021 年 12 月 10 日晚上 11:41 #

嗨，Jason，

我不理解“重采样”和“有放回抽样”之间的区别。它们是同一个意思吗？我正在使用 MATLAB 和函数“fitcensemble”来创建我的 RF 模型，它有选项“Replace”和“Resample”可以指定为“on”或“off”，所以这意味着它们是不同的东西，但我不理解这种区别。

提前感谢！

回复
- Adrian Tam 2021 年 12 月 15 日上午 5:44 #
  
  不是。给你一副 52 张扑克牌，从中抽取 5 张是无放回抽样。从中抽取一张放回去，然后重复 5 次是有放回抽样。
  
  回复
- James Carmichael 2021 年 12 月 21 日晚上 11:45 #
  
  嗨，伊娃……谢谢你的问题！以下是理解这些术语及其应用的绝佳资源。
  
  https://web.ma.utexas.edu/users/parker/sampling/repl.htm
  
  如果您需要更多信息，请告诉我。
  
  此致，
  
  回复
Mike 2024 年 4 月 18 日上午 5:39 #

Jason，这家伙窃取了你的内容。在 Google 上搜索 Manish Prasad 的文章“随机森林用于学习不平衡数据”。他一字不差地复制了它。

回复
- James Carmichael 2024 年 4 月 18 日上午 8:48 #
  
  谢谢你，迈克！
  
  回复

导航

用于不平衡分类的 Bagging 和随机森林

教程概述

不平衡分类的 Bagging

标准 Bagging

想要开始学习不平衡分类吗？

带有随机欠采样的 Bagging

不平衡分类的随机森林

标准随机森林

带有类别加权的随机森林

带有 Bootstrap 类别加权的随机森林

带有随机欠采样的随机森林

不平衡分类的 Easy Ensemble

Easy Ensemble

进一步阅读

论文

书籍

API

总结

掌控不平衡分类！

在几分钟内开发不平衡学习模型

将不平衡分类方法引入您的机器学习项目

关于此主题的更多信息

对《不平衡分类的 Bagging 和随机森林》的 36 条回复

发表回复点击此处取消回复。

导航

教程概述

不平衡分类的 Bagging

标准 Bagging

想要开始学习不平衡分类吗？

带有随机欠采样的 Bagging

不平衡分类的随机森林

标准随机森林

带有类别加权的随机森林

带有 Bootstrap 类别加权的随机森林

带有随机欠采样的随机森林

不平衡分类的 Easy Ensemble

Easy Ensemble

进一步阅读

论文

书籍

API

总结

掌控不平衡分类！

在几分钟内开发不平衡学习模型

将不平衡分类方法引入您的机器学习项目

关于此主题的更多信息

对《不平衡分类的 Bagging 和随机森林》的 36 条回复

发表回复 点击此处取消回复。

发表回复点击此处取消回复。