超越 GridSearchCV：Scikit-learn 模型的先进超参数调整策略

作者 Iván Palomares Carrascosa 于 2025年6月21日发布在实用机器学习 0

Beyond GridSearchCV: Advanced Hyperparameter Tuning Strategies for Scikit-learn Models

超越 GridSearchCV：Scikit-learn 模型的先进超参数调整策略
作者 | Ideogram 提供图片

引言

你是否曾感觉自己在大海捞针？这是构建和优化机器学习模型过程的一部分，特别是像集成模型和神经网络这类复杂模型，在训练之前需要我们手动设置多个超参数。机器学习模型的超参数，如学习率、集成模型中训练的估计器数量、决策树的最大深度等，根据这些超参数值的设置方式，可以产生不同性能水平的模型：找到每个超参数的最优配置并非易事。

幸运的是，Scikit-learn 提供了几种类来实现基于搜索算法与交叉验证相结合的超参数调优策略。在这篇文章中，我们介绍了像 GridSearchCV 这样的基本策略。现在，我们将探索三种额外的策略以及如何在 Scikit-learn 中实现它们。

随机搜索 (RandomizedSearchCV)
贝叶斯搜索 (BayesSearchCV)
渐进式减半策略 (HalvingGridSearchCV 和 HalvingRandomSearchCV)

随机搜索

而网格搜索则穷尽式地搜索我们定义的多个超参数的“可能”值网格，以在该网格中找到最佳组合。RandomizedSearchCV 类则根据指定的或默认的分布，从网格中随机采样超参数值。当需要调优的超参数数量很多且调优范围变化很大时，这是一种更有效的方法。

为了更好地理解，让我们首先加载 MNIST 数据集以进行图像分类，并导入训练随机森林分类器和调优其超参数所需的 Python 模块和类。

import numpy as np
from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import accuracy_score

import numpy as np

from sklearn.datasets import load_digits

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split, RandomizedSearchCV

from sklearn.metrics import accuracy_score

加载 MNIST 数据并将其分割为训练集和测试集

digits = load_digits()
X, y = digits.data, digits.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

digits = load_digits()

X, y = digits.data, digits.target

X_train, X_test, y_train, y_test = train_test_split(

X, y, test_size=0.2, random_state=42

)

我们初始化随机森林分类器（尚未训练），并定义一个用于采样的超参数空间。

rf = RandomForestClassifier(random_state=42)

param_dist = {
    'n_estimators': [50, 100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}

rf = RandomForestClassifier(random_state=42)

param_dist = {

'n_estimators': [50, 100, 200, 300],

'max_depth': [None, 10, 20, 30],

'min_samples_split': [2, 5, 10],

'min_samples_leaf': [1, 2, 4],

'bootstrap': [True, False]

}

现在，我们定义负责超参数调优过程的对象，传入随机森林实例、我们刚刚定义的超参数空间，并指定要执行的随机尝试次数 (n_iter) 以及交叉验证过程中固有的训练-验证折叠数。一旦定义好，fit() 方法将执行整个过程，并给出找到的最佳超参数设置。

search = RandomizedSearchCV(
    estimator=rf,
    param_distributions=param_dist,
    n_iter=20,
    cv=5,
    scoring='accuracy',
    random_state=42,
    n_jobs=-1
)

search = RandomizedSearchCV(

estimator=rf,

param_distributions=param_dist,

n_iter=20,

cv=5,

scoring='accuracy',

random_state=42,

n_jobs=-1

)

search.fit(X_train, y_train)

print("Best Parameters:", search.best_params_)

best_rf = search.best_estimator_
y_pred = best_rf.predict(X_test)
print("Test Accuracy:", accuracy_score(y_test, y_pred))

search.fit(X_train, y_train)

print("最佳参数:", search.best_params_)

best_rf = search.best_estimator_

y_pred = best_rf.predict(X_test)

print("测试准确率:", accuracy_score(y_test, y_pred))

我的结果是基于以下超参数设置找到的“最佳”集成模型，在测试数据上的准确率接近 98%。

Best Parameters: {'n_estimators': 50, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_depth': 20, 'bootstrap': False}
Test Accuracy: 0.9777777777777777

1 2	最佳参数: {'n_estimators': 50, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_depth': 20, 'bootstrap': False} 测试准确率: 0.9777777777777777

贝叶斯搜索

此策略也从定义的搜索空间中随机采样，但它做得更智能，通过选择有希望的点和区域，在具有挑战性的问题和数据集上比随机搜索更有效。必要的类不位于基础 Scikit-learn 库中，而是位于由同一个社区为高级优化策略构建的独立扩展中。这个“附加”库称为 skopt，是 scikit-optimize 的缩写（您可能需要先使用 pip install scikit-optimize 安装它）。

以下是它用于在相同数据集上优化另一个随机森林分类器的工作示例。

from skopt import BayesSearchCV
from skopt.space import Real, Integer

from sklearn.ensemble import RandomForestClassifier

search_space = {
    'n_estimators': Integer(100, 300),
    'max_depth': Integer(5, 30),
    'min_samples_split': Integer(2, 10),
    'min_samples_leaf': Integer(1, 4)
}

opt = BayesSearchCV(
    estimator=RandomForestClassifier(),
    search_spaces=search_space,
    n_iter=20,
    cv=5,
    scoring='accuracy',
    random_state=42,
    n_jobs=-1
)

opt.fit(X_train, y_train)

from skopt import BayesSearchCV

from skopt.space import Real, Integer

from sklearn.ensemble import RandomForestClassifier

search_space = {

'n_estimators': Integer(100, 300),

'max_depth': Integer(5, 30),

'min_samples_split': Integer(2, 10),

'min_samples_leaf': Integer(1, 4)

}

opt = BayesSearchCV(

estimator=RandomForestClassifier(),

search_spaces=search_space,

n_iter=20,

cv=5,

scoring='accuracy',

random_state=42,

n_jobs=-1

)

opt.fit(X_train, y_train)

正如您所见，工作流程与 RandomizedSearchCV 非常相似。

渐进式减半策略

渐进式减半采用自适应资源分配，从许多可能的模型配置开始，然后逐渐缩小选项范围。但有一个问题：计算预算会随着糟糕的配置被丢弃而逐步增加，从而帮助将资源集中在最有希望的候选者上。这使得该过程比传统的网格搜索或随机搜索更有效。

Scikit-learn 中有两种类可以实现此策略：HalvingGridSearchCV 和 HalvingRandomSearchCV。前者穷尽式地评估所有参数组合，但会提前修剪（删除）表现不佳的组合；后者则从随机采样的配置开始，并在采样后应用修剪。

实现其中任何一个都需要指定一个超参数作为 resource，即其值在搜索空间缩小后将逐渐增加的超参数。

from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingRandomSearchCV
from scipy.stats import randint
from sklearn.ensemble import RandomForestClassifier

param_dist = {
    'max_depth': randint(5, 30),
    'min_samples_split': randint(2, 10),
    'min_samples_leaf': randint(1, 4),
    'bootstrap': [True, False]
}

search = HalvingRandomSearchCV(
    estimator=RandomForestClassifier(),
    param_distributions=param_dist,
    resource='n_estimators',
    max_resources=300,
    factor=2,
    cv=5,
    scoring='accuracy',
    random_state=42,
    n_jobs=-1
)

search.fit(X_train, y_train)

from sklearn.experimental import enable_halving_search_cv

from sklearn.model_selection import HalvingRandomSearchCV

from scipy.stats import randint

from sklearn.ensemble import RandomForestClassifier

param_dist = {

'max_depth': randint(5, 30),

'min_samples_split': randint(2, 10),

'min_samples_leaf': randint(1, 4)

'bootstrap': [True, False]

}

search = HalvingRandomSearchCV(

estimator=RandomForestClassifier(),

param_distributions=param_dist,

resource='n_estimators',

max_resources=300,

factor=2,

cv=5,

scoring='accuracy',

random_state=42,

n_jobs=-1

)

search.fit(X_train, y_train)

可视化找到的最佳模型配置，不仅包括搜索空间中的超参数，还包括用作资源的超参数 — 在本例中是 n_estimators。

Best Parameters: {'bootstrap': False, 'max_depth': 16, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 256}

1	最佳参数: {'bootstrap': False, 'max_depth': 16, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 256}

总结

本文介绍了 Scikit-learn 中用于微调机器学习模型超参数的三种高级策略 — 随机搜索、贝叶斯搜索和渐进式减半 — 这些策略都超越了传统的网格搜索方法。

导航

超越 GridSearchCV：Scikit-learn 模型的先进超参数调整策略

引言

随机搜索

贝叶斯搜索

渐进式减半策略

总结

关于此主题的更多信息

暂无评论。

留下回复点击此处取消回复。

导航

引言

随机搜索

贝叶斯搜索

渐进式减半策略

总结

关于此主题的更多信息

暂无评论。

留下回复 点击此处取消回复。

留下回复点击此处取消回复。