使用随机优化算法进行特征选择

作者 Jason Brownlee 于 2021年10月12日发布在优化 27

通常，通过从训练数据集中移除输入特征（列），可以开发出更简单、性能更好的机器学习模型。

这被称为特征选择，有许多不同类型的算法可供使用。

可以将特征选择问题视为一个优化问题。如果输入特征很少，可以评估所有可能的输入特征组合，并明确找到最佳子集。如果输入特征数量庞大，可以使用随机优化算法来探索搜索空间并找到有效的特征子集。

在本教程中，您将了解如何在机器学习中使用优化算法进行特征选择。

完成本教程后，您将了解：

特征选择问题可以广泛地定义为优化问题。
如何枚举数据集中所有可能的输入特征子集。
如何应用随机优化来选择最佳输入特征子集。

开始您的项目，阅读我的新书《机器学习优化》，其中包含分步教程和所有示例的Python源代码文件。

让我们开始吧。

How to Use Optimization for Feature Selection

如何使用优化进行特征选择
照片由 Gregory “Slobirdr” Smith 拍摄，部分权利保留。

教程概述

本教程分为三个部分；它们是：

优化用于特征选择
枚举所有特征子集
优化特征子集

优化用于特征选择

特征选择是在开发预测模型时减少输入变量数量的过程。

为了降低建模的计算成本，并在某些情况下提高模型的性能，减少输入变量的数量是可取的。有许多不同类型的特征选择算法，尽管它们可以大致分为两类：包装器方法和过滤器方法。

包装器特征选择方法创建许多具有不同输入特征子集的模型，并根据性能指标选择导致最佳模型性能的特征。这些方法不关心变量类型，尽管它们可能计算成本高昂。RFE 是包装器特征选择方法的一个很好的例子。

过滤器特征选择方法使用统计技术来评估每个输入变量与目标变量之间的关系，并使用这些分数作为选择（过滤）将用于模型中的输入变量的基础。

包装器特征选择：搜索性能良好的特征子集。
过滤器特征选择：根据特征与目标的关系选择特征子集。

有关选择特征选择算法的更多信息，请参阅教程

如何为机器学习选择特征选择方法

一种流行的包装器方法是递归特征消除（RFE）算法。

RFE 通过从训练数据集中的所有特征开始，然后逐渐移除特征直到达到所需的数量来搜索特征子集。

这是通过拟合模型核心中使用的给定机器学习算法、按重要性对特征进行排名、丢弃最不重要的特征并重新拟合模型来实现的。此过程重复进行，直到剩余指定数量的特征为止。

有关 RFE 的更多信息，请参阅教程

Python 中的递归特征消除 (RFE) 用于特征选择

包装器特征选择问题可以构建为优化问题。也就是说，找到一个输入特征子集，该子集能带来最佳的模型性能。

RFE 是系统地解决此问题的一种方法，尽管它可能受到特征数量庞大的限制。

另一种方法是当特征数量非常大时，使用随机优化算法，例如随机爬山算法。当特征数量相对较少时，可以枚举所有可能的特征子集。

少数输入变量：枚举所有可能的特征子集。
许多输入特征：使用随机优化算法查找好的特征子集。

现在我们已经熟悉了可以通过优化问题来探索特征选择的想法，接下来我们来看看如何枚举所有可能的特征子集。

想要开始学习优化算法吗？

立即参加我为期7天的免费电子邮件速成课程（附示例代码）。

点击注册，同时获得该课程的免费PDF电子书版本。

枚举所有特征子集

当输入变量数量相对较少且模型评估相对较快时，枚举所有可能的输入变量子集是可行的。

这意味着使用测试框架对每个可能的唯一输入变量组的性能进行评估。

我们将通过一个实际示例来探讨如何做到这一点。

首先，让我们定义一个具有较少输入特征的小型二元分类数据集。我们可以使用 make_classification() 函数来定义一个具有五个输入变量（其中两个是信息性的）和 1,000 行的数据集。

下面的示例定义了数据集并总结了其形状。

# define a small classification dataset
from sklearn.datasets import make_classification
# define dataset
X, y = make_classification(n_samples=1000, n_features=5, n_informative=2, n_redundant=3, random_state=1)
# summarize the shape of the dataset
print(X.shape, y.shape)

# 定义一个小型分类数据集

from sklearn.datasets import make_classification

# 定义数据集

X, y = make_classification(n_samples=1000, n_features=5, n_informative=2, n_redundant=3, random_state=1)

# 总结数据集的形状

print(X.shape, y.shape)

运行示例会创建数据集并确认其具有所需的形状。

(1000, 5) (1000,)

1	(1000, 5) (1000,)

接下来，我们可以使用在整个数据集上评估的模型来建立性能基线。

我们将使用 DecisionTreeClassifier 作为模型，因为其性能对输入变量的选择相当敏感。

我们将使用良好的实践来评估模型，例如重复分层k折交叉验证，重复三次，折数为 10。

完整的示例如下所示。

# evaluate a decision tree on the entire small dataset
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.tree import DecisionTreeClassifier
# define dataset
X, y = make_classification(n_samples=1000, n_features=3, n_informative=2, n_redundant=1, random_state=1)
# define model
model = DecisionTreeClassifier()
# define evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report result
print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

# 在整个小型数据集上评估决策树

from numpy import mean

from numpy import std

from sklearn.datasets import make_classification

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import RepeatedStratifiedKFold

from sklearn.tree import DecisionTreeClassifier

# 定义数据集

X, y = make_classification(n_samples=1000, n_features=3, n_informative=2, n_redundant=1, random_state=1)

# 定义模型

model = DecisionTreeClassifier()

# 定义评估过程

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# 评估模型

scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

# 报告结果

print('平均准确率: %.3f (%.3f)' % (mean(scores), std(scores)))

运行示例会对整个数据集上的决策树进行评估，并报告平均和标准差分类准确率。

注意：由于算法或评估过程的随机性，或者数值精度的差异，您的结果可能有所不同。考虑运行示例几次并比较平均结果。

在这种情况下，我们可以看到模型达到了大约 80.5% 的准确率。

Mean Accuracy: 0.805 (0.030)

1	平均准确率: 0.805 (0.030)

接下来，我们可以尝试通过使用输入特征的子集来提高模型性能。

首先，我们必须选择一个表示形式进行枚举。

在这种情况下，我们将枚举一个布尔值列表，每个输入特征对应一个值：如果使用该特征，则为True，如果未使用该特征作为输入，则为False。

例如，对于五个输入特征，序列 [True, True, True, True, True] 将使用所有输入特征，而 [True, False, False, False, False] 则仅将第一个输入特征用作输入。

我们可以使用 product() Python 函数来枚举长度为 5 的所有布尔值序列。我们必须指定有效值 [True, False] 以及序列中的步数，这等于输入变量的数量。

该函数返回一个可迭代对象，我们可以直接枚举其中的每个序列。

...
# determine the number of columns
n_cols = X.shape[1]
best_subset, best_score = None, 0.0
# enumerate all combinations of input features
for subset in product([True, False], repeat=n_cols):
	...

...

# 确定列数

n_cols = X.shape[1]

best_subset, best_score = None, 0.0

# 枚举所有输入特征的组合

for subset in product([True, False], repeat=n_cols):

...

对于给定的布尔值序列，我们可以枚举它并将其转换为每个True值的列索引序列。

...
# convert into column indexes
ix = [i for i, x in enumerate(subset) if x]

...

# 转换为列索引

ix = [i for i, x in enumerate(subset) if x]

如果序列没有列索引（全部为False），则可以跳过该序列。

# check for now column (all False)
if len(ix) == 0:
	continue

# 检查无列（全为 False）

if len(ix) == 0:

continue

然后，我们可以使用列索引来选择数据集中的列。

...
# select columns
X_new = X[:, ix]

...

# 选择列

X_new = X[:, ix]

然后，此数据集子集可以像之前一样进行评估。

...
# define model
model = DecisionTreeClassifier()
# define evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X_new, y, scoring='accuracy', cv=cv, n_jobs=-1)
# summarize scores
result = mean(scores)

...

# 定义模型

model = DecisionTreeClassifier()

# 定义评估过程

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# 评估模型

scores = cross_val_score(model, X_new, y, scoring='accuracy', cv=cv, n_jobs=-1)

# 总结得分

result = mean(scores)

如果模型的准确率比目前找到的最佳序列要好，则可以存储它。

...
# check if it is better than the best so far
if best_score is None or result >= best_score:
	# better result
	best_subset, best_score = ix, result

...

# 检查是否比目前最好的要好

if best_score is None or result >= best_score:

# 更好的结果

best_subset, best_score = ix, result

就是这样。

将这些内容整合起来，下面列出了通过枚举所有可能的特征子集进行特征选择的完整示例。

# feature selection by enumerating all possible subsets of features
from itertools import product
from numpy import mean
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.tree import DecisionTreeClassifier
# define dataset
X, y = make_classification(n_samples=1000, n_features=5, n_informative=2, n_redundant=3, random_state=1)
# determine the number of columns
n_cols = X.shape[1]
best_subset, best_score = None, 0.0
# enumerate all combinations of input features
for subset in product([True, False], repeat=n_cols):
	# convert into column indexes
	ix = [i for i, x in enumerate(subset) if x]
	# check for now column (all False)
	if len(ix) == 0:
		continue
	# select columns
	X_new = X[:, ix]
	# define model
	model = DecisionTreeClassifier()
	# define evaluation procedure
	cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
	# evaluate model
	scores = cross_val_score(model, X_new, y, scoring='accuracy', cv=cv, n_jobs=-1)
	# summarize scores
	result = mean(scores)
	# report progress
	print('>f(%s) = %f ' % (ix, result))
	# check if it is better than the best so far
	if best_score is None or result >= best_score:
		# better result
		best_subset, best_score = ix, result
# report best
print('Done!')
print('f(%s) = %f' % (best_subset, best_score))

# 通过枚举所有可能的特征子集进行特征选择

from itertools import product

from numpy import mean

from sklearn.datasets import make_classification

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import RepeatedStratifiedKFold

from sklearn.tree import DecisionTreeClassifier

# 定义数据集

X, y = make_classification(n_samples=1000, n_features=5, n_informative=2, n_redundant=3, random_state=1)

# 确定列数

n_cols = X.shape[1]

best_subset, best_score = None, 0.0

# 枚举所有输入特征的组合

for subset in product([True, False], repeat=n_cols):

# 转换为列索引

ix = [i for i, x in enumerate(subset) if x]

# 检查无列（全为 False）

if len(ix) == 0:

continue

# 选择列

X_new = X[:, ix]

# 定义模型

model = DecisionTreeClassifier()

# 定义评估过程

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# 评估模型

scores = cross_val_score(model, X_new, y, scoring='accuracy', cv=cv, n_jobs=-1)

# 汇总分数

result = mean(scores)

# 报告进度

print('>f(%s) = %f ' % (ix, result))

# 检查是否比目前最好的要好

if best_score is None or result >= best_score:

# 更好的结果

best_subset, best_score = ix, result

# 报告最佳

print('Done!')

print('f(%s) = %f' % (best_subset, best_score))

运行示例会报告模型在考虑的每个特征子集上的平均分类准确率。最佳子集随后在运行结束时报告。

注意：由于算法或评估过程的随机性，或者数值精度的差异，您的结果可能有所不同。考虑运行示例几次并比较平均结果。

在此案例中，我们可以看到最佳特征子集涉及索引为 [2, 3, 4] 的特征，这带来了约 83.0% 的平均分类准确率，优于之前使用所有输入特征报告的结果。

>f([0, 1, 2, 3, 4]) = 0.813667
>f([0, 1, 2, 3]) = 0.827667
>f([0, 1, 2, 4]) = 0.815333
>f([0, 1, 2]) = 0.824000
>f([0, 1, 3, 4]) = 0.821333
>f([0, 1, 3]) = 0.825667
>f([0, 1, 4]) = 0.807333
>f([0, 1]) = 0.817667
>f([0, 2, 3, 4]) = 0.830333
>f([0, 2, 3]) = 0.819000
>f([0, 2, 4]) = 0.828000
>f([0, 2]) = 0.818333
>f([0, 3, 4]) = 0.830333
>f([0, 3]) = 0.821333
>f([0, 4]) = 0.816000
>f([0]) = 0.639333
>f([1, 2, 3, 4]) = 0.823667
>f([1, 2, 3]) = 0.821667
>f([1, 2, 4]) = 0.823333
>f([1, 2]) = 0.818667
>f([1, 3, 4]) = 0.818000
>f([1, 3]) = 0.820667
>f([1, 4]) = 0.809000
>f([1]) = 0.797000
>f([2, 3, 4]) = 0.827667
>f([2, 3]) = 0.755000
>f([2, 4]) = 0.827000
>f([2]) = 0.516667
>f([3, 4]) = 0.824000
>f([3]) = 0.514333
>f([4]) = 0.777667
Done!
f([0, 3, 4]) = 0.830333

>f([0, 1, 2, 3, 4]) = 0.813667

>f([0, 1, 2, 3]) = 0.827667

>f([0, 1, 2, 4]) = 0.815333

>f([0, 1, 2]) = 0.824000

>f([0, 1, 3, 4]) = 0.821333

>f([0, 1, 3]) = 0.825667

>f([0, 1, 4]) = 0.807333

>f([0, 1]) = 0.817667

>f([0, 2, 3, 4]) = 0.830333

>f([0, 2, 3]) = 0.819000

>f([0, 2, 4]) = 0.828000

>f([0, 2]) = 0.818333

>f([0, 3, 4]) = 0.830333

>f([0, 3]) = 0.821333

>f([0, 4]) = 0.816000

>f([0]) = 0.639333

>f([1, 2, 3, 4]) = 0.823667

>f([1, 2, 3]) = 0.821667

>f([1, 2, 4]) = 0.823333

>f([1, 2]) = 0.818667

>f([1, 3, 4]) = 0.818000

>f([1, 3]) = 0.820667

>f([1, 4]) = 0.809000

>f([1]) = 0.797000

>f([2, 3, 4]) = 0.827667

>f([2, 3]) = 0.755000

>f([2, 4]) = 0.827000

>f([2]) = 0.516667

>f([3, 4]) = 0.824000

>f([3]) = 0.514333

>f([4]) = 0.777667

完成！

f([0, 3, 4]) = 0.830333

既然我们知道如何枚举所有可能的特征子集，那么我们来看看如何使用随机优化算法来选择特征子集。

优化特征子集

我们可以将随机优化算法应用于输入特征子集的搜索空间。

首先，让我们定义一个更大的问题，它具有更多的特征，使得模型评估过于缓慢，搜索空间对于枚举所有子集来说也太大。

我们将定义一个具有 10,000 行和 500 个输入特征的分类问题，其中 10 个是相关的，其余 490 个是冗余的。

# define a large classification dataset
from sklearn.datasets import make_classification
# define dataset
X, y = make_classification(n_samples=10000, n_features=500, n_informative=10, n_redundant=490, random_state=1)
# summarize the shape of the dataset
print(X.shape, y.shape)

# 定义一个大型分类数据集

from sklearn.datasets import make_classification

# 定义数据集

X, y = make_classification(n_samples=10000, n_features=500, n_informative=10, n_redundant=490, random_state=1)

# 总结数据集的形状

print(X.shape, y.shape)

运行示例会创建数据集并确认其具有所需的形状。

(10000, 500) (10000,)

1	(10000, 500) (10000,)

通过在具有所有输入特征的数据集上评估模型，我们可以建立性能基线。

由于数据集较大且模型评估缓慢，我们将修改模型评估以使用 3 折交叉验证，例如，较少的折数且无重复。

完整的示例如下所示。

# evaluate a decision tree on the entire larger dataset
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.tree import DecisionTreeClassifier
# define dataset
X, y = make_classification(n_samples=10000, n_features=500, n_informative=10, n_redundant=490, random_state=1)
# define model
model = DecisionTreeClassifier()
# define evaluation procedure
cv = StratifiedKFold(n_splits=3)
# evaluate model
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report result
print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

# 在整个更大的数据集上评估决策树

from numpy import mean

from numpy import std

from sklearn.datasets import make_classification

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import StratifiedKFold

from sklearn.tree import DecisionTreeClassifier

# 定义数据集

X, y = make_classification(n_samples=10000, n_features=500, n_informative=10, n_redundant=490, random_state=1)

# 定义模型

model = DecisionTreeClassifier()

# 定义评估过程

cv = StratifiedKFold(n_splits=3)

# 评估模型

scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

# 报告结果

print('平均准确率: %.3f (%.3f)' % (mean(scores), std(scores)))

运行示例会对整个数据集上的决策树进行评估，并报告平均和标准差分类准确率。

注意：由于算法或评估过程的随机性，或者数值精度的差异，您的结果可能有所不同。考虑运行示例几次并比较平均结果。

在这种情况下，我们可以看到模型达到了大约 91.3% 的准确率。

这提供了一个基线，我们期望通过特征选择能够超越它。

Mean Accuracy: 0.913 (0.001)

1	平均准确率: 0.913 (0.001)

我们将使用简单的随机爬山算法作为优化算法。

首先，我们需要定义目标函数。它将数据集和要使用的特征子集作为输入，并返回从 0（最差）到 1（最佳）的估计模型准确率。这是一个最大化优化问题。

这个目标函数就是前一节中对序列的解码和模型评估步骤。

下面的 *objective()* 函数实现了这一点，并同时返回分数以及用于有益报告的解码后的列子集。

# objective function
def objective(X, y, subset):
	# convert into column indexes
	ix = [i for i, x in enumerate(subset) if x]
	# check for now column (all False)
	if len(ix) == 0:
		return 0.0
	# select columns
	X_new = X[:, ix]
	# define model
	model = DecisionTreeClassifier()
	# evaluate model
	scores = cross_val_score(model, X_new, y, scoring='accuracy', cv=3, n_jobs=-1)
	# summarize scores
	result = mean(scores)
	return result, ix

# 目标函数

def objective(X, y, subset):

# 转换为列索引

ix = [i for i, x in enumerate(subset) if x]

# 检查无列（全为 False）

if len(ix) == 0:

return 0.0

# 选择列

X_new = X[:, ix]

# 定义模型

model = DecisionTreeClassifier()

# 评估模型

scores = cross_val_score(model, X_new, y, scoring='accuracy', cv=3, n_jobs=-1)

# 汇总分数

result = mean(scores)

return result, ix

我们还需要一个可以进行搜索空间步进的函数。

给定一个现有解决方案，它必须对其进行修改并返回一个临近的新解决方案。在这种情况下，我们将通过随机翻转子序列中列的包含/排除来实现这一点。

序列中的每个位置都将被独立考虑，并将在概率上翻转，其中翻转的概率是一个超参数。

下面的 *mutate()* 函数给定一个候选解决方案（布尔序列）和变异超参数来实现这一点，创建并返回一个修改后的解决方案（搜索空间中的一个步进）。

p_mutate 值越大（在 0 到 1 的范围内），搜索空间中的步进就越大。

# mutation operator
def mutate(solution, p_mutate):
	# make a copy
	child = solution.copy()
	for i in range(len(child)):
		# check for a mutation
		if rand() < p_mutate:
			# flip the inclusion
			child[i] = not child[i]
	return child

# 变异算子

def mutate(solution, p_mutate):

# 创建副本

child = solution.copy()

for i in range(len(child)):

# 检查变异

if rand() < p_mutate:

# 翻转包含

child[i] = not child[i]

return child

现在我们可以实现爬山算法。

初始解是一个随机生成的序列，然后对其进行评估。

...
# generate an initial point
solution = choice([True, False], size=X.shape[1])
# evaluate the initial point
solution_eval, ix = objective(X, y, solution)

...

# 生成初始点

solution = choice([True, False], size=X.shape[1])

# 评估初始点

solution_eval, ix = objective(X, y, solution)

然后我们循环固定次数的迭代，创建当前解的变异版本，评估它们，如果分数更高就保存它们。

...
# run the hill climb
for i in range(n_iter):
	# take a step
	candidate = mutate(solution, p_mutate)
	# evaluate candidate point
	candidate_eval, ix = objective(X, y, candidate)
	# check if we should keep the new point
	if candidate_eval >= solution_eval:
		# store the new point
		solution, solution_eval = candidate, candidate_eval
	# report progress
	print('>%d f(%s) = %f' % (i+1, len(ix), solution_eval))

...

# 运行爬山

for i in range(n_iter):

# 迈出一步

candidate = mutate(solution, p_mutate)

# 评估候选点

candidate_eval, ix = objective(X, y, candidate)

# 检查是否应该保留新点

if candidate_eval >= solution_eval:

# 存储新点

solution, solution_eval = candidate, candidate_eval

# 报告进度

print('>%d f(%s) = %f' % (i+1, len(ix), solution_eval))

下面的 *hillclimbing()* 函数实现了这一点，它将数据集、目标函数和超参数作为参数，并返回最佳数据集列子集以及模型的估计性能。

# hill climbing local search algorithm
def hillclimbing(X, y, objective, n_iter, p_mutate):
	# generate an initial point
	solution = choice([True, False], size=X.shape[1])
	# evaluate the initial point
	solution_eval, ix = objective(X, y, solution)
	# run the hill climb
	for i in range(n_iter):
		# take a step
		candidate = mutate(solution, p_mutate)
		# evaluate candidate point
		candidate_eval, ix = objective(X, y, candidate)
		# check if we should keep the new point
		if candidate_eval >= solution_eval:
			# store the new point
			solution, solution_eval = candidate, candidate_eval
		# report progress
		print('>%d f(%s) = %f' % (i+1, len(ix), solution_eval))
	return solution, solution_eval

# 爬山局部搜索算法

def hillclimbing(X, y, objective, n_iter, p_mutate):

# 生成初始点

solution = choice([True, False], size=X.shape[1])

# 评估初始点

solution_eval, ix = objective(X, y, solution)

# 运行爬山算法

for i in range(n_iter):

# 迈出一步

candidate = mutate(solution, p_mutate)

# 评估候选点

candidate_eval, ix = objective(X, y, candidate)

# 检查是否应该保留新点

if candidate_eval >= solution_eval:

# 存储新点

solution, solution_eval = candidate, candidate_eval

# 报告进度

print('>%d f(%s) = %f' % (i+1, len(ix), solution_eval))

return solution, solution_eval

然后，我们可以调用此函数并将我们的合成数据集传递进去，以执行特征选择的优化。

在这种情况下，我们将运行该算法 100 次迭代，对于给定的变异，大约会有五次序列翻转，这是相当保守的。

...
# define dataset
X, y = make_classification(n_samples=10000, n_features=500, n_informative=10, n_redundant=490, random_state=1)
# define the total iterations
n_iter = 100
# probability of including/excluding a column
p_mut = 10.0 / 500.0
# perform the hill climbing search
subset, score = hillclimbing(X, y, objective, n_iter, p_mut)

...

# 定义数据集

X, y = make_classification(n_samples=10000, n_features=500, n_informative=10, n_redundant=490, random_state=1)

# 定义总迭代次数

n_iter = 100

# 包含/排除列的概率

p_mut = 10.0 / 500.0

# 执行爬山搜索

subset, score = hillclimbing(X, y, objective, n_iter, p_mut)

在运行结束时，我们将布尔序列转换为列索引（以便我们可以在需要时拟合最终模型），并报告最佳子序列的性能。

...
# convert into column indexes
ix = [i for i, x in enumerate(subset) if x]
print('Done!')
print('Best: f(%d) = %f' % (len(ix), score))

...

# 转换为列索引

ix = [i for i, x in enumerate(subset) if x]

print('Done!')

print('Best: f(%d) = %f' % (len(ix), score))

将所有这些结合起来，完整的示例如下所示。

# stochastic optimization for feature selection
from numpy import mean
from numpy.random import rand
from numpy.random import choice
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier

# objective function
def objective(X, y, subset):
	# convert into column indexes
	ix = [i for i, x in enumerate(subset) if x]
	# check for now column (all False)
	if len(ix) == 0:
		return 0.0
	# select columns
	X_new = X[:, ix]
	# define model
	model = DecisionTreeClassifier()
	# evaluate model
	scores = cross_val_score(model, X_new, y, scoring='accuracy', cv=3, n_jobs=-1)
	# summarize scores
	result = mean(scores)
	return result, ix

# mutation operator
def mutate(solution, p_mutate):
	# make a copy
	child = solution.copy()
	for i in range(len(child)):
		# check for a mutation
		if rand() < p_mutate:
			# flip the inclusion
			child[i] = not child[i]
	return child

# hill climbing local search algorithm
def hillclimbing(X, y, objective, n_iter, p_mutate):
	# generate an initial point
	solution = choice([True, False], size=X.shape[1])
	# evaluate the initial point
	solution_eval, ix = objective(X, y, solution)
	# run the hill climb
	for i in range(n_iter):
		# take a step
		candidate = mutate(solution, p_mutate)
		# evaluate candidate point
		candidate_eval, ix = objective(X, y, candidate)
		# check if we should keep the new point
		if candidate_eval >= solution_eval:
			# store the new point
			solution, solution_eval = candidate, candidate_eval
		# report progress
		print('>%d f(%s) = %f' % (i+1, len(ix), solution_eval))
	return solution, solution_eval

# define dataset
X, y = make_classification(n_samples=10000, n_features=500, n_informative=10, n_redundant=490, random_state=1)
# define the total iterations
n_iter = 100
# probability of including/excluding a column
p_mut = 10.0 / 500.0
# perform the hill climbing search
subset, score = hillclimbing(X, y, objective, n_iter, p_mut)
# convert into column indexes
ix = [i for i, x in enumerate(subset) if x]
print('Done!')
print('Best: f(%d) = %f' % (len(ix), score))

# 用于特征选择的随机优化

from numpy import mean

from numpy.random import rand

from numpy.random import choice

from sklearn.datasets import make_classification

from sklearn.model_selection import cross_val_score

from sklearn.tree import DecisionTreeClassifier

# 目标函数

def objective(X, y, subset):

# 转换为列索引

ix = [i for i, x in enumerate(subset) if x]

# 检查无列（全为 False）

if len(ix) == 0:

return 0.0

# 选择列

X_new = X[:, ix]

# 定义模型

model = DecisionTreeClassifier()

# 评估模型

scores = cross_val_score(model, X_new, y, scoring='accuracy', cv=3, n_jobs=-1)

# 汇总分数

result = mean(scores)

return result, ix

# 变异算子

def mutate(solution, p_mutate):

# 创建副本

child = solution.copy()

for i in range(len(child)):

# 检查变异

if rand() < p_mutate:

# 翻转包含

child[i] = not child[i]

return child

# 爬山局部搜索算法

def hillclimbing(X, y, objective, n_iter, p_mutate):

# 生成初始点

solution = choice([True, False], size=X.shape[1])

# 评估初始点

solution_eval, ix = objective(X, y, solution)

# 运行爬山算法

for i in range(n_iter):

# 迈出一步

candidate = mutate(solution, p_mutate)

# 评估候选点

candidate_eval, ix = objective(X, y, candidate)

# 检查是否应该保留新点

if candidate_eval >= solution_eval:

# 存储新点

solution, solution_eval = candidate, candidate_eval

# 报告进度

print('>%d f(%s) = %f' % (i+1, len(ix), solution_eval))

return solution, solution_eval

# 定义数据集

X, y = make_classification(n_samples=10000, n_features=500, n_informative=10, n_redundant=490, random_state=1)

# 定义总迭代次数

n_iter = 100

# 包含/排除列的概率

p_mut = 10.0 / 500.0

# 执行爬山搜索

subset, score = hillclimbing(X, y, objective, n_iter, p_mut)

# 转换为列索引

ix = [i for i, x in enumerate(subset) if x]

print('Done!')

print('Best: f(%d) = %f' % (len(ix), score))

运行示例会报告模型在考虑的每个特征子集上的平均分类准确率。最佳子集随后在运行结束时报告。

注意：由于算法或评估过程的随机性，或者数值精度的差异，您的结果可能有所不同。考虑运行示例几次并比较平均结果。

在这种情况下，我们可以看到，使用 239 个特征的子集取得了最佳性能，分类准确率约为 91.8%。

这比在所有输入特征上评估的模型要好。

尽管结果更好，但我们知道我们可以做得更好，也许是通过调整优化算法的超参数，或者通过使用替代的优化算法。

...
>80 f(240) = 0.918099
>81 f(236) = 0.918099
>82 f(238) = 0.918099
>83 f(236) = 0.918099
>84 f(239) = 0.918099
>85 f(240) = 0.918099
>86 f(239) = 0.918099
>87 f(245) = 0.918099
>88 f(241) = 0.918099
>89 f(239) = 0.918099
>90 f(239) = 0.918099
>91 f(241) = 0.918099
>92 f(243) = 0.918099
>93 f(245) = 0.918099
>94 f(239) = 0.918099
>95 f(245) = 0.918099
>96 f(244) = 0.918099
>97 f(242) = 0.918099
>98 f(238) = 0.918099
>99 f(248) = 0.918099
>100 f(238) = 0.918099
Done!
Best: f(239) = 0.918099

...

>80 f(240) = 0.918099

>81 f(236) = 0.918099

>82 f(238) = 0.918099

>83 f(236) = 0.918099

>84 f(239) = 0.918099

>85 f(240) = 0.918099

>86 f(239) = 0.918099

>87 f(245) = 0.918099

>88 f(241) = 0.918099

>89 f(239) = 0.918099

>90 f(239) = 0.918099

>91 f(241) = 0.918099

>92 f(243) = 0.918099

>93 f(245) = 0.918099

>94 f(239) = 0.918099

>95 f(245) = 0.918099

>96 f(244) = 0.918099

>97 f(242) = 0.918099

>98 f(238) = 0.918099

>99 f(248) = 0.918099

>100 f(238) = 0.918099

完成！

最佳：f(239) = 0.918099

进一步阅读

如果您想深入了解，本节提供了更多关于该主题的资源。

教程

API

总结

在本教程中，您了解了如何在机器学习中使用优化算法进行特征选择。

具体来说，你学到了：

特征选择问题可以广泛地定义为优化问题。
如何枚举数据集中所有可能的输入特征子集。
如何应用随机优化来选择最佳输入特征子集。

你有什么问题吗？
在下面的评论中提出你的问题，我会尽力回答。

关于此主题的更多信息

如何选择优化算法

Python 中的基于直方图的梯度提升集成

对“使用随机优化算法进行特征选择”的 27 条回复

fabou 2020 年 12 月 25 日上午 5:38 #

圣诞快乐！我迫不及待想尝试里面的内容。

回复
- Jason Brownlee 2020 年 12 月 25 日上午 5:45 #
  
  谢谢！
  
  让我知道你的进展。
  
  回复
  - Fabrice BOUCAHREL 2020 年 12 月 31 日晚上 10:36 #
    
    过了一段时间，我决定使用优化来对分类特征的类别进行分组。我正在为此努力。
    在我看来，变异步骤相当重要。我选择以 3 种随机方式变异类别分组：合并 2 个组、拆分 1 个组或将一个类别的类别放到另一个类别。这必须对数据集中的所有分类特征都这样做。
    当然，评估必须使用交叉验证来完成。
    
    回复
    - Jason Brownlee 2021 年 1 月 1 日上午 5:28 #
      
      听起来很有趣，祝你的项目好运！
      
      回复
Saber Abid 2020 年 12 月 25 日上午 6:14 #

感谢 Jason 的这篇非常有价值的文章，
我冒昧地问一个问题，确实
如果我们想对这个问题进行建模并将其与数学模型关联起来，约束条件和目标函数是什么？

再次感谢

Saber

回复
- Keith 2020 年 12 月 25 日下午 7:42 #
  
  对于目标函数，请参阅标题为“深度学习的特征重要性排序”的论文，该论文发表在 NeurIPS 2020 上。
  
  回复
- Jason Brownlee 2020 年 12 月 26 日上午 5:01 #
  
  不客气。
  
  不确定你具体指的是什么。目标函数是对数据集上的模型进行评估。
  
  回复
Jhon Connor 2020 年 12 月 25 日上午 7:41 #

是否有任何文章展示了该算法的伪代码？

回复
- Jason Brownlee 2020 年 12 月 26 日上午 5:03 #
  
  你说的是哪个算法？
  
  回复
Van Tai 2020 年 12 月 25 日晚上 11:32 #

工作不错。谢谢，圣诞快乐！

回复
- Jason Brownlee 2020 年 12 月 26 日上午 5:11 #
  
  谢谢！
  
  回复
marco 2020 年 12 月 26 日上午 4:34 #

嗨，Jason，
圣诞快乐。
我有一个关于散点图矩阵的问题。
我想知道它是否适用于分类或回归？
谢谢

回复
- Jason Brownlee 2020 年 12 月 26 日上午 5:13 #
  
  是的。
  
  你可以创建任何你喜欢的数据的成对散点图，例如回归或分类数据集的输入变量对。
  
  回复
marco 2020 年 12 月 26 日上午 4:35 #

另一个问题，
我看到了 sklearn 新版本 (0.24) 中 HistGradientBoostingClassifier 的许多改进。
XGBoost 和 HistGradientBoostingClassifier 的主要区别是什么？
您是否有 HistGradientBoostingClassifier 的示例？
是使用 XGBoost 还是 HistGradientBoostingClassifier 更好？
谢谢
Marco

回复
- Jason Brownlee 2020 年 12 月 26 日上午 5:14 #
  
  我有，并且还有更多文章正在编写和计划中。
  
  也许从这里开始
  https://machinelearning.org.cn/gradient-boosting-with-scikit-learn-xgboost-lightgbm-and-catboost/
  
  回复
Md Sajid 2020 年 12 月 26 日下午 7:27 #

感谢分享，非常有信息量的文章

回复
- Jason Brownlee 2020 年 12 月 27 日上午 5:00 #
  
  不客气。
  
  回复
Asad Khan 2021 年 1 月 30 日晚上 8:36 #

我们希望有一天您会为特征选择实现粒子群优化 (PSO)。

回复
- Jason Brownlee 2021 年 1 月 31 日上午 5:33 #
  
  感谢您的建议。
  
  回复
weipanpan 2021 年 8 月 22 日上午 12:24 #

你好，我想问一下，在最后一步优化特征子集的结果 [239] 是什么意思？

回复
- Adrian Tam 2021 年 8 月 23 日上午 5:07 #
  
  它的意思是：代码生成了一个包含 500 个特征的数据集，并应用决策树进行分类。根据验证，最好只使用 239 个特征进行决策树。
  
  回复
weipanpan 2021 年 8 月 23 日下午 7:14 #

感谢您抽出宝贵时间回复。我还有另一个问题：当没有具体显示 239 个特征时，如何确定选择了哪些 239 个特征？

回复
- Adrian Tam 2021 年 8 月 24 日上午 8:31 #
  
  没有简单的方法可以列出它们，但 scikit-learn 的决策树可以打印其结构，从中可以看到特征的名称。文档中有一个示例：https://scikit-learn.cn/stable/auto_examples/tree/plot_unveil_tree_structure.htm
  
  回复
Ayman AlMutlaq 2022 年 3 月 6 日上午 9:08 #

嗨，Jason，

感谢您的教程。我正在尝试通过优化相似性距离来优化类比估算（类似于 KNN Regressor）中的特征权重。目标函数将是最小化准确性度量。

我需要询问如何使用交叉验证来验证我的最终模型？

回复
- James Carmichael 2022 年 3 月 6 日下午 1:02 #
  
  你好 Ayman……以下内容可能与您有关。
  
  https://machinelearning.org.cn/repeated-k-fold-cross-validation-with-python/
  
  回复
Rakesh 2022 年 10 月 19 日上午 3:48 #

嗨，Jason，
非常感谢这些信息。老实说，我从您那里学习 ML。我一直期待您的帖子。这次我正在尝试对 200 个特征进行特征选择方法。我有一个问题。在您的代码结尾，结果是 238 个特征子集给出了最佳结果。但是是哪些特征组合？如何知道？您能否对此进行回复？谢谢。

回复
- James Carmichael 2022 年 10 月 19 日上午 6:53 #
  
  你好 Rakesh……你可能会发现以下内容与你有关
  
  https://towardsdatascience.com/feature-selection-techniques-in-machine-learning-with-python-f24e7da3f36e
  
  回复

导航

使用随机优化算法进行特征选择

教程概述

优化用于特征选择

想要开始学习优化算法吗？

枚举所有特征子集

优化特征子集

进一步阅读

教程

API

总结

掌握现代优化算法！

加深您对优化的理解

将现代优化算法应用于
您的机器学习项目

关于此主题的更多信息

对“使用随机优化算法进行特征选择”的 27 条回复

留下回复点击此处取消回复。

导航

教程概述

优化用于特征选择

想要开始学习优化算法吗？

枚举所有特征子集

优化特征子集

进一步阅读

教程

API

总结

掌握现代优化算法！

加深您对优化的理解

将现代优化算法应用于您的机器学习项目

关于此主题的更多信息

对“使用随机优化算法进行特征选择”的 27 条回复

留下回复 点击此处取消回复。

将现代优化算法应用于
您的机器学习项目

留下回复点击此处取消回复。