使用 Python 的最近收缩质心

作者： Jason Brownlee 于 2020年6月19日发表于 Python机器学习 6

最近质心是一种线性分类机器学习算法。

它涉及根据新样本属于训练数据集中哪一个基于类的质心最近来预测新样本的类别标签。

最近邻收缩质心算法是一种扩展，它通过将基于类的质心向整个训练数据集的质心移动，并移除那些区分类作用较小的输入变量。

因此，最近邻收缩质心算法执行一种自动特征选择形式，使其适用于具有大量输入变量的数据集。

在本教程中，您将了解最近邻收缩质心分类机器学习算法。

完成本教程后，您将了解：

最近邻收缩质心是一种简单的线性机器学习分类算法。
如何使用Scikit-Learn拟合、评估和预测最近邻收缩质心模型。
如何在给定数据集上调整最近邻收缩质心算法的超参数。

让我们开始吧。

使用 Python 的最近收缩质心
照片作者：Giuseppe Milo，保留部分权利。

教程概述

本教程分为三个部分；它们是：

最近质心算法
使用Scikit-Learn实现最近质心
调整最近质心超参数

使用Scikit-Learn实现最近质心

最近收缩质心在scikit-learn Python机器学习库中可以通过NearestCentroid类获得。

该类允许通过“metric”参数配置算法使用的距离度量，默认值为“euclidean”（欧氏距离）。

可以更改为其他内置度量，例如“manhattan”。

...
# create the nearest centroid model
model = NearestCentroid(metric='euclidean')

...

# 创建最近质心模型

model = NearestCentroid(metric='euclidean')

默认情况下，不使用收缩，但可以通过“shrink_threshold”参数指定收缩，该参数接受0到1之间的浮点值。

...
# create the nearest centroid model
model = NearestCentroid(metric='euclidean', shrink_threshold=0.5)

...

# 创建最近质心模型

model = NearestCentroid(metric='euclidean', shrink_threshold=0.5)

我们可以通过一个实际示例来演示最近收缩质心。

首先，让我们定义一个合成分类数据集。

我们将使用 make_classification() 函数创建一个具有 1,000 个示例的数据集，每个示例有 20 个输入变量。

该示例创建并总结了数据集。

# test classification dataset
from sklearn.datasets import make_classification
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1)
# summarize the dataset
print(X.shape, y.shape)

# 测试分类数据集

from sklearn.datasets import make_classification

# 定义数据集

X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1)

# 汇总数据集

print(X.shape, y.shape)

运行示例创建数据集并确认数据集的行数和列数。

(1000, 20) (1000,)

1	(1000, 20) (1000,)

我们可以使用重复分层K折交叉验证通过RepeatedStratifiedKFold类来拟合和评估最近收缩质心模型。我们将在测试环境中进行10折和3次重复。

我们将使用欧氏距离和无收缩的默认配置。

...
# create the nearest centroid model
model = NearestCentroid()

...

# 创建最近质心模型

model = NearestCentroid()

下面列出了用于合成二分类任务的最近收缩质心模型评估的完整示例。

# evaluate an nearest centroid model on the dataset
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.neighbors import NearestCentroid
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1)
# define model
model = NearestCentroid()
# define model evaluation method
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# summarize result
print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

# 在数据集上评估最近质心模型

from numpy import mean

from numpy import std

from sklearn.datasets import make_classification

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import RepeatedStratifiedKFold

from sklearn.neighbors import NearestCentroid

# 定义数据集

X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1)

# 定义模型

model = NearestCentroid()

# 定义模型评估方法

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# 评估模型

scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

# 总结结果

print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

运行示例将在合成数据集上评估最近收缩质心算法，并报告3次10折交叉验证重复的总平均准确率。

鉴于学习算法的随机性，您的具体结果可能会有所不同。可以尝试运行几次示例。

在这种情况下，我们可以看到模型获得的平均准确率为71%左右。

Mean Accuracy: 0.711 (0.055)

1	平均准确率：0.711 (0.055)

我们可以决定将最近收缩质心作为我们的最终模型，并对新数据进行预测。

这可以通过在所有可用数据上拟合模型并调用predict()函数来传递新数据行来实现。

我们可以在下面列出的完整示例中演示这一点。

# make a prediction with a nearest centroid model on the dataset
from sklearn.datasets import make_classification
from sklearn.neighbors import NearestCentroid
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1)
# define model
model = NearestCentroid()
# fit model
model.fit(X, y)
# define new data
row = [2.47475454,0.40165523,1.68081787,2.88940715,0.91704519,-3.07950644,4.39961206,0.72464273,-4.86563631,-6.06338084,-1.22209949,-0.4699618,1.01222748,-0.6899355,-0.53000581,6.86966784,-3.27211075,-6.59044146,-2.21290585,-3.139579]
# make a prediction
yhat = model.predict([row])
# summarize prediction
print('Predicted Class: %d' % yhat)

# 在数据集上使用最近质心模型进行预测

from sklearn.datasets import make_classification

from sklearn.neighbors import NearestCentroid

# 定义数据集

X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1)

# 定义模型

model = NearestCentroid()

# 拟合模型

model.fit(X, y)

# 定义新数据

row = [2.47475454,0.40165523,1.68081787,2.88940715,0.91704519,-3.07950644,4.39961206,0.72464273,-4.86563631,-6.06338084,-1.22209949,-0.4699618,1.01222748,-0.6899355,-0.53000581,6.86966784,-3.27211075,-6.59044146,-2.21290585,-3.139579]

# 进行预测

yhat = model.predict([row])

# 总结预测

print('Predicted Class: %d' % yhat)

运行示例后，模型将被拟合，并对新数据行进行类别标签预测。

Predicted Class: 0

预测类别：0

接下来，我们可以看看如何配置模型超参数。

调整最近质心超参数

最近收缩质心方法的超参数必须针对您的特定数据集进行配置。

最重要的超参数可能是通过“shrink_threshold”参数控制的收缩。最好在0到1之间的网格值上测试值，例如0.1或0.01。

下面的示例使用我们定义的网格值的 GridSearchCV 类来演示这一点。

# grid search shrinkage for nearest centroid
from numpy import arange
from sklearn.datasets import make_classification
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.neighbors import NearestCentroid
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1)
# define model
model = NearestCentroid()
# define model evaluation method
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# define grid
grid = dict()
grid['shrink_threshold'] = arange(0, 1.01, 0.01)
# define search
search = GridSearchCV(model, grid, scoring='accuracy', cv=cv, n_jobs=-1)
# perform the search
results = search.fit(X, y)
# summarize
print('Mean Accuracy: %.3f' % results.best_score_)
print('Config: %s' % results.best_params_)

# 网格搜索最近质心的收缩

from numpy import arange

from sklearn.datasets import make_classification

from sklearn.model_selection import GridSearchCV

from sklearn.model_selection import RepeatedStratifiedKFold

from sklearn.neighbors import NearestCentroid

# 定义数据集

X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1)

# 定义模型

model = NearestCentroid()

# 定义模型评估方法

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# 定义网格

grid = dict()

grid['shrink_threshold'] = arange(0, 1.01, 0.01)

# 定义搜索

search = GridSearchCV(model, grid, scoring='accuracy', cv=cv, n_jobs=-1)

# 执行搜索

results = search.fit(X, y)

# 总结

print('Mean Accuracy: %.3f' % results.best_score_)

print('Config: %s' % results.best_params_)

运行示例将使用重复交叉验证评估每种配置组合。

鉴于学习算法的随机性，您的具体结果可能会有所不同。尝试运行几次示例。

在这种情况下，我们可以看到，使用0收缩和曼哈顿距离（而不是欧氏距离）获得了略好的结果，准确率为71.4%对71.1%。我们可以看到模型分配了0.53的shrink_threshold值。

Mean Accuracy: 0.714
Config: {'shrink_threshold': 0.53}

1 2	平均准确率：0.714 配置：{'shrink_threshold': 0.53}

另一个关键配置是使用的距离度量，可以根据输入变量的分布来选择。

可以使用任何内置的距离度量，如下所示

metrics.pairwise.pairwise_distances API.

常见的距离度量包括

‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’

有关这些距离度量如何计算的更多信息，请参阅教程

机器学习的 4 种距离度量

鉴于我们的输入变量是数值的，我们的数据集仅支持“euclidean”和“manhattan”。

我们可以将这些度量包含在我们的网格搜索中；完整的示例列在下面。

# grid search shrinkage and distance metric for nearest centroid
from numpy import arange
from sklearn.datasets import make_classification
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.neighbors import NearestCentroid
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1)
# define model
model = NearestCentroid()
# define model evaluation method
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# define grid
grid = dict()
grid['shrink_threshold'] = arange(0, 1.01, 0.01)
grid['metric'] = ['euclidean', 'manhattan']
# define search
search = GridSearchCV(model, grid, scoring='accuracy', cv=cv, n_jobs=-1)
# perform the search
results = search.fit(X, y)
# summarize
print('Mean Accuracy: %.3f' % results.best_score_)
print('Config: %s' % results.best_params_)

# 网格搜索最近质心的收缩和距离度量

from numpy import arange

from sklearn.datasets import make_classification

from sklearn.model_selection import GridSearchCV

from sklearn.model_selection import RepeatedStratifiedKFold

from sklearn.neighbors import NearestCentroid

# 定义数据集

X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1)

# 定义模型

model = NearestCentroid()

# 定义模型评估方法

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# 定义网格

grid = dict()

grid['shrink_threshold'] = arange(0, 1.01, 0.01)

grid['metric'] = ['euclidean', 'manhattan']

# 定义搜索

search = GridSearchCV(model, grid, scoring='accuracy', cv=cv, n_jobs=-1)

# 执行搜索

results = search.fit(X, y)

# 总结

print('Mean Accuracy: %.3f' % results.best_score_)

print('Config: %s' % results.best_params_)

运行示例会拟合模型，并通过交叉验证发现能获得最佳结果的超参数。

鉴于学习算法的随机性，您的具体结果可能会有所不同。尝试运行几次示例。

在这种情况下，我们可以看到，使用无收缩和曼哈顿距离（而不是欧氏距离）可以获得略高的准确率，达到75%。

Mean Accuracy: 0.750
Config: {'metric': 'manhattan', 'shrink_threshold': 0.0}

1 2	平均准确率：0.750 配置：{'metric': 'manhattan', 'shrink_threshold': 0.0}

一个好的扩展实验是将数据归一化或标准化作为建模Pipeline的一部分。

进一步阅读

如果您想深入了解，本节提供了更多关于该主题的资源。

API

文章

总结

在本教程中，您了解了最近收缩质心分类机器学习算法。

具体来说，你学到了：

最近邻收缩质心是一种简单的线性机器学习分类算法。
如何使用Scikit-Learn拟合、评估和预测最近邻收缩质心模型。
如何在给定数据集上调整最近邻收缩质心算法的超参数。

你有什么问题吗？
在下面的评论中提出你的问题，我会尽力回答。

关于此主题的更多信息

如何在Python中开发LASSO回归模型

如何在Python中开发LARS回归模型

6条关于使用Python实现最近收缩质心的回复

RK_pat 2020年10月15日下午6:16 #

你好，我们如何查看每个维度或特征的质心以及每个目标类的主要质心。您是否知道如何为每个条目创建得分？

回复
- Jason Brownlee 2020年10月16日上午5:52 #
  
  通过模型上的“centroids_”属性。
  https://scikit-learn.cn/stable/modules/generated/sklearn.neighbors.NearestCentroid.html
  
  回复
Cameron 2022年8月18日上午12:36 #

你好，
感谢您的文章。如何将定义的“make_classification”数据集行代码更改为输入您自己的数字数据集（基因表达数据）？
谢谢你

回复
- James Carmichael 2022年8月18日上午11:02 #
  
  嗨 Cameron…您可能会发现以下内容有所帮助
  
  https://machinelearning.org.cn/how-to-load-data-in-python-with-scikit-learn/
  
  回复
Amit 2024年4月20日下午1:55 #

感谢您非常棒的教程。我仍然不清楚如何使用收缩阈值确定哪些特征已被收缩为零，以及还剩下多少特征。

回复
- James Carmichael 2024年4月21日上午10:18 #
  
  嗨 Amit…理解收缩质心的工作原理，尤其是在通过收缩进行特征选择的背景下，需要掌握诸如收缩质心正则判别分析（也称为“最近收缩质心”方法）等技术。该方法通常用于基因表达数据分类等场景，其中特征（基因）的数量可能远大于样本数量。让我们来详细介绍该方法的工作原理以及如何识别哪些特征已被收缩过程有效消除。
  
  ### 什么是收缩质心？
  
  收缩质心，由预测微阵列分析（PAM）方法推广，主要用于分类。它通过由收缩参数（通常是lambda）决定的量将类质心向所有类的整体质心收缩。目标是通过减少方差而不显著增加偏差来提高分类准确性。
  
  ### 收缩如何工作？
  
  1. **质心计算**：对于每个类，计算属于该类的样本的特征质心。这是特定类样本所有特征值的平均值。
  
  2. **整体质心**：计算所有类特征的整体质心。
  
  3. **收缩**：每个类质心分量被“收缩”到整体质心。这种收缩的程度取决于收缩参数（lambda）。lambda值越高，收缩程度越大。
  
  4. **收缩的影响**：如果收缩将质心分量完全拉到整体质心（或非常接近），则效果是相应特征对于区分类没有有效的贡献。当所有类别的特征的质心分量被收缩到整体质心时，这意味着该特征在收缩惩罚下几乎没有区分能力，可以认为它被收缩到零。
  
  ### 如何识别收缩为零的特征？
  
  要确定哪些特征被收缩为零，您需要查看收缩后类质心与整体质心之间的差异。
  
  1. **阈值处理**：应用收缩后，任何类质心（对所有类）足够接近整体质心的特征都可以被认为是收缩到零。接近度可以根据与lambda值相关的阈值来确定。
  
  2. **实际实现**：如果使用R的pamr包进行PAM，它通常提供可视化或直接识别哪些特征具有系数减少到零的工具。例如，使用pamr
  R library(pamr) fit <- pamr.train(data, labels) thresholded <- pamr.threshold(fit) print(thresholded$features)
  在这里，thresholded$features将列出应用收缩阈值后剩余的特征。未列出的特征是那些被收缩为零的特征。
  
  ### 注意事项
  
  - **Lambda的选择**：收缩参数lambda的选择至关重要。它通常可以通过交叉验证来选择，以优化最佳分类性能，同时最小化过拟合。
  
  - **对特征选择的影响**：该方法充当一种特征选择形式，在收缩后仅保留有助于类分离的特征。
  
  了解收缩质心中的哪些特征被收缩为零有助于简化模型并专注于最相关属性。这使得模型既易于解释又高效，尤其是在高维数据场景中。
  
  回复

导航

使用 Python 的最近收缩质心

教程概述

最近质心算法

使用Scikit-Learn实现最近质心

调整最近质心超参数

进一步阅读

教程

论文

书籍

API

文章

总结

发现 Python 中的快速机器学习！

在几分钟内开发您自己的模型

最终将机器学习带入
您自己的项目

关于此主题的更多信息

6条关于使用Python实现最近收缩质心的回复

Leave a Reply Click here to cancel reply.

导航

教程概述

最近质心算法

使用Scikit-Learn实现最近质心

调整最近质心超参数

进一步阅读

教程

论文

书籍

API

文章

总结

发现 Python 中的快速机器学习！

在几分钟内开发您自己的模型

最终将机器学习带入您自己的项目

关于此主题的更多信息

6条关于使用Python实现最近收缩质心的回复

Leave a Reply Click here to cancel reply.

最终将机器学习带入
您自己的项目