AutoML 简介：自动化机器学习工作流程

作者： Abid Ali Awan 发布于 2024年7月26日分类为机器学习流程 0

Introduction to AutoML: Automating Machine Learning Workflows

作者提供图片

AutoML 是一款为技术和非技术专家设计的工具，它简化了训练机器学习模型的过程。您只需提供数据集，它就会为您返回最适合您用例的、性能最佳的模型。您无需花费大量时间编写代码或尝试各种技术；它会独立完成所有工作。

在本教程中，我们将学习 AutoML 和 TPOT（一个用于构建机器学习管道的 Python AutoML 工具）。我们还将学习如何构建机器学习分类器、保存模型以及使用它进行模型推理。

什么是 AutoML？

AutoML，即自动化机器学习，是一种工具，您只需提供数据集，它就会在后台完成所有任务，为您提供高性能的机器学习模型。AutoML 执行各种任务，如数据预处理、特征选择、模型选择、超参数调优、模型集成和模型评估。即使是非技术用户也可以使用 AutoML 工具构建高度复杂的机器学习模型。

通过使用先进的机器学习算法和技术，AutoML 系统可以自动发现给定数据集的最佳模型和配置，从而减少开发机器学习模型所需的时间和精力。

1. TPOT 入门

TPOT (基于树的管道优化工具) 是最简单、最受欢迎的 AutoML 工具，它使用遗传编程来优化机器学习管道。它会自动探索数百种潜在管道，以识别给定数据集最有效的模型。

您可以使用以下命令在您的系统上安装 TPOT。

!pip install tpot==0.12.2

1	!pip install tpot==0.12.2

加载必要的 Python 库以加载和处理数据并训练分类模型。

import numpy as np
import pandas as pd
from tpot import TPOTClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer

import numpy as np

import pandas as pd

from tpot import TPOTClassifier

from sklearn.model_selection import train_test_split

from sklearn.datasets import load_breast_cancer

2. 加载数据

在本教程中，我们使用的是 Kaggle 上的蘑菇数据集，该数据集包含 9 个特征，用于确定蘑菇是否有毒。

我们将使用 Pandas 加载数据集，并从数据集中随机选择 1000 个样本。

data = pd.read_csv('mushroom_cleaned.csv')
data = data.sample(n=1000, random_state=55)
data.head()

data = pd.read_csv('mushroom_cleaned.csv')

data = data.sample(n=1000, random_state=55)

data.head()

3. 数据处理

“class”列是我们的目标变量，它包含两个值——0 或 1——其中 0 表示无毒，1 表示有毒。我们将使用它来创建独立变量和因变量。之后，我们将将其拆分为训练集和测试集。

X = data.drop('class', axis=1)
y = data['class'].values

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=55)

X = data.drop('class', axis=1)

y = data['class'].values

# 将数据集拆分为训练集和测试集

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=55)

4. 构建和拟合 TPOT 分类器

我们将初始化 TPOT 分类器并使用训练集对其进行训练。该模型将尝试各种模型和技术，并返回性能最佳的模型和管道。

# Initialize TPOTClassifier
tpot = TPOTClassifier(verbosity=2, generations=5, population_size=20, random_state=55)

# Fit the classifier to the training data
tpot.fit(X_train, y_train)

# 初始化 TPOTClassifier

tpot = TPOTClassifier(verbosity=2, generations=5, population_size=20, random_state=55)

# 将分类器拟合到训练数据

tpot.fit(X_train, y_train)

我们获得了不同代数和最佳管道的各种分数。

让我们使用 .score 函数在测试数据集上评估我们的最佳管道。

# Evaluate the model on the test set
print(tpot.score(X_test, y_test))

1 2	# 在测试集上评估模型 print(tpot.score(X_test, y_test))

我认为我们已经有了一个非常稳定和准确的模型。

0.875

0.875

5. 保存 TPOT 管道和模型

要保存 TPOT 管道，我们将使用 .export 函数并为其提供文件名和 .py 扩展名。

tpot.export('tpot_mashroom_pipeline.py')

1	tpot.export('tpot_mashroom_pipeline.py')

该文件将保存为 Python 文件，其中包含最佳管道的代码。为了运行管道，您需要对数据集的目录、分隔符和目标列名进行一些更改。

tpot_mashroom_pipeline.py

import numpy as np
import pandas as pd
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from tpot.export_utils import set_param_recursive

# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1)
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'], random_state=55)

# Average CV score on the training set was: 0.8800000000000001
exported_pipeline = make_pipeline(
    SelectFromModel(estimator=ExtraTreesClassifier(criterion="entropy", max_features=0.9000000000000001, n_estimators=100), threshold=0.1),
    ExtraTreesClassifier(bootstrap=False, criterion="gini", max_features=0.9500000000000001, min_samples_leaf=4, min_samples_split=2, n_estimators=100)
)

# Fix random state for all the steps in exported pipeline
set_param_recursive(exported_pipeline.steps, 'random_state', 55)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)

import numpy as np

import pandas as pd

from sklearn.ensemble import ExtraTreesClassifier

from sklearn.feature_selection import SelectFromModel

from sklearn.model_selection import train_test_split

from sklearn.pipeline import make_pipeline

from tpot.export_utils import set_param_recursive

# 注意：请确保结果列在数据文件中标记为 'target'

tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)

features = tpot_data.drop('target', axis=1)

training_features, testing_features, training_target, testing_target = \

train_test_split(features, tpot_data['target'], random_state=55)

# 训练集上的平均交叉验证得分为：0.8800000000000001

exported_pipeline = make_pipeline(

SelectFromModel(estimator=ExtraTreesClassifier(criterion="entropy", max_features=0.9000000000000001, n_estimators=100), threshold=0.1),

ExtraTreesClassifier(bootstrap=False, criterion="gini", max_features=0.9500000000000001, min_samples_leaf=4, min_samples_split=2, n_estimators=100)

)

# 为导出管道中的所有步骤设置随机状态

set_param_recursive(exported_pipeline.steps, 'random_state', 55)

exported_pipeline.fit(training_features, training_target)

results = exported_pipeline.predict(testing_features)

您甚至可以使用 joblib 库将模型保存为 pickle 文件。此文件包含模型权重以及运行模型推理的代码。

import joblib

joblib.dump(tpot.fitted_pipeline_, 'tpot_mashroom_pipeline.pkl')

import joblib

joblib.dump(tpot.fitted_pipeline_, 'tpot_mashroom_pipeline.pkl')

6. 加载 TPOT 管道和模型推理

我们将使用 joblib.load 函数加载保存的模型，并预测测试数据集的前 10 个样本。

model = joblib.load('tpot_mashroom_pipeline.pkl')

print(y_test[0:10])
print(model.predict(X_test[0:10]))

model = joblib.load('tpot_mashroom_pipeline.pkl')

print(y_test[0:10])

print(model.predict(X_test[0:10]))

我们的模型很准确，因为实际标签与预测标签相似。

[1 1 1 1 1 1 0 1 0 1]
[1 1 1 1 1 1 0 1 0 1]

1 2	[1 1 1 1 1 1 0 1 0 1] [1 1 1 1 1 1 0 1 0 1]

总结

在本教程中，我们学习了 AutoML 以及它如何被任何人使用，甚至是非技术用户。我们还学习了如何使用 TPOT，这是一种 AutoML Python 工具，可以自动执行数据处理、特征选择、模型选择、超参数调优、模型集成和模型评估。在模型训练结束时，我们只需运行两行代码即可获得性能最佳的模型和管道。我们甚至可以保存模型并使用它来构建 AI 应用程序。

导航

AutoML 简介：自动化机器学习工作流程

什么是 AutoML？

1. TPOT 入门

2. 加载数据

3. 数据处理

4. 构建和拟合 TPOT 分类器

5. 保存 TPOT 管道和模型

6. 加载 TPOT 管道和模型推理

总结

关于此主题的更多信息

暂无评论。

发表评论点击此处取消回复。

导航

什么是 AutoML？

1. TPOT 入门

2. 加载数据

3. 数据处理

4. 构建和拟合 TPOT 分类器

5. 保存 TPOT 管道和模型

6. 加载 TPOT 管道和模型推理

总结

关于此主题的更多信息

暂无评论。

发表评论 点击此处取消回复。

发表评论点击此处取消回复。