TPOT 用于 Python 上的自动化机器学习

作者 Jason Brownlee 于 2020年9月7日发布在 Python 机器学习 41

自动化机器学习（AutoML）是指在用户几乎不参与的情况下，为预测建模任务自动发现表现良好的模型的技术。

TPOT 是一个用于在 Python 中执行 AutoML 的开源库。它利用流行的 Scikit-Learn 机器学习库进行数据转换和机器学习算法，并使用遗传编程随机全局搜索过程，为给定的数据集高效地发现表现最佳的模型管道。

在本教程中，您将学习如何在 Python 中使用 TPOT 和 Scikit-Learn 机器学习算法进行 AutoML。

完成本教程后，您将了解：

TPOT 是一个用于 AutoML 的开源库，使用 scikit-learn 进行数据准备和机器学习模型。
如何使用 TPOT 自动发现分类任务的最佳模型。
如何使用 TPOT 自动发现回归任务的最佳模型。

让我们开始吧。

TPOT for Automated Machine Learning in Python

TPOT 用于 Python 上的自动化机器学习
照片作者：Gwen，部分权利保留。

教程概述

本教程分为四个部分；它们是

TPOT 用于自动化机器学习
安装和使用 TPOT
TPOT 用于分类
TPOT 用于回归

TPOT 用于自动化机器学习

基于树的管道优化工具，简称 TPOT，是用于自动化机器学习的 Python 库。

TPOT 使用基于树的结构来表示预测建模问题的模型管道，包括数据准备、建模算法和模型超参数。

……一种称为基于树的管道优化工具（TPOT）的进化算法，可自动设计和优化机器学习管道。

— 对用于自动化数据科学的基于树的管道优化工具的评估，2016。

然后执行优化过程，以找到对给定数据集表现最佳的树结构。具体来说，是一种遗传编程算法，旨在对表示为树的程序执行随机全局优化。

TPOT 使用遗传编程的一种版本来自动设计和优化一系列数据转换和机器学习模型，以最大化给定监督学习数据集的分类准确性。

— 对用于自动化数据科学的基于树的管道优化工具的评估，2016。

TPOT 论文中的下图显示了管道搜索所涉及的元素，包括数据清理、特征选择、特征处理、特征构建、模型选择和超参数优化。

TPOT 管道搜索概述
来源：对用于自动化数据科学的基于树的管道优化工具的评估，2016。

现在我们熟悉了 TPOT 是什么，让我们来看看如何安装和使用 TPOT 来找到有效的模型管道。

安装和使用 TPOT

第一步是安装 TPOT 库，可以使用 pip 来完成，如下所示

pip install tpot

1	pip install tpot

安装后，我们可以导入该库并打印版本号以确认安装成功

# check tpot version
import tpot
print('tpot: %s' % tpot.__version__)

# 检查 tpot 版本

import tpot

print('tpot: %s' % tpot.__version__)

运行示例将打印版本号。

您的版本号应与此相同或更高。

tpot: 0.11.1

1	tpot: 0.11.1

使用 TPOT 非常简单。

它涉及创建 TPOTRegressor 或 TPOTClassifier 类的实例，配置它进行搜索，然后导出在您的数据集上表现最佳的模型管道。

配置类涉及两个主要元素。

第一个是如何评估模型，例如交叉验证方案和性能指标。我建议明确指定具有您选择的配置和要使用的性能指标的交叉验证类。

例如，回归的 RepeatedKFold 和回归的“neg_mean_absolute_error”指标

...
# define evaluation procedure
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# define search
model = TPOTRegressor(... scoring='neg_mean_absolute_error', cv=cv)

...

# 定义评估过程

cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

# 定义搜索

model = TPOTRegressor(... scoring='neg_mean_absolute_error', cv=cv)

或者分类的 RepeatedStratifiedKFold 和“accuracy”指标

...
# define evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# define search
model = TPOTClassifier(... scoring='accuracy', cv=cv)

...

# 定义评估过程

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# 定义搜索

model = TPOTClassifier(... scoring='accuracy', cv=cv)

另一个元素是随机全局搜索过程的性质。

作为一种进化算法，这涉及到设置配置，例如种群大小、运行代数，以及可能的交叉和变异率。前者重要的是控制搜索的范围；如果你是进化搜索新手，后者可以保留默认值。

例如，100 的适度种群大小和 5 或 10 代是一个不错的起点。

...
# define search
model = TPOTClassifier(generations=5, population_size=50, ...)

...

# 定义搜索

model = TPOTClassifier(generations=5, population_size=50, ...)

搜索结束时，会找到一个表现最佳的管道。

此管道可以导出为 Python 文件中的代码，您可以稍后将其复制粘贴到您自己的项目中。

...
# export the best model
model.export('tpot_model.py')

...

# 导出最佳模型

model.export('tpot_model.py')

现在我们熟悉了如何使用 TPOT，让我们来看一些使用真实数据的示例。

TPOT 用于分类

在本节中，我们将使用 TPOT 为声纳数据集发现模型。

声纳数据集是一个标准机器学习数据集，包含 208 行数据，其中有 60 个数值输入变量和一个具有两个类别值的目标变量，例如二元分类。

使用重复分层 10 折交叉验证（重复三次）的测试框架，一个朴素模型可以达到约 53% 的准确率。一个表现最佳的模型在此相同测试框架上可达到约 88% 的准确率。这提供了该数据集上的预期表现范围。

该数据集涉及预测声纳回波是否指示岩石或模拟水雷。

无需下载数据集；我们将在工作示例中自动下载它。

下面的示例下载数据集并汇总其形状。

# summarize the sonar dataset
from pandas import read_csv
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv'
dataframe = read_csv(url, header=None)
# split into input and output elements
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
print(X.shape, y.shape)

# 汇总声纳数据集

from pandas import read_csv

# 加载数据集

url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv'

dataframe = read_csv(url, header=None)

# 分割输入和输出元素

data = dataframe.values

X, y = data[:, :-1], data[:, -1]

print(X.shape, y.shape)

运行此示例将下载数据集并将其拆分为输入和输出元素。正如预期的那样，我们可以看到有 208 行数据和 60 个输入变量。

(208, 60) (208,)

1	(208, 60) (208,)

接下来，我们使用 TPOT 为声纳数据集找到一个好的模型。

首先，我们可以定义评估模型的方法。我们将采用一种良好的实践，即使用重复分层 k 折交叉验证，重复三次，折数 10。

...
# define model evaluation
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

...

# 定义模型评估

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

我们将为搜索使用 50 的种群大小和 5 代，并通过将“n_jobs”设置为 -1 来使用系统上的所有核心。

...
# define search
model = TPOTClassifier(generations=5, population_size=50, cv=cv, scoring='accuracy', verbosity=2, random_state=1, n_jobs=-1)

...

# 定义搜索

model = TPOTClassifier(generations=5, population_size=50, cv=cv, scoring='accuracy', verbosity=2, random_state=1, n_jobs=-1)

最后，我们可以启动搜索，并确保在运行结束时保存表现最佳的模型。

...
# perform the search
model.fit(X, y)
# export the best model
model.export('tpot_sonar_best_model.py')

...

# 执行搜索

model.fit(X, y)

# 导出最佳模型

model.export('tpot_sonar_best_model.py')

将这些结合起来，完整的示例列在下面。

# example of tpot for the sonar classification dataset
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import RepeatedStratifiedKFold
from tpot import TPOTClassifier
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv'
dataframe = read_csv(url, header=None)
# split into input and output elements
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
# minimally prepare dataset
X = X.astype('float32')
y = LabelEncoder().fit_transform(y.astype('str'))
# define model evaluation
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# define search
model = TPOTClassifier(generations=5, population_size=50, cv=cv, scoring='accuracy', verbosity=2, random_state=1, n_jobs=-1)
# perform the search
model.fit(X, y)
# export the best model
model.export('tpot_sonar_best_model.py')

# TPOT 用于声纳分类数据集的示例

from pandas import read_csv

from sklearn.preprocessing import LabelEncoder

from sklearn.model_selection import RepeatedStratifiedKFold

from tpot import TPOTClassifier

# 加载数据集

url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv'

dataframe = read_csv(url, header=None)

# 分割输入和输出元素

data = dataframe.values

X, y = data[:, :-1], data[:, -1]

# 最少地准备数据集

X = X.astype('float32')

y = LabelEncoder().fit_transform(y.astype('str'))

# 定义模型评估

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# 定义搜索

model = TPOTClassifier(generations=5, population_size=50, cv=cv, scoring='accuracy', verbosity=2, random_state=1, n_jobs=-1)

# 执行搜索

model.fit(X, y)

# 导出最佳模型

model.export('tpot_sonar_best_model.py')

运行示例可能需要几分钟时间，您将在命令行上看到一个进度条。

注意：您的结果可能因算法或评估程序的随机性，或数值精度的差异而有所不同。请考虑运行示例几次并比较平均结果。

表现最佳模型的准确率将在此过程中报告。

Generation 1 - Current best internal CV score: 0.8650793650793651
Generation 2 - Current best internal CV score: 0.8650793650793651
Generation 3 - Current best internal CV score: 0.8650793650793651
Generation 4 - Current best internal CV score: 0.8650793650793651
Generation 5 - Current best internal CV score: 0.8667460317460318

Best pipeline: GradientBoostingClassifier(GaussianNB(input_matrix), learning_rate=0.1, max_depth=7, max_features=0.7000000000000001, min_samples_leaf=15, min_samples_split=10, n_estimators=100, subsample=0.9000000000000001)

第 1 代 - 当前最佳内部 CV 分数：0.8650793650793651

第 2 代 - 当前最佳内部 CV 分数：0.8650793650793651

第 3 代 - 当前最佳内部 CV 分数：0.8650793650793651

第 4 代 - 当前最佳内部 CV 分数：0.8650793650793651

第 5 代 - 当前最佳内部 CV 分数：0.8667460317460318

最佳管道：GradientBoostingClassifier(GaussianNB(input_matrix), learning_rate=0.1, max_depth=7, max_features=0.7000000000000001, min_samples_leaf=15, min_samples_split=10, n_estimators=100, subsample=0.9000000000000001)

在这种情况下，我们可以看到表现最佳的管道达到了约 86.6% 的平均准确率。这是一个熟练的模型，接近该数据集上的表现最佳的模型。

表现最佳的管道然后保存到名为“tpot_sonar_best_model.py”的文件中。

打开此文件，您可以看到一些用于加载数据集和拟合管道的通用代码。下面列出了一个示例。

import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import make_pipeline, make_union
from tpot.builtins import StackingEstimator
from tpot.export_utils import set_param_recursive

# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1)
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'], random_state=1)

# Average CV score on the training set was: 0.8667460317460318
exported_pipeline = make_pipeline(
    StackingEstimator(estimator=GaussianNB()),
    GradientBoostingClassifier(learning_rate=0.1, max_depth=7, max_features=0.7000000000000001, min_samples_leaf=15, min_samples_split=10, n_estimators=100, subsample=0.9000000000000001)
)
# Fix random state for all the steps in exported pipeline
set_param_recursive(exported_pipeline.steps, 'random_state', 1)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)

import numpy as np

import pandas as pd

from sklearn.ensemble import GradientBoostingClassifier

from sklearn.model_selection import train_test_split

from sklearn.naive_bayes import GaussianNB

from sklearn.pipeline import make_pipeline, make_union

from tpot.builtins import StackingEstimator

from tpot.export_utils import set_param_recursive

# 注意：请确保结果列在数据文件中标记为 'target'

tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)

features = tpot_data.drop('target', axis=1)

training_features, testing_features, training_target, testing_target = \

train_test_split(features, tpot_data['target'], random_state=1)

# 训练集上的平均 CV 分数是：0.8667460317460318

exported_pipeline = make_pipeline(

StackingEstimator(estimator=GaussianNB()),

GradientBoostingClassifier(learning_rate=0.1, max_depth=7, max_features=0.7000000000000001, min_samples_leaf=15, min_samples_split=10, n_estimators=100, subsample=0.9000000000000001)

)

# 为导出管道中的所有步骤固定随机状态

set_param_recursive(exported_pipeline.steps, 'random_state', 1)

exported_pipeline.fit(training_features, training_target)

results = exported_pipeline.predict(testing_features)

注意：按原样，此代码并非有意执行。它是一个您可以复制粘贴到您的项目中的模板。

在这种情况下，我们可以看到表现最佳的模型是由朴素贝叶斯模型和梯度增强模型组成的管道。

我们可以修改此代码以在所有可用数据上拟合最终模型，并为新数据进行预测。

完整的示例如下所示。

# example of fitting a final model and making a prediction on the sonar dataset
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import make_pipeline
from tpot.builtins import StackingEstimator
from tpot.export_utils import set_param_recursive
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv'
dataframe = read_csv(url, header=None)
# split into input and output elements
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
# minimally prepare dataset
X = X.astype('float32')
y = LabelEncoder().fit_transform(y.astype('str'))
# Average CV score on the training set was: 0.8667460317460318
exported_pipeline = make_pipeline(
    StackingEstimator(estimator=GaussianNB()),
    GradientBoostingClassifier(learning_rate=0.1, max_depth=7, max_features=0.7000000000000001, min_samples_leaf=15, min_samples_split=10, n_estimators=100, subsample=0.9000000000000001)
)
# Fix random state for all the steps in exported pipeline
set_param_recursive(exported_pipeline.steps, 'random_state', 1)
# fit the model
exported_pipeline.fit(X, y)
# make a prediction on a new row of data
row = [0.0200,0.0371,0.0428,0.0207,0.0954,0.0986,0.1539,0.1601,0.3109,0.2111,0.1609,0.1582,0.2238,0.0645,0.0660,0.2273,0.3100,0.2999,0.5078,0.4797,0.5783,0.5071,0.4328,0.5550,0.6711,0.6415,0.7104,0.8080,0.6791,0.3857,0.1307,0.2604,0.5121,0.7547,0.8537,0.8507,0.6692,0.6097,0.4943,0.2744,0.0510,0.2834,0.2825,0.4256,0.2641,0.1386,0.1051,0.1343,0.0383,0.0324,0.0232,0.0027,0.0065,0.0159,0.0072,0.0167,0.0180,0.0084,0.0090,0.0032]
yhat = exported_pipeline.predict([row])
print('Predicted: %.3f' % yhat[0])

# TPOT 用于声纳分类数据集的最终模型拟合和预测示例

from pandas import read_csv

from sklearn.preprocessing import LabelEncoder

from sklearn.model_selection import RepeatedStratifiedKFold

from sklearn.ensemble import GradientBoostingClassifier

from sklearn.naive_bayes import GaussianNB

from sklearn.pipeline import make_pipeline

from tpot.builtins import StackingEstimator

from tpot.export_utils import set_param_recursive

# 加载数据集

url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv'

dataframe = read_csv(url, header=None)

# 分割输入和输出元素

data = dataframe.values

X, y = data[:, :-1], data[:, -1]

# 最少地准备数据集

X = X.astype('float32')

y = LabelEncoder().fit_transform(y.astype('str'))

# 训练集上的平均 CV 分数是：0.8667460317460318

exported_pipeline = make_pipeline(

StackingEstimator(estimator=GaussianNB()),

GradientBoostingClassifier(learning_rate=0.1, max_depth=7, max_features=0.7000000000000001, min_samples_leaf=15, min_samples_split=10, n_estimators=100, subsample=0.9000000000000001)

)

# 为导出管道中的所有步骤固定随机状态

set_param_recursive(exported_pipeline.steps, 'random_state', 1)

# 拟合模型

exported_pipeline.fit(X, y)

# 对新数据行进行预测

row = [0.0200,0.0371,0.0428,0.0207,0.0954,0.0986,0.1539,0.1601,0.3109,0.2111,0.1609,0.1582,0.2238,0.0645,0.0660,0.2273,0.3100,0.2999,0.5078,0.4797,0.5783,0.5071,0.4328,0.5550,0.6711,0.6415,0.7104,0.8080,0.6791,0.3857,0.1307,0.2604,0.5121,0.7547,0.8537,0.8507,0.6692,0.6097,0.4943,0.2744,0.0510,0.2834,0.2825,0.4256,0.2641,0.1386,0.1051,0.1343,0.0383,0.0324,0.0232,0.0027,0.0065,0.0159,0.0072,0.0167,0.0180,0.0084,0.0090,0.0032]

yhat = exported_pipeline.predict([row])

print('Predicted: %.3f' % yhat[0])

运行示例将最好的模型拟合到数据集中，并为单行新数据进行预测。

Predicted: 1.000

预测：1.000

TPOT 用于回归

在本节中，我们将使用 TPOT 为汽车保险数据集发现模型。

汽车保险数据集是一个标准机器学习数据集，包含 63 行数据，其中有一个数值输入变量和一个数值目标变量。

使用重复分层 10 折交叉验证（重复三次）的测试框架，一个朴素模型可以达到约 66 的平均绝对误差 (MAE)。一个表现最佳的模型在此相同测试框架上可达到约 28 的 MAE。这提供了该数据集上的预期表现范围。

该数据集涉及根据不同地理区域的索赔数量预测总索赔金额（以瑞典克朗千计）。

无需下载数据集；我们将在工作示例中自动下载它。

下面的示例下载数据集并汇总其形状。

# summarize the auto insurance dataset
from pandas import read_csv
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/auto-insurance.csv'
dataframe = read_csv(url, header=None)
# split into input and output elements
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
print(X.shape, y.shape)

# 总结汽车保险数据集

from pandas import read_csv

# 加载数据集

url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/auto-insurance.csv'

dataframe = read_csv(url, header=None)

# 分割输入和输出元素

data = dataframe.values

X, y = data[:, :-1], data[:, -1]

print(X.shape, y.shape)

运行示例会下载数据集并将其分割为输入和输出元素。正如预期的那样，我们可以看到有 63 行数据，其中有一个输入变量。

(63, 1) (63,)

1	(63, 1) (63,)

接下来，我们可以使用 TPOT 为汽车保险数据集找到一个好的模型。

首先，我们可以定义评估模型的方法。我们将采用一种良好的实践，即使用重复 k 折交叉验证，重复三次，折数 10。

...
# define evaluation procedure
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

...

# 定义评估过程

cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

我们将为搜索使用 50 的种群大小和 5 代，并通过将“n_jobs”设置为 -1 来使用系统上的所有核心。

...
# define search
model = TPOTRegressor(generations=5, population_size=50, scoring='neg_mean_absolute_error', cv=cv, verbosity=2, random_state=1, n_jobs=-1)

...

# 定义搜索

model = TPOTRegressor(generations=5, population_size=50, scoring='neg_mean_absolute_error', cv=cv, verbosity=2, random_state=1, n_jobs=-1)

最后，我们可以启动搜索，并确保在运行结束时保存表现最佳的模型。

...
# perform the search
model.fit(X, y)
# export the best model
model.export('tpot_insurance_best_model.py')

...

# 执行搜索

model.fit(X, y)

# 导出最佳模型

model.export('tpot_insurance_best_model.py')

将这些结合起来，完整的示例列在下面。

# example of tpot for the insurance regression dataset
from pandas import read_csv
from sklearn.model_selection import RepeatedKFold
from tpot import TPOTRegressor
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/auto-insurance.csv'
dataframe = read_csv(url, header=None)
# split into input and output elements
data = dataframe.values
data = data.astype('float32')
X, y = data[:, :-1], data[:, -1]
# define evaluation procedure
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# define search
model = TPOTRegressor(generations=5, population_size=50, scoring='neg_mean_absolute_error', cv=cv, verbosity=2, random_state=1, n_jobs=-1)
# perform the search
model.fit(X, y)
# export the best model
model.export('tpot_insurance_best_model.py')

# TPOT 用于保险回归数据集的示例

from pandas import read_csv

from sklearn.model_selection import RepeatedKFold

from tpot import TPOTRegressor

# 加载数据集

url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/auto-insurance.csv'

dataframe = read_csv(url, header=None)

# 分割输入和输出元素

data = dataframe.values

data = data.astype('float32')

X, y = data[:, :-1], data[:, -1]

# 定义评估过程

cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

# 定义搜索

model = TPOTRegressor(generations=5, population_size=50, scoring='neg_mean_absolute_error', cv=cv, verbosity=2, random_state=1, n_jobs=-1)

# 执行搜索

model.fit(X, y)

# 导出最佳模型

model.export('tpot_insurance_best_model.py')

运行示例可能需要几分钟时间，您将在命令行上看到一个进度条。

注意：您的结果可能因算法或评估程序的随机性，或数值精度的差异而有所不同。请考虑运行示例几次并比较平均结果。

表现最佳模型的 MAE 将在此过程中报告。

Generation 1 - Current best internal CV score: -29.147625969129034
Generation 2 - Current best internal CV score: -29.147625969129034
Generation 3 - Current best internal CV score: -29.147625969129034
Generation 4 - Current best internal CV score: -29.147625969129034
Generation 5 - Current best internal CV score: -29.147625969129034

Best pipeline: LinearSVR(input_matrix, C=1.0, dual=False, epsilon=0.0001, loss=squared_epsilon_insensitive, tol=0.001)

第 1 代 - 当前最佳内部 CV 分数：-29.147625969129034

第 2 代 - 当前最佳内部 CV 分数：-29.147625969129034

第 3 代 - 当前最佳内部 CV 分数：-29.147625969129034

第 4 代 - 当前最佳内部 CV 分数：-29.147625969129034

第 5 代 - 当前最佳内部 CV 分数：-29.147625969129034

最佳管道：LinearSVR(input_matrix, C=1.0, dual=False, epsilon=0.0001, loss=squared_epsilon_insensitive, tol=0.001)

在这种情况下，我们可以看到表现最佳的管道实现了约 29.14 的平均 MAE。这是一个熟练的模型，接近该数据集上的表现最佳的模型。

表现最佳的管道然后保存到名为“tpot_insurance_best_model.py”的文件中。

打开此文件，您可以看到一些用于加载数据集和拟合管道的通用代码。下面列出了一个示例。

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVR

# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1)
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'], random_state=1)

# Average CV score on the training set was: -29.147625969129034
exported_pipeline = LinearSVR(C=1.0, dual=False, epsilon=0.0001, loss="squared_epsilon_insensitive", tol=0.001)
# Fix random state in exported estimator
if hasattr(exported_pipeline, 'random_state'):
    setattr(exported_pipeline, 'random_state', 1)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)

import numpy as np

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.svm import LinearSVR

# 注意：请确保结果列在数据文件中标记为 'target'

tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)

features = tpot_data.drop('target', axis=1)

training_features, testing_features, training_target, testing_target = \

train_test_split(features, tpot_data['target'], random_state=1)

# 训练集上的平均 CV 分数是：-29.147625969129034

exported_pipeline = LinearSVR(C=1.0, dual=False, epsilon=0.0001, loss="squared_epsilon_insensitive", tol=0.001)

# 固定导出估计器中的随机状态

if hasattr(exported_pipeline, 'random_state'):

setattr(exported_pipeline, 'random_state', 1)

exported_pipeline.fit(training_features, training_target)

results = exported_pipeline.predict(testing_features)

注意：按原样，此代码并非有意执行。它是一个您可以复制粘贴到您的项目中的模板。

在这种情况下，我们可以看到表现最佳的模型是由线性支持向量机模型组成的管道。

我们可以修改此代码以在所有可用数据上拟合最终模型，并为新数据进行预测。

完整的示例如下所示。

# example of fitting a final model and making a prediction on the insurance dataset
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVR
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/auto-insurance.csv'
dataframe = read_csv(url, header=None)
# split into input and output elements
data = dataframe.values
data = data.astype('float32')
X, y = data[:, :-1], data[:, -1]
# Average CV score on the training set was: -29.147625969129034
exported_pipeline = LinearSVR(C=1.0, dual=False, epsilon=0.0001, loss="squared_epsilon_insensitive", tol=0.001)
# Fix random state in exported estimator
if hasattr(exported_pipeline, 'random_state'):
    setattr(exported_pipeline, 'random_state', 1)
# fit the model
exported_pipeline.fit(X, y)
# make a prediction on a new row of data
row = [108]
yhat = exported_pipeline.predict([row])
print('Predicted: %.3f' % yhat[0])

# 汽车保险数据集的最终模型拟合和预测示例

from pandas import read_csv

from sklearn.model_selection import train_test_split

from sklearn.svm import LinearSVR

# 加载数据集

url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/auto-insurance.csv'

dataframe = read_csv(url, header=None)

# 分割输入和输出元素

data = dataframe.values

data = data.astype('float32')

X, y = data[:, :-1], data[:, -1]

# 训练集上的平均 CV 分数是：-29.147625969129034

exported_pipeline = LinearSVR(C=1.0, dual=False, epsilon=0.0001, loss="squared_epsilon_insensitive", tol=0.001)

# 固定导出估计器中的随机状态

if hasattr(exported_pipeline, 'random_state'):

setattr(exported_pipeline, 'random_state', 1)

# 拟合模型

exported_pipeline.fit(X, y)

# 对新数据行进行预测

row = [108]

yhat = exported_pipeline.predict([row])

print('Predicted: %.3f' % yhat[0])

运行示例将最好的模型拟合到数据集中，并为单行新数据进行预测。

Predicted: 389.612

1	预测：389.612

进一步阅读

如果您想深入了解，本节提供了更多关于该主题的资源。

总结

在本教程中，您学习了如何在 Python 中使用 TPOT 和 Scikit-Learn 机器学习算法进行 AutoML。

具体来说，你学到了：

TPOT 是一个用于 AutoML 的开源库，使用 scikit-learn 进行数据准备和机器学习模型。
如何使用 TPOT 自动发现分类任务的最佳模型。
如何使用 TPOT 自动发现回归任务的最佳模型。

你有什么问题吗？
在下面的评论中提出你的问题，我会尽力回答。

关于此主题的更多信息

TPOT 用于 Python 中的自动化机器学习

HyperOpt 用于带有 Scikit-Learn 的自动化机器学习

41 条对“TPOT 用于 Python 中的自动化机器学习”的回复

xiaoning 2020年9月9日上午9:06 #

太棒了！！！

回复
- Jason Brownlee 2020年9月9日下午1:31 #
  
  谢谢！
  
  回复
Prox 2020年9月9日上午9:11 #

感谢您的教程！非常有帮助！
我从 TPOT 用于回归的代码中得到的唯一结果是以下内容：
WARNING:stopit:Code block execution exceeded 2 seconds timeout
回溯（最近一次调用）
…
stopit.utils.TimeoutException

有什么问题吗？

回复
- Jason Brownlee 2020年9月9日下午1:31 #
  
  谢谢。
  
  看起来像一个警告，暂时忽略。
  
  回复
Hutudi 2020年9月9日下午12:28 #

它对高维度有效吗？

回复
- Jason Brownlee 2020年9月9日下午1:34 #
  
  我看不出为什么不。
  
  回复
Piotr 2020年9月9日下午4:41 #

Jason，非常好的教程。我喜欢 AutoML 系列。TPOT 和 Auto-Sklearn 是最早的 AutoML 包之一。搜索最佳模型的能力确实很有帮助，并加快了数据科学过程。如今，ML 的其他方面变得很重要，例如可解释性。ML 模型不能是黑匣子，而应提供有关其工作原理和预测原因的信息。这极大地有助于理解数据和模型。有一个 AutoML 包可以为模型生成详尽的解释：https://github.com/mljar/mljar-supervised 我希望您会发现它很有价值，并会为您的读者介绍。祝好。

回复
- Jason Brownlee 2020年9月10日上午6:24 #
  
  感谢分享。
  
  回复
Michael Klein 2020年9月9日下午8:24 #

感谢您提高了我对开发者所用概念的（非技术性）理解。

对于诸如气候变化或城市物流之类的复杂现实世界挑战，鉴于朴素（naivity）与当前公认的物理学理论相悖，以及底层技术和人们的社会哲学随时间变化，可以获得多高的准确性？

回复
- Jason Brownlee 2020年9月10日上午6:29 #
  
  不客气。
  
  抱歉，我没太理解您的问题。我不确定我们能否用简单的预测模型来解决气候变化问题。
  
  回复
ndcharles 2020年9月10日下午4:00 #

我非常感谢您，特别是超参数调优目前让我很头疼。

但是，在您的分类模型中，您将 y 编码为字符串。有什么原因吗？（我以为所有的模型参数都应该是数值型的。）

# 最少地准备数据集
X = X.astype(‘float32’)
y = LabelEncoder().fit_transform(y.astype(‘str’))

回复
- Jason Brownlee 2020年9月11日上午5:50 #
  
  谢谢。
  
  是的，我确保了在进行序数编码之前传递给标签编码器的变量是字符串。这是一种旧习惯。
  
  回复
shaheen mohammed saleh 2020年9月12日下午5:23 #

TPOT 中有多少个算法或模型，多还是少？您更喜欢自动发现表现良好的模型还是手动选择？

回复
- Jason Brownlee 2020年9月13日上午6:01 #
  
  它会搜索 scikit-learn 模型的多种组合。
  
  回复
shaheen mohammed saleh 2020年9月12日下午5:28 #

如果您更喜欢自动发现表现良好的模型，您更偏好哪一个？为什么？谢谢。

1- Autosklearn
2- TPOT
3- Hyperopt-sklearn

回复
- Jason Brownlee 2020年9月13日上午6:01 #
  
  也许可以尝试在您的项目上分别使用它们，并选择您喜欢或最符合您要求的那个。
  
  回复
Grzegorz Kępisty 2020年9月15日下午10:41 #

下午好，Jason，

很棒的文章和例子！

问题：我理解堆叠（stacking）的思想是：数据 -> 几个算法 -> 中间输出 -> 下一个算法 -> 最终预测。在您的分类示例中，有一个最佳模型：堆叠了高斯朴素贝叶斯（GaussianNaiveBayes），然后是梯度提升（GradientBoosting）。里面是否只有一个 GNB 模型（这看起来太简单了），还是我遗漏了什么？

此致！

回复
- Jason Brownlee 2020年9月16日上午6:22 #
  
  有时简单的模型表现得很好或最好。
  
  回复

Anthony The Koala 2020年9月17日下午5:19 #

尊敬的Jason博士，
我运行了第一个示例，输出结果并不相同。

Generation 5 - Current best internal CV score: 0.8779365079365081

Best pipeline: MLPClassifier(StandardScaler(input_matrix), alpha=0.0001, learning_rate_init=0.001)
TPOTClassifier(cv=RepeatedStratifiedKFold(n_repeats=3, n_splits=10, random_state=1),

Generation 5 - Current best internal CV score: 0.8779365079365081

Best pipeline: MLPClassifier(StandardScaler(input_matrix), alpha=0.0001, learning_rate_init=0.001)

TPOTClassifier(cv=RepeatedStratifiedKFold(n_repeats=3, n_splits=10, random_state=1),

与您的示例相比

Generation 5 - Current best internal CV score: 0.8667460317460318

Best pipeline: LinearSVR(input_matrix, C=1.0, dual=False, epsilon=0.0001, loss=squared_epsilon_insensitive, tol=0.001)

Generation 5 - Current best internal CV score: 0.8667460317460318

Best pipeline: LinearSVR(input_matrix, C=1.0, dual=False, epsilon=0.0001, loss=squared_epsilon_insensitive, tol=0.001)

换句话说，为什么我的最佳分类器是 MLPClassifier 且得分为 0.877，而您的是 LinearSVR 且得分为 0.8667，尽管我运行的是相同的代码？

谢谢你
悉尼的Anthony

Jason Brownlee 2020年9月18日上午6:39 #

干得好。

是的，这是常见问题。
https://machinelearning.org.cn/faq/single-faq/why-do-i-get-different-results-each-time-i-run-the-code

回复

Anthony The Koala 2020年9月17日下午7:05 #

尊敬的Jason博士，
同样，对于回归，我得到了：

Generation 5 - Current best internal CV score: -28.976067798113224
                                                                              
Best pipeline: RidgeCV(OneHotEncoder(OneHotEncoder(ExtraTreesRegressor(input_matrix, bootstrap=True, max_features=0.25, min_samples_leaf=11, min_samples_split=20, n_estimators=100), minimum_fraction=0.2, sparse=False, threshold=10), minimum_fraction=0.05, sparse=False, threshold=10))

Generation 5 - Current best internal CV score: -28.976067798113224

Best pipeline: RidgeCV(OneHotEncoder(OneHotEncoder(ExtraTreesRegressor(input_matrix, bootstrap=True, max_features=0.25, min_samples_leaf=11, min_samples_split=20, n_estimators=100), minimum_fraction=0.2, sparse=False, threshold=10), minimum_fraction=0.05, sparse=False, threshold=10))

而您的实验产生了 LinearBestSVR，得分为 -29.148

在我的电脑上使用的代码相同，但结果略有不同。

谢谢你，
悉尼的Anthony

Jason Brownlee 2020年9月18日上午6:42 #

是的，鉴于优化算法的随机性，这是可以预期的。

回复

ahmed 2020年12月5日上午8:58 #

AttributeError: ‘TPOTClassifier’ object has no attribute ‘_optimized_pipeline’

回复
- Jason Brownlee 2020年12月5日下午1:20 #
  
  也许这些提示会有所帮助
  https://machinelearning.org.cn/faq/single-faq/why-does-the-code-in-the-tutorial-not-work-for-me
  
  回复
Max 2021年3月25日下午9:05 #

尊敬的Jason博士，
是否可以优化 pipeline 步骤（如 scaler 或 encoder）？还是它只能优化模型的超参数？

回复
- Jason Brownlee 2021年3月26日上午6:25 #
  
  我认为可以 – 我相信 TPOT 可以做到或者支持这一点。
  
  回复
Kate 2021年4月20日下午7:24 #

如何将这些结果投入生产？

回复
- Jason Brownlee 2021年4月21日上午5:56 #
  
  也许可以找出最佳模型的配置，然后在所有数据上用该配置拟合最终模型。
  https://machinelearning.org.cn/train-final-machine-learning-model/
  
  回复
Ben Bartling 2021年5月23日下午10:13 #

Jason，在 TPOT regressor 示例中，就像我尝试代码一样，from sklearn.model_selection import train_test_split 会自动从 TPOT 生成的样板 .py 文件中导入。我们是否应该利用这一点来验证模型？或者在您不使用此示例的情况下，还有哪些其他验证方法可用于回归？TPOT 是否也适用于由时间序列数据组成的回归问题？

回复
- Jason Brownlee 2021年5月24日上午5:45 #
  
  它可能有助于作为模型验证的起点代码。
  
  使用此框架处理时间序列可能存在风险，因为我认为它可能无法尊重样本的时间顺序，从而导致评估无效。
  
  回复
Mehdi 2021年7月6日上午12:05 #

嗨，Jason，
感谢您的教程
我将 TPOT 用于 GradientBoostingRegressor 的超参数优化。但我遇到了以下错误：

终端必须具有唯一的名称。请考虑使用参数‘name’来重命名您的第二个 GradientBoostingRegressor__learning_rate=0 终端。

您能否分享您的看法？

谢谢。

回复
- Jason Brownlee 2021年7月6日上午5:49 #
  
  也许可以尝试另一种模型类型？
  也许检查您的所有库是否都是最新的？
  也许联系 tpot 项目？
  也许在 stackoverflow 上发布代码和错误？
  
  回复
Maryam Zeinolabedini Rezaabad 2021年8月25日下午8:46 #

嗨，Jason，

非常感谢您的教程。

我有两个问题。
1. 是否有办法查看每个生成中构建的最佳模型或所有模型的详细信息（例如，遗传编程的树结构等）？

2. 在遗传编程中，初始随机种群是由我们提供的定义（真实数据）生成的，还是不是？

在此先感谢，

回复
- Adrian Tam 2021年8月27日上午5:34 #
  
  如果您想获取每个生成中的详细信息，请考虑 tpot 中的 checkpoint 参数。但您可能需要编写一些代码来可视化 checkpoints 的详细信息。
  
  回复
Jaret 2022年5月9日下午11:35 #

尊敬的Jason博士，
您的想法对我帮助很大。但是，我想知道如何在 MSE 下获得结果而不是 MAE。期待您的回复
感谢您提供的任何帮助。

回复
- James Carmichael 2022年5月10日下午12:09 #
  
  Jaret…您可以修改以下内容
  
  scoring=’neg_mean_absolute_error’
  
  推广到
  
  scoring=’neg_mean_squared_error’
  
  回复
  - Jaret 2022年5月11日上午12:28 #
    
    亲爱的Jason博士，
    感谢您的回复，我现在掌握了 scoring，但如何看到读取 CSV 文件的第一个模型（n_splits=10, n_repeats=3, random_state=1 是什么意思）？作为一名新的机器学习学者，您的建议对我帮助很大，谢谢您的慷慨。
    
    回复
John White 2022年8月30日上午3:11 #

你好，

在收到新的训练数据后，或者当我们决定重新训练当前模型时，是否需要或有意义运行 TPOT？谢谢

-John

回复
Akansha 2023年10月25日上午12:09 #

尊敬的Jason博士，
我尝试了分类示例，但结果不相同。相反，我收到了错误“A pipeline has not yet been optimized. Please call fit() first”。

回复
- James Carmichael 2023年10月25日上午9:08 #
  
  Akansha…以下资源可能对您有帮助
  
  https://stackoverflow.com/questions/57347026/runtimeerror-a-pipeline-has-not-yet-been-optimized-please-call-fit-first-pro
  
  回复
Akansha 2023年10月30日下午5:35 #

嗨 James，
谢谢您的回复。我查看了该资源。他们提到数据本身可能存在问题。所以，我检查了 sonar 数据集的链接，它不起作用。我将尝试使用我电脑上的数据集看看是否有效。

回复

导航

TPOT 用于 Python 上的自动化机器学习

教程概述

TPOT 用于自动化机器学习

安装和使用 TPOT

TPOT 用于分类

TPOT 用于回归

进一步阅读

总结

发现 Python 中的快速机器学习！

在几分钟内开发您自己的模型

最终将机器学习带入
您自己的项目

关于此主题的更多信息

41 条对“TPOT 用于 Python 中的自动化机器学习”的回复

留下回复点击此处取消回复。

导航

教程概述

TPOT 用于自动化机器学习

安装和使用 TPOT

TPOT 用于分类

TPOT 用于回归

进一步阅读

总结

发现 Python 中的快速机器学习！

在几分钟内开发您自己的模型

最终将机器学习带入您自己的项目

关于此主题的更多信息

41 条对“TPOT 用于 Python 中的自动化机器学习”的回复

留下回复 点击此处取消回复。

最终将机器学习带入
您自己的项目

留下回复点击此处取消回复。