Python 中更轻松的实验

作者 Adrian Tam 于 2022 年 6 月 21 日发表在 Python for Machine Learning 2

当我们进行机器学习项目时，我们经常需要尝试多种替代方案。Python 中的一些功能使我们能够轻松尝试不同的选项。在本教程中，我们将介绍一些使我们的实验更快的技巧。

完成本教程后，你将学到：

如何利用鸭子类型功能轻松交换函数和对象
如何将组件制作成可互换的“即插即用”替换件，以帮助加快实验速度

开始你的项目，阅读我的新书《Python for Machine Learning》，其中包含分步教程和所有示例的Python 源代码文件。

让我们开始吧。

Python 中更轻松的实验。图片来自 Jake Givens。保留部分权利

概述

本教程分为三个部分；它们是

机器学习项目工作流程
函数作为对象
注意事项

机器学习项目工作流程

考虑一个非常简单的机器学习项目，如下所示

from pandas import read_csv

from sklearn.model_selection import train_test_split

from sklearn.svm import SVC

# 加载数据集

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"

names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']

dataset = read_csv(url, names=names)

# 分离出验证数据集

array = dataset.values

X = array[:,0:4]

y = array[:,4]

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.20, random_state=1, shuffle=True)

# 训练

clf = SVC()

clf.fit(X_train, y_train)

# 测试

score = clf.score(X_val, y_val)

print("Validation accuracy", score)

这是一个典型的机器学习项目工作流程。我们有一个数据预处理阶段，然后是模型训练，最后是结果评估。但在每个步骤中，我们可能都想尝试一些不同的东西。例如，我们可能想知道是否对数据进行归一化会使其更好。因此，我们可能会将上面的代码重写为以下形式

from pandas import read_csv

from sklearn.model_selection import train_test_split

from sklearn.svm import SVC

from sklearn.pipeline import Pipeline

from sklearn.preprocessing import StandardScaler

# 加载数据集

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"

names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']

dataset = read_csv(url, names=names)

# 分离出验证数据集

array = dataset.values

X = array[:,0:4]

y = array[:,4]

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.20, random_state=1, shuffle=True)

# 训练

clf = Pipeline([('scaler',StandardScaler()), ('classifier',SVC())])

clf.fit(X_train, y_train)

# 测试

score = clf.score(X_val, y_val)

print("Validation accuracy", score)

到目前为止，一切都很好。但是，如果我们不断尝试不同的数据集、不同的模型或不同的评分函数呢？每次我们来回切换使用缩放器和不使用缩放器都会导致大量的代码更改，而且很容易出错。

因为 Python 支持鸭子类型，我们可以看到以下两个分类器模型实现了相同的接口

1 2	clf = SVC() clf = Pipeline([('scaler',StandardScaler()), ('classifier',SVC())])

因此，我们可以简单地在这两个版本之间进行选择，并保持其他一切不变。我们可以说这两个模型是彼此的即插即用替换件。

利用这一特性，我们可以创建一个切换变量来控制我们做出的设计选择

USE_SCALER = True

if USE_SCALER:

clf = Pipeline([('scaler',StandardScaler()), ('classifier',SVC())])

else:

clf = SVC()

通过在True和False之间切换变量USE_SCALER，我们可以选择是否应用缩放器。一个更复杂的例子是选择不同的缩放器和分类器模型，例如

SCALER = "standard"

CLASSIFIER = "svc"

if CLASSIFIER == "svc":

model = SVC()

elif CLASSIFIER == "cart":

model = DecisionTreeClassifier()

else:

raise NotImplementedError

if SCALER == "standard":

clf = Pipeline([('scaler',StandardScaler()), ('classifier',model)])

elif SCALER == "maxmin":

clf = Pipeline([('scaler',MaxMinScaler()), ('classifier',model)])

elif SCALER == None:

clf = model

else:

raise NotImplementedError

一个完整的例子如下

from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# toggle between options
SCALER = "maxmin"    # "standard", "maxmin", or None
CLASSIFIER = "cart"  # "svc" or "cart"

# Load dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = read_csv(url, names=names)

# Split-out validation dataset
array = dataset.values
X = array[:,0:4]
y = array[:,4]
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.20, random_state=1, shuffle=True)

# Create model
if CLASSIFIER == "svc":
    model = SVC()
elif CLASSIFIER == "cart":
    model = DecisionTreeClassifier()
else:
    raise NotImplementedError

if SCALER == "standard":
    clf = Pipeline([('scaler',StandardScaler()), ('classifier',model)])
elif SCALER == "maxmin":
    clf = Pipeline([('scaler',MinMaxScaler()), ('classifier',model)])
elif SCALER == None:
    clf = model
else:
    raise NotImplementedError

# Train
clf.fit(X_train, y_train)

# Test
score = clf.score(X_val, y_val)
print("Validation accuracy", score)

from pandas import read_csv

from sklearn.model_selection import train_test_split

from sklearn.svm import SVC

from sklearn.tree import DecisionTreeClassifier

from sklearn.pipeline import Pipeline

from sklearn.preprocessing import StandardScaler, MinMaxScaler

# 切换选项

SCALER = "maxmin" # "standard", "maxmin", or None

CLASSIFIER = "cart" # "svc" or "cart"

# 加载数据集

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"

names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']

dataset = read_csv(url, names=names)

# 分离出验证数据集

array = dataset.values

X = array[:,0:4]

y = array[:,4]

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.20, random_state=1, shuffle=True)

# 创建模型

if CLASSIFIER == "svc":

model = SVC()

elif CLASSIFIER == "cart":

model = DecisionTreeClassifier()

else:

raise NotImplementedError

if SCALER == "standard":

clf = Pipeline([('scaler',StandardScaler()), ('classifier',model)])

elif SCALER == "maxmin":

clf = Pipeline([('scaler',MinMaxScaler()), ('classifier',model)])

elif SCALER == None:

clf = model

else:

raise NotImplementedError

# 训练

clf.fit(X_train, y_train)

# 测试

score = clf.score(X_val, y_val)

print("Validation accuracy", score)

如果再进一步，您甚至可以跳过切换变量，直接使用字符串进行快速实验

import numpy as np

import scipy.stats as stats

# 协方差矩阵和 Cholesky 分解

cov = np.array([[1, 0.8], [0.8, 1]])

L = np.linalg.cholesky(cov)

# 生成 100 对双变量高斯随机数

if not "USE SCIPY":

z = np.random.randn(100,2)

x = z @ L.T

else:

x = stats.multivariate_normal(mean=[0, 0], cov=cov).rvs(100)

...

函数作为对象

在 Python 中，函数是头等公民。您可以将函数赋值给一个变量。事实上，函数是 Python 中的对象，类也是如此（类本身，而不仅仅是类的实例）。因此，我们可以使用与上面相同的技术来实验类似的函数。

import numpy as np

DIST = "normal"

if DIST == "normal":

rangen = np.random.normal

elif DIST == "uniform":

rangen = np.random.uniform

else:

raise NotImplementedError

random_data = rangen(size=(10,5))

print(random_data)

上面与调用 np.random.normal(size=(10,5)) 类似，但我们将函数保存在一个变量中，以便于将一个函数与其他函数进行交换。请注意，由于我们以相同的参数调用函数，因此我们必须确保所有变体都能接受它。如果不是这样，我们可能需要一些额外的代码来创建一个包装器。例如，在生成 Student's t 分布的情况下，我们需要一个额外的参数来表示自由度

import numpy as np

DIST = "t"

if DIST == "normal":

rangen = np.random.normal

elif DIST == "uniform":

rangen = np.random.uniform

elif DIST == "t":

def t_wrapper(size):

# Student's t distribution with 3 degree of freedom

return np.random.standard_t(df=3, size=size)

rangen = t_wrapper

else:

raise NotImplementedError

random_data = rangen(size=(10,5))

print(random_data)

这之所以有效，是因为在上面，我们定义的 np.random.normal、 np.random.uniform 和 t_wrapper 都是彼此的即插即用替换件。

想开始学习机器学习 Python 吗？

立即参加我为期7天的免费电子邮件速成课程（附示例代码）。

点击注册，同时获得该课程的免费PDF电子书版本。

注意事项

机器学习与其他编程项目不同，因为工作流程中有更多的不确定性。当你构建一个网页或一个游戏时，你心中有一个要实现的目标。但机器学习项目有一些探索性的工作。

在其他项目中，你可能会使用像 git 或 Mercurial 这样的源代码控制系统来管理你的源代码开发历史。然而，在机器学习项目中，我们正在尝试不同的组合的多个步骤。使用 git 来管理不同的变体可能不合适，更不用说有时可能有些大材小用了。因此，使用切换变量来控制流程应该能让我们更快地尝试不同的事物。当我们处理 Jupyter notebook 中的项目时，这一点尤其有用。

然而，当我们把多个版本的代码放在一起时，我们的程序就会变得笨拙且可读性降低。最好在我们确定要做什么之后进行一些清理。这将有助于我们未来的维护。