如何使用优化算法手动拟合回归模型

作者： Jason Brownlee 发布于 2021年10月12日分类：优化 0

回归模型通过线性回归和局部搜索优化算法在训练数据上进行拟合。

线性回归和逻辑回归等模型通过最小二乘优化进行训练，这是找到最小化模型误差的系数的最有效方法。

尽管如此，仍然可以使用其他 **优化算法来拟合回归模型** 到训练数据集中。这可以是一个有用的练习，以更多地了解回归函数以及优化在应用机器学习中的核心作用。当数据不满足最小二乘优化程序的期望时，也可能需要这样做。

在本教程中，您将了解如何手动优化回归模型的系数。

完成本教程后，您将了解：

如何从头开始开发回归模型的推理模型。
如何优化线性回归模型以预测数值。
如何使用随机爬山法优化逻辑回归模型的系数。

开始您的项目，阅读我的新书《机器学习优化》，其中包括分步教程以及所有示例的Python源代码文件。

让我们开始吧。

How to Use Optimization Algorithms to Manually Fit Regression Models

如何使用优化算法手动拟合回归模型
照片由 Christian Collins 拍摄，保留部分权利。

教程概述

本教程分为三个部分；它们是：

优化回归模型
优化线性回归模型
优化逻辑回归模型

优化回归模型

线性回归和逻辑回归等回归模型是统计学领域中已被充分理解的算法。

这两种算法都是线性的，意味着模型的输出是输入的加权和。线性回归适用于需要预测数字的“回归”问题，而逻辑回归适用于需要预测类别标签的“分类”问题。

这些回归模型涉及使用优化算法为模型的每个输入找到一组系数，以最小化预测误差。由于模型是线性的且易于理解，因此可以使用高效的优化算法。

在处理线性回归时，可以通过最小二乘优化来找到系数，这可以通过线性代数来解决。在处理逻辑回归时，通常使用局部搜索优化算法。

可以使用任何任意优化算法来训练线性和逻辑回归模型。

也就是说，我们可以定义一个回归模型，并使用给定的优化算法来查找模型系数集，从而实现预测误差的最小化或分类准确率的最大化。

使用替代优化算法的平均效率通常不如使用推荐的优化算法。尽管如此，在某些特定情况下，它可能更有效，例如，如果输入数据不符合模型预期（例如高斯分布）并且与外部输入不相关。

这也可以是一个有趣的练习，以展示优化在训练机器学习算法，特别是回归模型中的核心作用。

接下来，让我们探讨如何使用随机爬山法训练线性回归模型。

想要开始学习优化算法吗？

立即参加我为期7天的免费电子邮件速成课程（附示例代码）。

点击注册，同时获得该课程的免费PDF电子书版本。

优化线性回归模型

线性回归模型可能是最简单的从数据中学习的预测模型。

该模型为每个输入都有一个系数，预测输出仅仅是某些输入和系数的权重。

在本节中，我们将优化数据集的系数。

首先，让我们定义一个我们可以作为模型优化重点的合成回归问题。

我们可以使用 make_regression() 函数来定义一个具有 1000 行和 10 个输入变量的回归问题。

以下示例创建数据集并总结数据的形状。

# define a regression dataset
from sklearn.datasets import make_regression
# define dataset
X, y = make_regression(n_samples=1000, n_features=10, n_informative=2, noise=0.2, random_state=1)
# summarize the shape of the dataset
print(X.shape, y.shape)

# 定义回归数据集

from sklearn.datasets import make_regression

# 定义数据集

X, y = make_regression(n_samples=1000, n_features=10, n_informative=2, noise=0.2, random_state=1)

# 总结数据集的形状

print(X.shape, y.shape)

运行示例会打印创建数据集的形状，证实了我们的预期。

(1000, 10) (1000,)

1	(1000, 10) (1000,)

接下来，我们需要定义一个线性回归模型。

在优化模型系数之前，我们必须开发模型并建立对其工作原理的信心。

让我们开始开发一个函数，用于计算给定数据集数据行输入的模型激活。

此函数将接受数据行和模型系数，并计算输入与额外 y 截距（也称为偏移或偏差）系数的加权和。下面的 predict_row() 函数实现了这一点。

我们使用简单的 Python 列表和命令式编程风格，而不是 NumPy 数组或列表推导式，这是有意为之，以便代码对 Python 初学者更具可读性。请随时优化它并在下面的评论中发布您的代码。

# linear regression
def predict_row(row, coefficients):
	# add the bias, the last coefficient
	result = coefficients[-1]
	# add the weighted input
	for i in range(len(row)):
		result += coefficients[i] * row[i]
	return result

# 线性回归

def predict_row(row, coefficients):

# 添加偏差，即最后一个系数

result = coefficients[-1]

# 添加加权输入

for i in range(len(row)):

result += coefficients[i] * row[i]

return result

接下来，我们可以对给定数据集中的每一行调用 `predict_row()` 函数。下面的 `predict_dataset()` 函数实现了这一点。

同样，我们有意使用简单的命令式编码风格而不是列表推导式以提高可读性。

# use model coefficients to generate predictions for a dataset of rows
def predict_dataset(X, coefficients):
	yhats = list()
	for row in X:
		# make a prediction
		yhat = predict_row(row, coefficients)
		# store the prediction
		yhats.append(yhat)
	return yhats

# 使用模型系数为数据集行生成预测

def predict_dataset(X, coefficients):

yhats = list()

for row in X:

# 进行预测

yhat = predict_row(row, coefficients)

# 存储预测

yhats.append(yhat)

return yhats

最后，我们可以使用模型对我们的合成数据集进行预测，以确认它都正常工作。

我们可以使用 rand() 函数生成一组随机模型系数。

请记住，我们需要为每个输入（本数据集中有十个输入）一个系数，外加一个用于 y 截距系数的额外权重。

...
# define dataset
X, y = make_regression(n_samples=1000, n_features=10, n_informative=2, noise=0.2, random_state=1)
# determine the number of coefficients
n_coeff = X.shape[1] + 1
# generate random coefficients
coefficients = rand(n_coeff)

...

# 定义数据集

X, y = make_regression(n_samples=1000, n_features=10, n_informative=2, noise=0.2, random_state=1)

# 确定系数的数量

n_coeff = X.shape[1] + 1

# 生成随机系数

coefficients = rand(n_coeff)

然后，我们可以将这些系数与数据集一起使用进行预测。

...
# generate predictions for dataset
yhat = predict_dataset(X, coefficients)

...

# 为数据集生成预测

yhat = predict_dataset(X, coefficients)

我们可以评估这些预测的均方误差。

...
# calculate model prediction error
score = mean_squared_error(y, yhat)
print('MSE: %f' % score)

...

# 计算模型预测误差

score = mean_squared_error(y, yhat)

print('MSE: %f' % score)

就是这样。

我们可以将所有这些结合起来，并演示我们的线性回归模型用于回归预测建模。完整的示例列于下方。

# linear regression model
from numpy.random import rand
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error

# linear regression
def predict_row(row, coefficients):
	# add the bias, the last coefficient
	result = coefficients[-1]
	# add the weighted input
	for i in range(len(row)):
		result += coefficients[i] * row[i]
	return result

# use model coefficients to generate predictions for a dataset of rows
def predict_dataset(X, coefficients):
	yhats = list()
	for row in X:
		# make a prediction
		yhat = predict_row(row, coefficients)
		# store the prediction
		yhats.append(yhat)
	return yhats

# define dataset
X, y = make_regression(n_samples=1000, n_features=10, n_informative=2, noise=0.2, random_state=1)
# determine the number of coefficients
n_coeff = X.shape[1] + 1
# generate random coefficients
coefficients = rand(n_coeff)
# generate predictions for dataset
yhat = predict_dataset(X, coefficients)
# calculate model prediction error
score = mean_squared_error(y, yhat)
print('MSE: %f' % score)

# 线性回归模型

from numpy.random import rand

from sklearn.datasets import make_regression

from sklearn.metrics import mean_squared_error

# 线性回归

def predict_row(row, coefficients):

# 添加偏差，即最后一个系数

result = coefficients[-1]

# 添加加权输入

for i in range(len(row)):

result += coefficients[i] * row[i]

return result

# 使用模型系数为数据集行生成预测

def predict_dataset(X, coefficients):

yhats = list()

for row in X:

# 进行预测

yhat = predict_row(row, coefficients)

# 存储预测

yhats.append(yhat)

return yhats

# 定义数据集

X, y = make_regression(n_samples=1000, n_features=10, n_informative=2, noise=0.2, random_state=1)

# 确定系数的数量

n_coeff = X.shape[1] + 1

# 生成随机系数

coefficients = rand(n_coeff)

# 为数据集生成预测

yhat = predict_dataset(X, coefficients)

# 计算模型预测误差

score = mean_squared_error(y, yhat)

print('MSE: %f' % score)

运行示例将为训练数据集中每个示例生成一个预测，然后打印预测的均方误差。

注意：由于算法或评估程序的随机性，或数值精度的差异，您的结果可能会有所不同。考虑运行示例几次并比较平均结果。

考虑到随机权重集，我们预计会产生很大的误差，在本例中也是如此，误差值为 7307 个单位左右。

MSE: 7307.756740

1	MSE: 7307.756740

现在，我们可以优化数据集的系数，以在该数据集上实现低误差。

首先，我们需要将数据集划分为训练集和测试集。重要的是要保留一些未用于优化模型的数据，以便我们能够对模型在新数据上进行预测时的性能进行合理估算。

我们将使用 67% 的数据进行训练，其余 33% 作为测试集，用于评估模型的性能。

...
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

...

# 拆分为训练测试集

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

接下来，我们可以开发一个随机爬山算法。

优化算法需要一个要优化的目标函数。它必须接受一组系数并返回一个分数，该分数对应于更好的模型，需要被最小化或最大化。

在这种情况下，我们将评估给定系数集的模型均方误差，并返回误差分数，该分数必须最小化。

下面的 objective() 函数实现了这一点，它接受数据集和一组系数，并返回模型的误差。

# objective function
def objective(X, y, coefficients):
	# generate predictions for dataset
	yhat = predict_dataset(X, coefficients)
	# calculate accuracy
	score = mean_squared_error(y, yhat)
	return score

# 目标函数

def objective(X, y, coefficients):

# 为数据集生成预测

yhat = predict_dataset(X, coefficients)

# 计算准确率

score = mean_squared_error(y, yhat)

return score

接下来，我们可以定义随机爬山算法。

该算法需要一个初始解（例如，随机系数），并且将迭代地对解进行小的更改，并检查它是否能产生更好的模型。对当前解所做的更改量由 step_size 超参数控制。此过程将持续固定的迭代次数，这也作为超参数提供。

下面的 hillclimbing() 函数实现了这一点，它将数据集、目标函数、初始解和超参数作为参数，并返回找到的最佳系数集和估计性能。

# hill climbing local search algorithm
def hillclimbing(X, y, objective, solution, n_iter, step_size):
	# evaluate the initial point
	solution_eval = objective(X, y, solution)
	# run the hill climb
	for i in range(n_iter):
		# take a step
		candidate = solution + randn(len(solution)) * step_size
		# evaluate candidate point
		candidte_eval = objective(X, y, candidate)
		# check if we should keep the new point
		if candidte_eval <= solution_eval:
			# store the new point
			solution, solution_eval = candidate, candidte_eval
			# report progress
			print('>%d %.5f' % (i, solution_eval))
	return [solution, solution_eval]

# 爬山局部搜索算法

def hillclimbing(X, y, objective, solution, n_iter, step_size):

# 评估初始点

solution_eval = objective(X, y, solution)

# 运行爬山算法

for i in range(n_iter):

# 迈出一步

candidate = solution + randn(len(solution)) * step_size

# 评估候选点

candidte_eval = objective(X, y, candidate)

# 检查是否应该保留新点

if candidte_eval <= solution_eval:

# 存储新点

solution, solution_eval = candidate, candidte_eval

# 报告进度

print('>%d %.5f' % (i, solution_eval))

return [solution, solution_eval]

然后，我们可以调用此函数，将一组初始系数作为初始解，并将训练数据集作为要优化模型的数据集。

...
# define the total iterations
n_iter = 2000
# define the maximum step size
step_size = 0.15
# determine the number of coefficients
n_coef = X.shape[1] + 1
# define the initial solution
solution = rand(n_coef)
# perform the hill climbing search
coefficients, score = hillclimbing(X_train, y_train, objective, solution, n_iter, step_size)
print('Done!')
print('Coefficients: %s' % coefficients)
print('Train MSE: %f' % (score))

...

# 定义总迭代次数

n_iter = 2000

# 定义最大步长

step_size = 0.15

# 确定系数的数量

n_coef = X.shape[1] + 1

# 定义初始解

solution = rand(n_coef)

# 执行爬山搜索

coefficients, score = hillclimbing(X_train, y_train, objective, solution, n_iter, step_size)

print('Done!')

print('Coefficients: %s' % coefficients)

print('Train MSE: %f' % (score))

最后，我们可以在测试数据集上评估最佳模型并报告性能。

...
# generate predictions for the test dataset
yhat = predict_dataset(X_test, coefficients)
# calculate accuracy
score = mean_squared_error(y_test, yhat)
print('Test MSE: %f' % (score))

...

# 为测试数据集生成预测

yhat = predict_dataset(X_test, coefficients)

# 计算准确率

score = mean_squared_error(y_test, yhat)

print('Test MSE: %f' % (score))

将所有内容整合在一起，优化合成回归数据集上的线性回归模型系数的完整示例列于下方。

# optimize linear regression coefficients for regression dataset
from numpy.random import randn
from numpy.random import rand
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# linear regression
def predict_row(row, coefficients):
	# add the bias, the last coefficient
	result = coefficients[-1]
	# add the weighted input
	for i in range(len(row)):
		result += coefficients[i] * row[i]
	return result

# use model coefficients to generate predictions for a dataset of rows
def predict_dataset(X, coefficients):
	yhats = list()
	for row in X:
		# make a prediction
		yhat = predict_row(row, coefficients)
		# store the prediction
		yhats.append(yhat)
	return yhats

# objective function
def objective(X, y, coefficients):
	# generate predictions for dataset
	yhat = predict_dataset(X, coefficients)
	# calculate accuracy
	score = mean_squared_error(y, yhat)
	return score

# hill climbing local search algorithm
def hillclimbing(X, y, objective, solution, n_iter, step_size):
	# evaluate the initial point
	solution_eval = objective(X, y, solution)
	# run the hill climb
	for i in range(n_iter):
		# take a step
		candidate = solution + randn(len(solution)) * step_size
		# evaluate candidate point
		candidte_eval = objective(X, y, candidate)
		# check if we should keep the new point
		if candidte_eval <= solution_eval:
			# store the new point
			solution, solution_eval = candidate, candidte_eval
			# report progress
			print('>%d %.5f' % (i, solution_eval))
	return [solution, solution_eval]

# define dataset
X, y = make_regression(n_samples=1000, n_features=10, n_informative=2, noise=0.2, random_state=1)
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
# define the total iterations
n_iter = 2000
# define the maximum step size
step_size = 0.15
# determine the number of coefficients
n_coef = X.shape[1] + 1
# define the initial solution
solution = rand(n_coef)
# perform the hill climbing search
coefficients, score = hillclimbing(X_train, y_train, objective, solution, n_iter, step_size)
print('Done!')
print('Coefficients: %s' % coefficients)
print('Train MSE: %f' % (score))
# generate predictions for the test dataset
yhat = predict_dataset(X_test, coefficients)
# calculate accuracy
score = mean_squared_error(y_test, yhat)
print('Test MSE: %f' % (score))

# 优化回归数据集的线性回归系数

from numpy.random import randn

from numpy.random import rand

from sklearn.datasets import make_regression

from sklearn.model_selection import train_test_split

from sklearn.metrics import mean_squared_error

# 线性回归

def predict_row(row, coefficients):

# 添加偏差，即最后一个系数

result = coefficients[-1]

# 添加加权输入

for i in range(len(row)):

result += coefficients[i] * row[i]

return result

# 使用模型系数为数据集行生成预测

def predict_dataset(X, coefficients):

yhats = list()

for row in X:

# 进行预测

yhat = predict_row(row, coefficients)

# 存储预测

yhats.append(yhat)

return yhats

# 目标函数

def objective(X, y, coefficients):

# 为数据集生成预测

yhat = predict_dataset(X, coefficients)

# 计算准确率

score = mean_squared_error(y, yhat)

return score

# 爬山局部搜索算法

def hillclimbing(X, y, objective, solution, n_iter, step_size):

# 评估初始点

solution_eval = objective(X, y, solution)

# 运行爬山算法

for i in range(n_iter):

# 迈出一步

candidate = solution + randn(len(solution)) * step_size

# 评估候选点

candidte_eval = objective(X, y, candidate)

# 检查是否应该保留新点

if candidte_eval <= solution_eval:

# 存储新点

solution, solution_eval = candidate, candidte_eval

# 报告进度

print('>%d %.5f' % (i, solution_eval))

return [solution, solution_eval]

# 定义数据集

X, y = make_regression(n_samples=1000, n_features=10, n_informative=2, noise=0.2, random_state=1)

# 拆分为训练测试集

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

# 定义总迭代次数

n_iter = 2000

# 定义最大步长

step_size = 0.15

# 确定系数的数量

n_coef = X.shape[1] + 1

# 定义初始解

solution = rand(n_coef)

# 执行爬山搜索

coefficients, score = hillclimbing(X_train, y_train, objective, solution, n_iter, step_size)

print('Done!')

print('Coefficients: %s' % coefficients)

print('Train MSE: %f' % (score))

# 为测试数据集生成预测

yhat = predict_dataset(X_test, coefficients)

# 计算准确率

score = mean_squared_error(y_test, yhat)

print('Test MSE: %f' % (score))

运行示例将报告迭代次数和均方误差，每次模型得到改进时都会报告。

在搜索结束时，将报告最佳系数集在训练数据集上的性能，并计算并报告同一模型在测试数据集上的性能。

注意：由于算法或评估程序的随机性，或数值精度的差异，您的结果可能会有所不同。考虑运行示例几次并比较平均结果。

在这种情况下，我们可以看到优化算法找到了一个系数集，该系数集在训练和测试数据集上的误差都约为 0.08。

算法找到一个在训练和测试数据集上性能非常相似的模型，这是一个好迹象，表明模型在新数据上泛化良好。这意味着模型没有对训练数据集过拟合（过度优化）。

...
>1546 0.35426
>1567 0.32863
>1572 0.32322
>1619 0.24890
>1665 0.24800
>1691 0.24162
>1715 0.15893
>1809 0.15337
>1892 0.14656
>1956 0.08042
Done!
Coefficients: [ 1.30559829e-02 -2.58299382e-04  3.33118191e+00  3.20418534e-02
  1.36497902e-01  8.65445367e+01  2.78356715e-02 -8.50901499e-02
  8.90078243e-02  6.15779867e-02 -3.85657793e-02]
Train MSE: 0.080415
Test MSE: 0.080779

...

>1546 0.35426

>1567 0.32863

>1572 0.32322

>1619 0.24890

>1665 0.24800

>1691 0.24162

>1715 0.15893

>1809 0.15337

>1892 0.14656

>1956 0.08042

完成！

Coefficients: [ 1.30559829e-02 -2.58299382e-04 3.33118191e+00 3.20418534e-02

1.36497902e-01 8.65445367e+01 2.78356715e-02 -8.50901499e-02

8.90078243e-02 6.15779867e-02 -3.85657793e-02]

Train MSE: 0.080415

Test MSE: 0.080779

现在我们熟悉了如何手动优化线性回归模型的系数，让我们看看如何扩展该示例来优化用于分类的逻辑回归模型的系数。

优化逻辑回归模型

逻辑回归模型是线性回归在分类预测建模中的扩展。

逻辑回归适用于二元分类任务，即数据集具有两个类别标签，类别=0 和类别=1。

首先，输出涉及计算输入的加权和，然后将该加权和通过逻辑函数（也称为 sigmoid 函数）。结果是示例属于类别=1 的二项概率，介于 0 和 1 之间。

在本节中，我们将基于上一节中学到的知识来优化回归模型的分类系数。我们将开发模型并使用随机系数对其进行测试，然后使用随机爬山法优化模型系数。

首先，让我们定义一个合成的二元分类问题，我们可以将其作为优化模型的重点。

我们可以使用make_classification()函数来定义一个包含1000行和五个输入变量的二元分类问题。

以下示例创建数据集并总结数据的形状。

# define a binary classification dataset
from sklearn.datasets import make_classification
# define dataset
X, y = make_classification(n_samples=1000, n_features=5, n_informative=2, n_redundant=1, random_state=1)
# summarize the shape of the dataset
print(X.shape, y.shape)

# 定义一个二元分类数据集

from sklearn.datasets import make_classification

# 定义数据集

X, y = make_classification(n_samples=1000, n_features=5, n_informative=2, n_redundant=1, random_state=1)

# 总结数据集的形状

print(X.shape, y.shape)

运行示例会打印创建数据集的形状，证实了我们的预期。

(1000, 5) (1000,)

1	(1000, 5) (1000,)

接下来，我们需要定义一个逻辑回归模型。

让我们通过更新 predict_row() 函数，将输入的加权和与系数通过逻辑函数进行传递来开始。

逻辑函数定义为

logistic = 1.0 / (1.0 + exp(-result))

其中 result 是输入的加权和与系数，exp() 是e（欧拉数）的幂，通过 exp() 函数实现。

更新后的 predict_row() 函数列于下方。

# logistic regression
def predict_row(row, coefficients):
	# add the bias, the last coefficient
	result = coefficients[-1]
	# add the weighted input
	for i in range(len(row)):
		result += coefficients[i] * row[i]
	# logistic function
	logistic = 1.0 / (1.0 + exp(-result))
	return logistic

# 逻辑回归

def predict_row(row, coefficients):

# 添加偏差，即最后一个系数

result = coefficients[-1]

# 添加加权输入

for i in range(len(row)):

result += coefficients[i] * row[i]

# 逻辑函数

logistic = 1.0 / (1.0 + exp(-result))

return logistic

从线性回归到逻辑回归的改动就这些了。

与线性回归一样，我们可以使用一组随机模型系数来测试模型。

...
# determine the number of coefficients
n_coeff = X.shape[1] + 1
# generate random coefficients
coefficients = rand(n_coeff)
# generate predictions for dataset
yhat = predict_dataset(X, coefficients)

...

# 确定系数的数量

n_coeff = X.shape[1] + 1

# 生成随机系数

coefficients = rand(n_coeff)

# 为数据集生成预测

yhat = predict_dataset(X, coefficients)

模型进行的预测是一个示例属于类别=1 的概率。

我们可以将预测四舍五入为整数值 0 和 1，以获得预期的类别标签。

...
# round predictions to labels
yhat = [round(y) for y in yhat]

...

# 将预测四舍五入为标签

yhat = [round(y) for y in yhat]

我们可以评估这些预测的分类准确性。

...
# calculate accuracy
score = accuracy_score(y, yhat)
print('Accuracy: %f' % score)

...

# 计算准确率

score = accuracy_score(y, yhat)

print('Accuracy: %f' % score)

就是这样。

我们可以将所有这些结合起来，并演示我们的简单逻辑回归模型用于二元分类。完整的示例列于下方。

# logistic regression function for binary classification
from math import exp
from numpy.random import rand
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score

# logistic regression
def predict_row(row, coefficients):
	# add the bias, the last coefficient
	result = coefficients[-1]
	# add the weighted input
	for i in range(len(row)):
		result += coefficients[i] * row[i]
	# logistic function
	logistic = 1.0 / (1.0 + exp(-result))
	return logistic

# use model coefficients to generate predictions for a dataset of rows
def predict_dataset(X, coefficients):
	yhats = list()
	for row in X:
		# make a prediction
		yhat = predict_row(row, coefficients)
		# store the prediction
		yhats.append(yhat)
	return yhats

# define dataset
X, y = make_classification(n_samples=1000, n_features=5, n_informative=2, n_redundant=1, random_state=1)
# determine the number of coefficients
n_coeff = X.shape[1] + 1
# generate random coefficients
coefficients = rand(n_coeff)
# generate predictions for dataset
yhat = predict_dataset(X, coefficients)
# round predictions to labels
yhat = [round(y) for y in yhat]
# calculate accuracy
score = accuracy_score(y, yhat)
print('Accuracy: %f' % score)

# 用于二元分类的逻辑回归函数

from math import exp

from numpy.random import rand

from sklearn.datasets import make_classification

from sklearn.metrics import accuracy_score

# 逻辑回归

def predict_row(row, coefficients):

# 添加偏差，即最后一个系数

result = coefficients[-1]

# 添加加权输入

for i in range(len(row)):

result += coefficients[i] * row[i]

# 逻辑函数

logistic = 1.0 / (1.0 + exp(-result))

return logistic

# 使用模型系数为数据集行生成预测

def predict_dataset(X, coefficients):

yhats = list()

for row in X:

# 进行预测

yhat = predict_row(row, coefficients)

# 存储预测

yhats.append(yhat)

return yhats

# 定义数据集

X, y = make_classification(n_samples=1000, n_features=5, n_informative=2, n_redundant=1, random_state=1)

# 确定系数的数量

n_coeff = X.shape[1] + 1

# 生成随机系数

coefficients = rand(n_coeff)

# 为数据集生成预测

yhat = predict_dataset(X, coefficients)

# 将预测四舍五入为标签

yhat = [round(y) for y in yhat]

# 计算准确率

score = accuracy_score(y, yhat)

print('Accuracy: %f' % score)

运行示例会为训练数据集中的每个示例生成一个预测，然后打印这些预测的分类准确率。

注意：由于算法或评估程序的随机性，或数值精度的差异，您的结果可能会有所不同。考虑运行示例几次并比较平均结果。

给定一组随机权重和每个类别中示例数量相等的数据集，我们预计准确率约为 50%，在本例中我们看到的也大致如此。

Accuracy: 0.540000

1	Accuracy: 0.540000

现在我们可以优化数据集的权重以在该数据集上获得良好的准确性。

用于线性回归的随机爬山算法也可以再次用于逻辑回归。

重要的区别是更新了 objective() 函数，以四舍五入预测并将模型评估为分类准确率，而不是均方误差。

# objective function
def objective(X, y, coefficients):
	# generate predictions for dataset
	yhat = predict_dataset(X, coefficients)
	# round predictions to labels
	yhat = [round(y) for y in yhat]
	# calculate accuracy
	score = accuracy_score(y, yhat)
	return score

# 目标函数

def objective(X, y, coefficients):

# 为数据集生成预测

yhat = predict_dataset(X, coefficients)

# 将预测四舍五入为标签

yhat = [round(y) for y in yhat]

# 计算准确率

score = accuracy_score(y, yhat)

return score

hillclimbing() 函数也必须更新，以便在逻辑回归的情况下最大化解决方案的分数，而不是最小化。

# hill climbing local search algorithm
def hillclimbing(X, y, objective, solution, n_iter, step_size):
	# evaluate the initial point
	solution_eval = objective(X, y, solution)
	# run the hill climb
	for i in range(n_iter):
		# take a step
		candidate = solution + randn(len(solution)) * step_size
		# evaluate candidate point
		candidte_eval = objective(X, y, candidate)
		# check if we should keep the new point
		if candidte_eval >= solution_eval:
			# store the new point
			solution, solution_eval = candidate, candidte_eval
			# report progress
			print('>%d %.5f' % (i, solution_eval))
	return [solution, solution_eval]

# 爬山局部搜索算法

def hillclimbing(X, y, objective, solution, n_iter, step_size):

# 评估初始点

solution_eval = objective(X, y, solution)

# 运行爬山算法

for i in range(n_iter):

# 迈出一步

candidate = solution + randn(len(solution)) * step_size

# 评估候选点

candidte_eval = objective(X, y, candidate)

# 检查是否应该保留新点

if candidte_eval >= solution_eval:

# 存储新点

solution, solution_eval = candidate, candidte_eval

# 报告进度

print('>%d %.5f' % (i, solution_eval))

return [solution, solution_eval]

最后，可以在运行结束时使用分类准确率评估搜索找到的系数。

...
# generate predictions for the test dataset
yhat = predict_dataset(X_test, coefficients)
# round predictions to labels
yhat = [round(y) for y in yhat]
# calculate accuracy
score = accuracy_score(y_test, yhat)
print('Test Accuracy: %f' % (score))

...

# 为测试数据集生成预测

yhat = predict_dataset(X_test, coefficients)

# 将预测四舍五入为标签

yhat = [round(y) for y in yhat]

# 计算准确率

score = accuracy_score(y_test, yhat)

print('Test Accuracy: %f' % (score))

将所有内容整合在一起，使用随机爬山法最大化逻辑回归模型分类准确率的完整示例列于下方。

# optimize logistic regression model with a stochastic hill climber
from math import exp
from numpy.random import randn
from numpy.random import rand
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# logistic regression
def predict_row(row, coefficients):
	# add the bias, the last coefficient
	result = coefficients[-1]
	# add the weighted input
	for i in range(len(row)):
		result += coefficients[i] * row[i]
	# logistic function
	logistic = 1.0 / (1.0 + exp(-result))
	return logistic

# use model coefficients to generate predictions for a dataset of rows
def predict_dataset(X, coefficients):
	yhats = list()
	for row in X:
		# make a prediction
		yhat = predict_row(row, coefficients)
		# store the prediction
		yhats.append(yhat)
	return yhats

# objective function
def objective(X, y, coefficients):
	# generate predictions for dataset
	yhat = predict_dataset(X, coefficients)
	# round predictions to labels
	yhat = [round(y) for y in yhat]
	# calculate accuracy
	score = accuracy_score(y, yhat)
	return score

# hill climbing local search algorithm
def hillclimbing(X, y, objective, solution, n_iter, step_size):
	# evaluate the initial point
	solution_eval = objective(X, y, solution)
	# run the hill climb
	for i in range(n_iter):
		# take a step
		candidate = solution + randn(len(solution)) * step_size
		# evaluate candidate point
		candidte_eval = objective(X, y, candidate)
		# check if we should keep the new point
		if candidte_eval >= solution_eval:
			# store the new point
			solution, solution_eval = candidate, candidte_eval
			# report progress
			print('>%d %.5f' % (i, solution_eval))
	return [solution, solution_eval]

# define dataset
X, y = make_classification(n_samples=1000, n_features=5, n_informative=2, n_redundant=1, random_state=1)
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
# define the total iterations
n_iter = 2000
# define the maximum step size
step_size = 0.1
# determine the number of coefficients
n_coef = X.shape[1] + 1
# define the initial solution
solution = rand(n_coef)
# perform the hill climbing search
coefficients, score = hillclimbing(X_train, y_train, objective, solution, n_iter, step_size)
print('Done!')
print('Coefficients: %s' % coefficients)
print('Train Accuracy: %f' % (score))
# generate predictions for the test dataset
yhat = predict_dataset(X_test, coefficients)
# round predictions to labels
yhat = [round(y) for y in yhat]
# calculate accuracy
score = accuracy_score(y_test, yhat)
print('Test Accuracy: %f' % (score))

# 使用随机爬山优化逻辑回归模型

from math import exp

from numpy.random import randn

from numpy.random import rand

from sklearn.datasets import make_classification

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

# 逻辑回归

def predict_row(row, coefficients):

# 添加偏差，即最后一个系数

result = coefficients[-1]

# 添加加权输入

for i in range(len(row)):

result += coefficients[i] * row[i]

# 逻辑函数

logistic = 1.0 / (1.0 + exp(-result))

return logistic

# 使用模型系数为数据集行生成预测

def predict_dataset(X, coefficients):

yhats = list()

for row in X:

# 进行预测

yhat = predict_row(row, coefficients)

# 存储预测

yhats.append(yhat)

return yhats

# 目标函数

def objective(X, y, coefficients):

# 为数据集生成预测

yhat = predict_dataset(X, coefficients)

# 将预测四舍五入为标签

yhat = [round(y) for y in yhat]

# 计算准确率

score = accuracy_score(y, yhat)

return score

# 爬山局部搜索算法

def hillclimbing(X, y, objective, solution, n_iter, step_size):

# 评估初始点

solution_eval = objective(X, y, solution)

# 运行爬山算法

for i in range(n_iter):

# 迈出一步

candidate = solution + randn(len(solution)) * step_size

# 评估候选点

candidte_eval = objective(X, y, candidate)

# 检查是否应该保留新点

if candidte_eval >= solution_eval:

# 存储新点

solution, solution_eval = candidate, candidte_eval

# 报告进度

print('>%d %.5f' % (i, solution_eval))

return [solution, solution_eval]

# 定义数据集

X, y = make_classification(n_samples=1000, n_features=5, n_informative=2, n_redundant=1, random_state=1)

# 拆分为训练测试集

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

# 定义总迭代次数

n_iter = 2000

# 定义最大步长

step_size = 0.1

# 确定系数的数量

n_coef = X.shape[1] + 1

# 定义初始解

solution = rand(n_coef)

# 执行爬山搜索

coefficients, score = hillclimbing(X_train, y_train, objective, solution, n_iter, step_size)

print('Done!')

print('Coefficients: %s' % coefficients)

print('Train Accuracy: %f' % (score))

# 为测试数据集生成预测

yhat = predict_dataset(X_test, coefficients)

# 将预测四舍五入为标签

yhat = [round(y) for y in yhat]

# 计算准确率

score = accuracy_score(y_test, yhat)

print('Test Accuracy: %f' % (score))

运行此示例将报告迭代次数和分类准确率，每次模型有所改进时都会报告。

在搜索结束时，将报告最佳系数集在训练数据集上的性能，并计算并报告同一模型在测试数据集上的性能。

注意：由于算法或评估程序的随机性，或数值精度的差异，您的结果可能会有所不同。考虑运行示例几次并比较平均结果。

在这种情况下，我们可以看到优化算法找到了一个权重集，该权重集在训练数据集上的准确率约为 87.3%，在测试数据集上的准确率约为 83.9%。

...
>200 0.85672
>225 0.85672
>230 0.85672
>245 0.86418
>281 0.86418
>285 0.86716
>294 0.86716
>306 0.86716
>316 0.86716
>317 0.86716
>320 0.86866
>348 0.86866
>362 0.87313
>784 0.87313
>1649 0.87313
Done!
Coefficients: [-0.04652756  0.23243427  2.58587637 -0.45528253 -0.4954355  -0.42658053]
Train Accuracy: 0.873134
Test Accuracy: 0.839394

...

>200 0.85672

>225 0.85672

>230 0.85672

>245 0.86418

>281 0.86418

>285 0.86716

>294 0.86716

>306 0.86716

>316 0.86716

>317 0.86716

>320 0.86866

>348 0.86866

>362 0.87313

>784 0.87313

>1649 0.87313

完成！

Coefficients: [-0.04652756 0.23243427 2.58587637 -0.45528253 -0.4954355 -0.42658053]

Train Accuracy: 0.873134

Test Accuracy: 0.839394

进一步阅读

如果您想深入了解，本节提供了更多关于该主题的资源。

API

文章

总结

在本教程中，您学习了如何手动优化回归模型的系数。

具体来说，你学到了：

如何从头开始开发回归模型的推理模型。
如何优化线性回归模型以预测数值。
如何使用随机爬山法优化逻辑回归模型的系数。

你有什么问题吗？
在下面的评论中提出你的问题，我会尽力回答。

导航

如何使用优化算法手动拟合回归模型

教程概述

优化回归模型

想要开始学习优化算法吗？

优化线性回归模型

优化逻辑回归模型

进一步阅读

API

文章

总结

掌握现代优化算法！

加深您对优化的理解

将现代优化算法应用于
您的机器学习项目

关于此主题的更多信息

暂无评论。

发表回复点击此处取消回复。

导航

教程概述

优化回归模型

想要开始学习优化算法吗？

优化线性回归模型

优化逻辑回归模型

进一步阅读

API

文章

总结

掌握现代优化算法！

加深您对优化的理解

将现代优化算法应用于您的机器学习项目

关于此主题的更多信息

暂无评论。

发表回复 点击此处取消回复。

将现代优化算法应用于
您的机器学习项目

发表回复点击此处取消回复。