从零开始实现带AdaGrad的梯度下降

作者 Jason Brownlee 于 2021 年 10 月 12 日发表在优化 2

梯度下降是一种优化算法，它沿着目标函数的负梯度方向移动，以找到函数的最小值。

梯度下降的一个局限性在于它为每个输入变量使用相同的步长（学习率）。这在目标函数在不同维度上具有不同曲率时可能会出现问题，从而需要不同大小的步长来移动到新的点。

自适应梯度，简称 **AdaGrad**，是梯度下降优化算法的一个扩展，它允许优化算法使用的每维步长根据搜索过程中看到的变量（偏导数）的梯度自动调整。

在本教程中，您将学习如何从零开始开发带有自适应梯度的梯度下降优化算法。

完成本教程后，您将了解：

梯度下降是一种优化算法，它利用目标函数的梯度来导航搜索空间。
梯度下降可以通过使用目标函数中每个输入变量的自动自适应步长来更新，称为自适应梯度或 AdaGrad。
如何从零开始实现 AdaGrad 优化算法，并将其应用于目标函数并评估结果。

通过我的新书《机器学习优化》为您的项目打下基础，书中包含分步教程和所有示例的Python源代码文件。

让我们开始吧。

Gradient Descent With AdaGrad From Scratch

从零开始实现带AdaGrad的梯度下降
照片由 Maurits Verbiest 拍摄，保留部分权利。

教程概述

本教程分为三个部分；它们是：

梯度下降
自适应梯度 (AdaGrad)
AdaGrad 梯度下降
1. 二维测试问题
2. AdaGrad 梯度下降优化
3. AdaGrad 可视化

梯度下降

梯度下降是一种优化算法。

它在技术上被称为一阶优化算法，因为它显式地使用了目标函数的一阶导数。

一阶方法依赖梯度信息来帮助指导寻找最小值……

——第69页，《优化算法》，2019年。

一阶导数，或简称为“导数”，是目标函数在特定点（例如，特定输入）的变化率或斜率。

如果目标函数接受多个输入变量，则称为多元函数，输入变量可以看作一个向量。因此，多元目标函数的导数也可以看作一个向量，通常称为“梯度”。

梯度：多元目标函数的一阶导数。

导数或梯度指向特定输入处目标函数最陡峭上升的方向。

梯度下降指一种最小化优化算法，它沿着目标函数的负梯度方向“下坡”移动，以找到函数的最小值。

梯度下降算法需要一个正在优化的目标函数以及目标函数的导数函数。目标函数 *f()* 为给定的输入集返回一个分数，而导数函数 *f'()* 为给定的输入集给出目标函数的导数。

梯度下降算法需要问题中的一个起点（x），例如输入空间中随机选择的一个点。

然后计算导数，并在输入空间中迈出一步，预计会导致目标函数下坡移动（假设我们正在最小化目标函数）。

向下移动是通过首先计算在输入空间中移动的距离来实现的，该距离计算为步长（称为 alpha 或学习率）乘以梯度。然后将其从当前点减去，确保我们沿梯度反方向移动，即沿目标函数向下移动。

x = x – 步长 * f'(x)

给定点处目标函数越陡峭，梯度的幅值越大，反之，在搜索空间中迈出的步长也越大。所迈步长的大小由步长超参数进行缩放。

**步长**（*alpha*）：控制算法每次迭代中在搜索空间中逆着梯度移动距离的超参数。

如果步长太小，在搜索空间中的移动会很小，搜索将花费很长时间。如果步长太大，搜索可能会在搜索空间中跳跃并跳过最优解。

现在我们熟悉了梯度下降优化算法，让我们来看看 AdaGrad。

想要开始学习优化算法吗？

立即参加我为期7天的免费电子邮件速成课程（附示例代码）。

点击注册，同时获得该课程的免费PDF电子书版本。

自适应梯度 (AdaGrad)

自适应梯度算法，简称 AdaGrad，是梯度下降优化算法的扩展。

该算法由 John Duchi 等人在其 2011 年的论文“在线学习和随机优化的自适应次梯度方法”中进行了描述。

它旨在加速优化过程，例如减少达到最优值所需的函数评估次数，或提高优化算法的能力，例如获得更好的最终结果。

具有最大偏导数的参数的学习率会相应快速下降，而具有小偏导数的参数的学习率下降相对较小。

— 第 307 页，深度学习，2016。

梯度下降算法的一个问题是，步长（学习率）在搜索空间中的每个变量或维度上都是相同的。有可能使用针对每个变量量身定制的步长可以获得更好的性能，允许在梯度始终陡峭的维度中进行更大的移动，而在梯度较小的维度中进行较小的移动。

AdaGrad 旨在专门探索在搜索空间中自动定制每个维度步长的思想。

自适应次梯度方法，或 Adagrad，为 x 的每个分量调整学习率

— 第 77 页，优化算法，2019。

这是通过首先为给定维度计算步长，然后使用计算出的步长通过偏导数在该维度中移动来实现的。然后为搜索空间中的每个维度重复此过程。

Adagrad 会减弱持续高梯度参数的影响，从而增加不经常更新的参数的影响。

— 第 77 页，优化算法，2019。

AdaGrad 适用于搜索空间曲率在不同维度上不同的目标函数，通过定制每个维度的步长来实现更有效的优化。

该算法要求您像往常一样为所有输入变量设置一个初始步长，例如 0.1 或 0.001，或类似值。尽管如此，该算法的好处在于它不像梯度下降算法那样对初始学习率敏感。

Adagrad 对学习率参数 alpha 的敏感度要低得多。学习率参数通常设置为默认值 0.01。

— 第 77 页，优化算法，2019。

然后为每个输入变量维护一个内部变量，该变量是搜索过程中观察到的该输入变量的偏导数平方的总和。

然后，该偏导数平方的总和用于通过将初始步长值（例如，运行开始时指定的超参数值）除以偏导数平方总和的平方根来计算该变量的步长。

自定义步长 = 步长 / sqrt(s)

偏导数平方的总和的平方根可能导致值为 0.0，从而导致除以零错误。因此，可以在分母中添加一个微小值来避免这种情况，例如 1e-8。

自定义步长 = 步长 / (1e-8 + sqrt(s))

其中 *cust_step_size* 是搜索过程中给定点输入变量的计算出的步长，*step_size* 是初始步长，*sqrt()* 是平方根运算，*s* 是迄今为止在搜索中看到的输入变量的偏导数平方的总和。

然后使用自定义步长计算搜索中的下一个点或解决方案的变量值。

x(t+1) = x(t) – 自定义步长 * f'(x(t))

然后为每个输入变量重复此过程，直到创建可以评估的新搜索空间点。

重要的是，当前解（搜索迭代）的偏导数包含在偏导数平方根的总和中。

我们可以为每个输入变量维护一个偏导数或偏导数平方的数组，但这并非必需。相反，我们只需维护偏导数平方的总和，并在过程中将新值添加到此总和中。

现在我们熟悉了 AdaGrad 算法，让我们探讨一下如何实现它并评估其性能。

AdaGrad 梯度下降

在本节中，我们将探讨如何实现带有自适应梯度的梯度下降优化算法。

二维测试问题

首先，让我们定义一个优化函数。

我们将使用一个简单的二维函数，它将每个维度的输入平方，并将有效输入范围定义为-1.0到1.0。

下面的 objective() 函数实现了这个函数。

# objective function
def objective(x, y):
	return x**2.0 + y**2.0

# 目标函数

def objective(x, y):

return x**2.0 + y**2.0

我们可以创建一个数据集的三维图来感受响应曲面的曲率。

下面列出了绘制目标函数的完整示例。

# 3d plot of the test function
from numpy import arange
from numpy import meshgrid
from matplotlib import pyplot

# objective function
def objective(x, y):
	return x**2.0 + y**2.0

# define range for input
r_min, r_max = -1.0, 1.0
# sample input range uniformly at 0.1 increments
xaxis = arange(r_min, r_max, 0.1)
yaxis = arange(r_min, r_max, 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a surface plot with the jet color scheme
figure = pyplot.figure()
axis = figure.gca(projection='3d')
axis.plot_surface(x, y, results, cmap='jet')
# show the plot
pyplot.show()

# 绘制测试函数的三维图

from numpy import arange

from numpy import meshgrid

from matplotlib import pyplot

# 目标函数

def objective(x, y):

return x**2.0 + y**2.0

# 定义输入范围

r_min, r_max = -1.0, 1.0

# 以 0.1 为增量均匀采样输入范围

xaxis = arange(r_min, r_max, 0.1)

yaxis = arange(r_min, r_max, 0.1)

# 从坐标轴创建网格

x, y = meshgrid(xaxis, yaxis)

# 计算目标值

results = objective(x, y)

# 使用 jet 配色方案创建曲面图

figure = pyplot.figure()

axis = figure.gca(projection='3d')

axis.plot_surface(x, y, results, cmap='jet')

# 显示绘图

pyplot.show()

运行示例将创建目标函数的三维曲面图。

我们可以看到熟悉的碗形，全局最小值在 f(0, 0) = 0。

Three-Dimensional Plot of the Test Objective Function

测试目标函数的三维图

我们还可以创建函数的二维图。这将在以后我们想要绘制搜索进度时提供帮助。

以下示例创建了目标函数的等高线图。

# contour plot of the test function
from numpy import asarray
from numpy import arange
from numpy import meshgrid
from matplotlib import pyplot

# objective function
def objective(x, y):
	return x**2.0 + y**2.0

# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# sample input range uniformly at 0.1 increments
xaxis = arange(bounds[0,0], bounds[0,1], 0.1)
yaxis = arange(bounds[1,0], bounds[1,1], 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a filled contour plot with 50 levels and jet color scheme
pyplot.contourf(x, y, results, levels=50, cmap='jet')
# show the plot
pyplot.show()

# 绘制测试函数的等高线图

from numpy import asarray

from numpy import arange

from numpy import meshgrid

from matplotlib import pyplot

# 目标函数

def objective(x, y):

return x**2.0 + y**2.0

# 定义输入范围

bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])

# 以 0.1 为增量均匀采样输入范围

xaxis = arange(bounds[0,0], bounds[0,1], 0.1)

yaxis = arange(bounds[1,0], bounds[1,1], 0.1)

# 从坐标轴创建网格

x, y = meshgrid(xaxis, yaxis)

# 计算目标值

results = objective(x, y)

# 使用50个级别和jet颜色方案创建填充等高线图

pyplot.contourf(x, y, results, levels=50, cmap='jet')

# 显示绘图

pyplot.show()

运行示例将创建目标函数的二维等高线图。

我们可以看到碗状被压缩成用颜色梯度显示的等高线。我们将使用这个图来绘制搜索过程中探索的特定点。

Two-Dimensional Contour Plot of the Test Objective Function

测试目标函数的二维等高线图

既然我们有了一个测试目标函数，让我们看看如何实现 AdaGrad 优化算法。

AdaGrad 梯度下降优化

我们可以将带有自适应梯度的梯度下降算法应用于测试问题。

首先，我们需要一个函数来计算此函数的导数。

f(x) = x^2
f'(x) = x * 2

x^2 的导数在每个维度上都是 x * 2。

下面的 *derivative()* 函数实现了这一点。

# derivative of objective function
def derivative(x, y):
	return asarray([x * 2.0, y * 2.0])

# 目标函数的导数

def derivative(x, y):

return asarray([x * 2.0, y * 2.0])

接下来，我们可以实现带有自适应梯度的梯度下降。

首先，我们可以在问题的边界内选择一个随机点作为搜索的起点。

这假设我们有一个数组，它定义了搜索的边界，每行一个维度，第一列定义最小值，第二列定义最大值。

...
# generate an initial point
solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])

...

# 生成初始点

solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])

接下来，我们需要将每个维度的偏导数平方和初始化为 0.0。

...
# list of the sum square gradients for each variable
sq_grad_sums = [0.0 for _ in range(bounds.shape[0])]

...

# 每个变量的梯度平方和列表

sq_grad_sums = [0.0 for _ in range(bounds.shape[0])]

然后，我们可以枚举搜索优化算法的固定次数迭代，该次数由“*n_iter*”超参数定义。

...
# run the gradient descent
for it in range(n_iter):
	...

...

# 运行梯度下降

for it in range(n_iter):

...

第一步是使用 *derivative()* 函数计算当前解的梯度。

...
# calculate gradient
gradient = derivative(solution[0], solution[1])

...

# 计算梯度

gradient = derivative(solution[0], solution[1])

然后，我们需要计算每个变量的偏导数的平方，并将它们添加到这些值的运行总和中。

...
# update the sum of the squared partial derivatives
for i in range(gradient.shape[0]):
	sq_grad_sums[i] += gradient[i]**2.0

...

# 更新偏导数平方和

for i in range(gradient.shape[0]):

sq_grad_sums[i] += gradient[i]**2.0

然后，我们可以使用偏导数平方和以及梯度来计算下一个点。

我们将一次处理一个变量，首先计算该变量的步长，然后计算该变量的新值。这些值将构建在一个数组中，直到我们拥有一个全新的解，该解通过自定义步长从当前点以最陡的下降方向获得。

...
# build a solution one variable at a time
new_solution = list()
for i in range(solution.shape[0]):
	# calculate the step size for this variable
	alpha = step_size / (1e-8 + sqrt(sq_grad_sums[i]))
	# calculate the new position in this variable
	value = solution[i] - alpha * gradient[i]
	# store this variable
	new_solution.append(value)

...

# 逐个变量构建解

new_solution = list()

for i in range(solution.shape[0]):

# 计算此变量的步长

alpha = step_size / (1e-8 + sqrt(sq_grad_sums[i]))

# 计算此变量的新位置

value = solution[i] - alpha * gradient[i]

# 存储此变量

new_solution.append(value)

然后可以使用 *objective()* 函数来评估这个新解，并报告搜索的性能。

...
# evaluate candidate point
solution = asarray(new_solution)
solution_eval = objective(solution[0], solution[1])
# report progress
print('>%d f(%s) = %.5f' % (it, solution, solution_eval))

...

# 评估候选点

solution = asarray(new_solution)

solution_eval = objective(solution[0], solution[1])

# 报告进展

print('>%d f(%s) = %.5f' % (it, solution, solution_eval))

就是这样。

我们将所有这些整合到一个名为 *adagrad()* 的函数中，该函数接受目标函数和导数函数的名称、包含域边界的数组，以及算法迭代总数和初始学习率的超参数值，并返回最终解及其评估。

完整的函数如下所示。

# gradient descent algorithm with adagrad
def adagrad(objective, derivative, bounds, n_iter, step_size):
	# generate an initial point
	solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
	# list of the sum square gradients for each variable
	sq_grad_sums = [0.0 for _ in range(bounds.shape[0])]
	# run the gradient descent
	for it in range(n_iter):
		# calculate gradient
		gradient = derivative(solution[0], solution[1])
		# update the sum of the squared partial derivatives
		for i in range(gradient.shape[0]):
			sq_grad_sums[i] += gradient[i]**2.0
		# build a solution one variable at a time
		new_solution = list()
		for i in range(solution.shape[0]):
			# calculate the step size for this variable
			alpha = step_size / (1e-8 + sqrt(sq_grad_sums[i]))
			# calculate the new position in this variable
			value = solution[i] - alpha * gradient[i]
			# store this variable
			new_solution.append(value)
		# evaluate candidate point
		solution = asarray(new_solution)
		solution_eval = objective(solution[0], solution[1])
		# report progress
		print('>%d f(%s) = %.5f' % (it, solution, solution_eval))
	return [solution, solution_eval]

# 带有 adagrad 的梯度下降算法

def adagrad(objective, derivative, bounds, n_iter, step_size):

# 生成初始点

solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])

# 每个变量的梯度平方和列表

sq_grad_sums = [0.0 for _ in range(bounds.shape[0])]

# 运行梯度下降

for it in range(n_iter):

# 计算梯度

gradient = derivative(solution[0], solution[1])

# 更新偏导数平方和

for i in range(gradient.shape[0]):

sq_grad_sums[i] += gradient[i]**2.0

# 逐个变量构建解

new_solution = list()

for i in range(solution.shape[0]):

# 计算此变量的步长

alpha = step_size / (1e-8 + sqrt(sq_grad_sums[i]))

# 计算此变量的新位置

value = solution[i] - alpha * gradient[i]

# 存储此变量

new_solution.append(value)

# 评估候选点

solution = asarray(new_solution)

solution_eval = objective(solution[0], solution[1])

# 报告进度

print('>%d f(%s) = %.5f' % (it, solution, solution_eval))

return [solution, solution_eval]

注意：为了提高可读性，我们特意使用了列表和命令式编码风格，而不是矢量化操作。随意根据需要将实现改编为使用 NumPy 数组的矢量化实现以提高性能。

然后，我们可以定义我们的超参数并调用 *adagrad()* 函数来优化我们的测试目标函数。

在这种情况下，我们将使用 50 次算法迭代和 0.1 的初始学习率，这两个值都是经过一些试验和错误后选择的。

...
# seed the pseudo random number generator
seed(1)
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# define the total iterations
n_iter = 50
# define the step size
step_size = 0.1
# perform the gradient descent search with adagrad
best, score = adagrad(objective, derivative, bounds, n_iter, step_size)
print('Done!')
print('f(%s) = %f' % (best, score))

...

# 初始化伪随机数生成器

seed(1)

# 定义输入范围

bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])

# 定义总迭代次数

n_iter = 50

# 定义步长

step_size = 0.1

# 执行带有 adagrad 的梯度下降搜索

best, score = adagrad(objective, derivative, bounds, n_iter, step_size)

print('Done!')

print('f(%s) = %f' % (best, score))

将所有这些结合起来，带有自适应梯度的梯度下降优化的完整示例如下所示。

# gradient descent optimization with adagrad for a two-dimensional test function
from math import sqrt
from numpy import asarray
from numpy.random import rand
from numpy.random import seed

# objective function
def objective(x, y):
	return x**2.0 + y**2.0

# derivative of objective function
def derivative(x, y):
	return asarray([x * 2.0, y * 2.0])

# gradient descent algorithm with adagrad
def adagrad(objective, derivative, bounds, n_iter, step_size):
	# generate an initial point
	solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
	# list of the sum square gradients for each variable
	sq_grad_sums = [0.0 for _ in range(bounds.shape[0])]
	# run the gradient descent
	for it in range(n_iter):
		# calculate gradient
		gradient = derivative(solution[0], solution[1])
		# update the sum of the squared partial derivatives
		for i in range(gradient.shape[0]):
			sq_grad_sums[i] += gradient[i]**2.0
		# build a solution one variable at a time
		new_solution = list()
		for i in range(solution.shape[0]):
			# calculate the step size for this variable
			alpha = step_size / (1e-8 + sqrt(sq_grad_sums[i]))
			# calculate the new position in this variable
			value = solution[i] - alpha * gradient[i]
			# store this variable
			new_solution.append(value)
		# evaluate candidate point
		solution = asarray(new_solution)
		solution_eval = objective(solution[0], solution[1])
		# report progress
		print('>%d f(%s) = %.5f' % (it, solution, solution_eval))
	return [solution, solution_eval]

# seed the pseudo random number generator
seed(1)
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# define the total iterations
n_iter = 50
# define the step size
step_size = 0.1
# perform the gradient descent search with adagrad
best, score = adagrad(objective, derivative, bounds, n_iter, step_size)
print('Done!')
print('f(%s) = %f' % (best, score))

# 带有 adagrad 的梯度下降优化，用于二维测试函数

from math import sqrt

from numpy import asarray

from numpy.random import rand

from numpy.random import seed

# 目标函数

def objective(x, y):

return x**2.0 + y**2.0

# 目标函数的导数

def derivative(x, y):

return asarray([x * 2.0, y * 2.0])

# 带有 adagrad 的梯度下降算法

def adagrad(objective, derivative, bounds, n_iter, step_size):

# 生成初始点

solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])

# 每个变量的梯度平方和列表

sq_grad_sums = [0.0 for _ in range(bounds.shape[0])]

# 运行梯度下降

for it in range(n_iter):

# 计算梯度

gradient = derivative(solution[0], solution[1])

# 更新偏导数平方和

for i in range(gradient.shape[0]):

sq_grad_sums[i] += gradient[i]**2.0

# 逐个变量构建解

new_solution = list()

for i in range(solution.shape[0]):

# 计算此变量的步长

alpha = step_size / (1e-8 + sqrt(sq_grad_sums[i]))

# 计算此变量的新位置

value = solution[i] - alpha * gradient[i]

# 存储此变量

new_solution.append(value)

# 评估候选点

solution = asarray(new_solution)

solution_eval = objective(solution[0], solution[1])

# 报告进度

print('>%d f(%s) = %.5f' % (it, solution, solution_eval))

return [solution, solution_eval]

# 初始化伪随机数生成器

seed(1)

# 定义输入范围

bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])

# 定义总迭代次数

n_iter = 50

# 定义步长

step_size = 0.1

# 执行带有 adagrad 的梯度下降搜索

best, score = adagrad(objective, derivative, bounds, n_iter, step_size)

print('Done!')

print('f(%s) = %f' % (best, score))

运行示例会将 AdaGrad 优化算法应用于我们的测试问题，并报告算法每次迭代的搜索性能。

注意：由于算法或评估程序的随机性，或者数值精度的差异，您的结果可能会有所不同。请考虑多次运行示例并比较平均结果。

在这种情况下，我们可以看到在可能 35 次搜索迭代后找到了一个接近最优的解，输入值接近 0.0 和 0.0，评估结果为 0.0。

>0 f([-0.06595599  0.34064899]) = 0.12039
>1 f([-0.02902286  0.27948766]) = 0.07896
>2 f([-0.0129815   0.23463749]) = 0.05522
>3 f([-0.00582483  0.1993997 ]) = 0.03979
>4 f([-0.00261527  0.17071256]) = 0.02915
>5 f([-0.00117437  0.14686138]) = 0.02157
>6 f([-0.00052736  0.12676134]) = 0.01607
>7 f([-0.00023681  0.10966762]) = 0.01203
>8 f([-0.00010634  0.09503809]) = 0.00903
>9 f([-4.77542704e-05  8.24607972e-02]) = 0.00680
>10 f([-2.14444463e-05  7.16123835e-02]) = 0.00513
>11 f([-9.62980437e-06  6.22327049e-02]) = 0.00387
>12 f([-4.32434258e-06  5.41085063e-02]) = 0.00293
>13 f([-1.94188148e-06  4.70624414e-02]) = 0.00221
>14 f([-8.72017797e-07  4.09453989e-02]) = 0.00168
>15 f([-3.91586740e-07  3.56309531e-02]) = 0.00127
>16 f([-1.75845235e-07  3.10112252e-02]) = 0.00096
>17 f([-7.89647442e-08  2.69937139e-02]) = 0.00073
>18 f([-3.54597657e-08  2.34988084e-02]) = 0.00055
>19 f([-1.59234984e-08  2.04577993e-02]) = 0.00042
>20 f([-7.15057749e-09  1.78112581e-02]) = 0.00032
>21 f([-3.21102543e-09  1.55077005e-02]) = 0.00024
>22 f([-1.44193729e-09  1.35024688e-02]) = 0.00018
>23 f([-6.47513760e-10  1.17567908e-02]) = 0.00014
>24 f([-2.90771361e-10  1.02369798e-02]) = 0.00010
>25 f([-1.30573263e-10  8.91375193e-03]) = 0.00008
>26 f([-5.86349941e-11  7.76164047e-03]) = 0.00006
>27 f([-2.63305247e-11  6.75849105e-03]) = 0.00005
>28 f([-1.18239380e-11  5.88502652e-03]) = 0.00003
>29 f([-5.30963626e-12  5.12447017e-03]) = 0.00003
>30 f([-2.38433568e-12  4.46221948e-03]) = 0.00002
>31 f([-1.07070548e-12  3.88556303e-03]) = 0.00002
>32 f([-4.80809073e-13  3.38343471e-03]) = 0.00001
>33 f([-2.15911255e-13  2.94620023e-03]) = 0.00001
>34 f([-9.69567190e-14  2.56547145e-03]) = 0.00001
>35 f([-4.35392094e-14  2.23394494e-03]) = 0.00000
>36 f([-1.95516389e-14  1.94526160e-03]) = 0.00000
>37 f([-8.77982370e-15  1.69388439e-03]) = 0.00000
>38 f([-3.94265180e-15  1.47499203e-03]) = 0.00000
>39 f([-1.77048011e-15  1.28438640e-03]) = 0.00000
>40 f([-7.95048604e-16  1.11841198e-03]) = 0.00000
>41 f([-3.57023093e-16  9.73885702e-04]) = 0.00000
>42 f([-1.60324146e-16  8.48035867e-04]) = 0.00000
>43 f([-7.19948720e-17  7.38448972e-04]) = 0.00000
>44 f([-3.23298874e-17  6.43023418e-04]) = 0.00000
>45 f([-1.45180009e-17  5.59929193e-04]) = 0.00000
>46 f([-6.51942732e-18  4.87572776e-04]) = 0.00000
>47 f([-2.92760228e-18  4.24566574e-04]) = 0.00000
>48 f([-1.31466380e-18  3.69702307e-04]) = 0.00000
>49 f([-5.90360555e-19  3.21927835e-04]) = 0.00000
Done!
f([-5.90360555e-19  3.21927835e-04]) = 0.000000

>0 f([-0.06595599 0.34064899]) = 0.12039

>1 f([-0.02902286 0.27948766]) = 0.07896

>2 f([-0.0129815 0.23463749]) = 0.05522

>3 f([-0.00582483 0.1993997 ]) = 0.03979

>4 f([-0.00261527 0.17071256]) = 0.02915

>5 f([-0.00117437 0.14686138]) = 0.02157

>6 f([-0.00052736 0.12676134]) = 0.01607

>7 f([-0.00023681 0.10966762]) = 0.01203

>8 f([-0.00010634 0.09503809]) = 0.00903

>9 f([-4.77542704e-05 8.24607972e-02]) = 0.00680

>10 f([-2.14444463e-05 7.16123835e-02]) = 0.00513

>11 f([-9.62980437e-06 6.22327049e-02]) = 0.00387

>12 f([-4.32434258e-06 5.41085063e-02]) = 0.00293

>13 f([-1.94188148e-06 4.70624414e-02]) = 0.00221

>14 f([-8.72017797e-07 4.09453989e-02]) = 0.00168

>15 f([-3.91586740e-07 3.56309531e-02]) = 0.00127

>16 f([-1.75845235e-07 3.10112252e-02]) = 0.00096

>17 f([-7.89647442e-08 2.69937139e-02]) = 0.00073

>18 f([-3.54597657e-08 2.34988084e-02]) = 0.00055

>19 f([-1.59234984e-08 2.04577993e-02]) = 0.00042

>20 f([-7.15057749e-09 1.78112581e-02]) = 0.00032

>21 f([-3.21102543e-09 1.55077005e-02]) = 0.00024

>22 f([-1.44193729e-09 1.35024688e-02]) = 0.00018

>23 f([-6.47513760e-10 1.17567908e-02]) = 0.00014

>24 f([-2.90771361e-10 1.02369798e-02]) = 0.00010

>25 f([-1.30573263e-10 8.91375193e-03]) = 0.00008

>26 f([-5.86349941e-11 7.76164047e-03]) = 0.00006

>27 f([-2.63305247e-11 6.75849105e-03]) = 0.00005

>28 f([-1.18239380e-11 5.88502652e-03]) = 0.00003

>29 f([-5.30963626e-12 5.12447017e-03]) = 0.00003

>30 f([-2.38433568e-12 4.46221948e-03]) = 0.00002

>31 f([-1.07070548e-12 3.88556303e-03]) = 0.00002

>32 f([-4.80809073e-13 3.38343471e-03]) = 0.00001

>33 f([-2.15911255e-13 2.94620023e-03]) = 0.00001

>34 f([-9.69567190e-14 2.56547145e-03]) = 0.00001

>35 f([-4.35392094e-14 2.23394494e-03]) = 0.00000

>36 f([-1.95516389e-14 1.94526160e-03]) = 0.00000

>37 f([-8.77982370e-15 1.69388439e-03]) = 0.00000

>38 f([-3.94265180e-15 1.47499203e-03]) = 0.00000

>39 f([-1.77048011e-15 1.28438640e-03]) = 0.00000

>40 f([-7.95048604e-16 1.11841198e-03]) = 0.00000

>41 f([-3.57023093e-16 9.73885702e-04]) = 0.00000

>42 f([-1.60324146e-16 8.48035867e-04]) = 0.00000

>43 f([-7.19948720e-17 7.38448972e-04]) = 0.00000

>44 f([-3.23298874e-17 6.43023418e-04]) = 0.00000

>45 f([-1.45180009e-17 5.59929193e-04]) = 0.00000

>46 f([-6.51942732e-18 4.87572776e-04]) = 0.00000

>47 f([-2.92760228e-18 4.24566574e-04]) = 0.00000

>48 f([-1.31466380e-18 3.69702307e-04]) = 0.00000

>49 f([-5.90360555e-19 3.21927835e-04]) = 0.00000

完成！

f([-5.90360555e-19 3.21927835e-04]) = 0.000000

AdaGrad 可视化

我们可以将搜索的进度绘制在领域等高线上。

这可以提供对算法迭代过程中搜索进展的直观感受。

我们必须更新 *adagrad()* 函数以维护搜索过程中找到的所有解的列表，然后在搜索结束时返回此列表。

包含这些更改的更新版本函数如下所示。

# gradient descent algorithm with adagrad
def adagrad(objective, derivative, bounds, n_iter, step_size):
	# track all solutions
	solutions = list()
	# generate an initial point
	solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
	# list of the sum square gradients for each variable
	sq_grad_sums = [0.0 for _ in range(bounds.shape[0])]
	# run the gradient descent
	for it in range(n_iter):
		# calculate gradient
		gradient = derivative(solution[0], solution[1])
		# update the sum of the squared partial derivatives
		for i in range(gradient.shape[0]):
			sq_grad_sums[i] += gradient[i]**2.0
		# build solution
		new_solution = list()
		for i in range(solution.shape[0]):
			# calculate the learning rate for this variable
			alpha = step_size / (1e-8 + sqrt(sq_grad_sums[i]))
			# calculate the new position in this variable
			value = solution[i] - alpha * gradient[i]
			new_solution.append(value)
		# store the new solution
		solution = asarray(new_solution)
		solutions.append(solution)
		# evaluate candidate point
		solution_eval = objective(solution[0], solution[1])
		# report progress
		print('>%d f(%s) = %.5f' % (it, solution, solution_eval))
	return solutions

# 带有 adagrad 的梯度下降算法

def adagrad(objective, derivative, bounds, n_iter, step_size):

# 跟踪所有解决方案

solutions = list()

# 生成初始点

solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])

# 每个变量的梯度平方和列表

sq_grad_sums = [0.0 for _ in range(bounds.shape[0])]

# 运行梯度下降

for it in range(n_iter):

# 计算梯度

gradient = derivative(solution[0], solution[1])

# 更新偏导数平方和

for i in range(gradient.shape[0]):

sq_grad_sums[i] += gradient[i]**2.0

# 构建解

new_solution = list()

for i in range(solution.shape[0]):

# 计算该变量的学习率

alpha = step_size / (1e-8 + sqrt(sq_grad_sums[i]))

# 计算此变量的新位置

value = solution[i] - alpha * gradient[i]

new_solution.append(value)

# 存储新解

solution = asarray(new_solution)

solutions.append(solution)

# 评估候选点

solution_eval = objective(solution[0], solution[1])

# 报告进度

print('>%d f(%s) = %.5f' % (it, solution, solution_eval))

return solutions

然后我们可以像以前一样执行搜索，这次检索解决方案列表而不是最终的最佳解决方案。

...
# seed the pseudo random number generator
seed(1)
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# define the total iterations
n_iter = 50
# define the step size
step_size = 0.1
# perform the gradient descent search
solutions = adagrad(objective, derivative, bounds, n_iter, step_size)

...

# 初始化伪随机数生成器

seed(1)

# 定义输入范围

bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])

# 定义总迭代次数

n_iter = 50

# 定义步长

step_size = 0.1

# 执行梯度下降搜索

solutions = adagrad(objective, derivative, bounds, n_iter, step_size)

然后我们可以像以前一样创建目标函数的等高线图。

...
# sample input range uniformly at 0.1 increments
xaxis = arange(bounds[0,0], bounds[0,1], 0.1)
yaxis = arange(bounds[1,0], bounds[1,1], 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a filled contour plot with 50 levels and jet color scheme
pyplot.contourf(x, y, results, levels=50, cmap='jet')

...

# 以 0.1 为增量均匀采样输入范围

xaxis = arange(bounds[0,0], bounds[0,1], 0.1)

yaxis = arange(bounds[1,0], bounds[1,1], 0.1)

# 从坐标轴创建网格

x, y = meshgrid(xaxis, yaxis)

# 计算目标值

results = objective(x, y)

# 使用50个级别和jet颜色方案创建填充等高线图

pyplot.contourf(x, y, results, levels=50, cmap='jet')

最后，我们可以将搜索过程中找到的每个解决方案绘制成一个由线连接的白点。

...
# plot the sample as black circles
solutions = asarray(solutions)
pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w')

...

# 将样本绘制为黑色圆圈

solutions = asarray(solutions)

pyplot.plot(solutions[:, 0], solutions[:, 1], '.', color='w')

将所有这些内容结合起来，下面列出了在测试问题上执行 AdaGrad 优化并在等高线图上绘制结果的完整示例。

# example of plotting the adagrad search on a contour plot of the test function
from math import sqrt
from numpy import asarray
from numpy import arange
from numpy.random import rand
from numpy.random import seed
from numpy import meshgrid
from matplotlib import pyplot
from mpl_toolkits.mplot3d import Axes3D

# objective function
def objective(x, y):
	return x**2.0 + y**2.0

# derivative of objective function
def derivative(x, y):
	return asarray([x * 2.0, y * 2.0])

# gradient descent algorithm with adagrad
def adagrad(objective, derivative, bounds, n_iter, step_size):
	# track all solutions
	solutions = list()
	# generate an initial point
	solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
	# list of the sum square gradients for each variable
	sq_grad_sums = [0.0 for _ in range(bounds.shape[0])]
	# run the gradient descent
	for it in range(n_iter):
		# calculate gradient
		gradient = derivative(solution[0], solution[1])
		# update the sum of the squared partial derivatives
		for i in range(gradient.shape[0]):
			sq_grad_sums[i] += gradient[i]**2.0
		# build solution
		new_solution = list()
		for i in range(solution.shape[0]):
			# calculate the learning rate for this variable
			alpha = step_size / (1e-8 + sqrt(sq_grad_sums[i]))
			# calculate the new position in this variable
			value = solution[i] - alpha * gradient[i]
			new_solution.append(value)
		# store the new solution
		solution = asarray(new_solution)
		solutions.append(solution)
		# evaluate candidate point
		solution_eval = objective(solution[0], solution[1])
		# report progress
		print('>%d f(%s) = %.5f' % (it, solution, solution_eval))
	return solutions

# seed the pseudo random number generator
seed(1)
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# define the total iterations
n_iter = 50
# define the step size
step_size = 0.1
# perform the gradient descent search
solutions = adagrad(objective, derivative, bounds, n_iter, step_size)
# sample input range uniformly at 0.1 increments
xaxis = arange(bounds[0,0], bounds[0,1], 0.1)
yaxis = arange(bounds[1,0], bounds[1,1], 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a filled contour plot with 50 levels and jet color scheme
pyplot.contourf(x, y, results, levels=50, cmap='jet')
# plot the sample as black circles
solutions = asarray(solutions)
pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w')
# show the plot
pyplot.show()

# 在测试函数的等高线图上绘制 adagrad 搜索的示例

from math import sqrt

from numpy import asarray

from numpy import arange

from numpy.random import rand

from numpy.random import seed

from numpy import meshgrid

from matplotlib import pyplot

from mpl_toolkits.mplot3d import Axes3D

# 目标函数

def objective(x, y):

return x**2.0 + y**2.0

# 目标函数的导数

def derivative(x, y):

return asarray([x * 2.0, y * 2.0])

# 带有 adagrad 的梯度下降算法

def adagrad(objective, derivative, bounds, n_iter, step_size):

# 跟踪所有解决方案

solutions = list()

# 生成初始点

solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])

# 每个变量的梯度平方和列表

sq_grad_sums = [0.0 for _ in range(bounds.shape[0])]

# 运行梯度下降

for it in range(n_iter):

# 计算梯度

gradient = derivative(solution[0], solution[1])

# 更新偏导数平方和

for i in range(gradient.shape[0]):

sq_grad_sums[i] += gradient[i]**2.0

# 构建解

new_solution = list()

for i in range(solution.shape[0]):

# 计算该变量的学习率

alpha = step_size / (1e-8 + sqrt(sq_grad_sums[i]))

# 计算此变量的新位置

value = solution[i] - alpha * gradient[i]

new_solution.append(value)

# 存储新解

solution = asarray(new_solution)

solutions.append(solution)

# 评估候选点

solution_eval = objective(solution[0], solution[1])

# 报告进度

print('>%d f(%s) = %.5f' % (it, solution, solution_eval))

return solutions

# 初始化伪随机数生成器

seed(1)

# 定义输入范围

bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])

# 定义总迭代次数

n_iter = 50

# 定义步长

step_size = 0.1

# 执行梯度下降搜索

solutions = adagrad(objective, derivative, bounds, n_iter, step_size)

# 以 0.1 为增量均匀采样输入范围

xaxis = arange(bounds[0,0], bounds[0,1], 0.1)

yaxis = arange(bounds[1,0], bounds[1,1], 0.1)

# 从坐标轴创建网格

x, y = meshgrid(xaxis, yaxis)

# 计算目标值

results = objective(x, y)

# 使用50个级别和jet颜色方案创建填充等高线图

pyplot.contourf(x, y, results, levels=50, cmap='jet')

# 将样本绘制为黑色圆圈

solutions = asarray(solutions)

pyplot.plot(solutions[:, 0], solutions[:, 1], '.', color='w')

# 显示绘图

pyplot.show()

运行该示例会像以前一样执行搜索，但在此情况下，会创建目标函数的等高线图，并在搜索过程中找到的每个解处显示一个白点，这些解从最优值上方开始，并逐渐向图中心的最优值靠近。

Contour Plot of the Test Objective Function With AdaGrad Search Results Shown

带有 AdaGrad 搜索结果的测试目标函数等高线图

进一步阅读

如果您想深入了解，本节提供了更多关于该主题的资源。

论文

自适应次梯度方法用于在线学习和随机优化, 2011.

书籍

优化算法, 2019.
深度学习, 2016.

API

文章

总结

在本教程中，您将学习如何从头开始开发具有自适应梯度的梯度下降优化算法。

具体来说，你学到了：

梯度下降是一种优化算法，它利用目标函数的梯度来导航搜索空间。
梯度下降可以通过使用目标函数中每个输入变量的自动自适应步长来更新，称为自适应梯度或 AdaGrad。
如何从零开始实现 AdaGrad 优化算法，并将其应用于目标函数并评估结果。

你有什么问题吗？
在下面的评论中提出你的问题，我会尽力回答。

关于此主题的更多信息

2 条对“从头开始的 AdaGrad 梯度下降”的回复

Marc 2023 年 11 月 20 日凌晨 1:30 #

感谢您的文章。

关于以下部分
“[…] 允许在梯度恒定的维度上进行更大的移动，在梯度较平缓的维度上进行较小的移动。”

AdaGrad 不应该在平坦方向（梯度较小的方向）使用更大的步长，在陡峭方向使用更小的步长吗？这样可以在平坦区域取得更大的进展，因为跳过最小值的可能性很低，而在陡峭方向缓慢移动不会意外地跳出最小值。

回复
- James Carmichael 2023 年 11 月 20 日上午 9:15 #
  
  你好 Marc……非常欢迎！以下资源将允许您更深入地研究 AdaGrad。
  
  https://optimization.cbe.cornell.edu/index.php?title=AdaGrad
  
  回复

导航

从零开始实现带AdaGrad的梯度下降

教程概述

梯度下降

想要开始学习优化算法吗？

自适应梯度 (AdaGrad)

AdaGrad 梯度下降

二维测试问题

AdaGrad 梯度下降优化

AdaGrad 可视化

进一步阅读

论文

书籍

API

文章

总结

掌握现代优化算法！

加深您对优化的理解

将现代优化算法应用于
您的机器学习项目

关于此主题的更多信息

2 条对“从头开始的 AdaGrad 梯度下降”的回复

留下回复点击此处取消回复。

导航

教程概述

梯度下降

想要开始学习优化算法吗？

自适应梯度 (AdaGrad)

AdaGrad 梯度下降

二维测试问题

AdaGrad 梯度下降优化

AdaGrad 可视化

进一步阅读

论文

书籍

API

文章

总结

掌握现代优化算法！

加深您对优化的理解

将现代优化算法应用于您的机器学习项目

关于此主题的更多信息

2 条对“从头开始的 AdaGrad 梯度下降”的回复

留下回复 点击此处取消回复。

将现代优化算法应用于
您的机器学习项目

留下回复点击此处取消回复。