从头开始的 Adadelta 梯度下降

作者 Jason Brownlee 于 2021年10月12日发布在优化 2

梯度下降是一种优化算法，它沿着目标函数的负梯度方向移动，以找到函数的最小值。

梯度下降的一个限制是它对每个输入变量使用相同的步长（学习率）。AdaGrad 和 RMSProp 是梯度下降的扩展，它们为目标函数的每个参数添加了一个自适应的学习率。

Adadelta 可以被认为是梯度下降的进一步扩展，它建立在 AdaGrad 和 RMSProp 的基础上，并改变了自定义步长的计算方式，使得单位一致，从而不再需要初始学习率超参数。

在本教程中，您将了解如何从头开始开发带有 Adadelta 优化算法的梯度下降。

完成本教程后，您将了解：

梯度下降是一种优化算法，它利用目标函数的梯度来导航搜索空间。
通过使用偏导数的衰减平均值（称为 Adadelta）来更新梯度下降，为每个输入变量使用自动自适应步长。
如何从头开始实现 Adadelta 优化算法，并将其应用于目标函数并评估结果。

开始您的项目，请阅读我的新书机器学习优化，其中包含分步教程和所有示例的Python源代码文件。

让我们开始吧。

Gradient Descent With Adadelta from Scratch

从头开始的 Adadelta 梯度下降
照片由 Robert Minkler 拍摄，部分权利保留。

教程概述

本教程分为三个部分；它们是：

梯度下降
Adadelta 算法
带有 Adadelta 的梯度下降
1. 二维测试问题
2. 带有 Adadelta 的梯度下降优化
3. Adadelta 可视化

梯度下降

梯度下降是一种优化算法。

它在技术上被称为一阶优化算法，因为它明确使用了目标函数的一阶导数。

一阶方法依赖梯度信息来帮助指导寻找最小值……

——第69页，《优化算法》，2019年。

“导数”是指目标函数在特定点（例如，特定输入）的变化率或斜率。

如果目标函数有多个输入变量，则称为多元函数，输入变量可以看作一个向量。反过来，多元目标函数的导数也可以看作一个向量，通常称为梯度。

梯度：多元目标函数的一阶导数。

导数或梯度指向特定输入处目标函数最陡峭上升的方向。

梯度下降指一种最小化优化算法，它沿着目标函数的负梯度方向“下坡”移动，以找到函数的最小值。

梯度下降算法需要一个要优化的目标函数以及目标函数的导数函数。目标函数 f() 为给定的一组输入返回一个分数，导数函数 f'() 为给定的输入集给出目标函数的导数。

梯度下降算法需要问题中的一个起点（x），例如输入空间中随机选择的一个点。

然后计算导数，并在输入空间中迈出一步，预计会导致目标函数下坡移动（假设我们正在最小化目标函数）。

通过首先计算在输入空间中移动的距离（计算为步长（称为 alpha 或学习率）乘以梯度）来进行下行移动。然后将其从当前点减去，确保我们沿着梯度反方向移动，即沿着目标函数的下降方向。

x = x – 步长 * f'(x)

给定点处目标函数越陡峭，梯度的幅值越大，反之，在搜索空间中迈出的步长也越大。所迈步长的大小由步长超参数进行缩放。

**步长**（*alpha*）：控制算法每次迭代中在搜索空间中逆着梯度移动距离的超参数。

如果步长太小，在搜索空间中的移动会很小，搜索将花费很长时间。如果步长太大，搜索可能会在搜索空间中跳跃并跳过最优解。

现在我们熟悉了梯度下降优化算法，让我们来看看 Adadelta。

想要开始学习优化算法吗？

立即参加我为期7天的免费电子邮件速成课程（附示例代码）。

点击注册，同时获得该课程的免费PDF电子书版本。

Adadelta 算法

Adadelta（或“ADADELTA”）是梯度下降优化算法的一个扩展。

该算法由 Matthew Zeiler 于 2012 年发表的题为“ADADELTA: An Adaptive Learning Rate Method”的论文中描述。

Adadelta 旨在加速优化过程，例如减少达到最优值所需的函数评估次数，或提高优化算法的能力，例如获得更好的最终结果。

它最好被理解为 AdaGrad 和 RMSProp 算法的扩展。

AdaGrad 是梯度下降的扩展，它在每次更新时为目标函数的每个参数计算一个步长（学习率）。步长是通过首先将搜索过程中迄今为止看到的参数的偏导数求和，然后将初始步长超参数除以偏导数平方和的平方根来计算的。

使用 AdaGrad 为一个参数计算自定义步长的公式如下：

cust_step_size(t+1) = step_size / (1e-8 + sqrt(s(t)))

其中 cust_step_size(t+1) 是在搜索过程中给定点处输入变量的计算步长，step_size 是初始步长，sqrt() 是平方根运算，s(t) 是到目前为止在搜索过程中（包括当前迭代）看到的输入变量的平方偏导数之和。

RMSProp 可以被认为是 AdaGrad 的扩展，因为它使用偏导数的衰减平均值或移动平均值而不是总和来计算每个参数的步长。这是通过添加一个名为“rho”的新超参数来实现的，该超参数像动量一样作用于偏导数。

为单个参数计算衰减移动平均平方偏导数的公式如下：

s(t+1) = (s(t) * rho) + (f'(x(t))^2 * (1.0-rho))

其中 s(t+1) 是算法当前迭代中单个参数的平均平方偏导数，s(t) 是前一次迭代的衰减移动平均平方偏导数，f'(x(t))^2 是当前参数的平方偏导数，rho 是一个超参数，通常像动量一样具有 0.9 的值。

Adadelta 是 RMSProp 的一个进一步扩展，旨在提高算法的收敛性并消除对手动指定的初始学习率的需求。

本文提出的思想源于 ADAGRAD，旨在改进该方法的两个主要缺点：1）在训练过程中学习率的持续衰减，以及 2）需要手动选择全局学习率。

— ADADELTA: An Adaptive Learning Rate Method, 2012.

如 RMSProp 所述，为每个参数计算平方偏导数的衰减移动平均值。关键区别在于计算参数步长的方式，该方式使用参数变化（delta）的衰减平均值。

选择此分子是为了确保计算的两个部分具有相同的单位。

在独立推导出 RMSProp 更新后，作者注意到梯度下降、动量和 Adagrad 的更新方程中的单位不匹配。为了解决这个问题，他们使用了平方更新的指数衰减平均值。

— 第 78-79 页，Algorithms for Optimization, 2019.

首先，自定义步长计算为变化 delta 的衰减平均值的平方根除以平方偏导数衰减平均值的平方根。

cust_step_size(t+1) = (ep + sqrt(delta(t))) / (ep + sqrt(s(t)))

其中 cust_step_size(t+1) 是给定更新的参数的自定义步长，ep 是添加到分子和分母中的超参数，以避免除零错误，delta(t) 是参数平方变化的衰减移动平均值（在上次迭代中计算），s(t) 是平方偏导数衰减移动平均值（在当前迭代中计算）。

ep 超参数设置为一个很小的值，例如 1e-3 或 1e-8。除了避免除零错误外，它还有助于算法的初始步骤，当衰减移动平均平方变化和衰减移动平均平方梯度为零时。

接下来，参数的变化计算为自定义步长乘以偏导数。

change(t+1) = cust_step_size(t+1) * f'(x(t))

接下来，使用“rho”超参数更新参数平方变化的衰减平均值。

delta(t+1) = (delta(t) * rho) + (change(t+1)^2 * (1.0-rho))

其中 delta(t+1) 是将在下一次迭代中使用的变量变化的衰减平均值，change(t+1) 是在上一步中计算的，rho 是一个超参数，它像动量一样起作用，其值约为 0.9。

最后，使用变化计算变量的新值。

x(t+1) = x(t) – change(t+1)

然后，对目标函数的每个变量重复此过程，然后重复整个过程以在固定的算法迭代次数内导航搜索空间。

现在我们熟悉了 Adadelta 算法，让我们探讨一下如何实现它并评估其性能。

带有 Adadelta 的梯度下降

在本节中，我们将探讨如何实现带有 Adadelta 的梯度下降优化算法。

二维测试问题

首先，让我们定义一个优化函数。

我们将使用一个简单的二维函数，它将每个维度的输入平方，并将有效输入范围定义为-1.0到1.0。

下面的 objective() 函数实现了这个功能

# objective function
def objective(x, y):
	return x**2.0 + y**2.0

# 目标函数

def objective(x, y):

return x**2.0 + y**2.0

我们可以创建一个数据集的三维图来感受响应曲面的曲率。

下面列出了绘制目标函数的完整示例。

# 3d plot of the test function
from numpy import arange
from numpy import meshgrid
from matplotlib import pyplot

# objective function
def objective(x, y):
	return x**2.0 + y**2.0

# define range for input
r_min, r_max = -1.0, 1.0
# sample input range uniformly at 0.1 increments
xaxis = arange(r_min, r_max, 0.1)
yaxis = arange(r_min, r_max, 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a surface plot with the jet color scheme
figure = pyplot.figure()
axis = figure.gca(projection='3d')
axis.plot_surface(x, y, results, cmap='jet')
# show the plot
pyplot.show()

# 绘制测试函数的三维图

from numpy import arange

from numpy import meshgrid

from matplotlib import pyplot

# 目标函数

def objective(x, y):

return x**2.0 + y**2.0

# 定义输入范围

r_min, r_max = -1.0, 1.0

# 以 0.1 为增量均匀采样输入范围

xaxis = arange(r_min, r_max, 0.1)

yaxis = arange(r_min, r_max, 0.1)

# 从坐标轴创建网格

x, y = meshgrid(xaxis, yaxis)

# 计算目标值

results = objective(x, y)

# 使用 jet 配色方案创建曲面图

figure = pyplot.figure()

axis = figure.gca(projection='3d')

axis.plot_surface(x, y, results, cmap='jet')

# 显示绘图

pyplot.show()

运行示例会创建一个目标函数的 ثلاثي الأبعاد 曲面图。

我们可以看到熟悉的碗形，全局最小值在 f(0, 0) = 0。

Three-Dimensional Plot of the Test Objective Function

测试目标函数的三维图

我们还可以创建函数的二维图。这将在以后我们想要绘制搜索进度时提供帮助。

以下示例创建了目标函数的等高线图。

# contour plot of the test function
from numpy import asarray
from numpy import arange
from numpy import meshgrid
from matplotlib import pyplot

# objective function
def objective(x, y):
	return x**2.0 + y**2.0

# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# sample input range uniformly at 0.1 increments
xaxis = arange(bounds[0,0], bounds[0,1], 0.1)
yaxis = arange(bounds[1,0], bounds[1,1], 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a filled contour plot with 50 levels and jet color scheme
pyplot.contourf(x, y, results, levels=50, cmap='jet')
# show the plot
pyplot.show()

# 绘制测试函数的等高线图

from numpy import asarray

from numpy import arange

from numpy import meshgrid

from matplotlib import pyplot

# 目标函数

def objective(x, y):

return x**2.0 + y**2.0

# 定义输入范围

bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])

# 以 0.1 为增量均匀采样输入范围

xaxis = arange(bounds[0,0], bounds[0,1], 0.1)

yaxis = arange(bounds[1,0], bounds[1,1], 0.1)

# 从坐标轴创建网格

x, y = meshgrid(xaxis, yaxis)

# 计算目标值

results = objective(x, y)

# 使用50个级别和jet颜色方案创建填充等高线图

pyplot.contourf(x, y, results, levels=50, cmap='jet')

# 显示绘图

pyplot.show()

运行示例将创建目标函数的二维等高线图。

我们可以看到碗状被压缩成用颜色梯度显示的等高线。我们将使用这个图来绘制搜索过程中探索的特定点。

Two-Dimensional Contour Plot of the Test Objective Function

测试目标函数的二维等高线图

现在我们有了测试目标函数，让我们看看如何实现 Adadelta 优化算法。

带有 Adadelta 的梯度下降优化

我们可以将带有 Adadelta 的梯度下降应用于测试问题。

首先，我们需要一个函数来计算此函数的导数。

f(x) = x^2
f'(x) = x * 2

x^2 的导数在每个维度上都是 x * 2。 derivative() 函数在下面实现了这一点。

# derivative of objective function
def derivative(x, y):
	return asarray([x * 2.0, y * 2.0])

# 目标函数的导数

def derivative(x, y):

return asarray([x * 2.0, y * 2.0])

接下来，我们可以实现梯度下降优化。

首先，我们可以在问题的边界内选择一个随机点作为搜索的起点。

这假设我们有一个数组，它定义了搜索的边界，每行一个维度，第一列定义最小值，第二列定义最大值。

...
# generate an initial point
solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])

...

# 生成初始点

solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])

接下来，我们需要将每个维度的平方偏导数和平方变化衰减平均值初始化为 0.0。

...
# list of the average square gradients for each variable
sq_grad_avg = [0.0 for _ in range(bounds.shape[0])]
# list of the average parameter updates
sq_para_avg = [0.0 for _ in range(bounds.shape[0])]

...

# 每个变量的平均平方梯度列表

sq_grad_avg = [0.0 for _ in range(bounds.shape[0])]

# 平均参数更新列表

sq_para_avg = [0.0 for _ in range(bounds.shape[0])]

然后，我们可以枚举搜索优化算法的固定次数迭代，该次数由“*n_iter*”超参数定义。

...
# run the gradient descent
for it in range(n_iter):
	...

...

# 运行梯度下降

for it in range(n_iter):

...

第一步是使用 *derivative()* 函数计算当前解的梯度。

...
# calculate gradient
gradient = derivative(solution[0], solution[1])

...

# 计算梯度

gradient = derivative(solution[0], solution[1])

然后，我们需要计算偏导数的平方，并使用“rho”超参数更新偏导数平方的衰减移动平均值。

...
# update the average of the squared partial derivatives
for i in range(gradient.shape[0]):
	# calculate the squared gradient
	sg = gradient[i]**2.0
	# update the moving average of the squared gradient
	sq_grad_avg[i] = (sq_grad_avg[i] * rho) + (sg * (1.0-rho))

...

# 更新平方偏导数的平均值

for i in range(gradient.shape[0]):

# 计算梯度平方

sg = gradient[i]**2.0

# 更新平方梯度移动平均值

sq_grad_avg[i] = (sq_grad_avg[i] * rho) + (sg * (1.0-rho))

然后，我们可以使用平方偏导数的衰减移动平均值和梯度来计算下一个点的步长。我们将一次处理一个变量。

...
# build solution
new_solution = list()
for i in range(solution.shape[0]):
	...

...

# 构建解决方案

new_solution = list()

for i in range(solution.shape[0]):

...

首先，我们将使用平方变化衰减移动平均值、平方偏导数衰减移动平均值以及“ep”超参数，在此迭代中为该变量计算自定义步长。

...
# calculate the step size for this variable
alpha = (ep + sqrt(sq_para_avg[i])) / (ep + sqrt(sq_grad_avg[i]))

...

# 计算此变量的步长

alpha = (ep + sqrt(sq_para_avg[i])) / (ep + sqrt(sq_grad_avg[i]))

接下来，我们可以使用自定义步长和偏导数来计算变量的变化。

...
# calculate the change
change = alpha * gradient[i]

...

# 计算变化

change = alpha * gradient[i]

然后，我们可以使用变化和“rho”超参数来更新平方变化的衰减移动平均值。

...
# update the moving average of squared parameter changes
sq_para_avg[i] = (sq_para_avg[i] * rho) + (change**2.0 * (1.0-rho))

...

# 更新参数变化平方的移动平均值

sq_para_avg[i] = (sq_para_avg[i] * rho) + (change**2.0 * (1.0-rho))

最后，我们可以更改变量并存储结果，然后再处理下一个变量。

...
# calculate the new position in this variable
value = solution[i] - change
# store this variable
new_solution.append(value)

...

# 计算此变量中的新位置

value = solution[i] - change

# 存储此变量

new_solution.append(value)

然后可以使用 objective() 函数评估此新解，并报告搜索的性能。

...
# evaluate candidate point
solution = asarray(new_solution)
solution_eval = objective(solution[0], solution[1])
# report progress
print('>%d f(%s) = %.5f' % (it, solution, solution_eval))

...

# 评估候选点

solution = asarray(new_solution)

solution_eval = objective(solution[0], solution[1])

# 报告进展

print('>%d f(%s) = %.5f' % (it, solution, solution_eval))

就是这样。

我们可以将所有这些内容整合到一个名为 adadelta() 的函数中，该函数接受目标函数和导数函数的名称、包含域边界的数组以及算法总迭代次数和 rho 的超参数值，并返回最终解及其评估结果。

ep 超参数也可以作为参数传递，尽管它有一个合理的默认值 1e-3。

完整的函数如下所示。

# gradient descent algorithm with adadelta
def adadelta(objective, derivative, bounds, n_iter, rho, ep=1e-3):
	# generate an initial point
	solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
	# list of the average square gradients for each variable
	sq_grad_avg = [0.0 for _ in range(bounds.shape[0])]
	# list of the average parameter updates
	sq_para_avg = [0.0 for _ in range(bounds.shape[0])]
	# run the gradient descent
	for it in range(n_iter):
		# calculate gradient
		gradient = derivative(solution[0], solution[1])
		# update the average of the squared partial derivatives
		for i in range(gradient.shape[0]):
			# calculate the squared gradient
			sg = gradient[i]**2.0
			# update the moving average of the squared gradient
			sq_grad_avg[i] = (sq_grad_avg[i] * rho) + (sg * (1.0-rho))
		# build a solution one variable at a time
		new_solution = list()
		for i in range(solution.shape[0]):
			# calculate the step size for this variable
			alpha = (ep + sqrt(sq_para_avg[i])) / (ep + sqrt(sq_grad_avg[i]))
			# calculate the change
			change = alpha * gradient[i]
			# update the moving average of squared parameter changes
			sq_para_avg[i] = (sq_para_avg[i] * rho) + (change**2.0 * (1.0-rho))
			# calculate the new position in this variable
			value = solution[i] - change
			# store this variable
			new_solution.append(value)
		# evaluate candidate point
		solution = asarray(new_solution)
		solution_eval = objective(solution[0], solution[1])
		# report progress
		print('>%d f(%s) = %.5f' % (it, solution, solution_eval))
	return [solution, solution_eval]

# 带有 adadelta 的梯度下降算法

def adadelta(objective, derivative, bounds, n_iter, rho, ep=1e-3):

# 生成初始点

solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])

# 每个变量的平均平方梯度列表

sq_grad_avg = [0.0 for _ in range(bounds.shape[0])]

# 平均参数更新列表

sq_para_avg = [0.0 for _ in range(bounds.shape[0])]

# 运行梯度下降

for it in range(n_iter):

# 计算梯度

gradient = derivative(solution[0], solution[1])

# 更新平方偏导数的平均值

for i in range(gradient.shape[0]):

# 计算梯度平方

sg = gradient[i]**2.0

# 更新平方梯度移动平均值

sq_grad_avg[i] = (sq_grad_avg[i] * rho) + (sg * (1.0-rho))

# 逐个变量构建解

new_solution = list()

for i in range(solution.shape[0]):

# 计算此变量的步长

alpha = (ep + sqrt(sq_para_avg[i])) / (ep + sqrt(sq_grad_avg[i]))

# 计算变化量

change = alpha * gradient[i]

# 更新参数变化平方的移动平均值

sq_para_avg[i] = (sq_para_avg[i] * rho) + (change**2.0 * (1.0-rho))

# 计算此变量的新位置

value = solution[i] - change

# 存储此变量

new_solution.append(value)

# 评估候选点

solution = asarray(new_solution)

solution_eval = objective(solution[0], solution[1])

# 报告进度

print('>%d f(%s) = %.5f' % (it, solution, solution_eval))

return [solution, solution_eval]

注意：为了提高可读性，我们故意使用了列表和命令式编码风格，而不是向量化操作。请随意使用 NumPy 数组进行向量化实现以获得更好的性能。

然后，我们可以定义我们的超参数并调用 adadelta() 函数来优化我们的测试目标函数。

在这种情况下，我们将使用该算法进行 120 次迭代，rho 超参数的值为 0.99，这是在经过一些反复试验后选择的。

...
# seed the pseudo random number generator
seed(1)
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# define the total iterations
n_iter = 120
# momentum for adadelta
rho = 0.99
# perform the gradient descent search with adadelta
best, score = adadelta(objective, derivative, bounds, n_iter, rho)
print('Done!')
print('f(%s) = %f' % (best, score))

...

# 初始化伪随机数生成器

seed(1)

# 定义输入范围

bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])

# 定义总迭代次数

n_iter = 120

# adadelta 的动量

rho = 0.99

# 使用 adadelta 执行梯度下降搜索

best, score = adadelta(objective, derivative, bounds, n_iter, rho)

print('Done!')

print('f(%s) = %f' % (best, score))

将所有这些内容结合起来，下面列出了使用 Adadelta 进行梯度下降优化的完整示例。

# gradient descent optimization with adadelta for a two-dimensional test function
from math import sqrt
from numpy import asarray
from numpy.random import rand
from numpy.random import seed

# objective function
def objective(x, y):
	return x**2.0 + y**2.0

# derivative of objective function
def derivative(x, y):
	return asarray([x * 2.0, y * 2.0])

# gradient descent algorithm with adadelta
def adadelta(objective, derivative, bounds, n_iter, rho, ep=1e-3):
	# generate an initial point
	solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
	# list of the average square gradients for each variable
	sq_grad_avg = [0.0 for _ in range(bounds.shape[0])]
	# list of the average parameter updates
	sq_para_avg = [0.0 for _ in range(bounds.shape[0])]
	# run the gradient descent
	for it in range(n_iter):
		# calculate gradient
		gradient = derivative(solution[0], solution[1])
		# update the average of the squared partial derivatives
		for i in range(gradient.shape[0]):
			# calculate the squared gradient
			sg = gradient[i]**2.0
			# update the moving average of the squared gradient
			sq_grad_avg[i] = (sq_grad_avg[i] * rho) + (sg * (1.0-rho))
		# build a solution one variable at a time
		new_solution = list()
		for i in range(solution.shape[0]):
			# calculate the step size for this variable
			alpha = (ep + sqrt(sq_para_avg[i])) / (ep + sqrt(sq_grad_avg[i]))
			# calculate the change
			change = alpha * gradient[i]
			# update the moving average of squared parameter changes
			sq_para_avg[i] = (sq_para_avg[i] * rho) + (change**2.0 * (1.0-rho))
			# calculate the new position in this variable
			value = solution[i] - change
			# store this variable
			new_solution.append(value)
		# evaluate candidate point
		solution = asarray(new_solution)
		solution_eval = objective(solution[0], solution[1])
		# report progress
		print('>%d f(%s) = %.5f' % (it, solution, solution_eval))
	return [solution, solution_eval]

# seed the pseudo random number generator
seed(1)
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# define the total iterations
n_iter = 120
# momentum for adadelta
rho = 0.99
# perform the gradient descent search with adadelta
best, score = adadelta(objective, derivative, bounds, n_iter, rho)
print('Done!')
print('f(%s) = %f' % (best, score))

# 使用 adadelta 对二维测试函数进行梯度下降优化

from math import sqrt

from numpy import asarray

from numpy.random import rand

from numpy.random import seed

# 目标函数

def objective(x, y):

return x**2.0 + y**2.0

# 目标函数的导数

def derivative(x, y):

return asarray([x * 2.0, y * 2.0])

# 带有 adadelta 的梯度下降算法

def adadelta(objective, derivative, bounds, n_iter, rho, ep=1e-3):

# 生成初始点

solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])

# 每个变量的平均平方梯度列表

sq_grad_avg = [0.0 for _ in range(bounds.shape[0])]

# 平均参数更新列表

sq_para_avg = [0.0 for _ in range(bounds.shape[0])]

# 运行梯度下降

for it in range(n_iter):

# 计算梯度

gradient = derivative(solution[0], solution[1])

# 更新平方偏导数的平均值

for i in range(gradient.shape[0]):

# 计算梯度平方

sg = gradient[i]**2.0

# 更新平方梯度移动平均值

sq_grad_avg[i] = (sq_grad_avg[i] * rho) + (sg * (1.0-rho))

# 逐个变量构建解

new_solution = list()

for i in range(solution.shape[0]):

# 计算此变量的步长

alpha = (ep + sqrt(sq_para_avg[i])) / (ep + sqrt(sq_grad_avg[i]))

# 计算变化量

change = alpha * gradient[i]

# 更新参数变化平方的移动平均值

sq_para_avg[i] = (sq_para_avg[i] * rho) + (change**2.0 * (1.0-rho))

# 计算此变量的新位置

value = solution[i] - change

# 存储此变量

new_solution.append(value)

# 评估候选点

solution = asarray(new_solution)

solution_eval = objective(solution[0], solution[1])

# 报告进度

print('>%d f(%s) = %.5f' % (it, solution, solution_eval))

return [solution, solution_eval]

# 初始化伪随机数生成器

seed(1)

# 定义输入范围

bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])

# 定义总迭代次数

n_iter = 120

# adadelta 的动量

rho = 0.99

# 使用 adadelta 执行梯度下降搜索

best, score = adadelta(objective, derivative, bounds, n_iter, rho)

print('Done!')

print('f(%s) = %f' % (best, score))

运行示例将 Adadelta 优化算法应用于我们的测试问题，并报告算法每次迭代的搜索性能。

注意：由于算法或评估程序的随机性质，或数值精度的差异，您的结果可能会有所不同。请考虑运行几次示例并比较平均结果。

在这种情况下，我们可以看到在搜索约 105 次迭代后找到了一个接近最优解，输入值接近 0.0 和 0.0，求值为 0.0。

...
>100 f([-1.45142626e-07 2.71163181e-03]) = 0.00001
>101 f([-1.24898699e-07 2.56875692e-03]) = 0.00001
>102 f([-1.07454197e-07 2.43328237e-03]) = 0.00001
>103 f([-9.24253035e-08 2.30483111e-03]) = 0.00001
>104 f([-7.94803792e-08 2.18304501e-03]) = 0.00000
>105 f([-6.83329263e-08 2.06758392e-03]) = 0.00000
>106 f([-5.87354975e-08 1.95812477e-03]) = 0.00000
>107 f([-5.04744185e-08 1.85436071e-03]) = 0.00000
>108 f([-4.33652179e-08 1.75600036e-03]) = 0.00000
>109 f([-3.72486699e-08 1.66276699e-03]) = 0.00000
>110 f([-3.19873691e-08 1.57439783e-03]) = 0.00000
>111 f([-2.74627662e-08 1.49064334e-03]) = 0.00000
>112 f([-2.3572602e-08 1.4112666e-03]) = 0.00000
>113 f([-2.02286891e-08 1.33604264e-03]) = 0.00000
>114 f([-1.73549914e-08 1.26475787e-03]) = 0.00000
>115 f([-1.48859650e-08 1.19720951e-03]) = 0.00000
>116 f([-1.27651224e-08 1.13320504e-03]) = 0.00000
>117 f([-1.09437923e-08 1.07256172e-03]) = 0.00000
>118 f([-9.38004754e-09 1.01510604e-03]) = 0.00000
>119 f([-8.03777865e-09 9.60673346e-04]) = 0.00000
Done!
f([-8.03777865e-09 9.60673346e-04]) = 0.000001

...

>100 f([-1.45142626e-07 2.71163181e-03]) = 0.00001

>101 f([-1.24898699e-07 2.56875692e-03]) = 0.00001

>102 f([-1.07454197e-07 2.43328237e-03]) = 0.00001

>103 f([-9.24253035e-08 2.30483111e-03]) = 0.00001

>104 f([-7.94803792e-08 2.18304501e-03]) = 0.00000

>105 f([-6.83329263e-08 2.06758392e-03]) = 0.00000

>106 f([-5.87354975e-08 1.95812477e-03]) = 0.00000

>107 f([-5.04744185e-08 1.85436071e-03]) = 0.00000

>108 f([-4.33652179e-08 1.75600036e-03]) = 0.00000

>109 f([-3.72486699e-08 1.66276699e-03]) = 0.00000

>110 f([-3.19873691e-08 1.57439783e-03]) = 0.00000

>111 f([-2.74627662e-08 1.49064334e-03]) = 0.00000

>112 f([-2.3572602e-08 1.4112666e-03]) = 0.00000

>113 f([-2.02286891e-08 1.33604264e-03]) = 0.00000

>114 f([-1.73549914e-08 1.26475787e-03]) = 0.00000

>115 f([-1.48859650e-08 1.19720951e-03]) = 0.00000

>116 f([-1.27651224e-08 1.13320504e-03]) = 0.00000

>117 f([-1.09437923e-08 1.07256172e-03]) = 0.00000

>118 f([-9.38004754e-09 1.01510604e-03]) = 0.00000

>119 f([-8.03777865e-09 9.60673346e-04]) = 0.00000

完成！

f([-8.03777865e-09 9.60673346e-04]) = 0.000001

Adadelta 可视化

我们可以绘制 Adadelta 搜索在领域等高线图上的进度。

这可以提供对算法迭代过程中搜索进展的直观感受。

我们必须更新 adadelta() 函数以维护一个列表，其中包含搜索期间找到的所有解决方案，然后在搜索结束时返回此列表。

包含这些更改的更新版本函数如下所示。

# gradient descent algorithm with adadelta
def adadelta(objective, derivative, bounds, n_iter, rho, ep=1e-3):
	# track all solutions
	solutions = list()
	# generate an initial point
	solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
	# list of the average square gradients for each variable
	sq_grad_avg = [0.0 for _ in range(bounds.shape[0])]
	# list of the average parameter updates
	sq_para_avg = [0.0 for _ in range(bounds.shape[0])]
	# run the gradient descent
	for it in range(n_iter):
		# calculate gradient
		gradient = derivative(solution[0], solution[1])
		# update the average of the squared partial derivatives
		for i in range(gradient.shape[0]):
			# calculate the squared gradient
			sg = gradient[i]**2.0
			# update the moving average of the squared gradient
			sq_grad_avg[i] = (sq_grad_avg[i] * rho) + (sg * (1.0-rho))
		# build solution
		new_solution = list()
		for i in range(solution.shape[0]):
			# calculate the step size for this variable
			alpha = (ep + sqrt(sq_para_avg[i])) / (ep + sqrt(sq_grad_avg[i]))
			# calculate the change
			change = alpha * gradient[i]
			# update the moving average of squared parameter changes
			sq_para_avg[i] = (sq_para_avg[i] * rho) + (change**2.0 * (1.0-rho))
			# calculate the new position in this variable
			value = solution[i] - change
			# store this variable
			new_solution.append(value)
		# store the new solution
		solution = asarray(new_solution)
		solutions.append(solution)
		# evaluate candidate point
		solution_eval = objective(solution[0], solution[1])
		# report progress
		print('>%d f(%s) = %.5f' % (it, solution, solution_eval))
	return solutions

# 带有 adadelta 的梯度下降算法

def adadelta(objective, derivative, bounds, n_iter, rho, ep=1e-3):

# 跟踪所有解决方案

solutions = list()

# 生成初始点

solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])

# 每个变量的平均平方梯度列表

sq_grad_avg = [0.0 for _ in range(bounds.shape[0])]

# 平均参数更新列表

sq_para_avg = [0.0 for _ in range(bounds.shape[0])]

# 运行梯度下降

for it in range(n_iter):

# 计算梯度

gradient = derivative(solution[0], solution[1])

# 更新平方偏导数的平均值

for i in range(gradient.shape[0]):

# 计算梯度平方

sg = gradient[i]**2.0

# 更新平方梯度移动平均值

sq_grad_avg[i] = (sq_grad_avg[i] * rho) + (sg * (1.0-rho))

# 构建解

new_solution = list()

for i in range(solution.shape[0]):

# 计算此变量的步长

alpha = (ep + sqrt(sq_para_avg[i])) / (ep + sqrt(sq_grad_avg[i]))

# 计算变化量

change = alpha * gradient[i]

# 更新参数变化平方的移动平均值

sq_para_avg[i] = (sq_para_avg[i] * rho) + (change**2.0 * (1.0-rho))

# 计算此变量的新位置

value = solution[i] - change

# 存储此变量

new_solution.append(value)

# 存储新解

solution = asarray(new_solution)

solutions.append(solution)

# 评估候选点

solution_eval = objective(solution[0], solution[1])

# 报告进度

print('>%d f(%s) = %.5f' % (it, solution, solution_eval))

return solutions

然后我们可以像以前一样执行搜索，这次检索解决方案列表而不是最终的最佳解决方案。

...
# seed the pseudo random number generator
seed(1)
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# define the total iterations
n_iter = 120
# rho for adadelta
rho = 0.99
# perform the gradient descent search with adadelta
solutions = adadelta(objective, derivative, bounds, n_iter, rho)

...

# 初始化伪随机数生成器

seed(1)

# 定义输入范围

bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])

# 定义总迭代次数

n_iter = 120

# adadelta 的 rho

rho = 0.99

# 使用 adadelta 执行梯度下降搜索

solutions = adadelta(objective, derivative, bounds, n_iter, rho)

然后我们可以像以前一样创建目标函数的等高线图。

...
# sample input range uniformly at 0.1 increments
xaxis = arange(bounds[0,0], bounds[0,1], 0.1)
yaxis = arange(bounds[1,0], bounds[1,1], 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a filled contour plot with 50 levels and jet color scheme
pyplot.contourf(x, y, results, levels=50, cmap='jet')

...

# 以 0.1 为增量均匀采样输入范围

xaxis = arange(bounds[0,0], bounds[0,1], 0.1)

yaxis = arange(bounds[1,0], bounds[1,1], 0.1)

# 从坐标轴创建网格

x, y = meshgrid(xaxis, yaxis)

# 计算目标值

results = objective(x, y)

# 使用50个级别和jet颜色方案创建填充等高线图

pyplot.contourf(x, y, results, levels=50, cmap='jet')

最后，我们可以将搜索过程中找到的每个解决方案绘制成一个由线连接的白点。

...
# plot the sample as black circles
solutions = asarray(solutions)
pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w')

...

# 将样本绘制为黑色圆圈

solutions = asarray(solutions)

pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w')

将所有这些内容结合起来，下面列出了在测试问题上执行 Adadelta 优化并将结果绘制在等高线图上的完整示例。

# example of plotting the adadelta search on a contour plot of the test function
from math import sqrt
from numpy import asarray
from numpy import arange
from numpy.random import rand
from numpy.random import seed
from numpy import meshgrid
from matplotlib import pyplot
from mpl_toolkits.mplot3d import Axes3D

# objective function
def objective(x, y):
	return x**2.0 + y**2.0

# derivative of objective function
def derivative(x, y):
	return asarray([x * 2.0, y * 2.0])

# gradient descent algorithm with adadelta
def adadelta(objective, derivative, bounds, n_iter, rho, ep=1e-3):
	# track all solutions
	solutions = list()
	# generate an initial point
	solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
	# list of the average square gradients for each variable
	sq_grad_avg = [0.0 for _ in range(bounds.shape[0])]
	# list of the average parameter updates
	sq_para_avg = [0.0 for _ in range(bounds.shape[0])]
	# run the gradient descent
	for it in range(n_iter):
		# calculate gradient
		gradient = derivative(solution[0], solution[1])
		# update the average of the squared partial derivatives
		for i in range(gradient.shape[0]):
			# calculate the squared gradient
			sg = gradient[i]**2.0
			# update the moving average of the squared gradient
			sq_grad_avg[i] = (sq_grad_avg[i] * rho) + (sg * (1.0-rho))
		# build solution
		new_solution = list()
		for i in range(solution.shape[0]):
			# calculate the step size for this variable
			alpha = (ep + sqrt(sq_para_avg[i])) / (ep + sqrt(sq_grad_avg[i]))
			# calculate the change
			change = alpha * gradient[i]
			# update the moving average of squared parameter changes
			sq_para_avg[i] = (sq_para_avg[i] * rho) + (change**2.0 * (1.0-rho))
			# calculate the new position in this variable
			value = solution[i] - change
			# store this variable
			new_solution.append(value)
		# store the new solution
		solution = asarray(new_solution)
		solutions.append(solution)
		# evaluate candidate point
		solution_eval = objective(solution[0], solution[1])
		# report progress
		print('>%d f(%s) = %.5f' % (it, solution, solution_eval))
	return solutions

# seed the pseudo random number generator
seed(1)
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# define the total iterations
n_iter = 120
# rho for adadelta
rho = 0.99
# perform the gradient descent search with adadelta
solutions = adadelta(objective, derivative, bounds, n_iter, rho)
# sample input range uniformly at 0.1 increments
xaxis = arange(bounds[0,0], bounds[0,1], 0.1)
yaxis = arange(bounds[1,0], bounds[1,1], 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a filled contour plot with 50 levels and jet color scheme
pyplot.contourf(x, y, results, levels=50, cmap='jet')
# plot the sample as black circles
solutions = asarray(solutions)
pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w')
# show the plot
pyplot.show()

# 在测试函数等高线图上绘制 adadelta 搜索的示例

from math import sqrt

from numpy import asarray

from numpy import arange

from numpy.random import rand

from numpy.random import seed

from numpy import meshgrid

from matplotlib import pyplot

from mpl_toolkits.mplot3d import Axes3D

# 目标函数

def objective(x, y):

return x**2.0 + y**2.0

# 目标函数的导数

def derivative(x, y):

return asarray([x * 2.0, y * 2.0])

# 带有 adadelta 的梯度下降算法

def adadelta(objective, derivative, bounds, n_iter, rho, ep=1e-3):

# 跟踪所有解决方案

solutions = list()

# 生成初始点

solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])

# 每个变量的平均平方梯度列表

sq_grad_avg = [0.0 for _ in range(bounds.shape[0])]

# 平均参数更新列表

sq_para_avg = [0.0 for _ in range(bounds.shape[0])]

# 运行梯度下降

for it in range(n_iter):

# 计算梯度

gradient = derivative(solution[0], solution[1])

# 更新平方偏导数的平均值

for i in range(gradient.shape[0]):

# 计算梯度平方

sg = gradient[i]**2.0

# 更新平方梯度移动平均值

sq_grad_avg[i] = (sq_grad_avg[i] * rho) + (sg * (1.0-rho))

# 构建解

new_solution = list()

for i in range(solution.shape[0]):

# 计算此变量的步长

alpha = (ep + sqrt(sq_para_avg[i])) / (ep + sqrt(sq_grad_avg[i]))

# 计算变化量

change = alpha * gradient[i]

# 更新参数变化平方的移动平均值

sq_para_avg[i] = (sq_para_avg[i] * rho) + (change**2.0 * (1.0-rho))

# 计算此变量的新位置

value = solution[i] - change

# 存储此变量

new_solution.append(value)

# 存储新解

solution = asarray(new_solution)

solutions.append(solution)

# 评估候选点

solution_eval = objective(solution[0], solution[1])

# 报告进度

print('>%d f(%s) = %.5f' % (it, solution, solution_eval))

return solutions

# 初始化伪随机数生成器

seed(1)

# 定义输入范围

bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])

# 定义总迭代次数

n_iter = 120

# adadelta 的 rho

rho = 0.99

# 使用 adadelta 执行梯度下降搜索

solutions = adadelta(objective, derivative, bounds, n_iter, rho)

# 以 0.1 为增量均匀采样输入范围

xaxis = arange(bounds[0,0], bounds[0,1], 0.1)

yaxis = arange(bounds[1,0], bounds[1,1], 0.1)

# 从坐标轴创建网格

x, y = meshgrid(xaxis, yaxis)

# 计算目标值

results = objective(x, y)

# 使用50个级别和jet颜色方案创建填充等高线图

pyplot.contourf(x, y, results, levels=50, cmap='jet')

# 将样本绘制为黑色圆圈

solutions = asarray(solutions)

pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w')

# 显示绘图

pyplot.show()

运行示例像以前一样执行搜索，但在此情况下，创建了目标函数的等高线图。

在这种情况下，我们可以看到搜索过程中找到的每个解决方案都显示为一个白点，从最优值上方开始，并逐渐靠近图中中心的最优值。

Contour Plot of the Test Objective Function With Adadelta Search Results Shown

测试目标函数使用 Adadelta 搜索结果的等高线图

进一步阅读

如果您想深入了解，本节提供了更多关于该主题的资源。

论文

ADADELTA：一种自适应学习率方法, 2012.

书籍

优化算法, 2019.
深度学习, 2016.

API

文章

总结

在本教程中，您将了解如何从头开始开发带有 Adadelta 优化的梯度下降。

具体来说，你学到了：

梯度下降是一种优化算法，它利用目标函数的梯度来导航搜索空间。
通过使用偏导数的衰减平均值（称为 Adadelta）来更新梯度下降，为每个输入变量使用自动自适应步长。
如何从头开始实现 Adadelta 优化算法，并将其应用于目标函数并评估结果。

你有什么问题吗？
在下面的评论中提出你的问题，我会尽力回答。

导航

从头开始的 Adadelta 梯度下降

教程概述

梯度下降

想要开始学习优化算法吗？

Adadelta 算法

带有 Adadelta 的梯度下降

二维测试问题

带有 Adadelta 的梯度下降优化

Adadelta 可视化

进一步阅读

论文

书籍

API

文章

总结

掌握现代优化算法！

加深您对优化的理解

将现代优化算法应用于
您的机器学习项目

关于此主题的更多信息

2 条关于从头开始的梯度下降与 Adadelta 的回复

发表回复点击此处取消回复。

导航

教程概述

梯度下降

想要开始学习优化算法吗？

Adadelta 算法

带有 Adadelta 的梯度下降

二维测试问题

带有 Adadelta 的梯度下降优化

Adadelta 可视化

进一步阅读

论文

书籍

API

文章

总结

掌握现代优化算法！

加深您对优化的理解

将现代优化算法应用于您的机器学习项目

关于此主题的更多信息

2 条关于从头开始的梯度下降与 Adadelta 的回复

发表回复 点击此处取消回复。

将现代优化算法应用于
您的机器学习项目

发表回复点击此处取消回复。