如何在Keras中开发带注意力机制的编码器-解码器模型

作者 Jason Brownlee 于 2020年8月27日发布在长短期记忆网络 358

用于循环神经网络的编码器-解码器架构在自然语言处理领域的许多序列到序列预测问题中都表现出色，例如机器翻译和字幕生成。

注意力机制解决了编码器-解码器架构在长序列上的一个限制，并且通常能加速学习并提升模型在序列到序列预测问题上的技能。

在本教程中，您将学习如何使用 Keras 在 Python 中开发一个带注意力机制的编码器-解码器循环神经网络。

完成本教程后，您将了解：

如何设计一个小型且可配置的问题来评估带或不带注意力机制的编码器-解码器循环神经网络。
如何设计和评估一个用于序列预测问题的带或不带注意力机制的编码器-解码器网络。
如何稳健地比较带或不带注意力机制的编码器-解码器网络的性能。

开始您的项目，阅读我的新书《Python 长短期记忆网络》，其中包括分步教程和所有示例的Python 源代码文件。

让我们开始吧。

注意：2020年5月：底层 API 已更改，本教程可能不再是最新的。您可能需要旧版本的 Keras 和 TensorFlow，例如 Keras 2 和 TF 1。

How to Develop an Encoder-Decoder Model with Attention for Sequence-to-Sequence Prediction in Keras

如何使用 Keras 开发用于序列到序列预测的带注意力机制的编码器-解码器模型
照片由 Angela and Andrew 拍摄，保留部分权利。

教程概述

本教程分为6个部分；它们是

带注意力的编码器-解码器
注意力测试问题
不带注意力的编码器-解码器
自定义 Keras 注意力层
带注意力的编码器-解码器
模型对比

Python 环境

本教程假设您已安装 Python 3 SciPy 环境。

您必须安装 Keras（2.0 或更高版本），并使用 TensorFlow 或 Theano 后端。

本教程还假设您已安装 scikit-learn、Pandas、NumPy 和 Matplotlib。

如果您在环境方面需要帮助，请参阅此帖子

如何使用 Anaconda 设置用于机器学习和深度学习的 Python 环境

带注意力的编码器-解码器

用于循环神经网络的编码器-解码器模型是序列到序列预测问题的架构。

正如其名称所示，它由两个子模型组成：

编码器：编码器负责遍历输入时间步，并将整个序列编码成一个固定长度的向量，称为上下文向量。
解码器：解码器负责遍历输出时间步，同时读取上下文向量。

该架构的一个问题是，它在长输入或输出序列上的性能较差。据信原因是编码器使用的内部表示固定大小。

注意力是该架构的一个扩展，它解决了这个限制。它首先为解码器提供来自编码器的更丰富的上下文，并提供一种学习机制，解码器可以学习在预测输出序列的每个时间步时，在哪里关注更丰富的编码。

有关编码器-解码器架构中注意力的更多信息，请参阅以下帖子：

注意力测试问题

在开发带注意力的模型之前，我们将首先定义一个人为设计的、可扩展的测试问题，用于确定注意力是否能带来任何好处。

在这个问题中，我们将生成随机整数序列作为输入，并生成匹配的输出序列，该序列由输入序列中的整数子集组成。

例如，输入序列可能是 [1, 6, 2, 7, 3]，预期的输出序列可能是序列中的前两个随机整数 [1, 6]。

我们将问题定义为使输入和输出序列的长度相同，并在需要时用“0”值填充输出序列。

首先，我们需要一个函数来生成随机整数序列。我们将使用 Python 的 randint() 函数在 0 到最大值之间生成随机整数，并将此范围作为问题的基数（例如，特征的数量或难度轴）。

下面的 generate_sequence() 函数将生成一个固定长度且具有指定基数的随机整数序列。

from random import randint

# generate a sequence of random integers
def generate_sequence(length, n_unique):
	return [randint(0, n_unique-1) for _ in range(length)]

# generate random sequence
sequence = generate_sequence(5, 50)
print(sequence)

from random import randint

# 生成一个随机整数序列

def generate_sequence(length, n_unique):

return [randint(0, n_unique-1) for _ in range(length)]

# 生成随机序列

sequence = generate_sequence(5, 50)

打印(序列)

运行此示例将生成一个 5 个时间步长的序列，序列中的每个值都是 0 到 49 之间的随机整数。

[43, 3, 28, 34, 33]

1	[43, 3, 28, 34, 33]

接下来，我们需要一个函数来 one-hot 编码离散整数值为二进制向量。

如果使用 50 的基数，则每个整数将由一个 50 维的零向量和一个在指定整数值索引处的 1 来表示。

下面的 one_hot_encode() 函数将对给定的整数序列进行 one-hot 编码。

# one hot encode sequence
def one_hot_encode(sequence, n_unique):
	encoding = list()
	for value in sequence:
		vector = [0 for _ in range(n_unique)]
		vector[value] = 1
		encoding.append(vector)
	return array(encoding)

# One-Hot 编码序列

def one_hot_encode(sequence, n_unique):

encoding = list()

for value in sequence:

vector = [0 for _ in range(n_unique)]

vector[value] = 1

encoding.append(vector)

return array(encoding)

我们还需要能够解码编码后的序列。这对于将模型的预测或编码的预期序列转换回我们可以读取和评估的整数序列是必需的。

下面的 one_hot_decode() 函数将把 one-hot 编码的序列解码回整数序列。

# decode a one hot encoded string
def one_hot_decode(encoded_seq):
	return [argmax(vector) for vector in encoded_seq]

# 解码 One-Hot 编码字符串

def one_hot_decode(encoded_seq):

return [argmax(vector) for vector in encoded_seq]

我们可以在下面的示例中测试这些操作。

from random import randint
from numpy import array
from numpy import argmax

# generate a sequence of random integers
def generate_sequence(length, n_unique):
	return [randint(0, n_unique-1) for _ in range(length)]

# one hot encode sequence
def one_hot_encode(sequence, n_unique):
	encoding = list()
	for value in sequence:
		vector = [0 for _ in range(n_unique)]
		vector[value] = 1
		encoding.append(vector)
	return array(encoding)

# decode a one hot encoded string
def one_hot_decode(encoded_seq):
	return [argmax(vector) for vector in encoded_seq]

# generate random sequence
sequence = generate_sequence(5, 50)
print(sequence)
# one hot encode
encoded = one_hot_encode(sequence, 50)
print(encoded)
# decode
decoded = one_hot_decode(encoded)
print(decoded)

from random import randint

from numpy import array

from numpy import argmax

# 生成一个随机整数序列

def generate_sequence(length, n_unique):

return [randint(0, n_unique-1) for _ in range(length)]

# One-Hot 编码序列

def one_hot_encode(sequence, n_unique):

encoding = list()

for value in sequence:

vector = [0 for _ in range(n_unique)]

vector[value] = 1

encoding.append(vector)

return array(encoding)

# 解码 One-Hot 编码字符串

def one_hot_decode(encoded_seq):

return [argmax(vector) for vector in encoded_seq]

# 生成随机序列

sequence = generate_sequence(5, 50)

打印(序列)

# One-Hot 编码

encoded = one_hot_encode(sequence, 50)

print(编码)

# 解码

decoded = one_hot_decode(encoded)

print(decoded)

运行示例将首先打印一个随机生成的序列，然后是 one-hot 编码的版本，最后是再次解码的序列。

[3, 18, 32, 11, 36]
[[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]]
[3, 18, 32, 11, 36]

[3, 18, 32, 11, 36]

[[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]]

[3, 18, 32, 11, 36]

最后，我们需要一个函数来创建输入和输出序列对，以训练和评估模型。

下面的 get_pair() 函数将在给定指定的输入长度、输出长度和基数的情况下返回一个输入和输出序列对。输入和输出序列的长度相同，即输入序列的长度，但输出序列将取自输入序列的前 n 个字符，并用零值填充到所需长度。

然后，整数序列会被编码，然后重塑为循环神经网络所需的 3D 格式，维度为：样本、时间步长和特征。在这种情况下，样本始终为 1，因为我们只生成一个输入-输出对；时间步长是输入序列的长度；特征是每个时间步的基数。

# prepare data for the LSTM
def get_pair(n_in, n_out, n_unique):
	# generate random sequence
	sequence_in = generate_sequence(n_in, n_unique)
	sequence_out = sequence_in[:n_out] + [0 for _ in range(n_in-n_out)]
	# one hot encode
	X = one_hot_encode(sequence_in, n_unique)
	y = one_hot_encode(sequence_out, n_unique)
	# reshape as 3D
	X = X.reshape((1, X.shape[0], X.shape[1]))
	y = y.reshape((1, y.shape[0], y.shape[1]))
	return X,y

# 为 LSTM 准备数据

def get_pair(n_in, n_out, n_unique):

# 生成随机序列

sequence_in = generate_sequence(n_in, n_unique)

sequence_out = sequence_in[:n_out] + [0 for _ in range(n_in-n_out)]

# 独热编码

X = one_hot_encode(sequence_in, n_unique)

y = one_hot_encode(sequence_out, n_unique)

# 重塑为 3D

X = X.reshape((1, X.shape[0], X.shape[1]))

y = y.reshape((1, y.shape[0], y.shape[1]))

return X,y

我们可以将所有这些组合起来，并演示数据准备代码。

from random import randint
from numpy import array
from numpy import argmax

# generate a sequence of random integers
def generate_sequence(length, n_unique):
	return [randint(0, n_unique-1) for _ in range(length)]

# one hot encode sequence
def one_hot_encode(sequence, n_unique):
	encoding = list()
	for value in sequence:
		vector = [0 for _ in range(n_unique)]
		vector[value] = 1
		encoding.append(vector)
	return array(encoding)

# decode a one hot encoded string
def one_hot_decode(encoded_seq):
	return [argmax(vector) for vector in encoded_seq]

# prepare data for the LSTM
def get_pair(n_in, n_out, n_unique):
	# generate random sequence
	sequence_in = generate_sequence(n_in, n_unique)
	sequence_out = sequence_in[:n_out] + [0 for _ in range(n_in-n_out)]
	# one hot encode
	X = one_hot_encode(sequence_in, n_unique)
	y = one_hot_encode(sequence_out, n_unique)
	# reshape as 3D
	X = X.reshape((1, X.shape[0], X.shape[1]))
	y = y.reshape((1, y.shape[0], y.shape[1]))
	return X,y

# generate random sequence
X, y = get_pair(5, 2, 50)
print(X.shape, y.shape)
print('X=%s, y=%s' % (one_hot_decode(X[0]), one_hot_decode(y[0])))

from random import randint

from numpy import array

from numpy import argmax

# 生成一个随机整数序列

def generate_sequence(length, n_unique):

return [randint(0, n_unique-1) for _ in range(length)]

# One-Hot 编码序列

def one_hot_encode(sequence, n_unique):

encoding = list()

for value in sequence:

vector = [0 for _ in range(n_unique)]

vector[value] = 1

encoding.append(vector)

return array(encoding)

# 解码 One-Hot 编码字符串

def one_hot_decode(encoded_seq):

return [argmax(vector) for vector in encoded_seq]

# 为 LSTM 准备数据

def get_pair(n_in, n_out, n_unique):

# 生成随机序列

sequence_in = generate_sequence(n_in, n_unique)

sequence_out = sequence_in[:n_out] + [0 for _ in range(n_in-n_out)]

# 独热编码

X = one_hot_encode(sequence_in, n_unique)

y = one_hot_encode(sequence_out, n_unique)

# 重塑为 3D

X = X.reshape((1, X.shape[0], X.shape[1]))

y = y.reshape((1, y.shape[0], y.shape[1]))

return X,y

# 生成随机序列

X, y = get_pair(5, 2, 50)

print(X.shape, y.shape)

print('X=%s, y=%s' % (one_hot_decode(X[0]), one_hot_decode(y[0])))

运行此示例将生成一个输入-输出对，并打印两个数组的形状。

生成的对随后以解码形式打印，我们可以看到序列的前两个整数在输出序列中重复出现，后跟零值填充。

(1, 5, 50) (1, 5, 50)
X=[12, 20, 36, 40, 12], y=[12, 20, 0, 0, 0]

1 2	(1, 5, 50) (1, 5, 50) X=[12, 20, 36, 40, 12], y=[12, 20, 0, 0, 0]

编码器-解码器无注意力机制

在本节中，我们将开发一个不带注意力机制的编码器-解码器模型作为性能基线。

我们将问题定义固定为：输入和输出序列均为 5 个时间步长，输出序列为输入序列的前 2 个元素，基数为 50。

# configure problem
n_features = 50
n_timesteps_in = 5
n_timesteps_out = 2

# 配置问题

n_features = 50

n_timesteps_in = 5

n_timesteps_out = 2

我们可以通过采用编码器 LSTM 模型的输出来开发一个简单的 Keras 编码器-解码器模型，将其重复 n 次以匹配输出序列的时间步长，然后使用解码器来预测输出序列。

有关如何在 Keras 中定义编码器-解码器架构的更多详细信息，请参阅以下帖子：

编码器-解码器长短期记忆网络

我们将使用相同的单元数来配置编码器和解码器，此处为 150。我们将使用高效的 Adam 梯度下降实现，并优化分类交叉熵损失函数，因为该问题在技术上是一个多类分类问题。

模型的配置是在一些试错后找到的，远非最优。

Keras 中编码器-解码器架构的代码如下所示。

# define model
model = Sequential()
model.add(LSTM(150, input_shape=(n_timesteps_in, n_features)))
model.add(RepeatVector(n_timesteps_in))
model.add(LSTM(150, return_sequences=True))
model.add(TimeDistributed(Dense(n_features, activation='softmax')))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# 定义模型

model = Sequential()

model.add(LSTM(150, input_shape=(n_timesteps_in, n_features)))

model.add(RepeatVector(n_timesteps_in))

model.add(LSTM(150, return_sequences=True))

model.add(TimeDistributed(Dense(n_features, activation='softmax')))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

我们将使用 5,000 个随机输入-输出整数序列对来训练模型。

# train LSTM
for epoch in range(5000):
	# generate new random sequence
	X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
	# fit model for one epoch on this sequence
	model.fit(X, y, epochs=1, verbose=2)

# 训练 LSTM

for epoch in range(5000):

# 生成新的随机序列

X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)

# 在此序列上拟合模型一个 epoch

model.fit(X, y, epochs=1, verbose=2)

训练完成后，我们将使用 100 个新的随机生成的整数序列来评估模型，并且只有当整个输出序列与预期值匹配时，才将预测标记为正确。

# evaluate LSTM
total, correct = 100, 0
for _ in range(total):
	X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
	yhat = model.predict(X, verbose=0)
	if array_equal(one_hot_decode(y[0]), one_hot_decode(yhat[0])):
		correct += 1
print('Accuracy: %.2f%%' % (float(correct)/float(total)*100.0))

# 评估 LSTM

total, correct = 100, 0

for _ in range(total):

X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)

yhat = model.predict(X, verbose=0)

if array_equal(one_hot_decode(y[0]), one_hot_decode(yhat[0])):

correct += 1

print('Accuracy: %.2f%%' % (float(correct)/float(total)*100.0))

最后，我们将打印 10 个预期的输出序列和模型预测的序列的示例。

综合所有这些，完整的示例如下所示。

from random import randint
from numpy import array
from numpy import argmax
from numpy import array_equal
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense
from keras.layers import TimeDistributed
from keras.layers import RepeatVector

# generate a sequence of random integers
def generate_sequence(length, n_unique):
	return [randint(0, n_unique-1) for _ in range(length)]

# one hot encode sequence
def one_hot_encode(sequence, n_unique):
	encoding = list()
	for value in sequence:
		vector = [0 for _ in range(n_unique)]
		vector[value] = 1
		encoding.append(vector)
	return array(encoding)

# decode a one hot encoded string
def one_hot_decode(encoded_seq):
	return [argmax(vector) for vector in encoded_seq]

# prepare data for the LSTM
def get_pair(n_in, n_out, cardinality):
	# generate random sequence
	sequence_in = generate_sequence(n_in, cardinality)
	sequence_out = sequence_in[:n_out] + [0 for _ in range(n_in-n_out)]
	# one hot encode
	X = one_hot_encode(sequence_in, cardinality)
	y = one_hot_encode(sequence_out, cardinality)
	# reshape as 3D
	X = X.reshape((1, X.shape[0], X.shape[1]))
	y = y.reshape((1, y.shape[0], y.shape[1]))
	return X,y

# configure problem
n_features = 50
n_timesteps_in = 5
n_timesteps_out = 2
# define model
model = Sequential()
model.add(LSTM(150, input_shape=(n_timesteps_in, n_features)))
model.add(RepeatVector(n_timesteps_in))
model.add(LSTM(150, return_sequences=True))
model.add(TimeDistributed(Dense(n_features, activation='softmax')))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# train LSTM
for epoch in range(5000):
	# generate new random sequence
	X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
	# fit model for one epoch on this sequence
	model.fit(X, y, epochs=1, verbose=2)
# evaluate LSTM
total, correct = 100, 0
for _ in range(total):
	X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
	yhat = model.predict(X, verbose=0)
	if array_equal(one_hot_decode(y[0]), one_hot_decode(yhat[0])):
		correct += 1
print('Accuracy: %.2f%%' % (float(correct)/float(total)*100.0))
# spot check some examples
for _ in range(10):
	X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
	yhat = model.predict(X, verbose=0)
	print('Expected:', one_hot_decode(y[0]), 'Predicted', one_hot_decode(yhat[0]))

from random import randint

from numpy import array

from numpy import argmax

from numpy import array_equal

from keras.models import Sequential

从 keras.layers 导入 LSTM

from keras.layers import Dense

from keras.layers import TimeDistributed

from keras.layers import RepeatVector

# 生成一个随机整数序列

def generate_sequence(length, n_unique):

return [randint(0, n_unique-1) for _ in range(length)]

# One-Hot 编码序列

def one_hot_encode(sequence, n_unique):

encoding = list()

for value in sequence:

vector = [0 for _ in range(n_unique)]

vector[value] = 1

encoding.append(vector)

return array(encoding)

# 解码 One-Hot 编码字符串

def one_hot_decode(encoded_seq):

return [argmax(vector) for vector in encoded_seq]

# 为 LSTM 准备数据

def get_pair(n_in, n_out, cardinality):

# 生成随机序列

sequence_in = generate_sequence(n_in, cardinality)

sequence_out = sequence_in[:n_out] + [0 for _ in range(n_in-n_out)]

# 独热编码

X = one_hot_encode(sequence_in, cardinality)

y = one_hot_encode(sequence_out, cardinality)

# 重塑为 3D

X = X.reshape((1, X.shape[0], X.shape[1]))

y = y.reshape((1, y.shape[0], y.shape[1]))

return X,y

# 配置问题

n_features = 50

n_timesteps_in = 5

n_timesteps_out = 2

# 定义模型

model = Sequential()

model.add(LSTM(150, input_shape=(n_timesteps_in, n_features)))

model.add(RepeatVector(n_timesteps_in))

model.add(LSTM(150, return_sequences=True))

model.add(TimeDistributed(Dense(n_features, activation='softmax')))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# 训练 LSTM

for epoch in range(5000):

# 生成新的随机序列

X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)

# 在此序列上拟合模型一个 epoch

model.fit(X, y, epochs=1, verbose=2)

# 评估 LSTM

total, correct = 100, 0

for _ in range(total):

X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)

yhat = model.predict(X, verbose=0)

if array_equal(one_hot_decode(y[0]), one_hot_decode(yhat[0])):

correct += 1

print('Accuracy: %.2f%%' % (float(correct)/float(total)*100.0))

# 抽查一些示例

for _ in range(10):

X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)

yhat = model.predict(X, verbose=0)

print('Expected:', one_hot_decode(y[0]), 'Predicted', one_hot_decode(yhat[0]))

运行此示例不会花费太长时间，可能在 CPU 上需要几分钟，不需要 GPU。

注意：鉴于算法或评估程序的随机性，或数值精度的差异，您的结果可能有所不同。可以尝试多次运行示例并比较平均结果。

模型的准确率报告略低于 20%。

Accuracy: 19.00%

1	Accuracy: 19.00%

从样本输出可以看出，该模型在大多数或所有情况下都能在输出序列中获得一个正确的数字，但在第二个数字上则表现不佳。所有零填充值都得到了正确预测。

Expected: [47, 0, 0, 0, 0] Predicted [47, 47, 0, 0, 0]
Expected: [43, 31, 0, 0, 0] Predicted [43, 31, 0, 0, 0]
Expected: [14, 22, 0, 0, 0] Predicted [14, 14, 0, 0, 0]
Expected: [39, 31, 0, 0, 0] Predicted [39, 39, 0, 0, 0]
Expected: [6, 4, 0, 0, 0] Predicted [6, 4, 0, 0, 0]
Expected: [47, 0, 0, 0, 0] Predicted [47, 47, 0, 0, 0]
Expected: [39, 33, 0, 0, 0] Predicted [39, 39, 0, 0, 0]
Expected: [23, 2, 0, 0, 0] Predicted [23, 23, 0, 0, 0]
Expected: [19, 28, 0, 0, 0] Predicted [19, 3, 0, 0, 0]
Expected: [32, 33, 0, 0, 0] Predicted [32, 32, 0, 0, 0]

Expected: [47, 0, 0, 0, 0] Predicted [47, 47, 0, 0, 0]

Expected: [43, 31, 0, 0, 0] Predicted [43, 31, 0, 0, 0]

Expected: [14, 22, 0, 0, 0] Predicted [14, 14, 0, 0, 0]

Expected: [39, 31, 0, 0, 0] Predicted [39, 39, 0, 0, 0]

Expected: [6, 4, 0, 0, 0] Predicted [6, 4, 0, 0, 0]

Expected: [47, 0, 0, 0, 0] Predicted [47, 47, 0, 0, 0]

Expected: [39, 33, 0, 0, 0] Predicted [39, 39, 0, 0, 0]

Expected: [23, 2, 0, 0, 0] Predicted [23, 23, 0, 0, 0]

Expected: [19, 28, 0, 0, 0] Predicted [19, 3, 0, 0, 0]

Expected: [32, 33, 0, 0, 0] Predicted [32, 32, 0, 0, 0]

自定义 Keras 注意力层

现在我们需要将注意力添加到编码器-解码器模型中。

在撰写本文时，Keras 库本身尚未内置注意力功能，但它即将推出即将到来。

在注意力机制正式可用之前，我们可以自行实现，也可以使用现有的第三方实现。

为了加快速度，让我们使用现有的第三方实现。

Zafarali Ahmed，Datalogue 的实习生，开发了一个 Keras 的自定义层，支持注意力机制，该层在 2017 年发表的题为“如何在 Keras 中可视化您的带注意力的循环神经网络”的帖子和名为“keras-attention”的 GitHub 项目中进行了介绍。

自定义注意力层称为 AttentionDecoder，在 GitHub 项目的 custom_recurrents.py 文件中可用。我们可以根据项目的 GNU Affero General Public License v3.0 许可证重用此代码。

为完整起见，以下是自定义层的副本。将其复制并粘贴到当前工作目录中一个名为 'attention_decoder.py' 的新独立文件中。

import tensorflow as tf
from keras import backend as K
from keras import regularizers, constraints, initializers, activations
from keras.layers.recurrent import Recurrent, _time_distributed_dense
from keras.engine import InputSpec

tfPrint = lambda d, T: tf.Print(input_=T, data=[T, tf.shape(T)], message=d)

class AttentionDecoder(Recurrent):

    def __init__(self, units, output_dim,
                 activation='tanh',
                 return_probabilities=False,
                 name='AttentionDecoder',
                 kernel_initializer='glorot_uniform',
                 recurrent_initializer='orthogonal',
                 bias_initializer='zeros',
                 kernel_regularizer=None,
                 bias_regularizer=None,
                 activity_regularizer=None,
                 kernel_constraint=None,
                 bias_constraint=None,
                 **kwargs):
        """
        Implements an AttentionDecoder that takes in a sequence encoded by an
        encoder and outputs the decoded states
        :param units: dimension of the hidden state and the attention matrices
        :param output_dim: the number of labels in the output space

        references:
            Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio.
            "Neural machine translation by jointly learning to align and translate."
            arXiv preprint arXiv:1409.0473 (2014).
        """
        self.units = units
        self.output_dim = output_dim
        self.return_probabilities = return_probabilities
        self.activation = activations.get(activation)
        self.kernel_initializer = initializers.get(kernel_initializer)
        self.recurrent_initializer = initializers.get(recurrent_initializer)
        self.bias_initializer = initializers.get(bias_initializer)

        self.kernel_regularizer = regularizers.get(kernel_regularizer)
        self.recurrent_regularizer = regularizers.get(kernel_regularizer)
        self.bias_regularizer = regularizers.get(bias_regularizer)
        self.activity_regularizer = regularizers.get(activity_regularizer)

        self.kernel_constraint = constraints.get(kernel_constraint)
        self.recurrent_constraint = constraints.get(kernel_constraint)
        self.bias_constraint = constraints.get(bias_constraint)

        super(AttentionDecoder, self).__init__(**kwargs)
        self.name = name
        self.return_sequences = True  # must return sequences

    def build(self, input_shape):
        """
          See Appendix 2 of Bahdanau 2014, arXiv:1409.0473
          for model details that correspond to the matrices here.
        """

        self.batch_size, self.timesteps, self.input_dim = input_shape

        if self.stateful:
            super(AttentionDecoder, self).reset_states()

        self.states = [None, None]  # y, s

        """
            Matrices for creating the context vector
        """

        self.V_a = self.add_weight(shape=(self.units,),
                                   name='V_a',
                                   initializer=self.kernel_initializer,
                                   regularizer=self.kernel_regularizer,
                                   constraint=self.kernel_constraint)
        self.W_a = self.add_weight(shape=(self.units, self.units),
                                   name='W_a',
                                   initializer=self.kernel_initializer,
                                   regularizer=self.kernel_regularizer,
                                   constraint=self.kernel_constraint)
        self.U_a = self.add_weight(shape=(self.input_dim, self.units),
                                   name='U_a',
                                   initializer=self.kernel_initializer,
                                   regularizer=self.kernel_regularizer,
                                   constraint=self.kernel_constraint)
        self.b_a = self.add_weight(shape=(self.units,),
                                   name='b_a',
                                   initializer=self.bias_initializer,
                                   regularizer=self.bias_regularizer,
                                   constraint=self.bias_constraint)
        """
            Matrices for the r (reset) gate
        """
        self.C_r = self.add_weight(shape=(self.input_dim, self.units),
                                   name='C_r',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.U_r = self.add_weight(shape=(self.units, self.units),
                                   name='U_r',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.W_r = self.add_weight(shape=(self.output_dim, self.units),
                                   name='W_r',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.b_r = self.add_weight(shape=(self.units, ),
                                   name='b_r',
                                   initializer=self.bias_initializer,
                                   regularizer=self.bias_regularizer,
                                   constraint=self.bias_constraint)

        """
            Matrices for the z (update) gate
        """
        self.C_z = self.add_weight(shape=(self.input_dim, self.units),
                                   name='C_z',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.U_z = self.add_weight(shape=(self.units, self.units),
                                   name='U_z',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.W_z = self.add_weight(shape=(self.output_dim, self.units),
                                   name='W_z',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.b_z = self.add_weight(shape=(self.units, ),
                                   name='b_z',
                                   initializer=self.bias_initializer,
                                   regularizer=self.bias_regularizer,
                                   constraint=self.bias_constraint)
        """
            Matrices for the proposal
        """
        self.C_p = self.add_weight(shape=(self.input_dim, self.units),
                                   name='C_p',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.U_p = self.add_weight(shape=(self.units, self.units),
                                   name='U_p',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.W_p = self.add_weight(shape=(self.output_dim, self.units),
                                   name='W_p',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.b_p = self.add_weight(shape=(self.units, ),
                                   name='b_p',
                                   initializer=self.bias_initializer,
                                   regularizer=self.bias_regularizer,
                                   constraint=self.bias_constraint)
        """
            Matrices for making the final prediction vector
        """
        self.C_o = self.add_weight(shape=(self.input_dim, self.output_dim),
                                   name='C_o',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.U_o = self.add_weight(shape=(self.units, self.output_dim),
                                   name='U_o',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.W_o = self.add_weight(shape=(self.output_dim, self.output_dim),
                                   name='W_o',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.b_o = self.add_weight(shape=(self.output_dim, ),
                                   name='b_o',
                                   initializer=self.bias_initializer,
                                   regularizer=self.bias_regularizer,
                                   constraint=self.bias_constraint)

        # For creating the initial state:
        self.W_s = self.add_weight(shape=(self.input_dim, self.units),
                                   name='W_s',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)

        self.input_spec = [
            InputSpec(shape=(self.batch_size, self.timesteps, self.input_dim))]
        self.built = True

    def call(self, x):
        # store the whole sequence so we can "attend" to it at each timestep
        self.x_seq = x

        # apply the a dense layer over the time dimension of the sequence
        # do it here because it doesn't depend on any previous steps
        # thefore we can save computation time:
        self._uxpb = _time_distributed_dense(self.x_seq, self.U_a, b=self.b_a,
                                             input_dim=self.input_dim,
                                             timesteps=self.timesteps,
                                             output_dim=self.units)

        return super(AttentionDecoder, self).call(x)

    def get_initial_state(self, inputs):
        # apply the matrix on the first time step to get the initial s0.
        s0 = activations.tanh(K.dot(inputs[:, 0], self.W_s))

        # from keras.layers.recurrent to initialize a vector of (batchsize,
        # output_dim)
        y0 = K.zeros_like(inputs)  # (samples, timesteps, input_dims)
        y0 = K.sum(y0, axis=(1, 2))  # (samples, )
        y0 = K.expand_dims(y0)  # (samples, 1)
        y0 = K.tile(y0, [1, self.output_dim])

        return [y0, s0]

    def step(self, x, states):

        ytm, stm = states

        # repeat the hidden state to the length of the sequence
        _stm = K.repeat(stm, self.timesteps)

        # now multiplty the weight matrix with the repeated hidden state
        _Wxstm = K.dot(_stm, self.W_a)

        # calculate the attention probabilities
        # this relates how much other timesteps contributed to this one.
        et = K.dot(activations.tanh(_Wxstm + self._uxpb),
                   K.expand_dims(self.V_a))
        at = K.exp(et)
        at_sum = K.sum(at, axis=1)
        at_sum_repeated = K.repeat(at_sum, self.timesteps)
        at /= at_sum_repeated  # vector of size (batchsize, timesteps, 1)

        # calculate the context vector
        context = K.squeeze(K.batch_dot(at, self.x_seq, axes=1), axis=1)
        # ~~~> calculate new hidden state
        # first calculate the "r" gate:

        rt = activations.sigmoid(
            K.dot(ytm, self.W_r)
            + K.dot(stm, self.U_r)
            + K.dot(context, self.C_r)
            + self.b_r)

        # now calculate the "z" gate
        zt = activations.sigmoid(
            K.dot(ytm, self.W_z)
            + K.dot(stm, self.U_z)
            + K.dot(context, self.C_z)
            + self.b_z)

        # calculate the proposal hidden state:
        s_tp = activations.tanh(
            K.dot(ytm, self.W_p)
            + K.dot((rt * stm), self.U_p)
            + K.dot(context, self.C_p)
            + self.b_p)

        # new hidden state:
        st = (1-zt)*stm + zt * s_tp

        yt = activations.softmax(
            K.dot(ytm, self.W_o)
            + K.dot(stm, self.U_o)
            + K.dot(context, self.C_o)
            + self.b_o)

        if self.return_probabilities:
            return at, [yt, st]
        else:
            return yt, [yt, st]

    def compute_output_shape(self, input_shape):
        """
            For Keras internal compatability checking
        """
        if self.return_probabilities:
            return (None, self.timesteps, self.timesteps)
        else:
            return (None, self.timesteps, self.output_dim)

    def get_config(self):
        """
            For rebuilding models on load time.
        """
        config = {
            'output_dim': self.output_dim,
            'units': self.units,
            'return_probabilities': self.return_probabilities
        }
        base_config = super(AttentionDecoder, self).get_config()
        return dict(list(base_config.items()) + list(config.items()))

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

import tensorflow as tf

from keras import backend as K

from keras import regularizers, constraints, initializers, activations

from keras.layers.recurrent import Recurrent, _time_distributed_dense

from keras.engine import InputSpec

tfPrint = lambda d, T: tf.Print(input_=T, data=[T, tf.shape(T)], message=d)

class AttentionDecoder(Recurrent):

def __init__(self, units, output_dim,

activation='tanh',

return_probabilities=False,

name='AttentionDecoder',

kernel_initializer='glorot_uniform',

recurrent_initializer='orthogonal',

bias_initializer='zeros',

kernel_regularizer=None,

bias_regularizer=None,

activity_regularizer=None,

kernel_constraint=None,

bias_constraint=None,

**kwargs):

"""

实现了一个 AttentionDecoder，它接收一个由编码器编码的序列

并输出解码状态

:param units: 隐藏状态和注意力矩阵的维度

:param output_dim: 输出空间的标签数量

参考文献

Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio.

"Neural machine translation by jointly learning to align and translate."

arXiv preprint arXiv:1409.0473 (2014)。

"""

self.units = units

self.output_dim = output_dim

self.return_probabilities = return_probabilities

self.activation = activations.get(activation)

self.kernel_initializer = initializers.get(kernel_initializer)

self.recurrent_initializer = initializers.get(recurrent_initializer)

self.bias_initializer = initializers.get(bias_initializer)

self.kernel_regularizer = regularizers.get(kernel_regularizer)

self.recurrent_regularizer = regularizers.get(kernel_regularizer)

self.bias_regularizer = regularizers.get(bias_regularizer)

self.activity_regularizer = regularizers.get(activity_regularizer)

self.kernel_constraint = constraints.get(kernel_constraint)

self.recurrent_constraint = constraints.get(kernel_constraint)

self.bias_constraint = constraints.get(bias_constraint)

super(AttentionDecoder, self).__init__(**kwargs)

self.name = name

self.return_sequences = True # 必须返回序列

def build(self, input_shape):

"""

参见 Bahdanau 2014, arXiv:1409.0473 的附录 2

关于与此处矩阵对应的模型细节。

"""

self.batch_size, self.timesteps, self.input_dim = input_shape

if self.stateful:

super(AttentionDecoder, self).reset_states()

self.states = [None, None] # y, s

"""

用于创建上下文向量的矩阵

"""

self.V_a = self.add_weight(shape=(self.units,),

name='V_a',

initializer=self.kernel_initializer,

regularizer=self.kernel_regularizer,

constraint=self.kernel_constraint)

self.W_a = self.add_weight(shape=(self.units, self.units),

name='W_a',

initializer=self.kernel_initializer,

regularizer=self.kernel_regularizer,

constraint=self.kernel_constraint)

self.U_a = self.add_weight(shape=(self.input_dim, self.units),

name='U_a',

initializer=self.kernel_initializer,

regularizer=self.kernel_regularizer,

constraint=self.kernel_constraint)

self.b_a = self.add_weight(shape=(self.units,),

name='b_a',

initializer=self.bias_initializer,

regularizer=self.bias_regularizer,

constraint=self.bias_constraint)

"""

r（重置）门的矩阵

"""

self.C_r = self.add_weight(shape=(self.input_dim, self.units),

name='C_r',

initializer=self.recurrent_initializer,

regularizer=self.recurrent_regularizer,

constraint=self.recurrent_constraint)

self.U_r = self.add_weight(shape=(self.units, self.units),

name='U_r',

initializer=self.recurrent_initializer,

regularizer=self.recurrent_regularizer,

constraint=self.recurrent_constraint)

self.W_r = self.add_weight(shape=(self.output_dim, self.units),

name='W_r',

initializer=self.recurrent_initializer,

regularizer=self.recurrent_regularizer,

constraint=self.recurrent_constraint)

self.b_r = self.add_weight(shape=(self.units, ),

name='b_r',

initializer=self.bias_initializer,

regularizer=self.bias_regularizer,

constraint=self.bias_constraint)

"""

z（更新）门的矩阵

"""

self.C_z = self.add_weight(shape=(self.input_dim, self.units),

name='C_z',

initializer=self.recurrent_initializer,

regularizer=self.recurrent_regularizer,

constraint=self.recurrent_constraint)

self.U_z = self.add_weight(shape=(self.units, self.units),

name='U_z',

initializer=self.recurrent_initializer,

regularizer=self.recurrent_regularizer,

constraint=self.recurrent_constraint)

self.W_z = self.add_weight(shape=(self.output_dim, self.units),

name='W_z',

initializer=self.recurrent_initializer,

regularizer=self.recurrent_regularizer,

constraint=self.recurrent_constraint)

self.b_z = self.add_weight(shape=(self.units, ),

name='b_z',

initializer=self.bias_initializer,

regularizer=self.bias_regularizer,

constraint=self.bias_constraint)

"""

用于生成预测的矩阵

"""

self.C_p = self.add_weight(shape=(self.input_dim, self.units),

name='C_p',

initializer=self.recurrent_initializer,

regularizer=self.recurrent_regularizer,

constraint=self.recurrent_constraint)

self.U_p = self.add_weight(shape=(self.units, self.units),

name='U_p',

initializer=self.recurrent_initializer,

regularizer=self.recurrent_regularizer,

constraint=self.recurrent_constraint)

self.W_p = self.add_weight(shape=(self.output_dim, self.units),

name='W_p',

initializer=self.recurrent_initializer,

regularizer=self.recurrent_regularizer,

constraint=self.recurrent_constraint)

self.b_p = self.add_weight(shape=(self.units, ),

name='b_p',

initializer=self.bias_initializer,

regularizer=self.bias_regularizer,

constraint=self.bias_constraint)

"""

用于生成最终预测向量的矩阵

"""

self.C_o = self.add_weight(shape=(self.input_dim, self.output_dim),

name='C_o',

initializer=self.recurrent_initializer,

regularizer=self.recurrent_regularizer,

constraint=self.recurrent_constraint)

self.U_o = self.add_weight(shape=(self.units, self.output_dim),

name='U_o',

initializer=self.recurrent_initializer,

regularizer=self.recurrent_regularizer,

constraint=self.recurrent_constraint)

self.W_o = self.add_weight(shape=(self.output_dim, self.output_dim),

name='W_o',

initializer=self.recurrent_initializer,

regularizer=self.recurrent_regularizer,

constraint=self.recurrent_constraint)

self.b_o = self.add_weight(shape=(self.output_dim, ),

name='b_o',

initializer=self.bias_initializer,

regularizer=self.bias_regularizer,

constraint=self.bias_constraint)

# 用于创建初始状态：

self.W_s = self.add_weight(shape=(self.input_dim, self.units),

name='W_s',

initializer=self.recurrent_initializer,

regularizer=self.recurrent_regularizer,

constraint=self.recurrent_constraint)

self.input_spec = [

InputSpec(shape=(self.batch_size, self.timesteps, self.input_dim))]

self.built = True

def call(self, x):

# 存储整个序列，以便在每个时间步“关注”它

self.x_seq = x

# 在序列的时间维度上应用一个密集层

# 在这里执行，因为它不依赖于任何先前的步骤

# 因此，我们可以节省计算时间：

self._uxpb = _time_distributed_dense(self.x_seq, self.U_a, b=self.b_a,

input_dim=self.input_dim,

timesteps=self.timesteps,

output_dim=self.units)

return super(AttentionDecoder, self).call(x)

def get_initial_state(self, inputs):

# 使用第一个时间步的矩阵来获得初始 s0。

s0 = activations.tanh(K.dot(inputs[:, 0], self.W_s))

# 来自 keras.layers.recurrent 以初始化一个 (batchsize,

# output_dim) 的向量

y0 = K.zeros_like(inputs) # (样本数, 时间步数, 输入维度)

y0 = K.sum(y0, axis=(1, 2)) # (样本数, )

y0 = K.expand_dims(y0) # (样本数, 1)

y0 = K.tile(y0, [1, self.output_dim])

return [y0, s0]

def step(self, x, states):

ytm, stm = states

# 将隐藏状态重复到序列长度

_stm = K.repeat(stm, self.timesteps)

# now multiplty the weight matrix with the repeated hidden state

_Wxstm = K.dot(_stm, self.W_a)

# calculate the attention probabilities

# this relates how much other timesteps contributed to this one.

et = K.dot(activations.tanh(_Wxstm + self._uxpb),

K.expand_dims(self.V_a))

at = K.exp(et)

at_sum = K.sum(at, axis=1)

at_sum_repeated = K.repeat(at_sum, self.timesteps)

at /= at_sum_repeated # vector of size (batchsize, timesteps, 1)

# calculate the context vector

context = K.squeeze(K.batch_dot(at, self.x_seq, axes=1), axis=1)

# ~~~> calculate new hidden state

# first calculate the "r" gate:

rt = activations.sigmoid(

K.dot(ytm, self.W_r)

+ K.dot(stm, self.U_r)

+ K.dot(context, self.C_r)

+ self.b_r)

# now calculate the "z" gate

zt = activations.sigmoid(

K.dot(ytm, self.W_z)

+ K.dot(stm, self.U_z)

+ K.dot(context, self.C_z)

+ self.b_z)

# calculate the proposal hidden state:

s_tp = activations.tanh(

K.dot(ytm, self.W_p)

+ K.dot((rt * stm), self.U_p)

+ K.dot(context, self.C_p)

+ self.b_p)

# new hidden state:

st = (1-zt)*stm + zt * s_tp

yt = activations.softmax(

K.dot(ytm, self.W_o)

+ K.dot(stm, self.U_o)

+ K.dot(context, self.C_o)

+ self.b_o)

if self.return_probabilities:

return at, [yt, st]

else:

return yt, [yt, st]

def compute_output_shape(self, input_shape):

"""

For Keras internal compatability checking

"""

if self.return_probabilities:

return (None, self.timesteps, self.timesteps)

else:

return (None, self.timesteps, self.output_dim)

def get_config(self):

"""

For rebuilding models on load time.

"""

config = {

'output_dim': self.output_dim,

'units': self.units,

'return_probabilities': self.return_probabilities

}

base_config = super(AttentionDecoder, self).get_config()

return dict(list(base_config.items()) + list(config.items()))

We can make use of this custom layer in our projects by importing it as follows

from attention_decoder import AttentionDecoder

1	from attention_decoder import AttentionDecoder

The layer implements attention as described by Bahdanau, et al. in their paper “Neural Machine Translation by Jointly Learning to Align and Translate.”

The code is explained well in the original post and linked to both the LSTM and attention equations.

A limitation of this implementation is that it must output sequences that are the same length as the input sequences, the specific limitation that the encoder-decoder architecture was designed to overcome.

Importantly, the new layer manages both the repeating of the decoding as performed by the second LSTM, as well as the softmax output for the model as was performed by the Dense output layer in the encoder-decoder model without attention. This greatly simplifies the code for the model.

It is important to note that the custom layer is built upon the Recurrent layer in Keras, which, at the time of writing, is marked as legacy code, and presumably will be removed from the project at some point.

Encoder-Decoder With Attention

Now that we have an implementation of attention that we can use, we can develop an encoder-decoder model with attention for our contrived sequence prediction problem.

The model with the attention layer is defined below. We can see that the layer handles some of the machinery of the encoder-decoder model itself, making defining the model simpler.

# define model
model = Sequential()
model.add(LSTM(150, input_shape=(n_timesteps_in, n_features), return_sequences=True))
model.add(AttentionDecoder(150, n_features))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# 定义模型

model = Sequential()

model.add(LSTM(150, input_shape=(n_timesteps_in, n_features), return_sequences=True))

model.add(AttentionDecoder(150, n_features))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

That’s it. The rest of the example is the same.

完整的示例如下所示。

from random import randint
from numpy import array
from numpy import argmax
from numpy import array_equal
from keras.models import Sequential
from keras.layers import LSTM
from attention_decoder import AttentionDecoder

# generate a sequence of random integers
def generate_sequence(length, n_unique):
	return [randint(0, n_unique-1) for _ in range(length)]

# one hot encode sequence
def one_hot_encode(sequence, n_unique):
	encoding = list()
	for value in sequence:
		vector = [0 for _ in range(n_unique)]
		vector[value] = 1
		encoding.append(vector)
	return array(encoding)

# decode a one hot encoded string
def one_hot_decode(encoded_seq):
	return [argmax(vector) for vector in encoded_seq]

# prepare data for the LSTM
def get_pair(n_in, n_out, cardinality):
	# generate random sequence
	sequence_in = generate_sequence(n_in, cardinality)
	sequence_out = sequence_in[:n_out] + [0 for _ in range(n_in-n_out)]
	# one hot encode
	X = one_hot_encode(sequence_in, cardinality)
	y = one_hot_encode(sequence_out, cardinality)
	# reshape as 3D
	X = X.reshape((1, X.shape[0], X.shape[1]))
	y = y.reshape((1, y.shape[0], y.shape[1]))
	return X,y

# configure problem
n_features = 50
n_timesteps_in = 5
n_timesteps_out = 2

# define model
model = Sequential()
model.add(LSTM(150, input_shape=(n_timesteps_in, n_features), return_sequences=True))
model.add(AttentionDecoder(150, n_features))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# train LSTM
for epoch in range(5000):
	# generate new random sequence
	X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
	# fit model for one epoch on this sequence
	model.fit(X, y, epochs=1, verbose=2)
# evaluate LSTM
total, correct = 100, 0
for _ in range(total):
	X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
	yhat = model.predict(X, verbose=0)
	if array_equal(one_hot_decode(y[0]), one_hot_decode(yhat[0])):
		correct += 1
print('Accuracy: %.2f%%' % (float(correct)/float(total)*100.0))
# spot check some examples
for _ in range(10):
	X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
	yhat = model.predict(X, verbose=0)
	print('Expected:', one_hot_decode(y[0]), 'Predicted', one_hot_decode(yhat[0]))

from random import randint

from numpy import array

from numpy import argmax

from numpy import array_equal

from keras.models import Sequential

从 keras.layers 导入 LSTM

from attention_decoder import AttentionDecoder

# 生成一个随机整数序列

def generate_sequence(length, n_unique):

return [randint(0, n_unique-1) for _ in range(length)]

# One-Hot 编码序列

def one_hot_encode(sequence, n_unique):

encoding = list()

for value in sequence:

vector = [0 for _ in range(n_unique)]

vector[value] = 1

encoding.append(vector)

return array(encoding)

# 解码 One-Hot 编码字符串

def one_hot_decode(encoded_seq):

return [argmax(vector) for vector in encoded_seq]

# 为 LSTM 准备数据

def get_pair(n_in, n_out, cardinality):

# 生成随机序列

sequence_in = generate_sequence(n_in, cardinality)

sequence_out = sequence_in[:n_out] + [0 for _ in range(n_in-n_out)]

# 独热编码

X = one_hot_encode(sequence_in, cardinality)

y = one_hot_encode(sequence_out, cardinality)

# 重塑为 3D

X = X.reshape((1, X.shape[0], X.shape[1]))

y = y.reshape((1, y.shape[0], y.shape[1]))

return X,y

# 配置问题

n_features = 50

n_timesteps_in = 5

n_timesteps_out = 2

# 定义模型

model = Sequential()

model.add(LSTM(150, input_shape=(n_timesteps_in, n_features), return_sequences=True))

model.add(AttentionDecoder(150, n_features))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# 训练 LSTM

for epoch in range(5000):

# 生成新的随机序列

X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)

# 在此序列上拟合模型一个 epoch

model.fit(X, y, epochs=1, verbose=2)

# 评估 LSTM

total, correct = 100, 0

for _ in range(total):

X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)

yhat = model.predict(X, verbose=0)

if array_equal(one_hot_decode(y[0]), one_hot_decode(yhat[0])):

correct += 1

print('Accuracy: %.2f%%' % (float(correct)/float(total)*100.0))

# 抽查一些示例

for _ in range(10):

X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)

yhat = model.predict(X, verbose=0)

print('Expected:', one_hot_decode(y[0]), 'Predicted', one_hot_decode(yhat[0]))

Running the example prints the skill of the model on 100 randomly generated input-output pairs.

注意：鉴于算法或评估程序的随机性，或数值精度的差异，您的结果可能有所不同。可以尝试多次运行示例并比较平均结果。

With the same resources and same amount of training, the model with attention performs much better.

Accuracy: 95.00%

1	Accuracy: 95.00%

Spot-checking some sample outputs and predicted sequences, we can see very few errors, even in cases when there is a zero value in the first two elements.

Expected: [48, 47, 0, 0, 0] Predicted [48, 47, 0, 0, 0]
Expected: [7, 46, 0, 0, 0] Predicted [7, 46, 0, 0, 0]
Expected: [32, 30, 0, 0, 0] Predicted [32, 2, 0, 0, 0]
Expected: [3, 25, 0, 0, 0] Predicted [3, 25, 0, 0, 0]
Expected: [45, 4, 0, 0, 0] Predicted [45, 4, 0, 0, 0]
Expected: [49, 9, 0, 0, 0] Predicted [49, 9, 0, 0, 0]
Expected: [22, 23, 0, 0, 0] Predicted [22, 23, 0, 0, 0]
Expected: [29, 36, 0, 0, 0] Predicted [29, 36, 0, 0, 0]
Expected: [0, 29, 0, 0, 0] Predicted [0, 29, 0, 0, 0]
Expected: [11, 26, 0, 0, 0] Predicted [11, 26, 0, 0, 0]

Expected: [48, 47, 0, 0, 0] Predicted [48, 47, 0, 0, 0]

Expected: [7, 46, 0, 0, 0] Predicted [7, 46, 0, 0, 0]

Expected: [32, 30, 0, 0, 0] Predicted [32, 2, 0, 0, 0]

Expected: [3, 25, 0, 0, 0] Predicted [3, 25, 0, 0, 0]

Expected: [45, 4, 0, 0, 0] Predicted [45, 4, 0, 0, 0]

Expected: [49, 9, 0, 0, 0] Predicted [49, 9, 0, 0, 0]

Expected: [22, 23, 0, 0, 0] Predicted [22, 23, 0, 0, 0]

Expected: [29, 36, 0, 0, 0] Predicted [29, 36, 0, 0, 0]

Expected: [0, 29, 0, 0, 0] Predicted [0, 29, 0, 0, 0]

Expected: [11, 26, 0, 0, 0] Predicted [11, 26, 0, 0, 0]

模型对比

Although we are getting better results from the model with attention, the results were reported from a single run of each model.

In this case, we seek a more robust finding by repeating the evaluation of each model multiple times and reporting the average performance over those runs. For more information on this robust approach to evaluating neural network models, see the post

如何评估深度学习模型的技能

We can define a function to create each type of model, as follows.

# define the encoder-decoder model
def baseline_model(n_timesteps_in, n_features):
	model = Sequential()
	model.add(LSTM(150, input_shape=(n_timesteps_in, n_features)))
	model.add(RepeatVector(n_timesteps_in))
	model.add(LSTM(150, return_sequences=True))
	model.add(TimeDistributed(Dense(n_features, activation='softmax')))
	model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
	return model

# define the encoder-decoder with attention model
def attention_model(n_timesteps_in, n_features):
	model = Sequential()
	model.add(LSTM(150, input_shape=(n_timesteps_in, n_features), return_sequences=True))
	model.add(AttentionDecoder(150, n_features))
	model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
	return model

# define the encoder-decoder model

def baseline_model(n_timesteps_in, n_features):

model = Sequential()

model.add(LSTM(150, input_shape=(n_timesteps_in, n_features)))

model.add(RepeatVector(n_timesteps_in))

model.add(LSTM(150, return_sequences=True))

model.add(TimeDistributed(Dense(n_features, activation='softmax')))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

return model

# define the encoder-decoder with attention model

def attention_model(n_timesteps_in, n_features):

model = Sequential()

model.add(LSTM(150, input_shape=(n_timesteps_in, n_features), return_sequences=True))

model.add(AttentionDecoder(150, n_features))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

return model

We can then define a function to fit and evaluate the accuracy of a fit model and return the accuracy score.

# train and evaluate a model, return accuracy
def train_evaluate_model(model, n_timesteps_in, n_timesteps_out, n_features):
	# train LSTM
	for epoch in range(5000):
		# generate new random sequence
		X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
		# fit model for one epoch on this sequence
		model.fit(X, y, epochs=1, verbose=0)
	# evaluate LSTM
	total, correct = 100, 0
	for _ in range(total):
		X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
		yhat = model.predict(X, verbose=0)
		if array_equal(one_hot_decode(y[0]), one_hot_decode(yhat[0])):
			correct += 1
	return float(correct)/float(total)*100.0

# train and evaluate a model, return accuracy

def train_evaluate_model(model, n_timesteps_in, n_timesteps_out, n_features):

# train LSTM

for epoch in range(5000):

# 生成新的随机序列

X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)

# 在此序列上拟合模型一个 epoch

model.fit(X, y, epochs=1, verbose=0)

# evaluate LSTM

total, correct = 100, 0

for _ in range(total):

X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)

yhat = model.predict(X, verbose=0)

if array_equal(one_hot_decode(y[0]), one_hot_decode(yhat[0])):

correct += 1

return float(correct)/float(total)*100.0

Putting this together, we can repeat the process of creating, training, and evaluating each type of model multiple times and reporting the mean accuracy over the repeats. To keep running times down, we will repeat each model evaluation 10 times, although if you have the resources, you could increase this to 30 or 100 times.

完整的示例如下所示。

from random import randint
from numpy import array
from numpy import argmax
from numpy import array_equal
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense
from keras.layers import TimeDistributed
from keras.layers import RepeatVector
from attention_decoder import AttentionDecoder

# generate a sequence of random integers
def generate_sequence(length, n_unique):
	return [randint(0, n_unique-1) for _ in range(length)]

# one hot encode sequence
def one_hot_encode(sequence, n_unique):
	encoding = list()
	for value in sequence:
		vector = [0 for _ in range(n_unique)]
		vector[value] = 1
		encoding.append(vector)
	return array(encoding)

# decode a one hot encoded string
def one_hot_decode(encoded_seq):
	return [argmax(vector) for vector in encoded_seq]

# prepare data for the LSTM
def get_pair(n_in, n_out, cardinality):
	# generate random sequence
	sequence_in = generate_sequence(n_in, cardinality)
	sequence_out = sequence_in[:n_out] + [0 for _ in range(n_in-n_out)]
	# one hot encode
	X = one_hot_encode(sequence_in, cardinality)
	y = one_hot_encode(sequence_out, cardinality)
	# reshape as 3D
	X = X.reshape((1, X.shape[0], X.shape[1]))
	y = y.reshape((1, y.shape[0], y.shape[1]))
	return X,y

# define the encoder-decoder model
def baseline_model(n_timesteps_in, n_features):
	model = Sequential()
	model.add(LSTM(150, input_shape=(n_timesteps_in, n_features)))
	model.add(RepeatVector(n_timesteps_in))
	model.add(LSTM(150, return_sequences=True))
	model.add(TimeDistributed(Dense(n_features, activation='softmax')))
	model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
	return model

# define the encoder-decoder with attention model
def attention_model(n_timesteps_in, n_features):
	model = Sequential()
	model.add(LSTM(150, input_shape=(n_timesteps_in, n_features), return_sequences=True))
	model.add(AttentionDecoder(150, n_features))
	model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
	return model

# train and evaluate a model, return accuracy
def train_evaluate_model(model, n_timesteps_in, n_timesteps_out, n_features):
	# train LSTM
	for epoch in range(5000):
		# generate new random sequence
		X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
		# fit model for one epoch on this sequence
		model.fit(X, y, epochs=1, verbose=0)
	# evaluate LSTM
	total, correct = 100, 0
	for _ in range(total):
		X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
		yhat = model.predict(X, verbose=0)
		if array_equal(one_hot_decode(y[0]), one_hot_decode(yhat[0])):
			correct += 1
	return float(correct)/float(total)*100.0

# configure problem
n_features = 50
n_timesteps_in = 5
n_timesteps_out = 2
n_repeats = 10
# evaluate encoder-decoder model
print('Encoder-Decoder Model')
results = list()
for _ in range(n_repeats):
	model = baseline_model(n_timesteps_in, n_features)
	accuracy = train_evaluate_model(model, n_timesteps_in, n_timesteps_out, n_features)
	results.append(accuracy)
	print(accuracy)
print('Mean Accuracy: %.2f%%' % (sum(results)/float(n_repeats)))
# evaluate encoder-decoder with attention model
print('Encoder-Decoder With Attention Model')
results = list()
for _ in range(n_repeats):
	model = attention_model(n_timesteps_in, n_features)
	accuracy = train_evaluate_model(model, n_timesteps_in, n_timesteps_out, n_features)
	results.append(accuracy)
	print(accuracy)
print('Mean Accuracy: %.2f%%' % (sum(results)/float(n_repeats)))

from random import randint

from numpy import array

from numpy import argmax

from numpy import array_equal

from keras.models import Sequential

从 keras.layers 导入 LSTM

from keras.layers import Dense

from keras.layers import TimeDistributed

from keras.layers import RepeatVector

from attention_decoder import AttentionDecoder

# 生成一个随机整数序列

def generate_sequence(length, n_unique):

return [randint(0, n_unique-1) for _ in range(length)]

# One-Hot 编码序列

def one_hot_encode(sequence, n_unique):

encoding = list()

for value in sequence:

vector = [0 for _ in range(n_unique)]

vector[value] = 1

encoding.append(vector)

return array(encoding)

# 解码 One-Hot 编码字符串

def one_hot_decode(encoded_seq):

return [argmax(vector) for vector in encoded_seq]

# 为 LSTM 准备数据

def get_pair(n_in, n_out, cardinality):

# 生成随机序列

sequence_in = generate_sequence(n_in, cardinality)

sequence_out = sequence_in[:n_out] + [0 for _ in range(n_in-n_out)]

# 独热编码

X = one_hot_encode(sequence_in, cardinality)

y = one_hot_encode(sequence_out, cardinality)

# 重塑为 3D

X = X.reshape((1, X.shape[0], X.shape[1]))

y = y.reshape((1, y.shape[0], y.shape[1]))

return X,y

# define the encoder-decoder model

def baseline_model(n_timesteps_in, n_features):

model = Sequential()

model.add(LSTM(150, input_shape=(n_timesteps_in, n_features)))

model.add(RepeatVector(n_timesteps_in))

model.add(LSTM(150, return_sequences=True))

model.add(TimeDistributed(Dense(n_features, activation='softmax')))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

return model

# define the encoder-decoder with attention model

def attention_model(n_timesteps_in, n_features):

model = Sequential()

model.add(LSTM(150, input_shape=(n_timesteps_in, n_features), return_sequences=True))

model.add(AttentionDecoder(150, n_features))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

return model

# train and evaluate a model, return accuracy

def train_evaluate_model(model, n_timesteps_in, n_timesteps_out, n_features):

# train LSTM

for epoch in range(5000):

# 生成新的随机序列

X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)

# 在此序列上拟合模型一个 epoch

model.fit(X, y, epochs=1, verbose=0)

# evaluate LSTM

total, correct = 100, 0

for _ in range(total):

X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)

yhat = model.predict(X, verbose=0)

if array_equal(one_hot_decode(y[0]), one_hot_decode(yhat[0])):

correct += 1

return float(correct)/float(total)*100.0

# 配置问题

n_features = 50

n_timesteps_in = 5

n_timesteps_out = 2

n_repeats = 10

# evaluate encoder-decoder model

print('Encoder-Decoder Model')

results = list()

for _ in range(n_repeats):

model = baseline_model(n_timesteps_in, n_features)

accuracy = train_evaluate_model(model, n_timesteps_in, n_timesteps_out, n_features)

results.append(accuracy)

print(accuracy)

print('Mean Accuracy: %.2f%%' % (sum(results)/float(n_repeats)))

# evaluate encoder-decoder with attention model

print('Encoder-Decoder With Attention Model')

results = list()

for _ in range(n_repeats):

model = attention_model(n_timesteps_in, n_features)

accuracy = train_evaluate_model(model, n_timesteps_in, n_timesteps_out, n_features)

results.append(accuracy)

print(accuracy)

print('Mean Accuracy: %.2f%%' % (sum(results)/float(n_repeats)))

注意：鉴于算法或评估程序的随机性，或数值精度的差异，您的结果可能有所不同。可以尝试多次运行示例并比较平均结果。

Running this example prints the accuracy for each model repeat to give you an idea of the progress of the run.

Encoder-Decoder Model
20.0
23.0
23.0
18.0
28.000000000000004
28.999999999999996
23.0
26.0
21.0
20.0
Mean Accuracy: 23.10%

Encoder-Decoder With Attention Model
98.0
91.0
94.0
93.0
96.0
99.0
97.0
94.0
99.0
96.0
Mean Accuracy: 95.70%

编码器-解码器模型

20.0

23.0

18.0

28.000000000000004

28.999999999999996

23.0

26.0

21.0

20.0

Mean Accuracy: 23.10%

Encoder-Decoder With Attention Model

98.0

91.0

94.0

93.0

96.0

99.0

97.0

94.0

99.0

96.0

Mean Accuracy: 95.70%

We can see that even averaged over 10 runs, the attention model still shows better performance than the encoder-decoder model without attention, 23.10% vs 95.70%.

A good extension to this evaluation would be to capture the model loss each epoch for each model, take the average, and compare how the loss changes over time for the architecture with and without attention.

I expect that this trace would show attention achieving better skill much faster and sooner than the non-attentional model, further highlighting the benefit of the approach.

进一步阅读

如果您想深入了解，本节提供了更多关于该主题的资源。

总结

In this tutorial, you discovered how to develop an encoder-decoder recurrent neural network with attention in Python with Keras.

具体来说，你学到了：

如何设计一个小型且可配置的问题来评估带或不带注意力机制的编码器-解码器循环神经网络。
如何设计和评估一个用于序列预测问题的带或不带注意力机制的编码器-解码器网络。
如何稳健地比较带或不带注意力机制的编码器-解码器网络的性能。

你有什么问题吗？
在下面的评论中提出你的问题，我会尽力回答。

How to Clean Text for Machine Learning with Python

358 Responses to How to Develop an Encoder-Decoder Model with Attention in Keras

Chetan October 17, 2017 at 6:11 am #

The timing of this post couldn’t have been more accurate. I’ve spent hours and days on google looking for a reliable Keras implementation of attention. Can’t wait to test this on my specific problem definition. Thanks a ton Jason!

回复
- Jason Brownlee October 17, 2017 at 4:03 pm #
  
  I’m glad to hear that Chetan!
  
  告诉我进展如何。
  
  回复
ChrisJew October 17, 2017 at 10:35 pm #

test soft

回复
- Jason Brownlee October 18, 2017 at 5:36 am #
  
  Your test worked.
  
  回复
Mateo October 18, 2017 at 11:30 pm #

感谢这篇文章！

Unfortunately the kernel crashes on my laptop! I don’t know why (no RAM issues)
I use Keras==2.0.8 and TF==1.3.0

回复
- Jason Brownlee October 19, 2017 at 5:37 am #
  
  Ouch. Perhaps there is something up with your environment.
  
  This post might help if you need to set things up from scratch
  https://machinelearning.org.cn/setup-python-environment-machine-learning-deep-learning-anaconda/
  
  回复
Ravi Annaswamy October 20, 2017 at 7:09 pm #

Jason, very nice tutorial on probably the most important and most powerful neural application architecture (seq2seq with attention – since it is equivalent to a self programming turing machine – it sees an input stream of symbols, then can move back and forth using attention and write out a stream of symbols).

In fact theoretically it is super-turing, because it works with continuous (real) representation instead of Turing symbolic notation. google ‘recurrent networks super turing’ for proofs.

I am looking forward to attention being integrated into Keras and your revised code later, but no one can match your ability to setup the problem, generate data, explain step by step.. Keep up the great work.

Ravi Annaswamy

回复
- Jason Brownlee 2017年10月21日上午5:29 #
  
  谢谢 Ravi，非常感谢你的支持！你让我今天很开心 🙂
  
  回复
Ravi Annaswamy 2017年10月20日下午8:12 #

Jason，我认为为了展示序列映射的强大功能，我们需要尝试两件事
1. 输入序列的长度应该是可变的（不总是 5）。例如，你可以将其最大长度设置为 10，但它应该生成长度在 4 到 10 之间的序列（其余为零）。
2. 输出不应仅仅是值的归零，而应是更复杂的输出，例如，序列的第一个和最后一个非零值……

回复
- Ravi Annaswamy 2017年10月20日下午8:15 #
  
  就像这里构建的示例一样
  https://talbaumel.github.io/attention/
  
  回复
  - Ravi Annaswamy 2017年10月20日下午9:49 #
    
    我正在修改你出色的代码，以说明这个扩展任务，很快就会发布。
    
    回复
- Jason Brownlee 2017年10月21日上午5:34 #
  
  是的，你可以轻松地修改上面的示例以满足这些要求。
  
  回复
Ravi Annaswamy 2017年10月20日下午10:25 #

Jason博士，

你做得非常出色的应用和框架代码。

我想通过尝试一个更难的问题来展示这个架构的巨大价值和代码的模块化
这个代码，这个难度体现在两个方面

首先，我们希望使输入序列的长度从一个示例到另一个示例可变。

其次，我们希望输出是需要注意力和长期记忆的，
贯穿整个长度！

所以我们提出了这个任务

给定一个输入序列，其长度是可变的，并用零填充……
[6, 8, 7, 2, 2, 6, 6, 4, 0, 0]
我想让网络挑选并输出序列的第一个和最后一个非零值
[6, 4, 0, 0, 0, 0, 0, 0, 0, 0]

为了让记忆任务更有趣，我们希望它输出
这两个数字的顺序是颠倒的！

输入
[6, 8, 7, 2, 2, 6, 6, 4, 0, 0]
输出
[4, 6, 0, 0, 0, 0, 0, 0, 0, 0]

这将需要算法弄清楚我们正在选择序列的第一个和最后一个，
然后以相反的顺序写出来！它确实需要某种图灵机，能够
在序列上来回移动并决定何时写什么！具有注意力的 seq2seq LSTM 能做到吗？
我们来试试。

这里有一些创建的训练案例
[5, 5, 3, 3, 2, 0, 0, 0, 0, 0] [2, 5, 0, 0, 0, 0, 0, 0, 0, 0]
[4, 7, 7, 4, 3, 9, 0, 0, 0, 0] [9, 4, 0, 0, 0, 0, 0, 0, 0, 0]
[2, 6, 7, 6, 5, 0, 0, 0, 0, 0] [5, 2, 0, 0, 0, 0, 0, 0, 0, 0]
[9, 8, 2, 8, 8, 7, 9, 1, 5, 0] [5, 9, 0, 0, 0, 0, 0, 0, 0, 0]

为了实现这一点，我对你出色的代码做了一些修改

1. 为了使用 0 作为填充字符，我们将唯一字母从 1 调整为 n_unique。

# 生成一个随机整数序列
def generate_sequence(length, n_unique)
return [randint(0, n_unique-2)+1 for _ in range(length)]

我认为在你原来的代码中你也应该采用上述机制，以便 0 被保留为填充
符号，并且生成的序列只包含 1 到 n_unique。我认为这将使你的测试准确率提高到 100%。

2. 为了简化领域，以加快训练速度，我限制了值的范围

n_features = 8
n_timesteps_in = 10
n_timesteps_out = 2

也就是说，输入的最大位置数为 10，但其中 4 到 9 个位置可能是非零序列，如下所示。
输入仅使用 8 个数字的字母表，而不是你使用的 50 个。

3. 相应地，get_pair 被修改以生成上述序列

# 为 LSTM 准备数据
def get_pair(n_in, n_out, cardinality, verbose=False): # 编辑此函数以添加 verbose 标志
# 生成随机序列
sequence_in = generate_sequence(n_in, cardinality)
real_length = randint(4,n_in-1) # 我添加了这个
sequence_in = sequence_in[:real_length] + [0 for _ in range(n_in-real_length)] # 我添加了这个
sequence_out = [sequence_in[real_length-1]]+[sequence_in[0]] + [0 for _ in range(n_in-2)] # 我编辑了这个
if verbose: # 添加此项用于测试
print(sequence_in,sequence_out) # 添加这个
# One-Hot 编码
X = one_hot_encode(sequence_in, cardinality)
y = one_hot_encode(sequence_out, cardinality)
# 重塑为 3D
X = X.reshape((1, X.shape[0], X.shape[1]))
y = y.reshape((1, y.shape[0], y.shape[1]))
return X,y

4. 通过这些更改
for _ in range(5)
a=get_pair(10,2,10,verbose=True)

生成

[6, 8, 7, 2, 2, 6, 6, 4, 0, 0] [4, 6, 0, 0, 0, 0, 0, 0, 0, 0]
[5, 5, 3, 3, 2, 0, 0, 0, 0, 0] [2, 5, 0, 0, 0, 0, 0, 0, 0, 0]
[4, 7, 7, 4, 3, 9, 0, 0, 0, 0] [9, 4, 0, 0, 0, 0, 0, 0, 0, 0]
[2, 6, 7, 6, 5, 0, 0, 0, 0, 0] [5, 2, 0, 0, 0, 0, 0, 0, 0, 0]
[9, 8, 2, 8, 8, 7, 9, 1, 5, 0] [5, 9, 0, 0, 0, 0, 0, 0, 0, 0]

5. 在此数据集上训练的结果
编码器-解码器模型
20.0
12.0
18.0
19.0
9.0
10.0
16.0
12.0
12.0
11.0

Encoder-Decoder With Attention Model
100.0
100.0
100.0
100.0
100.0
100.0
100.0
100.0
100.0
100.0

是的！

这显示了循环神经网络模型从示例输入和输出对学习任意程序的能力！
当然，我们可以增加序列的长度和 n_unique 的值来使任务更难，但我并不期望
随着我们逐渐增加到合理的值，会出现戏剧性的失败。

我真的很高兴你能把这个出色的例子组织起来。如果它能增加价值，请随时将此扩展应用程序添加到你出色的文章/书籍中。另外，请审查这些更改，以确保我没有犯任何错误。

我唯一的抱怨是 Keras 的注意力实现非常慢。（我认为 PyTorch 的实现会
快得多，因为它避免了几个抽象层……但我也许错了，我会试试……）

拉维

附带完整的代码以供重现

from random import randint
from numpy import array
from numpy import argmax
from numpy import array_equal
来自 keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense
from keras.layers import TimeDistributed
from keras.layers import RepeatVector
from attention_decoder import AttentionDecoder

# 生成一个随机整数序列
def generate_sequence(length, n_unique)
return [randint(0, n_unique-2)+1 for _ in range(length)]

# One-Hot 编码序列
def one_hot_encode(sequence, n_unique)
encoding = list()
for value in sequence
vector = [0 for _ in range(n_unique)]
vector[value] = 1
encoding.append(vector)
return array(encoding)

# 解码 One-Hot 编码字符串
def one_hot_decode(encoded_seq)
return [argmax(vector) for vector in encoded_seq]

# 为 LSTM 准备数据
def get_pair(n_in, n_out, cardinality, verbose=False)
# 生成随机序列
sequence_in = generate_sequence(n_in, cardinality)
real_length = randint(4,n_in-1)
sequence_in = sequence_in[:real_length] + [0 for _ in range(n_in-real_length)]
sequence_out = [sequence_in[real_length-1]]+[sequence_in[0]] + [0 for _ in range(n_in-2)]
if verbose
print(sequence_in,sequence_out)
# One-Hot 编码
X = one_hot_encode(sequence_in, cardinality)
y = one_hot_encode(sequence_out, cardinality)
# 重塑为 3D
X = X.reshape((1, X.shape[0], X.shape[1]))
y = y.reshape((1, y.shape[0], y.shape[1]))
return X,y

# define the encoder-decoder model
def baseline_model(n_timesteps_in, n_features)
model = Sequential()
model.add(LSTM(150, input_shape=(n_timesteps_in, n_features)))
model.add(RepeatVector(n_timesteps_in))
model.add(LSTM(150, return_sequences=True))
model.add(TimeDistributed(Dense(n_features, activation=’softmax’)))
model.compile(loss=’categorical_crossentropy’, optimizer=’adam’, metrics=[‘acc’])
return model

# define the encoder-decoder with attention model
def attention_model(n_timesteps_in, n_features)
model = Sequential()
model.add(LSTM(150, input_shape=(n_timesteps_in, n_features), return_sequences=True))
model.add(AttentionDecoder(150, n_features))
model.compile(loss=’categorical_crossentropy’, optimizer=’adam’, metrics=[‘acc’])
return model

# train and evaluate a model, return accuracy
def train_evaluate_model(model, n_timesteps_in, n_timesteps_out, n_features)
# 训练 LSTM
for epoch in range(5000)
# 生成新的随机序列
X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
# 对此序列进行一个 epoch 的模型拟合
model.fit(X, y, epochs=1, verbose=0)
# 评估 LSTM
total, correct = 100, 0
for _ in range(total)
X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
yhat = model.predict(X, verbose=0)
if array_equal(one_hot_decode(y[0]), one_hot_decode(yhat[0]))
correct += 1
return float(correct)/float(total)*100.0

# 配置问题
n_features = 8
n_timesteps_in = 10
n_timesteps_out = 2
n_repeats = 10

# evaluate encoder-decoder model
print(‘Encoder-Decoder Model’)
results = list()
for _ in range(n_repeats)
model = baseline_model(n_timesteps_in, n_features)
accuracy = train_evaluate_model(model, n_timesteps_in, n_timesteps_out, n_features)
results.append(accuracy)
print(accuracy)
print(‘Mean Accuracy: %.2f%%’ % (sum(results)/float(n_repeats)))
# evaluate encoder-decoder with attention model
print(‘Encoder-Decoder With Attention Model’)
results = list()
for _ in range(n_repeats)
model = attention_model(n_timesteps_in, n_features)
accuracy = train_evaluate_model(model, n_timesteps_in, n_timesteps_out, n_features)
results.append(accuracy)
print(accuracy)
print(‘Mean Accuracy: %.2f%%’ % (sum(results)/float(n_repeats)))

回复
- Jason Brownlee 2017年10月21日上午5:38 #
  
  干得漂亮！
  
  回复
  - Bhuvana 2019年6月27日凌晨12:11 #
    
    导入 .py 文件时出现以下错误
    
    from attention_decoder import AttentionDecoder
    
    ImportError 回溯 (最近一次调用)
    in ()
    —-> 1 from attention_decoder import AttentionDecoder as ad
    
    ImportError: cannot import name ‘AttentionDecoder’
    
    回复
    - Jason Brownlee 2019年6月27日早上7:52 #
      
      很抱歉听到这个消息，我在这里有一些建议。
      https://machinelearning.org.cn/faq/single-faq/why-does-the-code-in-the-tutorial-not-work-for-me
      
      回复
      - Bilal Chandio 2020年10月11日晚上6:48 #
        
        Dimensions must be equal, but are 150 and 50 for ‘AttentionDecoder_1/MatMul_4’ (op: ‘MatMul’) with input shapes: [?,150], [50,150].
        
        I am using Keras==2.0
        and tensorflow==1.13.1
        
        Please specify for which version code is fully functional.
        
        提前感谢。
      - Jason Brownlee 2020年10月12日早上6:40 #
        
        I recommend using the new attention layers in TensorFlow 2.
- Ashima 2019年2月16日凌晨2:35 #
  
  Hi @Ravi, @Jason,
  
  Thanks for the great post. Is it possible to give variable timesteps as the input for RepeatVector for variable input length ?
  
  For instance, instead of defining a fixed size of n_timesteps_in as 10, I want to read the entire input sequence as a whole.
  
  model.add(RepeatVector(n_timesteps_in))
  
  回复
  - Jason Brownlee 2019年2月16日早上6:20 #
    
    并不是真的。
    
    回复
    - kuldeep 2020年1月3日晚上8:30 #
      
      I have a problem i am making a model in NMT for English to Hindi but its prediction is not good very bad how can i improve my prediction.
      
      回复
      - Jason Brownlee 2020年1月4日早上8:30 #
        
        这里有一些建议。
        https://machinelearning.org.cn/improve-deep-learning-performance/
ravi annaswamy 2017年10月20日晚上10:44 #

here is verbose evaluation

for _ in range(5)
X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features,verbose=True)
yhat = model.predict(X, verbose=0)
print(one_hot_decode(yhat[0]))

[5, 5, 1, 6, 1, 4, 5, 0, 0, 0] [5, 5, 0, 0, 0, 0, 0, 0, 0, 0]
[5, 5, 0, 0, 0, 0, 0, 0, 0, 0]
[5, 5, 4, 7, 2, 1, 3, 0, 0, 0] [3, 5, 0, 0, 0, 0, 0, 0, 0, 0]
[3, 5, 0, 0, 0, 0, 0, 0, 0, 0]
[3, 4, 7, 6, 3, 1, 3, 1, 1, 0] [1, 3, 0, 0, 0, 0, 0, 0, 0, 0]
[1, 3, 0, 0, 0, 0, 0, 0, 0, 0]
[1, 4, 1, 4, 7, 2, 2, 3, 4, 0] [4, 1, 0, 0, 0, 0, 0, 0, 0, 0]
[4, 1, 0, 0, 0, 0, 0, 0, 0, 0]
[1, 5, 1, 4, 7, 6, 3, 7, 7, 0] [7, 1, 0, 0, 0, 0, 0, 0, 0, 0]
[7, 1, 0, 0, 0, 0, 0, 0, 0, 0]

回复
- Meeklai 2017年10月24日晚上8:07 #
  
  Awesome work Ravi!
  
  Would you please upload these codes into https://gist.github.com?
  
  回复
Meeklai 2017年10月21日凌晨2:57 #

First of all, thank you so much for this worth reading article. It clarified me a lot about how to implement autoencoder model in Keras.

I just have a little confused point that I wish you would explain. Why do you need to transform an original vector of integers into a 2D matrix containing a one hot vector of each integer? Can’t you just send that original vector of integers into the encoder as an input?

Thank you again for this worthful article, Dr. Brownlee

回复
- Jason Brownlee 2017年10月21日上午5:43 #
  
  You can, but the one hot encoding is richer and often results in better model skill.
  
  回复
  - Meeklai 2017年10月23日凌晨2:21 #
    
    Thank you Dr. Brownlee, would one hot encoding is better for a situation that the number of cardinality is much greater than this example? Like fitting an encoder with lots of text documents, which will result in huge number of encoder’s keys
    
    回复
    - Jason Brownlee 2017年10月23日早上5:49 #
      
      In that case, it might be better to use a distributed representation like a word embedding
      https://machinelearning.org.cn/develop-word-embeddings-python-gensim/
      
      回复
Hendrik 2017年10月24日晚上7:00 #

In case of multiple LSTM layers, is the AttentionDecoder layer supposed to stay after all LSTMs only once or it must be inserted after each LSTM layert?

回复
- Jason Brownlee 2017年10月25日早上6:44 #
  
  The attention is only used directly after the encoder.
  
  回复
  - AP 2018年5月18日晚上7:31 #
    
    Hi Jason, following up on Hendirk’s question, then how can I stack multple LSTM layers with attention. Do i initialise the first decoder layer as AttentionDecoder and follow it up with Keras’s LSTM layers? Thanks for the super informative post!
    
    回复
    - Jason Brownlee 2018年5月19日早上7:37 #
      
      Attention would only be required on the first level of the decoder. LSTM layers may then be added after that.
      
      回复
Trialcritic 2017年10月25日早上8:21 #

Usually, when people have 5 input and 2 output steps, we use

model.add(LSTM(size, input_shape=(n_timesteps_in, n_features)))
model.add(RepeatVector(n_timesteps_out)) # this is different from input steps
model.add(LSTM(size, return_sequences=True))

This makes sense, as suggested

“we need to repeat the single vector outputted from the encoder network to obtain a sequence which has the same length with the output sequences”.

Wonder if this must be changed.

回复
- Jason Brownlee 2017年10月25日下午3:57 #
  
  Yes, the RepeatVector approach is not a pure encoder-decoder as defined in the first papers, but often performs as well or better in my experience.
  
  回复
Aayushee 2017年11月3日晚上8:17 #

嗨，Jason，

Thanks for such a well explained post on this topic. You mention the limitation that output sequences are the same length as the input sequences in case of the attention encoder decoder model used.
Could you please give an idea what should be done in an attention based model when output and input lengths are not same? I was wondering if we can use a RepeatVector(output_timesteps) in the current attention model on the encoder output and then feed it to the AttentionDecoder?

回复
- Jason Brownlee 2017年11月4日早上5:29 #
  
  This implementation of attention cannot handle input and output sequences with different lengths, sorry.
  
  回复
  - Sravan Malla 2019年5月27日晚上7:42 #
    
    Hi Json, If this implementation of attention cannot handle input and output sequences with different lengths…then it cant be used for language translation task right? please advise
    
    回复
    - Jason Brownlee 2019年5月28日早上8:13 #
      
      可能不会。
      
      回复
caichao 2017年11月4日晚上11:43 #

By running your example (the “with attention part”, I’ve gotten the following error
ValueError: Dimensions must be equal, but are 150 and 50 for ‘AttentionDecoder/MatMul_4’ (op: ‘MatMul’) with input shapes: [?,150], [50,150].

回复
- Jason Brownlee 2017年11月5日早上5:16 #
  
  Ensure you have the latest version of Keras.
  
  回复
  - caichao 2017年11月5日中午11:51 #
    
    My keras version is 2.0.2
    
    回复
    - Jason Brownlee 2017年11月6日早上4:48 #
      
      Perhaps try 2.0.8 or higher?
      
      回复
      - caichao 2017年11月6日晚上11:47 #
        
        also when I upgrade keras to 2.0.9
        I got the following problem
        
        from keras.layers.recurrent import Recurrent, _time_distributed_dense
        “unresolved reference _time_distributed_dense”
      - Jason Brownlee 2017年11月7日早上9:50 #
        
        Interesting, perhaps the example requires Keras 2.0.8. This was the version I used when developing the example.
      - j 2021年5月4日晚上9:12 #
        
        Recurrent is not found in tensorflow 2, got error when import it
        
        ImportError: cannot import name ‘Recurrent
        
        The line itself is “from tensorflow.keras.layers import Recurrent ”
        
        How do you import that layer? any Idea
      - Jason Brownlee 2021年5月5日早上6:10 #
        
        I believe the above tutorial is not compatible with the latest version of the APIs.
      - Woo 2022年3月2日晚上11:33 #
        
        Then how can I use this tutorial? I tried to find some ways, but failed.
      - James Carmichael 2022年3月3日晚上1:42 #
        
        Hi Woo…Please provide more detail regarding what exactly failed in your implementation of the code listings so that I can better assist you.
  - caichao 2017年11月5日中午12:08 #
    
    also when I upgrade keras to 2.0.9
    I got the following problem
    
    from keras.layers.recurrent import Recurrent, _time_distributed_dense
    “unresolved reference _time_distributed_dense”
    
    回复
kamal 2017年11月6日凌晨12:54 #

Hi Jason. thank you for your great tutorials. I have 2 questions

1) is there any Dense layer after Decoder in Attention code?
2)should features input be equal to features output or not ( their length should be equal as you mentioned)?

thank you, again

回复
- Jason Brownlee 2017年11月6日早上4:53 #
  
  Yes, there is normally a dense output after the decoder (or a part of the decoder).
  
  Features can vary. Normally/often you would have more input features than output features.
  
  回复
Nandini 2017年11月29日晚上5:33 #

嗨，Jason，

from keras.models import Model,

How this Model() layer will works in keras?

回复
- Jason Brownlee 2017年11月30日早上8:07 #
  
  Great question, you can learn more in this post
  https://machinelearning.org.cn/keras-functional-api-deep-learning/
  
  回复
Basma 2017年11月30日晚上9:57 #

嗨，Jason，

thank you so much for this great tutorial, I’m actually trying to build an encoder with attention, so the attention should be in the encoder part, can you explain please how this can be adapted ?

Many thanks 🙂

回复
- Jason Brownlee 2017年12月1日早上7:32 #
  
  Generally, attention is in the decoder, not the encoder. Sorry, I don’t have an example of an encoder with attention.
  
  回复
  - Basma 2017年12月6日晚上7:52 #
    
    嗨，Jason，
    
    i’m trying to use this great implementation for seq2seq to encode text. I have a dialogue turn from user A that I’ll decode to get dialogue turn from user B. I am using the following code
    
    seq2seq = Sequential()
    seq2seq.add(Embedding(output_dim=args.emb_dim,
    input_dim=MAX_NB_WORDS,
    input_length=MAX_SEQUENCE_LENGTH,
    weights=[embedding_matrix],
    mask_zero=True,
    trainable=True))
    
    seq2seq.add(LSTM(units=args.hidden_size, return_sequences=True))
    seq2seq.add(AttentionDecoder(args.hidden_size, args.emb_dim))
    seq2seq.compile(loss=’categorical_crossentropy’, optimizer=’adam’, metrics=[‘acc’])
    
    But actually I don’t know how I can compare the decoded vector to the turn vector that I already have.
    
    My dialogue vectors are already encoded using keras preprocessing text_to_sequence and padded.
    
    Many thanks !
    
    回复
    - Jason Brownlee 2017年12月7日早上7:53 #
      
      I assume you are outputting a sequence of integers. These integers must be mapped back to words using whatever scheme you used to encode your training data.
      
      回复
Leo 2018年1月2日晚上8:58 #

嗨，Jason

Thanks for this tutorial. I’m trying a word embedding seq2seq model. But I’m stuck with how to build the model.
I use tokenizer and pad_sequences to encode Xtrain and ytrain, and then processing ytrain through to_categorical.
The format of input fed into the model is just like the ones in this tutorial: 1 input of Xtrain and ytrain for each epoch.
And it seems there’s something wrong with the embedding layer. But I can’t figure out why.

model = Sequential()
model.add(Embedding(vocab_size, 150, input_length=max_length))
model.add(Bidirectional(LSTM(150, return_sequences=True)))
model.add(AttentionDecoder(150, n_features))
model.compile(loss=’categorical_crossentropy’, optimizer=’adam’, metrics=[‘acc’])

ValueError: Error when checking input: expected embedding_8_input to have shape (None, 148) but got array with shape (148, 1)

回复
- Leo 2018年1月2日晚上9:35 #
  
  Sorry, I have another question. Can I just fit the model directly instead of using for loop to train the model for each epoch?
  If I just fit the model directly, I got another error message
  Error when checking target: expected AttentionDecoder to have 3 dimensions, but got array with shape (200, 1321)
  
  非常感谢。
  
  回复
  - Jason Brownlee 2018年1月3日早上5:36 #
    
    You can, but you must change the shape of the data you are feeding to the network.
    
    回复
- Jason Brownlee 2018年1月3日早上5:34 #
  
  This post might help to get you started with embedding layers
  https://machinelearning.org.cn/use-word-embedding-layers-deep-learning-keras/
  
  Remember to one hot encode your input data.
  
  回复
  - Leo 2018年1月4日晚上3:34 #
    
    嗨，Jason
    
    Thanks for your suggestion. and it works. However, I change the problem in this tutorial a little bit, and get stuck again.
    
    In this tutorial, the definition of the problem is given Xtrain, for example, [3, 5, 12, 10, 18, 20], and then, we echo the first two element, so the ytrain looks like [3, 5, 0, 0, 0, 0].
    
    Now, I want to find the specific continuous two numbers in a sequence but those two continuous numbers are located at different location within each sequence.
    
    For example, what I want is [16, Z] where Z is any number, and [16, Z] is within an sequence, Xtrain.
    
    So, Xtrain and ytrain look like
    Xtrain ytrain
    [3, 1, 10, 14, 8, 20, 16, 7, 9, 19] [16, 7, 0, 0, 0, 0, 0, 0, 0, 0]
    [6, 1, 23, 16, 9, 12, 22, 8, 0, 17] [16, 9, 0, 0, 0, 0, 0, 0, 0, 0]
    [9, 13, 15, 12, 16, 2, 5, 1, 10, 8] [16, 2, 0, 0, 0, 0, 0, 0, 0, 0]
    
    I think the key point is to transform the format of Xtrain and ytrain. One-hot encoding Xtrain remains the same just like this tutorial. But right now I have no idea how to fit ytrain into the model. I tried several ways to transform the format of ytrain, such as,
    1. One-hot encoding ytrain, but it doesn’t work.
    2. One-hot encoding the location of [16, Z], but it seems nonsense.
    3. Changing the format of ytrain to, for example, [0, 0, 0, 0, 0, 0 16, 7, 0, 0], and then one-hot encoding this sequence, but it
    still doesn’t work.
    
    Do you have any suggestion or idea on such problem? Thank you very much.
    
    回复
    - Jason Brownlee 2018年1月5日早上5:17 #
      
      You could model it as a summarization task with the full sequence in and 2 elements out.
      
      For that you can use an encoder-decoder without attention.
      
      回复
Paul 2018年1月7日凌晨2:27 #

Is there an updated version of this example that uses TimeDistributed in Keras instead of _time_distributed_dense ?

回复
- Jason Brownlee 2018年1月7日早上5:11 #
  
  Not at this stage, I am waiting for attention to be officially supported by Keras.
  
  回复
  - Denis 2018年1月12日上午3:50 #
    
    关于 _time_distributed_dense 问题：我将 keras/python/keras/layers/recurrent.py 文件中的 _time_distributed_dense 函数代码复制到了 attention_decoder.py 文件中（在 AttentionDecoder 类之前），Jason 的代码对我来说是有效的。
    
    回复
    - Jason Brownlee 2018年1月12日上午5:54 #
      
      太棒了！
      
      回复
    - Jacob 2018年2月21日上午9:39 #
      
      你能把代码发给我，然后复制到 AttentionDecoder 类里吗？ ninjajake@gmail.com
      
      回复
    - Alimur Razi Rana 2018年8月9日下午3:02 #
      
      这个技巧帮了我大忙。谁想要代码—— https://github.com/keras-team/keras/blob/b587aeee1c1be3633a56b945af3e7c2c303369ca/keras/layers/recurrent.py
      
      回复
  - Abdi 2022年12月10日上午4:34 #
    
    现在 Keras.io 上有了 https://keras.org.cn/api/layers/attention_layers/attention/
    你指的是那个吗？但我不太理解在编码器和解码器之间使用的参数。有没有示例？
    
    回复
Nipun Batra 2018年1月13日上午8:28 #

嗨，Jason，
非常感谢这篇优秀的博文（一如既往！）。

我想知道：我们是否可以为连续数据学习一个模型？也就是说，我们是否可以输入原始序列，而不是对输入和输出进行独热编码？我之所以这么问，是因为我还没有见过用于连续数据的带注意力的 Seq2Seq。我想基于您的博文写一个简单的模型来去噪正弦信号。这应该是一个相同长度序列上的 Seq2seq 的情况。

回复
- Jason Brownlee 2018年1月14日上午6:33 #
  
  当然可以，但这可能是模型学习的更难的问题。
  
  回复
  - Nipun Batra 2018年1月14日晚上11:50 #
    
    谢谢！当我使用这篇博文中（带注意力）的代码时，我没有看到损失函数有任何减少，这是在使用连续数据时。那时我就在想，这里分享的注意力实现是否只适用于离散数据？
    
    回复
    - Jason Brownlee 2018年1月15日上午6:59 #
      
      不错。
      
      不，它独立于数据。
      
      回复
- Shreya Bhatia 2020年4月2日上午2:49 #
  
  嘿 Nipun Batra…… 我正在做我的毕业设计项目，项目基于连续数据。我想知道你是否能和我分享你已经能够为你的目的工作的编码器-解码器代码。谢谢 Shreya
  
  回复
moses 2018年1月30日下午5:40 #

解码器应该如何区分 0 和空行？
（零是第一个条目为 1 的独热编码向量，而空是零行）？

回复
- Jason Brownlee 2018年1月31日上午9:39 #
  
  抱歉，我不确定我的意思是否清楚。您能提供更多背景信息吗？
  
  回复
moses 2018年1月31日凌晨12:55 #

再说一个。

如果输入特征的数量与输出特征的数量不同。我们必须更改这一行。
model.add(TimeDistributed(Dense(n_features, activation=’softmax’)))

我们还需要做更多吗？

回复
- Jason Brownlee 2018年1月31日上午9:46 #
  
  特征还是输出的时间步？
  
  回复
  - moses 2018年2月1日下午10:31 #
    
    我的问题是关于 n_features 的情况（将其输入和输出），但我认为序列的长度也很重要。第一个问题我已经解决了，正如我所写的，我不确定这是否足够。
    
    回复
Nathan D. 2018年2月2日上午5:13 #

嗨，Jason，

这是一个很棒的演示，非常感谢您。

我想知道，您是否知道任何方法可以找回注意力向量 *at*？由于它不是模型的参数，访问 keras.backend.get_value() 似乎不起作用。谢谢。

回复
Amy 2018年2月15日下午4:16 #

嗨，Jason，

很棒的教程！这极大地帮助了我理解。请问如何将此注意力 seq2seq 修改为批量版本？谢谢！

回复
- Jason Brownlee 2018年2月16日上午8:32 #
  
  批量方法对注意力有什么影响？
  
  回复
haya 2018年3月13日上午6:27 #

嗨，Jason，
在 AttentionDecoder 的 step 函数中，我们可以使用 keras lstm 层而不是从头构建它吗？

回复
- Jason Brownlee 2018年3月13日上午6:33 #
  
  不确定我的意思是否清楚。也许可以尝试一下。
  
  回复
Paul 2018年3月20日上午2:21 #

看起来，没有注意力的差劲分数主要是由于优化问题。我通过仅使用合理的批量大小（32）就能达到 95% 以上的准确率。

回复
- Jason Brownlee 2018年3月20日上午6:26 #
  
  好的，感谢您的提示，Paul。您使用的是什么配置？
  
  回复
Dan 2018年3月26日上午8:37 #

我只想让您知道，您在这个网站上的辛勤工作得到了认可。它在我学习如此复杂的东西方面非常有用😀

非常感谢！

回复
- Jason Brownlee 2018年3月26日上午10:05 #
  
  谢谢，听到这个我很高兴。
  
  回复
Eduardo 2018年4月23日下午5:44 #

你好，谢谢你的网站。它在我的学士论文上真的帮了我。

我们可以使用上下文向量训练 SVM 吗？

回复
- Jason Brownlee 2018年4月24日上午6:23 #
  
  不客气。
  
  当然。目的是什么？
  
  回复
jimbung 2018年4月23日晚上8:08 #

嗨，Jason，
我运行代码时遇到了一个问题。

回溯（最近一次调用）
文件 “gutils.py”，第 50 行，在
model.add(AttentionDecoder(150, n_features))
……
# 计算注意力概率
# 这关系到其他时间步对当前时间步的贡献程度。
et = K.dot(activations.tanh(_Wxstm + self._uxpb),
K.expand_dims(self.V_a))
……
文件 “/home/wanjb/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/tensor_util.py”，第 421 行，在 make_tensor_proto
raise ValueError(“None values not supported.”)
ValueError: 不支持 None 值。

我的环境
Anaconda: conda 4.4.10
Python 3.6.4 :: Anaconda, Inc.

你能看看这个吗？谢谢！

回复
- Jason Brownlee 2018年4月24日上午6:32 #
  
  很遗憾听到这个消息。
  
  - 您能否确认 TensorFlow 和 Keras 是否是最新的？
  - 您能否确认您复制了所有代码？
  - 您能否确认您是从命令行运行代码的？
  
  回复

jimbung 2018年4月24日下午1:22 #

嗨，Jason，

正如“Denis 2018年1月12日上午3:50”所述，问题已经解决了。
谢谢 Denis！

def _time_distributed_dense(x, w, b=None, dropout=None,
                           input_dim=None, output_dim=None, timesteps=None):
    '''Apply y.w + b for every temporal slice y of x.
    '''
    if not input_dim:
        # won't work with TensorFlow
        input_dim = K.shape(x)[2]
    if not timesteps:
        # won't work with TensorFlow
        timesteps = K.shape(x)[1]
    if not output_dim:
        # won't work with TensorFlow
        output_dim = K.shape(w)[1]

    if dropout:
        # apply the same dropout pattern at every timestep
        ones = K.ones_like(K.reshape(x[:, 0, :], (-1, input_dim)))
        dropout_matrix = K.dropout(ones, dropout)
        expanded_dropout_matrix = K.repeat(dropout_matrix, timesteps)
        x *= expanded_dropout_matrix

    # collapse time dimension and batch dimension together
    x = K.reshape(x, (-1, input_dim))

    x = K.dot(x, w)
    if b:
        x = x + b
    # reshape to 3D tensor
    x = K.reshape(x, (-1, timesteps, output_dim))
    return x

def _time_distributed_dense(x, w, b=None, dropout=None,

input_dim=None, output_dim=None, timesteps=None):

'''为 x 的每个时间片 y 应用 y.w + b。

'''

if not input_dim:

# 在 TensorFlow 中将不起作用

input_dim = K.shape(x)[2]

if not timesteps:

# 在 TensorFlow 中将不起作用

timesteps = K.shape(x)[1]

if not output_dim:

# 在 TensorFlow 中将不起作用

output_dim = K.shape(w)[1]

if dropout:

# 在每个时间步应用相同的 Dropout 模式

ones = K.ones_like(K.reshape(x[:, 0, :], (-1, input_dim)))

dropout_matrix = K.dropout(ones, dropout)

expanded_dropout_matrix = K.repeat(dropout_matrix, timesteps)

x *= expanded_dropout_matrix

# 将时间维度和批次维度合并在一起

x = K.reshape(x, (-1, input_dim))

x = K.dot(x, w)

if b:

x = x + b

# 重塑为 3D 张量

x = K.reshape(x, (-1, timesteps, output_dim))

return x

Jason Brownlee 2018年4月24日下午2:50 #

太棒了！

回复
Jack 2022年10月17日下午7:46 #

嗨
未定义名称 'K'

回复

Jorn 2018年5月2日上午2:53 #

非常感谢您的又一篇精彩博文！这种架构是否适用于时间序列预测，其中您有多个特征序列来预测单个目标序列？特征的序列长度长于要预测的目标序列的长度。
到目前为止我看到的所有示例都显示了一个特征序列作为输入到一个目标序列的输出。

回复
- Jason Brownlee 2018年5月2日上午5:46 #
  
  也许可以，试试看。
  
  回复
Ahmad Aldali 2018年5月5日上午5:39 #

Jason你好……
感谢这些信息……
我有一个问题……
我可以在我的翻译模型中使用这个实现吗？
我使用编码器-解码器如下：

“””””
embedded_output = embedding_layer(embedding_input)

# ================================ Encoder ================================
encoder = LSTM(lstm_units, return_sequences=True, return_state=True, name=’encoder’)
encoder_outputs, state_h, state_c = encoder(embedded_output)
encoder_states = [state_h, state_c]

#….
embedding_Ar_input = Input(shape=(MAX_Ar_SEQUENCE_LENGTH,))
embedded_Ar_output = embedding_Ar_layer(embedding_Ar_input)

# ================================ Decoder ================================
# 我们将解码器设置为返回完整的输出序列，
decoder_lstm = LSTM(lstm_units, return_sequences=True, return_state=True, name=’decoder’)

decoder_outputs, _, _ = decoder_lstm(embedded_Ar_output, initial_state=encoder_states)

# SoftMax
decoder_dense = Dense(output_vector_length, activation=’softmax’, name=’softmax’)
outputs_model = decoder_dense(attention)

“””””
n_features 是什么？它代表什么？？

回复
- Ahmad Aldali 2018年5月5日上午5:42 #
  
  n_features 是指最大解码序列长度吗？
  
  回复
fatime 2018年5月7日晚上8:10 #

你好，Jason，你能告诉我 verbose 的作用是什么吗？

回复
- Jason Brownlee 2018年5月8日上午6:11 #
  
  它会在训练期间开启输出，以便您能够看到模型在训练期间（例如，技能和进度）做了什么。
  
  回复
  - fatime 2018年5月8日晚上8:14 #
    
    那么，verbose = 1 或 2 或 none 有什么区别？哪种注意力机制最适合机器翻译？
    
    回复
    - Jason Brownlee 2018年5月9日上午6:20 #
      
      Verbose 0 关闭详细输出，verbose 1 显示进度条，verbose 2 每个 epoch 显示一行。
      
      请参阅这篇关于优秀 NMT 架构的博文：
      https://machinelearning.org.cn/configure-encoder-decoder-model-neural-machine-translation/
      
      回复
fatime 2018年5月8日晚上8:14 #

那么，verbose = 1 或 2 或 none 有什么区别？哪种注意力机制最适合机器翻译？

回复
radhika 2018年5月8日晚上8:49 #

你好，我们能否使用这个模型来翻译一种语言到另一种语言？

回复
- Jason Brownlee 2018年5月9日上午6:23 #
  
  这里有一个例子
  https://machinelearning.org.cn/develop-neural-machine-translation-system-keras/
  
  回复
YoonseokHeo 2018年5月17日下午5:59 #

感谢精彩的教程。
在自定义 Keras 注意力层（AttentionDecoder 类）中，我想知道您能否告诉我为什么您以这种方式实现预测词（yt）在时间步 t
将上一个生成的词（ytm）、上一个隐藏状态（stm）和计算出的上下文向量（context）
与它们的权重相加。

您实现的是如下：
yt = activations.softmax(
K.dot(ytm, self.W_o)
+ K.dot(stm, self.U_o)
+ K.dot(context, self.C_o)
+ self.b_o)

我找不到任何提及（除了第一个定义）关于像这样计算下一个词：P(yt|y1,…yt-1, X) = g(yi-1, si, ci)

我不确定这个方程是否表明了您在计算 yt 时的方式。

回复
- Jason Brownlee 2018年5月18日上午6:20 #
  
  如前所述，我没有实现自定义的注意力层。我不是回答它相关问题的最佳人选。
  
  回复
chris 2018年5月20日上午1:49 #

Jason 你好，我尝试理解 LSTMs，而且我是新手。你能否用更简单的方式解释一下以下代码？
# 定义模型
model = Sequential()
model.add(LSTM(150, input_shape=(n_timesteps_in, n_features)))
model.add(RepeatVector(n_timesteps_in))
model.add(LSTM(150, return_sequences=True))
model.add(TimeDistributed(Dense(n_features, activation=’softmax’)))

我理解 RepeatVector 和 TimeDistributed 的部分。我不理解的是 150 个隐藏单元，它们必须相同吗？如果它们是 1 个会怎样？如果你有可视化解释该结构的来源，那就太好了。提前感谢。

回复
- Jason Brownlee 2018年5月20日上午6:39 #
  
  您可以根据需要更改单元的数量。
  
  回复
Rui 2018年5月28日上午1:04 #

我们如何将其应用于多变量时间序列？

回复
- Jason Brownlee 2018年5月28日上午6:02 #
  
  当然可以。
  
  回复
Santy 2018年5月30日下午4:47 #

嗨，Jason！

我正在尝试理解带注意力的图像字幕生成。我看过您关于图像字幕生成的教程。您能否为我推荐一些资源，以便我可以在 Keras 中使用注意力模型来实现它？

谢谢！

回复
- Jason Brownlee 2018年5月31日上午6:13 #
  
  我正在等待 Keras 获得官方支持注意力。
  
  回复
Santy 2018年6月14日上午1:58 #

嗨，Jason！

我已浏览您在以下链接中关于图像字幕生成的教程。

https://machinelearning.org.cn/develop-a-deep-learning-caption-generation-model-in-python/

您能否将此注意力模型用于图像字幕生成，其中 CNN 用作编码器，RNN 用作解码器？

请给我建议。

谢谢。

回复
- Jason Brownlee 2018年6月14日上午6:10 #
  
  也许可以。
  
  我希望在 Keras 正式支持注意力后提供更多注意力示例。
  
  回复
Gang 2018年7月24日上午3:36 #

感谢您的精彩教程。我从您的网站学到了很多。

我试用了您的代码。似乎仅从基线模型中删除第一个 LSTM 就能为该示例获得完美的预测。不确定注意力层是否在这里是必需的。

model = Sequential()
model.add(LSTM(150, input_shape=(n_timesteps_in, n_features), return_sequences=True))
model.add(Dense(n_features, activation=’softmax’))

回复
- Jason Brownlee 2018年7月24日上午6:22 #
  
  可能不是，这只是一个演示。
  
  回复
Atilla 2018年8月2日晚上11:00 #

Jason 你好，我想让模型编码器使用 Bidirectional-LSTM，解码器使用 Bidirectional-LSTM。在理论上，它是否可以像您提出的模型中的 Bi-LSTM 一样？

回复
- Jason Brownlee 2018年8月3日上午6:02 #
  
  当然可以。
  
  回复
Elias Lousseief 2018年8月11日上午1:21 #

你好 J！感谢您提供非常棒的实践教程……它如预期般工作，并且注意力确实改进了结果……但是，在检查 attention_decoder 中的 at 向量时，它没有显示期望的激活……

例如

输入：[29, 9, 47, 0, 12]，输出：[29, 9, 0, 0, 0]（正确）

第一个输出的 at 向量（四舍五入）：[6.2*10^(-12), 5.6*10^(-7), 1.5, 90.0, 8.4]

我本以为这些数字中的第一个会最大，因为它应该比其余的四个影响输出更大……您对此有何看法？您能否检查一下 at 向量，看看是否得到相同的结果？

回复
- Jason Brownlee 2018年8月11日上午6:13 #
  
  很好的观察，可能需要进一步研究。
  
  回复
- Niranjan 2018年9月13日上午4:10 #
  
  我也看到了同样的情况。大部分概率都在最后 3 位数字上，而从未在前面 2 位数字上。
  
  感谢 Jason 的精彩教程！非常有帮助。
  
  回复
  - Jason Brownlee 2018年9月13日上午8:06 #
    
    不客气。
    
    回复
Ling 2018年8月11日上午4:31 #

Jason，做得很好！

回复
- Jason Brownlee 2018年8月11日上午6:13 #
  
  不客气。
  
  回复
Nilanjan 2018年8月14日下午7:06 #

嗨，Jason，

感谢您发布的精彩文章。我有一个小问题，为了给您提供背景信息，我正在处理文本数据，在这种情况下，输入和输出的长度差异很大。所以我想检查一下我们是否可以调整这段代码，以便在编码器和解码器长度不同的情况下应用注意力。如果您能指导我到一个已实现此功能的资源，或者指导我如何修改 Attention 类以包含此功能，那将非常有帮助。

谢谢，
Nilanjan

回复
- Jason Brownlee 2018年8月15日上午5:57 #
  
  您可能需要使用不同的注意力实现。
  
  回复
Md. Zakir Hossain 2018年8月14日下午11:37 #

嗨，Jason，

非常感谢您提供的非常有帮助的帖子。我也看了您关于图像字幕的代码

def define_model(vocab_size, max_length)
# feature extractor model
inputs1 = Input(shape=(4096,))
fe1 = Dropout(0.5)(inputs1)
fe2 = Dense(256, activation=’relu’)(fe1)
# sequence model
inputs2 = Input(shape=(max_length,))
se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)
se2 = Dropout(0.5)(se1)
se3 = LSTM(256)(se2)
# decoder model
decoder1 = add([fe2, se3])
decoder2 = Dense(256, activation=’relu’)(decoder1)
outputs = Dense(vocab_size, activation=’softmax’)(decoder2)
# tie it together [image, seq] [word]
model = Model(inputs=[inputs1, inputs2], outputs=outputs)
model.compile(loss=’categorical_crossentropy’, optimizer=’adam’)
# 总结模型
打印(model.summary())
plot_model(model, to_file=’model.png’, show_shapes=True)
return model

在这里，我们如何使用这个注意力层？请您帮忙，我将非常感激。

回复
- Jason Brownlee 2018年8月15日上午6:04 #
  
  抱歉，我没有针对图像字幕的注意力示例。我不想在没有开发代码支持的情况下提供即兴建议。
  
  回复
Raza 2018年8月28日下午4:38 #

上述的注意力解码器似乎无法与 keras 2.1.6 或 2.0.0 以上版本一起工作，为什么会这样？

回复
- Jason Brownlee 2018年8月29日上午8:07 #
  
  很抱歉听到这个消息。也许可以联系新层的开发者？
  
  回复
  - Raza 2018年8月29日下午8:32 #
    
    上述方法是否可以用于纠正句子中的上下文错误？
    例如
    输入：fishing is suffering from fever
    预期输出：patient is suffering from fever.
    
    如果不能，您会针对此类问题陈述提出什么建议？
    
    回复
    - Jason Brownlee 2018年8月30日上午6:28 #
      
      也许可以。试试看？
      
      回复
Alice 2018年8月29日下午3:26 #

你好，Jason，

您在这篇文章中使用了随机数列表。

我有一个数字列表（不是随机的）按顺序排列。如何使用我自己的数字作为输入数据来预测下一个数字作为输出？您能举个例子吗？

谢谢！

此致，
Alice

回复
- Jason Brownlee 2018年8月30日上午6:26 #
  
  你到底遇到了什么问题？
  
  回复
Yasir 2018年9月5日上午8:48 #

你好，我想知道您是否有复制机制的例子。谢谢。

回复
- Jason Brownlee 2018年9月5日下午2:41 #
  
  您说的“复制机制”是指什么？
  
  回复
- jackzy 2018年9月7日下午7:25 #
  
  您是指指针网络吗？
  
  回复
victor eloy 2018年9月13日下午11:04 #

一个小建议，如果您将 LSTM 单元替换为 GRU 单元，在添加注意力层后，您将能够获得 100% 的准确率（这非常棒）。

回复
- Jason Brownlee 2018年9月14日上午6:36 #
  
  太棒了！不错的提示。
  
  回复
Gary 2018年9月20日上午1:34 #

您好，我有两个问题

1) 如果我有一个 CNN 接 LSTM，然后我只添加一个注意力层，那么这个架构仍然是编码器-解码器吗？

2) 编码器-解码器模型可以用于序列标记（输出仅为 IOB 标签）吗？如果可以，为什么在命名实体识别等任务中它们不像 LSTM-CRF 那样常用？

回复
- Jason Brownlee 2018年9月20日上午8:05 #
  
  当然。我认为 CNN-LSTM 是一个编码器-解码器。
  
  是的，CNN 和混合模型在序列分类方面表现得非常好。我的新书中有人类活动识别的例子。
  
  此外，它们在文本序列方面也表现得非常好，例如，在情感分析（一项序列分类任务）中达到了最先进水平。
  
  回复
Dave 2018年10月11日上午11:34 #

嗨，Jason，
我很喜欢您关于注意力的帖子，并且成功地运行了您的示例，然后对其进行了修改以使用一些真实世界的数据并取得了有趣的初步结果。当我保存模型然后尝试重新加载它时，遇到了一个问题。似乎它不识别 AttentionDecoder。其他人是否遇到过这种情况？您知道有什么修复方法吗？
谢谢，
戴夫

回复
- Jason Brownlee 2018年10月11日下午4:14 #
  
  我没有尝试保存模型，也许它需要特殊的处理才能保存自定义层。
  
  回复
Judd Brau 2018年10月29日下午12:16 #

嗨，Jason，

在这篇文章中，您使用 AttentionDecoder 模型进行 Seq2Seq 学习，但是否可以使用此模型为文本分类获取上下文向量？例如，是否可以使用它将可变长度的 LSTM 输出转换为前馈神经网络的输入？

回复
- Jason Brownlee 2018年10月29日下午2:13 #
  
  我不确定我是否明白，抱歉。也许您可以详细说明一下？
  
  回复
  - Judd brau 2018年10月30日上午8:47 #
    
    当然。这个模型是否可以用来获取一个代表整个文本含义的向量？我不是专家，但当我研究文本分类时，我看到许多论文都在讨论注意力机制，特别是这篇：http://univagora.ro/jour/index.php/ijccc/article/view/3142/pdf
    
    您在这篇文章中展示的模型是否也可以用于此目的？
    
    回复
    - Jason Brownlee 2018年10月30日下午2:10 #
      
      也许可以。抱歉，我没有针对文本分类的带有注意力的 LSTM 的教程。
      
      回复
Jairo 2018年11月17日上午1:25 #

感谢您的帮助，Jason。您认为使用函数式 API 实现注意力而不是使用预构建的层和 Sequential 是可行的吗？

回复
- Jason Brownlee 2018年11月17日上午5:47 #
  
  当然可以。
  
  回复
Zh LM 2018年11月18日下午10:59 #

嗨，Jason，

为什么我们在一个 epoch 训练一个样本，而不是更多样本？

# 训练 LSTM
for epoch in range(5000)
# 生成新的随机序列
X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
# 对此序列进行一个 epoch 的模型拟合
model.fit(X, y, epochs=1, verbose=2)

回复
- Jason Brownlee 2018年11月19日上午6:46 #
  
  我们正在手动控制训练 epochs。
  
  回复
  - Zh LM 2018年11月19日下午6:50 #
    
    但是当我修改不带注意力的 Seq2Seq 代码时，如下所示
    
    batch_size = 10
    epochs = 10
    
    # 训练 LSTM
    X_data = []
    y_data = []
    for sample in range(5000)
    # 生成新的随机序列
    X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
    X_data.append(X)
    y_data.append(y)
    
    X_data = array(X_data).reshape(5000, X.shape[1], X.shape[2])
    y_data = array(y_data).reshape(5000, X.shape[1], X.shape[2])
    
    model.fit(X_data, y_data, batch_size = batch_size, epochs = epochs, verbose = 2)
    
    我的测试准确率（91.00%）远高于您的（19.00%），
    这是否意味着当您在一个 epoch 训练一个样本时，您的网络训练得不好？
    
    回复
    - Jason Brownlee 2018年11月20日上午6:34 #
      
      也许可以。
      
      回复
Malik 2018年12月14日上午12:12 #

我有一个关于 ATTENTION 的问题，您已经分享了“
多变量时间序列预测与 Keras 中的 LSTM”
https://machinelearning.org.cn/multivariate-time-series-forecasting-lstms-keras/

我的问题是，“ATTENTION”是否比相同的例子中的 LSTM 更好？我是否需要根据 ATTENTION 进行修改？

我只是想更好地理解 ATTENTION。

回复
- Jason Brownlee 2018年12月14日上午5:32 #
  
  或许可以试试并比较结果。
  
  回复
Koon Wai Choong 2018年12月16日下午10:03 #

嗨，Jason，

一如既往的精彩教程！

您能解释一下如何填充我自己的时间序列而不是使用 generate_sequence() 吗？

X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)

我尝试使用我自己的时间序列作为 X, y

谢谢您，先生

此致，

Joe

回复
- Jason Brownlee 2018年12月17日上午6:21 #
  
  也许这会有帮助。
  https://machinelearning.org.cn/convert-time-series-supervised-learning-problem-python/
  
  回复
  - Koon Wai Choong 2018年12月17日下午7:05 #
    
    明白了。谢谢。
    
    回复
Jenna Ma 2018年12月21日下午1:28 #

到目前为止，Keras 库中有注意力吗？
您在这篇文章中说它很快就会出现。🙂
先谢谢您了。

回复
- Jason Brownlee 2018年12月21日下午3:18 #
  
  它似乎还没有（！！！），也许等到 TensorFlow 2.0 发布时。
  
  回复
xiaoxx 2018年12月25日上午7:52 #

嗨，Jason，

当我尝试运行完全相同的代码时，我遇到了这个问题

ValueError: Dimensions must be equal, but are 150 and 50 for ‘AttentionDecoder/MatMul_4’ (op: ‘MatMul’) with input shapes: [?,150], [50,150].

您能告诉我发生了什么吗？

回复
- Jason Brownlee 2018年12月26日上午6:40 #
  
  也许 API 已经更改，并且代码不再适用于最新版本的 Keras？
  
  您使用的是哪个版本的 Keras？
  
  回复
xiaoxx 2018年12月27日上午4:43 #

keras 版本：2.2.0

tensorflow 版本：1.12.0

回复
Zalman 2019年1月2日上午5:14 #

嗨，Jason，
首先，感谢您发表这篇文章！

我有一个简单的网络，并且我正在尝试使用这个 AttentionDecoder，我得到了
“Input 0 is incompatible with layer AttentionDecoder”

我的网络
model = Sequential()

model.add(LSTM(500, input_shape=(None, 145), init=”he_normal”, return_sequences=True))
model.add(Dropout(0.2))

model.add(LSTM(500, input_shape=(None, 145), init=”he_normal”, return_sequences=False))
model.add(Dropout(0.2))

model.add(AttentionDecoder(500, 145, ‘softmax’))
model.compile(loss=’categorical_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])

有什么想法吗？

回复
- Jason Brownlee 2019年1月2日上午6:42 #
  
  该层可能不再支持最新版本的 Keras。
  
  回复
  - Zalman 2019年1月2日上午6:43 #
    
    我正在使用 keras 2.1.2，它兼容吗？
    
    回复
    - Jason Brownlee 2019年1月2日上午6:44 #
      
      应该兼容。也许可以尝试 2.xx 范围内的另一个版本。
      
      回复
Zalman 2019年1月2日上午6:56 #

那么您对“Input 0 is incompatible with layer AttentionDecoder”有什么想法吗？

回复
- Jason Brownlee 2019年1月2日上午7:48 #
  
  目前没有，也许可以尝试另一个 Keras 版本？
  
  回复
  - Zalman 2019年1月2日上午8:15 #
    
    您使用的是什么版本？我将尝试相同的版本。
    
    谢谢！
    
    回复
    - Jason Brownlee 2019年1月2日下午12:00 #
      
      2.0.8 或附近的版本。
      
      回复
Kushal Davendra 2019年1月5日下午9:24 #

嗨，Jason，

我正在尝试使用您的注意力网络来学习带注意力的 Seq2Seq 机器翻译。我的源语言输出词汇量为 32,000，目标词汇量为 34,000。以下步骤会耗尽 RAM（可以理解，因为它试图管理一个 34K x 34K 的浮点矩阵）：它失败了，因为它超出了 2G 的 protobuf 限制。

self.W_o = self.add_weight(shape=(self.output_dim, self.output_dim),
name=’W_o’,
initializer=self.recurrent_initializer,
regularizer=self.recurrent_regularizer,
constraint=self.recurrent_constraint)

这是我的模型
n_units:128, src_vocab_size:32000,tar_vocab_size:34000,src_max_length:11, tar_max_length:11

def define_model(n_units, src_vocab_size, tar_vocab_size, src_max_length, tar_max_length)
model = Sequential()
model.add(Embedding(src_vocab_size, n_units, input_length=src_max_length, mask_zero=True))
model.add(LSTM(n_units, return_sequences=True))
model.add(AttentionDecoder(n_units, tar_vocab_size))
return model

是否有任何解决方案可以解决向网络添加 Output_dim * Output_dim 变量的 add_weight 步骤？

回复
- Jason Brownlee 2019年1月6日上午10:17 #
  
  也许使用更小的数据样本或尝试渐进式加载？
  
  回复
Jenna Ma 2019年1月8日下午6:49 #

太棒的教程！
在输入和输出之间添加 0 以确保 n_timestep 相等非常棒。这对我帮助很大！谢谢！
由于 Keras 删除了 _time_distributed_dense， AttentionDecoder 的开发者已经更新了他的代码，提供了一个 tdd.py。您可能想更新这篇帖子，以便在更高的 Keras 版本上成功使用此教程。🙂

回复
- Jason Brownlee 2019年1月9日上午8:42 #
  
  谢谢提示。
  
  我注意到 Keras 中的内置注意力实现即将发布。也许在下一个 Keras 版本中！
  
  回复
  - NISHANK GARG 2019年2月10日下午9:28 #
    
    感谢您这篇精彩的文章。
    
    请更新这篇关于 Keras 和注意力的帖子。我急需。
    
    回复
    - Jason Brownlee 2019年2月11日上午7:58 #
      
      谢谢。我正在等待 Keras 正式支持注意力。
      
      回复
Kartik Sharma 2019年1月12日上午7:41 #

__init__() 需要 2 个位置参数，但给出了 3 个

请帮忙！
谢谢

回复
- Jason Brownlee 2019年1月13日上午5:37 #
  
  最新的 Keras API 可能不支持此注意力层。
  
  回复
Victor Calle 2019年1月15日上午7:06 #

Jason 你好！您能否解释一下如何获取带注意力的编码器-解码器模型中编码器部分的结果？

回复
Navneet Singh 2019年1月22日上午5:31 #

嗨，Jason，
我在“attention_decoder.py”文件的第 158 行收到一个错误

行
self.b_p = self.add_weight(shape=(self.units, ),
name=’b_p’,
initializer=self.bias_initializer,
regularizer=self.bias_regularizer,
constraint=self.bias_constraint)

错误如下：
维度必须相等，但“AttentionDecoder/MatMul_4”（op: ‘MatMul’）的输入形状为 [?,150]，[50,150]。

您能否帮助我解决这个错误，这将非常有帮助。
提前感谢。

回复
nandini 2019年1月25日下午6:26 #

我的聊天机器人应用程序有一个使用 rnn 的需求，（即）在对大量数据进行训练后，对于聊天机器人应用程序，我们需要记住之前的对话，至少是当前对话之前的 3 句话。

有可能吗？如果可能，请就此需求提供建议，如何进一步实现这一目标。

请提供任何与此需求相关的链接或文章。

先谢谢了

回复
- Jason Brownlee 2019年1月26日上午6:11 #
  
  抱歉，我没有关于聊天机器人的教程。我无法给您好的建议。
  
  回复
Maddy 2019年2月19日上午7:27 #

Jason 你好，非常感谢您的帖子！太棒了。

当我尝试使用相同的注意力模型来预测单变量多步时间序列时，例如，使用 [1, 2, 4, 2, 3] 来预测 [2, 4, 2, 3, 6]，预测输出全为 1 [1,1,1,1,1]。您知道我该如何修复模型吗？（因为这是一个时间序列问题，我在数据准备期间没有像您帖子中列出的那样进行独热编码。）谢谢！

model = Sequential()
model.add(LSTM(150, input_shape=(n_timesteps_in, n_features), return_sequences=True))
model.add(AttentionDecoder(150, n_features))
model.compile(loss=’categorical_crossentropy’, optimizer=’adam’, metrics=[‘acc’])

回复
- Jason Brownlee 2019年2月19日上午7:29 #
  
  也许注意力层在最新版本的 Keras 中不受支持？
  
  回复
Max Power 2019年3月2日下午12:57 #

嗨，Jason，

您的页面确实很棒，感谢您为此付出的所有努力。特别是这篇文章非常有帮助。
我仍然在为某事而苦恼：如何更改 TimeDistributionLayer 以考虑输出中的序列任意性？我想使用 Jaccard 距离，因此不关心 output_elements 的顺序，只要它们都在其中。

我想做类似下面的事情，但没有成功
model.add(Dense(2D(n_timeSteps_out ,n_Features), activation=’relu, axis=0′))
model.add(Dense(n_Features , activation=’softmax’))

感谢您在此处付出的所有努力！

祝好，
最大值

回复
- Jason Brownlee 2019年3月3日上午7:57 #
  
  在使用此注意力实现时，输入和输出的数量必须匹配。
  
  也许可以尝试直接使用编码器-解码器，然后更改 RepeatVector 中的值来更改输出步数？
  
  回复
zied 2019年3月21日上午3:19 #

当我尝试运行代码时，我遇到了这个错误
TypeError: 添加的层必须是 Layer 类的实例。找到
而且我找不到解决方案。
感谢您的帮助。

回复
- Jason Brownlee 2019年3月21日上午8:20 #
  
  很抱歉听到这个消息，我没有见过这个错误。
  
  也许可以确认您的 Keras 库是最新的，并且您复制了所有代码？
  
  回复
  - zied 2019年3月21日下午7:29 #
    
    非常感谢您的回复，我使用的是旧版本的 keras 2.0.8，新版本会产生一个错误，因为我无法导入 _time_distributed_dense。
    我将注意力层添加到了您的机器翻译代码中
    
    def define_model(src_vocab, tar_vocab, src_timesteps, tar_timesteps, n_units)
    
    model = Sequential()
    model.add(Embedding(src_vocab, n_units, input_length=src_timesteps, mask_zero=True))
    model.add(LSTM(n_units))
    model.add(AttentionDecoder(n_units, tar_vocab))
    return model
    
    回复
    - saria 2019年3月29日上午11:25 #
      
      这是因为新版本的 Keras 不再支持 _time_distributed_dense。
      
      您可以用这个 StackOverflow 解决它
      
      https://stackoverflow.com/questions/45631235/importerror-cannot-import-name-timedistributeddense-in-keras
      
      回复
      - Olatunji Omisore 2020年11月25日下午3:54 #
        
        你好，
        
        请问您是如何解决这个问题的？我遇到了同样的问题，并且一段时间以来无法继续进行项目。
        
        谢谢
saria 2019年3月29日上午11:23 #

谢谢，Jason，感谢您发布的精彩帖子。
您能否根据什么逻辑来解释我们如何选择 n_timesteps_out？
请举一个真实的例子来说明这个数字的来源。

谢谢！

回复
- Jason Brownlee 2019年3月29日下午2:02 #
  
  是的，我有几十个例子，您可以从这里开始
  https://machinelearning.org.cn/start-here/#deep_learning_time_series
  
  回复
saria 2019年3月29日上午11:27 #

我的意思是不同的 n_timestamp_ou 值，例如您在这里选择了 2，在什么情况下我们可以选择不同的数字？

回复
- Jason Brownlee 2019年3月29日下午2:03 #
  
  您可以进行敏感性分析，以发现最适合您的特定模型和数据集的方法。
  
  回复
saria 2019年3月29日上午11:43 #

如果您能给我一个除了翻译和文本生成之外的实际应用场景，那将非常棒。据我所知，在翻译中，我们将根据另一种语言中的每个句子在第一种语言中对应的句子来构建seq_out。
在文本生成中，我认为我们应该将相同的seq_in提供给seq_out。

但是，既然您选择了timestamp_out=2。我想知道您为什么这样做？特别是在实际应用场景中，我们可能会选择timestamp_out=1、timestamp_out=2=2、timestamp_out=3等等，无论哪种都可以。

回复
saria 2019年4月11日下午2:11 #

嗨，Jason，
我在这里有一个问题。我的模型如下，没有使用注意力层。

inputs = Input(shape=(SEQUENCE_LEN, EMBED_SIZE), name=”input”)
encoded = Bidirectional(LSTM(LATENT_SIZE), merge_mode=”sum”, name=”encoder_lstm”)(inputs)
decoded = RepeatVector(SEQUENCE_LEN, name=”repeater”)(encoded)
decoded = Bidirectional(LSTM(EMBED_SIZE, return_sequences=True), merge_mode=”sum”, name=”decoder_lstm”)(decoded)
autoencoder = Model(inputs, decoded)

将它改为下面的代码来嵌入注意力层到我的模型中是否有意义？

inputs = Input(shape=(SEQUENCE_LEN, EMBED_SIZE), name=”input”)
encoded = Bidirectional(LSTM(LATENT_SIZE), merge_mode=”sum”, name=”encoder_lstm”)(inputs)
attention = AttentionDecoder(LATENT_SIZE, n_features)
autoencoder = Model(inputs, attention)

它抱怨缺少一个参数！

谢谢你的帮助

回复
- Jason Brownlee 2019年4月11日下午2:22 #
  
  抱歉，我无法调试您的代码，这可能会有帮助。
  https://machinelearning.org.cn/faq/single-faq/can-you-read-review-or-debug-my-code
  
  回复
ben kubi 2019年4月17日上午1:41 #

你好
我尝试运行这段代码，但总是遇到这个错误：
ImportError: cannot import name ‘_time_distributed_dense’

回复
- Jason Brownlee 2019年4月17日上午7:03 #
  
  我相信它不再支持最新的Keras版本了。
  
  回复
NookLook 2019年4月17日上午10:33 #

嗨，Jason，
在处理具有可变时间步长的输入时，我可以修改输入，例如：
input = Input(shape=(None, n_features)), 然后接着
encoded = LSTM(….)(input)

但是下一行的重复应该怎么做呢？
decode = RepeatVector(???)(encoded)

我尝试设置None和shape[1]，但都没有成功。

回复
- Jason Brownlee 2019年4月18日上午8:45 #
  
  也许可以试试这个教程
  https://machinelearning.org.cn/how-to-develop-lstm-models-for-time-series-forecasting/
  
  回复
Antonio 2019年5月1日上午12:44 #

尊敬的Jason博士，

您认为带有注意力的LSTM编码器-解码器模型在时间序列预测（飞机燃油消耗预测）方面是否有潜力，该模型使用了多变量输入（8个传感器数据变量）和单变量多步输出（未来X个时间步的燃油消耗）？

祝好

回复
- Jason Brownlee 2019年5月1日上午7:07 #
  
  也许可以从一个标准的LSTM模型，甚至是一个线性模型开始，然后逐步深入。
  
  回复
Alexey 2019年5月2日上午2:05 #

在自编码器中使用注意力（2014年论文）是否是作弊？因为解码器将知道所有编码器的状态，并且可以以100%的准确率进行解码，瓶颈将毫无用处。我说的对吗？

回复
- Jason Brownlee 2019年5月2日上午8:06 #
  
  我认为没有。
  
  回复
Asha 2019年5月10日上午12:14 #

嗨，Jason，
感谢您提供精彩的教程！我正在尝试使用它来解决文本纠错问题，即输入是一个错误的句子，输出是正确的句子。
在典型的编码器-解码器架构中，我知道编码器的单元状态必须传递给解码器。我不确定这是否在这个模型中发生。
您能确认一下吗？

回复
- Jason Brownlee 2019年5月10日上午8:18 #
  
  这听起来是个有趣的问题。
  
  也许可以将您的方法与文献中其他人描述的方法进行比较？
  
  回复
Adam Oudad 2019年5月11日上午2:17 #

嗨，感谢这个教程，

为什么不带注意力的编码器-解码器无法正确预测序列中的第二个整数？这是梯度消失问题吗？我不认为标准的LSTM在如此简单的自编码器应用中表现如此糟糕（准确率只有20%）。

感谢任何建议。

回复
- Jason Brownlee 2019年5月11日上午6:18 #
  
  这个问题被设计成对编码器-解码器模型来说很难，而对于带有注意力的相同模型来说则很容易。
  
  这可能是许多原因之一，例如容量不足。
  
  回复
Mayra 2019年5月11日上午3:28 #

嗨，Jason，

非常感谢您的博文。它非常有帮助。您能否就以下事项给我您的意见？是否有可能在Keras中开发一个带有注意力的LSTM自编码器模型来重建输入？关于如何调整示例中演示的方法，是否有任何提示？

提前感谢，

回复
- Jason Brownlee 2019年5月11日上午6:19 #
  
  是的，请看这篇文章
  https://machinelearning.org.cn/lstm-autoencoders/
  
  回复
Alexander 2019年5月14日上午10:29 #

嗨，Jason，
非常感谢您提供如此精彩的教程。
我想实现一个带有注意力的编码器-解码器模型并使用teacher forcing。 Francois Chollet实现了一个seq2seq模型，包含1个编码器输入、1个解码器输入和1个解码器目标序列，其中解码器的两个序列相差一个时间步（teacher forcing）。
据我所知，GRU默认使用teacher forcing（Bahdanau et al (2015), p.13, A.2.2）。
我对于自定义层是否使用真实y值来条件化预测y感到困惑。
在您的注意力模型中，AttentionDecoder在每个时间步的LSTM层中的每个单元接收一个编码值，即150 x 5，因为return_sequence=True是硬编码的。
第47行：model.add(AttentionDecoder(150, n_features))
在Zafarali Ahmed的自定义层代码中，我猜这些编码序列在call(self, x)定义中的第200行被保存在了cell中，代码如下：
self.x_seq = x
对吗？
在step function定义中，我发现了y的第一个提示。
第227行 ytm, stm = states
我注意到y值是如何导入的，但这些是在同一个Recurrent cell（第67行 self.states = [None, None] # y, s）内部构建的。
因此，我找不到任何地方导入了真实值。只有cell本身的预测值被用于step function（第278行）。这是正确的吗？
我的方法是，在call函数中用编码序列和真实值（但偏移一个时间步）的列表替换x。您怎么看？

回复
- Jason Brownlee 2019年5月15日上午8:16 #
  
  您必须实现teacher forcing，它不是自动提供的。
  
  您可以在这里了解更多关于teacher forcing的信息：
  https://machinelearning.org.cn/teacher-forcing-for-recurrent-neural-networks/
  
  我经常推荐使用基于自编码器的编码器-解码器用于LSTM，您可以使用这里的动态RNN方法。
  https://machinelearning.org.cn/develop-encoder-decoder-model-sequence-sequence-prediction-keras/
  
  回复
Bernardo 2019年5月28日下午8:08 #

你好 Jason，很棒的教程！

我想请教您一个建议。我正在为我的机器学习论文学习。
基本上，我有单词，每个单词的字符都用一个整数表示。
例如，我有单词：“develop”，它表示为序列：[4 5 22 5 12 15 16]。
我正在训练一个循环神经网络，它以序列“develo”作为输入，并预测下一个字符“p”。我尝试使用您的注意力层，所以将[4 5 22 5 12 15]作为X，将[16 0 0 0 0]作为y。在这种情况下，准确率非常低，只有25%，这取决于数据集的大小；但我从未获得过高结果。也许我没有正确使用注意力层。
所以，我正在训练RNN，它以序列[4 5 22 5 12 15 16]作为X，以序列[0 0 0 0 0 0 16]作为y。现在，准确率非常高，但我认为这是因为我出现了过拟合。
您认为注意力层在我的情况下可以使用得当吗？如何使用？
谢谢！

回复
- Jason Brownlee 2019年5月29日上午8:41 #
  
  不需要注意力，我认为这篇帖子会帮到您：
  https://machinelearning.org.cn/develop-character-based-neural-language-model-keras/
  
  回复
Jack 2019年6月4日下午1:19 #

嗨，Jason，
感谢您提供如此简洁的教程，我有一些问题想请教您，我想知道AttentionDecoder是否可以在CNN-LSTM编码器-解码器模型中使用，您的教程中的示例（https://machinelearning.org.cn/how-to-develop-lstm-models-for-multi-step-time-series-forecasting-of-household-power-consumption/）是吗？我想知道如何使用注意力来改进CNN-LSTM模型，请给我一些详细的说明，谢谢。

回复
- Jason Brownlee 2019年6月4日下午2:26 #
  
  可能吧。抱歉，我无法为您准备示例。
  
  回复
Alex 2019年6月21日上午7:59 #

嗨 Jason！很棒的教程，谢谢！

正如Zh LM所说：
适当选择的批次大小或参数选择通常会导致准确性提高。因此，在这个简单的例子中，它更多的是加速收敛而不是最终准确性。

编码器-解码器模型
训练4500个样本，验证500个样本
…….
第 150 周期/150
4500/4500 [==============================] – 3s 563us/step – loss: 9.9794e-05 – acc: 1.0000 – val_loss: 0.0067 – val_acc: 0.9976
100.0
平均准确率：100.00%

Encoder-Decoder With Attention Model
训练4500个样本，验证500个样本
....
第 150 周期/150
4500/4500 [==============================] – 3s 742us/step – loss: 1.8149e-05 – acc: 1.0000 – val_loss: 0.0021 – val_acc: 0.9992
100.0
平均准确率：100.00%

回复
- Jason Brownlee 2019年6月21日下午2:01 #
  
  谢谢。
  
  回复
Saichand 2019年7月17日下午4:14 #

嗨，Jason，

这是一个很棒的教程。我注意到keras.layers.recurrent不再工作了。新的解决方案是什么？当我使用AttentionDecoder(256, 300)在lstm层之后时，我遇到了这个错误 —– TypeError: \_\_init\_\_() missing 1 required positional argument: ‘cell’。
当我使用AttentionDecoder(256, 300)在lstm层之后时。

回复
- Jason Brownlee 2019年7月18日上午8:20 #
  
  您可能需要使用此代码配合旧版本的Keras。
  
  回复
  - Saichand 2019年7月19日下午8:33 #
    
    我们能使用注意力层来识别重复/非重复句子吗？如果可以，怎么做？
    我目前为每个句子使用lstm层，然后将它们连接起来，再将连接层通过密集层给出预测。现在我想使用注意力层来改进我的预测。我应该在哪里以及如何使用注意力层？
    请帮助我。
    
    回复
    - Jason Brownlee 2019年7月20日上午10:52 #
      
      您可以使用一个普通的python程序，用一个if语句来检测重复的句子。
      
      回复
  - jorge 2021年9月20日上午11:52 #
    
    抱歉，您不能使用RNN代替recurrent。
    
    回复
joyce 2019年7月22日下午12:01 #

你好Jason。
在时间序列预测问题上，我想使用多注意力编码器-解码器模型。
但是，正如您上面实现的，我可以在我的编码器模型之前使用注意力层吗？
因为我首先想检查我的字符哪个更重要，然后我将在我的编码器模型之后和解码器模型之前使用注意力层。

所以，我的问题是。我不知道哪个是正确的。请帮帮我。谢谢。

回复
- Jason Brownlee 2019年7月22日下午2:07 #
  
  也许可以从这里描述的更简单的模型开始。
  https://machinelearning.org.cn/how-to-develop-lstm-models-for-time-series-forecasting/
  
  回复
Pranjal 2019年7月26日下午4:13 #

Jason，这个教程虽然非常有帮助，但现在已经很老了（2017年）。如果您有时间，能否使用tensorflow.keras层制作一个更新的教程，其中您使用TensorFlow的注意力实现？因为我在任何地方都找不到面向初学者的教程。另外，由于我的应用程序是用于生产目的的，使用可能存在漏洞的过时软件包确实无济于事。谢谢。

回复
- Jason Brownlee 2019年7月27日上午6:06 #
  
  我希望在Keras官方支持注意力时编写新的教程。
  https://github.com/keras-team/keras/pull/11421
  
  回复
joyce 2019年8月5日下午11:43 #

https://github.com/andhus/keras/pull/6/files#diff-b4e22ccac72c2e1c47c8ea1ad67cf592
这是最新的。

回复
- Jason Brownlee 2019年8月6日上午6:39 #
  
  我正在关注这里。
  https://github.com/keras-team/keras/pull/11421
  
  回复
Kevin 2019年8月24日上午5:40 #

极好的文章，谢谢！

回复
- Jason Brownlee 2019年8月24日上午8:02 #
  
  谢谢Kevin。
  
  回复
Anurag 2019年9月29日上午12:25 #

嗨Jason Brownlee，您的文章非常有用，但在执行相同的代码时（当我使用keras 2.0.8时），我遇到了这个错误：

model.add(AttentionDecoder(150, n_features))
回溯（最近一次调用）

File “”, line 1, in
model.add(AttentionDecoder(150, n_features))

TypeError: \_\_init\_\_() missing 1 required positional argument: ‘cell’

回复
- Jason Brownlee 2019年9月29日上午6:13 #
  
  抱歉，我不知道错误的根本原因。
  
  也许可以尝试发布到 stackoverflow？
  
  回复
Edgar 2019年11月8日上午10:20 #

嗨Jason，好文章。

我有一个关于注意力的疑问，并且在哪里都找不到答案。也许我理解错了。

我读了很多关于使用LSTM/GRU与注意力相结合的文章，它们都将输入视为序列x1, x2, x3等。其中，x2在x1之后，x3在x2之后，依此类推。

我的疑问是，如果输入不是序列，而是值集，并且它们的顺序无关紧要，那么注意力将如何工作？

例如，集合{1,5,10,15}与{10,1,15,5}表示相同的含义，显然对于这两种情况，输出（y）都是相同的。

对于第一个集合，假设注意力表明第二个位置的元素5是最重要的，那么对于第二个集合，这个结果是否会相同？（最后一个位置的元素5是最重要的）。

注意力能处理这个问题吗？

感谢您的时间。

回复
- Jason Brownlee 2019年11月8日下午1:49 #
  
  注意力假定输入是序列。否则，它真的没有意义。
  
  如果顺序不重要，您会在正常神经网络的加权输入中获得类似注意力的行为。
  
  回复
jorge 2019年11月21日上午2:21 #

嗨 Jason

再次感谢您的辛勤工作，是否可以在另一个时间序列数据集（如空气污染）上使用注意力，例如预测PM2.5？

回复
- Jason Brownlee 2019年11月21日上午6:09 #
  
  我看不出为什么不。
  
  回复
Xu Zhang 2019年12月3日上午11:39 #

非常感谢您的精彩文章。

如果您能发布一篇关于自注意力（self-attention）的教程，不仅适用于序列数据，也适用于图像分类和其他应用，那将非常有帮助。非常感谢！

回复
- Jason Brownlee 2019年12月3日下午1:34 #
  
  感谢您的建议！
  
  回复
Andrew 2019年12月3日下午1:26 #

嗨Jason，我是Keras的新手，我在网上发现，在自定义层（AttentionDecoder）中，call()应该包含所有张量计算，但在您的示例中，step接管了call的功能，您能否给我一些解释，非常感谢。

回复
- Jason Brownlee 2019年12月3日下午1:37 #
  
  是的，我认为这个代码示例现在有点过时了。
  
  回复
  - Andrew 2019年12月3日下午2:14 #
    
    非常感谢，我工作了两年了，您的博客一直激励着我。
    
    回复
    - Jason Brownlee 2019年12月4日上午5:28 #
      
      谢谢。
      
      回复
juntay 2019年12月13日下午9:24 #

您好，我是一名学生，很高兴看到您的帖子。我的问题场景是根据一系列天气类别（one-hot 编码）来预测当前的光伏发电值（浮点类型）。您的模型可以修改以适应我的问题吗？如果可以，如何进行更改？

回复
- Jason Brownlee 2019年12月14日上午6:19 #
  
  也许可以从这里的模型开始。
  https://machinelearning.org.cn/how-to-develop-lstm-models-for-time-series-forecasting/
  
  回复
  - juntay 2019年12月16日下午1:30 #
    
    谢谢，Jason。★
    
    回复
    - Jason Brownlee 2019年12月16日下午1:39 #
      
      不客气。
      
      回复
Divesh 2019年12月13日下午10:12 #

你好 jason,
极好的文章，您能否撰写一篇关于某些最新编码器-解码器架构中使用的复制机制（copy net）的文章？

回复
- Jason Brownlee 2019年12月14日上午6:19 #
  
  感谢您的建议。
  
  回复
mohammadreza 2019年12月15日上午9:16 #

嗨，Jason，
我想在我的代码中添加解码注意力，但我不知道怎么做。您能帮帮我吗？
我的代码在这里：
from __future__ import print_function

from keras.models import Model
from keras.layers import Input, LSTM, Dense
import numpy as np

batch_size = 64 # 训练的批次大小。
epochs = 100 # 训练的轮数。
latent_dim = 256 # 编码空间的潜在维度。
num_samples = 10000 # 训练的样本数。
# 数据在磁盘上的txt文件路径。
data_path = ‘fra-eng/fra.txt’

# 矢量化数据。
input_texts = []
target_texts = []
input_characters = set()
target_characters = set()
with open(data_path, ‘r’, encoding=’utf-8′) as f:
lines = f.read().split(‘\n’)
for line in lines[: min(num_samples, len(lines) – 1)]:
input_text, target_text = line.split(‘\t’)
# 我们使用“制表符”作为目标的“开始序列”字符，
# 而“\n”作为“结束序列”字符。
target_text = ‘\t’ + target_text + ‘\n’
input_texts.append(input_text)
target_texts.append(target_text)
for char in input_text:
if char not in input_characters:
input_characters.add(char)
for char in target_text:
if char not in target_characters:
target_characters.add(char)

input_characters = sorted(list(input_characters))
target_characters = sorted(list(target_characters))
num_encoder_tokens = len(input_characters)
num_decoder_tokens = len(target_characters)
max_encoder_seq_length = max([len(txt) for txt in input_texts])
max_decoder_seq_length = max([len(txt) for txt in target_texts])

print(‘Number of samples:’, len(input_texts))
print(‘Number of unique input tokens:’, num_encoder_tokens)
print(‘Number of unique output tokens:’, num_decoder_tokens)
print(‘Max sequence length for inputs:’, max_encoder_seq_length)
print(‘Max sequence length for outputs:’, max_decoder_seq_length)

input_token_index = dict(
[(char, i) for i, char in enumerate(input_characters)])
target_token_index = dict(
[(char, i) for i, char in enumerate(target_characters)])

encoder_input_data = np.zeros(
(len(input_texts), max_encoder_seq_length, num_encoder_tokens),
dtype=’float32′)
decoder_input_data = np.zeros(
(len(input_texts), max_decoder_seq_length, num_decoder_tokens),
dtype=’float32′)
decoder_target_data = np.zeros(
(len(input_texts), max_decoder_seq_length, num_decoder_tokens),
dtype=’float32′)

for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):
for t, char in enumerate(input_text):
encoder_input_data[i, t, input_token_index[char]] = 1.
encoder_input_data[i, t + 1:, input_token_index[‘ ‘]] = 1。
for t, char in enumerate(target_text)
# decoder_target_data 比 decoder_input_data 提前一个时间步
decoder_input_data[i, t, target_token_index[char]] = 1。
if t > 0
# decoder_target_data 将提前一个时间步
# 并且不包含开始字符。
decoder_target_data[i, t – 1, target_token_index[char]] = 1。
decoder_input_data[i, t + 1:, target_token_index[‘ ‘]] = 1。
decoder_target_data[i, t:, target_token_index[‘ ‘]] = 1。
# 定义输入序列并处理。
encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
# 我们丢弃encoder_outputs，只保留状态。
encoder_states = [state_h, state_c]

# 设置解码器，使用encoder_states作为初始状态。
decoder_inputs = Input(shape=(None, num_decoder_tokens))
# 我们将解码器设置为返回完整的输出序列，
# 并返回内部状态。我们在训练模型中不使用
# 返回状态，但在推理时会使用它们。
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs,
initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation=’softmax’)
decoder_outputs = decoder_dense(decoder_outputs)

# 定义将
# encoder_input_data & decoder_input_data 转换为 decoder_target_data 的模型
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

# 运行训练
model.compile(optimizer=’rmsprop’, loss=’categorical_crossentropy’,
metrics=['accuracy'])
model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
batch_size=batch_size,
epochs=epochs,
validation_split=0.2)
# 保存模型
model.save(‘s2s.h5’)

# 下一步：推理模式（采样）。
# 以下是步骤
# 1) 编码输入并检索初始解码器状态
# 2) 使用此初始状态运行一步解码器
# 并将“序列开始”标记作为目标。
# 输出将是下一个目标标记
# 3) 使用当前目标标记和当前状态重复

# 定义采样模型
encoder_model = Model(encoder_inputs, encoder_states)

decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
decoder_outputs, state_h, state_c = decoder_lstm(
decoder_inputs, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = Model(
[decoder_inputs] + decoder_states_inputs,
[decoder_outputs] + decoder_states)

# 反向查找标记索引以将序列解码回
# 可读内容。
reverse_input_char_index = dict(
(i, char) for char, i in input_token_index.items())
reverse_target_char_index = dict(
(i, char) for char, i in target_token_index.items())

def decode_sequence(input_seq)
# 将输入编码为状态向量。
states_value = encoder_model.predict(input_seq)

# 生成长度为 1 的空目标序列。
target_seq = np.zeros((1, 1, num_decoder_tokens))
# 使用开始字符填充目标序列的第一个字符。
target_seq[0, 0, target_token_index[‘\t’]] = 1。

# 对一批序列进行采样循环
# (为简化起见，此处假设批次大小为 1)。
stop_condition = False
decoded_sentence = ”
while not stop_condition
output_tokens, h, c = decoder_model.predict(
[target_seq] + states_value)

# 采样一个标记
sampled_token_index = np.argmax(output_tokens[0, -1, :])
sampled_char = reverse_target_char_index[sampled_token_index]
decoded_sentence += sampled_char

# 退出条件：达到最大长度
# 或找到停止字符。
if (sampled_char == ‘\n’ or
len(decoded_sentence) > max_decoder_seq_length)
stop_condition = True

# 更新目标序列（长度为 1）。
target_seq = np.zeros((1, 1, num_decoder_tokens))
target_seq[0, 0, sampled_token_index] = 1。

# 更新状态
states_value = [h, c]

return decoded_sentence

for seq_index in range(100)
# 提取一个序列（训练集的一部分）
# 用于尝试解码。
input_seq = encoder_input_data[seq_index: seq_index + 1]
decoded_sentence = decode_sequence(input_seq)
print(‘-‘)
print(‘输入句子:’, input_texts[seq_index])
print(‘解码句子:’, decoded_sentence)

谢谢你

回复
- Jason Brownlee 2019年12月16日上午6:05 #
  
  这是我在这里回答的一个常见问题
  https://machinelearning.org.cn/faq/single-faq/can-you-read-review-or-debug-my-code
  
  回复
Mina 2019年12月17日上午2:32 #

大家好，我是机器学习新手，我对 epoch 的使用感到困惑。这两种方法有什么区别？
for epoch in range(5000)
# 生成新的随机序列
X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
# 对此序列进行一个 epoch 的模型拟合
model.fit(X, y, epochs=1, verbose=2)
或者
model.fit(X,y,epochs=5000,verboes=2)

回复
- Jason Brownlee 2019年12月17日上午6:38 #
  
  一次遍历训练数据 vs 5K 次遍历训练数据。
  
  关于 epoch 的更多信息请参见此处
  https://machinelearning.org.cn/difference-between-a-batch-and-an-epoch/
  
  回复
Jenna 2020年1月8日下午8:32 #

Jason，您好。抱歉再次打扰您。
我测试了这段注意力代码，用于我的 seq2seq 预测项目，该项目输入和输出都是浮点数据，但效果很差。我猜想这也许是专门为 one-hot 编码数据设计的？
您认为注意力机制是否可以在理论上改进 LSTM 编码器-解码器模型在多步时间序列预测中的性能？例如，本博文所示的 LSTM 编码器-解码器模型：https://machinelearning.org.cn/how-to-develop-lstm-models-for-time-series-forecasting/

谢谢！

回复
- Jason Brownlee 2020年1月9日上午7:25 #
  
  我没有试过，抱歉。
  
  回复
  - Jenna 2020年1月10日下午6:17 #
    
    尊敬的Jason博士，
    感谢您的回复。
    我阅读了一些关于时间序列预测中注意力机制的文章，我认为注意力机制在自然语言处理中的成功启发研究人员将其应用于时间序列预测问题。期待您能写一篇关于这个领域的博文。
    感谢您发布的精彩博文！
    
    回复
    - Jason Brownlee 2020年1月11日上午7:21 #
      
      谢谢。
      
      回复
Jay 2020年1月11日上午9:53 #

嗨，Jason，

我遇到了一个错误，例如：在最初两次迭代中，图无法按拓扑顺序排序，但训练仍在继续。我认为这可能会破坏整个过程。您知道是什么原因吗？

回复
- Jason Brownlee 2020年1月12日上午7:58 #
  
  抱歉，我不太清楚。我认为注意力层不再适用于最新版本的代码库。
  
  回复
Erin 2020年1月22日下午10:10 #

嗨，Jason，

谢谢这个很棒的教程。🙂
不过我有一个问题：我是否正确地认为我们不需要为模型添加激活函数，因为 AttentionDecoder 中已经包含了“tanh”？

谢谢，祝您有美好的一天！
Erin

回复
- Jason Brownlee 2020年1月23日上午6:33 #
  
  我相信注意力机制使用的是带有内部激活的模型。
  
  回复
Alireza 2020年3月31日上午6:00 #

嗨 Jason，

首先，非常感谢您提供全面且有用的教程。请继续保持。

我有一个问题：可以在回归问题（时间序列预测）中使用注意力机制吗？

如果可以，怎么做？它应该只以编码器-解码器的形式出现吗？
特别是现在有了 tf.keras.attention，我们也可以使用这个内置层吗？

谢谢，保持安全

回复
- Jason Brownlee 2020年3月31日上午8:19 #
  
  不客气。
  
  当然，试试看。我现在没有这方面的例子。
  
  回复
najeh 2020年4月6日上午11:14 #

嗨，Jason，
感谢您的精彩教程！我想知道 LSTM 中的注意力机制与 seq2seq 模型中的注意力机制有什么区别？
谢谢你。

回复
- Jason Brownlee 2020年4月6日下午1:32 #
  
  相同的注意力方法可以用于不同的模型架构。
  
  回复
Natalko 2020年4月17日上午5:38 #

我有一些问题。我尝试运行您的代码进行时间序列预测。
我试图根据单词的先前出现次数来预测单词的流行度/出现次数（仅时间序列预测）。
模型应该是通用的。一个模型用于所有单词。但即使我尝试为每个单词单独运行模型，模型仍然对每次预测返回“1”。

我已去除趋势并应用了监督学习（x-1 , x ）。
我还将时间序列分割成样本，每个样本的长度为 5。

这是我的模型代码。也许我应该添加一些其他参数？

提前感谢！

i = Input(shape=(samples_train.shape[1],samples_train.shape[2]), dtype=’float32′)
enc = Bidirectional(GRU(150, return_sequences=True), merge = ‘concat’)(i)
dec = AttentionDecoder(150,samples_train.shape[2])(enc)
model = Model( inputs = i , outputs = dec )
model.compile(loss=’mse’, optimizer=’adam’)

回复
- Jason Brownlee 2020年4月17日上午6:28 #
  
  也许可以尝试其他问题设置、数据预处理、模型、模型配置和训练配置？
  
  回复
gloria 2020年4月21日下午8:49 #

嗨，Jason，
感谢您的教程！我想知道这是用于 Tensorflow 2.0 还是 Tensorflow 1.0？
谢谢你。

回复
- Jason Brownlee 2020年4月22日上午5:54 #
  
  所有示例都适用于 TF 2。
  
  回复
Vinay Kumar 2020年5月3日下午4:33 #

嗨，Jason，

很棒的文章，谢谢。

我对 RNN 非常陌生。我正在构建一个二元分类模型，我想知道需要进行哪些更改才能实现该结果。我相信在我的情况下不需要 one-hot 编码。本质上，我可能需要输入自己的数据，而不是随机序列生成器。我通过阅读 https://machinelearning.org.cn/how-to-develop-lstm-models-for-time-series-forecasting/ 开始构建 RNN 分类器
并参考 https://machinelearning.org.cn/how-to-choose-loss-functions-when-training-deep-learning-neural-networks/ 将损失函数更新为 Binary Cross-Entropy。

普通的 RNN 模型可以工作，我尝试添加注意力层，但遇到了维度不匹配的错误。
ValueError: (‘Error when checking target: expected AttentionDecoder to have 3 dimensions, but got array with shape (70, 1)’, ‘occurred at index 0’)
我感觉这与编码有关，但不确定。
您能帮我弄清楚吗？

谢谢

回复
- Jason Brownlee 2020年5月3日下午5:11 #
  
  很高兴它有帮助。
  
  您可以查看此处的适用于时间序列分类的 LSTM 示例，您可以根据需要进行改编
  https://machinelearning.org.cn/start-here/#deep_learning_time_series
  
  回复
AR 2020年5月5日上午4:59 #

嗨 Jason

现在 TensorFlow 2.1 中提供了注意力机制。您能否准备一个关于如何使用它的教程？

回复
- Jason Brownlee 2020年5月5日上午6:35 #
  
  好建议！
  
  回复
Khushbu 2020年5月7日下午10:54 #

你好 Jason，

感谢您提供的出色教程。
我想使用基于注意力的编码器-解码器模型，例如 tf.keras。TensorFlow 具有 AdditiveAttention 层（https://tensorflowcn.cn/api_docs/python/tf/keras/layers/AdditiveAttention?version=nightly）。

我按照教程实现了编码器和解码器（https://blog.keras.org.cn/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html）。

现在，我想在编码器状态和解码器状态之间添加 AdditiveAttetion。
根据文档，我需要将解码器状态作为查询，将编码器状态作为值传递给 AdditiveAttention()([decoder_states, encoder_States])，它将根据分布在 encoder_states 上的权重返回上下文向量。

那么如何将这三个步骤连接起来呢？

您能否提供一个基于 AdditiveAttention 的小型教程？

谢谢你。

回复
- Jason Brownlee 2020年5月8日上午6:35 #
  
  不客气。
  
  干得好！
  
  我将写一篇关于这个主题的教程并弄清楚。
  
  回复
Alejandro Oñate 2020年5月14日上午4:58 #

它看起来很棒而且易于使用。

但是，这个例子说明的是一个自编码器风格的模型。如何将其应用于经典的序列到序列解码器-编码器模型？

我使用了类似的模型（但有多层）
https://machinelearning.org.cn/develop-encoder-decoder-model-sequence-sequence-prediction-keras/

谢谢！

回复
- Jason Brownlee 2020年5月14日上午5:58 #
  
  我希望将来能介绍它。
  
  回复
ARJUN 2020年6月3日下午9:23 #

导入注意力解码器时出现此错误

from keras.layers.recurrent import Recurrent, _time_distributed_dense
ImportError: cannot import name ‘_time_distributed_dense’ from ‘keras.layers.recurrent’ (

请帮忙！

回复
- Jason Brownlee 2020年6月4日上午6:20 #
  
  我认为此教程需要旧版本的 tensorflow。
  
  回复
Rakesh 2020年6月10日上午12:05 #

你好 Jason，

我稍微修改了您的实现，以便处理不同的输入和输出词汇量大小。您建议使用 repeat vector 层来解决不同的输入和输出大小问题吗？

inputs=Input(shape=(inputSequenceLength,)) embedding=Embedding(input_dim=inputVocabSize,output_dim=embedding_dim,embeddings_initializer=Constant(embeddings_initializer),input_length=input_length,trainable=trainable)(inputs)
x=Bidirectional(LSTM(units=128,return_sequences=True))(embedding)
x=LSTM(units=128)(x)
x=RepeatVector(outputSequenceLength)(x)
outputs=AttentionDecoder(128, outputVocabSize)(x)
model = Model(inputs=inputs, outputs=outputs)

谢谢 Rakesh

回复
- Jason Brownlee 2020年6月10日上午6:17 #
  
  干得不错。
  
  我认为自编码器方法对于编码器-解码器来说更容易理解/实现，并且同样有效。
  
  回复
Arindam Mondal 2020年6月12日下午4:00 #

非常好的描述。我有一个疑问：您的 LSTM 电子书是否包含带有注意力机制的编码器-解码器？

回复
- Jason Brownlee 2020年6月13日上午5:47 #
  
  不，目前书中不包含注意力机制。
  
  回复
nutan 2020年6月12日下午5:03 #

嗨，Jason，

我在 colab 中运行这个示例。所以将所有内容复制到同一个笔记本中。
我在这一行收到一个错误
model.add(AttentionDecoder(150, n_features))

—————————————————————————
OperatorNotAllowedInGraphError Traceback (most recent call last)
in ()
47 model = Sequential()
48 model.add(LSTM(150, input_shape=(n_timesteps_in, n_features), return_sequences=True))
—> 49 model.add(AttentionDecoder(150, n_features))
50
51 model.compile(loss=’categorical_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])

9 frames
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py in _disallow_in_graph_mode(self, task)
535 raise errors.OperatorNotAllowedInGraphError(
“{} is not allowed in Graph execution. Use Eager execution or decorate”
–> 537 ” this function with @tf.function.”.format(task))
538
539 def _disallow_bool_casting(self)

OperatorNotAllowedInGraphError: using a tf.Tensor as a Python bool is not allowed in Graph execution. Use Eager execution or decorate this function with @tf.function.

没有比较操作。不知道为什么

请告诉我

回复
- nutan 2020年6月12日下午5:53 #
  
  嗨，Jason，
  
  已解决 —
  
  我从上面的一个链接复制了这段代码 –
  
  def time_distributed_dense(x, w, b=None, dropout=None,
  input_dim=None, output_dim=None, timesteps=None)
  ”’Apply y.w + b for every temporal slice y of x.
  ”’
  if not input_dim
  # won’t work with TensorFlow
  input_dim = K.shape(x)[2]
  if not timesteps
  # won’t work with TensorFlow
  timesteps = K.shape(x)[1]
  if not output_dim
  # won’t work with TensorFlow
  output_dim = K.shape(w)[1]
  
  if dropout
  # apply the same dropout pattern at every timestep
  ones = K.ones_like(K.reshape(x[:, 0, :], (-1, input_dim)))
  dropout_matrix = K.dropout(ones, dropout)
  expanded_dropout_matrix = K.repeat(dropout_matrix, timesteps)
  x *= expanded_dropout_matrix
  
  # collapse time dimension and batch dimension together
  x = K.reshape(x, (-1, input_dim))
  
  x = K.dot(x, w)
  print(“in time_distributed_dense… 3”)
  print(“b shape “, b.shape)
  print(“Type of b “,type(b))
  print(“Type of x “,type(x))
  print(“x shape”,x.shape)
  
  #if b
  x = x + b
  
  # reshape to 3D tensor
  print(“in time_distributed_dense… 4”)
  x = K.reshape(x, (-1, timesteps, output_dim))
  print(“in time_distributed_dense…last”)
  return x
  
  _————————————————————————–
  “If b” 这个比较会抛出上述错误。我们可以注释掉它来运行。
  
  谢谢
  
  回复
  - Jason Brownlee 2020年6月13日上午5:53 #
    
    干得好。
    
    回复
- Jason Brownlee 2020年6月13日上午5:50 #
  
  我预计该代码在现代库版本下将不再有效。
  
  回复
Arindam Mondal 2020年6月12日下午11:00 #

嗨，Jason，
确实解释得非常好。但是“_time_distributed_dense”即使在 tensorflow 版本 2.0.0 中也无法导入。您能帮我吗？

回复
- Jason Brownlee 2020年6月13日上午6:03 #
  
  是的，此教程已不再是最新的。
  
  回复
David 2020年6月14日上午10:30 #

对于任何在使用 tensorflow >= 2.2 时遇到注意力问题的用户，请检查此教程是否有帮助 https://medium.com/@dmunozc/using-keras-attention-with-tensorflow-2-2-69da8f8ae7db

回复
- Jason Brownlee 2020年6月15日上午5:59 #
  
  感谢分享！
  
  回复
Alejandro Oñate 2020年7月2日下午8:04 #

你好，我想了解模型如何连接 LSTM 编码器和解码器层。

model = Sequential ()
model.add (……

它知道如何连接隐藏状态吗？经典模型要复杂得多（https://machinelearning.org.cn/develop-encoder-decoder-model-sequence-sequence-prediction-keras/），我想了解这个选项是否自动连接，或者它是否以其他方式（也有效）连接。

谢谢！

回复
- Jason Brownlee 2020年7月3日上午6:14 #
  
  也许从这里开始
  https://machinelearning.org.cn/start-here/#lstm
  
  回复
Kingsley Udeh 2020年7月4日上午1:13 #

嗨，Jason，

我可以在时间序列问题中使用自定义 Keras 注意力层实现，在该问题中我需要预测下一个小时或当前时间步？目前，注意力概念似乎很适合 seq2seq 模型，但我希望输出序列只有一个时间步。

如果前面的问题是可能的，我可以使用 CNN 模型作为编码器，然后是注意力、循环和密集模型在我的网络架构中吗？

提前感谢。

回复
- Jason Brownlee 2020年7月4日上午6:02 #
  
  现在使用 TensorFlow 提供的注意力层可能更好。
  
  回复
Sk 2020年7月7日下午6:08 #

你好，
我们可以使用 TensorFlow Addons [ https://tensorflowcn.cn/addons/api_docs/python/tfa/%5D 而不是自定义注意力层吗？如果可以，应该使用哪个函数？

回复
- Sk 2020年7月7日下午6:09 #
  
  https://tensorflowcn.cn/addons/api_docs/python/tfa/seq2seq/
  
  回复
- Jason Brownlee 2020年7月8日上午6:28 #
  
  抱歉，我对 TensorFlow Addons 不了解。
  
  回复
Robert 2020年7月9日下午5:17 #

你好，Jason，大家

我正在做一个涉及心电图信号的小项目。

输入（心电图信号和图片），分析和学习，使用注意机制和来自 tensors 和 keras 的 cnn。

之前我尝试使用 physionets 2017 挑战之一 miguel-lozano-220 -s 的代码来了解其功能，但在学习结束和验证过程开始时遇到了维度问题（使用了 physionets 2017 数据库）。

然后我找到了这段代码，它非常好，并且我在考虑修改它，这段代码可以处理输入吗，或者有没有什么好的指导能引导我完成这类应用的示例。

回复
- Jason Brownlee 2020年7月10日上午5:51 #
  
  也许您可以尝试使用 Keras 提供的注意力层
  https://keras.org.cn/api/layers/attention_layers/
  
  回复
Murat Karakaya 2020年7月20日上午4:32 #

您能更新一下帖子，加入 Keras Attention 层吗？

回复
- Jason Brownlee 2020年7月20日上午6:18 #
  
  感谢您的建议。我希望尽快撰写关于该主题的新教程。
  
  回复
Chung-Hao Ku 2020年7月24日下午6:33 #

Jason，你好，我想问一下，在这个注意力实现框架中，有一个叫做“step”的方法，它计算注意力分数和上下文向量。但是我没有看到它在整个实现过程中是如何使用的。当我查看 keras 子类化框架时，我也没找到这个方法，也不是 Python 的内置函数。你能给我一个线索，说明这个方法在代码的哪些地方被使用了吗？非常感谢。

回复
Sameer kumar 2020年8月20日下午7:29 #

我可以用双向 LSTM 配合注意力机制来构建解码器吗？

回复
- Jason Brownlee 2020年8月21日上午6:26 #
  
  我看不出为什么不。
  
  回复
Sameer kumar 2020年8月20日下午7:46 #

如何在注意力解码器中使用双向 LSTM 代替 LSTM？
我正在做一个图像字幕项目
请帮助

回复
- Jason Brownlee 2020年8月21日上午6:27 #
  
  这或许能帮助你入门
  https://machinelearning.org.cn/develop-bidirectional-lstm-sequence-classification-python-keras/
  
  回复
Prisha 2020年8月25日上午12:51 #

嗨

我尝试使用它，但是在加载模型时出现了内存错误。您能告诉我为什么吗？您是否有其他关于 encoder-decoder with attention layer 的例子，因为找不到 time distributed 函数。

回复
- Jason Brownlee 2020年8月25日上午6:43 #
  
  听到这个消息我很难过。
  
  我希望很快能写更多关于这个主题的内容。
  
  回复
A_P 2020年9月10日下午6:54 #

嗨，Jason，
在您的示例中，输出 (y) 是序列 (X) 的一部分：X 的前两位数字。
当 y 是二元分类问题且 y 不包含在 X 创建的序列中时，应该如何处理 y？

谢谢！

回复
- Jason Brownlee 2020年9月11日上午5:52 #
  
  好问题，请参阅此关于时间序列分类的教程
  https://machinelearning.org.cn/how-to-develop-rnn-models-for-human-activity-recognition-time-series-classification/
  
  回复
AS 2020年9月12日上午4:29 #

我非常感谢您所有的教程，您是一位出色的老师。我从您的帖子中学到了关于 LSTM、注意力机制、CNN 的知识。感谢您为我们提供了如此出色的资源！

回复
- Jason Brownlee 2020年9月12日上午6:20 #
  
  谢谢！
  
  回复
Kingsley Udeh 2020年9月25日下午9:25 #

嗨，Jason博士，

感谢您在深度学习实践和研究方面所做的出色工作。

我想知道您是否已经用 Keras 注意力层实现了 encoder-decoder 架构？您能否也考虑在时间序列回归问题中添加自注意力层？

再次感谢！

回复
- Jason Brownlee 2020年9月26日上午6:19 #
  
  当然可以。
  
  回复
Bilal CHandio 2020年10月11日下午7:37 #

感谢如此出色的注意力机制实现。您能否解释一下如何在此模型上获得验证准确率？在 model.fit() 中调用验证是否有效？我希望这能在我的文本分类问题中奏效。

回复
Cheng 2020年10月14日下午11:31 #

Jason 你好，你的博客对我帮助很大，但在我的研究课题中，序列是由高精度小数给出的，而在你的博客中，序列是整数，易于使用 one_hot 编码。模型可以将预测问题转化为分类问题，但我的数据应该如何编码？

x=[113.654,112.1120,110.2354,108.3314………99.1014]
y=[12.3251,13.5564,15.6312,16.3544,………20.3314]

以上是一组数据样本，我该如何处理，然后使用 x 序列预测 y 序列？

期待您的回复

此致

Cheng

回复
- Jason Brownlee 2020年10月15日上午6:09 #
  
  谢谢。
  
  您可以使用任何您喜欢的精度，从这里开始
  https://machinelearning.org.cn/how-to-develop-lstm-models-for-time-series-forecasting/
  
  回复
Shahad 2020年10月21日下午6:38 #

嗨，Jason，

感谢您的高质量工作，我从您的教程中学到了很多。

我想知道，如果我有一个带有多个隐藏层的堆叠 encoder-decoder，注意力层应该放在哪里？

例如，我的 encoder 和 decoder 都有 3 个隐藏层，注意力层是应该放在 decoder 的第三层之后，还是放在第一层之前？有没有标准方法来处理这种情况？

回复
- Jason Brownlee 2020年10月22日上午6:38 #
  
  不客气！
  
  好问题，注意力是用于解码器顶端的。
  
  回复
  - Shahad 2020年10月22日下午4:12 #
    
    非常感谢！
    
    还有一个最后的想法。如果我想使用自编码器进行时间序列降维，使用注意力层来获得更丰富/更好的潜在空间是否有意义？也许自注意力在这种情况下更适用。
    
    我非常期待听到您的想法。
    
    回复
    - Jason Brownlee 2020年10月23日上午6:03 #
      
      也许可以尝试有和没有注意力，也可以从这里开始
      https://machinelearning.org.cn/lstm-autoencoders/
      
      回复
Hoda 2020年10月30日上午1:49 #

Jason 博士您好
非常感谢这篇精彩的文章。
您能否教我们如何将自注意力层添加到 encoder-decoder 模型中？

回复
- Jason Brownlee 2020年10月30日上午6:55 #
  
  感谢您的建议。我希望尽快写关于该主题的内容。
  
  回复
Alvin 2020年10月31日上午10:57 #

嗨，Jason，

非常感谢您提供的这些精美的图文教程。我个人从您对许多复杂概念的解释中学到了很多。

我想知道您是否计划更新此教程以使其与最新的 tensorflow 2 兼容？另外，我注意到 TF2 中有 Attention 层的实现（例如 MultiHeadAttention）。如果您能提供一个关于如何使用这些现有包内 Attention 层来完成任务的教程，那将是极好的。这对像我这样的非专业人士会很有帮助！

回复
- Jason Brownlee 2020年10月31日下午1:55 #
  
  不客气！
  
  是的，我希望尽快撰写该教程的更新版本。
  
  回复
Olatunji Omisore 2020年11月25日下午4:04 #

嗨，Jason，

非常感谢您的教程。我正在实现一个 CNN-LSTM 项目，虽然我的训练准确率通常超过 95%，但不幸的是，我的测试准确率并不令人印象深刻（低于 70%）。我考虑为网络添加一个注意力层，并且我已经尝试了很多。我发现您的代码和教程很有用，但 tensorflow 2 中 _time_distributed_dense 的弃用阻止了我真正将您的代码改编到我的实现中。

请问您能提供一种替代方法来实现这一点吗？

非常感谢

回复
Laith 2021年3月25日下午5:46 #

你好

是否有 Keras 代码可以帮助使用 transformers 实现 encoder decoder 模型？

最好的祝福

回复
- Jason Brownlee 2021年3月26日上午6:20 #
  
  目前我没有示例。
  
  回复
Minh 2021年3月28日下午12:09 #

你好，对于当前版本的 Keras，您不能再从 keras.layers.recurrent 导入 Recurrent 了。您有什么解决方案，除了降级 Keras 之外？

回复
- Jason Brownlee 2021年3月29日上午6:15 #
  
  没有，抱歉。
  
  回复
Masud 2021年5月29日上午7:24 #

感谢这个很棒的教程。您是否计划用标准的注意力层更新代码？

回复
- Jason Brownlee 2021年5月30日上午5:44 #
  
  我希望很快能写一系列关于使用标准 Keras 注意力层的新教程。
  
  回复
Wang Hui 2021年5月30日上午3:15 #

我正在处理糖尿病分类问题，我的数据形状是 (125000,219)。我能使用您的方法进行分类吗？如果可以，怎么做；如果不行，为什么？非常感谢！

回复
- Jason Brownlee 2021年5月30日上午5:50 #
  
  我认为这不合适，请直接尝试 MLP 模型。
  
  回复
MS 2021年6月28日下午5:40 #

你好，Jason。
Bahdanau 等人的注意力权重计算方法是 (a=v tanh(w1 ht + w2 hs))，其中 ht 是查询，hs 是值。我使用自定义的 Keras 注意力层，通过 innit、build 和 call 函数来接收查询、值并返回上下文向量以及注意力权重。我对此 seq2seq 问题使用 teacher forcing。查询是解码器的最后一个隐藏状态，值是编码器的所有隐藏状态。我的问题是，编码器的输出是否作为值，即编码器的所有隐藏状态？那么查询应该是什么？应该是 encoder_h，即编码器的最后一个隐藏状态吗？

回复
- Jason Brownlee 2021年6月29日上午4:46 #
  
  抱歉，我一时记不起来了，也许可以查阅 tf.keras 内置的注意力层。
  
  回复
frozenade 2021年11月24日下午11:14 #

嗨 Jason

我遇到了错误
TypeError: \_\_init\_\_() missing 1 required positional argument: ‘cell’

在
super(AttentionDecoder, self).__init__(**kwargs)

当我使用这段代码时
model = define_model(all_vocab_size, all_length, 256, encoder, decoder, attention)

# 定义 NMT 模型
def define_model(vocab, timesteps, n_units, encoder, decoder, attention)
model = Sequential()
model.add(Embedding(vocab, n_units, input_length=timesteps, mask_zero=True))
# model.add(Embedding(vocab, n_units, weights=[embedding_vectors], input_length=timesteps, trainable=False))
if(encoder == “LSTM”)
model.add(LSTM(n_units, return_sequences=False, dropout=0.5, recurrent_dropout=0.5))
elif(encoder == “GRU”)
model.add(GRU(n_units, return_sequences=False, dropout=0.5, recurrent_dropout=0.5))

model.add(RepeatVector(timesteps))
if(decoder == “LSTM”)
model.add(LSTM(n_units, return_sequences=True, dropout=0.5, recurrent_dropout=0.5))
elif(decoder == “GRU”)
model.add(GRU(n_units, return_sequences=True, dropout=0.5, recurrent_dropout=0.5))

model.add(BatchNormalization())
if(attention == “ATTNDECODER”)
model.add(AttentionDecoder(n_units, vocab))
else
model.add(TimeDistributed(Dense(vocab, activation=’softmax’,
# kernel_regularizer=regularizers.l2(0.01),
# activity_regularizer=regularizers.l2(0.01)
)))
return model

我错过了什么？

回复
- Adrian Tam 2021年11月25日下午2:31 #
  
  就 Keras 2.0 而言，这仍然应该有效。但之后，Recurrent 类成为了“RNN”类的别名，语法也发生了变化。这就是为什么您会看到错误。不幸的是，重写代码并不那么简单。也许您应该降级您的 keras 以使其运行。
  
  回复
Quentin 2022年7月12日上午9:13 #

嗨，Jason，
这个教程非常有帮助，我可以在我的项目中使用这段代码吗？

回复
- James Carmichael 2022年7月13日上午7:43 #
  
  Quentin 你好…是的，但请理解，我网站和书籍中的所有代码和材料都是为教育目的而开发和提供的。
  
  我对代码、它可能做什么或你如何使用它不承担任何责任。
  
  如果你在自己的项目中使用我的代码或材料，请注明来源，包括
  
  作者姓名，例如“Jason Brownlee”。
  教程或书籍的标题。
  网站名称，例如“Machine Learning Mastery”。
  教程或书籍的 URL。
  您访问或复制该代码的日期。
  例如
  
  Jason Brownlee, Machine Learning Algorithms in Python, Machine Learning Mastery, 网址：https://machinelearning.org.cn/machine-learning-with-python/, 访问日期：2018年4月15日。
  另外，如果您的作品是公开的，请联系我，我很乐意出于普遍兴趣看看它。
  
  回复
Ram 2022年7月14日下午1:05 #

嗨，Jason，
我们如何使用这段代码，或者如何使用注意力机制来进行使用 CNN2D 的图像分类？

回复
- James Carmichael 2022年7月15日上午8:33 #
  
  Ram 你好…以下资源可能对您有帮助
  
  https://blog.paperspace.com/image-classification-with-attention/
  
  回复
Kostas 2022年9月3日上午3:07 #

嗨，Jason，
很棒的文章，我无法让代码运行。这可能是因为我使用的 python 和 tensorflow 版本。
您能告诉我应该使用哪个版本的 python 和 tensorflow 吗？

感谢您的时间。
Kostas

回复
- James Carmichael 2022年9月4日上午10:02 #
  
  Kostas 你好…您遇到了什么错误消息？这将有助于我们更好地帮助您。
  
  回复
  - Kostas 2022年9月11日下午6:56 #
    
    嗨 James，
    从命令行运行代码时，我收到一条错误消息：“ImportError: cannot import name ‘Recurrent’ from ‘keras.layers.recurrent’ (C:\Users\papav\AppData\Local\Programs\Python\Python37\lib\site-packages\keras\layers\recurrent.py)”
    
    进一步的互联网搜索发现有人报告说，这条错误消息意味着 keras 和 tensorflow 版本不正确，因为 recurrent 已在最新版本的 keras-tensorflow 中弃用，这就是我问关于所有不断发展的软件（python、keras、tensorflow）的合适版本的问题。
    
    回复
    - Jack 2022年10月17日下午6:35 #
      
      亲爱的 James Carmichael 博士和 Kostas
      
      我也遇到了同样的错误。错误消息显示无法从 'keras.layers.recurrent' 导入 'Recurrent' (D:\Users\72771\Anaconda3\lib\site-packages\keras\layers\recurrent.py)。如何修复？
      
      回复
Abdi 2022年11月30日上午2:51 #

亲爱的 Jason,

一个很棒的示例代码。但我们现在知道 Keras 已开发了 SDPA（Scaled Dot Product Attention）。我的问题是，如何定义 k、v 和 q 来在解码器中使用注意力层？如果可能的话，或者 SDPA 模块是否可以用于 transformer 中的自注意力？

如果我的答案是否定的，正如您之前提到的，Keras 有哪些可用的“attention decoder”函数？

我的第二个问题是，如果我有一个 CNN 网络作为编码器，这个“attention decoder”函数还能正常工作吗？

回复
- James Carmichael 2022年11月30日上午8:58 #
  
  Abdi 你好…以下资源可能有助于澄清
  
  https://machinelearning.org.cn/how-to-implement-scaled-dot-product-attention-from-scratch-in-tensorflow-and-keras/
  
  回复
  - Abdi 2022年12月10日上午4:58 #
    
    谢谢你，亲爱的 James
    我学习了本教程，但无法在模型中使用
    add.mode.attentionlayer(…)
    
    我的目的是在序贯模型中像这样使用它：
    
    model = Sequential()
    model.add(LSTM(200, activation='relu', input_shape=(n_timesteps, n_features)))
    model.add(RepeatVector(n_outputs))
    model.add(Attention_layer (**kwargs) —————————————————> 如果可能的话
    model.add(LSTM(200, activation=’relu’, return_sequences=True))
    model.add(TimeDistributed(Dense(100, activation=’relu’)))
    model.add(TimeDistributed(Dense(1)))
    model.compile(loss=’mse’, optimizer=’adam’)
    
    或者如果我们没有这样的模型（我搜索了很多），我是否可以使用 attention_decoder 代码正确地与 Keras 2.9 一起使用？
    
    回复
Abdi 2022年12月2日上午2:08 #

谢谢。我以前学过，现在我想用它来替换 Zafar-Ali 的 attention_decoder 函数，如果可能的话。你以前这样做过吗？

回复
Abdi 2022年12月2日上午2:24 #

一些问题，亲爱的 Jason

1. 如果我们在 Colab 中运行代码，将 attention_decoder.py 文件保存在挂载的驱动器中是否足够使用？

2. 如上所述的指令 “from keras.layers.recurrent import Recurrent, _time_distributed_dense” 不起作用。那么，如果我们在 Colab 中安装 Keras 2.0.8，其他需要更高或最新版本 Keras 的代码是否有问题？

回复

导航

如何在Keras中开发带注意力机制的编码器-解码器模型

教程概述

Python 环境

带注意力的编码器-解码器

注意力测试问题

编码器-解码器无注意力机制

自定义 Keras 注意力层

Encoder-Decoder With Attention

模型对比

进一步阅读

总结

立即开发用于序列预测的 LSTM！

在几分钟内开发您自己的 LSTM 模型。

最终将 LSTM 循环神经网络引入。
您的序列预测项目。

关于此主题的更多信息

358 Responses to How to Develop an Encoder-Decoder Model with Attention in Keras

Leave a Reply Click here to cancel reply.

导航

教程概述

Python 环境

带注意力的编码器-解码器

注意力测试问题

编码器-解码器无注意力机制

自定义 Keras 注意力层

Encoder-Decoder With Attention

模型对比

进一步阅读

总结

立即开发用于序列预测的 LSTM！

在几分钟内开发您自己的 LSTM 模型。

最终将 LSTM 循环神经网络引入。您的序列预测项目。

关于此主题的更多信息

358 Responses to How to Develop an Encoder-Decoder Model with Attention in Keras

Leave a Reply Click here to cancel reply.

最终将 LSTM 循环神经网络引入。
您的序列预测项目。