如何为机器翻译准备一个法语-英语数据集

作者 Jason Brownlee 于 2020年4月30日发布在自然语言处理深度学习 56

机器翻译是一项具有挑战性的任务，旨在将源语言文本转换为目标语言中连贯且匹配的文本。

神经网络机器翻译系统，例如编码器-解码器循环神经网络，通过直接在源语言和目标语言上进行训练的单一端到端系统，在机器翻译方面取得了最先进的结果。

需要标准数据集来开发、探索和熟悉如何开发神经网络机器翻译系统。

在本教程中，您将了解 Europarl 标准机器翻译数据集以及如何准备数据以供建模。

完成本教程后，您将了解：

Europarl 数据集包含欧洲议会会议记录，有 11 种语言。
如何加载和清理平行法语和英语文本，为神经网络机器翻译系统做好建模准备。
如何减小法语和英语数据的词汇量，以降低翻译任务的复杂性。

快速启动您的项目，阅读我的新书《自然语言处理深度学习》，其中包含分步教程和所有示例的Python源代码文件。

让我们开始吧。

How to Prepare a French-to-English Dataset for Machine Translation

如何为机器翻译准备一个法语-英语数据集
照片作者：Giuseppe Milo，部分权利保留。

教程概述

本教程分为5个部分，它们是：

Europarl 机器翻译数据集
下载法语-英语数据集
加载数据集
清理数据集
减小词汇量

Python 环境

本教程假设您已安装带有 Python 3 的 Python SciPy 环境。

本教程还假设您已安装 scikit-learn、Pandas、NumPy 和 Matplotlib。

如果您在环境方面需要帮助，请参阅此帖子

如何使用 Anaconda 设置用于机器学习和深度学习的 Python 环境

需要深度学习处理文本数据的帮助吗？

立即参加我的免费7天电子邮件速成课程（附代码）。

点击注册，同时获得该课程的免费PDF电子书版本。

Europarl 机器翻译数据集

Europarl 是用于统计机器翻译的标准数据集，以及最近用于神经网络机器翻译的数据集。

它包含欧洲议会的会议记录，因此数据集的名称是“Europarl”的缩写。

会议记录是欧洲议会发言人的文字记录，并被翻译成 11 种不同的语言。

这是自 1996 年以来欧洲议会会议记录的集合。总而言之，语料库包含欧盟 11 种官方语言中每种语言约 3000 万个单词。

— Europarl: A Parallel Corpus for Statistical Machine Translation, 2005。

原始数据可在欧洲议会网站上以 HTML 格式获取。

数据集的创建由“统计机器翻译”一书的作者Philipp Koehn 牵头。

该数据集可在“European Parliament Proceedings Parallel Corpus 1996-2011”网站上免费提供给研究人员，并经常出现在机器翻译挑战中，例如 2014 年统计机器翻译研讨会上的机器翻译任务。

数据集的最新版本是 2012 年发布的第 7 版，包含 1996 年至 2011 年的数据。

下载法语-英语数据集

我们将专注于法语-英语平行数据集。

这是一个已处理好的语料库，包含 1996 年至 2011 年间的法语和英语对齐句子。

数据集具有以下统计数据：

句子：2,007,723
法语单词：51,388,643
英语单词：50,196,035

您可以从此处下载数据集

法语-英语平行语料库 (194 MB)

下载完成后，您应该在当前工作目录中看到文件“fr-en.tgz”。

您可以使用 tar 命令解压此存档文件，如下所示：

tar zxvf fr-en.tgz

1	tar zxvf fr-en.tgz

现在您将获得两个文件，如下所示：

英语：europarl-v7.fr-en.en (288MB)
法语：europarl-v7.fr-en.fr (331MB)

下面是英语文件的一个示例。

Resumption of the session
I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999, and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period.
Although, as you will have seen, the dreaded 'millennium bug' failed to materialise, still the people in a number of countries suffered a series of natural disasters that truly were dreadful.
You have requested a debate on this subject in the course of the next few days, during this part-session.
In the meantime, I should like to observe a minute' s silence, as a number of Members have requested, on behalf of all the victims concerned, particularly those of the terrible storms, in the various countries of the European Union.

会议恢复

我宣布恢复欧洲议会因 1999 年 12 月 17 日星期五休会而暂停的会议，并希望再次祝您新年快乐，希望您度过了愉快的节日。

尽管正如您所见，“千年虫”并没有出现，但一些国家的民众却遭受了一系列真正可怕的自然灾害。

您已要求在本届会议期间的未来几天内就此主题进行辩论。

在此期间，我想应一些议员的要求，为所有受影响的受害者，特别是各欧盟国家中遭受可怕风暴的受害者默哀一分钟。

下面是法语文件的一个示例。

Reprise de la session
Je déclare reprise la session du Parlement européen qui avait été interrompue le vendredi 17 décembre dernier et je vous renouvelle tous mes vux en espérant que vous avez passé de bonnes vacances.
Comme vous avez pu le constater, le grand "bogue de l'an 2000" ne s'est pas produit. En revanche, les citoyens d'un certain nombre de nos pays ont été victimes de catastrophes naturelles qui ont vraiment été terribles.
Vous avez souhaité un débat à ce sujet dans les prochains jours, au cours de cette période de session.
En attendant, je souhaiterais, comme un certain nombre de collègues me l'ont demandé, que nous observions une minute de silence pour toutes les victimes, des tempêtes notamment, dans les différents pays de l'Union européenne qui ont été touchés.

会议恢复

我宣布恢复欧洲议会会议，该会议已于去年 12 月 17 日星期五暂停，我再次祝大家假期愉快。

正如您所见，“千年虫”并没有出现。但是，我们一些国家/地区的公民却遭受了确实非常可怕的自然灾害。

您希望在未来几天内就此主题进行辩论，在此会议期间。

在此期间，我希望像一些同事要求我的那样，为所有受影响的受害者，特别是欧洲联盟各受影响国家的风暴受害者默哀一分钟。

加载数据集

让我们开始加载数据文件。

我们可以将每个文件加载为字符串。由于文件包含 Unicode 字符，因此在将文件加载为文本时必须指定编码。在这种情况下，我们将使用 UTF-8，它可以轻松处理这两个文件中的 Unicode 字符。

下面的函数名为 load_doc()，它将加载给定文件并将其返回为文本块。

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, mode='rt', encoding='utf-8')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# 加载文档到内存

def load_doc(filename):

# 以只读方式打开文件

file = open(filename, mode='rt', encoding='utf-8')

# 读取所有文本

text = file.read()

# 关闭文件

file.close()

返回文本

接下来，我们可以将文件分割成句子。

通常，一个话语存储在一行中。我们可以将它们视为句子，并通过换行符分割文件。下面的 to_sentences() 函数将分割加载的文档。

# split a loaded document into sentences
def to_sentences(doc):
	return doc.strip().split('\n')

# 将加载的文档分割成句子

def to_sentences(doc):

return doc.strip().split('\n')

在稍后准备我们的模型时，我们需要知道数据集中句子的长度。我们可以编写一个简短的函数来计算最长和最短的句子。

# shortest and longest sentence lengths
def sentence_lengths(sentences):
	lengths = [len(s.split()) for s in sentences]
	return min(lengths), max(lengths)

# 最长和最短句子长度

def sentence_lengths(sentences):

lengths = [len(s.split()) for s in sentences]

return min(lengths), max(lengths)

我们可以将所有这些内容结合起来，加载并汇总英语和法语数据文件。完整的示例列在下面。

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, mode='rt', encoding='utf-8')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# split a loaded document into sentences
def to_sentences(doc):
	return doc.strip().split('\n')

# shortest and longest sentence lengths
def sentence_lengths(sentences):
	lengths = [len(s.split()) for s in sentences]
	return min(lengths), max(lengths)

# load English data
filename = 'europarl-v7.fr-en.en'
doc = load_doc(filename)
sentences = to_sentences(doc)
minlen, maxlen = sentence_lengths(sentences)
print('English data: sentences=%d, min=%d, max=%d' % (len(sentences), minlen, maxlen))

# load French data
filename = 'europarl-v7.fr-en.fr'
doc = load_doc(filename)
sentences = to_sentences(doc)
minlen, maxlen = sentence_lengths(sentences)
print('French data: sentences=%d, min=%d, max=%d' % (len(sentences), minlen, maxlen))

# 加载文档到内存

def load_doc(filename):

# 以只读方式打开文件

file = open(filename, mode='rt', encoding='utf-8')

# 读取所有文本

text = file.read()

# 关闭文件

file.close()

return text

# 将加载的文档分割成句子

def to_sentences(doc):

return doc.strip().split('\n')

# 最长和最短句子长度

def sentence_lengths(sentences):

lengths = [len(s.split()) for s in sentences]

return min(lengths), max(lengths)

# 加载英语数据

filename = 'europarl-v7.fr-en.en'

doc = load_doc(filename)

sentences = to_sentences(doc)

minlen, maxlen = sentence_lengths(sentences)

print('English data: sentences=%d, min=%d, max=%d' % (len(sentences), minlen, maxlen))

# 加载法语数据

filename = 'europarl-v7.fr-en.fr'

doc = load_doc(filename)

sentences = to_sentences(doc)

minlen, maxlen = sentence_lengths(sentences)

print('French data: sentences=%d, min=%d, max=%d' % (len(sentences), minlen, maxlen))

运行示例将汇总每个文件中的行数或句子数，以及每个文件中最长和最短行的长度。

English data: sentences=2007723, min=0, max=668
French data: sentences=2007723, min=0, max=693

1 2	English data: sentences=2007723, min=0, max=668 French data: sentences=2007723, min=0, max=693

重要的是，我们可以看到行数 2,007,723 与预期相符。

清理数据集

在用于训练神经网络翻译模型之前，数据需要进行一些最小的清理。

查看一些文本样本，一些最小的文本清理可能包括：

按空格分词。
将大小写规范化为小写。
从每个单词中删除标点符号。
删除不可打印字符。
将法语字符转换为拉丁字符。
删除包含非字母字符的单词。

这些只是一些基本的入门操作；您可能知道或需要更复杂的数据清理操作。

下面的 clean_lines() 函数实现了这些清理操作。一些说明：

我们使用 Unicode API 来规范化 Unicode 字符，这会将法语字符转换为拉丁字符。
我们使用反向正则表达式匹配来保留单词中所有可打印的字符。
我们使用翻译表来逐个翻译字符，但排除所有标点符号。

# clean a list of lines
def clean_lines(lines):
	cleaned = list()
	# prepare regex for char filtering
	re_print = re.compile('[^%s]' % re.escape(string.printable))
	# prepare translation table for removing punctuation
	table = str.maketrans('', '', string.punctuation)
	for line in lines:
		# normalize unicode characters
		line = normalize('NFD', line).encode('ascii', 'ignore')
		line = line.decode('UTF-8')
		# tokenize on white space
		line = line.split()
		# convert to lower case
		line = [word.lower() for word in line]
		# remove punctuation from each token
		line = [word.translate(table) for word in line]
		# remove non-printable chars form each token
		line = [re_print.sub('', w) for w in line]
		# remove tokens with numbers in them
		line = [word for word in line if word.isalpha()]
		# store as string
		cleaned.append(' '.join(line))
	return cleaned

# 清理行列表

def clean_lines(lines):

cleaned = list()

# 准备字符过滤的正则表达式

re_print = re.compile('[^%s]' % re.escape(string.printable))

# 准备用于删除标点符号的转换表

table = str.maketrans('', '', string.punctuation)

for line in lines:

# 规范化 unicode 字符

line = normalize('NFD', line).encode('ascii', 'ignore')

line = line.decode('UTF-8')

# 根据空白符进行分词

line = line.split()

# 转换为小写

line = [word.lower() for word in line]

# 从每个标记中删除标点符号

line = [word.translate(table) for word in line]

# 从每个 token 中删除不可打印字符

line = [re_print.sub('', w) for w in line]

# 删除包含数字的标记

line = [word for word in line if word.isalpha()]

# 存储为字符串

cleaned.append(' '.join(line))

return cleaned

规范化后，我们使用 pickle API 以二进制格式直接保存清理后的行列表。这将加速加载，以便将来进行进一步操作。

重用前面几节中开发的加载和分割函数，完整的示例列在下面。

import string
import re
from pickle import dump
from unicodedata import normalize

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, mode='rt', encoding='utf-8')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# split a loaded document into sentences
def to_sentences(doc):
	return doc.strip().split('\n')

# clean a list of lines
def clean_lines(lines):
	cleaned = list()
	# prepare regex for char filtering
	re_print = re.compile('[^%s]' % re.escape(string.printable))
	# prepare translation table for removing punctuation
	table = str.maketrans('', '', string.punctuation)
	for line in lines:
		# normalize unicode characters
		line = normalize('NFD', line).encode('ascii', 'ignore')
		line = line.decode('UTF-8')
		# tokenize on white space
		line = line.split()
		# convert to lower case
		line = [word.lower() for word in line]
		# remove punctuation from each token
		line = [word.translate(table) for word in line]
		# remove non-printable chars form each token
		line = [re_print.sub('', w) for w in line]
		# remove tokens with numbers in them
		line = [word for word in line if word.isalpha()]
		# store as string
		cleaned.append(' '.join(line))
	return cleaned

# save a list of clean sentences to file
def save_clean_sentences(sentences, filename):
	dump(sentences, open(filename, 'wb'))
	print('Saved: %s' % filename)

# load English data
filename = 'europarl-v7.fr-en.en'
doc = load_doc(filename)
sentences = to_sentences(doc)
sentences = clean_lines(sentences)
save_clean_sentences(sentences, 'english.pkl')
# spot check
for i in range(10):
	print(sentences[i])

# load French data
filename = 'europarl-v7.fr-en.fr'
doc = load_doc(filename)
sentences = to_sentences(doc)
sentences = clean_lines(sentences)
save_clean_sentences(sentences, 'french.pkl')
# spot check
for i in range(10):
	print(sentences[i])

import string

import re

from pickle import dump

from unicodedata import normalize

# 加载文档到内存

def load_doc(filename):

# 以只读方式打开文件

file = open(filename, mode='rt', encoding='utf-8')

# 读取所有文本

text = file.read()

# 关闭文件

file.close()

return text

# 将加载的文档分割成句子

def to_sentences(doc):

return doc.strip().split('\n')

# 清理行列表

def clean_lines(lines):

cleaned = list()

# 准备字符过滤的正则表达式

re_print = re.compile('[^%s]' % re.escape(string.printable))

# 准备用于删除标点符号的转换表

table = str.maketrans('', '', string.punctuation)

for line in lines:

# 规范化 unicode 字符

line = normalize('NFD', line).encode('ascii', 'ignore')

line = line.decode('UTF-8')

# 根据空白符进行分词

line = line.split()

# 转换为小写

line = [word.lower() for word in line]

# 从每个标记中删除标点符号

line = [word.translate(table) for word in line]

# 从每个 token 中删除不可打印字符

line = [re_print.sub('', w) for w in line]

# 删除包含数字的标记

line = [word for word in line if word.isalpha()]

# 存储为字符串

cleaned.append(' '.join(line))

return cleaned

# 将清理后的句子列表保存到文件

def save_clean_sentences(sentences, filename):

dump(sentences, open(filename, 'wb'))

print('Saved: %s' % filename)

# 加载英语数据

filename = 'europarl-v7.fr-en.en'

doc = load_doc(filename)

sentences = to_sentences(doc)

sentences = clean_lines(sentences)

save_clean_sentences(sentences, 'english.pkl')

# 抽查

for i in range(10):

print(sentences[i])

# 加载法语数据

filename = 'europarl-v7.fr-en.fr'

doc = load_doc(filename)

sentences = to_sentences(doc)

sentences = clean_lines(sentences)

save_clean_sentences(sentences, 'french.pkl')

# 抽查

for i in range(10):

print(sentences[i])

运行后，清理后的句子分别保存在 english.pkl 和 french.pkl 文件中。

作为运行的一部分，我们还打印了每行清理后的句子列表的前几行，如下所示：

English

resumption of the session
i declare resumed the session of the european parliament adjourned on friday december and i would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period
although as you will have seen the dreaded millennium bug failed to materialise still the people in a number of countries suffered a series of natural disasters that truly were dreadful
you have requested a debate on this subject in the course of the next few days during this partsession
in the meantime i should like to observe a minute s silence as a number of members have requested on behalf of all the victims concerned particularly those of the terrible storms in the various countries of the european union
please rise then for this minute s silence
the house rose and observed a minute s silence
madam president on a point of order
you will be aware from the press and television that there have been a number of bomb explosions and killings in sri lanka
one of the people assassinated very recently in sri lanka was mr kumar ponnambalam who had visited the european parliament just a few months ago

resumption of the session

i declare resumed the session of the european parliament adjourned on friday december and i would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period

although as you will have seen the dreaded millennium bug failed to materialise still the people in a number of countries suffered a series of natural disasters that truly were dreadful

you have requested a debate on this subject in the course of the next few days during this partsession

in the meantime i should like to observe a minute s silence as a number of members have requested on behalf of all the victims concerned particularly those of the terrible storms in the various countries of the european union

please rise then for this minute s silence

the house rose and observed a minute s silence

madam president on a point of order

you will be aware from the press and television that there have been a number of bomb explosions and killings in sri lanka

one of the people assassinated very recently in sri lanka was mr kumar ponnambalam who had visited the european parliament just a few months ago

French

reprise de la session
je declare reprise la session du parlement europeen qui avait ete interrompue le vendredi decembre dernier et je vous renouvelle tous mes vux en esperant que vous avez passe de bonnes vacances
comme vous avez pu le constater le grand bogue de lan ne sest pas produit en revanche les citoyens dun certain nombre de nos pays ont ete victimes de catastrophes naturelles qui ont vraiment ete terribles
vous avez souhaite un debat a ce sujet dans les prochains jours au cours de cette periode de session
en attendant je souhaiterais comme un certain nombre de collegues me lont demande que nous observions une minute de silence pour toutes les victimes des tempetes notamment dans les differents pays de lunion europeenne qui ont ete touches
je vous invite a vous lever pour cette minute de silence
le parlement debout observe une minute de silence
madame la presidente cest une motion de procedure
vous avez probablement appris par la presse et par la television que plusieurs attentats a la bombe et crimes ont ete perpetres au sri lanka
lune des personnes qui vient detre assassinee au sri lanka est m kumar ponnambalam qui avait rendu visite au parlement europeen il y a quelques mois a peine

reprise de la session

je declare reprise la session du parlement europeen qui avait ete interrompue le vendredi decembre dernier et je vous renouvelle tous mes vux en esperant que vous avez passe de bonnes vacances

comme vous avez pu le constater le grand bogue de lan ne sest pas produit en revanche les citoyens dun certain nombre de nos pays ont ete victimes de catastrophes naturelles qui ont vraiment ete terribles

vous avez souhaite un debat a ce sujet dans les prochains jours au cours de cette periode de session

en attendant je souhaiterais comme un certain nombre de collegues me lont demande que nous observions une minute de silence pour toutes les victimes des tempetes notamment dans les differents pays de lunion europeenne qui ont ete touches

je vous invite a vous lever pour cette minute de silence

le parlement debout observe une minute de silence

madame la presidente cest une motion de procedure

vous avez probablement appris par la presse et par la television que plusieurs attentats a la bombe et crimes ont ete perpetres au sri lanka

lune des personnes qui vient detre assassinee au sri lanka est m kumar ponnambalam qui avait rendu visite au parlement europeen il y a quelques mois a peine

我对法语的阅读能力非常有限，但至少在英语方面，还可以进行进一步改进，例如删除或连接悬挂的“s”作为复数。

减小词汇量

作为数据清理的一部分，限制源语言和目标语言的词汇量非常重要。

翻译任务的难度与词汇量的大小成正比，而词汇量的大小又会影响模型训练时间以及使模型可行的所需数据集的大小。

在本节中，我们将减小英语和法语文本的词汇量，并使用特殊标记标记所有词汇量之外 (OOV) 的单词。

我们可以从加载上一节保存的 pickle 清理行开始。下面的 load_clean_sentences() 函数将加载并返回给定文件名的列表。

# load a clean dataset
def load_clean_sentences(filename):
	return load(open(filename, 'rb'))

# 加载干净的数据集

def load_clean_sentences(filename):

return load(open(filename, 'rb'))

接下来，我们可以计算数据集中每个单词的出现次数。为此，我们可以使用 Counter 对象，它是一个以单词为键的 Python 字典，并在每次添加单词的新出现时更新计数。

下面的 to_vocab() 函数为给定的句子列表创建词汇表。

# create a frequency table for all words
def to_vocab(lines):
	vocab = Counter()
	for line in lines:
		tokens = line.split()
		vocab.update(tokens)
	return vocab

# 为所有单词创建频率表

def to_vocab(lines):

vocab = Counter()

for line in lines:

tokens = line.split()

vocab.update(tokens)

return vocab

然后，我们可以处理创建的词汇表，并从 Counter 中删除所有出现次数低于特定阈值的单词。

下面的 trim_vocab() 函数执行此操作，它接受最小出现次数作为参数，并返回更新的词汇表。

# remove all words with a frequency below a threshold
def trim_vocab(vocab, min_occurance):
	tokens = [k for k,c in vocab.items() if c >= min_occurance]
	return set(tokens)

# 删除所有频率低于阈值的单词

def trim_vocab(vocab, min_occurance):

tokens = [k for k,c in vocab.items() if c >= min_occurance]

return set(tokens)

最后，我们可以更新句子，删除所有不在修剪词汇表中的单词，并使用特殊标记标记它们的删除，在本例中是字符串“unk”。

下面的 update_dataset() 函数执行此操作，并返回一个可以保存到新文件的更新行列表。

# mark all OOV with "unk" for all lines
def update_dataset(lines, vocab):
	new_lines = list()
	for line in lines:
		new_tokens = list()
		for token in line.split():
			if token in vocab:
				new_tokens.append(token)
			else:
				new_tokens.append('unk')
		new_line = ' '.join(new_tokens)
		new_lines.append(new_line)
	return new_lines

# 将所有 OOV 标记为“unk”，适用于所有行

def update_dataset(lines, vocab):

new_lines = list()

for line in lines:

new_tokens = list()

for token in line.split():

if token in vocab:

new_tokens.append(token)

else:

new_tokens.append('unk')

new_line = ' '.join(new_tokens)

new_lines.append(new_line)

return new_lines

我们可以将所有这些内容结合起来，减小英语和法语数据集的词汇量，并将结果保存到新的数据文件中。

我们将使用最小出现次数为 5，但您可以随意探索适合您应用程序的其他最小出现次数。

完整的代码示例如下所示。

from pickle import load
from pickle import dump
from collections import Counter

# load a clean dataset
def load_clean_sentences(filename):
	return load(open(filename, 'rb'))

# save a list of clean sentences to file
def save_clean_sentences(sentences, filename):
	dump(sentences, open(filename, 'wb'))
	print('Saved: %s' % filename)

# create a frequency table for all words
def to_vocab(lines):
	vocab = Counter()
	for line in lines:
		tokens = line.split()
		vocab.update(tokens)
	return vocab

# remove all words with a frequency below a threshold
def trim_vocab(vocab, min_occurance):
	tokens = [k for k,c in vocab.items() if c >= min_occurance]
	return set(tokens)

# mark all OOV with "unk" for all lines
def update_dataset(lines, vocab):
	new_lines = list()
	for line in lines:
		new_tokens = list()
		for token in line.split():
			if token in vocab:
				new_tokens.append(token)
			else:
				new_tokens.append('unk')
		new_line = ' '.join(new_tokens)
		new_lines.append(new_line)
	return new_lines

# load English dataset
filename = 'english.pkl'
lines = load_clean_sentences(filename)
# calculate vocabulary
vocab = to_vocab(lines)
print('English Vocabulary: %d' % len(vocab))
# reduce vocabulary
vocab = trim_vocab(vocab, 5)
print('New English Vocabulary: %d' % len(vocab))
# mark out of vocabulary words
lines = update_dataset(lines, vocab)
# save updated dataset
filename = 'english_vocab.pkl'
save_clean_sentences(lines, filename)
# spot check
for i in range(10):
	print(lines[i])

# load French dataset
filename = 'french.pkl'
lines = load_clean_sentences(filename)
# calculate vocabulary
vocab = to_vocab(lines)
print('French Vocabulary: %d' % len(vocab))
# reduce vocabulary
vocab = trim_vocab(vocab, 5)
print('New French Vocabulary: %d' % len(vocab))
# mark out of vocabulary words
lines = update_dataset(lines, vocab)
# save updated dataset
filename = 'french_vocab.pkl'
save_clean_sentences(lines, filename)
# spot check
for i in range(10):
	print(lines[i])

from pickle import load

from pickle import dump

from collections import Counter

# 加载干净的数据集

def load_clean_sentences(filename):

return load(open(filename, 'rb'))

# 将清理后的句子列表保存到文件

def save_clean_sentences(sentences, filename):

dump(sentences, open(filename, 'wb'))

print('Saved: %s' % filename)

# 为所有单词创建频率表

def to_vocab(lines):

vocab = Counter()

for line in lines:

tokens = line.split()

vocab.update(tokens)

return vocab

# 删除所有频率低于阈值的单词

def trim_vocab(vocab, min_occurance):

tokens = [k for k,c in vocab.items() if c >= min_occurance]

return set(tokens)

# 将所有 OOV 标记为“unk”，适用于所有行

def update_dataset(lines, vocab):

new_lines = list()

for line in lines:

new_tokens = list()

for token in line.split():

if token in vocab:

new_tokens.append(token)

else:

new_tokens.append('unk')

new_line = ' '.join(new_tokens)

new_lines.append(new_line)

return new_lines

# 加载英语数据集

filename = 'english.pkl'

lines = load_clean_sentences(filename)

# 计算词汇量

vocab = to_vocab(lines)

print('English Vocabulary: %d' % len(vocab))

# 减小词汇量

vocab = trim_vocab(vocab, 5)

print('New English Vocabulary: %d' % len(vocab))

# 标记词汇量之外的单词

lines = update_dataset(lines, vocab)

# 保存更新后的数据集

filename = 'english_vocab.pkl'

save_clean_sentences(lines, filename)

# 抽查

for i in range(10):

print(lines[i])

# 加载法语数据集

filename = 'french.pkl'

lines = load_clean_sentences(filename)

# 计算词汇量

vocab = to_vocab(lines)

print('French Vocabulary: %d' % len(vocab))

# 减小词汇量

vocab = trim_vocab(vocab, 5)

print('New French Vocabulary: %d' % len(vocab))

# 标记词汇量之外的单词

lines = update_dataset(lines, vocab)

# 保存更新后的数据集

filename = 'french_vocab.pkl'

save_clean_sentences(lines, filename)

# 抽查

for i in range(10):

print(lines[i])

首先，报告英语词汇量的大小，然后是更新后的大小。更新后的数据集保存在名为‘english_vocab.pkl’的文件中，并打印了一些用“unk”替换了词汇量之外单词的更新示例。

English Vocabulary: 105357
New English Vocabulary: 41746
Saved: english_vocab.pkl

English Vocabulary: 105357

New English Vocabulary: 41746

Saved: english_vocab.pkl

我们可以看到词汇量的大小缩小了大约一半，只剩下了略多于 40,000 个单词。

resumption of the session
i declare resumed the session of the european parliament adjourned on friday december and i would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period
although as you will have seen the dreaded millennium bug failed to materialise still the people in a number of countries suffered a series of natural disasters that truly were dreadful
you have requested a debate on this subject in the course of the next few days during this partsession
in the meantime i should like to observe a minute s silence as a number of members have requested on behalf of all the victims concerned particularly those of the terrible storms in the various countries of the european union
please rise then for this minute s silence
the house rose and observed a minute s silence
madam president on a point of order
you will be aware from the press and television that there have been a number of bomb explosions and killings in sri lanka
one of the people assassinated very recently in sri lanka was mr unk unk who had visited the european parliament just a few months ago

resumption of the session

i declare resumed the session of the european parliament adjourned on friday december and i would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period

although as you will have seen the dreaded millennium bug failed to materialise still the people in a number of countries suffered a series of natural disasters that truly were dreadful

you have requested a debate on this subject in the course of the next few days during this partsession

please rise then for this minute s silence

the house rose and observed a minute s silence

madam president on a point of order

you will be aware from the press and television that there have been a number of bomb explosions and killings in sri lanka

one of the people assassinated very recently in sri lanka was mr unk unk who had visited the european parliament just a few months ago

接下来对法语数据集执行相同的过程，并将结果保存到文件 ‘french_vocab.pkl‘ 中。

French Vocabulary: 141642
New French Vocabulary: 58800
Saved: french_vocab.pkl

法语词汇量：141642

新的法语词汇量：58800

已保存：french_vocab.pkl

我们看到法语词汇量的大小也类似地缩小了。

reprise de la session
je declare reprise la session du parlement europeen qui avait ete interrompue le vendredi decembre dernier et je vous renouvelle tous mes vux en esperant que vous avez passe de bonnes vacances
comme vous avez pu le constater le grand bogue de lan ne sest pas produit en revanche les citoyens dun certain nombre de nos pays ont ete victimes de catastrophes naturelles qui ont vraiment ete terribles
vous avez souhaite un debat a ce sujet dans les prochains jours au cours de cette periode de session
en attendant je souhaiterais comme un certain nombre de collegues me lont demande que nous observions une minute de silence pour toutes les victimes des tempetes notamment dans les differents pays de lunion europeenne qui ont ete touches
je vous invite a vous lever pour cette minute de silence
le parlement debout observe une minute de silence
madame la presidente cest une motion de procedure
vous avez probablement appris par la presse et par la television que plusieurs attentats a la bombe et crimes ont ete perpetres au sri lanka
lune des personnes qui vient detre assassinee au sri lanka est m unk unk qui avait rendu visite au parlement europeen il y a quelques mois a peine

reprise de la session

je declare reprise la session du parlement europeen qui avait ete interrompue le vendredi decembre dernier et je vous renouvelle tous mes vux en esperant que vous avez passe de bonnes vacances

vous avez souhaite un debat a ce sujet dans les prochains jours au cours de cette periode de session

je vous invite a vous lever pour cette minute de silence

le parlement debout observe une minute de silence

madame la presidente cest une motion de procedure

vous avez probablement appris par la presse et par la television que plusieurs attentats a la bombe et crimes ont ete perpetres au sri lanka

lune des personnes qui vient detre assassinee au sri lanka est m unk unk qui avait rendu visite au parlement europeen il y a quelques mois a peine

进一步阅读

如果您想深入了解，本节提供了更多关于该主题的资源。

总结

在本教程中，您了解了 Europarl 机器翻译数据集以及如何准备数据以供建模。

具体来说，你学到了：

Europarl 数据集包含欧洲议会会议记录，有 11 种语言。
如何加载和清理平行法语和英语文本，为神经网络机器翻译系统做好建模准备。
如何减小法语和英语数据的词汇量，以降低翻译任务的复杂性。

你有什么问题吗？
在下面的评论中提出你的问题，我会尽力回答。

关于此主题的更多信息

如何为自然语言处理实现束搜索解码器

如何从头开始开发神经机器翻译系统

56 条回复《如何准备法语到英语数据集以进行机器翻译》

Gerrit Govaerts 2018 年 1 月 8 日下午 6:54 #

有点离题，但对于循环神经网络和卷积神经网络的显着成功以及为什么基本的多层感知器可能不值得努力，我有一些非常敏锐的观察：http://www.stochasticlifestyle.com/algorithm-efficiency-comes-problem-information/

回复
- Jason Brownlee 2018 年 1 月 9 日上午 5:26 #
  
  感谢分享。
  
  回复
Klaas 2018 年 1 月 9 日上午 6:20 #

Jason，这又是出色的工作。非常感谢分享。这对像我这样缺乏科学/数学背景但仍然对学习这些内容非常感兴趣的人非常有帮助。
非常感谢您的工作！
此致

回复
- Jason Brownlee 2018 年 1 月 9 日下午 3:17 #
  
  谢谢，很高兴对您有帮助。
  
  回复
  - riyaj atar 2020 年 12 月 10 日上午 3:27 #
    
    谢谢 Jason。又一个很棒的教程。
    我有一个关于为机器翻译任务创建并行语料库的 tfds 格式数据集的问题。
    您能给出一些关于如何为我们自己的数据集创建这种格式的步骤吗？
    感谢您的时间和努力。
    谢谢。保持健康。
    
    回复
    - Jason Brownlee 2020 年 12 月 10 日上午 6:30 #
      
      谢谢。
      
      抱歉，我不明白您的问题。您能详细说明您遇到的问题吗？
      
      回复
Vidyush Bakshi 2018 年 1 月 9 日晚上 9:38 #

又一次很棒的工作，解释得很好！

回复
- Jason Brownlee 2018 年 1 月 10 日上午 5:25 #
  
  谢谢。
  
  回复
Canbey Bilgili 2018 年 1 月 13 日上午 1:25 #

很棒的文章。这是准备数据的好资源。谢谢！

回复
- Jason Brownlee 2018 年 1 月 13 日上午 5:34 #
  
  很高兴听到这个消息。
  
  回复
LeeX 2018 年 1 月 22 日上午 2:28 #

中国研究人员非常感谢您的教程！

回复
- Jason Brownlee 2018 年 1 月 22 日上午 4:46 #
  
  谢谢！
  
  回复
Nixon 2018 年 2 月 7 日上午 4:13 #

嗨，兄弟，我是一名新手学习者，如何轻松学习机器学习，请帮助我

回复
- Jason Brownlee 2018 年 2 月 7 日上午 9:27 #
  
  从这里开始
  https://machinelearning.org.cn/start-here/
  
  回复
mzeid 2018 年 2 月 26 日上午 11:48 #

嗨，Jason，

这确实是一篇很棒的文章。我正在尝试遵循您关于英语>阿拉伯语数据的指南，但当与阿拉伯语文本一起使用时，函数 ‘clean_lines(lines)’ 没有任何结果。对于阿拉伯语，有什么解决办法吗？

提前感谢！

回复
- Jason Brownlee 2018 年 2 月 26 日下午 2:55 #
  
  抱歉，我没有处理过阿拉伯语。也许需要更新该函数以支持 unicode 字符？
  
  回复
machine_translator 2018 年 4 月 6 日晚上 9:19 #

非常感谢您关于如何准备机器翻译数据的清晰教程。接下来的步骤是什么？是否有针对这些步骤的类似教程？

回复
- Jason Brownlee 2018 年 4 月 7 日上午 6:31 #
  
  是的，我在我的书中有涵盖整个项目。
  https://machinelearning.org.cn/deep-learning-for-nlp/
  
  回复
  - Rokaya 2022 年 5 月 16 日下午 4:37 #
    
    我正在研究多语言翻译模型，您能帮我一下吗？您的书在哪里可以买到？
    
    回复
    - James Carmichael 2022 年 5 月 17 日上午 9:55 #
      
      嗨 Rokaya……您可能会发现以下资源非常有帮助
      
      https://machinelearning.org.cn/deep-learning-for-nlp/
      
      回复
Zayed 2018 年 4 月 11 日上午 8:45 #

很棒且有用的教程。

我想将文件保存为纯文本文件 ‘.txt’，UTF-8 格式，我不需要 pickle 文件。

我需要更改上面的代码才能输出文本文件吗？

回复
- Jason Brownlee 2018 年 4 月 11 日下午 4:15 #
  
  也许您可以将词汇表保存为每行一个单词。
  
  您可以将翻译保存为每行一个。
  
  要做到这一点，您可以使用标准的 Python API 编写一个函数将列表保存到 ASCII 文件，并调用它而不是 pickle 函数。
  
  回复
  - Zayed 2018 年 4 月 12 日上午 2:53 #
    
    感谢 Jason 的回复。我甚至不需要词汇表。我卡在这个函数上了。
    
    # 将清理后的句子列表保存到文件
    def save_clean_sentences(sentences, filename)
    dump(sentences, open(filename, ‘wb’))
    print(‘Saved: %s’ % filename)
    
    我尝试了这个，可以在 PyCharm 中看到抽查结果（每种语言 10 行），但文件中没有写入任何内容。
    
    def save_clean_sentences(sentences, filename)
    f = open(filename, ‘r+’)
    for line in f
    f.write(line[i], ‘r+’)
    f.write(‘\n’)
    
    我错过了什么？
    
    再次感谢您抽出宝贵时间支持我。
    
    回复
    - Jason Brownlee 2018 年 4 月 12 日上午 8:49 #
      
      也许确保在写入后关闭文件？
      
      回复
Zayed 2018 年 4 月 14 日上午 3:38 #

我有一个关于从数据中删除标点符号的问题。在上面的示例中，您会看到像这样的句子

“please rise then for this minute s silence”
“the house rose and observed a minute s silence”

正如您所见，句子中的撇号被删除了。那么，这是否意味着我尝试翻译同一个句子，但带有撇号“please rise then for this minute’s silence”，神经解码器将无法挑选出正确的法语翻译，或者由于源文本略有不同，翻译会不同？

如果同一个源句子末尾有一个句号或开头是一个大写字母，翻译会有所不同吗？例如

Please rise then for this minute’s silence
Please rise then for this minute’s silence.

从训练数据中删除标点符号是标准做法吗？它会提高整体质量吗？有什么建议吗！

回复
- Jason Brownlee 2018 年 4 月 14 日上午 6:50 #
  
  我删除了它（或者它在之前从训练数据中被删除了，我不记得了），以简化问题。
  
  我建议将其添加回来（如果存在则不要从训练数据中剥离，或者获取带标点符号的数据），以学习带标点符号的翻译。
  
  当专注于翻译部分时，这是标准做法，但对于实际工作的模型则不是。
  
  或者，您可以开发一个模型来添加标点符号。
  
  回复
  - Zayed 2018 年 4 月 14 日上午 7:17 #
    
    谢谢 Jason！这很有道理。
    
    我还有一个关于小写的问题。如果将所有训练数据小写，神经解码器能否在目标句的开头进行大写，或者将训练数据中未出现的未知词保持原样？或者它会在解码/翻译过程中将所有单词都小写吗？例如，我们有这个句子
    
    IBM is providing AI services.
    
    如果只用小写数据进行训练，神经解码器能否保持 IBM 和 AI 的原样？
    
    再次感谢您的支持，希望您不介意我频繁提问。另外，请告诉我您的哪本书详细介绍了神经机器翻译？我主要对创建神经机器翻译系统和神经拼写检查器感兴趣。您的书中是否涵盖了神经拼写检查？
    
    再次感谢！
    
    回复
    - Jason Brownlee 2018 年 4 月 15 日上午 6:17 #
      
      如果所有训练数据都是小写的，那么模型只认识小写。
      
      如果大小写很重要，您可以保留大小写进行训练，或者训练一个模型来为小写字符串添加大小写，或者其他巧妙的想法……
      
      回复
Dominique Lahaix 2018 年 9 月 20 日上午 7:50 #

嗨 Jason – 我有个问题也许你能帮忙？

我们建立了一个使用 ML 自动对短文档进行分类的系统。我们是用英语完成的，现在需要为法语做同样的事情。我们使用了大量手动标注的文档进行了监督学习。

不幸的是，我们的法语训练集要小得多……所以我想知道我们是否可以

– 翻译训练集并将其用作法语的训练集（补充）
– 翻译模型本身（我甚至不知道这是否是一个选项）

您听说过人们使用翻译的训练集来构建模型吗？它效果如何？
谢谢

回复
- Jason Brownlee 2018 年 9 月 20 日上午 8:11 #
  
  听起来是不错的主意！
  
  也许生成新的数据来训练，作为您现有文档的增强版本。
  
  另外，在使用较小的数据集时，请考虑使用正则化方法以确保您不会过度拟合训练数据。
  
  回复
simran 2018 年 12 月 26 日下午 3:44 #

问候，先生，
我正在使用机器学习进行语料库语言学的博士研究。我需要帮助为翻译前的语料库开发预处理算法。

回复
- Jason Brownlee 2018 年 12 月 27 日上午 5:39 #
  
  这不是一个算法，而是一系列最适合您特定数据集的预处理步骤。
  
  回复
Prashant Kumar Singh 2019 年 3 月 14 日晚上 9:45 #

嗨 Jason，是否可以将语言从英语翻译成其他语言，例如法语？

我的项目示例如下；

我在 existdb 中工作并生成 PDF 文件，这些文件在全球范围内发布在网页上。但我希望按国家/地区更改 PDF 内容的语言。那么，这是否可以通过您的博客来实现？

转换任务是如何将 Python（在您的博客中）与 existdb（开源数据库）结合起来？或者有什么其他方法可以做到这一点？请帮助我理解。

谢谢，
Prashant

回复
- Jason Brownlee 2019 年 3 月 15 日上午 5:30 #
  
  感谢您的建议。
  
  回复
  - Prashant Kumar Singh 2019 年 3 月 20 日下午 4:07 #
    
    嗨，Jason，
    
    您能否回答我以下几点，这将对我有所帮助；
    
    问题：是否可以将机器学习模型连接到我的网页，该网页基于 exist (XML) 数据库内容？请建议我遵循的步骤。
    
    谢谢，
    Prashant
    
    回复
    - Jason Brownlee 2019 年 3 月 21 日上午 7:58 #
      
      我看不出为什么不。
      
      这听起来像一个工程问题，并且取决于您生产环境的具体情况。我没有实际的示例，抱歉。
      
      回复
anvesh 2019 年 6 月 10 日晚上 7:27 #

我们可以使用一个英语到法语的预训练模型在我自己的小型数据集上进行训练，然后将英语翻译成任何其他语言吗？

回复
- Jason Brownlee 2019 年 6 月 11 日上午 7:52 #
  
  这可能有助于作为起点，但需要进一步的训练。
  
  回复
Dani Gross 2019 年 6 月 15 日晚上 6:44 #

嗨，Jason，
感谢您的教程！

如何处理该语料库的句子对齐，考虑到它包含空字符串？

回复
- Jason Brownlee 2019 年 6 月 16 日上午 7:12 #
  
  抱歉，我没有关于句子对齐的教程，我无法为您提供好的现成建议。
  
  回复
Rishai 2019 年 8 月 27 日上午 9:07 #

嗨 Jason，感谢所有这些教程。您是否有关于如何进行下一步，将词元/单词转换为整数向量，以便它们可以传递到 Embedding 层？

回复
- Jason Brownlee 2019 年 8 月 27 日下午 2:07 #
  
  是的，这里是一个很好的起点
  https://machinelearning.org.cn/start-here/#nlp
  
  回复
Sreenivas Kashyap 2019 年 10 月 26 日上午 3:45 #

嗨，Jason，
工作出色，我正在开发卡纳达语到英语的翻译模型，但分词器在分割文本时不起作用。
输出如下

Saved: english-kannada.pkl
[tom woke up] => []
[give me half] => []
[we needed it] => []
[tom liked you] => []
[just go inside] => []
[do you remember] => []
[i just got back] => []
[see you at] => []
您能为我解决上述问题提供建议吗？

回复
- Jason Brownlee 2019 年 10 月 26 日上午 4:41 #
  
  很抱歉听到这个消息，我在这里有一些建议。
  https://machinelearning.org.cn/faq/single-faq/why-does-the-code-in-the-tutorial-not-work-for-me
  
  回复
Nithin 2019 年 12 月 5 日上午 5:58 #

你好 Jason，

感谢教程和良好的解释。
如果您能澄清以下疑问，我将不胜感激
1) 将频率较低的单词标记为“unk”会如何影响模型准确性，因为现在“unk”标记可能会占数据的重要比例。
2) 您对使用字符级编码与单词级编码有什么看法？
3) 对于更大的词汇量（约 0.1M），是否有技术和软件可以使用单词的稀疏表示，而不是独热编码，以在训练期间减少内存需求？

谢谢

回复
- Jason Brownlee 2019 年 12 月 5 日上午 6:44 #
  
  从词汇表中删除这些单词，然后在为模型预处理文本时将不在词汇表中的单词标记为 unk。
  
  据我所知，目前按单词级别进行建模更有效。
  
  100K 的词汇量适中。不用担心。
  
  回复
  - Nithin 2019 年 12 月 7 日上午 4:55 #
    
    你好 Jason，
    
    感谢您的回答。
    我们正在使用 Europarl 的编码器-解码器模型。我们使用 2 个 GRU 层，每个层有 128 个单元格和一个时间分布式层。根据教程，法语词汇量约为 50K（阈值为 5）。我们想在具有 12GB GPU 内存的单 GPU 上训练此模型，但使用批次大小为 16 或 32 时，GPU 内存会填满并报错内存不足。
    我们怀疑最可能的原因是每个单词的独热表示，维度为 50K。
    时间分布式层的形状也为 (None, 528 (法语句子最大长度), 50K (输出向量大小 = 法语词汇量)。
    我们想知道是否有任何方法可以避免这种情况（例如 libSVM），或者是否有更有效的表示方法来用于具有大词汇量的 RNN，以帮助训练大批次大小？
    
    谢谢你
    
    回复
    - Jason Brownlee 2019 年 12 月 7 日上午 5:41 #
      
      也许可以尝试使用生成器来实现批次的渐进加载？
      
      回复
      - Nithin 2019 年 12 月 7 日上午 7:59 #
        
        是的，我们正在使用生成器。模块在创建大小为 [19136,58802]（[(法语句子最大长度)*批次大小, 法语词汇量大小]）的张量时遇到内存不足的问题，其中法语句子最大长度约为 600，批次大小为 1024。这似乎是正确的，因为对于 8 字节浮点表示，该矩阵将接近 8GB。所以我们想知道是否有任何方法可以解决这个问题，以及解决这些问题的最先进方法是什么。
        
        谢谢你
      - Jason Brownlee 2019 年 12 月 8 日上午 6:03 #
        
        也许可以尝试使用小词汇量？
        也许可以尝试使用更小的批次大小？
        也许可以尝试使用更短的句子长度？
        也许可以尝试在内存更多的机器上进行训练？
Jane 2020 年 1 月 16 日下午 1:28 #

这是否被认为是一个 seq2seq 模型？

回复
- Jason Brownlee 2020 年 1 月 16 日下午 1:34 #
  
  是的。
  
  回复
S.Gowri pooja 2020 年 5 月 17 日上午 11:09 #

嗨，Jason。感谢分享这篇文章。这个预处理模型对于不同的语言是如何变化的？

回复
- Jason Brownlee 2020 年 5 月 18 日上午 6:07 #
  
  好问题，我暂时不知道。
  
  回复
ARABA AMAN 2021 年 12 月 24 日下午 6:12 #

我正在进行阿姆哈拉语和阿法语的机器翻译。
我准备了数据集，放在不同的工作表中，并附有相应的目标语言。

那么，我如何从两个文件名来训练神经机器翻译模型呢？
也就是说，如何将这些清理过的句子输入模型进行训练？

回复
- James Carmichael 2022 年 2 月 18 日下午 1:07 #
  
  嗨 Araba……我建议使用 seq2seq 模型，如这里讨论的
  
  https://machinelearning.org.cn/develop-encoder-decoder-model-sequence-sequence-prediction-keras/
  
  回复

导航

如何为机器翻译准备一个法语-英语数据集

教程概述

Python 环境

需要深度学习处理文本数据的帮助吗？

Europarl 机器翻译数据集

下载法语-英语数据集

加载数据集

清理数据集

减小词汇量

进一步阅读

总结

立即开发文本数据的深度学习模型！

在几分钟内开发您自己的文本模型

最终将深度学习应用于您的自然语言处理项目

关于此主题的更多信息

56 条回复《如何准备法语到英语数据集以进行机器翻译》

留下回复点击此处取消回复。

导航

教程概述

Python 环境

需要深度学习处理文本数据的帮助吗？

Europarl 机器翻译数据集

下载法语-英语数据集

加载数据集

清理数据集

减小词汇量

进一步阅读

总结

立即开发文本数据的深度学习模型！

在几分钟内开发您自己的文本模型

最终将深度学习应用于您的自然语言处理项目

关于此主题的更多信息

56 条回复《如何准备法语到英语数据集以进行机器翻译》

留下回复 点击此处取消回复。

留下回复点击此处取消回复。