Python 中文本 BLEU 分数计算入门

作者： Jason Brownlee 于 2019年12月19日发布在深度学习自然语言处理 114

BLEU，即双语评估替补得分，是用于将候选文本翻译与一个或多个参考翻译进行比较的分数。

虽然它最初是为翻译而开发的，但它也可用于评估为一系列自然语言处理任务生成的文本。

在本教程中，您将了解如何使用Python中的NLTK库中的BLEU分数来评估和评分候选文本。

完成本教程后，您将了解：

对BLEU分数的初步介绍，以及对计算内容的直观理解。
如何使用NLTK库在Python中为句子和文档计算BLEU分数。
如何使用一系列小型示例来直观地了解候选文本与参考文本之间的差异如何影响最终的BLEU分数。

立即开始您的项目，阅读我的新书《自然语言处理深度学习》，其中包含分步教程以及所有示例的Python源代码文件。

让我们开始吧。

2019年5月：更新以反映NLTK 3.4.1+中的API更改。

A Gentle Introduction to Calculating the BLEU Score for Text in Python

Python 中文本 BLEU 分数计算入门
照片来源：Bernard Spragg. NZ，部分权利保留。

教程概述

本教程分为4个部分，它们是：

双语评估替补得分
计算BLEU分数
累积和单独的BLEU分数
实际示例

需要深度学习处理文本数据的帮助吗？

立即参加我的免费7天电子邮件速成课程（附代码）。

点击注册，同时获得该课程的免费PDF电子书版本。

双语评估替补得分

双语评估替补得分（BLEU）是用于评估生成句子与参考句子的指标。

完美匹配得分为1.0，完美不匹配得分为0.0。

该分数是为了评估自动机器翻译系统的预测而开发的。它并不完美，但提供了5个引人注目的优点：

计算速度快且成本低。
易于理解。
与语言无关。
与人类评估高度相关。
已被广泛采用。

BLEU分数由Kishore Papineni等人于2002年在其论文“BLEU：一种自动评估机器翻译的方法”中提出。

该方法通过计算候选翻译中与参考文本中n-gram匹配的数量来工作，其中1-gram或unigram是每个标记，bigram比较是每个单词对。比较时不考虑词序。

BLEU实现者的主要编程任务是将候选n-gram与参考翻译的n-gram进行比较，并计算匹配的数量。这些匹配与位置无关。匹配越多，候选翻译越好。

— BLEU：一种自动评估机器翻译的方法, 2002。

匹配n-gram的计数经过修改，以确保它考虑到词语在参考文本中的出现次数，而不是奖励候选翻译中生成大量合理词语的情况。这在论文中被称为修改后的n-gram精度。

不幸的是，机器翻译系统可能会过度生成“合理”的词语，导致不准确但高精度的翻译 […] 直观地说，问题很清楚：匹配的候选词语一旦确定，参考词语就应该被视为已用尽。我们将这种直觉形式化为修改后的unigram精度。

— BLEU：一种自动评估机器翻译的方法, 2002。

该分数用于比较句子，但还提出了一种修改版本，通过词语的出现次数对n-gram进行标准化，以便更好地对多个句子块进行评分。

我们首先逐句计算n-gram匹配。接下来，我们将所有候选句子的裁剪n-gram计数相加，并除以测试语料库中的候选n-gram数量，以计算整个测试语料库的修改精度pn。

— BLEU：一种自动评估机器翻译的方法, 2002。

实际上不可能获得满分，因为翻译必须与参考完全匹配。即使是人类翻译也无法做到这一点。用于计算BLEU分数的参考的数量和质量意味着跨数据集比较分数可能会很麻烦。

BLEU指标的范围从0到1。除非翻译与参考翻译完全相同，否则很少有翻译能达到1分。因此，即使是人类翻译者也不一定会得1分[…] 在大约500个句子（40个一般新闻报道）的测试语料库上，人类翻译者在四个参考下得分为0.3468，在两个参考下得分为0.2571。

— BLEU：一种自动评估机器翻译的方法, 2002。

除了翻译，我们还可以将BLEU分数用于其他语言生成问题，例如使用深度学习方法：

语言生成。
图像标题生成。
文本摘要。
语音识别。

等等。

计算BLEU分数

Python自然语言工具包（NLTK）库提供了BLEU分数的实现，您可以使用它来评估您生成的文本与参考文本的匹配程度。

句子BLEU分数

NLTK提供了sentence_bleu()函数，用于评估单个候选句子与一个或多个参考句子的匹配程度。

参考句子必须作为句子列表提供，其中每个参考句子是标记列表。候选句子作为标记列表提供。例如：

from nltk.translate.bleu_score import sentence_bleu
reference = [['this', 'is', 'a', 'test'], ['this', 'is' 'test']]
candidate = ['this', 'is', 'a', 'test']
score = sentence_bleu(reference, candidate)
print(score)

from nltk.translate.bleu_score import sentence_bleu

reference = [['this', 'is', 'a', 'test'], ['this', 'is' 'test']]

candidate = ['this', 'is', 'a', 'test']

score = sentence_bleu(reference, candidate)

print(score)

运行此示例将打印一个完美分数，因为候选文本与其中一个参考文本完全匹配。

1.0

1.0

语料库BLEU分数

NLTK还提供了一个名为corpus_bleu()的函数，用于计算多个句子（如段落或文档）的BLEU分数。

参考必须指定为文档列表，其中每个文档是参考列表，每个替代参考是标记列表，例如列表的列表的列表的标记。候选文档必须指定为列表，其中每个文档是标记列表，例如列表的列表的标记。

这有点令人困惑；以下是单个文档的两个参考的示例。

# two references for one document
from nltk.translate.bleu_score import corpus_bleu
references = [[['this', 'is', 'a', 'test'], ['this', 'is' 'test']]]
candidates = [['this', 'is', 'a', 'test']]
score = corpus_bleu(references, candidates)
print(score)

# 单个文档的两个参考

from nltk.translate.bleu_score import corpus_bleu

references = [[['this', 'is', 'a', 'test'], ['this', 'is' 'test']]]

candidates = [['this', 'is', 'a', 'test']]

score = corpus_bleu(references, candidates)

print(score)

运行示例，结果与之前一样，得到了完美的分数。

1.0

1.0

累积和单独的BLEU分数

NLTK中的BLEU分数计算允许您指定不同n-gram在BLEU分数计算中的权重。

这使您可以灵活地计算不同类型的BLEU分数，例如单独的和累积的n-gram分数。

让我们看一下。

单独的N-Gram分数

单独的N-gram分数是对特定顺序的匹配gram的评估，例如单个单词（1-gram）或单词对（2-gram或bigram）。

权重指定为一个元组，其中每个索引指的是gram的顺序。要仅计算1-gram匹配的BLEU分数，您可以为1-gram指定权重1，为2、3和4指定权重0（1, 0, 0, 0）。例如：

# 1-gram individual BLEU
from nltk.translate.bleu_score import sentence_bleu
reference = [['this', 'is', 'small', 'test']]
candidate = ['this', 'is', 'a', 'test']
score = sentence_bleu(reference, candidate, weights=(1, 0, 0, 0))
print(score)

# 单独的1-gram BLEU

from nltk.translate.bleu_score import sentence_bleu

reference = [['this', 'is', 'small', 'test']]

candidate = ['this', 'is', 'a', 'test']

score = sentence_bleu(reference, candidate, weights=(1, 0, 0, 0))

print(score)

运行此示例将打印0.5的分数。

0.75

0.75

我们可以为1到4的单独n-gram重复此示例，如下所示：

# n-gram individual BLEU
from nltk.translate.bleu_score import sentence_bleu
reference = [['this', 'is', 'a', 'test']]
candidate = ['this', 'is', 'a', 'test']
print('Individual 1-gram: %f' % sentence_bleu(reference, candidate, weights=(1, 0, 0, 0)))
print('Individual 2-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 1, 0, 0)))
print('Individual 3-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 0, 1, 0)))
print('Individual 4-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 0, 0, 1)))

# 单独的n-gram BLEU

from nltk.translate.bleu_score import sentence_bleu

reference = [['this', 'is', 'a', 'test']]

candidate = ['this', 'is', 'a', 'test']

print('Individual 1-gram: %f' % sentence_bleu(reference, candidate, weights=(1, 0, 0, 0)))

print('Individual 2-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 1, 0, 0)))

print('Individual 3-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 0, 1, 0)))

print('Individual 4-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 0, 0, 1)))

运行示例将获得以下结果。

Individual 1-gram: 1.000000
Individual 2-gram: 1.000000
Individual 3-gram: 1.000000
Individual 4-gram: 1.000000

单独的1-gram: 1.000000

单独的2-gram: 1.000000

单独的3-gram: 1.000000

单独的4-gram: 1.000000

虽然我们可以计算单独的BLEU分数，但这并不是该方法的预期用途，而且分数没有太大意义，或看起来没有太大可解释性。

累积N-Gram分数

累积分数是指计算从1到n的所有阶数的单独n-gram分数，并通过计算加权几何平均数来加权它们。

默认情况下，sentence_bleu()和corpus_bleu()分数计算累积4-gram BLEU分数，也称为BLEU-4。

BLEU-4的权重是每个1-gram、2-gram、3-gram和4-gram分数的1/4（25%）或0.25。例如：

# 4-gram cumulative BLEU
from nltk.translate.bleu_score import sentence_bleu
reference = [['this', 'is', 'small', 'test']]
candidate = ['this', 'is', 'a', 'test']
score = sentence_bleu(reference, candidate, weights=(0.25, 0.25, 0.25, 0.25))
print(score)

# 4-gram累积BLEU

from nltk.translate.bleu_score import sentence_bleu

reference = [['this', 'is', 'small', 'test']]

candidate = ['this', 'is', 'a', 'test']

score = sentence_bleu(reference, candidate, weights=(0.25, 0.25, 0.25, 0.25))

print(score)

运行此示例将打印以下分数。

1.0547686614863434e-154

1	1.0547686614863434e-154

累积和单独的1-gram BLEU使用相同的权重，例如（1, 0, 0, 0）。2-gram权重将50%分配给1-gram和2-gram，3-gram权重是1、2和3-gram分数的33%。

让我们通过计算BLEU-1、BLEU-2、BLEU-3和BLEU-4的累积分数来具体说明。

# cumulative BLEU scores
from nltk.translate.bleu_score import sentence_bleu
reference = [['this', 'is', 'small', 'test']]
candidate = ['this', 'is', 'a', 'test']
print('Cumulative 1-gram: %f' % sentence_bleu(reference, candidate, weights=(1, 0, 0, 0)))
print('Cumulative 2-gram: %f' % sentence_bleu(reference, candidate, weights=(0.5, 0.5, 0, 0)))
print('Cumulative 3-gram: %f' % sentence_bleu(reference, candidate, weights=(0.33, 0.33, 0.33, 0)))
print('Cumulative 4-gram: %f' % sentence_bleu(reference, candidate, weights=(0.25, 0.25, 0.25, 0.25)))

# 累积BLEU分数

from nltk.translate.bleu_score import sentence_bleu

reference = [['this', 'is', 'small', 'test']]

candidate = ['this', 'is', 'a', 'test']

print('Cumulative 1-gram: %f' % sentence_bleu(reference, candidate, weights=(1, 0, 0, 0)))

print('Cumulative 2-gram: %f' % sentence_bleu(reference, candidate, weights=(0.5, 0.5, 0, 0)))

print('Cumulative 3-gram: %f' % sentence_bleu(reference, candidate, weights=(0.33, 0.33, 0.33, 0)))

print('Cumulative 4-gram: %f' % sentence_bleu(reference, candidate, weights=(0.25, 0.25, 0.25, 0.25)))

运行示例将打印以下分数。它们差异很大，比

它们差异很大，并且比单独的n-gram分数更具表现力。

Cumulative 1-gram: 0.750000
Cumulative 2-gram: 0.500000
Cumulative 3-gram: 0.000000
Cumulative 4-gram: 0.000000

累积1-gram: 0.750000

累积2-gram: 0.500000

累积3-gram: 0.000000

累积4-gram: 0.000000

在描述文本生成系统的能力时，通常会报告累积BLEU-1到BLEU-4分数。

实际示例

在本节中，我们将尝试通过一些示例进一步理解BLEU分数。

我们以句子级别操作，使用以下单个参考句子：

那只敏捷的棕色狐狸跳过了懒狗

首先，让我们看看完美匹配。

# prefect match
from nltk.translate.bleu_score import sentence_bleu
reference = [['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']]
candidate = ['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']
score = sentence_bleu(reference, candidate)
print(score)

# 完美匹配

from nltk.translate.bleu_score import sentence_bleu

reference = [['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']]

candidate = ['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']

score = sentence_bleu(reference, candidate)

print(score)

运行此示例将打印出完美匹配。

1.0

1.0

接下来，让我们将一个单词“quick”更改为“fast”。

# one word different
from nltk.translate.bleu_score import sentence_bleu
reference = [['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']]
candidate = ['the', 'fast', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']
score = sentence_bleu(reference, candidate)
print(score)

# 一个单词不同

from nltk.translate.bleu_score import sentence_bleu

reference = [['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']]

candidate = ['the', 'fast', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']

score = sentence_bleu(reference, candidate)

print(score)

结果是分数略有下降。

0.7506238537503395

1	0.7506238537503395

尝试更改两个单词，“quick”改为“fast”和“lazy”改为“sleepy”。

# two words different
from nltk.translate.bleu_score import sentence_bleu
reference = [['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']]
candidate = ['the', 'fast', 'brown', 'fox', 'jumped', 'over', 'the', 'sleepy', 'dog']
score = sentence_bleu(reference, candidate)
print(score)

# 两个单词不同

from nltk.translate.bleu_score import sentence_bleu

reference = [['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']]

candidate = ['the', 'fast', 'brown', 'fox', 'jumped', 'over', 'the', 'sleepy', 'dog']

score = sentence_bleu(reference, candidate)

print(score)

运行此示例，我们可以看到分数呈线性下降。

0.4854917717073234

1	0.4854917717073234

如果候选中的所有单词都不同怎么办？

# all words different
from nltk.translate.bleu_score import sentence_bleu
reference = [['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']]
candidate = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']
score = sentence_bleu(reference, candidate)
print(score)

# 所有单词都不同

from nltk.translate.bleu_score import sentence_bleu

reference = [['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']]

candidate = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']

score = sentence_bleu(reference, candidate)

print(score)

我们得到最差的分数。

0.0

0.0

现在，让我们尝试一个候选文本，它比参考文本的单词少（例如，去掉最后两个单词），但所有单词都是正确的。

# shorter candidate
from nltk.translate.bleu_score import sentence_bleu
reference = [['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']]
candidate = ['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the']
score = sentence_bleu(reference, candidate)
print(score)

# 较短的候选文本

from nltk.translate.bleu_score import sentence_bleu

reference = [['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']]

candidate = ['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the']

score = sentence_bleu(reference, candidate)

print(score)

分数与上面两个单词错误时的分数非常相似。

0.7514772930752859

1	0.7514772930752859

如果我们将候选文本的长度增加两个单词怎么办？

# longer candidate
from nltk.translate.bleu_score import sentence_bleu
reference = [['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']]
candidate = ['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog', 'from', 'space']
score = sentence_bleu(reference, candidate)
print(score)

# 较长的候选文本

from nltk.translate.bleu_score import sentence_bleu

reference = [['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']]

candidate = ['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog', 'from', 'space']

score = sentence_bleu(reference, candidate)

print(score)

同样，我们可以看到我们的直觉是正确的，分数与“两个单词错误”的情况相似。

0.7860753021519787

1	0.7860753021519787

最后，让我们比较一个太短的候选文本：长度只有两个单词。

# very short
from nltk.translate.bleu_score import sentence_bleu
reference = [['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']]
candidate = ['the', 'quick']
score = sentence_bleu(reference, candidate)
print(score)

# 非常短

from nltk.translate.bleu_score import sentence_bleu

reference = [['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']]

candidate = ['the', 'quick']

score = sentence_bleu(reference, candidate)

print(score)

运行此示例将首先打印一条警告消息，表明无法执行评估的3-gram及以上部分（最多4-gram）。考虑到我们候选文本中只有2-gram，这是合理的。

UserWarning:
Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().
  warnings.warn(_msg)

用户警告

语料库/句子包含0个3-gram重叠计数。

BLEU分数可能不理想，请使用SmoothingFunction()。

warnings.warn(_msg)

接下来，我们可以看到一个非常低的分数。

0.0301973834223185

1	0.0301973834223185

我鼓励您继续尝试这些示例。

数学很简单，我也鼓励您阅读论文，并尝试在电子表格中自行计算句子级别的分数。

进一步阅读

如果您想深入了解此主题，本节提供了更多资源。

总结

在本教程中，您了解了BLEU分数，用于在机器翻译和其他语言生成任务中评估和评分候选文本与参考文本。

具体来说，你学到了：

对BLEU分数的初步介绍，以及对计算内容的直观理解。
如何使用NLTK库在Python中为句子和文档计算BLEU分数。
如何使用一系列小型示例来直观地了解候选文本与参考文本之间的差异如何影响最终的BLEU分数。

你有什么问题吗？
在下面的评论中提出你的问题，我会尽力回答。

关于此主题的更多信息

如何为长短期记忆网络准备单变量时间序列数据

深度学习字幕生成模型入门

114条对《Python中计算文本BLEU分数入门指南》的回复

ngc 2017年12月25日 10:08 #

Brownlee博士您好，我想知道是否可以将BLEU用作提前停止的标准？

回复
- Jason Brownlee 2017年12月26日 05:12 #
  
  当然可以。
  
  回复
  - Bashayer 2020年1月7日凌晨1:41 #
    
    请问我如何下载 BLEU 分数
    
    我如何下载 BLEU 分数
    
    回复
    - Jason Brownlee 2020年1月7日凌晨7:24 #
      
      抱歉，我没明白。你能详细说明一下吗？
      
      回复
Sasikanth 2018年1月8日下午5:01 #

你好 Jason，

很高兴了解 BLEU（双语评估替译）。R 语言中是否有这样的包？

谢谢

回复
- Jason Brownlee 2018年1月9日凌晨5:24 #
  
  我不知道，抱歉。
  
  回复
Davin Chern 2018年1月31日晚上6:34 #

嗨，Jason，

感谢您关于 BLEU 的精彩介绍。

当我尝试使用 corpus_bleu() 计算多个句子的 BLEU 分数时，我发现了一些奇怪的现象。

假设我有一个包含两个句子的段落，我尝试同时翻译它们，以下是两种情况：

情况 1
references = [[[‘a’, ‘b’, ‘c’, ‘d’]], [[‘e’, ‘f’, ‘g’]]]
candidates = [[‘a’, ‘b’, ‘c’, ‘d’], [‘e’, ‘f’, ‘g’]]
score = corpus_bleu(references, candidates)

情况 2
references = [[[‘a’, ‘b’, ‘c’, ‘d’, ‘x’]], [[‘e’, ‘f’, ‘g’, ‘y’]]]
candidates = [[‘a’, ‘b’, ‘c’, ‘d’, ‘x’], [‘e’, ‘f’, ‘g’, ‘y’]]
score = corpus_bleu(references, candidates)

我假设这两种情况都应该给我 1.0 的结果，但只有第二种情况是这样，而第一种是 0.84。实际上，当两个句子的长度都大于等于 4 时，答案总是 1.0，所以我认为这是因为情况 1 的第二个句子没有 4-gram。

在实践中，当处理长度小于 4 的句子时，我们是否必须通过设置适当的权重来让 corpus_bleu() 忽略冗余的 n-gram 情况？

非常感谢您的帮助！

回复
- Jason Brownlee 2018年2月1日凌晨7:18 #
  
  是的，理想情况下。我建议也单独报告 1、2、3、4-gram 分数。
  
  回复
Daniel Pietschmann 2018年6月25日凌晨12:22 #

亲爱的Jason Brownlee，

非常感谢这个很棒的教程，它非常有帮助！

遗憾的是，我在使用这段代码时遇到了一个错误
“# n-gram individual BLEU
from nltk.translate.bleu_score import sentence_bleu
reference = [[‘this’, ‘is’, ‘a’, ‘test’]]
candidate = [‘this’, ‘is’, ‘a’, ‘test’]
print(‘Individual 1-gram: %f’ % sentence_bleu(reference, candidate, weights=(1, 0, 0, 0)))
print(‘Individual 2-gram: %f’ % sentence_bleu(reference, candidate, weights=(0, 1, 0, 0)))
print(‘Individual 3-gram: %f’ % sentence_bleu(reference, candidate, weights=(0, 0, 1, 0)))
print(‘Individual 4-gram: %f’ % sentence_bleu(reference, candidate, weights=(0, 0, 0, 1)))”

这是错误消息：“Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU分数可能不理想，请使用SmoothingFunction()。
warnings.warn(_msg)”

我尝试添加一个平滑函数
“from nltk.translate.bleu_score import SmoothingFunction
chencherry = SmoothingFunction()
print(‘Cumulative 1-gram: %f’ % sentence_bleu(reference, candidate, weights=(1, 0, 0, 0), smoothing_function=chencherry.method4))
print(‘Cumulative 2-gram: %f’ % sentence_bleu(reference, candidate, weights=(0.5, 0.5, 0, 0), smoothing_function=chencherry.method4))
print(‘Cumulative 3-gram: %f’ % sentence_bleu(reference, candidate, weights=(0.33, 0.33, 0.33, 0), smoothing_function=chencherry.method4))
print(‘Cumulative 4-gram: %f’ % sentence_bleu(reference, candidate, weights=(0.25, 0.25, 0.25, 0.25), smoothing_function=chencherry.method4))”

这有帮助，现在错误消息消失了，但我得到的分数与您的不同
“Cumulative 1-gram: 0.750000
累积2-gram: 0.500000
Cumulative 3-gram: 0.358299
Cumulative 4-gram: 0.286623”

我不太明白问题出在哪里，以及为什么我现在得到不同的结果。
如果您能向我解释我的代码出了什么问题，我将不胜感激！

提前非常感谢 🙂

回复
- Jason Brownlee 2018年6月25日凌晨6:23 #
  
  您需要一个更长的文本示例。
  
  回复
Gaurav Gupta 2018年7月24日凌晨4:55 #

很棒的教程！

回复
- Jason Brownlee 2018年7月24日凌晨6:24 #
  
  谢谢。
  
  回复
sawsan 2018年8月18日晚上9:27 #

谢谢，

回复
- Jason Brownlee 2018年8月19日凌晨6:18 #
  
  不客气。
  
  回复
Francesco 2018年9月7日凌晨2:42 #

在“lazy fox”的例子中，将“quick”改为“fast”会导致 BLEU 分数显著下降，但两句话的意思是相同的。

我想知道我们是否可以通过使用词向量而不是单词本身来缓解这种影响。您是否知道任何使用词嵌入的 BLEU 算法？

回复
- Jason Brownlee 2018年9月7日凌晨8:09 #
  
  同意。
  
  基于意义的评分是个好主意。抱歉，我没有见过这方面的内容。
  
  回复
- Todd 2020年9月22日凌晨3:36 #
  
  嘿，我就是这么想的..
  你在这方面做过任何实验吗？我认为词嵌入肯定会产生更合理的结果。
  
  回复
Aziz 2018年10月12日中午12:10 #

嗨 Jason，感谢您的精彩教程。

我认为教程中有一个错误，它与我们的直觉相矛盾，而不是符合我们的直觉。两个单词长度短一点或长一点的分数，与一个单词不同而不是两个单词不同的分数非常相似。

回复
- Jason Brownlee 2018年10月13日凌晨6:07 #
  
  是吗？
  
  嗯，对于所有情况都没有完美的度量。
  
  回复
Chen Mei 2019年2月1日凌晨12:47 #

如何在 Python 中计算 ROUGE、CIDEr 和 METEOR 值？

回复
- Jason Brownlee 2019年2月1日凌晨5:39 #
  
  抱歉，我没有计算这些度量的示例。
  
  回复
  - Zara 2019年7月31日凌晨7:35 #
    
    您能否创建一个关于如何计算 METEOR、TER 和 ROUGE 的教程？
    
    回复
    - Jason Brownlee 2019年7月31日下午2:05 #
      
      很棒的建议，谢谢！
      
      回复
      - Micky 2020年2月12日中午11:06 #
        
        您什么时候会创建一个关于如何计算 METEOR、TER 和 ROUGE 的教程，先生？
      - Jason Brownlee 2020年2月12日下午1:36 #
        
        目前没有固定时间表。
Dave Howcroft 2019年3月8日凌晨1:36 #

我认为建议将 BLEU 用于生成和图像字幕是有误导性的。BLEU 似乎适用于其设计目的（MT 在开发期间的评估），但没有证据支持它是一个好的 NLG 度量。例如，请参阅 Ehud Reiter 去年的论文：https://aclanthology.info/papers/J18-3002/j18-3002

其中一个问题是 BLEU 通常只使用少量参考文本进行计算，但也有理由认为我们无法合理地扩展参考集以涵盖给定含义的有效文本的足够空间，使其成为一个好的度量（参见此关于语法错误纠正的类似度量的研究：https://aclanthology.info/papers/P18-1059/p18-1059）。

回复
- Jason Brownlee 2019年3月8日凌晨7:53 #
  
  我认为你是对的，困惑度可能是语言生成任务的更好度量。
  
  回复
Madhav 2019年4月12日下午3:54 #

嗨，Jason，

我正在研究自动问题生成。我可以使用 BLEU 作为评估指标吗？如果可以，它如何适应问题？如果不行，您会建议我使用什么其他指标？

回复
- Jason Brownlee 2019年4月13日凌晨6:21 #
  
  也许，或者 ROGUE 或类似的得分。
  
  也许可以查看该主题的最新论文，看看哪些是常见的。
  
  回复
Shubham 2019年5月9日凌晨3:04 #

>>> from nltk.translate.bleu_score import sentence_bleu
>>> reference = [[‘this’, ‘is’, ‘small’, ‘test’]]
>>> candidate = [‘this’, ‘is’, ‘a’, ‘test’]
>>> print(sentence_bleu(reference, candidate, weights=(0.25, 0.25, 0.25, 0.25)))
1.0547686614863434e-154

回复
- Jason Brownlee 2019年5月9日凌晨6:47 #
  
  干得好！
  
  回复
Justen 2019年5月21日晚上5:18 #

我也遇到了同样的问题。
您给出的例子得分是 0.707106781187
而我得到的分数极低，为 1.0547686614863434e-154
怎么回事？

回复
- Justen 2019年5月21日晚上5:21 #
  
  抱歉，打字太多了。
  Shubham 提供了这段代码作为示例。
  >>> from nltk.translate.bleu_score import sentence_bleu
  >>> reference = [[‘this’, ‘is’, ‘small’, ‘test’]]
  >>> candidate = [‘this’, ‘is’, ‘a’, ‘test’]
  >>> print(sentence_bleu(reference, candidate, weights=(0.25, 0.25, 0.25, 0.25)))
  1.0547686614863434e-154
  
  您运行相同的代码，但得到的 BLEU 分数是 0.707106781187
  怎么回事？
  
  回复
  - Jason Brownlee 2019年5月22日凌晨7:43 #
    
    我也得到相同的结果，也许 API 最近有变化？
    
    我会安排时间更新帖子的。
    
    回复
- Jason Brownlee 2019年5月22日凌晨7:39 #
  
  这很令人惊讶，您的库是最新的吗？您复制了所有代码吗？
  
  回复
  - Justen 2019年5月23日凌晨12:31 #
    
    是的，完全相同的代码和完全相同的示例。
    由于这个“bug”，很多 BLEU 分数都评估为零或接近零，例如 1.0547686614863434e-154。我还没有找到原因。
    
    回复
    - Jason Brownlee 2019年5月23日凌晨6:04 #
      
      我会调查的。
      
      回复
- Sanjita Suresh 2019年7月2日凌晨3:29 #
  
  你能找到解决办法吗？我也遇到了同样的问题
  
  回复
Pavithra 2019年6月12日晚上9:00 #

可以找出机器翻译模型的 BLEU 分数吗？

回复
- Jason Brownlee 2019年6月13日凌晨6:15 #
  
  当然可以。
  
  回复
Pavithra 2019年6月13日晚上1:50 #

您能否分享如何确定模型的 BLEU 分数？我使用 Moses 构建了机器翻译模型。

谢谢。

回复
- Jason Brownlee 2019年6月13日晚上2:36 #
  
  上面的教程展示了如何计算它。
  
  回复
Sanjita Suresh 2019年7月2日凌晨3:27 #

感谢您提供的出色教程。
我在 Google Colab 和 Jupyter Notebook 中得到不同的 BLEU 分数

prediction_en = ‘A man in an orange hat presenting something’
reference_en= ‘A man in an orange hat starring at something’

使用的代码，

from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.bleu_score import SmoothingFunction
smoother = SmoothingFunction()

sentence_bleu(reference_en, prediction_en , smoothing_function=smoother.method4)

对于这个，我在 Google Colab 中得到 BLEU 分数 0.768521，但在 Jupyter Notebook 中没有平滑得到 1.3767075064676063e-231 的分数，有平滑得到 0.3157039。

您能帮我看看我哪里做错了什么吗？

回复
- Jason Brownlee 2019年7月2日凌晨7:35 #
  
  我建议不要使用 notebook。
  https://machinelearning.org.cn/faq/single-faq/why-dont-use-or-recommend-notebooks
  
  回复
- alex 2019年9月14日凌晨1:22 #
  
  尝试为 Colab 和本地机器都添加权重……类似这样的。
  
  score = sentence_bleu(reference, candidate, weights=(1, 0,0,0))
  
  我得到相同的结果
  
  回复
- Tohida 2022年2月23日凌晨1:57 #
  
  现在在 Colab 中得分非常低。您能告诉我原因吗？
  prediction_en = ‘A man in an orange hat presenting something.’
  reference_en= ‘A man in an orange hat starring at something.’
  
  from nltk.translate.bleu_score import sentence_bleu
  from nltk.translate.bleu_score import SmoothingFunction
  smoother = SmoothingFunction()
  
  sentence_bleu(reference_en, prediction_en,smoothing_function=smoother.method4)
  输出：0.013061727262337088
  
  回复
  - James Carmichael 2022年2月23日中午12:21 #
    
    嗨 Tohida……您在 Colab 和另一个环境中遇到了不同的结果吗？
    
    回复
Quang Le 2019年7月16日晚上11:20 #

嗨 Jason，我正在使用 fairseq-py 训练 2 个神经机器翻译模型（模型 A 和模型 B，分别有不同的改进）。当我用 BLEU 分数评估模型时，模型 A 的 BLEU 分数是 25.9，模型 B 是 25.7。然后我按长度过滤数据，分为 4 个范围，例如 1 到 10 个词，11 到 20 个词，21 到 30 个词和 31 到 40 个词。我对每个过滤后的数据重新评估，模型 B 的所有 BLEU 分数都大于模型 A。您认为这是正常情况吗？

回复
- Jason Brownlee 2019年7月17日凌晨8:26 #
  
  是的，细粒度的评估可能对您更有意义。
  
  回复
Shyam Yadav 2019年12月15日凌晨6:13 #

BLEU-1、BLEU-2、BLEU-3、BLEU-4 有什么区别？它们是 1-gram、2-gram……吗？我脑子里还有一个疑问，对于 n = 4 的 BLEU，weights=(0.25, 0.25, 0.25, 0.25) 和 weights=(0, 0, 0, 1) 之间有什么区别？

回复
- Jason Brownlee 2019年12月16日凌晨6:02 #
  
  它们评估不同长度的词序列。
  
  您可以在上面的一些示例中看到区别。
  
  回复
  - Shyam Yadav 2019年12月17日晚上8:50 #
    
    我仍然对何时使用 (0,0,0,1) 和 (0.25, 0.25, 0.25, 0.25) 感到困惑。
    
    回复
    - Jason Brownlee 2019年12月18日凌晨6:04 #
      
      好问题。
      
      当您只关心正确的 4-gram 时，请使用 0,0,0,1。
      
      当您关心 1-gram、2-gram、3-gram、4-gram，并且所有权重都相同，请使用 0.25,0.25,0.25,0.25。
      
      回复
      - Shyam Yadav 2019年12月19日凌晨4:58 #
        
        好的，谢谢您。
      - shab 2022年5月9日晚上7:38 #
        
        为什么考虑 unigrams，而 unigrams 只表示单词本身，而累积 ngram 分数……bi、tri 或 four grams 的分数是否优于 unigram 分数？
      - James Carmichael 2022年5月10日中午12:09 #
        
        嗨 Shab……请重新表述您的问题，以便我们能更好地帮助您。
Shyam Yadav 2019年12月26日凌晨6:20 #

您能告诉我如何通过纸笔计算 weights = (0.5,0.5,0,0) 吗？可以是对任何参考和预测进行计算。

回复
- Shyam Yadav 2019年12月26日凌晨6:22 #
  
  我用了这个
  
  references = [[‘the’, ‘quick’, ‘brown’, ‘fox’, ‘jumped’, ‘over’, ‘the’, ‘lazy’, ‘dog’]]
  candidates = [‘the’, ‘quick’, ‘fox’, ‘jumped’, ‘on’, ‘the’, ‘dog’]
  
  score = sentence_bleu(references, candidates, weights = (0.5,0.5,0,0), smoothing_function=SmoothingFunction().method0)
  print(score)
  
  并得到了以下输出：
  
  0.4016815092325757
  
  您能一步一步地告诉我数学计算过程吗？非常感谢！
  
  回复
  - Jason Brownlee 2019年12月26日凌晨7:43 #
    
    是的，您可以在论文中看到计算过程。
    
    回复
- Jason Brownlee 2019年12月26日凌晨7:42 #
  
  很好的问题！
  
  教程中引用的论文将向您展示计算方法。
  
  回复
  - Shyam Yadav 2019年12月26日晚上5:54 #
    
    教程在哪里？我在哪里可以看到论文中的计算？您能给我链接吗？
    
    回复
    - Jason Brownlee 2019年12月27日凌晨6:32 #
      
      我没有关于论文中计算的教程。
      
      回复
      - Shyam Yadav 2019年12月28日凌晨12:28 #
        
        教程中引用的论文将向您展示计算方法。
        
        那这个呢？有链接吗？或者您能提供一个教程，或者一张图来展示累积 BLEU 分数的计算吗？如果可能的话，请告诉我？
      - Jason Brownlee 2019年12月28日凌晨7:48 #
        
        我将来可能会涵盖这个主题。
safia 2020年2月21日晚上7:03 #

你好 Jason，
您能写回答或验证这些（链接）关于 BLEU 的假设吗？
https://towardsdatascience.com/evaluating-text-output-in-nlp-bleu-at-your-own-risk-e8609665a213

回复
- Jason Brownlee 2020年2月22日凌晨6:23 #
  
  抱歉，我没有能力为您审核第三方教程。
  
  也许您可以总结一下您遇到的问题，一两句话？
  
  回复
  - safia 2020年3月1日晚上11:58 #
    
    哦，我的歉意。感谢您的所有教程。这对我们非常有帮助。
    
    回复
    - Jason Brownlee 2020年3月2日凌晨6:16 #
      
      不客气。
      
      回复
sawsan 2020年2月27日晚上4:53 #

请问如何在语料库级别使用平滑函数？

回复
- Jason Brownlee 2020年2月28日凌晨5:58 #
  
  在您的模型为整个语料库进行预测后，计算分数时指定平滑。
  
  回复
  - sawsan 2020年2月28日晚上4:34 #
    
    谢谢你
    
    回复
    - Jason Brownlee 2020年2月29日凌晨7:07 #
      
      不客气。
      
      回复
sawsan 2020年3月2日晚上5:06 #

请问我应用了公式和示例

from nltk.translate.bleu_score import sentence_bleu
reference = [[‘the’, ‘cat’, ‘is’, ‘sitting’,’on’, ‘the’ ,’mat’]]
candidate = [‘on’, ‘the’, ‘mat’, ‘is’,’a’,’cat’]
score = sentence_bleu(reference, candidate,weights=(0.25, 0.25, 0.25,0.25))
print(score)
5.5546715329196825e-78

但根据博客中的公式
https://medium.com/usf-msds/choosing-the-right-metric-for-machine-learning-models-part-1-a99d7d7414e4
我得到的分数是 0.454346419
bleu=EXP(1-7/6)*EXP(LN(0.83)*0.25+LN(0.4)*0.25+LN(0.25)*0.25)
为什么结果不同？
你能帮帮我吗？

回复
- Jason Brownlee 2020年3月3日凌晨5:56 #
  
  我不熟悉那个博客，也许可以直接联系作者。
  
  回复
sawsan Asjea 2020年3月2日下午5:47 #

你好Jason。
我认为如果你这样做，结果会匹配。
from nltk.translate.bleu_score import sentence_bleu
reference = [[‘the’, ‘cat’, ‘is’, ‘sitting’,’on’, ‘the’ ,’mat’]]
candidate = [‘on’, ‘the’, ‘mat’, ‘is’,’a’,’cat’]
score = sentence_bleu(reference, candidate,weights=(0.25, 0.25, 0.25))
print(score)
0.4548019047027907

忽略最后一个权重0.25是否正确？你认为我该如何解释这一点？;

回复
amel 2020年3月26日上午4:50 #

很棒的教程，谢谢Jason。

回复
- Jason Brownlee 2020年3月26日上午8:02 #
  
  谢谢！
  
  回复
Amrutesh 2020年4月14日上午4:23 #

我构建了一个神经网络来生成多个字幕
我正在使用flickr8k数据集，所以我有5个字幕作为候选
如何为多个字幕生成bleu分数？

回复
- Jason Brownlee 2020年4月14日上午6:27 #
  
  请参阅此教程以获取示例。
  https://machinelearning.org.cn/develop-a-deep-learning-caption-generation-model-in-python/
  
  回复
Sunny 2020年4月25日上午4:52 #

如何使用BLeU来比较两个文本生成模型，比如LSTM和ngram，使用生成的文本？在这种情况下，参考是什么？

回复
- Jason Brownlee 2020年4月25日上午7:04 #
  
  计算模型在同一数据集上的分数，然后进行比较。
  
  回复
  - Sunny 2020年4月26日上午8:11 #
    
    是的，但参考是什么？我知道候选是输出文本，但参考是什么？如果我使用我的整个训练集来生成文本，那么我原始的50000行文本将是参考吗？
    
    回复
    - Jason Brownlee 2020年4月27日上午5:22 #
      
      测试数据的预期文本输出用作参考。
      
      回复
Dooji 2020年5月5日上午6:33 #

你好！谢谢您的帖子。我正在使用BLEU来评估一个摘要模型。因此，我的模型生成的句子和ground truth摘要不匹配，计数也不相同。我想知道，如果我想使用corpus_bleu，这会是一个问题吗？因为在文档中，每个hyp中的句子似乎都有一个对应的参考句子。

回复
- Jason Brownlee 2020年5月5日上午6:37 #
  
  据我回忆，我认为没问题，只要有一些n-gram可以比较。
  
  回复
  - Dr. Abdulnaser 2020年11月21日下午12:00 #
    
    感谢Jassin提供的信息丰富的教程。我可以使用bleu进行机器翻译后编辑吗？
    
    回复
    - Jason Brownlee 2020年11月21日下午1:05 #
      
      是的，您可以使用BLEU来评估机器翻译模型。
      
      回复
ghaith 2020年7月22日上午4:46 #

为什么损失很高？
它达到2了吗？
我知道它必须低于1。

回复
- Jason Brownlee 2020年7月22日上午5:45 #
  
  也许可以尝试重新训练模型。
  
  回复
Asha 2020年10月2日下午4:15 #

你好，先生，

很棒的教程！

我可以计算处理音译文本的翻译的BLEU分数吗？

回复
- Jason Brownlee 2020年10月3日上午6:05 #
  
  是的。
  
  回复
DHILIP KUMAR T.P 2020年10月16日上午5:06 #

你好，杰森，

我们可以将bleu分数应用于语音识别系统

回复
- Jason Brownlee 2020年10月16日上午5:57 #
  
  我目前不知道，我很抱歉，我猜不行，并建议您查阅文献。
  
  回复
Felipe 2020年11月29日上午3:33 #

你好，Brownlee博士。

我想知道在深度学习模型的训练阶段使用BLEU分数是否正常/正确，或者它是否只能在测试阶段使用？

我有一个深度学习模型和三个数据集——训练、验证和测试。

回复
- Jason Brownlee 2020年11月29日上午8:15 #
  
  用于测试阶段，模型评估。
  
  回复
Azaz Ur Rehman Butt 2021年3月1日下午9:08 #

亲爱的Jason，我正在做一个图像字幕任务，我得到的BELU分数低于0.6，这对我的模型来说可以接受吗？还是我需要改进它？

回复
- Jason Brownlee 2021年3月2日上午5:43 #
  
  也许可以与同一数据集上的其他更简单的模型进行比较，看看它是否具有相对的技能。
  
  回复
Stanislav 2021年3月17日上午5:41 #

在另一个sentence_bleu教程中，我注意到3-gram的权重被定义为(1, 0, 1, 0)。你能解释一下这个时刻吗？因为我对元组中第一个数字的用途一无所知？

回复
- Jason Brownlee 2021年3月17日上午6:11 #
  
  抱歉，我不明白您的问题。您能详细说明一下吗？
  
  回复
KG17 2021年4月22日下午11:19 #

这是一个很好的信息，但是关于计算整个文档的BLEU分数，我有一个问题。您展示的例子是针对句子的，我感兴趣的是比较.txt文档。您是否可能有一个例子或者可以解释如何做到这一点，因为从解释中我并不完全清楚。
非常感谢，任何建议都将不胜感激！

回复
- Jason Brownlee 2021年4月23日上午5:04 #
  
  谢谢。
  
  也许是句子平均？参见上面的例子。
  
  回复
srz 2021年5月5日上午5:54 #

嘿Jason。感谢如此简洁清晰的解释。
然而，我最近一直在研究语言模型，并注意到人们得到的bleu分数高达36和50。既然满分是1，这怎么可能呢？
谷歌云的一篇文章称，好的bleu分数在50以上。
我在理解上哪里错了？
谢谢你

回复
- Jason Brownlee 2021年5月5日上午6:14 #
  
  不客气。
  
  也许他们报告的bleu分数乘以100。
  
  回复
  - srz 2021年5月6日上午7:30 #
    
    哦，非常感谢。是的，他们是以百分比而不是小数的形式呈现的。
    
    回复
    - Jason Brownlee 2021年5月7日上午6:23 #
      
      不客气。
      
      回复
MUHAMMAD KAMRAN 2021年9月2日下午6:30 #

嘿Justin，你好吗……如果字幕生成模型给出0.9的bleu分数，这是可能的和可接受的吗？还是模型有什么问题？？

回复
- Jason Brownlee 2021年9月3日上午5:29 #
  
  您必须决定一个给定的模型是否适合您的特定项目。
  
  回复
Bambang Setiawan 2021年11月15日下午2:21 #

嗨

不，我正在使用Tensorflow/Keras的RNN来创建一个对话模型。

你知道如何在编译模型时添加BLEU分数吗？

谢谢

回复
- Adrian Tam 2021年11月16日上午1:58 #
  
  Keras中似乎没有BLEU。你可能需要检查是否有第三方实现，或者你需要自己编写一个函数来实现它。
  
  回复
wajahat 2022年2月23日下午8:24 #

有像这样的CIDER或Meteor的代码实现吗？

回复
- James Carmichael 2022年2月24日下午12:57 #
  
  你好Wajahat……我对两者都不熟悉。
  
  回复
Daniel Kleine 2023年6月5日下午10:44 #

“运行此示例将打印0.5的分数。”
-> 你是指0.75，对吧？

回复
MUHOOZI C.DENIS 2024年12月6日上午10:12 #

谢谢博士，您的工作非常出色

回复
- James Carmichael 2024年12月7日上午5:40 #
  
  Muhoozi，不客气！我们非常感激！
  
  回复

导航

Python 中文本 BLEU 分数计算入门

教程概述

需要深度学习处理文本数据的帮助吗？

双语评估替补得分

计算BLEU分数

句子BLEU分数

语料库BLEU分数

累积和单独的BLEU分数

单独的N-Gram分数

累积N-Gram分数

实际示例

进一步阅读

总结

立即开发文本数据的深度学习模型！

在几分钟内开发您自己的文本模型

最终将深度学习应用于您的自然语言处理项目

关于此主题的更多信息

114条对《Python中计算文本BLEU分数入门指南》的回复

留下回复点击此处取消回复。

导航

教程概述

需要深度学习处理文本数据的帮助吗？

双语评估替补得分

计算BLEU分数

句子BLEU分数

语料库BLEU分数

累积和单独的BLEU分数

单独的N-Gram分数

累积N-Gram分数

实际示例

进一步阅读

总结

立即开发文本数据的深度学习模型！

在几分钟内开发您自己的文本模型

最终将深度学习应用于您的自然语言处理项目

关于此主题的更多信息

114条对《Python中计算文本BLEU分数入门指南》的回复

留下回复 点击此处取消回复。

留下回复点击此处取消回复。