如何为情感分析（文本分类）准备电影评论数据

作者 Jason Brownlee 于 2020年12月21日发布在自然语言处理深度学习 48

文本数据的准备因问题而异。

准备工作从简单的步骤开始，例如加载数据，但很快就会变得困难，因为数据清洗任务非常具体于您正在处理的数据。您需要帮助来了解从哪里开始以及如何从原始数据到准备好的数据按顺序进行。

在本教程中，您将逐步了解如何为情感分析准备电影评论文本数据。

完成本教程后，您将了解：

如何加载文本数据并清理它以删除标点符号和其他非单词字符。
如何开发一个词汇表，对其进行定制，并将其保存到文件中。
如何使用清理和预定义的词汇表准备电影评论，并将它们保存到为模型准备好的新文件中。

通过我的新书《自然语言处理深度学习》，其中包含分步教程和所有示例的Python源代码文件，来启动您的项目。

让我们开始吧。

更新 2017 年 10 月：修复了跳过不匹配文件时的一个小错误，感谢 Jan Zett。
更新 2017 年 12 月：修复了完整示例中的一个小错别字，感谢 Ray 和 Zain。
2020 年 8 月更新：更新了电影评论数据集的链接。

How to Prepare Movie Review Data for Sentiment Analysis

如何为情感分析准备电影评论数据
照片作者：Kenneth Lu，保留部分权利。

教程概述

本教程分为5个部分，它们是：

电影评论数据集
加载文本数据
清理文本数据
开发词汇表
保存准备好的数据

需要深度学习处理文本数据的帮助吗？

立即参加我的免费7天电子邮件速成课程（附代码）。

点击注册，同时获得该课程的免费PDF电子书版本。

1. 电影评论数据集

电影评论数据是 Bo Pang 和 Lillian Lee 在 2000 年代初期从 imdb.com 网站检索的电影评论集合。这些评论是作为他们自然语言处理研究的一部分而收集和提供的。

这些评论最初发布于 2002 年，但在 2004 年发布了一个更新和清理后的版本，称为“v2.0”。

该数据集包含 1,000 条正面和 1,000 条负面电影评论，这些评论摘自托管在 IMDB 上的 rec.arts.movies.reviews 新闻组存档。作者将此数据集称为“极性数据集”。

我们的数据包含 1000 条褒义和 1000 条贬义评论，所有评论均在 2002 年之前撰写，每个作者（共 312 位作者）每个类别最多 20 条评论。我们将此语料库称为极性数据集。

—— 《情感教育：基于最小割的主观性摘要情感分析》，2004 年。

数据已经过一些清理，例如

数据集仅包含英文评论。
所有文本都已转换为小写。
标点符号（如句号、逗号和括号）周围有空格。
文本已分成每行一句。

该数据已被用于一些相关的自然语言处理任务。对于分类任务，经典模型（如支持向量机）在该数据上的性能在 70% 高位到 80% 低位之间（例如 78%-82%）。

更复杂的数据准备可能会看到 10 折交叉验证的结果高达 86%。如果我们在现代方法实验中使用此数据集，这给了我们一个低到中等 80% 的估算。

...根据下游极性分类器的选择，我们可以实现高度统计学意义的改进（从 82.8% 提高到 86.4%）

—— 《情感教育：基于最小割的主观性摘要情感分析》，2004 年。

您可以从此处下载数据集

电影评论极性数据集 (review_polarity.tar.gz, 3MB)

解压缩文件后，您将得到一个名为“txt_sentoken”的目录，其中包含两个子目录，分别用于负面评论和正面评论的“neg”和“pos”。评论以每个文件一篇的形式存储，命名约定为 neg 和 pos 的每个文件的cv000 到cv999。

接下来，我们看看如何加载文本数据。

2. 加载文本数据

在本节中，我们将介绍如何加载单个文本文件，然后处理文件目录。

我们假设评论数据已下载，并位于当前工作目录下的“txt_sentoken”文件夹中。

我们可以通过打开、读取 ASCII 文本并关闭文件来加载单个文本文件。这是标准的处理文件的东西。例如，我们可以按如下方式加载第一个负面评论文件“cv000_29416.txt”：

# load one file
filename = 'txt_sentoken/neg/cv000_29416.txt'
# open the file as read only
file = open(filename, 'r')
# read all text
text = file.read()
# close the file
file.close()

# 加载一个文件

filename = 'txt_sentoken/neg/cv000_29416.txt'

# 以只读方式打开文件

file = open(filename, 'r')

# 读取所有文本

text = file.read()

# 关闭文件

文件.close()

这会将文档加载为 ASCII 格式并保留任何空格，例如换行符。

我们可以将其转换为一个名为 load_doc() 的函数，该函数接受要加载的文档的文件名并返回文本。

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# 加载文档到内存

def load_doc(filename):

# 以只读方式打开文件

file = open(filename, 'r')

# 读取所有文本

text = file.read()

# 关闭文件

file.close()

返回文本

我们有两个目录，每个目录有 1,000 篇文档。我们可以依次处理每个目录，首先使用 listdir() 函数获取目录中的文件列表，然后依次加载每个文件。

例如，我们可以使用 load_doc() 函数加载负面目录中的每个文档来执行实际的加载。

from os import listdir

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# specify directory to load
directory = 'txt_sentoken/neg'
# walk through all files in the folder
for filename in listdir(directory):
	# skip files that do not have the right extension
	if not filename.endswith(".txt"):
		continue
	# create the full path of the file to open
	path = directory + '/' + filename
	# load document
	doc = load_doc(path)
	print('Loaded %s' % filename)

from os import listdir

# 加载文档到内存

def load_doc(filename):

# 以只读方式打开文件

file = open(filename, 'r')

# 读取所有文本

text = file.read()

# 关闭文件

file.close()

return text

# 指定要加载的目录

directory = 'txt_sentoken/neg'

# 遍历文件夹中的所有文件

for filename in listdir(directory):

# 跳过不具有正确扩展名的文件

if not filename.endswith(".txt"):

continue

# 创建要打开的文件的完整路径

path = directory + '/' + filename

# 加载文档

doc = load_doc(path)

print('Loaded %s' % filename)

运行此示例将打印每个评论在加载后的文件名。

...
Loaded cv995_23113.txt
Loaded cv996_12447.txt
Loaded cv997_5152.txt
Loaded cv998_15691.txt
Loaded cv999_14636.txt

...

Loaded cv995_23113.txt

Loaded cv996_12447.txt

Loaded cv997_5152.txt

Loaded cv998_15691.txt

Loaded cv999_14636.txt

我们可以将文档的处理过程也转换为一个函数，并将其用作模板，以便以后开发一个函数来清理文件夹中的所有文档。例如，下面我们定义一个 process_docs() 函数来执行相同的操作。

from os import listdir

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# load all docs in a directory
def process_docs(directory):
	# walk through all files in the folder
	for filename in listdir(directory):
		# skip files that do not have the right extension
		if not filename.endswith(".txt"):
			continue
		# create the full path of the file to open
		path = directory + '/' + filename
		# load document
		doc = load_doc(path)
		print('Loaded %s' % filename)

# specify directory to load
directory = 'txt_sentoken/neg'
process_docs(directory)

from os import listdir

# 加载文档到内存

def load_doc(filename):

# 以只读方式打开文件

file = open(filename, 'r')

# 读取所有文本

text = file.read()

# 关闭文件

file.close()

return text

# 加载目录中的所有文档

def process_docs(directory):

# 遍历文件夹中的所有文件

for filename in listdir(directory):

# 跳过不具有正确扩展名的文件

if not filename.endswith(".txt"):

continue

# 创建要打开的文件的完整路径

path = directory + '/' + filename

# 加载文档

doc = load_doc(path)

print('Loaded %s' % filename)

# 指定要加载的目录

directory = 'txt_sentoken/neg'

process_docs(directory)

现在我们知道了如何加载电影评论文本数据，接下来看看如何清理它。

3. 清理文本数据

在本节中，我们将看看我们可能需要对电影评论数据进行哪些数据清理。

我们假设我们将使用词袋模型或某种不需要太多准备的词嵌入。

分割成标记

首先，让我们加载一个文档，并查看按空格分割的原始标记。我们将使用上一节中开发的 load_doc() 函数。我们可以使用 split() 函数将加载的文档分割成由空格分隔的标记。

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# load the document
filename = 'txt_sentoken/neg/cv000_29416.txt'
text = load_doc(filename)
# split into tokens by white space
tokens = text.split()
print(tokens)

# 加载文档到内存

def load_doc(filename):

# 以只读方式打开文件

file = open(filename, 'r')

# 读取所有文本

text = file.read()

# 关闭文件

file.close()

return text

# 加载文档

filename = 'txt_sentoken/neg/cv000_29416.txt'

text = load_doc(filename)

# 按空格分割成令牌

tokens = text.split()

print(tokens)

运行示例会得到一个很长的列表，其中包含文档中的原始标记。

...
'years', 'ago', 'and', 'has', 'been', 'sitting', 'on', 'the', 'shelves', 'ever', 'since', '.', 'whatever', '.', '.', '.', 'skip', 'it', '!', "where's", 'joblo', 'coming', 'from', '?', 'a', 'nightmare', 'of', 'elm', 'street', '3', '(', '7/10', ')', '-', 'blair', 'witch', '2', '(', '7/10', ')', '-', 'the', 'crow', '(', '9/10', ')', '-', 'the', 'crow', ':', 'salvation', '(', '4/10', ')', '-', 'lost', 'highway', '(', '10/10', ')', '-', 'memento', '(', '10/10', ')', '-', 'the', 'others', '(', '9/10', ')', '-', 'stir', 'of', 'echoes', '(', '8/10', ')']

...

'years', 'ago', 'and', 'has', 'been', 'sitting', 'on', 'the', 'shelves', 'ever', 'since', '.', 'whatever', '.', '.', '.', 'skip', 'it', '!', "where's", 'joblo', 'coming', 'from', '?', 'a', 'nightmare', 'of', 'elm', 'street', '3', '(', '7/10', ')', '-', 'blair', 'witch', '2', '(', '7/10', ')', '-', 'the', 'crow', '(', '9/10', ')', '-', 'the', 'crow', ':', 'salvation', '(', '4/10', ')', '-', 'lost', 'highway', '(', '10/10', ')', '-', 'memento', '(', '10/10', ')', '-', 'the', 'others', '(', '9/10', ')', '-', 'stir', 'of', 'echoes', '(', '8/10', ')']

仅仅查看原始标记就能给我们很多可以尝试的想法，例如：

从单词中删除标点符号（例如，“what’s”）。
删除仅包含标点符号的标记（例如，“-”）。
删除包含数字的标记（例如，“10/10”）。
删除只有一个字符的标记（例如，“a”）。
删除没有太多意义的标记（例如，“and”）。

一些想法

我们可以使用字符串的 translate() 函数过滤掉标记中的标点符号。
我们可以通过对每个标记使用 isalpha() 检查来删除仅包含标点符号或包含数字的标记。
我们可以使用 NLTK 加载的列表来删除英语停用词。
我们可以通过检查标记的长度来过滤掉短标记。

下面是清理此评论的更新版本。

from nltk.corpus import stopwords
import string

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# load the document
filename = 'txt_sentoken/neg/cv000_29416.txt'
text = load_doc(filename)
# split into tokens by white space
tokens = text.split()
# remove punctuation from each token
table = str.maketrans('', '', string.punctuation)
tokens = [w.translate(table) for w in tokens]
# remove remaining tokens that are not alphabetic
tokens = [word for word in tokens if word.isalpha()]
# filter out stop words
stop_words = set(stopwords.words('english'))
tokens = [w for w in tokens if not w in stop_words]
# filter out short tokens
tokens = [word for word in tokens if len(word) > 1]
print(tokens)

from nltk.corpus import stopwords

import string

# 加载文档到内存

def load_doc(filename):

# 以只读方式打开文件

file = open(filename, 'r')

# 读取所有文本

text = file.read()

# 关闭文件

file.close()

return text

# 加载文档

filename = 'txt_sentoken/neg/cv000_29416.txt'

text = load_doc(filename)

# 按空格分割成令牌

tokens = text.split()

# 从每个令牌中去除标点符号

table = str.maketrans('', '', string.punctuation)

tokens = [w.translate(table) for w in tokens]

# 移除所有非字母的剩余标记

tokens = [word for word in tokens if word.isalpha()]

# 过滤掉停用词

stop_words = set(stopwords.words('english'))

tokens = [w for w in tokens if not w in stop_words]

# 过滤掉短标记

tokens = [word for word in tokens if len(word) > 1]

print(tokens)

运行此示例会得到一个更干净的标记列表。

...
'explanation', 'craziness', 'came', 'oh', 'way', 'horror', 'teen', 'slasher', 'flick', 'packaged', 'look', 'way', 'someone', 'apparently', 'assuming', 'genre', 'still', 'hot', 'kids', 'also', 'wrapped', 'production', 'two', 'years', 'ago', 'sitting', 'shelves', 'ever', 'since', 'whatever', 'skip', 'wheres', 'joblo', 'coming', 'nightmare', 'elm', 'street', 'blair', 'witch', 'crow', 'crow', 'salvation', 'lost', 'highway', 'memento', 'others', 'stir', 'echoes']

...

'explanation', 'craziness', 'came', 'oh', 'way', 'horror', 'teen', 'slasher', 'flick', 'packaged', 'look', 'way', 'someone', 'apparently', 'assuming', 'genre', 'still', 'hot', 'kids', 'also', 'wrapped', 'production', 'two', 'years', 'ago', 'sitting', 'shelves', 'ever', 'since', 'whatever', 'skip', 'wheres', 'joblo', 'coming', 'nightmare', 'elm', 'street', 'blair', 'witch', 'crow', 'crow', 'salvation', 'lost', 'highway', 'memento', 'others', 'stir', 'echoes']

我们可以将此放入一个名为 clean_doc() 的函数中，并用它来测试另一个评论，这次是一个正面评论。

from nltk.corpus import stopwords
import string

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# turn a doc into clean tokens
def clean_doc(doc):
	# split into tokens by white space
	tokens = doc.split()
	# remove punctuation from each token
	table = str.maketrans('', '', string.punctuation)
	tokens = [w.translate(table) for w in tokens]
	# remove remaining tokens that are not alphabetic
	tokens = [word for word in tokens if word.isalpha()]
	# filter out stop words
	stop_words = set(stopwords.words('english'))
	tokens = [w for w in tokens if not w in stop_words]
	# filter out short tokens
	tokens = [word for word in tokens if len(word) > 1]
	return tokens

# load the document
filename = 'txt_sentoken/pos/cv000_29590.txt'
text = load_doc(filename)
tokens = clean_doc(text)
print(tokens)

from nltk.corpus import stopwords

import string

# 加载文档到内存

def load_doc(filename):

# 以只读方式打开文件

file = open(filename, 'r')

# 读取所有文本

text = file.read()

# 关闭文件

file.close()

return text

# 将文档转换为干净的令牌

def clean_doc(doc):

# 按空格分割成标记

tokens = doc.split()

# 从每个标记中删除标点符号

table = str.maketrans('', '', string.punctuation)

tokens = [w.translate(table) for w in tokens]

# 删除所有非字母字符的标记

tokens = [word for word in tokens if word.isalpha()]

# 过滤停用词

stop_words = set(stopwords.words('english'))

tokens = [w for w in tokens if not w in stop_words]

# 过滤短标记

tokens = [word for word in tokens if len(word) > 1]

return tokens

# 加载文档

filename = 'txt_sentoken/pos/cv000_29590.txt'

text = load_doc(filename)

tokens = clean_doc(text)

print(tokens)

同样，清理过程似乎产生了一组不错的标记，至少作为初步尝试。

...
'comic', 'oscar', 'winner', 'martin', 'childs', 'shakespeare', 'love', 'production', 'design', 'turns', 'original', 'prague', 'surroundings', 'one', 'creepy', 'place', 'even', 'acting', 'hell', 'solid', 'dreamy', 'depp', 'turning', 'typically', 'strong', 'performance', 'deftly', 'handling', 'british', 'accent', 'ians', 'holm', 'joe', 'goulds', 'secret', 'richardson', 'dalmatians', 'log', 'great', 'supporting', 'roles', 'big', 'surprise', 'graham', 'cringed', 'first', 'time', 'opened', 'mouth', 'imagining', 'attempt', 'irish', 'accent', 'actually', 'wasnt', 'half', 'bad', 'film', 'however', 'good', 'strong', 'violencegore', 'sexuality', 'language', 'drug', 'content']

...

'comic', 'oscar', 'winner', 'martin', 'childs', 'shakespeare', 'love', 'production', 'design', 'turns', 'original', 'prague', 'surroundings', 'one', 'creepy', 'place', 'even', 'acting', 'hell', 'solid', 'dreamy', 'depp', 'turning', 'typically', 'strong', 'performance', 'deftly', 'handling', 'british', 'accent', 'ians', 'holm', 'joe', 'goulds', 'secret', 'richardson', 'dalmatians', 'log', 'great', 'supporting', 'roles', 'big', 'surprise', 'graham', 'cringed', 'first', 'time', 'opened', 'mouth', 'imagining', 'attempt', 'irish', 'accent', 'actually', 'wasnt', 'half', 'bad', 'film', 'however', 'good', 'strong', 'violencegore', 'sexuality', 'language', 'drug', 'content']

我们还可以进行更多的清理步骤，将其留给您的想象。

接下来，我们看看如何管理首选的标记词汇表。

4. 开发词汇表

在使用文本预测模型（如词袋模型）时，存在减小词汇表大小的压力。

词汇表越大，每个单词或文档的表示就越稀疏。

为情感分析准备文本的一部分包括定义和定制模型支持的单词词汇表。

我们可以通过加载数据集中的所有文档并构建单词集合来做到这一点。我们可以选择支持所有这些单词，或者丢弃一些。然后可以将最终选择的词汇表保存到文件中以备将来使用，例如将来过滤新文档中的单词。

我们可以使用 Counter 来跟踪词汇表，Counter 是一个单词及其计数的字典，并带有一些额外的便捷函数。

我们需要开发一个新函数来处理文档并将其添加到词汇表中。该函数需要通过调用之前开发的 load_doc() 函数来加载文档。它需要使用之前开发的 clean_doc() 函数清理加载的文档，然后需要将所有标记添加到 Counter 并更新计数。我们可以通过在 Counter 对象上调用 update() 函数来完成此最后一步。

下面是一个名为 add_doc_to_vocab() 的函数，它接受文档文件名和 Counter 词汇表作为参数。

# load doc and add to vocab
def add_doc_to_vocab(filename, vocab):
	# load doc
	doc = load_doc(filename)
	# clean doc
	tokens = clean_doc(doc)
	# update counts
	vocab.update(tokens)

# 加载文档并添加到词汇表

def add_doc_to_vocab(filename, vocab):

# 加载文档

doc = load_doc(filename)

# 清理文档

tokens = clean_doc(doc)

# 更新计数

vocab.update(tokens)

最后，我们可以使用上面的模板来处理目录中的所有文档，将其命名为 process_docs()，并更新它以调用 add_doc_to_vocab()。

# load all docs in a directory
def process_docs(directory, vocab):
	# walk through all files in the folder
	for filename in listdir(directory):
		# skip files that do not have the right extension
		if not filename.endswith(".txt"):
			continue
		# create the full path of the file to open
		path = directory + '/' + filename
		# add doc to vocab
		add_doc_to_vocab(path, vocab)

# 加载目录中的所有文档

def process_docs(directory, vocab):

# 遍历文件夹中的所有文件

for filename in listdir(directory):

# 跳过不具有正确扩展名的文件

if not filename.endswith(".txt"):

continue

# 创建要打开的文件的完整路径

path = directory + '/' + filename

# 将文档添加到词汇表

add_doc_to_vocab(path, vocab)

我们可以将所有这些整合起来，并从数据集中所有文档开发一个完整的词汇表。

from string import punctuation
from os import listdir
from collections import Counter
from nltk.corpus import stopwords

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# turn a doc into clean tokens
def clean_doc(doc):
	# split into tokens by white space
	tokens = doc.split()
	# remove punctuation from each token
	table = str.maketrans('', '', punctuation)
	tokens = [w.translate(table) for w in tokens]
	# remove remaining tokens that are not alphabetic
	tokens = [word for word in tokens if word.isalpha()]
	# filter out stop words
	stop_words = set(stopwords.words('english'))
	tokens = [w for w in tokens if not w in stop_words]
	# filter out short tokens
	tokens = [word for word in tokens if len(word) > 1]
	return tokens

# load doc and add to vocab
def add_doc_to_vocab(filename, vocab):
	# load doc
	doc = load_doc(filename)
	# clean doc
	tokens = clean_doc(doc)
	# update counts
	vocab.update(tokens)

# load all docs in a directory
def process_docs(directory, vocab):
	# walk through all files in the folder
	for filename in listdir(directory):
		# skip files that do not have the right extension
		if not filename.endswith(".txt"):
			continue
		# create the full path of the file to open
		path = directory + '/' + filename
		# add doc to vocab
		add_doc_to_vocab(path, vocab)

# define vocab
vocab = Counter()
# add all docs to vocab
process_docs('txt_sentoken/neg', vocab)
process_docs('txt_sentoken/pos', vocab)
# print the size of the vocab
print(len(vocab))
# print the top words in the vocab
print(vocab.most_common(50))

from string import punctuation

from os import listdir

from collections import Counter

from nltk.corpus import stopwords

# 加载文档到内存

def load_doc(filename):

# 以只读方式打开文件

file = open(filename, 'r')

# 读取所有文本

text = file.read()

# 关闭文件

file.close()

return text

# 将文档转换为干净的令牌

def clean_doc(doc):

# 按空格分割成标记

tokens = doc.split()

# 从每个标记中删除标点符号

table = str.maketrans('', '', punctuation)

tokens = [w.translate(table) for w in tokens]

# 删除所有非字母字符的标记

tokens = [word for word in tokens if word.isalpha()]

# 过滤停用词

stop_words = set(stopwords.words('english'))

tokens = [w for w in tokens if not w in stop_words]

# 过滤短标记

tokens = [word for word in tokens if len(word) > 1]

return tokens

# 加载文档并添加到词汇表

def add_doc_to_vocab(filename, vocab):

# 加载文档

doc = load_doc(filename)

# 清理文档

tokens = clean_doc(doc)

# 更新计数

vocab.update(tokens)

# 加载目录中的所有文档

def process_docs(directory, vocab):

# 遍历文件夹中的所有文件

for filename in listdir(directory):

# 跳过不具有正确扩展名的文件

if not filename.endswith(".txt"):

continue

# 创建要打开的文件的完整路径

path = directory + '/' + filename

# 将文档添加到词汇表

add_doc_to_vocab(path, vocab)

# 定义词汇表

vocab = Counter()

# 将所有文档添加到词汇表

process_docs('txt_sentoken/neg', vocab)

process_docs('txt_sentoken/pos', vocab)

# 打印词汇表大小

print(len(vocab))

# 打印词汇表中最重要的词

print(vocab.most_common(50))

运行此示例将创建一个包含数据集中所有文档的词汇表，包括正面和负面评论。

我们可以看到，所有评论中共有 46,000 多个唯一单词，排名前 3 的单词是“film”、“one”和“movie”。

46557
[('film', 8860), ('one', 5521), ('movie', 5440), ('like', 3553), ('even', 2555), ('good', 2320), ('time', 2283), ('story', 2118), ('films', 2102), ('would', 2042), ('much', 2024), ('also', 1965), ('characters', 1947), ('get', 1921), ('character', 1906), ('two', 1825), ('first', 1768), ('see', 1730), ('well', 1694), ('way', 1668), ('make', 1590), ('really', 1563), ('little', 1491), ('life', 1472), ('plot', 1451), ('people', 1420), ('movies', 1416), ('could', 1395), ('bad', 1374), ('scene', 1373), ('never', 1364), ('best', 1301), ('new', 1277), ('many', 1268), ('doesnt', 1267), ('man', 1266), ('scenes', 1265), ('dont', 1210), ('know', 1207), ('hes', 1150), ('great', 1141), ('another', 1111), ('love', 1089), ('action', 1078), ('go', 1075), ('us', 1065), ('director', 1056), ('something', 1048), ('end', 1047), ('still', 1038)]

46557

[('film', 8860), ('one', 5521), ('movie', 5440), ('like', 3553), ('even', 2555), ('good', 2320), ('time', 2283), ('story', 2118), ('films', 2102), ('would', 2042), ('much', 2024), ('also', 1965), ('characters', 1947), ('get', 1921), ('character', 1906), ('two', 1825), ('first', 1768), ('see', 1730), ('well', 1694), ('way', 1668), ('make', 1590), ('really', 1563), ('little', 1491), ('life', 1472), ('plot', 1451), ('people', 1420), ('movies', 1416), ('could', 1395), ('bad', 1374), ('scene', 1373), ('never', 1364), ('best', 1301), ('new', 1277), ('many', 1268), ('doesnt', 1267), ('man', 1266), ('scenes', 1265), ('dont', 1210), ('know', 1207), ('hes', 1150), ('great', 1141), ('another', 1111), ('love', 1089), ('action', 1078), ('go', 1075), ('us', 1065), ('director', 1056), ('something', 1048), ('end', 1047), ('still', 1038)]

也许最不常见的词，那些在所有评论中只出现一次的词，是没有预测性的。也许一些最常见的词也没有用。

这些都是很好的问题，真的应该用特定的预测模型来测试。

通常，在 2,000 条评论中只出现一次或几次的词可能没有预测性，可以从词汇表中删除，大大减少我们需要建模的标记数量。

我们可以通过遍历单词及其计数来实现这一点，只保留计数高于选定阈值的单词。这里我们将使用 5 次出现。

# keep tokens with > 5 occurrence
min_occurane = 5
tokens = [k for k,c in vocab.items() if c >= min_occurane]
print(len(tokens))

# 保留出现次数 > 5 的标记

min_occurane = 5

tokens = [k for k,c in vocab.items() if c >= min_occurane]

print(len(tokens))

这会将词汇表从 46,557 个单词减少到 14,803 个单词，这是一个巨大的降幅。也许至少 5 次出现的阈值太高了；您可以尝试不同的值。

然后，我们可以将选定的单词词汇表保存到一个新文件中。我喜欢将词汇表保存为 ASCII 格式，每行一个单词。

下面定义了一个名为 save_list() 的函数，用于将项目列表（在此情况下为标记）保存到文件中，每行一个。

def save_list(lines, filename):
	data = '\n'.join(lines)
	file = open(filename, 'w')
	file.write(data)
	file.close()

def save_list(lines, filename):

data = '\n'.join(lines)

file = open(filename, 'w')

file = file.write(data)

file.close()

定义和保存词汇表的完整示例如下所示。

from string import punctuation
from os import listdir
from collections import Counter
from nltk.corpus import stopwords

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# turn a doc into clean tokens
def clean_doc(doc):
	# split into tokens by white space
	tokens = doc.split()
	# remove punctuation from each token
	table = str.maketrans('', '', punctuation)
	tokens = [w.translate(table) for w in tokens]
	# remove remaining tokens that are not alphabetic
	tokens = [word for word in tokens if word.isalpha()]
	# filter out stop words
	stop_words = set(stopwords.words('english'))
	tokens = [w for w in tokens if not w in stop_words]
	# filter out short tokens
	tokens = [word for word in tokens if len(word) > 1]
	return tokens

# load doc and add to vocab
def add_doc_to_vocab(filename, vocab):
	# load doc
	doc = load_doc(filename)
	# clean doc
	tokens = clean_doc(doc)
	# update counts
	vocab.update(tokens)

# load all docs in a directory
def process_docs(directory, vocab):
	# walk through all files in the folder
	for filename in listdir(directory):
		# skip files that do not have the right extension
		if not filename.endswith(".txt"):
			continue
		# create the full path of the file to open
		path = directory + '/' + filename
		# add doc to vocab
		add_doc_to_vocab(path, vocab)

# save list to file
def save_list(lines, filename):
	data = '\n'.join(lines)
	file = open(filename, 'w')
	file.write(data)
	file.close()

# define vocab
vocab = Counter()
# add all docs to vocab
process_docs('txt_sentoken/neg', vocab)
process_docs('txt_sentoken/pos', vocab)
# print the size of the vocab
print(len(vocab))
# print the top words in the vocab
print(vocab.most_common(50))
# keep tokens with > 5 occurrence
min_occurane = 5
tokens = [k for k,c in vocab.items() if c >= min_occurane]
print(len(tokens))
# save tokens to a vocabulary file
save_list(tokens, 'vocab.txt')

from string import punctuation

from os import listdir

from collections import Counter

from nltk.corpus import stopwords

# 加载文档到内存

def load_doc(filename):

# 以只读方式打开文件

file = open(filename, 'r')

# 读取所有文本

text = file.read()

# 关闭文件

file.close()

return text

# 将文档转换为干净的令牌

def clean_doc(doc):

# 按空格分割成标记

tokens = doc.split()

# 从每个标记中删除标点符号

table = str.maketrans('', '', punctuation)

tokens = [w.translate(table) for w in tokens]

# 删除所有非字母字符的标记

tokens = [word for word in tokens if word.isalpha()]

# 过滤停用词

stop_words = set(stopwords.words('english'))

tokens = [w for w in tokens if not w in stop_words]

# 过滤短标记

tokens = [word for word in tokens if len(word) > 1]

return tokens

# 加载文档并添加到词汇表

def add_doc_to_vocab(filename, vocab):

# 加载文档

doc = load_doc(filename)

# 清理文档

tokens = clean_doc(doc)

# 更新计数

vocab.update(tokens)

# 加载目录中的所有文档

def process_docs(directory, vocab):

# 遍历文件夹中的所有文件

for filename in listdir(directory):

# 跳过不具有正确扩展名的文件

if not filename.endswith(".txt"):

continue

# 创建要打开的文件的完整路径

path = directory + '/' + filename

# 将文档添加到词汇表

add_doc_to_vocab(path, vocab)

# 将列表保存到文件

def save_list(lines, filename):

data = '\n'.join(lines)

file = open(filename, 'w')

file = file.write(data)

file.close()

# 定义词汇表

vocab = Counter()

# 将所有文档添加到词汇表

process_docs('txt_sentoken/neg', vocab)

process_docs('txt_sentoken/pos', vocab)

# 打印词汇表大小

print(len(vocab))

# 打印词汇表中最重要的词

print(vocab.most_common(50))

# 保留出现次数 > 5 的标记

min_occurane = 5

tokens = [k for k,c in vocab.items() if c >= min_occurane]

print(len(tokens))

# 将标记保存到词汇表文件

save_list(tokens, 'vocab.txt')

在创建词汇表后运行此最终代码片段会将选定的单词保存到文件中。

最好查看甚至研究您选择的词汇表，以便获得更好地准备此数据或将来文本数据的想法。

hasnt
updating
figuratively
symphony
civilians
might
fisherman
hokum
witch
buffoons
...

hasnt

updating

figuratively

symphony

civilians

might

fisherman

hokum

witch

buffoons

...

接下来，我们看看如何使用词汇表来创建电影评论数据集的准备版本。

5. 保存准备好的数据

我们可以使用数据清理和选定的词汇表来准备每条电影评论，并保存准备好的评论版本以供模型使用。

这是一个好习惯，因为它将数据准备与建模分离开来，使您可以专注于建模，并在有新想法时返回到数据准备。

我们可以从加载“vocab.txt”中的词汇表开始。

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# load vocabulary
vocab_filename = 'review_polarity/vocab.txt'
vocab = load_doc(vocab_filename)
vocab = vocab.split()
vocab = set(vocab)

# 加载文档到内存

def load_doc(filename):

# 以只读方式打开文件

file = open(filename, 'r')

# 读取所有文本

text = file.read()

# 关闭文件

file.close()

return text

# 加载词汇表

vocab_filename = 'review_polarity/vocab.txt'

vocab = load_doc(vocab_filename)

vocab = vocab.split()

vocab = set(vocab)

接下来，我们可以清理评论，使用加载的词汇表过滤掉不需要的标记，并将清理后的评论保存到一个新文件中。

一种方法是将所有正面评论保存在一个文件中，所有负面评论保存在另一个文件中，将过滤后的标记用空格分隔，每条评论占一行。

首先，我们可以定义一个函数来处理文档、清理它、过滤它并将其作为单行返回，该行可以保存在文件中。下面定义了 doc_to_line() 函数来完成此操作，它接受文件名和词汇表（作为集合）作为参数。

它调用先前定义的 load_doc() 函数来加载文档，并调用 clean_doc() 来标记文档。

# load doc, clean and return line of tokens
def doc_to_line(filename, vocab):
	# load the doc
	doc = load_doc(filename)
	# clean doc
	tokens = clean_doc(doc)
	# filter by vocab
	tokens = [w for w in tokens if w in vocab]
	return ' '.join(tokens)

# 加载文档，清理并返回标记行

def doc_to_line(filename, vocab):

# 加载文档

doc = load_doc(filename)

# 清理文档

tokens = clean_doc(doc)

# 按词汇表过滤

tokens = [w for w in tokens if w in vocab]

return ' '.join(tokens)

接下来，我们可以定义一个新版本的 process_docs() 来遍历文件夹中的所有评论，并通过为每个文档调用 doc_to_line() 将它们转换为行。然后返回一个行列表。

# load all docs in a directory
def process_docs(directory, vocab):
	lines = list()
	# walk through all files in the folder
	for filename in listdir(directory):
		# skip files that do not have the right extension
		if not filename.endswith(".txt"):
			continue
		# create the full path of the file to open
		path = directory + '/' + filename
		# load and clean the doc
		line = doc_to_line(path, vocab)
		# add to list
		lines.append(line)
	return lines

# 加载目录中的所有文档

def process_docs(directory, vocab):

lines = list()

# 遍历文件夹中的所有文件

for filename in listdir(directory):

# 跳过不具有正确扩展名的文件

if not filename.endswith(".txt"):

continue

# 创建要打开的文件的完整路径

path = directory + '/' + filename

# 加载并清理文档

line = doc_to_line(path, vocab)

# 添加到列表

lines.append(line)

return lines

然后，我们可以为正面和负面评论的目录调用 process_docs()，然后调用上一节的 save_list() 将每个已处理评论列表保存到文件中。

完整的代码列表如下。

from string import punctuation
from os import listdir
from collections import Counter
from nltk.corpus import stopwords

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# turn a doc into clean tokens
def clean_doc(doc):
	# split into tokens by white space
	tokens = doc.split()
	# remove punctuation from each token
	table = str.maketrans('', '', punctuation)
	tokens = [w.translate(table) for w in tokens]
	# remove remaining tokens that are not alphabetic
	tokens = [word for word in tokens if word.isalpha()]
	# filter out stop words
	stop_words = set(stopwords.words('english'))
	tokens = [w for w in tokens if not w in stop_words]
	# filter out short tokens
	tokens = [word for word in tokens if len(word) > 1]
	return tokens

# save list to file
def save_list(lines, filename):
	data = '\n'.join(lines)
	file = open(filename, 'w')
	file.write(data)
	file.close()

# load doc, clean and return line of tokens
def doc_to_line(filename, vocab):
	# load the doc
	doc = load_doc(filename)
	# clean doc
	tokens = clean_doc(doc)
	# filter by vocab
	tokens = [w for w in tokens if w in vocab]
	return ' '.join(tokens)

# load all docs in a directory
def process_docs(directory, vocab):
	lines = list()
	# walk through all files in the folder
	for filename in listdir(directory):
		# skip files that do not have the right extension
		if not filename.endswith(".txt"):
			continue
		# create the full path of the file to open
		path = directory + '/' + filename
		# load and clean the doc
		line = doc_to_line(path, vocab)
		# add to list
		lines.append(line)
	return lines

# load vocabulary
vocab_filename = 'vocab.txt'
vocab = load_doc(vocab_filename)
vocab = vocab.split()
vocab = set(vocab)
# prepare negative reviews
negative_lines = process_docs('txt_sentoken/neg', vocab)
save_list(negative_lines, 'negative.txt')
# prepare positive reviews
positive_lines = process_docs('txt_sentoken/pos', vocab)
save_list(positive_lines, 'positive.txt')

from string import punctuation

from os import listdir

from collections import Counter

from nltk.corpus import stopwords

# 加载文档到内存

def load_doc(filename):

# 以只读方式打开文件

file = open(filename, 'r')

# 读取所有文本

text = file.read()

# 关闭文件

file.close()

return text

# 将文档转换为干净的令牌

def clean_doc(doc):

# 按空格分割成标记

tokens = doc.split()

# 从每个标记中删除标点符号

table = str.maketrans('', '', punctuation)

tokens = [w.translate(table) for w in tokens]

# 删除所有非字母字符的标记

tokens = [word for word in tokens if word.isalpha()]

# 过滤停用词

stop_words = set(stopwords.words('english'))

tokens = [w for w in tokens if not w in stop_words]

# 过滤短标记

tokens = [word for word in tokens if len(word) > 1]

return tokens

# 将列表保存到文件

def save_list(lines, filename):

data = '\n'.join(lines)

file = open(filename, 'w')

file = file.write(data)

file.close()

# 加载文档，清理并返回标记行

def doc_to_line(filename, vocab):

# 加载文档

doc = load_doc(filename)

# 清理文档

tokens = clean_doc(doc)

# 按词汇表过滤

tokens = [w for w in tokens if w in vocab]

return ' '.join(tokens)

# 加载目录中的所有文档

def process_docs(directory, vocab):

lines = list()

# 遍历文件夹中的所有文件

for filename in listdir(directory):

# 跳过不具有正确扩展名的文件

if not filename.endswith(".txt"):

continue

# 创建要打开的文件的完整路径

path = directory + '/' + filename

# 加载并清理文档

line = doc_to_line(path, vocab)

# 添加到列表

lines.append(line)

return lines

# 加载词汇表

vocab_filename = 'vocab.txt'

vocab = load_doc(vocab_filename)

vocab = vocab.split()

vocab = set(vocab)

# 准备负面评论

negative_lines = process_docs('txt_sentoken/neg', vocab)

save_list(negative_lines, 'negative.txt')

# 准备正面评论

positive_lines = process_docs('txt_sentoken/pos', vocab)

save_list(positive_lines, 'positive.txt')

运行示例会保存两个新文件，“negative.txt”和“positive.txt”，其中分别包含已准备好的负面和正面评论。

数据已准备好，可用于词袋模型，甚至词嵌入模型。

扩展

本节列出了一些您可能希望探索的扩展。

词干提取。我们可以使用像 Porter 词干提取器这样的词干提取算法将文档中的每个词减少到它们的词干。
N-gram。我们不是处理单个单词，而是处理一对单词的词汇表，称为双词。我们还可以研究使用更大的组，例如三词（trigrams）及更多（n-grams）。
编码单词。我们不是按原样保存标记，而是保存单词的整数编码，其中单词在词汇表中的索引代表单词的唯一整数。这将使在建模时更容易处理数据。
编码文档。我们不是在文档中保存标记，而是可以使用词袋模型来编码文档，并为每个单词编码为布尔型的存在/不存在标志，或使用更复杂的评分，例如 TF-IDF。

如果您尝试了任何这些扩展，我将非常乐意得知。
请在下面的评论中分享您的结果。

进一步阅读

如果您想深入了解此主题，本节提供了更多资源。

论文

情感教育：基于最小割的主观性摘要的情感分析, 2004.

API

总结

在本教程中，您逐步了解了如何为情感分析准备电影评论文本数据。

具体来说，你学到了：

如何加载文本数据并清理它以删除标点符号和其他非单词字符。
如何开发一个词汇表，对其进行定制，并将其保存到文件中。
如何使用清理和预定义词汇表准备电影评论，并将它们保存到可用于建模的新文件中。

你有什么问题吗？
在下面的评论中提出你的问题，我会尽力回答。

关于此主题的更多信息

注意力在编码器-解码器循环神经网络中是如何工作的

如何使用 Keras 开发具有注意力的编码器-解码器模型

48 条回复对如何为情感分析（文本分类）准备电影评论数据

Alexander 2017 年 10 月 16 日下午 6:42 #

谢谢 Jason。非常有价值的工作。请告诉我，我们如何实现 N-gram 扩展？我们可以使用像 GloVe 这样的预训练模型吗？

回复
- Jason Brownlee 2017 年 10 月 17 日上午 5:40 #
  
  我希望很快在博客上有一个例子。
  
  回复
Alexander 2017 年 10 月 17 日下午 5:11 #

谢谢你。

回复
Lin Li 2017 年 10 月 20 日上午 1:20 #

谢谢 Jason 博士。
我使用了 keras 的内置函数来加载 IMDB 数据集。即“from keras.datasets import imdb”的“(X_train, y_train),(X_test, y_test) = imdb.load_data()”。我感到困惑的是，我用“imdb.load_data()”加载的 IMDB 数据集与您在这篇文章中使用的 IMDB 数据集有什么区别？前者包含 25,000 条高度极化的电影评论，而后者仅包含 2,000 条评论。那么 IMDB 数据集究竟是什么？如果我想构建一个深度学习模型来进行情感分析，我应该使用什么数据集？
我期待您的回复。谢谢！

回复
- Jason Brownlee 2017 年 10 月 20 日上午 5:42 #
  
  它们是不同的数据集，都仅用于教育目的 – 例如，学习如何开发模型。
  
  我建议收集代表您要解决的问题的数据。
  
  回复
  - Li Lin 2017 年 10 月 22 日上午 2:09 #
    
    谢谢您的回复！
    当我使用内置函数加载 IMDB 数据集时，它已经过预处理，单词已被表示为整数索引。有没有办法获取原始数据？
    
    回复
    - Jason Brownlee 2017 年 10 月 22 日上午 5:31 #
      
      也许在这里
      http://ai.stanford.edu/~amaas/data/sentiment/
      
      回复
Alexander 2017 年 10 月 20 日下午 7:06 #

杰森，请帮帮我。
如果我们开发带有 Embedding 层的 LSTM RNN，网络能否学习单词之间的关系？

回复
- Jason Brownlee 2017 年 10 月 21 日上午 5:28 #
  
  是的。嵌入本身将学习有关单词使用方式的表示。
  
  LSTM 可以学习不同位置单词的重要性，具体取决于应用程序。
  
  您心中有特定的领域吗？
  
  回复
Jan Zett 2017 年 10 月 21 日上午 12:44 #

嘿 Jason，谢谢你的精彩工作。我真的很喜欢你的博客，并且已经学到了很多！
我不确定你是否注意到，但你的代码中有一个小错误。它并不重要，但当你尝试跳过目录中不以 .txt 结尾的文件时，你使用了 next 而不是 continue。这在这种情况下没有达到预期的效果。

回复
- Jason Brownlee 2017 年 10 月 21 日上午 5:42 #
  
  谢谢 Jan，已修复！我想我当时在想 Ruby 或类似的东西……
  
  回复
Alexander 2017 年 10 月 21 日上午 5:52 #

感谢反馈，Jason。这非常有趣。我尝试理解。尝试将这些想法应用到不同的领域……感谢您的启发。

回复
- Jason Brownlee 2017 年 10 月 22 日上午 5:14 #
  
  谢谢亚历山大。
  
  回复
Debendra 2017 年 10 月 22 日上午 2:19 #

非常感谢您写了这篇帖子……这对那些正在重新学习数据科学的人帮助很大。

回复
- Jason Brownlee 2017 年 10 月 22 日上午 5:32 #
  
  谢谢 Debendra。
  
  回复
Vengadesan Nammalvar 2017 年 10 月 24 日上午 10:13 #

您激发了许多新兴的机器学习专业人士来实现他们的职业目标。

回复
- Jason Brownlee 2017 年 10 月 24 日下午 3:59 #
  
  很高兴听到这个消息。
  
  回复
Ray 2017 年 11 月 10 日上午 4:54 #

嗨 Jason，你的作品和例子总是详细且有用的。你是千分之一的老师。我发现你的例子详尽、有用且可迁移。

请注意，完整代码中的第 74 行缺少一个冒号……只是需要注意的一点，供那些复制粘贴以在本地运行的人参考。

回复
- Jason Brownlee 2017 年 11 月 10 日上午 10:42 #
  
  谢谢 Ray！
  
  你是指这一行
  
  positive_lines = process_docs(txt_sentoken/pos', vocab)
  
  1
  
  positive_lines = process_docs(txt_sentoken/pos', vocab)
  
  如果是的话，它怎么会缺少一个冒号？
  
  回复
  - Zain 2017 年 12 月 19 日上午 9:28 #
    
    Ray 实际上是指在 txt_sentoken/pos’ 之前缺少了引号。
    
    回复
    - Jason Brownlee 2017 年 12 月 19 日下午 3:58 #
      
      已修复，谢谢大家！
      
      回复
    - Zain 2017 年 12 月 20 日上午 5:37 #
      
      而第 46 行应该是
      tokens = [w for w in tokens if w not in vocab]
      
      感谢您提供这些很棒的教程……它们真的很有帮助！
      
      回复
      - Jason Brownlee 2017 年 12 月 20 日上午 5:54 #
        
        我不这么认为。我们正试图只保留文档中存在于词汇表中的单词。
mohit 2018 年 1 月 31 日下午 9:25 #

我想要这段 Python 2.7 代码

回复
- Jason Brownlee 2018 年 2 月 1 日上午 7:21 #
  
  试试看。
  
  回复
Neeraj joon 2018 年 2 月 1 日下午 7:06 #

嘿 Jason，我只想做的是：输入一个评论，然后代码返回一个单词，说明它是负面还是正面。我搜索了整个互联网都找不到。这段代码是否可以通过少量修改做到这一点，如果没有，我在哪里可以找到这种代码。

回复
- Jason Brownlee 2018 年 2 月 2 日上午 8:11 #
  
  你可以修改它来实现这个目的。看这篇帖子
  https://machinelearning.org.cn/develop-word-embedding-model-predicting-movie-review-sentiment/
  
  回复
HaRRy 2018 年 5 月 23 日上午 1:01 #

嗨 Jason 博士……我算是数据科学方面的新手。目前，我正在 Rapid Miner 中做一个项目，使用搜索推特和情感分析……我正在尝试找到一种方法来证明漫威电影比 DC 电影更好，并且我还尝试从收集的数据中提取新属性。例如，用来描述复仇者联盟的词（常用词）是什么？用来描述正面、负面、中立的词是什么？到目前为止……我不知道如何做到这一点……我已经使用搜索推特和情感分析收集了数据……但是后面的部分……是个谜。你能帮帮我吗？

回复
- Jason Brownlee 2018 年 5 月 23 日上午 6:28 #
  
  听起来很有趣。
  
  抱歉，我没有关于收集推特数据的好的建议。
  
  回复
ADiNoS 2018 年 6 月 7 日上午 3:54 #

嘿 Jason Brownlee，感谢你出色的工作。我非常感激。

回复
- Jason Brownlee 2018 年 6 月 7 日上午 6:34 #
  
  谢谢。
  
  回复
Saji 2018 年 9 月 24 日上午 9:34 #

你好 Jason，感谢你出色的工作。但我现在有一个问题，我将要创建一个涉及文档自动文本分类的项目。现在我的问题是，我将要创建的项目有动态定义的类别。这意味着如果我现在有一个包含 5 个类别的数据库，那么如果添加了新类别，我就必须为该类别添加另一个数据库。我想要的是我的项目能够自动适应新类别，而无需为新类别添加额外的数据库。谢谢。

回复
- Jason Brownlee 2018 年 9 月 24 日下午 2:09 #
  
  我手头上不确定，这可能需要一些非常仔细的设计。
  
  回复

Katherine Munro 2019 年 2 月 13 日上午 7:24 #

嗨，Jason，

一如既往的出色工作。我很惊讶没有人评论这个，但一旦你将你的 process_docs 方法更改为加载预制文档，你就失去了创建新词汇的机会。我想这就是为什么你教程末尾的代码对我有用，但词汇量大小都是 0（除非我还有其他问题）。

另外，我添加了一个 get_corpus_vocab 函数，它基本上是你早期版本 process_docs 的版本，当时它仍然可以用来构建新的词汇表。也许应该添加它？如果你愿意，我可以添加/发送完整的代码。

Katherine

# Get entire corpus vocab
def get_corpus_vocab(directory, vocab):
    # walk through all files in the folder
    for filename in listdir(directory):
        # skip files that do not have the right extension
        if not filename.endswith(".txt"):
            continue
        # create the full path of the file to open
        path = directory + '/' + filename
        # add doc to vocab
        add_doc_to_vocab(path, vocab)

And then all the methods are called with the following lines:

# Define vocab counter
vocab = Counter()
get_corpus_vocab('review_polarity/txt_sentoken/neg', vocab)
get_corpus_vocab('review_polarity/txt_sentoken/pos', vocab)
print("Vocab length: ", len(vocab), "and top 20 words:")
print(vocab.most_common(20))

# keep vocab tokens with > 5 occurrences
min_occurrence = 5
tokens = [k for k,c in vocab.items() if c >= min_occurrence]
print("Vocab length after filtering for num occurrences: ", len(tokens))

# Save vocab
save_list(tokens, 'review_polarity/txt_sentoken/vocab2.txt')

# load vocabulary
vocab_filename = 'review_polarity/txt_sentoken/vocab2.txt'
vocab = load_doc(vocab_filename)
vocab = vocab.split()
vocab = set(vocab)

# prepare negative reviews
negative_lines = process_docs('review_polarity/txt_sentoken/neg', vocab)
save_list(negative_lines, 'review_polarity/negative2.txt')
# prepare positive reviews
positive_lines = process_docs('review_polarity/txt_sentoken/pos', vocab)
save_list(positive_lines, 'review_polarity/positive2.txt')

# 获取整个语料库词汇表

def get_corpus_vocab(directory, vocab):

# 遍历文件夹中的所有文件

for filename in listdir(directory):

# 跳过扩展名不正确的文件

if not filename.endswith(".txt"):

continue

# 创建要打开的文件的完整路径

path = directory + '/' + filename

# 将文档添加到词汇表

add_doc_to_vocab(path, vocab)

然后所有方法都用以下行调用：

# 定义词汇计数器

vocab = Counter()

get_corpus_vocab('review_polarity/txt_sentoken/neg', vocab)

get_corpus_vocab('review_polarity/txt_sentoken/pos', vocab)

print("词汇表长度： ", len(vocab), "和前 20 个词：")

print(vocab.most_common(20))

# 保留出现次数大于 5 次的词汇表标记

min_occurrence = 5

tokens = [k for k,c in vocab.items() if c >= min_occurrence]

print("过滤出现次数后的词汇表长度： ", len(tokens))

# 保存词汇表

save_list(tokens, 'review_polarity/txt_sentoken/vocab2.txt')

# 加载词汇表

vocab_filename = 'review_polarity/txt_sentoken/vocab2.txt'

vocab = load_doc(vocab_filename)

vocab = vocab.split()

vocab = set(vocab)

# 准备负面评论

negative_lines = process_docs('review_polarity/txt_sentoken/neg', vocab)

save_list(negative_lines, 'review_polarity/negative2.txt')

# 准备正面评论

positive_lines = process_docs('review_polarity/txt_sentoken/pos', vocab)

save_list(positive_lines, 'review_polarity/positive2.txt')

Jason Brownlee 2019 年 2 月 13 日上午 8:06 #

非常酷，谢谢分享！

回复

Shyju 2019 年 3 月 1 日下午 6:49 #

感谢这篇文章。

可以使用哪些工具/方法/模型来为活动组织者从客户评论中推断出一些有用的信息？

回复
- Jason Brownlee 2019 年 3 月 2 日上午 9:30 #
  
  也许可以研究一下文本挖掘工具，以提取比情感更多的信息？
  
  回复
Data_enthusiast 2019 年 4 月 13 日下午 4:25 #

测试数据集我们该怎么处理？数据集分为训练集和测试集两部分。如果我加载训练数据集并进一步将其分成两组来训练模型，那么如何使用测试数据集？您能解释并提供帮助吗？

回复
- Jason Brownlee 2019 年 4 月 14 日上午 5:43 #
  
  您是指一般情况，还是特指本教程？
  
  在本教程中，我确切地展示了如何加载和处理数据。
  
  回复
Teerth 2019 年 8 月 6 日下午 8:44 #

谢谢 Jason 提供这个精彩的教程。我正在研究一个基于情感评论的电影系统。您是否有其他关于训练、分类（朴素贝叶斯）和预测数据的教程？
这将非常有帮助。
谢谢

回复
- Jason Brownlee 2019 年 8 月 7 日上午 7:52 #
  
  很多，也许可以从这里开始
  https://machinelearning.org.cn/start-here/
  
  回复
TEERTH GATECHA 2019 年 8 月 9 日下午 5:04 #

抱歉，您能更具体一点吗？我太困惑了。我需要做一个在线的基于电影的情感系统，并且我在数据预处理后卡住了。

回复
- Jason Brownlee 2019 年 8 月 10 日上午 7:11 #
  
  也许上面的教程可以为您的项目提供一个很好的模板？
  
  回复
TEERTH GATECHA 2019 年 8 月 9 日下午 5:45 #

所以现在我有两个文件 positive.txt 和 negative.txt，接下来该做什么？

回复
- Jason Brownlee 2019 年 8 月 10 日上午 7:11 #
  
  也许尝试将它们加载到内存中。
  
  回复
Wendy T Alphin 2020 年 1 月 9 日上午 3:39 #

非常有用的报告。谢谢分享！

回复
- Jason Brownlee 2020 年 1 月 9 日上午 7:31 #
  
  谢谢！
  
  回复
Ibrahima 2021 年 11 月 10 日下午 12:32 #

非常感谢，这是一个很好的解释。

回复

导航

如何为情感分析（文本分类）准备电影评论数据

教程概述

需要深度学习处理文本数据的帮助吗？

1. 电影评论数据集

2. 加载文本数据

3. 清理文本数据

4. 开发词汇表

5. 保存准备好的数据

扩展

进一步阅读

论文

API

总结

立即开发文本数据的深度学习模型！

在几分钟内开发您自己的文本模型

最终将深度学习应用于您的自然语言处理项目

关于此主题的更多信息

48 条回复对如何为情感分析（文本分类）准备电影评论数据

发表回复点击此处取消回复。

导航

教程概述

需要深度学习处理文本数据的帮助吗？

1. 电影评论数据集

2. 加载文本数据

3. 清理文本数据

4. 开发词汇表

5. 保存准备好的数据

扩展

进一步阅读

论文

API

总结

立即开发文本数据的深度学习模型！

在几分钟内开发您自己的文本模型

最终将深度学习应用于您的自然语言处理项目

关于此主题的更多信息

48 条回复对如何为情感分析（文本分类）准备电影评论数据

发表回复 点击此处取消回复。

发表回复点击此处取消回复。