语言模型中的词嵌入

作者： Adrian Tam 发布于 2025年8月18日分类：构建Transformer模型 0

自然语言处理（NLP）长期以来一直是计算机科学的基础领域。然而，随着词嵌入的引入，其发展轨迹发生了巨大变化。在词嵌入出现之前，NLP主要依赖于将词视为离散符号的基于规则的方法。通过词嵌入，计算机获得了通过向量空间表示理解语言的能力。在本文中，您将了解到：

词嵌入如何将词转换为密集向量
如何使用预训练词嵌入
如何训练自己的词嵌入
词嵌入在现代语言模型中的应用

让我们开始吧！

语言模型中的词嵌入
图片来源：Satoshi Hirayama。保留部分权利。

概述

这篇博文分为三部分；它们是：

理解词嵌入
使用预训练词嵌入
使用Gensim训练Word2Vec
使用PyTorch训练Word2Vec
Transformer模型中的嵌入

理解词嵌入

词嵌入将词表示为连续空间中的密集向量，其中语义相似的词彼此靠近。核心原则是，出现在相似上下文中的词应该具有相似的向量表示。这个概念通过Word2Vec、GloVe、FastText和ELMo等模型得到了普及。

词嵌入模型通常使用无监督学习进行训练，因为每个词的理想向量表示是未知的（否则，我们可以直接使用它）。目标是学习训练语料库中的词共现模式。

Word2Vec，由论文《向量空间中词表示的有效估计》引入，开创了这种方法。它使用神经网络根据局部上下文预测词，并有两种变体：

连续词袋（CBOW）：给定上下文预测目标词
Skip-gram：给定目标词预测上下文词

Skip-gram通常在较小数据集和稀有词上表现更好，而CBOW对较大数据集更快更有效。Word2Vec通过展示嵌入向量可以满足“king – man + woman ≈ queen”等方程，证明了计算机可以理解词之间的语义关系。

GloVe（全局词向量）采用不同的方法。它不使用神经网络，而是构建并分解一个词共现矩阵以获得嵌入。GloVe结合了以下优点：

全局矩阵分解方法（如潜在语义分析）
局部上下文窗口方法（如Word2Vec）

生成的嵌入捕获了词之间的语义和句法关系，并且在需要更广泛语义理解的任务上通常优于Word2Vec。

FastText在Word2Vec的基础上进行了改进，它学习字符n-gram而不是整个词的向量。这种方法捕获子词信息，解决了词汇表外（OOT）问题，并为形态丰富的语言提供了更好的性能。

ELMo是一个更近期的模型，它使用深度双向LSTM生成上下文相关的词向量。与以前的模型不同，ELMo的词向量不是固定的，而是根据上下文而变化。虽然在大型语言模型出现后，ELMo如今使用较少，但其核心思想——词义应依赖于上下文——构成了所有现代语言模型的基础。

使用预训练词嵌入

您可以轻松使用流行库中预训练的词嵌入。以下是使用`gensim`库和GloVe嵌入的示例：

from gensim.models import KeyedVectors

# Load pretrained GloVe embeddings
model = KeyedVectors.load_word2vec_format('glove.6B.50d.txt', binary=False, no_header=True)

# Find similar words
similar_words = model.most_similar('king')
print(similar_words)
print()

# Word analogies
result = model.most_similar(positive=['king', 'woman'], negative=['man'])
print(result)

from gensim.models import KeyedVectors

# 加载预训练的GloVe嵌入

model = KeyedVectors.load_word2vec_format('glove.6B.50d.txt', binary=False, no_header=True)

# 查找相似词

similar_words = model.most_similar('king')

print(similar_words)

print()

# 词语类比

result = model.most_similar(positive=['king', 'woman'], negative=['man'])

print(result)

要运行此代码，您需要从https://nlp.stanford.edu/projects/glove/下载GloVe嵌入，并从zip文件`glove.6B.zip`中提取`glove.6B.50d.txt`文件。该文件包含来自60亿词训练语料库的400,000个词的训练向量。

运行此代码时，您将看到以下输出：

[('prince', 0.8236179351806641), ('queen', 0.7839043140411377), ('ii', 0.7746230363845825),
('emperor', 0.7736247777938843), ('son', 0.766719400882721), ('uncle', 0.7627150416374207),
('kingdom', 0.7542160749435425), ('throne', 0.7539913654327393), ('brother', 0.7492411136627197),
('ruler', 0.7434253692626953)]

[('queen', 0.8523604273796082), ('throne', 0.7664334177970886), ('prince', 0.7592144012451172),
('daughter', 0.7473883628845215), ('elizabeth', 0.7460219860076904), ('princess', 0.7424570322036743),
('kingdom', 0.7337412238121033), ('monarch', 0.721449077129364), ('eldest', 0.7184861898422241),
('widow', 0.7099431157112122)]

[('prince', 0.8236179351806641), ('queen', 0.7839043140411377), ('ii', 0.7746230363845825),

('emperor', 0.7736247777938843), ('son', 0.766719400882721), ('uncle', 0.7627150416374207),

('kingdom', 0.7542160749435425), ('throne', 0.7539913654327393), ('brother', 0.7492411136627197),

('ruler', 0.7434253692626953)]

[('queen', 0.8523604273796082), ('throne', 0.7664334177970886), ('prince', 0.7592144012451172),

('daughter', 0.7473883628845215), ('elizabeth', 0.7460219860076904), ('princess', 0.7424570322036743),

('kingdom', 0.7337412238121033), ('monarch', 0.721449077129364), ('eldest', 0.7184861898422241),

('widow', 0.7099431157112122)]

第一个输出显示，在此嵌入模型下，“king”与“prince”最相似。第二个输出显示，“queen”是“king + woman – man”最接近的词。

使用Gensim训练Word2Vec

Gensim提供了一个简单的接口来训练您自己的Word2Vec模型。以下是操作方法：

from gensim.models import Word2Vec
from gensim.utils import simple_preprocess

# Prepare your text data
sentences = [
    "the quick brown fox jumps over the lazy dog",
    "a quick brown dog jumps over the lazy fox",
    # ... more sentences
]

# Preprocess the sentences
tokenized_sentences = [simple_preprocess(sentence) for sentence in sentences]

# Train the model
model = Word2Vec(
    sentences=tokenized_sentences,
    vector_size=100,  # dimension of the word vectors
    window=5,         # context window size
    min_count=1,      # ignore words with frequency < min_count
    workers=4,        # number of CPU cores to use
    sg=0              # 0 for CBOW, 1 for Skip-gram
)

# Save the model
model.save("word2vec.model")

# Use the model
model = Word2Vec.load("word2vec.model")
vector = model.wv['quick']  # get the vector for a word
similar_words = model.wv.most_similar('quick')
print(similar_words)

from gensim.models import Word2Vec

from gensim.utils import simple_preprocess

# 准备您的文本数据

sentences = [

"the quick brown fox jumps over the lazy dog",

"a quick brown dog jumps over the lazy fox",

# ... 更多句子

]

# 预处理句子

tokenized_sentences = [simple_preprocess(sentence) for sentence in sentences]

# 训练模型

model = Word2Vec(

sentences=tokenized_sentences,

vector_size=100, # 词向量的维度

window=5, # 上下文窗口大小

min_count=1, # 忽略频率低于min_count的词

workers=4, # 使用的CPU核心数

sg=0 # 0表示CBOW，1表示Skip-gram

)

# 保存模型

model.save("word2vec.model")

# 使用模型

model = Word2Vec.load("word2vec.model")

vector = model.wv['quick'] # 获取词的向量

similar_words = model.wv.most_similar('quick')

print(similar_words)

运行此代码不会得到一个好的模型。要获得有用的嵌入，您需要一个大型语料库进行训练。您可能不想扩展Python列表`sentences`，而是重写代码以从磁盘上的某些文件中读取。

假设您已经这样做了，gensim将训练一个Word2Vec模型并将其保存到文件`word2vec.model`中。一旦训练完成，您可以将其加载回来并用它来获取词的向量，如上面代码所示。

使用PyTorch训练Word2Vec

您也可以使用PyTorch从头开始实现Word2Vec。这是一个基本实现：

import torch
import torch.nn as nn
import torch.optim as optim

class Word2VecModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super().__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear = nn.Linear(embedding_dim, vocab_size)
        
    def forward(self, inputs):
        embeds = self.embeddings(inputs)
        out = self.linear(embeds)
        return out

# Prepare your text data
sentences = [
    "the quick brown fox jumps over the lazy dog",
    "a quick brown dog jumps over the lazy fox",
    # ... more sentences
]

# Create a dataset for training
skipgram_size = 2
dataset = []
vocab = set()
for sentence in sentences:
    tokens = sentence.split()
    vocab.update(tokens)
    for i in range(len(tokens)):
        context = tokens[i-skipgram_size:i] + tokens[i+1:i+skipgram_size+1]
        target = tokens[i]
        dataset.append((context, target))

vocab_to_idx = {word: idx for idx, word in enumerate(sorted(vocab))}
vocab_size = len(vocab)

# Training setup
embedding_dim = 50
model = Word2VecModel(vocab_size, embedding_dim)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.001)
num_epochs = 10

# Training loop
for epoch in range(num_epochs):
    for context, target in dataset:
        context_idx = [vocab_to_idx[word] for word in context]
        target_idx = [vocab_to_idx[target]] * len(context)
        optimizer.zero_grad()
        output = model(torch.tensor(target_idx))
        loss = criterion(output, torch.tensor(context_idx))
        loss.backward()
        optimizer.step()

# Save the model
torch.save(model.state_dict(), "word2vec.pt")

import torch

import torch.nn as nn

import torch.optim as optim

class Word2VecModel(nn.Module):

def __init__(self, vocab_size, embedding_dim):

super().__init__()

self.embeddings = nn.Embedding(vocab_size, embedding_dim)

self.linear = nn.Linear(embedding_dim, vocab_size)

def forward(self, inputs):

embeds = self.embeddings(inputs)

out = self.linear(embeds)

return out

# 准备您的文本数据

sentences = [

"the quick brown fox jumps over the lazy dog",

"a quick brown dog jumps over the lazy fox",

# ... 更多句子

]

# 创建训练数据集

skipgram_size = 2

dataset = []

vocab = set()

for sentence in sentences:

tokens = sentence.split()

vocab.update(tokens)

for i in range(len(tokens)):

context = tokens[i-skipgram_size:i] + tokens[i+1:i+skipgram_size+1]

target = tokens[i]

dataset.append((context, target))

vocab_to_idx = {word: idx for idx, word in enumerate(sorted(vocab))}

vocab_size = len(vocab)

# 训练设置

embedding_dim = 50

model = Word2VecModel(vocab_size, embedding_dim)

criterion = nn.CrossEntropyLoss()

optimizer = optim.SGD(model.parameters(), lr=0.001)

num_epochs = 10

# 训练循环

for epoch in range(num_epochs):

for context, target in dataset:

context_idx = [vocab_to_idx[word] for word in context]

target_idx = [vocab_to_idx[target]] * len(context)

optimizer.zero_grad()

output = model(torch.tensor(target_idx))

loss = criterion(output, torch.tensor(context_idx))

loss.backward()

optimizer.step()

# 保存模型

torch.save(model.state_dict(), "word2vec.pt")

这段代码将训练一个Word2Vec的“skip-gram”模型。在这个模型中，训练数据是文本语料库中的一个词窗口。您应该做一些预处理来清理词汇表，例如，去除标点符号并将所有词转换为小写。请注意变量`context`和`target`是如何使用的。在一个窗口中，例如上面例子中的“the quick brown fox jumps”，模型将以中心词作为输入，并被要求预测同一窗口中的任何其他词。训练的损失函数是交叉熵损失。

这个例子可能不会给您一个好的模型，因为您需要更大的语料库和更多的训练轮次。然而，请注意模型有一个嵌入层和一个线性层。使用`nn.Embedding`创建的嵌入层将是您感兴趣的词嵌入矩阵。

此外，请注意嵌入层只是一个数值矩阵。您需要一个查找表，例如上面代码中的`vocab_to_idx`，将词转换为索引，然后使用该索引获取嵌入向量。查找表应该与模型一起保存，因为如果您无法将词转换为正确的索引，您将无法使用它。

Transformer模型中的嵌入

从上面的例子中，您了解到词嵌入可以被训练，并且您可以为此目的创建一个`nn.Embedding`层。事实上，大多数现代语言模型都使用这种方法。让我们以BERT模型为例。

from transformers import BertModel, BertConfig

config = BertConfig()
model = BertModel(config=config)
print(model)
print(model.embeddings.word_embeddings.state_dict())

from transformers import BertModel, BertConfig

config = BertConfig()

model = BertModel(config=config)

print(model)

print(model.embeddings.word_embeddings.state_dict())

运行此代码时，您将看到：

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
          (intermediate_act_fn): GELUActivation()
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
  )
  (pooler): BertPooler(
    (dense): Linear(in_features=768, out_features=768, bias=True)
    (activation): Tanh()
  )
)
OrderedDict({'weight': tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0373, -0.0254, -0.0057,  ...,  0.0262, -0.0122,  0.0050],
        [-0.0222,  0.0076,  0.0077,  ...,  0.0085,  0.0052,  0.0209],
        ...,
        [-0.0253, -0.0047,  0.0141,  ..., -0.0262, -0.0303, -0.0488],
        [-0.0029, -0.0301, -0.0286,  ..., -0.0130, -0.0312, -0.0125],
        [ 0.0507, -0.0257, -0.0376,  ...,  0.0087, -0.0076,  0.0027]])})

BertModel(

(embeddings): BertEmbeddings(

(word_embeddings): Embedding(30522, 768, padding_idx=0)

(position_embeddings): Embedding(512, 768)

(token_type_embeddings): Embedding(2, 768)

(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)

(dropout): Dropout(p=0.1, inplace=False)

)

(encoder): BertEncoder(

(layer): ModuleList(

(0-11): 12 x BertLayer(

(attention): BertAttention(

(self): BertSdpaSelfAttention(

(query): Linear(in_features=768, out_features=768, bias=True)

(key): Linear(in_features=768, out_features=768, bias=True)

(value): Linear(in_features=768, out_features=768, bias=True)

(dropout): Dropout(p=0.1, inplace=False)

)

(output): BertSelfOutput(

(dense): Linear(in_features=768, out_features=768, bias=True)

(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)

(dropout): Dropout(p=0.1, inplace=False)

)

(intermediate): BertIntermediate(

(dense): Linear(in_features=768, out_features=3072, bias=True)

(intermediate_act_fn): GELUActivation()

)

(output): BertOutput(

(dense): Linear(in_features=3072, out_features=768, bias=True)

(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)

(dropout): Dropout(p=0.1, inplace=False)

)

(pooler): BertPooler(

(dense): Linear(in_features=768, out_features=768, bias=True)

(activation): Tanh()

)

OrderedDict({'weight': tensor([[ 0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],

[ 0.0373, -0.0254, -0.0057, ..., 0.0262, -0.0122, 0.0050],

[-0.0222, 0.0076, 0.0077, ..., 0.0085, 0.0052, 0.0209],

...,

[-0.0253, -0.0047, 0.0141, ..., -0.0262, -0.0303, -0.0488],

[-0.0029, -0.0301, -0.0286, ..., -0.0130, -0.0312, -0.0125],

[ 0.0507, -0.0257, -0.0376, ..., 0.0087, -0.0076, 0.0027]])})

BERT模型很复杂，包含许多组件。词嵌入层名为`word_embeddings`。创建模型后，您可以使用`model.embeddings.word_embeddings`来引用它。从其参数可以看出，它有30522个词汇，每个向量的维度为768。第二个打印语句将转储嵌入矩阵。您应该期望矩阵的形状为`(30522, 768)`。

在上一篇文章中，您了解到语言模型需要一个分词器来将输入文本拆分为标记。分词器还会为每个标记分配一个标记ID。这个标记ID是嵌入矩阵的行索引。当您将输入文本馈送给此模型时，您应该馈送一个标记ID序列。通常，嵌入层是模型的第一个层。它将通过将每个标记ID替换为嵌入矩阵中对应的行，将标记ID序列转换为嵌入向量序列。

进一步阅读

以下是一些关于该主题的进一步阅读材料：

总结

在本文中，您学习了词嵌入及其应用。特别是，您了解到：

词嵌入将词表示为连续空间中的密集向量，语义相似的词彼此靠近。
预训练词嵌入可通过流行的库轻松获取。
您可以使用Gensim或PyTorch训练自定义词嵌入。
现代Transformer模型通过`nn.Embedding`层利用学习到的嵌入。
嵌入对于捕捉词之间的语义关系至关重要。

导航

语言模型中的词嵌入

概述

理解词嵌入

使用预训练词嵌入

使用Gensim训练Word2Vec

使用PyTorch训练Word2Vec

Transformer模型中的嵌入

进一步阅读

总结

关于此主题的更多信息

暂无评论。

发表回复点击此处取消回复。

导航

概述

理解词嵌入

使用预训练词嵌入

使用Gensim训练Word2Vec

使用PyTorch训练Word2Vec

Transformer模型中的嵌入

进一步阅读

总结

关于此主题的更多信息

暂无评论。

发表回复 点击此处取消回复。

发表回复点击此处取消回复。