上下文向量的应用

作者： Muhammad Asad Iqbal Khan 于 2025年5月15日发表在 Hugging Face Transformers 0

上下文向量是高级 NLP 任务的强大工具。它们可以让你捕捉词语的上下文含义，例如在词语具有多种含义时，识别句子中词语的正确含义。在本帖中，我们将探讨上下文向量的一些示例应用。具体来说，

你将学习如何从文档中提取上下文关键词
你将学习如何使用上下文向量生成文档摘要

通过我的书籍《Hugging Face Transformers中的NLP》，快速启动您的项目。它提供了带有工作代码的自学教程。

让我们开始吧。

上下文向量的应用
照片作者：Erik Karits。部分权利保留。

概述

这篇文章分为两部分：

上下文关键词提取
上下文文本摘要

上下文关键词提取

上下文关键词提取是一种根据词语的上下文相关性来识别文档中最重要词语的技术。试想一下，你有一个文档，想突出最具有代表性的词语。一种方法是找出与文档语义最相似的词语。这项技术对于各种 NLP 任务都很有用，例如信息检索、文档聚类和文本摘要。

让我们通过将文档中的每个词语与整个文档进行比较来实现一个简单的上下文关键词提取系统

import numpy as np
import torch
from transformers import BertTokenizer, BertModel

def get_context_vectors(sentence, model, tokenizer):
    inputs = tokenizer(sentence, return_tensors="pt", add_special_tokens=True)
    input_ids = inputs["input_ids"]
    attention_mask = inputs["attention_mask"]

    # Get the tokens (for reference)
    tokens = tokenizer.convert_ids_to_tokens(input_ids[0])

    # Forward pass, get all hidden states from each layer
    with torch.no_grad():
        outputs = model(input_ids, attention_mask=attention_mask, output_hidden_states=True)
    hidden_states = outputs.hidden_states

    # Each element in hidden states has shape (batch_size, sequence_length, hidden_size)
    # Here takes the first element in the batch from the last layer
    last_layer_vectors = hidden_states[-1][0].numpy()  # Shape: (sequence_length, hidden_size)

    return tokens, last_layer_vectors

def cosine_similarity(vec1, vec2):
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

def extract_contextual_keywords(document, model, tokenizer, top_n=5):
    """extract contextual keywords from a document"""
    # Split the document into sentences (simple split by period)
    sentences = [s.strip() for s in document.split(".") if s.strip()]

    # Process each sentence to get context vectors
    all_tokens = []
    all_vectors = []
    for sentence in sentences:
        if not sentence:
            continue   # Skip empty sentences

        # Get context vectors
        tokens, vectors = get_context_vectors(sentence, model, tokenizer)

        # Store tokens and vectors (excluding special tokens [CLS] and [SEP])
        all_tokens.extend(tokens[1:-1])
        all_vectors.extend(vectors[1:-1])

    # Convert to numpy arrays, then calculate the document vector as average of all token vectors
    all_vectors = np.array(all_vectors)
    doc_vector = np.mean(all_vectors, axis=0)

    # Calculate similarity between each token vector and the document vector
    similarities = []
    for token, vec in zip(all_tokens, all_vectors):
        # Skip special tokens, punctuation, and common words
        if token in ["[CLS]", "[SEP]", ".", ",", "!", "?", "the", "a", "an", "is", "are", "was", "were"]:
            continue
        # compute similarity, then remember it with the token
        sim = cosine_similarity(vec, doc_vector)
        similarities.append((sim, token))

    # Sort the similarity and get the top N
    top_similarities = sorted(similarities, reverse=True)[:top_n]
    return top_similarities

# Example document
document = """
Artificial intelligence is transforming industries around the world.
Machine learning algorithms can analyze vast amounts of data to identify patterns and make predictions.
Natural language processing enables computers to understand and generate human language.
Computer vision systems can recognize objects and interpret visual information.
These technologies are driving innovation in healthcare, finance, transportation, and many other sectors.
"""

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")
model.eval()

# Extract contextual keywords and print the result
top_keywords = extract_contextual_keywords(document, model, tokenizer, top_n=10)
print("Top contextual keywords:")
for similarity, token in top_keywords:
    print(f"{token}: {similarity:.4f}")

import numpy as np

import torch

from transformers import BertTokenizer, BertModel

def get_context_vectors(sentence, model, tokenizer):

inputs = tokenizer(sentence, return_tensors="pt", add_special_tokens=True)

input_ids = inputs["input_ids"]

attention_mask = inputs["attention_mask"]

# 获取 token (用于参考)

tokens = tokenizer.convert_ids_to_tokens(input_ids[0])

# 前向传播，获取每一层的隐藏状态

with torch.no_grad():

outputs = model(input_ids, attention_mask=attention_mask, output_hidden_states=True)

hidden_states = outputs.hidden_states

# hidden states 的每个元素形状为 (batch_size, sequence_length, hidden_size)

# 这里从最后一层获取批次中的第一个元素

last_layer_vectors = hidden_states[-1][0].numpy() # 形状：(sequence_length, hidden_size)

return tokens, last_layer_vectors

def cosine_similarity(vec1, vec2):

return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

def extract_contextual_keywords(document, model, tokenizer, top_n=5):

"""从文档中提取上下文关键词"""

# 将文档分割成句子 (简单地按句号分割)

sentences = [s.strip() for s in document.split(".") if s.strip()]

# 处理每个句子以获取上下文向量

all_tokens = []

all_vectors = []

for sentence in sentences:

if not sentence:

continue # 跳过空句子

# 获取上下文向量

tokens, vectors = get_context_vectors(sentence, model, tokenizer)

# 存储 token 和向量 (排除特殊 token [CLS] 和 [SEP])

all_tokens.extend(tokens[1:-1])

all_vectors.extend(vectors[1:-1])

# 转换为 numpy 数组，然后将文档向量计算为所有 token 向量的平均值

all_vectors = np.array(all_vectors)

doc_vector = np.mean(all_vectors, axis=0)

# 计算每个 token 向量与文档向量之间的相似度

similarities = []

for token, vec in zip(all_tokens, all_vectors):

# 排除特殊 token、标点符号和常用词

if token in ["[CLS]", "[SEP]", ".", ",", "!", "?", "the", "a", "an", "is", "are", "was", "were"]:

continue

# 计算相似度，然后将它与 token 一起记住

sim = cosine_similarity(vec, doc_vector)

similarities.append((sim, token))

# 对相似度进行排序并获取前 N 个

top_similarities = sorted(similarities, reverse=True)[:top_n]

return top_similarities

# 示例文档

document = """

人工智能正在改变世界各地的行业。

机器学习算法可以分析大量数据以识别模式并做出预测。

自然语言处理使计算机能够理解和生成人类语言。

计算机视觉系统可以识别物体并解释视觉信息。

这些技术正在推动医疗保健、金融、交通运输以及许多其他领域的创新。

"""

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

model = BertModel.from_pretrained("bert-base-uncased")

model.eval()

# 提取上下文关键词并打印结果

top_keywords = extract_contextual_keywords(document, model, tokenizer, top_n=10)

print("Top contextual keywords:")

for similarity, token in top_keywords:

print(f"{token}: {similarity:.4f}")

在此示例中，BERT 模型用于为文档中的每个词语生成上下文向量。文档向量是所有 token 向量的平均值。或者，您可以通过将整个文档输入模型后提取 `[CLS]` 前缀 token 来获取文档向量。但是，这里没有使用这种方法，因为输入文档可能太长，模型无法一次处理。相反，文档被分割成句子，每个句子被单独处理。

有了每个词语的向量和文档向量，就可以计算每个词语与文档之间的余弦相似度。`extract_contextual_keywords()` 函数返回相似度得分最高的 N 个词语。然后打印这些结果。

余弦相似度衡量两个向量的接近程度。在这种情况下，如果一个词语向量接近文档向量，则认为它很好地代表了文档。之所以有效，是因为词语向量是上下文感知的，由 Transformer 模型生成。与依赖频率（如 TF-IDF）或预定义规则（如 RAKE）的传统关键词提取方法不同，这种方法利用了 Transformer 模型所捕获的语义理解。

运行此代码时，您将获得

Top contextual keywords:
to: 0.7961
can: 0.7909
can: 0.7804
of: 0.7551
human: 0.7365
analyze: 0.7354
enables: 0.7345
computers: 0.7310
in: 0.7282
systems: 0.7153

Top contextual keywords

to: 0.7961

can: 0.7909

can: 0.7804

of: 0.7551

human: 0.7365

analyze: 0.7354

enables: 0.7345

computers: 0.7310

in: 0.7282

systems: 0.7153

要改进结果，您可以考虑实现停用词移除，以排除输出中的常用词，如“to”。

上下文文本摘要

文档摘要可以通过多种方式完成。最常见的方法之一是从文档中选择最有代表性的句子，这种方法称为抽取式摘要。

一种执行抽取式摘要的方法是为每个句子生成一个向量，并为整个文档生成一个向量。然后选择与文档最相似的句子。使用上下文向量，可以轻松地实现这种方法。我们来试试

import numpy as np
import torch
from transformers import BertTokenizer, BertModel

def cosine_similarity(vec1, vec2):
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

def get_sentence_embedding(sentence, model, tokenizer):
    """Sentence embedding extracted from the [CLS] prefix token"""
    # Tokenize the input
    inputs = tokenizer(sentence, return_tensors="pt",
                       add_special_tokens=True, truncation=True, max_length=512)

    # Forward pass, get hidden states
    with torch.no_grad():
        outputs = model(**inputs)

    # Get the [CLS] token embedding at position 0 from the last layer
    cls_embedding = outputs.last_hidden_state[0, 0].numpy()
    return cls_embedding

def extractive_summarize(document, model, tokenizer, num_sentences=3):
    # Split the document into sentences
    sentences = [s.strip() for s in document.split(".") if s.strip()]
    if len(sentences) <= num_sentences:
        return document

    # Get embeddings for all sentences
    sentence_embeddings = []
    for sentence in sentences:
        embedding = get_sentence_embedding(sentence, model, tokenizer)
        sentence_embeddings.append(embedding)

    # Calculate the document embedding (average of all sentence embeddings)
    # then find the most similar sentences
    document_embedding = np.mean(sentence_embeddings, axis=0)
    similarities = []
    for idx, embedding in enumerate(sentence_embeddings):
        sim = cosine_similarity(embedding, document_embedding)
        similarities.append((sim, idx))
    top_sentences = sorted(similarities, reverse=True)[:num_sentences]

    # Extract the sentences, preserve the original order
    top_indices = sorted([x[1] for x in top_sentences])
    summary_sentences = [sentences[i] for i in top_indices]

    # Join the sentences to form the summary
    summary = ". ".join(summary_sentences) + "."
    return summary

# Example document
document = """
Transformer models have revolutionized natural language processing by
introducing mechanisms that can effectively capture contextual relationships in
text. One of the most powerful aspects of transformers is their ability to
generate context-aware vector representations, often referred to as context
vectors. Unlike traditional word embeddings that assign a fixed vector to each
word regardless of context, transformer models generate dynamic representations
that depend on the surrounding words. This allows them to capture the nuanced
meanings of words in different contexts. For example, in the sentences "I'm
going to the bank to deposit money" and "I'm going to sit by the river bank,"
the word "bank" has different meanings. A traditional word embedding would
assign the same vector to "bank" in both sentences, but a transformer model
generates different context vectors that capture the distinct meanings based on
the surrounding words. This contextual understanding enables transformers to
excel at a wide range of NLP tasks, from question answering and sentiment
analysis to machine translation and text summarization.
"""

# Generate a summary
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")
summary = extractive_summarize(document, model, tokenizer, num_sentences=3)

# Print the original document and the summary
print("Original Document:")
print(document)
print("Summary:")
print(summary)

import numpy as np

import torch

from transformers import BertTokenizer, BertModel

def cosine_similarity(vec1, vec2):

return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

def get_sentence_embedding(sentence, model, tokenizer):

"""从 [CLS] 前缀 token 中提取的句子嵌入"""

# 对输入进行 token 化

inputs = tokenizer(sentence, return_tensors="pt",

add_special_tokens=True, truncation=True, max_length=512)

# 前向传播，获取隐藏状态

with torch.no_grad():

outputs = model(**inputs)

# 从最后一层获取 [CLS] token 嵌入，位于位置 0

cls_embedding = outputs.last_hidden_state[0, 0].numpy()

return cls_embedding

def extractive_summarize(document, model, tokenizer, num_sentences=3):

# 将文档分割成句子

sentences = [s.strip() for s in document.split(".") if s.strip()]

if len(sentences) <= num_sentences:

return document

# 获取所有句子的嵌入

sentence_embeddings = []

for sentence in sentences:

embedding = get_sentence_embedding(sentence, model, tokenizer)

sentence_embeddings.append(embedding)

# 计算文档嵌入 (所有句子嵌入的平均值)

# 然后找到最相似的句子

document_embedding = np.mean(sentence_embeddings, axis=0)

similarities = []

for idx, embedding in enumerate(sentence_embeddings):

sim = cosine_similarity(embedding, document_embedding)

similarities.append((sim, idx))

top_sentences = sorted(similarities, reverse=True)[:num_sentences]

# 提取句子，保留原始顺序

top_indices = sorted([x[1] for x in top_sentences])

summary_sentences = [sentences[i] for i in top_indices]

# 连接句子形成摘要

summary = ". ".join(summary_sentences) + "."

return summary

# 示例文档

document = """

Transformer 模型通过引入能够有效捕捉文本中上下文关系的机制，彻底改变了自然语言处理。

Transformer 最强大的方面之一是它们生成上下文感知向量表示的能力，通常称为上下文向量。与为每个词语分配固定向量（无论上下文如何）的传统词语嵌入不同，Transformer 模型生成动态表示，这些表示取决于周围的词语。这使得它们能够捕捉词语在不同上下文中的细微含义。例如，在句子“我要去银行存钱”和“我要坐在河边的银行”中，

“bank”这个词有不同的含义。传统的词语嵌入在两个句子中都会为“bank”分配相同的向量，但 Transformer 模型

会生成不同的上下文向量，这些向量根据周围的词语捕捉不同的含义。这种上下文理解能力使 Transformer 在广泛的 NLP 任务中表现出色，从问答和情感分析到机器翻译和文本摘要。

“bank”这个词有不同的含义。传统的词语嵌入会

在两个句子中都为“bank”分配相同的向量，但 Transformer 模型

会生成不同的上下文向量，这些向量取决于周围的词语。

含义，例如，在句子“我要去银行存钱”和“我要坐在河边的银行”中，

“bank”这个词有不同的含义。传统的词语嵌入会

在两个句子中都为“bank”分配相同的向量，但 Transformer 模型

"""

# 生成摘要

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

model = BertModel.from_pretrained("bert-base-uncased")

summary = extractive_summarize(document, model, tokenizer, num_sentences=3)

# 打印原始文档和摘要

print("Original Document:")

print(document)

print("Summary:")

print(summary)

如果运行此代码，您将获得

Original Document:

Transformer models have revolutionized natural language processing by
introducing mechanisms that can effectively capture contextual relationships in
text. One of the most powerful aspects of transformers is their ability to
generate context-aware vector representations, often referred to as context
vectors. Unlike traditional word embeddings that assign a fixed vector to each
word regardless of context, transformer models generate dynamic representations
that depend on the surrounding words. This allows them to capture the nuanced
meanings of words in different contexts. For example, in the sentences "I'm
going to the bank to deposit money" and "I'm going to sit by the river bank,"
the word "bank" has different meanings. A traditional word embedding would
assign the same vector to "bank" in both sentences, but a transformer model
generates different context vectors that capture the distinct meanings based on
the surrounding words. This contextual understanding enables transformers to
excel at a wide range of NLP tasks, from question answering and sentiment
analysis to machine translation and text summarization.

Summary:
One of the most powerful aspects of transformers is their ability to
generate context-aware vector representations, often referred to as context
vectors. Unlike traditional word embeddings that assign a fixed vector to each
word regardless of context, transformer models generate dynamic representations
that depend on the surrounding words. A traditional word embedding would
assign the same vector to "bank" in both sentences, but a transformer model
generates different context vectors that capture the distinct meanings based on
the surrounding words.

Original Document

Transformer 模型通过引入能够有效捕捉文本中上下文关系的机制，彻底改变了自然语言处理。

“bank”这个词有不同的含义。传统的词语嵌入在两个句子中都会为“bank”分配相同的向量，但 Transformer 模型

“bank”这个词有不同的含义。传统的词语嵌入会

在两个句子中都为“bank”分配相同的向量，但 Transformer 模型

会生成不同的上下文向量，这些向量取决于周围的词语。

在句子“我”的含义，例如，在句子“我”

要去银行存钱”和“我要坐在河边的银行”，

“bank”这个词有不同的含义。传统的词语嵌入会

在两个句子中都为“bank”分配相同的向量，但 Transformer 模型

总结

Transformer 最强大的方面之一是它们有能力

“bank”这个词有不同的含义。传统的词语嵌入会

在两个句子中都为“bank”分配相同的向量，但 Transformer 模型

会生成不同的上下文向量，这些向量取决于周围的词语。

在两个句子中都为“bank”分配相同的向量，但 Transformer 模型

周围的词语。

在此示例中，`get_sentence_embedding()` 函数用于通过使用 Transformer 最后一层的 `[CLS]` token 嵌入来为整个句子生成嵌入。`[CLS]` token 是一个特殊的 token，加在句子的前面，Transformer 被训练来生成代表整个输入的嵌入。

在 `extractive_summarize()` 函数中，您为文档中的每个句子生成句子嵌入，并将文档嵌入计算为所有句子嵌入的平均值。然后，您计算文档嵌入与每个句子嵌入之间的余弦相似度，选择相似度得分最高的 N 个句子。

摘要是通过将这些前 N 个句子按它们在文档中的原始顺序连接起来形成的。这假定在语义上最相似的句子最能代表文档。