文本嵌入的应用示例

作者 Muhammad Asad Iqbal Khan 于 2025 年 5 月 15 日发表在 Hugging Face Transformers 0

文本嵌入通过提供捕捉语义意义的密集向量表示，彻底改变了自然语言处理。在上一篇教程中，您学习了如何使用 Transformer 模型生成这些嵌入。在本篇文章中，您将学习文本嵌入的高级应用，这些应用超越了诸如语义搜索和文档聚类等基本任务。
具体来说，您将学习

如何使用文本嵌入构建推荐系统
如何使用多语言嵌入实现跨语言应用
如何使用基于嵌入的特征创建文本分类系统
如何开发零样本学习应用
如何可视化和分析文本嵌入

通过我的书籍《Hugging Face Transformers中的NLP》，快速启动您的项目。它提供了带有工作代码的自学教程。

让我们开始吧。

文本嵌入的应用示例
照片来源：Christina Winter。部分权利保留。

概述

本文分为五个部分，它们是：

推荐系统
跨语言应用
文本分类
零样本分类
可视化文本嵌入

跨语言应用

现代 Transformer 模型的一个强大功能是它们能够为多种语言的文本生成嵌入。这使得跨语言应用成为可能，您可以在不同语言之间比较或处理文本。

让我们来实现一个简单的跨语言语义搜索系统

import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

corpus = [
  {
    "language": "English",
    "text": ("Machine learning is a field of study that gives computers the ability to learn "
             "without being explicitly programmed.")
  },
  {
    "language": "Spanish",
    "text": ("El aprendizaje automático es un campo de estudio que da a las computadoras la "
             "capacidad de aprender sin ser programadas explícitamente.")
  },
  {
    "language": "French",
    "text": ("L'apprentissage automatique est un domaine d'étude qui donne aux ordinateurs "
             "la capacité d'apprendre sans être explicitement programmés.")
  },
  {
    "language": "German",
    "text": ("Maschinelles Lernen ist ein Studienbereich, der Computern die Fähigkeit gibt, "
             "zu lernen, ohne explizit programmiert zu werden.")
  },
  {
    "language": "Italian",
    "text": ("Il machine learning è un campo di studio che conferisce ai computer la capacità "
             "di apprendere senza essere esplicitamente programmati.")
  },
  {
    "language": "English",
    "text": ("Natural language processing is a subfield of linguistics, computer science, "
             "and artificial intelligence.")
  },
  {
    "language": "English",
    "text": ("Computer vision is an interdisciplinary field that deals with how computers can "
             "gain high-level understanding from digital images or videos.")
  }
]

model = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")

# Generate embeddings for the corpus
texts = [doc["text"] for doc in corpus]
embeddings = model.encode(texts)

# Define a query in English and generate an embedding
query = "What is machine learning?"
query_embedding = model.encode(query)

# Sort the embeddings of the corpus by descending similarity
similarities = cosine_similarity([query_embedding], embeddings)[0]
ranked_indices = np.argsort(similarities)[::-1]

# Print ranked results
print(f"Query: {query}\n")
for i, idx in enumerate(ranked_indices[:3]):  # Show top 3 results
    print(f"{i+1}. [{corpus[idx]["language"]}] {corpus[idx]["text"]} (Similarity: {similarities[idx]:.4f})")

import numpy as np

from sentence_transformers import SentenceTransformer

from sklearn.metrics.pairwise import cosine_similarity

corpus = [

{

"language": "English",

"text": ("机器学习是一个研究领域，它使计算机能够学习，"

"而无需明确编程。")

{

"language": "Spanish",

"text": ("El aprendizaje automático es un campo de estudio que da a las computadoras la "

"capacidad de aprender sin ser programadas explícitamente.")

{

"language": "French",

"text": ("L'apprentissage automatique est un domaine d'étude qui donne aux ordinateurs "

"la capacité d'apprendre sans être explicitement programmés.")

{

"language": "German",

"text": ("Maschinelles Lernen ist ein Studienbereich, der Computern die Fähigkeit gibt, "

"zu lernen, ohne explizit programmiert zu werden.")

{

"language": "Italian",

"text": ("Il machine learning è un campo di studio che conferisce ai computer la capacità "

"di apprendere senza essere esplicitamente programmati.")

{

"language": "English",

"text": ("Natural language processing is a subfield of linguistics, computer science, "

"and artificial intelligence.")

{

"language": "English",

"text": ("Computer vision is an interdisciplinary field that deals with how computers can "

"gain high-level understanding from digital images or videos.")

}

]

model = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")

# 为语料库生成嵌入

texts = [doc["text"] for doc in corpus]

embeddings = model.encode(texts)

# 定义一个英文查询并生成嵌入

query = "什么是机器学习？"

query_embedding = model.encode(query)

# 按降序相似度对语料库的嵌入进行排序

similarities = cosine_similarity([query_embedding], embeddings)[0]

ranked_indices = np.argsort(similarities)[::-1]

# 打印排名结果

print(f"Query: {query}\n")

for i, idx in enumerate(ranked_indices[:3]): # 显示前 3 个结果

print(f"{i+1}. [{corpus[idx]["language"]}] {corpus[idx]["text"]} (相似度: {similarities[idx]:.4f})")

在此示例中，我们使用了多语言 Sentence Transformer 模型（paraphrase-multilingual-MiniLM-L12-v2）来为不同语言的文档创建嵌入。语料库包含多种语言和多种主题。上面的程序用于实现一个问答系统，但问题可能会在不同语言中找到答案。

上面的例子与上一节的例子非常相似。语料库首先被转换为嵌入。然后，以嵌入形式的查询通过余弦相似度与语料库进行比较。打印前 3 个结果。运行此代码将为您提供

Query: What is machine learning?

1. [Italian] Il machine learning è un campo di studio che conferisce ai computer la capacità di apprendere senza essere esplicitamente programmati. (Similarity: 0.8129)
2. [English] Machine learning is a field of study that gives computers the ability to learn without being explicitly programmed. (Similarity: 0.7788)
3. [French] L'apprentissage automatique est un domaine d'étude qui donne aux ordinateurs la capacité d'apprendre sans être explicitement programmés. (Similarity: 0.7470)

查询：什么是机器学习？

1. [Italian] Il machine learning è un campo di studio che conferisce ai computer la capacità di apprendere senza essere esplicitamente programmati. (相似度: 0.8129)

2. [English] Machine learning is a field of study that gives computers the ability to learn without being explicitly programmed. (相似度: 0.7788)

3. [French] L'apprentissage automatique est un domaine d'étude qui donne aux ordinateurs la capacité d'apprendre sans être explicitement programmés. (相似度: 0.7470)

最高答案是意大利语，而问题“什么是机器学习？”是英语。之所以这样工作，是因为嵌入向量代表了文本的语义含义，而与语言无关。这种跨语言能力对于多语言搜索引擎等应用尤其有用。

文本分类

想象一下，您有大量文本数据，并且这些数据每天都在增长。这可能是因为您正在收集新的文章或电子邮件。您想将它们分类到不同的类别中。这可以通过使用文本嵌入来完成。

这是一个类似于“主题建模”的任务。主题建模是一个无监督学习任务，它将文本文档分组到不同的主题中。它使用诸如潜在狄利克雷分配（LDA）等算法来查找分类的签名关键词。这是一个监督方法：您有一组预定义的类别和一些示例（也许您手动进行分类）。然后，您将新文本添加到集合中，并自动完成分类。

文本嵌入可以通过将文本的语义含义提取到向量中来提供帮助。然后，您可以训练一个机器学习模型将这些向量分类到不同的类别中。这样做效果更好，因为向量代表了文本的含义，而不是文本本身。因此，它比使用词袋模型或 TF-IDF 特征更好。

实现机器学习分类器的方法有很多。一种简单的方法是使用 scikit-learn 中的逻辑回归。让我们在代码中实现这一点

from sentence_transformers import SentenceTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

articles = [
    # Business articles
    {"text": "The stock market reached a new high today, with technology stocks leading the gains.", "category": "Business"},
    {"text": "The government announced a new tax policy that will affect small businesses.", "category": "Business"},
    {"text": "The central bank has decided to keep interest rates unchanged.", "category": "Business"},
    {"text": "Quarterly earnings reports exceeded expectations for most Fortune 500 companies.", "category": "Business"},
    {"text": "Inflation rates have decreased for the third consecutive month.", "category": "Business"},
    {"text": "The merger between two major corporations has been approved by regulators.", "category": "Business"},
    {"text": "Unemployment rates have fallen to a five-year low according to new data.", "category": "Business"},
    {"text": "The cryptocurrency market experienced significant volatility this week.", "category": "Business"},

    # Health articles
    {"text": "A new study shows that regular exercise can reduce the risk of heart disease.", "category": "Health"},
    {"text": "A clinical trial for a new cancer treatment has shown promising results.", "category": "Health"},
    {"text": "A balanced diet and regular sleep are essential for maintaining good health.", "category": "Health"},
    {"text": "Medical researchers have identified a new gene linked to Alzheimer's disease.", "category": "Health"},
    {"text": "The WHO has issued new guidelines for managing diabetes in elderly patients.", "category": "Health"},
    {"text": "A new technique for early detection of breast cancer has been developed.", "category": "Health"},
    {"text": "Studies show that mindfulness meditation can help reduce stress and anxiety.", "category": "Health"},
    {"text": "Public health officials warn of a potential flu outbreak this winter season.", "category": "Health"},

    # Technology articles
    {"text": "The latest smartphone from Apple features a better camera and longer battery life.", "category": "Technology"},
    {"text": "The new electric car from Tesla has a range of over 400 miles.", "category": "Technology"},
    {"text": "The latest update to the operating system includes new security features.", "category": "Technology"},
    {"text": "A new artificial intelligence system can detect diseases from medical images.", "category": "Technology"},
    {"text": "The tech company unveiled its new virtual reality headset at the annual conference.", "category": "Technology"},
    {"text": "Researchers have developed a quantum computer that can solve complex problems.", "category": "Technology"},
    {"text": "The new social media platform has gained millions of users in just a few months.", "category": "Technology"},
    {"text": "Cybersecurity experts warn of a new type of malware targeting smart home devices.", "category": "Technology"},

    # Science articles
    {"text": "Scientists have discovered a new species of frog in the Amazon rainforest.", "category": "Science"},
    {"text": "Astronomers have observed a supernova in a distant galaxy.", "category": "Science"},
    {"text": "Researchers have developed a new method for measuring ocean temperatures.", "category": "Science"},
    {"text": "A fossil discovery suggests that dinosaurs may have been warm-blooded.", "category": "Science"},
    {"text": "Climate scientists report that Arctic ice is melting at an unprecedented rate.", "category": "Science"},
    {"text": "Physicists have confirmed the existence of a new subatomic particle.", "category": "Science"},
    {"text": "A study of coral reefs shows signs of recovery in protected marine areas.", "category": "Science"},
    {"text": "Biologists have sequenced the genome of an endangered species of tiger.", "category": "Science"}
]


# Prepare data for classification training
model = SentenceTransformer("all-MiniLM-L6-v2")
texts = [article["text"] for article in articles]
X = model.encode(texts)
y = [article["category"] for article in articles]

# Normalize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split data into training and testing sets with stratification
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42, stratify=y
)

# Train a logistic regression classifier with regularization
classifier = LogisticRegression(C=1.0, class_weight="balanced", max_iter=1000)
classifier.fit(X_train, y_train)

# Evaluate the classifier
y_pred = classifier.predict(X_test)
print(classification_report(y_test, y_pred))

# Classify new articles
new_articles = [
    "The company reported a 20% increase in quarterly profits.",
    "A new vaccine has been approved for use against the flu.",
    "The new laptop features a faster processor and more memory.",
    "The Mars rover has sent back new images of the planet\"s surface."
]
new_embeddings = model.encode(new_articles)
new_embeddings_scaled = scaler.transform(new_embeddings)
new_predictions = classifier.predict(new_embeddings_scaled)
for article, prediction in zip(new_articles, new_predictions):
    print(f"Article: {article}\nPredicted Category: {prediction}\n")

from sentence_transformers import SentenceTransformer

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import classification_report

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

articles = [

# 商业文章

{"text": "今天的股市创下新高，科技股领涨。", "category": "Business"},

{"text": "政府宣布了新的税收政策，将影响小企业。", "category": "Business"},

{"text": "央行决定维持利率不变。", "category": "Business"},

{"text": "根据新的数据，失业率已降至五年来的最低点。", "category": "Business"},

{"text": "通货膨胀率已连续第三个月下降。", "category": "Business"},

{"text": "两家主要公司的合并已获得监管机构的批准。", "category": "Business"},

{"text": "根据新数据，失业率已降至五年来的最低点。", "category": "Business"},

{"text": "加密货币市场本周经历了显著的波动。", "category": "Business"},

# 健康文章

{"text": "一项新研究表明，定期锻炼可以降低患心脏病的风险。", "category": "Health"},

{"text": "一项针对新癌症治疗方法的临床试验显示出有希望的结果。", "category": "Health"},

{"text": "均衡饮食和规律睡眠对于保持良好健康至关重要。", "category": "Health"},

{"text": "医学研究人员已确定一种与阿尔茨海默病相关的新基因。", "category": "Health"},

{"text": "世界卫生组织发布了针对老年糖尿病患者管理的新指南。", "category": "Health"},

{"text": "已开发出一种用于早期检测乳腺癌的新技术。", "category": "Health"},

{"text": "研究表明，正念冥想有助于减轻压力和焦虑。", "category": "Health"},

{"text": "公共卫生官员警告说，今年冬天可能会爆发流感。", "category": "Health"},

# Technology articles

{"text": "苹果最新款智能手机配备了更好的摄像头和更长的电池续航。", "category": "Technology"},

{"text": "特斯拉新款电动汽车的续航里程超过 400 英里。", "category": "Technology"},

{"text": "最新操作系统更新包含新的安全功能。", "category": "Technology"},

{"text": "新的人工智能系统可以从医学影像中检测疾病。", "category": "Technology"},

{"text": "这家科技公司在年度大会上发布了新款虚拟现实头显。", "category": "Technology"},

{"text": "研究人员开发出一种量子计算机，可以解决复杂问题。", "category": "Technology"},

{"text": "新的社交媒体平台在短短几个月内就获得了数百万用户。", "category": "Technology"},

{"text": "网络安全专家警告说，一种新型恶意软件正针对智能家居设备。", "category": "Technology"},

# Science articles

{"text": "科学家在亚马逊雨林中发现了一种新的蛙类。", "category": "Science"},

{"text": "天文学家观测到了一个遥远星系中的超新星。", "category": "Science"},

{"text": "研究人员开发了一种测量海水温度的新方法。", "category": "Science"},

{"text": "化石发现表明恐龙可能是温血动物。", "category": "Science"},

{"text": "气候科学家报告称，北极冰正在以前所未有的速度融化。", "category": "Science"},

{"text": "物理学家已证实存在一种新的亚原子粒子。", "category": "Science"},

{"text": "对珊瑚礁的研究显示，在受保护的海洋区域有恢复迹象。", "category": "Science"},

{"text": "生物学家已对一种濒危老虎物种进行了基因组测序。", "category": "Science"}

]

# Prepare data for classification training

model = SentenceTransformer("all-MiniLM-L6-v2")

texts = [article["text"] for article in articles]

X = model.encode(texts)

y = [article["category"] for article in articles]

# Normalize features

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

# Split data into training and testing sets with stratification

X_train, X_test, y_train, y_test = train_test_split(

X_scaled, y, test_size=0.2, random_state=42, stratify=y

)

# Train a logistic regression classifier with regularization

classifier = LogisticRegression(C=1.0, class_weight="balanced", max_iter=1000)

classifier.fit(X_train, y_train)

# Evaluate the classifier

y_pred = classifier.predict(X_test)

print(classification_report(y_test, y_pred))

# Classify new articles

new_articles = [

"该公司报告季度利润增长 20%。",

"一种新疫苗已获准用于对抗流感。",

"新款笔记本电脑采用更快的处理器和更多的内存。",

"火星探测器已传回该行星表面新图像。"

]

new_embeddings = model.encode(new_articles)

new_embeddings_scaled = scaler.transform(new_embeddings)

new_predictions = classifier.predict(new_embeddings_scaled)

for article, prediction in zip(new_articles, new_predictions):

print(f"Article: {article}\nPredicted Category: {prediction}\n")

When you run this, you will get

              precision    recall  f1-score   support

    Business       1.00      1.00      1.00         2
      Health       0.50      1.00      0.67         1
     Science       1.00      1.00      1.00         2
  Technology       1.00      0.50      0.67         2

    accuracy                           0.86         7
   macro avg       0.88      0.88      0.83         7
weighted avg       0.93      0.86      0.86         7

Article: The company reported a 20% increase in quarterly profits.
Predicted Category: Business

Article: A new vaccine has been approved for use against the flu.
Predicted Category: Health

Article: The new laptop features a faster processor and more memory.
Predicted Category: Technology

Article: The Mars rover has sent back new images of the planet"s surface.
Predicted Category: Science

precision recall f1-score support

Business 1.00 1.00 1.00 2

Health 0.50 1.00 0.67 1

Science 1.00 1.00 1.00 2

Technology 1.00 0.50 0.67 2

accuracy 0.86 7

macro avg 0.88 0.88 0.83 7

weighted avg 0.93 0.86 0.86 7

Article: The company reported a 20% increase in quarterly profits.

Predicted Category: Business

Article: A new vaccine has been approved for use against the flu.

Predicted Category: Health

Article: The new laptop features a faster processor and more memory.

Predicted Category: Technology

Article: The Mars rover has sent back new images of the planet"s surface.

Predicted Category: Science

In this example, the corpus is annotated with one of the four categories: business, health, technology, or science. The text is converted into embeddings, which, together with the category label, are used to train a logistic regression classifier.

The classifier is trained with 80% of the corpus and then evaluated with the remaining 20%. The results are printed in the form of a classification report. You can see that Business and Science are classified accurately, but Health and Technology are not so good. When you finish the training, you can use the trained classifier on the new articles. The workflow is the same as in training: Encode the text into embeddings, then scale the embeddings using the trained scaler, and finally, use the trained classifier to predict the category.

Note that you can use other classifiers like random forest or K-Nearest Neighbors. You can try them and see which one works better.

零样本分类

In the previous example, you trained a classifier to classify the text into one of the predefined categories. If the category labels are meaningful text, why can’t you use the meaning of the label for classification? In this way, you can simply convert the text into embeddings and then compare it with the category labels’ embeddings. The text is then tagged with the most similar category label.

This is the idea of zero-shot learning. It is not a supervised learning task. Indeed, you never train a new model, but the classification and information retrieval tasks can still be done.

Let’s implement a zero-shot text classifier using text embeddings

import torch
from sentence_transformers import SentenceTransformer, util

texts = [
    "The stock market reached a new high today, with technology stocks leading the gains.",
    "A new study shows that regular exercise can reduce the risk of heart disease.",
    "The latest smartphone from Apple features a better camera and longer battery life.",
    "Scientists have discovered a new species of frog in the Amazon rainforest."
]
categories = ["Business", "Health", "Technology", "Science"]

# Load a pre-trained Sentence Transformer model
model = SentenceTransformer("all-MiniLM-L6-v2")
text_embeddings = model.encode(texts, convert_to_tensor=True)
category_embeddings = model.encode(categories, convert_to_tensor=True)

# Calculate cosine similarity between texts and categories
similarities = util.cos_sim(text_embeddings, category_embeddings)

# Get the most similar category for each text
best_categories = torch.argmax(similarities, dim=1)
for i, text in enumerate(texts):
    category = categories[best_categories[i]]
    similarity = similarities[i][best_categories[i]].item()
    print(f"Text: {text}")
    print(f"Category: {category} (Similarity: {similarity:.4f})\n")

import torch

from sentence_transformers import SentenceTransformer, util

texts = [

"The stock market reached a new high today, with technology stocks leading the gains.",

"A new study shows that regular exercise can reduce the risk of heart disease.",

"The latest smartphone from Apple features a better camera and longer battery life.",

"Scientists have discovered a new species of frog in the Amazon rainforest."

]

categories = ["Business", "Health", "Technology", "Science"]

# Load a pre-trained Sentence Transformer model

model = SentenceTransformer("all-MiniLM-L6-v2")

text_embeddings = model.encode(texts, convert_to_tensor=True)

category_embeddings = model.encode(categories, convert_to_tensor=True)

# Calculate cosine similarity between texts and categories

similarities = util.cos_sim(text_embeddings, category_embeddings)

# Get the most similar category for each text

best_categories = torch.argmax(similarities, dim=1)

for i, text in enumerate(texts):

category = categories[best_categories[i]]

similarity = similarities[i][best_categories[i]].item()

print(f"Text: {text}")

print(f"Category: {category} (Similarity: {similarity:.4f})\n")

输出如下：

Text: The stock market reached a new high today, with technology stocks leading the gains.
Category: Technology (Similarity: 0.2624)

Text: A new study shows that regular exercise can reduce the risk of heart disease.
Category: Health (Similarity: 0.3297)

Text: The latest smartphone from Apple features a better camera and longer battery life.
Category: Technology (Similarity: 0.1623)

Text: Scientists have discovered a new species of frog in the Amazon rainforest.
Category: Science (Similarity: 0.1940)

Text: The stock market reached a new high today, with technology stocks leading the gains.

Category: Technology (Similarity: 0.2624)

Text: A new study shows that regular exercise can reduce the risk of heart disease.

Category: Health (Similarity: 0.3297)

Text: The latest smartphone from Apple features a better camera and longer battery life.

Category: Technology (Similarity: 0.1623)

Text: Scientists have discovered a new species of frog in the Amazon rainforest.

Category: Science (Similarity: 0.1940)

The result may not be as good as the previous example because the category labels are sometimes ambiguous, and you do not have a model trained for this task. Nevertheless, it produces meaningful results.

Zero-shot learning is particularly useful for tasks where labeled training data is scarce or unavailable. It can be applied to a wide range of NLP tasks, including classification, entity recognition, and question-answering.

可视化文本嵌入

Not a particular application, but visualizing text embeddings can sometimes provide insights into the semantic relationships between texts. Since embeddings typically have hundreds of dimensions, you need dimensionality reduction techniques to visualize them in 2D or 3D.

PCA is probably the most popular dimensionality reduction technique. However, for visualization, t-SNE (t-Distributed Stochastic Neighbor Embedding) usually works better. Let’s implement a visualization of text embeddings using t-SNE

import matplotlib.pyplot as plt
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.manifold import TSNE

texts_with_categories = [
    {"text": "The stock market reached a new high today.", "category": "Business"},
    {"text": "Investors are optimistic about the economy.", "category": "Business"},
    {"text": "The company reported strong quarterly earnings.", "category": "Business"},
    {"text": "The central bank has decided to keep interest rates unchanged.", "category": "Business"},
    {"text": "A new study shows that regular exercise can reduce the risk of heart disease.", "category": "Health"},
    {"text": "A balanced diet is essential for maintaining good health.", "category": "Health"},
    {"text": "The new vaccine has been approved for use against the flu.", "category": "Health"},
    {"text": "Sleep is important for physical and mental health.", "category": "Health"},
    {"text": "The latest smartphone features a better camera and longer battery life.", "category": "Technology"},
    {"text": "The new laptop has a faster processor and more memory.", "category": "Technology"},
    {"text": "The software update includes new security features.", "category": "Technology"},
    {"text": "5G networks promise faster internet speeds for mobile devices.", "category": "Technology"},
    {"text": "Scientists have discovered a new species in the Amazon rainforest.", "category": "Science"},
    {"text": "Astronomers have observed a supernova in a distant galaxy.", "category": "Science"},
    {"text": "The Mars rover has sent back new images of the planet's surface.", "category": "Science"},
    {"text": "Researchers have developed a new method for measuring ocean temperatures.", "category": "Science"}
]


# Extract texts and categories
texts = [item["text"] for item in texts_with_categories]
categories = [item["category"] for item in texts_with_categories]

# Generate embeddings, then reduce dimension with t-SNE
model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(texts)

tsne = TSNE(n_components=2, perplexity=5, random_state=42)
reduced_embeddings = tsne.fit_transform(embeddings)

# Define colors for categories
unique_categories = list(set(categories))
colors = plt.cm.rainbow(np.linspace(0, 1, len(unique_categories)))
category_to_color = {category: color for category, color in zip(unique_categories, colors)}

# Create a scatter plot
plt.figure(figsize=(10, 8))
for i, (x, y) in enumerate(reduced_embeddings):
    category = categories[i]
    color = category_to_color[category]
    plt.scatter(x, y, color=color, alpha=0.7)
    plt.annotate(texts[i][:20] + "...", (x, y), fontsize=8)

# Add legend, mark the axes
for category, color in category_to_color.items():
    plt.scatter([], [], color=color, label=category)
plt.legend()
plt.xlabel("t-SNE Dimension 1")
plt.ylabel("t-SNE Dimension 2")
plt.title("t-SNE Visualization of Text Embeddings")
plt.tight_layout()
plt.show()

import matplotlib.pyplot as plt

import numpy as np

from sentence_transformers import SentenceTransformer

from sklearn.manifold import TSNE

texts_with_categories = [

{"text": "The stock market reached a new high today.", "category": "Business"},

{"text": "Investors are optimistic about the economy.", "category": "Business"},

{"text": "The company reported strong quarterly earnings.", "category": "Business"},

{"text": "央行决定维持利率不变。", "category": "Business"},

{"text": "一项新研究表明，定期锻炼可以降低患心脏病的风险。", "category": "Health"},

{"text": "A balanced diet is essential for maintaining good health.", "category": "Health"},

{"text": "The new vaccine has been approved for use against the flu.", "category": "Health"},

{"text": "Sleep is important for physical and mental health.", "category": "Health"},

{"text": "The latest smartphone features a better camera and longer battery life.", "category": "Technology"},

{"text": "The new laptop has a faster processor and more memory.", "category": "Technology"},

{"text": "The software update includes new security features.", "category": "Technology"},

{"text": "5G networks promise faster internet speeds for mobile devices.", "category": "Technology"},

{"text": "Scientists have discovered a new species in the Amazon rainforest.", "category": "Science"},

{"text": "天文学家观测到了一个遥远星系中的超新星。", "category": "Science"},

{"text": "The Mars rover has sent back new images of the planet's surface.", "category": "Science"},

{"text": "Researchers have developed a new method for measuring ocean temperatures.", "category": "Science"}

]

# Extract texts and categories

texts = [item["text"] for item in texts_with_categories]

categories = [item["category"] for item in texts_with_categories]

# Generate embeddings, then reduce dimension with t-SNE

model = SentenceTransformer("all-MiniLM-L6-v2")

embeddings = model.encode(texts)

tsne = TSNE(n_components=2, perplexity=5, random_state=42)

reduced_embeddings = tsne.fit_transform(embeddings)

# Define colors for categories

unique_categories = list(set(categories))

colors = plt.cm.rainbow(np.linspace(0, 1, len(unique_categories)))

category_to_color = {category: color for category, color in zip(unique_categories, colors)}

# Create a scatter plot

plt.figure(figsize=(10, 8))

for i, (x, y) in enumerate(reduced_embeddings):

category = categories[i]

color = category_to_color[category]

plt.scatter(x, y, color=color, alpha=0.7)

plt.annotate(texts[i][:20] + "...", (x, y), fontsize=8)

# Add legend, mark the axes

for category, color in category_to_color.items():

plt.scatter([], [], color=color, label=category)

plt.legend()

plt.xlabel("t-SNE Dimension 1")

plt.ylabel("t-SNE Dimension 2")

plt.title("t-SNE Visualization of Text Embeddings")

plt.tight_layout()

plt.show()

您使用了 scikit-learn 的 t-SNE 实现。它很容易使用，您只需要将嵌入向量的行传递给 tsne.fit_transform() 方法即可。输出的 embeddings 是一个 $N \times 2$ 的数组（即在二维空间中的坐标）。

然后，您使用一个 for 循环将每个转换后的嵌入绘制成散点图中的一个点。每个点根据原始文本中的标注类别进行着色。为了避免图表混乱，图例是在另一个 for 循环中稍后创建的。生成的图表如下所示：

可视化将含义相似的文本放在一起；这意味着标签对于表示文本的语义含义很有用。您可以查看图表，检查同一类别的点是否足够聚集在一起，以判断您的嵌入是否良好。

还存在其他降维技术，例如 PCA（主成分分析）或 UMAP（均匀流形逼近与投影）。您可以尝试这些方法，看看可视化是否仍然有意义。

进一步阅读

以下是一些您可能觉得有用的进一步阅读资料：

文本嵌入的预训练模型
t-SNE 在 Scikit-Learn 中
PCA 在 Scikit-Learn 中
UMAP

总结

在本教程中，您学习了文本嵌入的几个应用。特别是，您学习了如何

使用嵌入空间中的相似性构建推荐系统
使用多语言嵌入实现跨语言应用
使用嵌入作为特征训练文本分类系统
使用嵌入空间中的相似性度量开发零样本文本标注应用
可视化和分析文本嵌入

文本嵌入是各种 NLP 任务中简单而强大的工具。它们使机器能够以捕捉语义含义的方式理解和处理文本。

导航

文本嵌入的应用示例

概述

推荐系统

跨语言应用

文本分类

零样本分类

可视化文本嵌入

进一步阅读

总结

想在您的NLP项目中使用强大的语言模型吗？

在您自己的机器上运行最先进的模型

最终将高级NLP带入
您自己的项目

关于此主题的更多信息

暂无评论。

留下回复点击此处取消回复。

导航

概述

推荐系统

跨语言应用

文本分类

零样本分类

可视化文本嵌入

进一步阅读

总结

想在您的NLP项目中使用强大的语言模型吗？

在您自己的机器上运行最先进的模型

最终将高级NLP带入您自己的项目

关于此主题的更多信息

暂无评论。

留下回复 点击此处取消回复。

最终将高级NLP带入
您自己的项目

留下回复点击此处取消回复。