文本嵌入通过提供捕捉语义意义的密集向量表示,彻底改变了自然语言处理。在上一篇教程中,您学习了如何使用 Transformer 模型生成这些嵌入。在本篇文章中,您将学习文本嵌入的高级应用,这些应用超越了诸如语义搜索和文档聚类等基本任务。
具体来说,您将学习
- 如何使用文本嵌入构建推荐系统
- 如何使用多语言嵌入实现跨语言应用
- 如何使用基于嵌入的特征创建文本分类系统
- 如何开发零样本学习应用
- 如何可视化和分析文本嵌入
通过我的书籍《Hugging Face Transformers中的NLP》,快速启动您的项目。它提供了带有工作代码的自学教程。
让我们开始吧。

文本嵌入的应用示例
照片来源:Christina Winter。部分权利保留。
概述
本文分为五个部分,它们是:
- 推荐系统
- 跨语言应用
- 文本分类
- 零样本分类
- 可视化文本嵌入
推荐系统
可以通过查找与目标项目最相似的几个项目来创建一个简单的推荐系统。在自然语言处理的例子中,当用户阅读一篇文章时,您可以找到一些类似的“您可能还喜欢”的文章。
实现这一点的方法有很多。但最简单的方法是检查两篇文章的相似程度。您可以将所有文章转换为上下文嵌入。在上下文嵌入中相似度最高的两篇文章在内容上是相似的。这可能不是您期望的推荐效果,但有时很有用,而且是一个好的起点。
让我们按照以下方式实现它
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 |
import numpy as np from sentence_transformers import SentenceTransformer from sklearn.metrics.pairwise import cosine_similarity # 定义文章语料库(标题和内容) articles = [ { "title": "理解深度学习", "content": ("深度学习是机器学习的一个子集,其中人工智能神经网络," "受人脑启发的算法,从大量数据中学习。") }, { "title": "自然语言处理导论", "content": ("自然语言处理(NLP)是人工智能的一个领域,它使机器能够" "阅读、理解和从人类语言中提取意义。") }, { "title": "计算机视觉的未来", "content": ("计算机视觉是一个交叉学科领域,研究如何让计算机" "从数字图像或视频中获得高层次的理解。") }, { "title": "强化学习解释", "content": ("强化学习是机器学习的一个领域,关注如何" "软件代理在环境中采取行动,以最大化某个" "累积奖励的概念。") }, { "title": "神经网络及其应用", "content": ("神经网络是一组算法,在很大程度上模仿人脑," "旨在识别数据中的模式。") } ] model = SentenceTransformer("all-MiniLM-L6-v2") def create_article_embeddings(articles, model): """为文章创建嵌入""" texts = [f"{article["title"]}. {article["content"]}" for article in articles] embeddings = model.encode(texts) return embeddings def get_recommendations(article_id, articles, embeddings, top_n=2): """根据余弦相似度获取给定文章ID的推荐""" similarities = cosine_similarity([embeddings[article_id]], embeddings)[0] similar_indices = np.argsort(similarities)[::-1][1:top_n+1] return [articles[idx] for idx in similar_indices] # 为所有文章创建嵌入,并为第一篇文章获取推荐 embeddings = create_article_embeddings(articles, model) recommendations = get_recommendations(0, articles, embeddings) # 打印推荐结果 print(f'为 "{articles[0]["title"]}" 推荐:') for i, rec in enumerate(recommendations): print(f"{i+1}. {rec["title"]}") |
您在代码开头设置了一个语料库,因为这是一个玩具示例。在实际应用中,您可能希望从数据库或文件系统中检索语料库。
在此程序中,您使用了 all-MiniLM-L6-v2
模型,并使用 SentenceTransformer
进行了实例化。这是一个可以对文本进行编码以生成上下文嵌入的预训练模型。您获取了语料库中定义的所有文章,并在 create_article_embeddings()
函数中将每篇文章转换为上下文嵌入。输出是向量的向量,即一个矩阵。在此特定实现中,语料库中有 5 个项目,嵌入向量有 384 个维度。输出 embeddings
是一个形状为 (5, 384) 的矩阵。
在 get_recommendations()
中,您计算了一个嵌入与所有其他嵌入之间的余弦相似度。scikit-learn 中的 cosine_similarity()
函数需要两个向量列表,并返回一个矩阵,说明每对向量的相似度。由于您正在比较一个与其他所有项,因此输出矩阵只有一行。然后,在 np.argsort(similarities)
中,您以升序获取了相似度得分的索引。由于余弦相似度在向量相同时为 1,在正交时(即完全不同)为 0,因此您反转结果以按降序排列相似度得分。然后,最相似的项目是此列表开头的项目,第一个项目除外,它是文章本身。
一旦获得了最相似项目的索引,就使用 for 循环打印推荐结果。
运行此代码时,您将获得
1 2 3 |
为“理解深度学习”推荐: 1. 神经网络及其应用 2. 强化学习解释 |
这些推荐将基于语义相似性,而不仅仅是关键字匹配,因此即使文章不包含“深度学习”这个确切短语,您也会得到关于神经网络或机器学习的文章。通过整合用户偏好、协同过滤或混合方法,可以将此方法扩展到更复杂的推荐系统。
跨语言应用
现代 Transformer 模型的一个强大功能是它们能够为多种语言的文本生成嵌入。这使得跨语言应用成为可能,您可以在不同语言之间比较或处理文本。
让我们来实现一个简单的跨语言语义搜索系统
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
import numpy as np from sentence_transformers import SentenceTransformer from sklearn.metrics.pairwise import cosine_similarity corpus = [ { "language": "English", "text": ("机器学习是一个研究领域,它使计算机能够学习," "而无需明确编程。") }, { "language": "Spanish", "text": ("El aprendizaje automático es un campo de estudio que da a las computadoras la " "capacidad de aprender sin ser programadas explícitamente.") }, { "language": "French", "text": ("L'apprentissage automatique est un domaine d'étude qui donne aux ordinateurs " "la capacité d'apprendre sans être explicitement programmés.") }, { "language": "German", "text": ("Maschinelles Lernen ist ein Studienbereich, der Computern die Fähigkeit gibt, " "zu lernen, ohne explizit programmiert zu werden.") }, { "language": "Italian", "text": ("Il machine learning è un campo di studio che conferisce ai computer la capacità " "di apprendere senza essere esplicitamente programmati.") }, { "language": "English", "text": ("Natural language processing is a subfield of linguistics, computer science, " "and artificial intelligence.") }, { "language": "English", "text": ("Computer vision is an interdisciplinary field that deals with how computers can " "gain high-level understanding from digital images or videos.") } ] model = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2") # 为语料库生成嵌入 texts = [doc["text"] for doc in corpus] embeddings = model.encode(texts) # 定义一个英文查询并生成嵌入 query = "什么是机器学习?" query_embedding = model.encode(query) # 按降序相似度对语料库的嵌入进行排序 similarities = cosine_similarity([query_embedding], embeddings)[0] ranked_indices = np.argsort(similarities)[::-1] # 打印排名结果 print(f"Query: {query}\n") for i, idx in enumerate(ranked_indices[:3]): # 显示前 3 个结果 print(f"{i+1}. [{corpus[idx]["language"]}] {corpus[idx]["text"]} (相似度: {similarities[idx]:.4f})") |
在此示例中,我们使用了多语言 Sentence Transformer 模型(paraphrase-multilingual-MiniLM-L12-v2
)来为不同语言的文档创建嵌入。语料库包含多种语言和多种主题。上面的程序用于实现一个问答系统,但问题可能会在不同语言中找到答案。
上面的例子与上一节的例子非常相似。语料库首先被转换为嵌入。然后,以嵌入形式的查询通过余弦相似度与语料库进行比较。打印前 3 个结果。运行此代码将为您提供
1 2 3 4 5 |
查询:什么是机器学习? 1. [Italian] Il machine learning è un campo di studio che conferisce ai computer la capacità di apprendere senza essere esplicitamente programmati. (相似度: 0.8129) 2. [English] Machine learning is a field of study that gives computers the ability to learn without being explicitly programmed. (相似度: 0.7788) 3. [French] L'apprentissage automatique est un domaine d'étude qui donne aux ordinateurs la capacité d'apprendre sans être explicitement programmés. (相似度: 0.7470) |
最高答案是意大利语,而问题“什么是机器学习?”是英语。之所以这样工作,是因为嵌入向量代表了文本的语义含义,而与语言无关。这种跨语言能力对于多语言搜索引擎等应用尤其有用。
文本分类
想象一下,您有大量文本数据,并且这些数据每天都在增长。这可能是因为您正在收集新的文章或电子邮件。您想将它们分类到不同的类别中。这可以通过使用文本嵌入来完成。
这是一个类似于“主题建模”的任务。主题建模是一个无监督学习任务,它将文本文档分组到不同的主题中。它使用诸如潜在狄利克雷分配(LDA)等算法来查找分类的签名关键词。这是一个监督方法:您有一组预定义的类别和一些示例(也许您手动进行分类)。然后,您将新文本添加到集合中,并自动完成分类。
文本嵌入可以通过将文本的语义含义提取到向量中来提供帮助。然后,您可以训练一个机器学习模型将这些向量分类到不同的类别中。这样做效果更好,因为向量代表了文本的含义,而不是文本本身。因此,它比使用词袋模型或 TF-IDF 特征更好。
实现机器学习分类器的方法有很多。一种简单的方法是使用 scikit-learn 中的逻辑回归。让我们在代码中实现这一点
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 |
from sentence_transformers import SentenceTransformer from sklearn.linear_model import LogisticRegression from sklearn.metrics import classification_report from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler articles = [ # 商业文章 {"text": "今天的股市创下新高,科技股领涨。", "category": "Business"}, {"text": "政府宣布了新的税收政策,将影响小企业。", "category": "Business"}, {"text": "央行决定维持利率不变。", "category": "Business"}, {"text": "根据新的数据,失业率已降至五年来的最低点。", "category": "Business"}, {"text": "通货膨胀率已连续第三个月下降。", "category": "Business"}, {"text": "两家主要公司的合并已获得监管机构的批准。", "category": "Business"}, {"text": "根据新数据,失业率已降至五年来的最低点。", "category": "Business"}, {"text": "加密货币市场本周经历了显著的波动。", "category": "Business"}, # 健康文章 {"text": "一项新研究表明,定期锻炼可以降低患心脏病的风险。", "category": "Health"}, {"text": "一项针对新癌症治疗方法的临床试验显示出有希望的结果。", "category": "Health"}, {"text": "均衡饮食和规律睡眠对于保持良好健康至关重要。", "category": "Health"}, {"text": "医学研究人员已确定一种与阿尔茨海默病相关的新基因。", "category": "Health"}, {"text": "世界卫生组织发布了针对老年糖尿病患者管理的新指南。", "category": "Health"}, {"text": "已开发出一种用于早期检测乳腺癌的新技术。", "category": "Health"}, {"text": "研究表明,正念冥想有助于减轻压力和焦虑。", "category": "Health"}, {"text": "公共卫生官员警告说,今年冬天可能会爆发流感。", "category": "Health"}, # Technology articles {"text": "苹果最新款智能手机配备了更好的摄像头和更长的电池续航。", "category": "Technology"}, {"text": "特斯拉新款电动汽车的续航里程超过 400 英里。", "category": "Technology"}, {"text": "最新操作系统更新包含新的安全功能。", "category": "Technology"}, {"text": "新的人工智能系统可以从医学影像中检测疾病。", "category": "Technology"}, {"text": "这家科技公司在年度大会上发布了新款虚拟现实头显。", "category": "Technology"}, {"text": "研究人员开发出一种量子计算机,可以解决复杂问题。", "category": "Technology"}, {"text": "新的社交媒体平台在短短几个月内就获得了数百万用户。", "category": "Technology"}, {"text": "网络安全专家警告说,一种新型恶意软件正针对智能家居设备。", "category": "Technology"}, # Science articles {"text": "科学家在亚马逊雨林中发现了一种新的蛙类。", "category": "Science"}, {"text": "天文学家观测到了一个遥远星系中的超新星。", "category": "Science"}, {"text": "研究人员开发了一种测量海水温度的新方法。", "category": "Science"}, {"text": "化石发现表明恐龙可能是温血动物。", "category": "Science"}, {"text": "气候科学家报告称,北极冰正在以前所未有的速度融化。", "category": "Science"}, {"text": "物理学家已证实存在一种新的亚原子粒子。", "category": "Science"}, {"text": "对珊瑚礁的研究显示,在受保护的海洋区域有恢复迹象。", "category": "Science"}, {"text": "生物学家已对一种濒危老虎物种进行了基因组测序。", "category": "Science"} ] # Prepare data for classification training model = SentenceTransformer("all-MiniLM-L6-v2") texts = [article["text"] for article in articles] X = model.encode(texts) y = [article["category"] for article in articles] # Normalize features scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Split data into training and testing sets with stratification X_train, X_test, y_train, y_test = train_test_split( X_scaled, y, test_size=0.2, random_state=42, stratify=y ) # Train a logistic regression classifier with regularization classifier = LogisticRegression(C=1.0, class_weight="balanced", max_iter=1000) classifier.fit(X_train, y_train) # Evaluate the classifier y_pred = classifier.predict(X_test) print(classification_report(y_test, y_pred)) # Classify new articles new_articles = [ "该公司报告季度利润增长 20%。", "一种新疫苗已获准用于对抗流感。", "新款笔记本电脑采用更快的处理器和更多的内存。", "火星探测器已传回该行星表面新图像。" ] new_embeddings = model.encode(new_articles) new_embeddings_scaled = scaler.transform(new_embeddings) new_predictions = classifier.predict(new_embeddings_scaled) for article, prediction in zip(new_articles, new_predictions): print(f"Article: {article}\nPredicted Category: {prediction}\n") |
When you run this, you will get
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
precision recall f1-score support Business 1.00 1.00 1.00 2 Health 0.50 1.00 0.67 1 Science 1.00 1.00 1.00 2 Technology 1.00 0.50 0.67 2 accuracy 0.86 7 macro avg 0.88 0.88 0.83 7 weighted avg 0.93 0.86 0.86 7 Article: The company reported a 20% increase in quarterly profits. Predicted Category: Business Article: A new vaccine has been approved for use against the flu. Predicted Category: Health Article: The new laptop features a faster processor and more memory. Predicted Category: Technology Article: The Mars rover has sent back new images of the planet"s surface. Predicted Category: Science |
In this example, the corpus is annotated with one of the four categories: business, health, technology, or science. The text is converted into embeddings, which, together with the category label, are used to train a logistic regression classifier.
The classifier is trained with 80% of the corpus and then evaluated with the remaining 20%. The results are printed in the form of a classification report. You can see that Business and Science are classified accurately, but Health and Technology are not so good. When you finish the training, you can use the trained classifier on the new articles. The workflow is the same as in training: Encode the text into embeddings, then scale the embeddings using the trained scaler, and finally, use the trained classifier to predict the category.
Note that you can use other classifiers like random forest or K-Nearest Neighbors. You can try them and see which one works better.
零样本分类
In the previous example, you trained a classifier to classify the text into one of the predefined categories. If the category labels are meaningful text, why can’t you use the meaning of the label for classification? In this way, you can simply convert the text into embeddings and then compare it with the category labels’ embeddings. The text is then tagged with the most similar category label.
This is the idea of zero-shot learning. It is not a supervised learning task. Indeed, you never train a new model, but the classification and information retrieval tasks can still be done.
Let’s implement a zero-shot text classifier using text embeddings
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
import torch from sentence_transformers import SentenceTransformer, util texts = [ "The stock market reached a new high today, with technology stocks leading the gains.", "A new study shows that regular exercise can reduce the risk of heart disease.", "The latest smartphone from Apple features a better camera and longer battery life.", "Scientists have discovered a new species of frog in the Amazon rainforest." ] categories = ["Business", "Health", "Technology", "Science"] # Load a pre-trained Sentence Transformer model model = SentenceTransformer("all-MiniLM-L6-v2") text_embeddings = model.encode(texts, convert_to_tensor=True) category_embeddings = model.encode(categories, convert_to_tensor=True) # Calculate cosine similarity between texts and categories similarities = util.cos_sim(text_embeddings, category_embeddings) # Get the most similar category for each text best_categories = torch.argmax(similarities, dim=1) for i, text in enumerate(texts): category = categories[best_categories[i]] similarity = similarities[i][best_categories[i]].item() print(f"Text: {text}") print(f"Category: {category} (Similarity: {similarity:.4f})\n") |
输出如下:
1 2 3 4 5 6 7 8 9 10 11 |
Text: The stock market reached a new high today, with technology stocks leading the gains. Category: Technology (Similarity: 0.2624) Text: A new study shows that regular exercise can reduce the risk of heart disease. Category: Health (Similarity: 0.3297) Text: The latest smartphone from Apple features a better camera and longer battery life. Category: Technology (Similarity: 0.1623) Text: Scientists have discovered a new species of frog in the Amazon rainforest. Category: Science (Similarity: 0.1940) |
The result may not be as good as the previous example because the category labels are sometimes ambiguous, and you do not have a model trained for this task. Nevertheless, it produces meaningful results.
Zero-shot learning is particularly useful for tasks where labeled training data is scarce or unavailable. It can be applied to a wide range of NLP tasks, including classification, entity recognition, and question-answering.
可视化文本嵌入
Not a particular application, but visualizing text embeddings can sometimes provide insights into the semantic relationships between texts. Since embeddings typically have hundreds of dimensions, you need dimensionality reduction techniques to visualize them in 2D or 3D.
PCA is probably the most popular dimensionality reduction technique. However, for visualization, t-SNE (t-Distributed Stochastic Neighbor Embedding) usually works better. Let’s implement a visualization of text embeddings using t-SNE
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 |
import matplotlib.pyplot as plt import numpy as np from sentence_transformers import SentenceTransformer from sklearn.manifold import TSNE texts_with_categories = [ {"text": "The stock market reached a new high today.", "category": "Business"}, {"text": "Investors are optimistic about the economy.", "category": "Business"}, {"text": "The company reported strong quarterly earnings.", "category": "Business"}, {"text": "央行决定维持利率不变。", "category": "Business"}, {"text": "一项新研究表明,定期锻炼可以降低患心脏病的风险。", "category": "Health"}, {"text": "A balanced diet is essential for maintaining good health.", "category": "Health"}, {"text": "The new vaccine has been approved for use against the flu.", "category": "Health"}, {"text": "Sleep is important for physical and mental health.", "category": "Health"}, {"text": "The latest smartphone features a better camera and longer battery life.", "category": "Technology"}, {"text": "The new laptop has a faster processor and more memory.", "category": "Technology"}, {"text": "The software update includes new security features.", "category": "Technology"}, {"text": "5G networks promise faster internet speeds for mobile devices.", "category": "Technology"}, {"text": "Scientists have discovered a new species in the Amazon rainforest.", "category": "Science"}, {"text": "天文学家观测到了一个遥远星系中的超新星。", "category": "Science"}, {"text": "The Mars rover has sent back new images of the planet's surface.", "category": "Science"}, {"text": "Researchers have developed a new method for measuring ocean temperatures.", "category": "Science"} ] # Extract texts and categories texts = [item["text"] for item in texts_with_categories] categories = [item["category"] for item in texts_with_categories] # Generate embeddings, then reduce dimension with t-SNE model = SentenceTransformer("all-MiniLM-L6-v2") embeddings = model.encode(texts) tsne = TSNE(n_components=2, perplexity=5, random_state=42) reduced_embeddings = tsne.fit_transform(embeddings) # Define colors for categories unique_categories = list(set(categories)) colors = plt.cm.rainbow(np.linspace(0, 1, len(unique_categories))) category_to_color = {category: color for category, color in zip(unique_categories, colors)} # Create a scatter plot plt.figure(figsize=(10, 8)) for i, (x, y) in enumerate(reduced_embeddings): category = categories[i] color = category_to_color[category] plt.scatter(x, y, color=color, alpha=0.7) plt.annotate(texts[i][:20] + "...", (x, y), fontsize=8) # Add legend, mark the axes for category, color in category_to_color.items(): plt.scatter([], [], color=color, label=category) plt.legend() plt.xlabel("t-SNE Dimension 1") plt.ylabel("t-SNE Dimension 2") plt.title("t-SNE Visualization of Text Embeddings") plt.tight_layout() plt.show() |
您使用了 scikit-learn 的 t-SNE 实现。它很容易使用,您只需要将嵌入向量的行传递给 tsne.fit_transform()
方法即可。输出的 embeddings
是一个 $N \times 2$ 的数组(即在二维空间中的坐标)。
然后,您使用一个 for 循环将每个转换后的嵌入绘制成散点图中的一个点。每个点根据原始文本中的标注类别进行着色。为了避免图表混乱,图例是在另一个 for 循环中稍后创建的。生成的图表如下所示:
可视化将含义相似的文本放在一起;这意味着标签对于表示文本的语义含义很有用。您可以查看图表,检查同一类别的点是否足够聚集在一起,以判断您的嵌入是否良好。
还存在其他降维技术,例如 PCA(主成分分析)或 UMAP(均匀流形逼近与投影)。您可以尝试这些方法,看看可视化是否仍然有意义。
进一步阅读
以下是一些您可能觉得有用的进一步阅读资料:
- 文本嵌入的预训练模型
- t-SNE 在 Scikit-Learn 中
- PCA 在 Scikit-Learn 中
- UMAP
总结
在本教程中,您学习了文本嵌入的几个应用。特别是,您学习了如何
- 使用嵌入空间中的相似性构建推荐系统
- 使用多语言嵌入实现跨语言应用
- 使用嵌入作为特征训练文本分类系统
- 使用嵌入空间中的相似性度量开发零样本文本标注应用
- 可视化和分析文本嵌入
文本嵌入是各种 NLP 任务中简单而强大的工具。它们使机器能够以捕捉语义含义的方式理解和处理文本。
暂无评论。