用决策树理解文本

作者 Iván Palomares Carrascosa 于 2025年8月12日发布在实际机器学习 0

Making Sense of Text with Decision Trees

用决策树理解文本
图片由 Editor | ChatGPT 提供

在本文中，您将学习到：

构建一个用于垃圾邮件检测的决策树分类器，该分类器可以分析文本数据。
将文本数据建模技术（如 TF-IDF 和嵌入）用于训练您的决策树。
使用 Scikit-learn 评估并比较分类结果与朴素贝叶斯等其他文本分类器的结果。

引言

众所周知，基于决策树的模型在处理各种分类和回归任务时表现出色，通常基于结构化的表格数据。然而，当与正确的工具结合时，决策树也能成为处理非结构化数据（如文本或图像）甚至时间序列数据的强大预测工具。

本文演示了如何为文本数据构建决策树。具体来说，我们将把 TF-IDF 和嵌入等文本表示技术应用于为垃圾邮件分类训练的决策树中，评估其性能，并借助 Python 的 Scikit-learn 库将结果与其他文本分类模型进行比较。

为文本分类构建决策树

接下来的实践教程将使用公开可用的 UCI 数据集进行垃圾邮件分类：这是一系列文本-标签对，描述了电子邮件消息及其被标记为垃圾邮件或“火腿”（“火腿”是对非垃圾邮件的俗称）。

下面的代码通过其公共存储库 URL 请求、解压缩并将数据集加载到名为 df 的 Pandas DataFrame 对象中。

import pandas as pd
import requests
import zipfile

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip"
r = requests.get(url)
open("smsspamcollection.zip", "wb").write(r.content)

with zipfile.ZipFile("smsspamcollection.zip", "r") as z:
    with z.open("SMSSpamCollection") as f:
        df = pd.read_csv(f, sep='\t', names=["label", "text"])

df.head()

import pandas as pd

import requests

import zipfile

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip"

r = requests.get(url)

open("smsspamcollection.zip", "wb").write(r.content)

with zipfile.ZipFile("smsspamcollection.zip", "r") as z:

with z.open("SMSSpamCollection") as f:

df = pd.read_csv(f, sep='\t', names=["label", "text"])

df.head()

作为快速的初步检查，让我们查看垃圾邮件和普通邮件的数量。

df["label"].value_counts()

1	df["label"].value_counts()

有 4,825 封普通邮件（86%）和 747 封垃圾邮件（14%）。这表明我们正在处理一个类别不平衡的数据集。请记住这一点，因为像准确率这样的简单指标将不是评估的最佳独立衡量标准。

接下来，我们将数据集（包括输入文本和标签）拆分为训练集和测试集。由于类别不平衡，我们将使用分层抽样来保持两个子集中相同的类别比例，这有助于训练更具泛化能力的模型。

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df["text"], df["label"], test_size=0.2, random_state=42, stratify=df["label"]
)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(

df["text"], df["label"], test_size=0.2, random_state=42, stratify=df["label"]

)

现在，我们准备训练第一个决策树模型。这里的关键是使用决策树可以处理的结构化格式来编码文本数据。一种常见的方法是 TF-IDF 向量化。TF-IDF 将每个文本映射成一个稀疏的数值向量，其中每个维度（特征）代表词汇表中存在的词语，并根据其 TF-IDF 分数加权。

Scikit-learn 的 Pipeline 类提供了一种优雅的方式来链接这些步骤。我们将创建一个管道，该管道首先使用 TfidfVectorizer 应用 TF-IDF 向量化，然后训练一个 DecisionTreeClassifier。

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

tfidf_tree = Pipeline([
    ("tfidf", TfidfVectorizer()),
    ("clf", DecisionTreeClassifier(random_state=42))
])

tfidf_tree.fit(X_train, y_train)
y_pred = tfidf_tree.predict(X_test)

print("MODEL 1. Decision Tree + TF-IDF:")
print(classification_report(y_test, y_pred))

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.tree import DecisionTreeClassifier

from sklearn.pipeline import Pipeline

from sklearn.metrics import classification_report

tfidf_tree = Pipeline([

("tfidf", TfidfVectorizer()),

("clf", DecisionTreeClassifier(random_state=42))

])

tfidf_tree.fit(X_train, y_train)

y_pred = tfidf_tree.predict(X_test)

print("模型 1. 决策树 + TF-IDF:")

print(classification_report(y_test, y_pred))

结果

MODEL 1. Decision Tree + TF-IDF:
              precision    recall  f1-score   support

         ham       0.97      0.99      0.98       966
        spam       0.91      0.83      0.87       149

    accuracy                           0.97      1115
   macro avg       0.94      0.91      0.92      1115
weighted avg       0.97      0.97      0.97      1115

模型 1. 决策树 + TF-IDF:

精确率召回率 f1-分数支持

普通邮件 0.97 0.99 0.98 966

垃圾邮件 0.91 0.83 0.87 149

准确率 0.97 1115

macro 平均 0.94 0.91 0.92 1115

weighted 平均 0.97 0.97 0.97 1115

结果不算太差，但受到占主导地位的 普通邮件 类别的影响而被轻微抬高。如果捕获所有垃圾邮件至关重要，我们应该特别关注 垃圾邮件 类的召回率，在此案例中仅为 0.83。垃圾邮件的精确率更高，意味着很少有普通邮件被错误地标记为垃圾邮件。如果我们要避免重要消息被发送到垃圾邮件文件夹，这是首要考虑因素。

我们的第二个决策树将使用一种替代方法来表示文本：嵌入。嵌入是单词或句子的向量表示，使得相似的文本在空间中与向量相近，捕捉了超越简单单词计数的语义含义和上下文关系。

生成文本嵌入的一种简单方法是使用预训练模型，如 GloVe。我们可以将电子邮件中的每个单词映射到其相应的密集 GloVe 向量，然后通过平均这些词向量来表示整个电子邮件。这将为每条消息生成一个紧凑、密集的数值表示。

以下代码实现了这一过程。它定义了一个 text_to_embedding() 函数，将其应用于训练集和测试集，然后训练并评估了一个新的决策树。

import numpy as np

# Downloading GloVe embeddings
!wget -q http://nlp.stanford.edu/data/glove.6B.zip
!unzip -q glove.6B.zip -d glove.6B

# Load embeddings into a dictionary
embeddings_index = {}
with open("glove.6B/glove.6B.50d.txt", encoding="utf8") as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs


def text_to_embedding(texts):
    vectors = []
    for text in texts:
        words = text.lower().split()
        word_vecs = [embeddings_index[w] for w in words if w in embeddings_index]
        if word_vecs:
            vectors.append(np.mean(word_vecs, axis=0))
        else:
            vectors.append(np.zeros(50))
    return np.array(vectors)

X_train_emb = text_to_embedding(X_train)
X_test_emb = text_to_embedding(X_test)

tree_emb = DecisionTreeClassifier(random_state=42)
tree_emb.fit(X_train_emb, y_train)
y_pred_emb = tree_emb.predict(X_test_emb)

print("MODEL 2. Decision Tree + Embeddings")
print(classification_report(y_test, y_pred_emb))

import numpy as np

# 下载 GloVe 嵌入

!wget -q http://nlp.stanford.edu/data/glove.6B.zip

!unzip -q glove.6B.zip -d glove.6B

# 将嵌入加载到字典中

embeddings_index = {}

with open("glove.6B/glove.6B.50d.txt", encoding="utf8") as f:

for line in f:

values = line.split()

word = values[0]

coefs = np.asarray(values[1:], dtype='float32')

embeddings_index[word] = coefs

def text_to_embedding(texts):

vectors = []

for text in texts:

words = text.lower().split()

word_vecs = [embeddings_index[w] for w in words if w in embeddings_index]

if word_vecs:

vectors.append(np.mean(word_vecs, axis=0))

else:

vectors.append(np.zeros(50))

return np.array(vectors)

X_train_emb = text_to_embedding(X_train)

X_test_emb = text_to_embedding(X_test)

tree_emb = DecisionTreeClassifier(random_state=42)

tree_emb.fit(X_train_emb, y_train)

y_pred_emb = tree_emb.predict(X_test_emb)

print("模型 2. 决策树 + 嵌入")

print(classification_report(y_test, y_pred_emb))

结果

MODEL 2. Decision Tree + Embeddings
              precision    recall  f1-score   support

         ham       0.95      0.95      0.95       966
        spam       0.66      0.69      0.68       149

    accuracy                           0.91      1115
   macro avg       0.81      0.82      0.81      1115
weighted avg       0.91      0.91      0.91      1115

模型 2. 决策树 + 嵌入

精确率召回率 f1-分数支持

普通邮件 0.95 0.95 0.95 966

垃圾邮件 0.66 0.69 0.68 149

准确率 0.91 1115

macro 平均 0.81 0.82 0.81 1115

weighted 平均 0.91 0.91 0.91 1115

不幸的是，这种简单的平均方法可能会导致显着的信息丢失，有时称为表示丢失。这解释了与 TF-IDF 模型相比整体性能的下降。决策树通常在稀疏、高信号特征（如 TF-IDF 的特征）上表现更好。这些词级特征可以充当强烈的区分器（例如，根据“免费”或“百万”等单词的存在将电子邮件分类为垃圾邮件）。这在很大程度上解释了两个模型之间的性能差异。

与朴素贝叶斯文本分类器进行比较

最后，让我们将我们的结果与另一个流行的文本分类模型进行比较：朴素贝叶斯分类器。虽然它不是基于树的，但它与 TF-IDF 特征配合得很好。过程与我们的第一个模型非常相似。

from sklearn.naive_bayes import MultinomialNB

nb_model = Pipeline([
    ("tfidf", TfidfVectorizer()),
    ("clf", MultinomialNB())
])

nb_model.fit(X_train, y_train)
y_pred_nb = nb_model.predict(X_test)

print("BASELINE. Naive Bayes + TF-IDF")
print(classification_report(y_test, y_pred_nb))

from sklearn.naive_bayes import MultinomialNB

nb_model = Pipeline([

("tfidf", TfidfVectorizer()),

("clf", MultinomialNB())

])

nb_model.fit(X_train, y_train)

y_pred_nb = nb_model.predict(X_test)

print("基线. 朴素贝叶斯 + TF-IDF")

print(classification_report(y_test, y_pred_nb))

结果

BASELINE. Naive Bayes + TF-IDF
              precision    recall  f1-score   support

         ham       0.96      1.00      0.98       966
        spam       1.00      0.70      0.83       149

    accuracy                           0.96      1115
   macro avg       0.98      0.85      0.90      1115
weighted avg       0.96      0.96      0.96      1115

基线. 朴素贝叶斯 + TF-IDF

精确率召回率 f1-分数支持

普通邮件 0.96 1.00 0.98 966

垃圾邮件 1.00 0.70 0.83 149

准确率 0.96 1115

macro 平均 0.98 0.85 0.90 1115

weighted 平均 0.96 0.96 0.96 1115

将我们的第一个决策树模型（模型 1）与此朴素贝叶斯模型进行比较，我们会发现它们在分类普通邮件方面的差异很小。对于垃圾邮件类别，朴素贝叶斯模型实现了完美的精确率（1.00），这意味着它识别为垃圾邮件的每封邮件确实都是垃圾邮件。然而，它的召回率（0.70）表现较差，在测试数据中错过了约 30% 的实际垃圾邮件。如果召回率是我们最重要的绩效指标，那么我们将倾向于第一个决策树模型结合 TF-IDF。然后，我们可以尝试进一步优化它，例如通过超参数调优或使用更多训练数据。

总结

本文演示了如何为文本数据训练决策树模型，通过 TF-IDF 和向量嵌入等常见文本表示方法处理垃圾邮件分类。

导航

用决策树理解文本

引言

为文本分类构建决策树

与朴素贝叶斯文本分类器进行比较

总结

关于此主题的更多信息

暂无评论。

留下回复点击此处取消回复。

导航

引言

为文本分类构建决策树

与朴素贝叶斯文本分类器进行比较

总结

关于此主题的更多信息

暂无评论。

留下回复 点击此处取消回复。

留下回复点击此处取消回复。