避免机器学习项目常见新手错误的 5 个技巧

作者： Matthew Mayo 于 2024年11月15日发表在开始机器学习 0

5 Tips for Avoiding Common Rookie Mistakes in Machine Learning Projects

避免机器学习项目常见新手错误的 5 个技巧
图片由 Editor | Ideogram & Canva 提供

在您的机器学习项目中，很容易做出一些会破坏您的努力并危及您结果的糟糕决定，尤其是作为初学者。虽然您随着时间的推移在实践中无疑会进步，但以下是五个技巧，可帮助您避免常见的新手错误，并在您摸索前进的过程中牢记这些技巧，以确保您的项目成功。

1. 正确预处理您的数据

正确的数据预处理对于构建可靠的机器学习模型至关重要。您可能听过这句话：垃圾进，垃圾出。这是真的，但它还包含更多。以下是两个关键的关注点：

数据清洗：确保您的数据干净，方法是处理缺失值、删除重复项和纠正不一致之处，这对于防止脏数据导致模型不准确至关重要。
归一化和缩放：应用归一化或缩放技术，确保您的数据处于相似的尺度，这有助于提高许多机器学习算法的性能。

以下是执行这些任务的示例代码，以及您可以学习的一些额外要点：

import pandas as pd
from sklearn.preprocessing import StandardScaler
import numpy as np

try:
   df = pd.read_csv('data.csv')
   
   # Check missing values pattern
   missing_pattern = df.isnull().sum()

# Only show columns with missing values
   print("\nMissing values per column:")
   print(missing_pattern[missing_pattern > 0])
   
   # Calculate percentage of missing values
   missing_percentage = (df.isnull().sum() / len(df)) * 100
   print("\nPercentage missing per column:")
   print(missing_percentage[missing_percentage > 0])
   
   # Consider dropping columns with high missing percentages
   high_missing_cols = missing_percentage[missing_percentage > 50].index
   if len(high_missing_cols) > 0:
       print(f"\nColumns with >50% missing values (consider dropping):")
       print(high_missing_cols.tolist())
       # Optional: df = df.drop(columns=high_missing_cols)
   
   # Identify data types and handle missing values
   numeric_columns = df.select_dtypes(include=[np.number]).columns
   categorical_columns = df.select_dtypes(include=['object']).columns
   
   # Handle numeric and categorical separately
   df[numeric_columns] = df[numeric_columns].fillna(df[numeric_columns].median())
   df[categorical_columns] = df[categorical_columns].fillna(df[categorical_columns].mode().iloc[0])
   
   # Scale only numeric features
   scaler = StandardScaler()
   df[numeric_columns] = scaler.fit_transform(df[numeric_columns])
   
except FileNotFoundError:
   print("Data file not found")
except Exception as e:
   print(f"Error processing data: {e}")

import pandas as pd

from sklearn.preprocessing import StandardScaler

import numpy as np

try:

df = pd.read_csv('data.csv')

# 检查缺失值模式

missing_pattern = df.isnull().sum()

# 仅显示有缺失值的列

print("\n每列的缺失值：")

print(missing_pattern[missing_pattern > 0])

# 计算缺失值的百分比

missing_percentage = (df.isnull().sum() / len(df)) * 100

print("\n每列的缺失百分比：")

print(missing_percentage[missing_percentage > 0])

# 考虑删除缺失百分比高的列

high_missing_cols = missing_percentage[missing_percentage > 50].index

if len(high_missing_cols) > 0:

print(f"\n缺失值超过50%的列（考虑删除）：")

print(high_missing_cols.tolist())

# 可选：df = df.drop(columns=high_missing_cols)

# 识别数据类型并处理缺失值

numeric_columns = df.select_dtypes(include=[np.number]).columns

categorical_columns = df.select_dtypes(include=['object']).columns

# 分别处理数值型和类别型数据

df[numeric_columns] = df[numeric_columns].fillna(df[numeric_columns].median())

df[categorical_columns] = df[categorical_columns].fillna(df[categorical_columns].mode().iloc[0])

# 仅对数值特征进行缩放

scaler = StandardScaler()

df[numeric_columns] = scaler.fit_transform(df[numeric_columns])

except FileNotFoundError:

print("未找到数据文件")

except Exception as e:

print(f"处理数据时出错: {e}")

以上摘录中涉及的内容，用要点进行解释：

数据分析：显示每列的缺失值数量，并转换为百分比以便更好地理解。
文件加载与安全：读取 CSV 文件，并提供错误保护：如果文件未找到或存在问题，代码会告知您出了什么问题。
数据类型检测：自动识别哪些列包含数字（年龄、价格），哪些列包含类别（颜色、名称）。
缺失数据处理：对于数值列，用中间值（中位数）填充空缺；对于类别列，用最常见的值（众数）填充。
数据缩放：通过标准化所有数值（例如将不同单位转换为通用尺度），使所有数值都具有可比性，同时保持类别列不变。

2. 使用交叉验证避免过拟合

过拟合是指模型在训练数据上表现良好，但在新数据上表现不佳。这是新手从业者普遍遇到的难题，而交叉验证是应对这一挑战的有力武器。

交叉验证：实施 k 折交叉验证，以确保您的模型能够很好地泛化；此技术将您的数据分成 k 个子集，并训练您的模型 k 次，每次使用不同的子集作为验证集，其余作为训练集。

以下是实现交叉验证的示例：

from sklearn.model_selection import cross_val_score, StratifiedKFold

from sklearn.ensemble import RandomForestClassifier

from sklearn.preprocessing import StandardScaler

# 初始化具有关键参数的模型

model = RandomForestClassifier(

n_estimators=100,

random_state=42

)

# 创建分层折叠

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# 缩放特征并执行交叉验证

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

scores = cross_val_score(model, X_scaled, y, cv=skf, scoring='accuracy')

print(f"交叉验证分数：{scores}")

print(f"平均值：{scores.mean():.3f} (±{scores.std() * 2:.3f})")

代码的含义如下：

数据准备：在建模前缩放特征，确保所有特征都以适当的比例贡献。
模型配置：设置随机种子以确保可复现性，并预先定义基本超参数。
验证策略：使用分层K折交叉验证，以在各个折叠中保持类别分布，这对于不平衡数据集尤其重要。
结果报告：显示单个分数以及平均值和置信区间（±2个标准差）。

3. 特征工程和选择

好的特征可以显著提升您模型的性能（而糟糕的特征则会产生相反的效果）。通过以下方法专注于创建和选择正确的特征：

特征工程：从现有数据中创建新特征以提高模型性能，这可能涉及组合或转换特征以更好地捕捉底层模式。
特征选择：使用递归特征消除（RFE）或带交叉验证的递归特征消除（RFECV）等技术来选择最重要的特征，这有助于减少过拟合并提高模型的可解释性。

这是一个例子

from sklearn.feature_selection import RFECV

from sklearn.linear_model import LogisticRegression

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import StratifiedKFold

# 缩放特征

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

# 初始化模型

model = LogisticRegression(max_iter=1000, random_state=42)

# 使用交叉验证查找最优特征数量

rfecv = RFECV(

estimator=model,

step=1,

cv=StratifiedKFold(5, shuffle=True, random_state=42),

scoring='accuracy',

min_features_to_select=3

)

# 拟合并获取结果

fit = rfecv.fit(X_scaled, y)

selected_features = X.columns[fit.support_]

print(f"最优特征数量：{rfecv.n_features_}")

print(f"选定的特征：{selected_features}")

print(f"交叉验证分数：{rfecv.grid_scores_}")

以上代码的作用（有些内容现在应该看起来很熟悉了）：

特征缩放：在选择特征之前标准化特征，以避免尺度偏差。
交叉验证：使用 RFECV 自动查找最优特征数量。
模型设置：包括 max_iter 和 random_state 以保证稳定性和可复现性。
结果清晰度：返回实际的特征名称，使结果更具可解释性。

4. 监控和调整超参数

超参数对模型的性能至关重要，无论您是初学者还是经验丰富的专家。正确调整可以带来显著的改进。

超参数调优：从网格搜索或随机搜索开始，为您的模型找到最佳超参数；网格搜索会穷尽地搜索指定的参数网格，而随机搜索则会采样指定的参数设置数量。

以下是网格搜索的示例实现：

from sklearn.model_selection import GridSearchCV, StratifiedKFold

from sklearn.ensemble import RandomForestClassifier

from sklearn.preprocessing import StandardScaler

import numpy as np

# 定义具有范围的参数网格

param_grid = {

'n_estimators': [100, 300, 500],

'max_depth': [10, 20, None],

'min_samples_split': [2, 5, 10],

'min_samples_leaf': [1, 2, 4]

}

# 设置模型和交叉验证

model = RandomForestClassifier(random_state=42)

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# 初始化搜索及评估指标

grid_search = GridSearchCV(

estimator=model,

param_grid=param_grid,

cv=cv,

scoring=['accuracy', 'f1'],

refit='f1',

n_jobs=-1,

verbose=1

)

# 拟合

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

grid_search.fit(X_scaled, y)

print(f"最佳参数：{grid_search.best_params_}")

print(f"最佳分数：{grid_search.best_score_:.3f}")

代码的含义总结如下：

参数空间：定义超参数空间和合理范围，以进行全面的调优。
多指标评估：使用准确率和 F1 分数，这对于不平衡数据集很重要。
性能：启用并行处理 (n_jobs=-1) 和进度跟踪 (verbose=1)。
预处理：包括特征缩放和分层交叉验证，以进行稳健的评估。

5. 使用恰当的指标评估模型性能

选择正确的指标对于准确评估您的模型至关重要。

选择正确的指标：选择与您的项目目标一致的指标；如果您处理的是不平衡类别，准确率可能不是最佳指标，而应考虑精度、召回率或 F1 分数。

from sklearn.metrics import classification_report, confusion_matrix

import seaborn as sns

import matplotlib.pyplot as plt

def evaluate_model(y_true, y_pred, model_name="模型"):

report = classification_report(y_true, y_pred, output_dict=True)

print(f"\n{model_name} 性能指标：")

# 计算并显示每个类别的指标

for label in set(y_true):

print(f"\n类别 {label}:")

print(f"精确率: {report[str(label)]['precision']:.3f}")

print(f"召回率: {report[str(label)]['recall']:.3f}")

print(f"F1 分数: {report[str(label)]['f1-score']:.3f}")

# 绘制混淆矩阵

cm = confusion_matrix(y_true, y_pred)

plt.figure(figsize=(8, 6))

sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')

plt.title(f'{model_name} 混淆矩阵')

plt.ylabel('真实标签')

plt.xlabel('预测标签')

plt.show()

# 用法

y_pred = model.predict(X_test)

evaluate_model(y_test, y_pred, "随机森林")

代码的含义如下：

全面指标：显示每个类别的性能，对于不平衡数据集至关重要。
代码组织：将评估封装到可重用的函数中，并支持模型命名。
结果格式：将指标四舍五入到小数点后三位，并提供清晰的标签。
可视化辅助：包括混淆矩阵热力图，用于错误模式分析。

遵循这些技巧，您可以帮助避免常见的新手错误，并大大改进您的机器学习项目的质量和性能。

导航

避免机器学习项目常见新手错误的 5 个技巧

1. 正确预处理您的数据

2. 使用交叉验证避免过拟合

3. 特征工程和选择

4. 监控和调整超参数

5. 使用恰当的指标评估模型性能

关于此主题的更多信息

暂无评论。

留下回复点击此处取消回复。

导航

1. 正确预处理您的数据

2. 使用交叉验证避免过拟合

3. 特征工程和选择

4. 监控和调整超参数

5. 使用恰当的指标评估模型性能

关于此主题的更多信息

暂无评论。

留下回复 点击此处取消回复。

留下回复点击此处取消回复。