7个提升机器学习模型开发的 Pandas 技巧

作者： Matthew Mayo 发布于 2025 年 8 月 21 日在实用机器学习 0

7 Pandas Tricks to Improve Your Machine Learning Model Development

7个提升机器学习模型开发的 Pandas 技巧
图片作者 | ChatGPT

引言

如果您正在阅读本文，您可能已经知道机器学习模型的性能不仅取决于所选算法，还受到模型训练所用数据的质量和表示方式的极大影响。

数据预处理和特征工程是机器学习工作流程中最重要的步骤之一。在 Python 生态系统中，**Pandas** 是处理这些数据操作任务的首选库，这一点您可能也知道。掌握一些精选的 Pandas 数据转换技术可以显著简化您的工作流程，使您的代码更清晰、更高效，并最终带来性能更好的模型。

本教程将引导您了解七个实用的 Pandas 场景及其技巧，这些技巧可以增强您的数据准备和特征工程过程，为您的下一个机器学习项目的成功奠定基础。

准备我们的数据

为了演示这些技巧，我们将使用经典的泰坦尼克号数据集。这是一个有用的示例，因为它包含数值和分类数据的混合，以及缺失值，这些都是您在实际机器学习任务中经常会遇到的挑战。

我们可以直接从 URL 将数据集轻松加载到 Pandas DataFrame 中。

import pandas as pd
import numpy as np

# Load the Titanic dataset from URL
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)

# Output shape and first 5 rows
print("Dataset shape:", df.shape)
print(df.head())

import pandas as pd

import numpy as np

# 从 URL 加载泰坦尼克号数据集

url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"

df = pd.read_csv(url)

# 输出形状和前 5 行

print("数据集形状：", df.shape)

print(df.head())

输出

Dataset shape: (891, 12)
   PassengerId  Survived  Pclass                                               Name     Sex   Age  SibSp  Parch            Ticket     Fare Cabin Embarked
0            1         0       3                            Braund, Mr. Owen Harris    male  22.0      1      0         A/5 21171   7.2500   NaN        S
1            2         1       1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1      0          PC 17599  71.2833   C85        C
2            3         1       3                             Heikkinen, Miss. Laina  female  26.0      0      0  STON/O2. 3101282   7.9250   NaN        S
3            4         1       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1      0            113803  53.1000  C123        S
4            5         0       3                           Allen, Mr. William Henry    male  35.0      0      0            373450   8.0500   NaN        S

数据集形状: (891, 12)

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked

0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S

1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C

2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S

3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S

4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

这给我们提供了一个 DataFrame，其中包含 Survived（我们的目标变量）、Pclass（乘客等级）、Sex、Age 等列。

现在，让我们拿出我们的锦囊妙计。

1. 使用 query() 进行更清晰的数据筛选

数据筛选是一项永无止境的任务，无论是创建用于训练的子集还是探索特定片段。使用布尔索引的标准方法在多条件情况下会变得笨拙和复杂。query() 方法通过允许您使用字符串表达式进行筛选，提供了更具可读性和直观的替代方案。

标准筛选

# Filter for first-class passengers over 30 who survived
filtered_df = df[(df['Pclass'] == 1) & (df['Age'] > 30) & (df['Survived'] == 1)]
print(filtered_df.head())

# 筛选出 30 岁以上且幸存的头等舱乘客

filtered_df = df[(df['Pclass'] == 1) & (df['Age'] > 30) & (df['Survived'] == 1)]

print(filtered_df.head())

使用 query() 进行筛选

# Same filter, but using the query() method
query_df = df.query('Pclass == 1 and Age > 30 and Survived == 1')
print(query_df.head())

# 相同的筛选，但使用 query() 方法

query_df = df.query('Pclass == 1 and Age > 30 and Survived == 1')

print(query_df.head())

相同输出

    PassengerId  Survived  Pclass                                               Name     Sex   Age  SibSp  Parch    Ticket     Fare Cabin Embarked
1             2         1       1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1      0  PC 17599  71.2833   C85        C
3             4         1       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1      0    113803  53.1000  C123        S
11           12         1       1                           Bonnell, Miss. Elizabeth  female  58.0      0      0    113783  26.5500  C103        S
52           53         1       1           Harper, Mrs. Henry Sleeper (Myna Haxtun)  female  49.0      1      0  PC 17572  76.7292   D33        C
61           62         1       1                                Icard, Miss. Amelie  female  38.0      0      0    113572  80.0000   B28      NaN

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked

1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C

3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S

11 12 1 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.5500 C103 S

52 53 1 1 Harper, Mrs. Henry Sleeper (Myna Haxtun) female 49.0 1 0 PC 17572 76.7292 D33 C

61 62 1 1 Icard, Miss. Amelie female 38.0 0 0 113572 80.0000 B28 NaN

我敢肯定您会同意，query() 版本更清晰易读，特别是当条件数量增加时。

2. 使用 cut() 为连续变量创建区间

某些模型（例如线性模型和决策树）可以通过离散化连续变量来受益，这有助于模型捕捉非线性关系。pd.cut() 函数可用于将数据分箱到自定义范围。为了演示，我们来创建年龄组。

# Define the bins and labels for age groups
bins = [0, 12, 18, 60, np.inf]
labels = ['Child', 'Teenager', 'Adult', 'Senior']

# Create the new 'AgeGroup' feature
df['AgeGroup'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)

# Display the counts of each age group
print(df['AgeGroup'].value_counts())

# 定义年龄组的区间和标签

bins = [0, 12, 18, 60, np.inf]

labels = ['儿童', '青少年', '成人', '老年人']

# 创建新的“AgeGroup”特征

df['AgeGroup'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)

# 显示每个年龄组的计数

print(df['AgeGroup'].value_counts())

输出

AgeGroup
Adult       575
Child        68
Teenager     45
Senior       26
Name: count, dtype: int64

AgeGroup

成人 575

儿童 68

青少年 45

老年人 26

名称: 计数, dtype: int64

这个新的 AgeGroup 特征是一个强大的分类变量，您的模型现在可以使用它。

3. 使用 .str 访问器从文本中提取特征

文本列通常包含有价值的结构化信息。Pandas 中的 .str 访问器提供了一整套字符串处理方法，可以一次性作用于整个 Series。我们可以使用 .str 访问器和正则表达式从 Name 列中提取乘客的称谓（例如“Mr.”、“Miss.”、“Dr.”）。

# Use a regular expression to extract titles from the Name column
df['Title'] = df['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)

# Display the value counts of the new Title feature
print(df['Title'].value_counts())

# 使用正则表达式从 Name 列中提取称谓

df['Title'] = df['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)

# 显示新 Title 特征的值计数

print(df['Title'].value_counts())

输出

Title
Mr          517
Miss        182
Mrs         125
Master       40
Dr            7
Rev           6
Mlle          2
Major         2
Col           2
Countess      1
Capt          1
Ms            1
Sir           1
Lady          1
Mme           1
Don           1
Jonkheer      1
Name: count, dtype: int64

称谓

Mr 517

Miss 182

Mrs 125

Master 40

Dr 7

Rev 6

Mlle 2

Major 2

Col 2

Countess 1

Capt 1

Ms 1

Sir 1

Lady 1

Mme 1

Don 1

Jonkheer 1

名称: 计数, dtype: int64

这个 Title 特征经常被证明是泰坦尼克号模型中生存的强有力预测因子。

4. 使用 transform() 进行高级归因

简单地删除含有缺失数据的行通常不是一个选择，因为它可能导致数据丢失。在许多情况下，更好的策略是归因。虽然用全局平均值或中位数填充很常见，但更复杂的方法是基于相关组进行归因。例如，我们可以用相同 Pclass 中乘客的年龄中位数来填充缺失的 Age 值。groupby() 和 transform() 方法使这变得简单明了，而且这是一个优雅的解决方案。

# Calculate the median age for each passenger class
median_age_by_pclass = df.groupby('Pclass')['Age'].transform('median')

# Fill missing Age values with the calculated median
df['Age'].fillna(median_age_by_pclass, inplace=True)

# Verify that there are no more missing Age values
print("Missing Age values after imputation:", df['Age'].isnull().sum())

# 计算每个乘客等级的年龄中位数

median_age_by_pclass = df.groupby('Pclass')['Age'].transform('median')

# 用计算出的中位数填充缺失的 Age 值

df['Age'].fillna(median_age_by_pclass, inplace=True)

# 验证 Age 值不再有缺失

print("归因后缺失年龄值：", df['Age'].isnull().sum())

输出

Missing Age values before imputation: 177
Missing Age values after imputation: 0

1 2	归因前缺失年龄值: 177 归因后缺失年龄值: 0

我们做到了；不再有缺失的年龄。出于各种原因，这种基于分组的归因通常比使用单个全局值更准确。

5. 使用方法链和 pipe() 简化工作流程

机器学习预处理流程通常涉及多个步骤。将这些操作链接在一起可以使代码更具可读性，并有助于避免创建不必要的中间 DataFrame。pipe() 方法通过允许您将自己的自定义函数整合到链中，进一步推动了这一点。

首先，我们定义一个自定义函数来删除列，另一个自定义函数将 Sex 列编码为 0 代表 male，1 代表 female。然后，我们可以使用 pipe 创建一个管道，将这两个自定义函数整合到我们的链中。

# A custom function to drop columns
def drop_cols(df, cols_to_drop):
    return df.drop(columns=cols_to_drop)

# A custom function to encode 'Sex'
def encode_sex(df):
    df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})
    return df
    
# Create a chained pipeline
processed_df = (df.copy()
                  .pipe(drop_cols, cols_to_drop=['Ticket', 'Cabin', 'Name'])
                  .pipe(encode_sex)
                 )

print(processed_df.head())

# 一个用于删除列的自定义函数

def drop_cols(df, cols_to_drop):

return df.drop(columns=cols_to_drop)

# 一个用于编码 'Sex' 的自定义函数

def encode_sex(df):

df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})

return df

# 创建一个链式管道

processed_df = (df.copy()

.pipe(drop_cols, cols_to_drop=['Ticket', 'Cabin', 'Name'])

.pipe(encode_sex)

)

print(processed_df.head())

以及我们的输出

   PassengerId  Survived  Pclass  Sex   Age  SibSp  Parch     Fare Embarked AgeGroup Title
0            1         0       3    0  22.0      1      0   7.2500        S    Adult    Mr
1            2         1       1    1  38.0      1      0  71.2833        C    Adult   Mrs
2            3         1       3    1  26.0      0      0   7.9250        S    Adult  Miss
3            4         1       1    1  35.0      1      0  53.1000        S    Adult   Mrs
4            5         0       3    0  35.0      0      0   8.0500        S    Adult    Mr

PassengerId Survived Pclass Sex Age SibSp Parch Fare Embarked AgeGroup Title

0 1 0 3 0 22.0 1 0 7.2500 S 成人 Mr

1 2 1 1 1 38.0 1 0 71.2833 C 成人 Mrs

2 3 1 3 1 26.0 0 0 7.9250 S 成人 Miss

3 4 1 1 1 35.0 1 0 53.1000 S 成人 Mrs

4 5 0 3 0 35.0 0 0 8.0500 S 成人 Mr

这种方法对于构建干净、可重现的机器学习管道非常有效。

6. 使用 map() 高效映射序数类别

虽然独热编码是名义分类数据的标准方法，但序数数据（类别具有自然顺序）最好通过映射到整数来处理。字典和 map() 方法非常适合此目的。让我们假设乘客等级具有质量排序。

# Let's assume Embarked has an order: S > C > Q
embarked_mapping = {'S': 2, 'C': 1, 'Q': 0}
df['Embarked_mapped'] = df['Embarked'].map(embarked_mapping)

print(df[['Embarked', 'Embarked_mapped']].head())

# 假设 Embarked 有顺序：S > C > Q

embarked_mapping = {'S': 2, 'C': 1, 'Q': 0}

df['Embarked_mapped'] = df['Embarked'].map(embarked_mapping)

print(df[['Embarked', 'Embarked_mapped']].head())

这是我们的输出

  Embarked  Embarked_mapped
0        S              2.0
1        C              1.0
2        S              2.0
3        S              2.0
4        S              2.0

Embarked Embarked_映射

0 S 2.0

1 C 1.0

2 S 2.0

3 S 2.0

4 S 2.0

这是一种快速而明确的方式来编码序数关系，以便您的模型学习。

7. 使用 astype() 优化内存

处理大型数据集时，内存使用可能会成为瓶颈。Pandas 默认使用较大的数据类型（如 int64 和 float64），但您通常可以使用较小的数据类型而不会丢失信息。将对象列转换为 category dtype 是一种有效的优化方法。

# Check original memory usage
print("Original memory usage:")
print(df.info(memory_usage='deep'))

# Optimize data types
df_optimized = df.copy()
df_optimized['Pclass'] = df_optimized['Pclass'].astype('int8')
df_optimized['Sex'] = df_optimized['Sex'].astype('category')
df_optimized['Age'] = df_optimized['Age'].astype('float32')
df_optimized['Embarked'] = df_optimized['Embarked'].astype('category')

# Check new memory usage
print("\nOptimized memory usage:")
print(df_optimized.info(memory_usage='deep'))

# 检查原始内存使用情况

print("原始内存使用情况：")

print(df.info(memory_usage='deep'))

# 优化数据类型

df_optimized = df.copy()

df_optimized['Pclass'] = df_optimized['Pclass'].astype('int8')

df_optimized['Sex'] = df_optimized['Sex'].astype('category')

df_optimized['Age'] = df_optimized['Age'].astype('float32')

df_optimized['Embarked'] = df_optimized['Embarked'].astype('category')

# 检查新的内存使用情况

print("\n优化后的内存使用情况：")

print(df_optimized.info(memory_usage='deep'))

输出结果

Original memory usage:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype   
---  ------           --------------  -----   
 0   PassengerId      891 non-null    int64   
 1   Survived         891 non-null    int64   
 2   Pclass           891 non-null    int64   
 3   Name             891 non-null    object  
 4   Sex              891 non-null    object  
 5   Age              891 non-null    float64 
 6   SibSp            891 non-null    int64   
 7   Parch            891 non-null    int64   
 8   Ticket           891 non-null    object  
 9   Fare             891 non-null    float64 
 10  Cabin            204 non-null    object  
 11  Embarked         889 non-null    object  
 12  AgeGroup         714 non-null    category
 13  Title            891 non-null    object  
 14  Embarked_mapped  889 non-null    float64 
dtypes: category(1), float64(3), int64(5), object(6)
memory usage: 338.9 KB
None

Optimized memory usage:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype   
---  ------           --------------  -----   
 0   PassengerId      891 non-null    int64   
 1   Survived         891 non-null    int64   
 2   Pclass           891 non-null    int8    
 3   Name             891 non-null    object  
 4   Sex              891 non-null    category
 5   Age              891 non-null    float32 
 6   SibSp            891 non-null    int64   
 7   Parch            891 non-null    int64   
 8   Ticket           891 non-null    object  
 9   Fare             891 non-null    float64 
 10  Cabin            204 non-null    object  
 11  Embarked         889 non-null    category
 12  AgeGroup         714 non-null    category
 13  Title            891 non-null    object  
 14  Embarked_mapped  889 non-null    float64 
dtypes: category(3), float32(1), float64(2), int64(4), int8(1), object(4)
memory usage: 241.3 KB
None

原始内存使用:

<类 'pandas.core.frame.DataFrame'>

RangeIndex: 891 条目, 0 到 890

数据列 (总计 15 列):

# 列名非空计数 Dtype

--- ------ -------------- -----

0 PassengerId 891 非空 int64

1 Survived 891 非空 int64

2 Pclass 891 非空 int64

3 Name 891 非空对象

4 Sex 891 非空对象

5 Age 891 非空 float64

6 SibSp 891 非空 int64

7 Parch 891 非空 int64

8 Ticket 891 非空对象

9 Fare 891 非空 float64

10 Cabin 204 非空对象

11 Embarked 889 非空对象

12 AgeGroup 714 非空分类

13 Title 891 非空对象

14 Embarked_mapped 889 非空 float64

dtypes: 分类(1), float64(3) int64(5), 对象(6)

内存使用: 338.9 KB

无

优化后内存使用:

<类 'pandas.core.frame.DataFrame'>

RangeIndex: 891 条目, 0 到 890

数据列 (总计 15 列):

# 列名非空计数 Dtype

--- ------ -------------- -----

0 PassengerId 891 非空 int64

1 Survived 891 非空 int64

2 Pclass 891 非空 int8

3 Name 891 非空对象

4 Sex 891 非空分类

5 Age 891 非空 float32

6 SibSp 891 非空 int64

7 Parch 891 非空 int64

8 Ticket 891 非空对象

9 Fare 891 非空 float64

10 Cabin 204 非空对象

11 Embarked 889 非空分类

12 AgeGroup 714 非空分类

13 Title 891 非空对象

14 Embarked_mapped 889 非空 float64

dtypes: 分类(3), float32(1), float64(2), int64(4), int8(1), 对象(4)

内存使用: 241.3 KB

无

您会经常看到内存占用显著减少，这对于在大型数据集上训练模型而不会导致机器崩溃变得非常重要。

总结

机器学习始终始于精心准备的数据。虽然算法的复杂性、超参数以及模型构建过程常常是焦点，但数据的高效操作才是真正的关键。

这里介绍的七个 Pandas 技巧不仅仅是编码捷径，它们代表了清理数据、设计有洞察力的特征以及构建健壮、可重现模型的强大策略。

导航

7个提升机器学习模型开发的 Pandas 技巧

引言

准备我们的数据

1. 使用 query() 进行更清晰的数据筛选

标准筛选

使用 query() 进行筛选

2. 使用 cut() 为连续变量创建区间

3. 使用 .str 访问器从文本中提取特征

4. 使用 transform() 进行高级归因

5. 使用方法链和 pipe() 简化工作流程

6. 使用 map() 高效映射序数类别

7. 使用 astype() 优化内存

总结

关于此主题的更多信息

暂无评论。

发表评论点击此处取消回复。

导航

引言

准备我们的数据

1. 使用 query() 进行更清晰的数据筛选

标准筛选

使用 query() 进行筛选

2. 使用 cut() 为连续变量创建区间

3. 使用 .str 访问器从文本中提取特征

4. 使用 transform() 进行高级归因

5. 使用方法链和 pipe() 简化工作流程

6. 使用 map() 高效映射序数类别

7. 使用 astype() 优化内存

总结

关于此主题的更多信息

暂无评论。

发表评论 点击此处取消回复。

发表评论点击此处取消回复。