5 个用于数据清洗的 Python DIY 函数

作者： Matthew Mayo 于 2025年4月21日发布在机器学习资源 0

5 DIY Python Functions for Data Cleaning

图片作者 | Midjourney

数据清理：无论你爱它还是恨它，你可能都会花很多时间来做这件事。

这就是我们选择的。没有理解、分析或建模数据，就不可能先清理它。确保我们拥有可重用的数据清理工具至关重要。为此，这里有 5 个 DIY 函数，为您提供一些示例和构建自己的数据清理工具箱的起点。

这些函数都经过良好记录，并包含对函数参数和返回类型的明确描述。还采用了类型提示，以确保函数能够按照预期的方式被调用，并且您（读者）能够很好地理解它们。

在我们开始之前，先处理导入。

import re
from datetime import datetime
import pandas as pd
import numpy as np
from typing import List, Union, Optional

import re

from datetime import datetime

import pandas as pd

import numpy as np

from typing import List, Union, Optional

好了，我们开始讲函数。

1. 去除多个空格

我们的第一个 DIY 函数旨在从文本中去除过多的空格。如果我们既不想要字符串中的多个空格，也不想要过多的前导或尾随空格，这个单行函数将为我们处理。我们使用正则表达式处理内部空格，并使用 strip() 处理尾随/前导空格。

def clean_spaces(text: str) -> str:
    """
    Remove multiple spaces from a string and trim leading/trailing spaces.

    :param text: The input string to clean
    :returns: A string with multiple spaces removed and trimmed
    """
    return re.sub(' +', ' ', str(text).strip())

def clean_spaces(text: str) -> str:

"""

从字符串中移除多个空格并修剪前导/尾随空格。

:param text: 要清理的输入字符串

:returns: 一个移除多个空格并修剪过的字符串

"""

return re.sub(' +', ' ', str(text).strip())

测试

messy_text = "This   has   too    many    spaces"
clean_text = clean_spaces(messy_text)
print(clean_text)

messy_text = "This has too many spaces"

clean_text = clean_spaces(messy_text)

print(clean_text)

输出

This has too many spaces

1	This has too many spaces

2. 标准化日期格式

你的数据集包含各种国际可接受的日期格式吗？此函数会将它们全部标准化为我们指定的格式（YYYY-MM-DD）。

def standardize_date(date_string: str) -> Optional[str]:
    """
    Convert various date formats to YYYY-MM-DD.

    :param date_string: The input date string to standardize
    :returns: A standardized date string in YYYY-MM-DD format, or None if parsing fails
    """
    date_formats = ["%Y-%m-%d", "%d-%m-%Y", "%m/%d/%Y", "%d/%m/%Y", "%B %d, %Y"]
    for fmt in date_formats:
        try:
            return datetime.strptime(date_string, fmt).strftime("%Y-%m-%d")
        except ValueError:
            pass
    # Return None if no format matches
    return None

def standardize_date(date_string: str) -> Optional[str]:

"""

将各种日期格式转换为 YYYY-MM-DD。

:param date_string: 要标准化的输入日期字符串

:returns: 标准化后的日期字符串（YYYY-MM-DD 格式），如果解析失败则返回 None

"""

date_formats = ["%Y-%m-%d", "%d-%m-%Y", "%m/%d/%Y", "%d/%m/%Y", "%B %d, %Y"]

for fmt in date_formats:

try:

return datetime.strptime(date_string, fmt).strftime("%Y-%m-%d")

except ValueError:

pass

# 如果没有匹配的格式，则返回 None

return None

测试

dates = ["2023-04-01", "01-04-2023", "04/01/2023", "April 1, 2023"]
standardized_dates = [standardize_date(date) for date in dates]
print(standardized_dates)

dates = ["2023-04-01", "01-04-2023", "04/01/2023", "April 1, 2023"]

standardized_dates = [standardize_date(date) for date in dates]

print(standardized_dates)

输出

['2023-04-01', '2023-04-01', '2023-04-01', '2023-04-01']

1	['2023-04-01', '2023-04-01', '2023-04-01', '2023-04-01']

3. 处理缺失值

让我们处理那些讨厌的缺失值。我们可以指定数字数据策略（“平均值”、“中位数”或“众数”）以及类别数据策略（“众数”或“虚拟”）。

def handle_missing(df: pd.DataFrame, numeric_strategy: str = 'mean', categorical_strategy: str = 'mode') -> pd.DataFrame:
    """
    Fill missing values in a DataFrame.

    :param df: The input DataFrame
    :param numeric_strategy: Strategy for handling missing numeric values ('mean', 'median', or 'mode')
    :param categorical_strategy: Strategy for handling missing categorical values ('mode' or 'dummy')
    :returns: A DataFrame with missing values filled
    """
    for column in df.columns:
        if df[column].dtype in ['int64', 'float64']:
            if numeric_strategy == 'mean':
                df[column].fillna(df[column].mean(), inplace=True)
            elif numeric_strategy == 'median':
                df[column].fillna(df[column].median(), inplace=True)
            elif numeric_strategy == 'mode':
                df[column].fillna(df[column].mode()[0], inplace=True)
        else:
            if categorical_strategy == 'mode':
                df[column].fillna(df[column].mode()[0], inplace=True)
            elif categorical_strategy == 'dummy':
                df[column].fillna('Unknown', inplace=True)
    return df

def handle_missing(df: pd.DataFrame, numeric_strategy: str = 'mean', categorical_strategy: str = 'mode') -> pd.DataFrame:

"""

填充 DataFrame 中的缺失值。

:param df: 输入 DataFrame

:param numeric_strategy: 处理缺失数字值的策略（“平均值”、“中位数”或“众数”）

:param categorical_strategy: 处理缺失类别值的策略（“众数”或“虚拟”）

:returns: 一个填充了缺失值的 DataFrame

"""

for column in df.columns:

if df[column].dtype in ['int64', 'float64']:

if numeric_strategy == 'mean':

df[column].fillna(df[column].mean(), inplace=True)

elif numeric_strategy == 'median':

df[column].fillna(df[column].median(), inplace=True)

elif numeric_strategy == 'mode':

df[column].fillna(df[column].mode()[0], inplace=True)

else:

if categorical_strategy == 'mode':

df[column].fillna(df[column].mode()[0], inplace=True)

elif categorical_strategy == 'dummy':

df[column].fillna('Unknown', inplace=True)

return df

测试

df = pd.DataFrame({'A': [1, 2, np.nan, 4], 'B': ['x', 'y', np.nan, 'z']})
cleaned_df = handle_missing(df)
print(cleaned_df)

df = pd.DataFrame({'A': [1, 2, np.nan, 4], 'B': ['x', 'y', np.nan, 'z']})

cleaned_df = handle_missing(df)

print(cleaned_df)

输出

  df[column].fillna(df[column].mode()[0], inplace=True)
          A  B
0  1.000000  x
1  2.000000  y
2  2.333333  x
3  4.000000  z

df[column].fillna(df[column].mode()[0], inplace=True)

A B

0 1.000000 x

1 2.000000 y

2 2.333333 x

3 4.000000 z

4. 去除异常值

异常值给你带来了麻烦？现在不会了。这个 DIY 函数使用 IQR 方法从我们的数据中删除异常值。您只需传入数据并指定要检查异常值的列，它就会返回一个无异常值的数据集。

import pandas as pd
import numpy as np
from typing import List

def remove_outliers_iqr(df: pd.DataFrame, columns: List[str], factor: float = 1.5) -> pd.DataFrame:
    """
    Remove outliers from specified columns using the Interquartile Range (IQR) method.

    :param df: The input DataFrame
    :param columns: List of column names to check for outliers
    :param factor: The IQR factor to use (default is 1.5)
    :returns: A DataFrame with outliers removed
    """
    mask = pd.Series(True, index=df.index)
    for col in columns:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - factor * IQR
        upper_bound = Q3 + factor * IQR
        mask &= (df[col] >= lower_bound) & (df[col] <= upper_bound)
    
    cleaned_df = df[mask]
    
    return cleaned_df

import pandas as pd

import numpy as np

from typing import List

def remove_outliers_iqr(df: pd.DataFrame, columns: List[str], factor: float = 1.5) -> pd.DataFrame:

"""

使用四分位距（IQR）方法从指定列中移除异常值。

:param df: 输入 DataFrame

:param columns: 要检查异常值的列名列表

:param factor: 要使用的 IQR 因子（默认为 1.5）

:returns: 一个已移除异常值的 DataFrame

"""

mask = pd.Series(True, index=df.index)

for col in columns:

Q1 = df[col].quantile(0.25)

Q3 = df[col].quantile(0.75)

IQR = Q3 - Q1

lower_bound = Q1 - factor * IQR

upper_bound = Q3 + factor * IQR

mask &= (df[col] >= lower_bound) & (df[col] <= upper_bound)

cleaned_df = df[mask]

return cleaned_df

测试

df = pd.DataFrame({'A': [1, 2, 3, 100, 4, 5], 'B': [10, 20, 30, 40, 50, 1000]})
print("Original DataFrame:")
print(df)
print("\nCleaned DataFrame:")
cleaned_df = remove_outliers_iqr(df, ['A', 'B'])
print(cleaned_df)

df = pd.DataFrame({'A': [1, 2, 3, 100, 4, 5], 'B': [10, 20, 30, 40, 50, 1000]})

print("原始 DataFrame:")

print(df)

print("\n清理后的 DataFrame:")

cleaned_df = remove_outliers_iqr(df, ['A', 'B'])

print(cleaned_df)

输出

Original DataFrame:
     A     B
0    1    10
1    2    20
2    3    30
3  100    40
4    4    50
5    5  1000

Cleaned DataFrame:
   A   B
0  1  10
1  2  20
2  3  30
4  4  50

原始 DataFrame:

A B

0 1 10

1 2 20

2 3 30

3 100 40

4 4 50

5 5 1000

清理后的 DataFrame:

A B

0 1 10

1 2 20

2 3 30

4 4 50

5. 规范化文本数据

让我们正常化！当您想将所有文本转换为小写、去除空格和删除特殊字符时，此 DIY 函数将派上用场。

def normalize_text(text: str) -> str:
    """
    Normalize text data by converting to lowercase, removing special characters, and extra spaces.

    :param text: The input text to normalize
    :returns: Normalized text
    """
    # Convert to lowercase
    text = str(text).lower()

    # Remove special characters
    text = re.sub(r'[^\w\s]', '', text)

    # Remove extra spaces
    text = re.sub(r'\s+', ' ', text).strip()

    return text

def normalize_text(text: str) -> str:

"""

通过转换为小写、删除特殊字符和多余空格来规范化文本数据。

:param text: 要规范化的输入文本

:returns: 规范化后的文本

"""

# 转换为小写

text = str(text).lower()

# 删除特殊字符

text = re.sub(r'[^\w\s]', '', text)

# 删除多余空格

text = re.sub(r'\s+', ' ', text).strip()

return text

测试

messy_text = "This is MESSY!!! Text   with $pecial ch@racters."
clean_text = normalize_text(messy_text)
print(clean_text)

messy_text = "This is MESSY!!! Text with $pecial ch@racters."

clean_text = normalize_text(messy_text)

print(clean_text)

输出

this is messy text with pecial chracters

1	this is messy text with pecial chracters

总结

好了，就这样。我们展示了 5 个不同的 DIY 函数，它们可以执行特定的数据清理任务。我们都试驾了它们，并检查了结果。您现在应该对下一步应该如何进行有所了解，并且不要忘记保存这些函数以备将来使用。

导航

5 个用于数据清洗的 Python DIY 函数

1. 去除多个空格

2. 标准化日期格式

3. 处理缺失值

4. 去除异常值

5. 规范化文本数据

总结

关于此主题的更多信息

暂无评论。

留下回复点击此处取消回复。

导航

1. 去除多个空格

2. 标准化日期格式

3. 处理缺失值

4. 去除异常值

5. 规范化文本数据

总结

关于此主题的更多信息

暂无评论。

留下回复 点击此处取消回复。

留下回复点击此处取消回复。