7个用于时间序列特征工程的Pandas技巧

作者： Matthew Mayo 发布于 2025年8月11日分类：数据科学 1

7 Pandas Tricks for Time-Series Feature Engineering

7个用于时间序列特征工程的Pandas技巧
图片由 Editor | ChatGPT 提供

引言

在构建有效的机器学习模型时，特征工程是最重要的步骤之一，这一点在处理时间序列数据时尤为重要。通过从时间数据中创建有意义的特征，您可以解锁仅使用原始时间戳无法获得的预测能力。

幸运的是，Pandas 为操作和创建时间序列特征提供了一套强大而灵活的操作。

本文将探讨 7 个实用的 Pandas 技巧，这些技巧可以帮助转换您的时间序列数据，从而可以带来改进的模型和更强大的预测能力。我们将使用一个简单的合成数据集来演示每种技术，让您能够快速掌握概念并将其应用到您自己的项目中。

设置我们的数据

首先，让我们创建一个示例时间序列 DataFrame。此数据集将代表一段时间内的每日销售数据，我们将使用它来进行所有后续示例。

import pandas as pd
import numpy as np

# Set a random seed for reproducibility
np.random.seed(42)

# Create a date range
date_range = pd.date_range(start='2025-07-01', end='2025-07-30', freq='D')

# Create a sample DataFrame
df = pd.DataFrame(date_range, columns=['date'])
df['sales'] = np.random.randint(50, 100, size=(len(date_range)))
df = df.set_index('date')

print(f"Dataset size: {df.size}")
print(df.head())

import pandas as pd

import numpy as np

# 设置随机种子以确保可重复性

np.random.seed(42)

# 创建日期范围

date_range = pd.date_range(start='2025-07-01', end='2025-07-30', freq='D')

# 创建示例 DataFrame

df = pd.DataFrame(date_range, columns=['date'])

df['sales'] = np.random.randint(50, 100, size=(len(date_range)))

df = df.set_index('date')

print(f"数据集大小: {df.size}")

print(df.head())

输出

Dataset size: 30
            sales
date             
2025-07-01     88
2025-07-02     78
2025-07-03     64
2025-07-04     92
2025-07-05     57

数据集大小: 30

sales

date

2025-07-01 88

2025-07-02 78

2025-07-03 64

2025-07-04 92

2025-07-05 57

我们创建了一个小型数据集，其中包含 2025 年 7 月每天的一个条目，并分配了随机销售值。请注意，如果您使用 np.random.seed(42)，您的数据将与上面的我的数据相同。

数据准备就绪后，我们现在可以探索几种创建有见地的特征的技术。

1. 提取日期时间组件

最简单但最有用的时间序列特征工程技术之一是将日期时间对象分解为其组成部分。这些组件可以捕获不同粒度（例如星期几、一年中的月份等）的季节性和趋势。Pandas 使用 .dt 访问器可以轻松完成此操作。

df['day_of_week'] = df.index.dayofweek
df['day_of_year'] = df.index.dayofyear
df['month'] = df.index.month
df['quarter'] = df.index.quarter
df['week_of_year'] = df.index.isocalendar().week

print(df.head())

df['day_of_week'] = df.index.dayofweek

df['day_of_year'] = df.index.dayofyear

df['month'] = df.index.month

df['quarter'] = df.index.quarter

df['week_of_year'] = df.index.isocalendar().week

print(df.head())

输出

            sales  day_of_week  day_of_year  month  quarter  week_of_year
date                                                                     
2025-07-01     88            1          182      7        3            27
2025-07-02     78            2          183      7        3            27
2025-07-03     64            3          184      7        3            27
2025-07-04     92            4          185      7        3            27
2025-07-05     57            5          186      7        3            27

sales day_of_week day_of_year month quarter week_of_year

date

2025-07-01 88 1 182 7 3 27

2025-07-02 78 2 183 7 3 27

2025-07-03 64 3 184 7 3 27

2025-07-04 92 4 185 7 3 27

2025-07-05 57 5 186 7 3 27

我们现在为每个条目提供了星期几、一年中的第几天、月份、季度和一年中的第几周的数据点。这些新特征可以帮助模型学习与每周周期（例如周末销售额更高）或年度季节性相关的模式。这是一个很好的起点。

2. 创建滞后特征

滞后特征是来自先前时间步的值。它们在时间序列预测中至关重要，因为它们代表了过去的系统状态，这通常对未来具有高度预测性。shift() 方法非常适合此目的。

# Create a lag feature for sales from the previous day
df['sales_lag_1'] = df['sales'].shift(1)

# Create a lag feature for sales from 3 days ago
df['sales_lag_3'] = df['sales'].shift(3)

print(df.head())

# 创建前一天销售额的滞后特征

df['sales_lag_1'] = df['sales'].shift(1)

# 创建三天前销售额的滞后特征

df['sales_lag_3'] = df['sales'].shift(3)

print(df.head())

输出

            sales  sales_lag_1  sales_lag_3
date                                       
2025-07-01     88          NaN          NaN
2025-07-02     78         88.0          NaN
2025-07-03     64         78.0          NaN
2025-07-04     92         64.0         88.0
2025-07-05     57         92.0         78.0

sales sales_lag_1 sales_lag_3

date

2025-07-01 88 NaN NaN

2025-07-02 78 88.0 NaN

2025-07-03 64 78.0 NaN

2025-07-04 92 64.0 88.0

2025-07-05 57 92.0 78.0

请注意，我们的移位在系列开头创建了一些 NaN 值，这是显而易见的。在建模之前，您需要通过过滤或删除来处理这些值。

3. 计算滚动窗口统计量

滚动窗口计算（也称为移动平均）有助于平滑短期波动并突出长期趋势。您可以使用 rolling() 方法轻松计算固定大小窗口上的各种统计量，如均值、中位数或标准差。

# Calculate the 3-day rolling mean of sales
df['rolling_mean_3'] = df['sales'].rolling(window=3).mean()

# Calculate the 3-day rolling standard deviation
df['rolling_std_3'] = df['sales'].rolling(window=3).std()

print(df.head())

# 计算销售额的 3 天滚动均值

df['rolling_mean_3'] = df['sales'].rolling(window=3).mean()

# 计算 3 天滚动标准差

df['rolling_std_3'] = df['sales'].rolling(window=3).std()

print(df.head())

输出

            sales  rolling_mean_3  rolling_std_3
date                                            
2025-07-01     88             NaN            NaN
2025-07-02     78             NaN            NaN
2025-07-03     64       76.666667      12.055428
2025-07-04     92       78.000000      14.000000
2025-07-05     57       71.000000      18.520259

sales rolling_mean_3 rolling_std_3

date

2025-07-01 88 NaN NaN

2025-07-02 78 NaN NaN

2025-07-03 64 76.666667 12.055428

2025-07-04 92 78.000000 14.000000

2025-07-05 57 71.000000 18.520259

这些新特征有助于提供关于系列近期趋势和波动性的洞察。

4. 生成扩展窗口统计量

与滚动窗口相反，扩展窗口包括自时间序列开始以来的所有数据，直到当前时间点。这对于捕获随时间累积的统计量（包括运行总计和总体平均值）非常有用。这可以通过 expanding() 方法实现。

# Calculate the expanding sum of sales
df['expanding_sum'] = df['sales'].expanding().sum()

# Calculate the expanding average of sales
df['expanding_avg'] = df['sales'].expanding().mean()

print(df.head())

# 计算销售额的扩展总和

df['expanding_sum'] = df['sales'].expanding().sum()

# 计算销售额的扩展平均值

df['expanding_avg'] = df['sales'].expanding().mean()

print(df.head())

输出

            sales  expanding_sum  expanding_avg
date                                           
2025-07-01     88           88.0      88.000000
2025-07-02     78          166.0      83.000000
2025-07-03     64          230.0      76.666667
2025-07-04     92          322.0      80.500000
2025-07-05     57          379.0      75.800000

sales expanding_sum expanding_avg

date

2025-07-01 88 88.0 88.000000

2025-07-02 78 166.0 83.000000

2025-07-03 64 230.0 76.666667

2025-07-04 92 322.0 80.500000

2025-07-05 57 379.0 75.800000

5. 计算事件之间的时间

通常，自上一个重要事件以来的时间或连续数据点之间的时间可以是一个理想的特征。您可以使用索引上的 diff() 来计算连续时间戳之间的差值。

# Our index is daily, so the difference is constant, but this shows the principle
df['time_since_last'] = df.index.to_series().diff().dt.days

print(df.head())

# 我们的索引是每日的，所以差值是恒定的，但这显示了原则

df['time_since_last'] = df.index.to_series().diff().dt.days

print(df.head())

            sales  time_since_last
date                              
2025-07-01     88              NaN
2025-07-02     78              1.0
2025-07-03     64              1.0
2025-07-04     92              1.0
2025-07-05     57              1.0

sales time_since_last

date

2025-07-01 88 NaN

2025-07-02 78 1.0

2025-07-03 64 1.0

2025-07-04 92 1.0

2025-07-05 57 1.0

虽然对于我们简单的规则系列来说不太有用，但对于时间间隔变化的非规律时间序列数据来说，这会非常强大。

6. 使用正弦/余弦对周期性特征进行编码

星期几或一年中的月份等周期性特征会给机器学习模型带来问题。这是因为周期的结束（星期六，第 5 天，在数值上与星期日，第 6 天，相差很大，这可能会导致混淆）。为了更好地处理这个问题，我们可以使用正弦和余弦变换将其转换为二维；这可以保留关系的周期性。

# From our earlier section "Extracting Datetime Components"
df['day_of_week'] = df.index.dayofweek
df['month'] = df.index.month

# Day of week has a cycle of 7 days
df['day_of_week_sin'] = np.sin(2 * np.pi * df['day_of_week'] / 7)
df['day_of_week_cos'] = np.cos(2 * np.pi * df['day_of_week'] / 7)

# Month has a cycle of 12 months
df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)

print(df.head())

# 来自我们之前的章节“提取日期时间组件”

df['day_of_week'] = df.index.dayofweek

df['month'] = df.index.month

# 星期几有 7 天的周期

df['day_of_week_sin'] = np.sin(2 * np.pi * df['day_of_week'] / 7)

df['day_of_week_cos'] = np.cos(2 * np.pi * df['day_of_week'] / 7)

# 月份有 12 个月的周期

df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)

df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)

print(df.head())

输出

            sales  day_of_week  month  day_of_week_sin  day_of_week_cos  month_sin  month_cos
date                                                                                         
2025-07-01     88            1      7         0.781831         0.623490       -0.5  -0.866025
2025-07-02     78            2      7         0.974928        -0.222521       -0.5  -0.866025
2025-07-03     64            3      7         0.433884        -0.900969       -0.5  -0.866025
2025-07-04     92            4      7        -0.433884        -0.900969       -0.5  -0.866025
2025-07-05     57            5      7        -0.974928        -0.222521       -0.5  -0.866025

sales day_of_week month day_of_week_sin day_of_week_cos month_sin month_cos

date

2025-07-01 88 1 7 0.781831 0.623490 -0.5 -0.866025

2025-07-02 78 2 7 0.974928 -0.222521 -0.5 -0.866025

2025-07-03 64 3 7 0.433884 -0.900969 -0.5 -0.866025

2025-07-04 92 4 7 -0.433884 -0.900969 -0.5 -0.866025

2025-07-05 57 5 7 -0.974928 -0.222521 -0.5 -0.866025

此变换有助于模型理解 12 月（第 12 个月）与 1 月（第 1 个月）的距离与 2 月（第 2 个月）与 1 月（第 1 个月）的距离一样近。

7. 创建交互特征

最后，让我们看看如何通过组合两个或多个现有特征来创建交互特征，这有助于捕获更复杂的关系。例如，模型可能受益于知道它是“工作日早上”还是“周末早上”。

# From our earlier section "Calculating Rolling Window Statistics"
df['rolling_mean_3'] = df['sales'].rolling(window=3).mean()

# A feature for the difference between a day's sales and the 3-day rolling average
df['sales_vs_rolling_mean'] = df['sales'] - df['rolling_mean_3']

print(df.head())

# 来自我们之前的章节“计算滚动窗口统计量”

df['rolling_mean_3'] = df['sales'].rolling(window=3).mean()

# 一个特征，用于表示某天的销售额与 3 天滚动平均值之间的差异

df['sales_vs_rolling_mean'] = df['sales'] - df['rolling_mean_3']

print(df.head())

输出

            sales  rolling_mean_3  sales_vs_rolling_mean
date                                                    
2025-07-01     88             NaN                    NaN
2025-07-02     78             NaN                    NaN
2025-07-03     64       76.666667             -12.666667
2025-07-04     92       78.000000              14.000000
2025-07-05     57       71.000000             -14.000000

sales rolling_mean_3 sales_vs_rolling_mean

date

2025-07-01 88 NaN NaN

2025-07-02 78 NaN NaN

2025-07-03 64 76.666667 -12.666667

2025-07-04 92 78.000000 14.000000

2025-07-05 57 71.000000 -14.000000

此类交互特征的可能性是无限的。您的领域知识和创造力越丰富，这些特征就越有见地。

总结

时间序列特征工程是艺术与科学的结合。领域专业知识无疑是宝贵的，但对 Pandas 等工具的熟练掌握同样重要，它们可以为创建有助于提高模型性能并最终解决问题的特征提供基础。

本文介绍的七种技巧——从提取日期时间组件到创建复杂交互——是任何时间序列分析或预测任务的强大构建块。通过利用 Pandas 及其强大的时间序列功能，您可以更有效地发掘时间序列数据中隐藏的模式。

导航

7个用于时间序列特征工程的Pandas技巧

引言

设置我们的数据

1. 提取日期时间组件

2. 创建滞后特征

3. 计算滚动窗口统计量

4. 生成扩展窗口统计量

5. 计算事件之间的时间

6. 使用正弦/余弦对周期性特征进行编码

7. 创建交互特征

总结

关于此主题的更多信息

关于 7 个 Pandas 时间序列特征工程技巧的一个回应

发表评论点击此处取消回复。

导航

引言

设置我们的数据

1. 提取日期时间组件

2. 创建滞后特征

3. 计算滚动窗口统计量

4. 生成扩展窗口统计量

5. 计算事件之间的时间

6. 使用正弦/余弦对周期性特征进行编码

7. 创建交互特征

总结

关于此主题的更多信息

关于 7 个 Pandas 时间序列特征工程技巧 的一个回应

发表评论 点击此处取消回复。

关于 7 个 Pandas 时间序列特征工程技巧的一个回应

发表评论点击此处取消回复。