探索 Ames 数据集中的字典、分类变量和数据插补

作者： Vinod Chugani 发布于 2024年11月5日，类别：数据科学基础 0

房地产市场是一个复杂的生态系统，由位置、房产特征、市场趋势和经济指标等众多变量驱动。Ames 住房数据集是深入了解这种复杂性的数据集之一。该数据集源自爱荷华州 Ames，包含各种房产及其特征，从巷道通行类型到房产的整体状况。

在这篇文章中，您将使用数据科学技术仔细研究这个数据集。具体来说，您将重点关注如何识别分类变量和数值变量，因为了解这些变量对于任何数据驱动的决策过程都至关重要。

让我们开始吧。

探索 Ames 数据集中的字典、分类变量和数据插补
照片来源：Brigitte Tohm。保留部分权利。

概述

这篇博文分为三部分；它们是：

数据字典的重要性
识别分类变量和数值变量
缺失数据插补

数据字典的重要性

分析 Ames 住房数据集的关键第一步是使用其数据字典。此版本不仅仅列出和定义特征；它还将它们分类为名义型、序数型、离散型和连续型，从而指导我们的分析方法。

名义变量是没有顺序的类别，如“Neighborhood”（社区）。它们有助于识别用于分组分析的细分。
序数变量具有明确的顺序（例如“KitchenQual”）。它们允许进行基于排名和顺序的分析，但不意味着类别之间存在等间距。
离散变量是可计数的数字，如“Bedroom”（卧室）。它们是汇总或比较数量的分析不可或缺的一部分。
连续变量在连续尺度上进行测量，如“Lot Area”（地块面积）。它们支持广泛的依赖于精细细节的统计分析。

理解这些变量类型也有助于选择适当的可视化技术。名义变量和序数变量非常适合条形图，条形图可以有效地突出类别差异和排名。相反，离散变量和连续变量最适合通过直方图、散点图和折线图表示，这些图表可以说明数据中的分布、关系和趋势。

通过我的书《数据科学初学者指南》启动您的项目。它提供了带有工作代码的自学教程。

识别分类变量和数值变量

在理解数据字典的基础上，让我们深入探讨如何使用 Python 的 pandas 库实际区分 Ames 数据集中的分类变量和数值变量。此步骤对于指导我们后续的数据处理和分析策略至关重要。

# Load and obtain the data types from the Ames dataset
import pandas as pd
Ames = pd.read_csv('Ames.csv')

print(Ames.dtypes)
print(Ames.dtypes.value_counts())

# 从 Ames 数据集中加载并获取数据类型

import pandas as pd

Ames = pd.read_csv('Ames.csv')

print(Ames.dtypes)

print(Ames.dtypes.value_counts())

执行上述代码将产生以下输出，按数据类型对每个特征进行分类

PID                int64
GrLivArea          int64
SalePrice          int64
MSSubClass         int64
MSZoning          object
                  ...   
SaleCondition     object
GeoRefNo         float64
Prop_Addr         object
Latitude         float64
Longitude        float64
Length: 85, dtype: object

object     44
int64      27
float64    14
dtype: int64

PID int64

GrLivArea int64

SalePrice int64

MSSubClass int64

MSZoning object

...

SaleCondition object

GeoRefNo float64

Prop_Addr object

Latitude float64

Longitude float64

Length: 85, dtype: object

object 44

int64 27

float64 14

dtype: int64

此输出显示数据集包含 `object`（44 个变量）、`int64`（27 个变量）和 `float64`（14 个变量）数据类型。其中，`object` 通常表示名义变量，即没有固有顺序的分类数据。同时，`int64` 和 `float64` 表示数值数据，可以是离散型（`int64` 用于可计数的数字）或连续型（`float64` 用于连续尺度上的可测量数量）。

现在我们可以利用 pandas 的 `select_dtypes()` 方法来明确分离 Ames 数据集中的数值特征和分类特征。

# Build on the above block of code
# Separating numerical and categorical features
numerical_features = Ames.select_dtypes(include=['int64', 'float64']).columns
categorical_features = Ames.select_dtypes(include=['object']).columns

# Displaying the separated lists
print("Numerical Features:", numerical_features)
print("Categorical Features:", categorical_features)

# 在上面的代码块上构建

# 分离数值特征和分类特征

numerical_features = Ames.select_dtypes(include=['int64', 'float64']).columns

categorical_features = Ames.select_dtypes(include=['object']).columns

# 显示分离后的列表

print("数值特征：", numerical_features)

print("分类特征：", categorical_features)

`numerical_features` 捕获存储为 `int64` 和 `float64` 的变量，分别表示可计数和可测量的数量。相反，`categorical_features` 包含 `object` 类型的变量，通常表示没有定量值的名义或序数数据。

Numerical Features: Index(['PID', 'GrLivArea', 'SalePrice', 'MSSubClass', 'LotFrontage', 'LotArea',
       'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea',
       'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF',
       '2ndFlrSF', 'LowQualFinSF', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd',
       'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF',
       'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea',
       'MiscVal', 'MoSold', 'YrSold', 'GeoRefNo', 'Latitude', 'Longitude'],
      dtype='object')
Categorical Features: Index(['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities',
       'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2',
       'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st',
       'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation',
       'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
       'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual',
       'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual',
       'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature',
       'SaleType', 'SaleCondition', 'Prop_Addr'],
      dtype='object')

数值特征: Index(['PID', 'GrLivArea', 'SalePrice', 'MSSubClass', 'LotFrontage', 'LotArea',

'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea',

'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF',

'2ndFlrSF', 'LowQualFinSF', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',

'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd',

'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF',

'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea',

'MiscVal', 'MoSold', 'YrSold', 'GeoRefNo', 'Latitude', 'Longitude'],

dtype='object')

分类特征: Index(['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities',

'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2',

'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st',

'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation',

'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',

'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual',

'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual',

'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature',

'SaleType', 'SaleCondition', 'Prop_Addr'],

dtype='object')

值得注意的是，尽管像“MSSubClass”这样的一些变量是数值编码的，但它们实际上是分类数据，这强调了参考我们的数据字典进行准确分类的重要性。同样，像“MoSold”（销售月份）和“YrSold”（销售年份）这样的特征本质上是数值型的，但它们通常可以被视为分类变量，特别是当不需要对它们进行数学运算时。我们可以使用 pandas 中的 `astype()` 方法将这些变量转换为分类特征。

# Building on the above 2 blocks of code
Ames['MSSubClass'] = Ames['MSSubClass'].astype('object')
Ames['YrSold'] = Ames['YrSold'].astype('object')
Ames['MoSold'] = Ames['MoSold'].astype('object')
print(Ames.dtypes.value_counts())

# 在以上两个代码块的基础上构建

Ames['MSSubClass'] = Ames['MSSubClass'].astype('object')

Ames['YrSold'] = Ames['YrSold'].astype('object')

Ames['MoSold'] = Ames['MoSold'].astype('object')

print(Ames.dtypes.value_counts())

执行此转换后，`object` 数据类型的列计数已增加到 47（之前为 44），而 `int64` 已降至 24（之前为 27）。

object     47
int64      24
float64    14
dtype: int64

object 47

int64 24

float64 14

dtype: int64

仔细评估数据字典、数据集的性质和领域专业知识有助于正确重新分类数据类型。

缺失数据插补

处理缺失数据是每个数据科学家面临的挑战。忽略缺失值或处理不当可能导致分析偏差和不正确的结论。插补技术的选择通常取决于数据的性质——分类或数值。此外，数据字典中的信息将非常有用（例如泳池质量的情况），其中缺失值（“NA”）具有含义，即特定属性缺少此特征。

针对缺失值的分类特征的数据插补

您可以识别分类数据类型并按它们受缺失数据影响最大的顺序进行排名。

# Calculating the percentage of missing values for each column
missing_data = Ames.isnull().sum()
missing_percentage = (missing_data / len(Ames)) * 100
data_type = Ames.dtypes

# Combining the counts and percentages into a DataFrame for better visualization
missing_info = pd.DataFrame({'Missing Values': missing_data, 'Percentage': missing_percentage,
                             'Data Type':data_type})

# Sorting the DataFrame by the percentage of missing values in descending order
missing_info = missing_info.sort_values(by='Percentage', ascending=False)

# Display columns with missing values of 'object' data type
print(missing_info[(missing_info['Missing Values'] > 0) & (missing_info['Data Type'] == 'object')])

# 计算每列缺失值的百分比

missing_data = Ames.isnull().sum()

missing_percentage = (missing_data / len(Ames)) * 100

data_type = Ames.dtypes

# 将计数和百分比组合成一个 DataFrame，以便更好地可视化

missing_info = pd.DataFrame({'Missing Values': missing_data, 'Percentage': missing_percentage,

'Data Type':data_type})

# 按缺失值百分比降序排列 DataFrame

missing_info = missing_info.sort_values(by='Percentage', ascending=False)

# 显示具有“object”数据类型的缺失值的列

print(missing_info[(missing_info['Missing Values'] > 0) & (missing_info['Data Type'] == 'object')])

              Missing Values  Percentage Data Type
PoolQC                  2570   99.651028    object
MiscFeature             2482   96.238852    object
Alley                   2411   93.485847    object
Fence                   2054   79.643273    object
FireplaceQu             1241   48.119426    object
GarageCond               129    5.001939    object
GarageQual               129    5.001939    object
GarageFinish             129    5.001939    object
GarageType               127    4.924389    object
BsmtExposure              71    2.753005    object
BsmtFinType2              70    2.714230    object
BsmtFinType1              69    2.675456    object
BsmtQual                  69    2.675456    object
BsmtCond                  69    2.675456    object
Prop_Addr                 20    0.775494    object
MasVnrType                14    0.542846    object
Electrical                 1    0.038775    object

缺失值百分比数据类型

PoolQC 2570 99.651028 object

MiscFeature 2482 96.238852 object

Alley 2411 93.485847 object

Fence 2054 79.643273 object

FireplaceQu 1241 48.119426 object

GarageCond 129 5.001939 object

GarageQual 129 5.001939 object

GarageFinish 129 5.001939 object

GarageType 127 4.924389 object

BsmtExposure 71 2.753005 object

BsmtFinType2 70 2.714230 object

BsmtFinType1 69 2.675456 object

BsmtQual 69 2.675456 object

BsmtCond 69 2.675456 object

Prop_Addr 20 0.775494 object

MasVnrType 14 0.542846 object

Electrical 1 0.038775 object

数据字典表明，上述分类特征列表中的缺失值表示特定房产缺少该特征，但“Electrical”（电气系统）除外。有了这个认识，我们可以对电气系统的一个缺失数据点使用“众数”进行插补，并使用 `“None”`（带引号使其成为 Python 字符串）对所有其他缺失值进行插补。

# Building on the above block of code
# Imputing Missing Categorical Data

mode_value = Ames['Electrical'].mode()[0]
Ames['Electrical'].fillna(mode_value, inplace=True)

missing_categorical = missing_info[(missing_info['Missing Values'] > 0)
                           & (missing_info['Data Type'] == 'object')]

for item in missing_categorical.index.tolist():
    Ames[item].fillna("None", inplace=True)

print(Ames[missing_categorical.index].isnull().sum())

# 在以上代码块的基础上构建

# 插补缺失的分类数据

mode_value = Ames['Electrical'].mode()[0]

Ames['Electrical'].fillna(mode_value, inplace=True)

missing_categorical = missing_info[(missing_info['Missing Values'] > 0)

& (missing_info['Data Type'] == 'object')]

for item in missing_categorical.index.tolist():

Ames[item].fillna("None", inplace=True)

print(Ames[missing_categorical.index].isnull().sum())

这证实了分类特征现在没有缺失值

PoolQC          0
MiscFeature     0
Alley           0
Fence           0
FireplaceQu     0
GarageCond      0
GarageQual      0
GarageFinish    0
GarageType      0
BsmtExposure    0
BsmtFinType2    0
BsmtFinType1    0
BsmtQual        0
BsmtCond        0
Prop_Addr       0
MasVnrType      0
Electrical      0

PoolQC 0

MiscFeature 0

Alley 0

Fence 0

FireplaceQu 0

GarageCond 0

GarageQual 0

GarageFinish 0

GarageType 0

BsmtExposure 0

BsmtFinType2 0

BsmtFinType1 0

BsmtQual 0

BsmtCond 0

Prop_Addr 0

MasVnrType 0

Electrical 0

针对缺失值的数值特征的数据插补

我们可以应用上面演示的相同技术来识别数值数据类型并按它们受缺失数据影响最大的顺序进行排名。

# Build on the above blocks of code
# Import Numpy
import numpy as np

# Calculating the percentage of missing values for each column
missing_data = Ames.isnull().sum()
missing_percentage = (missing_data / len(Ames)) * 100
data_type = Ames.dtypes

# Combining the counts and percentages into a DataFrame for better visualization
missing_info = pd.DataFrame({'Missing Values': missing_data, 'Percentage': missing_percentage,
                             'Data Type':data_type})

# Sorting the DataFrame by the percentage of missing values in descending order
missing_info = missing_info.sort_values(by='Percentage', ascending=False)

# Display columns with missing values of numeric data type
print(missing_info[(missing_info['Missing Values'] > 0)
                   & (missing_info['Data Type'] == np.number)])

# 在以上代码块的基础上构建

# 导入 Numpy

import numpy as np

# 计算每列缺失值的百分比

missing_data = Ames.isnull().sum()

missing_percentage = (missing_data / len(Ames)) * 100

data_type = Ames.dtypes

# 将计数和百分比组合成一个 DataFrame，以便更好地可视化

missing_info = pd.DataFrame({'Missing Values': missing_data, 'Percentage': missing_percentage,

'Data Type':data_type})

# 按缺失值百分比降序排列 DataFrame

missing_info = missing_info.sort_values(by='Percentage', ascending=False)

# 显示具有数值数据类型缺失值的列

print(missing_info[(missing_info['Missing Values'] > 0)

& (missing_info['Data Type'] == np.number)])

              Missing Values  Percentage Data Type
LotFrontage              462   17.913920   float64
GarageYrBlt              129    5.001939   float64
Longitude                 97    3.761148   float64
Latitude                  97    3.761148   float64
GeoRefNo                  20    0.775494   float64
MasVnrArea                14    0.542846   float64
BsmtFullBath               2    0.077549   float64
BsmtHalfBath               2    0.077549   float64
BsmtFinSF2                 1    0.038775   float64
GarageArea                 1    0.038775   float64
BsmtFinSF1                 1    0.038775   float64
BsmtUnfSF                  1    0.038775   float64
TotalBsmtSF                1    0.038775   float64
GarageCars                 1    0.038775   float64

缺失值百分比数据类型

LotFrontage 462 17.913920 float64

GarageYrBlt 129 5.001939 float64

Longitude 97 3.761148 float64

Latitude 97 3.761148 float64

GeoRefNo 20 0.775494 float64

MasVnrArea 14 0.542846 float64

BsmtFullBath 2 0.077549 float64

BsmtHalfBath 2 0.077549 float64

BsmtFinSF2 1 0.038775 float64

GarageArea 1 0.038775 float64

BsmtFinSF1 1 0.038775 float64

BsmtUnfSF 1 0.038775 float64

TotalBsmtSF 1 0.038775 float64

GarageCars 1 0.038775 float64

上述内容表明，缺失的数值数据实例少于缺失的分类数据实例。然而，数据字典对于直接填充并不那么有用。数据科学中是否填充缺失数据很大程度上取决于分析的目标。通常，数据科学家可能会生成多次填充以解决填充过程中的不确定性。常见的多次填充方法包括（但不限于）均值、中位数和回归填充。作为基准，我们将在此说明如何使用均值填充，但可能会根据手头的任务参考其他技术。

# Build on the above blocks of code
# Initialize a DataFrame to store the concise information
concise_info = pd.DataFrame(columns=['Feature', 'Missing Values After Imputation', 
                                     'Mean Value Used to Impute'])

# Identify and impute missing numerical values, and store the related concise information
missing_numeric_df = missing_info[(missing_info['Missing Values'] > 0)
                           & (missing_info['Data Type'] == np.number)]

for item in missing_numeric_df.index.tolist():
    mean_value = Ames[item].mean(skipna=True)
    Ames[item].fillna(mean_value, inplace=True)

    # Append the concise information to the concise_info DataFrame
    concise_info.loc[len(concise_info)] = pd.Series({
        'Feature': item,
        'Missing Values After Imputation': Ames[item].isnull().sum(),
        # This should be 0 as we are imputing all missing values
        'Mean Value Used to Impute': mean_value
    })

# Display the concise_info DataFrame
print(concise_info)

# 在以上代码块的基础上构建

# 初始化一个DataFrame来存储简洁信息

concise_info = pd.DataFrame(columns=['Feature', 'Missing Values After Imputation',

'Mean Value Used to Impute'])

# 识别并填充缺失的数值，并存储相关的简洁信息

missing_numeric_df = missing_info[(missing_info['Missing Values'] > 0)

& (missing_info['Data Type'] == np.number)]

for item in missing_numeric_df.index.tolist():

mean_value = Ames[item].mean(skipna=True)

Ames[item].fillna(mean_value, inplace=True)

# 将简洁信息追加到 concise_info DataFrame

concise_info.loc[len(concise_info)] = pd.Series({

'Feature': item,

'Missing Values After Imputation': Ames[item].isnull().sum(),

# 这应该是 0，因为我们正在填充所有缺失值

'Mean Value Used to Impute': mean_value

})

# 显示 concise_info DataFrame

print(concise_info)

输出如下：

         Feature Missing Values After Imputation  Mean Value Used to Impute
0    LotFrontage                               0               6.851063e+01
1    GarageYrBlt                               0               1.976997e+03
2      Longitude                               0              -9.364254e+01
3       Latitude                               0               4.203456e+01
4       GeoRefNo                               0               7.136762e+08
5     MasVnrArea                               0               9.934698e+01
6   BsmtFullBath                               0               4.353900e-01
7   BsmtHalfBath                               0               6.208770e-02
8     BsmtFinSF2                               0               5.325950e+01
9     GarageArea                               0               4.668646e+02
10    BsmtFinSF1                               0               4.442851e+02
11     BsmtUnfSF                               0               5.391947e+02
12   TotalBsmtSF                               0               1.036739e+03
13    GarageCars                               0               1.747867e+00

Feature Missing Values After Imputation Mean Value Used to Impute

0 LotFrontage 0 6.851063e+01

1 GarageYrBlt 0 1.976997e+03

2 Longitude 0 -9.364254e+01

3 Latitude 0 4.203456e+01

4 GeoRefNo 0 7.136762e+08

5 MasVnrArea 0 9.934698e+01

6 BsmtFullBath 0 4.353900e-01

7 BsmtHalfBath 0 6.208770e-02

8 BsmtFinSF2 0 5.325950e+01

9 GarageArea 0 4.668646e+02

10 BsmtFinSF1 0 4.442851e+02

11 BsmtUnfSF 0 5.391947e+02

12 TotalBsmtSF 0 1.036739e+03

13 GarageCars 0 1.747867e+00

有时，我们也可以选择不进行任何填充，以保留原始数据集的真实性，并在需要时删除没有完整准确数据的数据点。或者，您也可以尝试构建一个机器学习模型，根据同一行中的其他数据来**猜测**缺失值，这是回归填充的原理。作为上述基准填充的最后一步，让我们交叉检查是否存在任何缺失值。

# Build on the above blocks of code
missing_values_count = Ames.isnull().sum().sum()
print(f'The DataFrame has a total of {missing_values_count} missing values.')

# 在以上代码块的基础上构建

missing_values_count = Ames.isnull().sum().sum()

print(f'The DataFrame has a total of {missing_values_count} missing values.')

您应该看到

The DataFrame has a total of 0 missing values.

1	该 DataFrame 总共有 0 个缺失值。

恭喜！我们已成功使用基准操作填充了Ames数据集中的每个缺失值。重要的是要注意，存在许多其他用于填充缺失数据的技术。作为一名数据科学家，探索各种选项并确定给定上下文最合适的方法对于产生可靠和有意义的结果至关重要。

想开始学习数据科学新手指南吗？

立即参加我的免费电子邮件速成课程（附示例代码）。

点击注册，同时获得该课程的免费PDF电子书版本。

进一步阅读

资源

总结

在本教程中，我们通过数据科学技术的视角探索了 Ames 住房数据集。我们讨论了数据字典在理解数据集变量方面的重要性，并深入研究了有助于有效识别和处理这些变量的 Python 代码片段。

理解您正在使用的变量的性质对于任何数据驱动的决策过程都至关重要。正如我们所见，Ames 数据字典在这方面提供了宝贵的指导。结合 Python 强大的数据操作库，处理像 Ames 住房数据集这样复杂的任务变得更加容易。

具体来说，您学习了：

在评估数据类型和填充策略时数据字典的重要性。
数值和分类特征的识别和重新分类方法。
如何使用 pandas 库填充缺失的分类和数值特征。

您有任何问题吗？请在下面的评论中提出您的问题，我将尽力回答。

开始学习数据科学新手指南！

学习在数据科学项目中取得成功的心态

...只需最少的数学和统计知识，通过Python中的简短示例来获取技能

在我的新电子书中探索如何实现
数据科学新手指南

它提供了包含所有可运行 Python 代码的**自学教程**，让您从新手成长为专家。它向您展示了如何*查找异常值、确认数据的正态性、查找相关特征、处理偏度、检验假设*等等……所有这些都为了支持您从数据集中创建叙述。

通过实践练习开启你的数据科学之旅

查看内容

关于此主题的更多信息

关于Vinod Chugani

我出生在印度，在日本长大，是一个拥有全球视野的第三文化孩子。我在杜克大学学习经济学，并在大三时荣幸地入选了Phi Beta Kappa。多年来，我获得了多样化的专业经验，在华尔街复杂的固定收益领域摸索了十年，随后在主街领导了一家全球分销企业。目前，我将我对数据科学、机器学习和人工智能的热情投入到纽约市数据科学学院担任导师。我珍视通过直播学习会话或深入的一对一互动来激发好奇心和分享知识的机会。凭借在金融/创业方面的基础以及目前在数据领域的沉浸，我以目标感和确定性展望未来。我期待进一步探索、持续学习，并有机会在不断发展的数据科学和机器学习领域，特别是在MLM这里，做出有意义的贡献。

查看Vinod Chugani发布的所有帖子 →

导航

探索 Ames 数据集中的字典、分类变量和数据插补

概述

数据字典的重要性

识别分类变量和数值变量

缺失数据插补

想开始学习数据科学新手指南吗？

进一步阅读

资源

总结

开始学习数据科学新手指南！

学习在数据科学项目中取得成功的心态

通过实践练习开启你的数据科学之旅

关于此主题的更多信息

暂无评论。

发表评论点击此处取消回复。

导航

概述

数据字典的重要性

识别分类变量和数值变量

缺失数据插补

想开始学习数据科学新手指南吗？

进一步阅读

资源

总结

开始学习数据科学新手指南！

学习在数据科学项目中取得成功的心态

通过实践练习开启你的数据科学之旅

关于此主题的更多信息

暂无评论。

发表评论 点击此处取消回复。

发表评论点击此处取消回复。