使用 Python 中的矩阵分解从头开始构建推荐系统

作者 Iván Palomares Carrascosa 于 2025年3月29日发表在实用机器学习 0

Building a Recommender System From Scratch with Matrix Factorization in Python

使用 Python 中的矩阵分解从头开始构建推荐系统
作者 | Ideogram 提供图片

引言

在本文中，我们将一步一步地使用矩阵分解在 Python 中构建一个电影推荐系统。在许多用于构建推荐系统的技术中，这些技术根据用户偏好和过去互动情况向用户推荐产品、服务或内容，其中矩阵分解作为协同过滤的强大技术脱颖而出，能够有效地从大规模用户和项目数据库中捕捉用户-项目互动中的隐藏模式。

具体来说，本教程将介绍一个名为 **surprise** 的 Python 库，其中包含用于构建推荐系统的便捷的矩阵分解算法实现。我们还将考虑 MovieLens 100K 数据集：一个流行的电影推荐数据集，非常适合从实践角度熟悉推荐系统。

注意：建议您在开始本教程之前，对推荐系统的概念和基础知识有一定的了解。

分步流程

第一步是导入必要的库和包。您可能需要在导入 `surprise` 库之前手动安装它。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from surprise import Dataset, Reader, SVD
from surprise.model_selection import train_test_split, cross_validate
from surprise import accuracy
import requests
import zipfile
import io
import os

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from surprise import Dataset, Reader, SVD

from surprise.model_selection import train_test_split, cross_validate

from surprise import accuracy

import requests

import zipfile

import io

import os

我们将从定义一个函数开始编码，该函数从官方数据集的外部网站加载 **MovieLens 100K** 数据集。此过程包括解压下载的 `.zip` 文件。

def download_and_extract_movielens():
    if not os.path.exists('ml-100k'):
        print("Downloading MovieLens 100K dataset...")
        url = "https://files.grouplens.org/datasets/movielens/ml-100k.zip"
        r = requests.get(url)
        z = zipfile.ZipFile(io.BytesIO(r.content))
        z.extractall()
        print("Movielens 100K dataset downloaded and extracted successfully.")
    else:
        print("The dataset already exists. Download skipped.")

def download_and_extract_movielens():

if not os.path.exists('ml-100k'):

print("Downloading MovieLens 100K dataset...")

url = "https://files.grouplens.org/datasets/movielens/ml-100k.zip"

r = requests.get(url)

z = zipfile.ZipFile(io.BytesIO(r.content))

z.extractall()

print("Movielens 100K dataset downloaded and extracted successfully.")

else:

print("The dataset already exists. Download skipped.")

接下来，我们通过调用新定义的函数来加载数据，将数据放入 Pandas DataFrame，并获取一些关于它的基本信息。

download_and_extract_movielens()

ratings_df = pd.read_csv('ml-100k/u.data', sep='\t', 
                       names=['user_id', 'item_id', 'rating', 'timestamp'])

print(f"Dataset shape: {ratings_df.shape}")
print(f"Number of unique users: {ratings_df['user_id'].nunique()}")
print(f"Number of unique movies: {ratings_df['item_id'].nunique()}")
print(f"Range of ratings: {ratings_df['rating'].min()} to {ratings_df['rating'].max()}")

download_and_extract_movielens()

ratings_df = pd.read_csv('ml-100k/u.data', sep='\t',

names=['user_id', 'item_id', 'rating', 'timestamp'])

print(f"Dataset shape: {ratings_df.shape}")

print(f"Number of unique users: {ratings_df['user_id'].nunique()}")

print(f"Number of unique movies: {ratings_df['item_id'].nunique()}")

print(f"Range of ratings: {ratings_df['rating'].min()} to {ratings_df['rating'].max()}")

打印出的输出描述了数据集的重要方面

Dataset shape: (100000, 4)
Number of unique users: 943
Number of unique movies: 1682
Rating range: 1 to 5

Dataset shape: (100000, 4)

Number of unique users: 943

Number of unique movies: 1682

Rating range: 1 to 5

正如我们所看到的，这个数据集的大小对于本教程的说明性目的来说是相当容易管理的，尽管矩阵分解的实际应用通常会涉及更大的用户和项目（例如电影）集。

现在，借助从 surprise 库导入的两个类，即 `Dataset` 和 `Reader`，我们将数据集打包成库的矩阵分解技术实现易于管理的形式。我们这样做如下，并且还为了模型评估将数据分割成训练集和测试集。请注意，在初始化 `Reader` 对象时指定数据集中正确的数值评分范围的重要性。

reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(ratings_df[['user_id', 'item_id', 'rating']], reader)

trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

reader = Reader(rating_scale=(1, 5))

data = Dataset.load_from_df(ratings_df[['user_id', 'item_id', 'rating']], reader)

trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

现在我们进入实际操作，初始化、训练和评估矩阵分解模型。具体来说，我们将使用奇异值分解（SVD），这是一种流行的矩阵分解方法，其实现通过 surprise 的 `SVD` 类提供。如果您熟悉使用 scikit-learn 训练机器学习模型，您会发现这个过程非常相似。

model = SVD(n_factors=20, lr_all=0.01, reg_all=0.01, n_epochs=20, random_state=42)
model.fit(trainset)

predictions = model.test(testset)
rmse = accuracy.rmse(predictions)
mae = accuracy.mae(predictions)

print(f"Test RMSE: {rmse:.4f}")
print(f"Test MAE: {mae:.4f}")

model = SVD(n_factors=20, lr_all=0.01, reg_all=0.01, n_epochs=20, random_state=42)

model.fit(trainset)

predictions = model.test(testset)

rmse = accuracy.rmse(predictions)

mae = accuracy.mae(predictions)

print(f"Test RMSE: {rmse:.4f}")

print(f"Test MAE: {mae:.4f}")

在上面的 SVD 模型实例化中，`n_factors` 是一个重要的超参数，我们在其中定义了所需的维度（在本例中为 20），用于我们用来构建紧凑的用户和项目向量表示的潜在特征空间，这些表示基于以庞大但稀疏的用户-项目评分矩阵形式提供的原始数据。要更好地理解矩阵分解中的这一关键过程，请务必查看本文。其他使用的参数是学习率（`lr_all`，0.01）、防止过拟合的正则化参数（`reg_all`，0.01）以及训练周期数（`n_epochs`），设置为 20。

更改上述任何参数的值都可能影响模型在测试数据上的最终性能，该性能通过 RMSE 和 MAE 等预测误差指标进行衡量。在我们的特定设置中，我们得到

Test RMSE: 0.9576
Test MAE: 0.7455

1 2	Test RMSE: 0.9576 Test MAE: 0.7455

为了进行更稳健的评估，我们可以选择应用交叉验证

cv_results = cross_validate(model, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

print(f"Average RMSE: {cv_results['test_rmse'].mean():.4f}")
print(f"Average MAE: {cv_results['test_mae'].mean():.4f}")

cv_results = cross_validate(model, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

print(f"Average RMSE: {cv_results['test_rmse'].mean():.4f}")

print(f"Average MAE: {cv_results['test_mae'].mean():.4f}")

尝试一下

现在，让我们通过一些示例推荐来让我们的推荐系统运行起来。为此，我们将首先定义另外两个自定义函数：一个用于加载电影标题集，另一个函数（给定用户 ID 和所需的推荐数量 N）将使用训练好的模型为该用户获取推荐电影列表，基于她在原始评分数据中反映出的偏好。后一个函数也许是整个代码中最有见地的一部分，因此我们添加了一些内联注释以便更好地理解所涉及的过程。

def get_movie_names():
    movies_df = pd.read_csv('ml-100k/u.item', sep='|', encoding='latin-1', 
                          header=None, usecols=[0, 1], 
                          names=['item_id', 'title'])
    return movies_df

movies_df = get_movie_names()

def recommend_movies(user_id, n=10):
    # List of all movies
    all_movies = movies_df['item_id'].unique()
    
    # Movies already rated by the user
    rated_movies = ratings_df[ratings_df['user_id'] == user_id]['item_id'].values
    
    # Movies not yet rated by the user
    unrated_movies = np.setdiff1d(all_movies, rated_movies)
    
    # Predicting ratings on unseen movies, by using the trained SVD model
    predictions = []
    for item_id in unrated_movies:
        predicted_rating = model.predict(user_id, item_id).est
        predictions.append((item_id, predicted_rating))
    
    # Rank predictions by estimated rating
    predictions.sort(key=lambda x: x[1], reverse=True)
    
    # Get top N recommendations
    top_recommendations = predictions[:n]
    
    # Fetch movie titles associated with top N recommendations
    recommendations = pd.DataFrame(top_recommendations, columns=['item_id', 'predicted_rating'])
    recommendations = recommendations.merge(movies_df, on='item_id')
    
    return recommendations

def get_movie_names():

movies_df = pd.read_csv('ml-100k/u.item', sep='|', encoding='latin-1',

header=None, usecols=[0, 1],

names=['item_id', 'title'])

return movies_df

movies_df = get_movie_names()

def recommend_movies(user_id, n=10):

# List of all movies

all_movies = movies_df['item_id'].unique()

# Movies already rated by the user

rated_movies = ratings_df[ratings_df['user_id'] == user_id]['item_id'].values

# Movies not yet rated by the user

unrated_movies = np.setdiff1d(all_movies, rated_movies)

# Predicting ratings on unseen movies, by using the trained SVD model

predictions = []

for item_id in unrated_movies:

predicted_rating = model.predict(user_id, item_id).est

predictions.append((item_id, predicted_rating))

# Rank predictions by estimated rating

predictions.sort(key=lambda x: x[1], reverse=True)

# Get top N recommendations

top_recommendations = predictions[:n]

# Fetch movie titles associated with top N recommendations

recommendations = pd.DataFrame(top_recommendations, columns=['item_id', 'predicted_rating'])

recommendations = recommendations.merge(movies_df, on='item_id')

return recommendations

现在所要做的就是尝试这些函数来获取实际的推荐！

user_id = 42
recommendations = recommend_movies(user_id, n=10)

print(f"\nTop 10 recommended movies for user {user_id}:")
print(recommendations[['title', 'predicted_rating']])

user_id = 42

recommendations = recommend_movies(user_id, n=10)

print(f"\nTop 10 recommended movies for user {user_id}:")

print(recommendations[['title', 'predicted_rating']])

输出

Top 10 recommended movies for user 42:
                        title  predicted_rating
0           Braveheart (1995)          4.946602
1  Singin' in the Rain (1952)          4.835148
2              Henry V (1989)          4.811671
3    Great Escape, The (1963)          4.754385
4                 Babe (1995)          4.702876
5  Wrong Trousers, The (1993)          4.646727
6         My Fair Lady (1964)          4.631982
7        Air Force One (1997)          4.617786
8              Sabrina (1954)          4.541566
9               Patton (1970)          4.530220

Top 10 recommended movies for user 42:

title predicted_rating

0 Braveheart (1995) 4.946602

1 Singin' in the Rain (1952) 4.835148

2 Henry V (1989) 4.811671

3 Great Escape, The (1963) 4.754385

4 Babe (1995) 4.702876

5 Wrong Trousers, The (1993) 4.646727

6 My Fair Lady (1964) 4.631982

7 Air Force One (1997) 4.617786

8 Sabrina (1954) 4.541566

9 Patton (1970) 4.530220

总结

就是这样！通过这些步骤，我们构建了第一个基于矩阵分解的电影推荐系统并进行了实际演示。要进一步深入了解此类推荐系统模型的错综复杂和奇妙之处，接下来的步骤可以是可视化有趣的 data patterns，例如每个用户或电影的评分分布，根据潜在因子表示查找相似电影，或者可视化潜在因子本身。

导航

使用 Python 中的矩阵分解从头开始构建推荐系统

引言

分步流程

尝试一下

总结

关于此主题的更多信息

暂无评论。

留下回复点击此处取消回复。

导航

引言

分步流程

尝试一下

总结

关于此主题的更多信息

暂无评论。

留下回复 点击此处取消回复。

留下回复点击此处取消回复。