奇异值分解(Singular Value Decomposition)是一种非常流行的线性代数技术,用于将一个矩阵分解为几个更小矩阵的乘积。事实上,这项技术有很多用途。一个例子是,我们可以利用SVD来发现物品之间的关系。基于此,可以轻松地构建一个推荐系统。
在本教程中,我们将了解如何仅使用线性代数技术构建推荐系统。
完成本教程后,您将了解:
- 奇异值分解对矩阵做了什么?
- 如何解释奇异值分解的结果?
- 单个推荐系统需要什么数据,以及我们如何利用SVD对其进行分析?
- 我们如何利用SVD的结果进行推荐?
让我们开始吧。

使用奇异值分解构建推荐系统
图片由Roberto Arias拍摄,部分权利保留。
教程概述
本教程分为3个部分;它们是
- 奇异值分解回顾
- 奇异值分解在推荐系统中的含义
- 实现推荐系统
奇异值分解回顾
就像数字24可以分解为因数24=2×3×4一样,矩阵也可以表示为其他矩阵的乘积。因为矩阵是数字的数组,它们有自己的乘法规则。因此,它们有不同的因式分解方式,也称为**分解**。QR分解或LU分解是常见的例子。另一个例子是**奇异值分解**,它对被分解的矩阵的形状或属性没有限制。
奇异值分解假设一个矩阵 $M$(例如,一个 $m \times n$ 矩阵)可以分解为:
$$
M = U\cdot \Sigma \cdot V^T
$$
其中 $U$ 是一个 $m \times m$ 矩阵,$\Sigma$ 是一个 $m \times n$ 的对角矩阵,而 $V^T$ 是一个 $n \times n$ 矩阵。对角矩阵 $\Sigma$ 很有趣,它可以是非方阵的,但只有对角线上的元素可能非零。矩阵 $U$ 和 $V^T$ 是**标准正交**矩阵。这意味着 $U$ 的列或 $V$ 的行(1)彼此正交,并且(2)是单位向量。向量彼此正交是指任意两个向量的点积为零。向量是单位向量是指其L2范数为1。标准正交矩阵的性质是其转置是其逆。换句话说,由于 $U$ 是一个标准正交矩阵,$U^T = U^{-1}$ 或 $U\cdot U^T=U^T\cdot U=I$,其中 $I$ 是单位矩阵。
奇异值分解的名字来源于 $\Sigma$ 上的对角线元素,它们被称为矩阵 $M$ 的奇异值。实际上,它们是矩阵 $M\cdot M^T$ 的特征值的平方根。就像数字可以分解为素数一样,矩阵的奇异值分解揭示了该矩阵的结构。
但实际上,上面描述的是**满秩SVD**。还有另一种版本称为**降秩SVD**或**紧凑SVD**。我们仍然写成 $M = U\cdot\Sigma\cdot V^T$,但 $\Sigma$ 是一个 $r \times r$ 的方对角矩阵,其中 $r$ 是矩阵 $M$ 的**秩**,通常小于或等于 $m$ 和 $n$ 中的较小者。此时矩阵 $U$ 是一个 $m \times r$ 矩阵,而 $V^T$ 是一个 $r \times n$ 矩阵。因为矩阵 $U$ 和 $V^T$ 不是方阵,所以它们被称为**半标准正交**,意味着 $U^T\cdot U=I$ 和 $V^T\cdot V=I$,在这两种情况下,$I$ 都是一个 $r \times r$ 的单位矩阵。
奇异值分解在推荐系统中的含义
如果矩阵 $M$ 的秩是 $r$,那么我们可以证明矩阵 $M\cdot M^T$ 和 $M^T\cdot M$ 的秩都是 $r$。在奇异值分解(降秩SVD)中,$U$ 矩阵的列是 $M\cdot M^T$ 的特征向量,$V^T$ 矩阵的行是 $M^T\cdot M$ 的特征向量。有趣的是,$M\cdot M^T$ 和 $M^T\cdot M$ 的尺寸可能不同(因为矩阵 $M$ 可以是非方阵),但它们具有相同的特征值集合,这些特征值是 $\Sigma$ 对角线数值的平方。
这就是为什么奇异值分解的结果可以揭示矩阵 $M$ 的很多信息。
想象一下我们收集了一些书籍评论,其中书籍是列,人们是行,条目是某人对某本书的评分。在这种情况下,$M\cdot M^T$ 将是一个“人到人”的表格,条目表示一个人的评分总和与另一个人的匹配程度。类似地,$M^T\cdot M$ 将是一个“书到书”的表格,条目表示获得的评分总和与另一本书获得的评分的匹配程度。人与书之间可能存在的隐藏联系是什么?可能是类型、作者,或者类似性质的东西。
实现推荐系统
让我们看看如何利用SVD的结果来构建一个推荐系统。首先,让我们从以下链接下载数据集(注意:它有600MB大)
这个数据集来自“推荐系统与个性化数据集”的“社交推荐数据”。它包含用户在Librarything上对书籍的评论。我们感兴趣的是用户给一本书的“星级”数量。
如果我们打开这个tar文件,我们会看到一个名为“reviews.json”的大文件。我们可以解压它,或者实时读取包含的文件。reviews.json的前三行如下所示:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
import tarfile # 从以下链接读取下载的文件 # http://deepyeti.ucsd.edu/jmcauley/datasets/librarything/lthing_data.tar.gz with tarfile.open("lthing_data.tar.gz") as tar: print("tar归档中的文件:") tar.list() with tar.extractfile("lthing_data/reviews.json") as file: count = 0 for line in file: print(line) count += 1 if count > 3: break |
上面的代码会打印出
1 2 3 4 5 6 7 8 |
tar归档中的文件 ?rwxr-xr-x julian/julian 0 2016-09-30 17:58:55 lthing_data/ ?rw-r--r-- julian/julian 4824989 2014-01-02 13:55:12 lthing_data/edges.txt ?rw-rw-r-- julian/julian 1604368260 2016-09-30 17:58:25 lthing_data/reviews.json b"{'work': '3206242', 'flags': [], 'unixtime': 1194393600, 'stars': 5.0, 'nhelpful': 0, 'time': 'Nov 7, 2007', 'comment': 'This a great book for young readers to be introduced to the world of Middle Earth. ', 'user': 'van_stef'}\n" b"{'work': '12198649', 'flags': [], 'unixtime': 1333756800, 'stars': 5.0, 'nhelpful': 0, 'time': 'Apr 7, 2012', 'comment': 'Help Wanted: Tales of On The Job Terror from Evil Jester Press is a fun and scary read. This book is edited by Peter Giglio and has short stories by Joe McKinney, Gary Brandner, Henry Snider and many more. As if work wasnt already scary enough, this book gives you more reasons to be scared. Help Wanted is an excellent anthology that includes some great stories by some master storytellers.\\nOne of the stories includes Agnes: A Love Story by David C. Hayes, which tells the tale of a lawyer named Jack who feels unappreciated at work and by his wife so he starts a relationship with a photocopier. They get along well until the photocopier starts wanting the lawyer to kill for it. The thing I liked about this story was how the author makes you feel sorry for Jack. His two co-workers are happily married and love their jobs while Jack is married to a paranoid alcoholic and he hates and works at a job he cant stand. You completely understand how he can fall in love with a copier because he is a lonely soul that no one understands except the copier of course.\\nAnother story in Help Wanted is Work Life Balance by Jeff Strand. In this story a man works for a company that starts to let their employees do what they want at work. It starts with letting them come to work a little later than usual, then the employees are allowed to hug and kiss on the job. Things get really out of hand though when the company starts letting employees carry knives and stab each other, as long as it doesnt interfere with their job. This story is meant to be more funny then scary but still has its scary moments. Jeff Strand does a great job mixing humor and horror in this story.\\nAnother good story in Help Wanted: On The Job Terror is The Chapel Of Unrest by Stephen Volk. This is a gothic horror story that takes place in the 1800s and has to deal with an undertaker who has the duty of capturing and embalming a ghoul who has been eating dead bodies in a graveyard. Stephen Volk through his use of imagery in describing the graveyard, the chapel and the clothes of the time, transports you into an 1800s gothic setting that reminded me of Bram Stokers Dracula.\\nOne more story in this anthology that I have to mention is Expulsion by Eric Shapiro which tells the tale of a mad man going into a office to kill his fellow employees. This is a very short but very powerful story that gets you into the mind of a disgruntled employee but manages to end on a positive note. Though there were stories I didnt like in Help Wanted, all in all its a very good anthology. I highly recommend this book ', 'user': 'dwatson2'}\n" b"{'work': '12533765', 'flags': [], 'unixtime': 1352937600, 'nhelpful': 0, 'time': 'Nov 15, 2012', 'comment': 'Magoon, K. (2012). Fire in the streets. New York: Simon and Schuster/Aladdin. 336 pp. ISBN: 978-1-4424-2230-8. (Hardcover); $16.99.\\nKekla Magoon is an author to watch (http://www.spicyreads.org/Author_Videos.html- scroll down). One of my favorite books from 2007 is Magoons The Rock and the River. At the time, I mentioned in reviews that we have very few books that even mention the Black Panther Party, let alone deal with them in a careful, thorough way. Fire in the Streets continues the story Magoon began in her debut book. While her familys financial fortunes drip away, not helped by her mothers drinking and assortment of boyfriends, the Panthers provide a very real respite for Maxie. Sam is still dealing with the death of his brother. Maxies relationship with Sam only serves to confuse and upset them both. Her friends, Emmalee and Patrice, are slowly drifting away. The Panther Party is the only thing that seems to make sense and she basks in its routine and consistency. She longs to become a full member of the Panthers and constantly battles with her Panther brother Raheem over her maturity and ability to do more than office tasks. Maxie wants to have her own gun. When Maxie discovers that there is someone working with the Panthers that is leaking information to the government about Panther activity, Maxie investigates. Someone is attempting to destroy the only place that offers her shelter. Maxie is determined to discover the identity of the traitor, thinking that this will prove her worth to the organization. However, the truth is not simple and it is filled with pain. Unfortunately we still do not have many teen books that deal substantially with the Democratic National Convention in Chicago, the Black Panther Party, and the social problems in Chicago that lead to the civil unrest. Thankfully, Fire in the Streets lives up to the standard Magoon set with The Rock and the River. Readers will feel like they have stepped back in time. Magoons factual tidbits add journalistic realism to the story and only improves the atmosphere. Maxie has spunk. Readers will empathize with her Atlas-task of trying to hold onto her world. Fire in the Streets belongs in all middle school and high school libraries. While readers are able to read this story independently of The Rock and the River, I strongly urge readers to read both and in order. Magoons recognition by the Coretta Scott King committee and the NAACP Image awards are NOT mistakes!', 'user': 'edspicer'}\n" b'{\'work\': \'12981302\', \'flags\': [], \'unixtime\': 1364515200, \'stars\': 4.0, \'nhelpful\': 0, \'time\': \'Mar 29, 2013\', \'comment\': "Well, I definitely liked this book better than the last in the series. There was less fighting and more story. I liked both Toni and Ricky Lee and thought they were pretty good together. The banter between the two was sweet and often times funny. I enjoyed seeing some of the past characters and of course it\'s always nice to be introduced to new ones. I just wonder how many more of these books there will be. At least two hopefully, one each for Rory and Reece. ", \'user\': \'amdrane2\'}\n' |
reviews.json中的每一行都是一个记录。我们将提取每个记录的“user”、“work”和“stars”字段,只要这三个字段中没有缺失数据。尽管有这个名字,但记录并不是格式良好的JSON字符串(最明显的是它使用了单引号而不是双引号)。因此,我们不能使用Python的json
包,而必须使用ast
来解码这样的字符串。
1 2 3 4 5 6 7 8 9 10 11 12 |
... import ast reviews = [] with tarfile.open("lthing_data.tar.gz") as tar: with tar.extractfile("lthing_data/reviews.json") as file: for line in file: record = ast.literal_eval(line.decode("utf8")) if any(x not in record for x in ['user', 'work', 'stars']): continue reviews.append([record['user'], record['work'], record['stars']]) print(len(reviews), "条记录已检索") |
1 |
1387209 条记录已检索 |
现在,我们需要创建一个关于不同用户如何评价每本书的矩阵。我们使用pandas库将收集到的数据转换为表格。
1 2 3 4 |
... import pandas as pd reviews = pd.DataFrame(reviews, columns=["user", "work", "stars"]) print(reviews.head()) |
1 2 3 4 5 6 |
user work stars 0 van_stef 3206242 5.0 1 dwatson2 12198649 5.0 2 amdrane2 12981302 4.0 3 Lila_Gustavus 5231009 3.0 4 skinglist 184318 2.0 |
例如,为了节省时间和内存,我们尝试不使用所有数据。这里我们只考虑评论了超过50本书的用户,以及被超过50个用户评论过的书籍。这样,我们将数据集缩减到原始大小的15%以下。
1 2 3 4 5 |
... # 查找评论超过50本书的用户 usercount = reviews[["work","user"]].groupby("user").count() usercount = usercount[usercount["work"] >= 50] print(usercount.head()) |
1 2 3 4 5 6 7 |
work user 84 -Eva- 602 06nwingert 370 1983mk 63 1dragones 194 |
1 2 3 4 5 |
... # 查找被超过50个用户评论过的书籍 workcount = reviews[["work","user"]].groupby("work").count() workcount = workcount[workcount["user"] >= 50] print(workcount.head()) |
1 2 3 4 5 6 7 |
user work 10000 106 10001 53 1000167 186 10001797 53 10005525 134 |
1 2 3 4 |
... # 保留热门书籍和活跃用户 reviews = reviews[reviews["user"].isin(usercount.index) & reviews["work"].isin(workcount.index)] print(reviews) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
user work stars 0 van_stef 3206242 5.0 6 justine 3067 4.5 18 stephmo 1594925 4.0 19 Eyejaybee 2849559 5.0 35 LisaMaria_C 452949 4.5 ... ... ... ... 1387161 connie53 1653 4.0 1387177 BruderBane 24623 4.5 1387192 StuartAston 8282225 4.0 1387202 danielx 9759186 4.0 1387206 jclark88 8253945 3.0 [205110 行 x 3 列] |
然后,我们可以使用pandas的“pivot table”函数将其转换为矩阵。
1 2 |
... reviewmatrix = reviews.pivot(index="user", columns="work", values="stars").fillna(0) |
结果是一个5593行、2898列的矩阵。
在这里,我们在矩阵中表示了5593个用户和2898本书。然后我们应用SVD(这需要一段时间)。
1 2 3 4 |
... from numpy.linalg import svd matrix = reviewmatrix.values u, s, vh = svd(matrix, full_matrices=False) |
默认情况下,svd()
返回一个完整的奇异值分解。我们选择降秩版本,以便可以使用较小的矩阵来节省内存。vh
的列对应于书籍。我们可以基于向量空间模型来查找哪本书与我们正在查看的书最相似。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
... import numpy as np def cosine_similarity(v,u): return (v @ u)/ (np.linalg.norm(v) * np.linalg.norm(u)) highest_similarity = -np.inf highest_sim_col = -1 for col in range(1,vh.shape[1]): similarity = cosine_similarity(vh[:,0], vh[:,col]) if similarity > highest_similarity: highest_similarity = similarity highest_sim_col = col print("列 %d 与列 0 最相似" % highest_sim_col) |
在上面的例子中,我们尝试找到与第一列最匹配的书籍。结果是:
1 |
列 906 与列 0 最相似 |
在推荐系统中,当用户选择了一本书,我们可以根据上面计算的余弦距离向她展示一些与她选择的书相似的其他书籍。
根据数据集的不同,我们可能会使用截断SVD来降低vh
矩阵的维度。本质上,这意味着在用于计算相似度之前,我们删除了vh
中对应奇异值较小的几行。这可能会使预测更准确,因为书籍中那些不那么重要的特征被排除在考虑之外。
请注意,在分解 $M=U\cdot\Sigma\cdot V^T$ 中,我们知道 $U$ 的行是用户,$V^T$ 的列是书籍,我们无法确定 $U$ 的列或 $V^T$ 的行(同样,$\Sigma$ 的列/行)的确切含义。我们知道它们可能是类型,例如,提供了用户和书籍之间的一些潜在联系,但我们不能确定它们到底是什么。然而,这并不能阻止我们将它们用作推荐系统中的**特征**。
总而言之,完整的代码如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 |
import tarfile import ast import pandas as pd import numpy as np # 从以下链接读取下载的文件 # http://deepyeti.ucsd.edu/jmcauley/datasets/librarything/lthing_data.tar.gz with tarfile.open("lthing_data.tar.gz") as tar: print("tar归档中的文件:") tar.list() print("\n示例记录:") with tar.extractfile("lthing_data/reviews.json") as file: count = 0 for line in file: print(line) count += 1 if count > 3: break # 收集记录 reviews = [] with tarfile.open("lthing_data.tar.gz") as tar: with tar.extractfile("lthing_data/reviews.json") as file: for line in file: try: record = ast.literal_eval(line.decode("utf8")) except: print(line.decode("utf8")) raise if any(x not in record for x in ['user', 'work', 'stars']): continue reviews.append([record['user'], record['work'], record['stars']]) print(len(reviews), "条记录已检索") # 打印一些我们收集到的样本 reviews = pd.DataFrame(reviews, columns=["user", "work", "stars"]) print(reviews.head()) # 查找评论超过50本书的用户 usercount = reviews[["work","user"]].groupby("user").count() usercount = usercount[usercount["work"] >= 50] # 查找被超过50个用户评论过的书籍 workcount = reviews[["work","user"]].groupby("work").count() workcount = workcount[workcount["user"] >= 50] # 保留热门书籍和活跃用户 reviews = reviews[reviews["user"].isin(usercount.index) & reviews["work"].isin(workcount.index)] print("\n数据子集:") print(reviews) # 将记录转换为用户-图书评分矩阵 reviewmatrix = reviews.pivot(index="user", columns="work", values="stars").fillna(0) matrix = reviewmatrix.values # 奇异值分解 u, s, vh = np.linalg.svd(matrix, full_matrices=False) # 找出最高相似度 def cosine_similarity(v,u): return (v @ u)/ (np.linalg.norm(v) * np.linalg.norm(u)) highest_similarity = -np.inf highest_sim_col = -1 for col in range(1,vh.shape[1]): similarity = cosine_similarity(vh[:,0], vh[:,col]) if similarity > highest_similarity: highest_similarity = similarity highest_sim_col = col print("列 %d (图书 ID %s) 与列 0 (图书 ID %s) 最相似" % (highest_sim_col, reviewmatrix.columns[col], reviewmatrix.columns[0]) ) |
延伸阅读
如果您想深入了解,本节提供了更多关于该主题的资源。
书籍
- 线性代数导论,第五版,2016。
API
文章
总结
在本教程中,您了解了如何使用奇异值分解构建推荐系统。
具体来说,你学到了:
- 奇异值分解对矩阵的意义
- 如何解释奇异值分解的结果
- 从奇异值分解得到的矩阵 $V^T$ 的列中查找相似度,并基于相似度进行推荐
import tarfile
# 从以下链接读取下载的文件
# http://deepyeti.ucsd.edu/jmcauley/datasets/librarything/lthing_data.tar.gz
with tarfile.open(“lthing_data.tar.gz”) as tar
print(“tar 存档中的文件:”)
tar.list()
with tar.extractfile(“lthing_data/reviews.json”) as file
count = 0
for line in file
print(line)
count += 1
if count > 3
break
尊敬的 Adrian Tam 博士
这些代码无法成功运行
我能问一下如何修复它吗?
您遇到了什么错误?
文件“”, 第 18 行
if count > 3
^
SyntaxError: invalid syntax
谢谢,那里有一些 HTML 格式问题,> 应该是 >。现在已经修复了。
感谢您的帖子。您能否再读一遍以求清晰?例如,“一个人对另一个人的评分总和匹配的表格”根本不清楚。
您的意思是“一张人与人[向量]的表格,其中的条目[]表示一个人给出的评分与另一个人[]的评分匹配的总和”吗?
清晰度对我来说真的很重要。否则,我无法跟上。
再次感谢。
这篇帖子看起来很棒????。更好的方法是自己制作一个数据集子集并将其托管在 GitHub 上,以便其他人可以轻松跟进。
为什么你不发布如何评估你的系统性能?
很棒的推荐!
嗨 Jason – 我不得不修改你的一些脚本才能让它工作。首先,当我从 ucsd.edu 下载 tar 文件时,json 文件不存在,但存在一个文本文件 – ‘reviews.txt’。
而且至少有一个评论包含“stars”一词,但没有“‘stars’:”所以我修改了语句
if any(x not in record for x in [‘user’, ‘work’, ‘stars’])
推广到
if any(x not in record for x in [“‘user’:”, “‘work’:”, “‘stars’:”])
然后一切正常
谢谢
嗨,Jason,
在存在大量 NAN 的矩阵上使用 SVD(即特定用户未阅读许多书籍),这是一个真正的问题,用 0.0 替换所有这些 NAN 将意味着用户实际上阅读了这些书籍并给了它们 0.0 评分……你能进一步解释一下吗?
非常感谢Jason
嗨 Antoine…以下资源可能与您有关
https://www.naukri.com/learning/articles/handling-missing-values-beginners-tutorial/
你说“vh 的列对应于书籍。”这个结论有点快,为什么,你能详细说明一下吗?
嗨 Qian…这可以从执行代码以建立矩阵中看出。您是否能够执行示例代码?
等等。最后一行是不是错了?
print(“列 %d (图书 ID %s) 与列 0 (图书 ID %s) 最相似” %
(highest_sim_col, reviewmatrix.columns[col], reviewmatrix.columns[0])
)
col 只给出最后一个 for 循环的迭代次数。
它不应该包含
review_matrix.columns[highest_sim_col]
?
嗨 Patrik…是的,您对提供的代码片段中可能存在的问题的说法是正确的。似乎在引用循环或计算过程中找到的最高相似度的正确列索引时可能存在不匹配。
让我们分解一下问题并澄清正确的引用方式
1. **变量定义和目的**
– `highest_sim_col` 应该保存与列 0 最相似的列的索引,这取决于您的相似度计算。
– `col` 似乎是在循环中使用的变量,可能遍历 `reviewmatrix` 中的列。
– `reviewmatrix.columns[col]` 目前用于在打印消息中引用列名或 ID,但根据您的描述,这里的 `col` 只会反映循环最后一次迭代的值,不一定是具有最高相似度的列。
2. **更正引用**
– 如果 `highest_sim_col` 的目的是存储与列 0 最相似的列的索引,那么在打印时您应该使用 `reviewmatrix.columns[highest_sim_col]` 来获取该列的正确图书 ID。
3. **更新的代码片段**
– 以下是更正后的代码行应该是什么样子
python
print("列 %d (图书 ID %s) 与列 0 (图书 ID %s) 最相似" %
(highest_sim_col, reviewmatrix.columns[highest_sim_col], reviewmatrix.columns[0]))
– 此更改可确保输出正确反映与列 0 最相似的列(及关联的图书 ID),该列是根据 `highest_sim_col` 中存储的值确定的。
通过进行此更正,输出语句将准确报告哪一列与列 0 最相似,并使用 `highest_sim_col` 中存储的索引从您的评论矩阵中获取相应的图书 ID。这对于确保输出的准确性至关重要,尤其是在进行相似度比较的数据分析任务中,例如推荐系统或聚类分析。
为了正确估计相似度,不应该是 cosine_similarity(vh[0,:], vh[col,:]) 吗?
嗨 Makis…请继续您的想法,并告诉我们您的发现!