library(tidyverse)
library(splitstackshape)
library(recommenderlab)
This submission is a continuation of P1 and P2 and utilizes the MovieLens movie ratings data linked from grouplens.org.
This third project of this summer course explores the Matrix Factorization method of Singular Value Decomposition (SVD) and how it can be applied to reduce the amount of stored data needed to make a recommendation preditions.
For reference, a preview of the MovieLens
data:
ratings <- read.csv("ratings.csv", header = T)
links <- read.csv("links.csv", header = T, stringsAsFactors = F)
movies <- read.csv("movies.csv", header = T, stringsAsFactors = F) %>%
left_join(links, by = "movieId") %>% select(-tmdbId)
head(ratings)
## userId movieId rating timestamp
## 1 1 2 3.5 1112486027
## 2 1 29 3.5 1112484676
## 3 1 32 3.5 1112484819
## 4 1 47 3.5 1112484727
## 5 1 50 3.5 1112484580
## 6 1 112 3.5 1094785740
Below I create the vectors for filtering. I include only users with greater than 1000 reviews and movies with more than 2500 revies.
users <- ratings %>% select(userId, rating) %>% group_by(userId) %>%
summarise(Qty_Rated = n()) %>%
arrange(desc(Qty_Rated)) %>%
filter(Qty_Rated > 1000)
items <- ratings %>% select(movieId, rating) %>%
filter(rating > 0) %>%
group_by(movieId) %>%
summarise(Qty_Rated = n()) %>%
arrange(desc(Qty_Rated)) %>%
filter(Qty_Rated > 2500)
#nrow(users);nrow(items)
Even after the filtering, the resulting matrix will be large ~1800 users x 1700 movies.
Now I convert the ratings
table into a sparse matrix of ratings. Unlike the previous assignment, I’m using a larger proportion of the total data (1,397,024 more ratings to be precise.)
set.seed(4)
# "mm" is my matrix for creating the realRatingsMatrix
mm <- ratings %>%
filter(rating > 0,
userId %in% users$userId,
movieId %in% items$movieId) %>%
select(-timestamp) %>%
mutate(userId = as.factor(userId),
movieId = as.factor(movieId))
mM <- evaluationScheme(as(mm, "realRatingMatrix"),
method = "split",
train = 0.7,
given = 5,
goodRating = 5)
mM
## Evaluation scheme with 5 items given
## Method: 'split' with 1 run(s).
## Training set proportion: 0.700
## Good ratings: >=5.000000
## Data set: 1884 x 1765 rating matrix of class 'realRatingMatrix' with 1538188 ratings.
Below, I preview the similarity between users using cosine
distance and find that among a small sample of 25, most users are different with their ratings.
#slotNames(mM)
s_users <- similarity(mM@data[1:25,],
method = "cosine",
which = "users")
image(as.matrix(s_users), main = "User Similarity")
Before proceeding, I’ll review using the image
function in recommenderlab
- just a corner of the data showing 25 users and 25 movies as the original is too large to show properly.
image(getData(mM, "train")[c(1:25),c(1:25)],
main = "50x50 corner of the rating matrix")
One of the options for the Recommender
object from the recommenderlab
library is SVD
. The resulting model contains the components of the SVD form:
\[A = U \Sigma V^{T} \]
One of the results of the above equation is that one can store the ratings matrix as the product of \(U\), \(\Sigma\), and \(V^T\).
The singular values of \(\Sigma\) provide a high-to-low diagonalization of the eigenvalues that provide sort of “genre weight” that projects a user’s ratings onto the movie ratings to create a prediction.
Below, I create a ratings matrix from the original dataset to mirror the test dataset in the recommender lab. The reason I’m doing this is so that I can compare the svd
results between base
and the results of the recommenderlab
method SVD
.
# obtain the same matrix as in the training set
# of the realRatingMatrix object:
rl_train <- getData(mM, "train")
train_users <- as.numeric(rl_train@data@Dimnames[[1]])
train_movie <- as.numeric(rl_train@data@Dimnames[[2]])
tr_ratings <- ratings %>%
filter(userId %in% train_users,
movieId %in% train_movie) %>%
select(-timestamp) %>%
spread(movieId, rating)
tr_ratings$userId <- rownames(tr_ratings)
tr_ratings <- tr_ratings %>%
select(-userId) %>%
as.matrix() # set NA's to mean for SVD
tr_ratings[is.na(tr_ratings)] <- 0 # mean(tr_ratings, na.rm = T)
dim(rl_train);dim(tr_ratings) # confirms same size
## [1] 1318 1765
## [1] 1318 1765
Now, I calculate the SVD decomposition using R’s base
function svd()
and then use the recommender function using method “SVD”.
t_svd_base <- proc.time()
svd_base <- svd(tr_ratings)
t_svd_base <- proc.time() - t_svd_base
The list of values in svd_base$d
represent the diagnols of \(\Sigma\). A general rule of thumb is that if 90% of the prediction energy is retained in the reduced \(\Sigma\), the rec-suggestion power will remain.
Next, I’ll use recommenderlab
’s SVD
method as the SVD decomposition to “aspire to.”
# build model with "train", predict with "known",
# test quality of prediction with "unknown"
t_svd <- proc.time()
m_svd <- Recommender(rl_train, method = "SVD")
#p_svd <- predict(m_svd, getData(mM, "known"), type = "ratings")
t_svd <- proc.time() - t_svd
The below function will narrow the values of \(\Sigma\) such that 90% of the “energy” in prediction power is retained. A general rule of thumb is that if the resulting matrix has 90% of the projection “engergy” of the orignal, it’ll provide very similar predicting power.
#energy_total <- sum((svd_base$d)**2)
# an_svd <- svd_base
reduce_dims <- function(an_svd){
energy_total = sum((an_svd)**2)
e_ratio = 1
indx = 0
while(e_ratio > .9 | indx <= length(energy_total)){
c = an_svd[1:(length(an_svd)-indx)]
#print(length(c))
e_ratio = sum((c)**2)/energy_total
indx = indx+1
#print(e_ratio)
}
return(c)
}
revised_d <- reduce_dims(svd_base$d)
length(revised_d)
## [1] 360
sum((revised_d)**2)/sum((svd_base$d)**2)
## [1] 0.899941
Above, using impute to zero (probably a bad idea), I reduce the length of d down from 1318 to 360.
What’s more, because \(\Sigma\) matrix is reduced, this will result in fewer dimensions needing process to produce an accurate recommender.
Below, I compare the dimensions of the recommenderlab
and the base svd
dimensions of the constiutents parts of $A = U V^{T} $:
# recommender model version is very streamlined
dim(m_svd@model$svd$u) # 1318 10
## [1] 1318 10
length(m_svd@model$svd$d)# 10
## [1] 10
dim(m_svd@model$svd$v) # 1765 10
## [1] 1765 10
t_svd
## user system elapsed
## 0.51 0.12 0.67
# the base svd version is MUCH larger - this represents
# the original decomposition.
dim(svd_base$u) # 1318 1318
## [1] 1318 1318
length(svd_base$d) # 1318
## [1] 1318
dim(svd_base$v) # 1765 1318
## [1] 1765 1318
t_svd_base
## user system elapsed
## 14.29 0.00 14.65
As shown above, the recommenderlab
SVD ran in a shorter period of time and provided constituent components that are much small in dimension.
Finally, I’ll rebuild each original rating matrix (one from recommenderlab
and the other from base
’s svd()
function:)
# A = U sigma # V^T
dim(round(svd_base$u %*% diag(svd_base$d) %*% t(svd_base$v))) # X = U D V'
## [1] 1318 1765
dim(round(m_svd@model$svd$u %*% diag(m_svd@model$svd$d) %*% t(m_svd@model$svd$v)))
## [1] 1318 1765
The recommenderlab
’s svd
method provides a stark reduction in data dimensions and reveals the power of singular value decomposition.
# user_genre_mix <- ratings %>% ungroup() %>% select(userId, movieId) %>%
# left_join(g_movies[, c(1, 5:ncol(g_movies))], by = "movieId") %>%
# select(-movieId) %>% ungroup() %>%
# group_by(userId) %>%
# summarise_all(funs(sum))
# tots <- rowSums(user_genre_mix[,2:ncol(user_genre_mix)])
#
# user_genre_mix %>% gather(genre, qty_rated, -userId) %>%
# filter(qty_rated != 0) %>%
# ggplot(aes(x = qty_rated)) + geom_histogram() +
# scale_x_log10()+
# facet_wrap(~ genre, scales = "free")
#user_genre_mix <- cbind(user_genre_mix, tots)
#tail(tidy_gm, 75)
# #combine cols 4:6 into a col "n", gather col:headers of 4:6 under new col "humans"
# zz<-gather(tb, "humans", "n", 4:6, na.rm = TRUE) %>%
# tidy_gm <- g_movies %>%
# select(movieId, `g_(no genres listed)`:g_Western) %>%
# group_by(movieId) %>% right_join(ratings)
# g_movies %>%
# gather(-c(movieId, title, imdbId, g),
# key = "genre",
# value = "m_count") %>%
# ggplot(aes(x = genre, y = m_count)) +
# geom_bar(stat = "identity") +
# coord_flip()