The goal of this assignment is to implement a matrix factorization method (SVD or ALS) in the context of a recommender system.

Data input

The dataset used was the MovieLens dataset, previously used in another assignment for content-based collaborative filtering. This dataset was retrieved from the https://grouplens.org/datasets/movielens/ website using the small dataset with 100,000 ratings, 1,300 tags for 9,000 movies and 700 users. Because the past challenge was creating a matrix of a size that R could handle, this assignment does not filter the users or movies based on traffic. A different approach was taken this time in order to ensure more accurate results (ones that weren’t skewed due to the popularity of certain items).

Loading libraries/dataset

library(dplyr)
library(tidyr)
library(ggplot2)
library(recommenderlab)
library(reshape2)
library(knitr)

The MovieLens matrix was created combining two datasets (movies and ratings). The movies dataset included the movieId, the title and the genres, while the ratings dataset included the userId, the movieId, the movie rating and a timestamp.

movies = read.csv("https://raw.githubusercontent.com/Galanopoulog/DATA643/master/Project%202/movies.csv", 
                   header = TRUE, sep = ",", stringsAsFactors = FALSE)
ratings = read.csv("https://raw.githubusercontent.com/Galanopoulog/DATA643/master/Project%202/ratings.csv", 
                    header = TRUE, sep =",", stringsAsFactors = FALSE)
kable(head(ratings))
userId movieId rating timestamp
1 31 2.5 1260759144
1 1029 3.0 1260759179
1 1061 3.0 1260759182
1 1129 2.0 1260759185
1 1172 4.0 1260759205
1 1263 2.0 1260759151
kable(head(movies))
movieId title genres
1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
2 Jumanji (1995) Adventure|Children|Fantasy
3 Grumpier Old Men (1995) Comedy|Romance
4 Waiting to Exhale (1995) Comedy|Drama|Romance
5 Father of the Bride Part II (1995) Comedy
6 Heat (1995) Action|Crime|Thriller

When the rating and movies dataframes were combined, the new dataframe (from which the matrix will be formed) was created. This dataframe contained only the title, the userId and the rating.

movRate = merge(movies, ratings, by = "movieId")
new = subset(movRate, select = c("title", "userId", "rating"))
new = unique(new)
kable(head(new))
title userId rating
Toy Story (1995) 23 3.0
Toy Story (1995) 623 4.5
Toy Story (1995) 559 4.0
Toy Story (1995) 306 3.0
Toy Story (1995) 361 3.0
Toy Story (1995) 357 5.0
matrix = acast(new, userId~title, value.var="rating", fun=sum)

One thing to note, because of R’s acast() function (the function that turns the dataframe into a matrix), all missing values were automatically turned into zeroes. The final matrix was particularly large with 671 users and 9064 movies. In order to avoid overloading R while also avoiding the filtering method, a random sample of both users and movies was taken to create the downsized matrix. This matrix included 200 users and 3000 movies.

premat = as(matrix, "realRatingMatrix")
data = premat[sample(671,200), sample(9064, 3000)]
data
## 200 x 3000 rating matrix of class 'realRatingMatrix' with 600000 ratings.

Data Exploration

Creating plots of the ratings distributions can help understand a lot of things, among those being what constitutes as a good rating. So, taking a look at the overall ratings, the majority of them were 4 stars, with the most of the votes around the 3-5 star range.

# Ratings
qplot(new$rating, geom="histogram", main = "Histogram of Ratings", xlab = "Rating Scores", binwidth = 0.5, fill=I("cornflower blue"))

A slightly different story was told when looking at the average score per movie. The majority of the ratings were 3.5, with a lot less emphasis around the 4.5-5 star rating than previously shown. The majority of the movies were rated between as 3-4 stars.

# Ratings per Movie
new2 = new %>% group_by(title) %>%
  summarise(count = mean(rating))
qplot(new2$count, geom="histogram", main = "Histogram of Movie Ratings", xlab = "Average Rating Scores Per Movie", binwidth = 0.5, fill=I("FireBrick"))

When looking at the average rating each of each user, very few persistently gave low (2.5 stars and below) and very few give high (5 stars) scores to movies. The majority lay between the 3.5-4 star range, with more users granting 3 stars than 4.5 stars.

# Ratings per User
new3 = new %>% group_by(userId) %>%
  summarise(count = mean(rating))
qplot(new3$count, geom="histogram", main = "Histogram of User Ratings", xlab = "Average Rating Scores Per User", binwidth = 0.5, fill=I("Plum"))

Because the downsizing approach taken this time did not filter the movies/users with the most activity, it was interesting to compare the results to the past assignment. Interestingly, the plots showed very little difference in the ratings, ratings per movie and ratings per user histograms. With this sampling method, though, the results seemed more well-rounded by including more extreme (1 and 5 star) ratings.

After looking at the plots, the data was split into training and test sets (sizes of 80% and 20% respectively).

evaluation = evaluationScheme(data, method="split", train=0.8, given=10, goodRating=3.5)

#Evaluation datasets
ev_train = getData(evaluation, "train")
ev_known = getData(evaluation, "known")
ev_unknown = getData(evaluation, "unknown")

SVD

The Singlular Value Decomposition (SVD) of a matrix A is the factorization of A into the product of three matrices so that \(A=UDV^T\). Here, matrices U and V are orthonormal and matrix D is a diagonal with positive real values. To give an idea of the mentality behind this approach, a random sparse matrix was created.

example = as.matrix(data.frame(c(1,3,4,0), c(1,2,4,0), c(0,0,0,5)))
example
##      c.1..3..4..0. c.1..2..4..0. c.0..0..0..5.
## [1,]             1             1             0
## [2,]             3             2             0
## [3,]             4             4             0
## [4,]             0             0             5

Performing a Singular Value Decomposition creates the three U, V and D matrices. The D matrix tells us that the third variable has less strength that the first and second, so it can be set to zero and effectively removed from the U and V matrices. This in turn reduces the original matrice’s size without signifcantly affecting the results.

svd(example)
## $d
## [1] 6.8290168 5.0000000 0.6037627
## 
## $u
##            [,1] [,2]       [,3]
## [1,] -0.2067855    0  0.1267410
## [2,] -0.5225665    0 -0.8525985
## [3,] -0.8271421    0  0.5069639
## [4,]  0.0000000   -1  0.0000000
## 
## $v
##            [,1] [,2]       [,3]
## [1,] -0.7443316    0 -0.6678102
## [2,] -0.6678102    0  0.7443316
## [3,]  0.0000000   -1  0.0000000

Considering this approach, the SVD method in the recommenderlab package was used on the training set to get in predicted ratings for users and movies. A sample of this is below.

svd_train = Recommender(ev_train, "SVD")
svd_preds = predict(svd_train, ev_known, type = "ratings")
getRatingMatrix(svd_preds[c(1,9,17,25,33),1:5])
## 5 x 5 sparse Matrix of class "dgCMatrix"
##     Osmosis Jones (2001) American Dream (1990) Titicut Follies (1967)
## 229             0.000000              0.000000               0.000000
## 5               0.000000              0.000000               0.000000
## 128             0.000000              0.000000               0.000000
## 75              0.000000              0.000000               0.000000
## 243             1.197233              1.197233               1.197233
##     City of God (Cidade de Deus) (2002) King Solomon's Mines (1985)
## 229                            0.000000                     0.00000
## 5                              0.000000                     0.00000
## 128                            0.000000                     0.00000
## 75                             0.000000                     0.00000
## 243                            1.227104                     1.19226

Comparison

In the previous assignment that used the same dataset, the results showed that the highest performing techniques were the User-to-User Collaborative Filter with Pearson correlation (items were recommended based on similar users) and the Popular method (the most popular items were recommended). Under the assumption reinforced by the plots that the filtering method did not skew the results, the UBCF and Popular methods were compared against the SVD results.

The table below shows that the SVD performed better than the UBCF approach, but was slightly less accurate than the Popular method. This result remains consistent across all values (RMSE, MSE and MAE).

# User-User
ubcf_train = Recommender(ev_train, "UBCF")
ubcf_preds = predict(ubcf_train, ev_known, type = "ratings")

# Popular
pop_train = Recommender(ev_train, "POPULAR")
pop_preds = predict(pop_train, ev_known, type = "ratings")


accuracy = rbind(
  SVD = calcPredictionAccuracy(svd_preds, ev_unknown),
  UBCF = calcPredictionAccuracy(ubcf_preds, ev_unknown),
  POPULAR = calcPredictionAccuracy(pop_preds, ev_unknown)
  )

kable(as.data.frame(accuracy))
RMSE MSE MAE
SVD 0.5099021 0.2600001 0.1351681
UBCF 0.9603690 0.9223086 0.7512329
POPULAR 0.4934812 0.2435237 0.1959183

The ROC and Precision/Recall plots below show the performance of each of the models.

eval_sets = evaluationScheme(data = data, method = "cross-validation", k = 4, given = 10, goodRating = 3.5)

mult_models = list(
  UBCF = list(name = "UBCF", param = list(method = "pearson")),
  Popular = list(name = "POPULAR", param = NULL),
  SVD = list(name = "SVD", param = NULL)
)

# Testing models
models = evaluate(eval_sets, mult_models, n= c(1, 5, seq(10, 100, 10)))
## UBCF run fold/sample [model time/prediction time]
##   1  [0.06sec/0.31sec] 
##   2  [0.02sec/0.29sec] 
##   3  [0.03sec/0.24sec] 
##   4  [0.05sec/0.28sec] 
## POPULAR run fold/sample [model time/prediction time]
##   1  [0.06sec/0.15sec] 
##   2  [0.02sec/0.14sec] 
##   3  [0.04sec/0.13sec] 
##   4  [0.03sec/0.13sec] 
## SVD run fold/sample [model time/prediction time]
##   1  [0.42sec/0.35sec] 
##   2  [0.08sec/0.36sec] 
##   3  [0.08sec/0.31sec] 
##   4  [0.04sec/0.37sec]
# Plotting models
plot(models, annotate = T, legend="topleft")

plot(models, "prec/rec", annotate = F, main="Precision/Recall", legend="topright")

Interestingly, unlike what was expected from the accuracy table above, the SVD model did not perform as well as expected compared to the UBCF and Popular model. Once again, the Popular model performed the best, however, in terms of Precision/Recall (and looking at the ROC curves), the SVD model was significanlty below the Popular model’s.

Conclusion

The best approach using this downsized MovieLens dataset was, once again, by recommending Popular movies. That being said, the Singular Value Decomposition approach did not perform poorly. When viewing the results by comparing RMSE, MSE and MAE values, the SVD method performed slightly beneath the Popular one, with the UBCF lagging further behind. Though both the ROC and Precision/Recall plots showed the SVD underperforming (even to the UBCF), the SVD method would be my second choice after the Popular approach based on the results of the accuracy table.