The goal of this assignment is give you practice working with Matrix Factorization techniques. Your task is implement a matrix factorization method—such as singular value decomposition (SVD) or Alternating Least Squares (ALS)—in the context of a recommender system. You may approach this assignment in a number of ways. You are welcome to start with an existing recommender system written by yourself or someone else. Remember as always to cite your sources, so that you can be graded on what you added, not what you found. SVD can be thought of as a pre-processing step for feature engineering. You might easily start with thousands or millions of items, and use SVD to create a much smaller set of “k” items (e.g. 20 or 70).
options(warn=-1)
if(!"recommenderlab" %in% rownames(installed.packages())){
install.packages("recommenderlab")}
suppressMessages(library("ggplot2"))
suppressMessages(library("recommenderlab"))
suppressMessages(library("dplyr"))
suppressMessages(library(kableExtra))
I am going to use MovieLense Dataset available as part of the recommenderlab package. First, let’s take a look at our data - it is a sparse ratings matrix.
data_package <- data(package = "recommenderlab")
data("MovieLense")
y<-as.matrix(MovieLense@data[1:10,1:10])
y %>% kable(caption = "Data") %>% kable_styling("striped", full_width = TRUE)
Toy Story (1995) | GoldenEye (1995) | Four Rooms (1995) | Get Shorty (1995) | Copycat (1995) | Shanghai Triad (Yao a yao yao dao waipo qiao) (1995) | Twelve Monkeys (1995) | Babe (1995) | Dead Man Walking (1995) | Richard III (1995) |
---|---|---|---|---|---|---|---|---|---|
5 | 3 | 4 | 3 | 3 | 5 | 4 | 1 | 5 | 3 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 0 | 2 | 4 | 4 | 0 |
0 | 0 | 0 | 5 | 0 | 0 | 5 | 5 | 5 | 4 |
0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 5 | 4 | 0 | 0 | 0 |
4 | 0 | 0 | 4 | 0 | 0 | 4 | 0 | 4 | 0 |
I will take a look at the number of ratings per user
ratings_movies <- MovieLense #[rowCounts(MovieLense) > 50, colCounts(MovieLense) > 100]
rating_cnt_per_member <- rowCounts(ratings_movies)
qplot(rating_cnt_per_member) + stat_bin(binwidth = 0.1) +
ggtitle("Number of ratings per user")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Now, I am going to select the data from high count of ratings per user and movie only. I am going to drop any user who rated less than 50 movies and any movie with less than 100 ratings.
ratings_movies <- MovieLense[rowCounts(MovieLense) > 50, colCounts(MovieLense) > 100]
ratings_movies
## 560 x 332 rating matrix of class 'realRatingMatrix' with 55298 ratings.
average_ratings_per_user <- rowMeans(ratings_movies)
qplot(average_ratings_per_user) + stat_bin(binwidth = 0.1) +
ggtitle("Distribution of the average rating per user")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
It looks like the average rating is between 3.5 and 4.
Let’s split our data into train and test set using 80% to 20% split.
which_train <- sample(x = c(TRUE, FALSE), size = nrow(ratings_movies),
replace = TRUE, prob = c(0.8, 0.2))
data_train <- ratings_movies[which_train, ]
data_test <- ratings_movies[!which_train, ]
I am going to use recommenderlab package and k=20 which appears to be sufficient after a few experimantal values.
set.seed(444)
recom <- Recommender(data_train, method = "SVD")
recom
## Recommender of type 'SVD' for 'realRatingMatrix'
## learned using 457 users.
pred <- predict(object = recom, newdata = data_test, type = "ratings")
es <- evaluationScheme(pred, method="cross-validation",
k=20, given = 5)
ev <- evaluate(es, "SVD", type="ratings")
## SVD run fold/sample [model time/prediction time]
## 1 [0.008sec/0.003sec]
## 2 [0.01sec/0.004sec]
## 3 [0.02sec/0.004sec]
## 4 [0.008sec/0.003sec]
## 5 [0.009sec/0.004sec]
## 6 [0.019sec/0.004sec]
## 7 [0.009sec/0.006sec]
## 8 [0.013sec/0.006sec]
## 9 [0.014sec/0.006sec]
## 10 [0.011sec/0.009sec]
## 11 [0.073sec/0.007sec]
## 12 [0.009sec/0.007sec]
## 13 [0.012sec/0.004sec]
## 14 [0.009sec/0.007sec]
## 15 [0.008sec/0.004sec]
## 16 [0.008sec/0.007sec]
## 17 [0.01sec/0.005sec]
## 18 [0.021sec/0.005sec]
## 19 [0.009sec/0.004sec]
## 20 [0.012sec/0.004sec]
avg(ev)
## RMSE MSE MAE
## res 0.1668493 0.02824454 0.1092995
Here is a sample 10 by 10 prediction matrix
set.seed(444)
pred_ratings <- predict(recom, data_test, type = "ratingMatrix")
x<-as.matrix(getRatingMatrix(pred_ratings)[1:10,1:10])
x %>% kable(caption = "Predictions") %>% kable_styling("striped", full_width = TRUE)
Toy Story (1995) | GoldenEye (1995) | Get Shorty (1995) | Twelve Monkeys (1995) | Babe (1995) | Dead Man Walking (1995) | Seven (Se7en) (1995) | Usual Suspects, The (1995) | Mighty Aphrodite (1995) | Postino, Il (1994) | |
---|---|---|---|---|---|---|---|---|---|---|
1 | 5.000000 | 3.000000 | 3.000000 | 4.000000 | 1.000000 | 5.000000 | 2.000000 | 5.000000 | 5.000000 | 5.000000 |
3 | 2.857563 | 2.867756 | 2.929589 | 2.825019 | 2.922900 | 2.927678 | 2.770381 | 2.858785 | 2.950713 | 2.923956 |
10 | 4.000000 | 4.158593 | 4.000000 | 4.000000 | 4.372776 | 4.000000 | 4.000000 | 5.000000 | 3.000000 | 4.243699 |
13 | 3.000000 | 3.000000 | 5.000000 | 2.000000 | 4.000000 | 3.000000 | 1.000000 | 5.000000 | 5.000000 | 4.000000 |
14 | 3.940837 | 4.073937 | 4.245194 | 5.000000 | 4.253971 | 4.000000 | 3.867751 | 5.000000 | 4.000000 | 3.000000 |
16 | 5.000000 | 4.549872 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 4.494116 | 4.381629 |
48 | 4.014233 | 3.912830 | 3.863638 | 3.933558 | 4.046132 | 3.791640 | 3.843449 | 4.022143 | 3.852081 | 3.870097 |
52 | 4.230366 | 4.187372 | 4.290835 | 5.000000 | 4.300532 | 4.398292 | 4.171558 | 4.302074 | 5.000000 | 4.357064 |
81 | 4.000000 | 3.331114 | 3.478745 | 4.000000 | 3.454990 | 3.587976 | 3.490547 | 3.546811 | 3.579865 | 3.539091 |
85 | 3.935785 | 3.463522 | 3.691018 | 3.311085 | 4.000000 | 4.000000 | 3.341172 | 3.695577 | 3.000000 | 4.000000 |
Here are the top 10 recomendations for User 1
set.seed(444)
pred2 <- predict(object = recom, newdata = data_test, type = "topNList")
recc_user_1 <- pred2@items[[14]]
movies_user_1 <- pred2@itemLabels[recc_user_1]
movies_user_1 %>% kable(caption = "User1 Predictions") %>% kable_styling("striped", full_width = TRUE)
x |
---|
True Lies (1994) |
Braveheart (1995) |
Top Gun (1986) |
Sneakers (1992) |
Ghost and the Darkness, The (1996) |
Sleepless in Seattle (1993) |
Stargate (1994) |
Indiana Jones and the Last Crusade (1989) |
Blues Brothers, The (1980) |
Crimson Tide (1995) |
In conclusion, SVD appears to be an effecting technique with a pretty good RMSE result and the list of film recommendations appears to be consistent with overall film ratings - a lot of the film recommendations are top rated, critically acclaimed films which is impressive considering a range amount of movies in the MovieLense Dataset.