Elina Azrilyan - Data 612

Project Description

The goal of this assignment is give you practice working with Matrix Factorization techniques. Your task is implement a matrix factorization method—such as singular value decomposition (SVD) or Alternating Least Squares (ALS)—in the context of a recommender system. You may approach this assignment in a number of ways. You are welcome to start with an existing recommender system written by yourself or someone else. Remember as always to cite your sources, so that you can be graded on what you added, not what you found. SVD can be thought of as a pre-processing step for feature engineering. You might easily start with thousands or millions of items, and use SVD to create a much smaller set of “k” items (e.g. 20 or 70).

Installing packages:

options(warn=-1)
if(!"recommenderlab" %in% rownames(installed.packages())){
install.packages("recommenderlab")}
suppressMessages(library("ggplot2"))
suppressMessages(library("recommenderlab"))
suppressMessages(library("dplyr"))
suppressMessages(library(kableExtra))

Loading dataset

I am going to use MovieLense Dataset available as part of the recommenderlab package. First, let’s take a look at our data - it is a sparse ratings matrix.

data_package <- data(package = "recommenderlab")
data("MovieLense")

y<-as.matrix(MovieLense@data[1:10,1:10])
y  %>% kable(caption = "Data") %>% kable_styling("striped", full_width = TRUE)

Data
Toy Story (1995)	GoldenEye (1995)	Four Rooms (1995)	Get Shorty (1995)	Copycat (1995)	Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)	Twelve Monkeys (1995)	Babe (1995)	Dead Man Walking (1995)	Richard III (1995)
5	3	4	3	3	5	4	1	5	3
4	0	0	0	0	0	0	0	0	2
0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0
4	3	0	0	0	0	0	0	0	0
4	0	0	0	0	0	2	4	4	0
0	0	0	5	0	0	5	5	5	4
0	0	0	0	0	0	3	0	0	0
0	0	0	0	0	5	4	0	0	0
4	0	0	4	0	0	4	0	4	0

Inspecting the data

I will take a look at the number of ratings per user

ratings_movies <- MovieLense #[rowCounts(MovieLense) > 50, colCounts(MovieLense) > 100] 
rating_cnt_per_member <- rowCounts(ratings_movies)
qplot(rating_cnt_per_member) + stat_bin(binwidth = 0.1) +
ggtitle("Number of ratings per user")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Now, I am going to select the data from high count of ratings per user and movie only. I am going to drop any user who rated less than 50 movies and any movie with less than 100 ratings.

ratings_movies <- MovieLense[rowCounts(MovieLense) > 50, colCounts(MovieLense) > 100] 

ratings_movies

## 560 x 332 rating matrix of class 'realRatingMatrix' with 55298 ratings.

average_ratings_per_user <- rowMeans(ratings_movies)

qplot(average_ratings_per_user) + stat_bin(binwidth = 0.1) +
ggtitle("Distribution of the average rating per user")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

It looks like the average rating is between 3.5 and 4.

Splitting the data into train/test sets

Let’s split our data into train and test set using 80% to 20% split.

which_train <- sample(x = c(TRUE, FALSE), size = nrow(ratings_movies),
replace = TRUE, prob = c(0.8, 0.2))

data_train <- ratings_movies[which_train, ]
data_test <- ratings_movies[!which_train, ]

Using SVD for recommendations

I am going to use recommenderlab package and k=20 which appears to be sufficient after a few experimantal values.

set.seed(444)
recom <- Recommender(data_train, method = "SVD")
recom

## Recommender of type 'SVD' for 'realRatingMatrix' 
## learned using 457 users.

pred <- predict(object = recom, newdata = data_test, type = "ratings")

es <- evaluationScheme(pred, method="cross-validation",
        k=20, given = 5)
        
ev <- evaluate(es, "SVD", type="ratings")

## SVD run fold/sample [model time/prediction time]
##   1  [0.008sec/0.003sec] 
##   2  [0.01sec/0.004sec] 
##   3  [0.02sec/0.004sec] 
##   4  [0.008sec/0.003sec] 
##   5  [0.009sec/0.004sec] 
##   6  [0.019sec/0.004sec] 
##   7  [0.009sec/0.006sec] 
##   8  [0.013sec/0.006sec] 
##   9  [0.014sec/0.006sec] 
##   10  [0.011sec/0.009sec] 
##   11  [0.073sec/0.007sec] 
##   12  [0.009sec/0.007sec] 
##   13  [0.012sec/0.004sec] 
##   14  [0.009sec/0.007sec] 
##   15  [0.008sec/0.004sec] 
##   16  [0.008sec/0.007sec] 
##   17  [0.01sec/0.005sec] 
##   18  [0.021sec/0.005sec] 
##   19  [0.009sec/0.004sec] 
##   20  [0.012sec/0.004sec]

avg(ev)

##          RMSE        MSE       MAE
## res 0.1668493 0.02824454 0.1092995

Here is a sample 10 by 10 prediction matrix

set.seed(444)
pred_ratings <- predict(recom, data_test, type = "ratingMatrix")

x<-as.matrix(getRatingMatrix(pred_ratings)[1:10,1:10])
x %>% kable(caption = "Predictions") %>% kable_styling("striped", full_width = TRUE)

Predictions
	Toy Story (1995)	GoldenEye (1995)	Get Shorty (1995)	Twelve Monkeys (1995)	Babe (1995)	Dead Man Walking (1995)	Seven (Se7en) (1995)	Usual Suspects, The (1995)	Mighty Aphrodite (1995)	Postino, Il (1994)
1	5.000000	3.000000	3.000000	4.000000	1.000000	5.000000	2.000000	5.000000	5.000000	5.000000
3	2.857563	2.867756	2.929589	2.825019	2.922900	2.927678	2.770381	2.858785	2.950713	2.923956
10	4.000000	4.158593	4.000000	4.000000	4.372776	4.000000	4.000000	5.000000	3.000000	4.243699
13	3.000000	3.000000	5.000000	2.000000	4.000000	3.000000	1.000000	5.000000	5.000000	4.000000
14	3.940837	4.073937	4.245194	5.000000	4.253971	4.000000	3.867751	5.000000	4.000000	3.000000
16	5.000000	4.549872	5.000000	5.000000	5.000000	5.000000	5.000000	5.000000	4.494116	4.381629
48	4.014233	3.912830	3.863638	3.933558	4.046132	3.791640	3.843449	4.022143	3.852081	3.870097
52	4.230366	4.187372	4.290835	5.000000	4.300532	4.398292	4.171558	4.302074	5.000000	4.357064
81	4.000000	3.331114	3.478745	4.000000	3.454990	3.587976	3.490547	3.546811	3.579865	3.539091
85	3.935785	3.463522	3.691018	3.311085	4.000000	4.000000	3.341172	3.695577	3.000000	4.000000

Here are the top 10 recomendations for User 1

set.seed(444)
pred2 <- predict(object = recom, newdata = data_test, type = "topNList")

recc_user_1 <- pred2@items[[14]]
movies_user_1 <- pred2@itemLabels[recc_user_1]
movies_user_1 %>% kable(caption = "User1 Predictions") %>% kable_styling("striped", full_width = TRUE)

User1 Predictions
x
True Lies (1994)
Braveheart (1995)
Top Gun (1986)
Sneakers (1992)
Ghost and the Darkness, The (1996)
Sleepless in Seattle (1993)
Stargate (1994)
Indiana Jones and the Last Crusade (1989)
Blues Brothers, The (1980)
Crimson Tide (1995)

Summary

In conclusion, SVD appears to be an effecting technique with a pretty good RMSE result and the list of film recommendations appears to be consistent with overall film ratings - a lot of the film recommendations are top rated, critically acclaimed films which is impressive considering a range amount of movies in the MovieLense Dataset.

Toy Story (1995)	GoldenEye (1995)	Four Rooms (1995)	Get Shorty (1995)	Copycat (1995)	Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)	Twelve Monkeys (1995)	Babe (1995)	Dead Man Walking (1995)	Richard III (1995)
5	3	4	3	3	5	4	1	5	3
4	0	0	0	0	0	0	0	0	2
0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0
4	3	0	0	0	0	0	0	0	0
4	0	0	0	0	0	2	4	4	0
0	0	0	5	0	0	5	5	5	4
0	0	0	0	0	0	3	0	0	0
0	0	0	0	0	5	4	0	0	0
4	0	0	4	0	0	4	0	4	0

Toy Story (1995)	GoldenEye (1995)	Four Rooms (1995)	Get Shorty (1995)	Copycat (1995)	Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)	Twelve Monkeys (1995)	Babe (1995)	Dead Man Walking (1995)	Richard III (1995)
5	3	4	3	3	5	4	1	5	3
4	0	0	0	0	0	0	0	0	2
0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0
4	3	0	0	0	0	0	0	0	0
4	0	0	0	0	0	2	4	4	0
0	0	0	5	0	0	5	5	5	4
0	0	0	0	0	0	3	0	0	0
0	0	0	0	0	5	4	0	0	0
4	0	0	4	0	0	4	0	4	0

Elina Azrilyan - Data 612 - Project 3