MovieLense Recommendation Overview

Background

Figure

Research Question:

How can I recommend the top 10 movies to certain people (users), based on their history of preference, or on the experirnce of similar people like them, or some other mechanism?

Research Approach:

There are a few packages that are available in the R for the recommendation system,and the most commonly used is recommenderlab. I have used recommender lab throughout and have performed below analysis on it: Content based filtering, User based filtering, and SVD. In this project, I am using all three machine learning recommendation methods to apply to the movie lense database, and to make comparison on their performances.

"The main task of a recommender system is to predict the users response to different options. GroupLens Research has collected and made available rating data sets from the MovieLens web site. A data set with 200,000 movie ratings history (by 1000 people who have rated 2000 movies) is downloaded and split into a train (80%) and a ‘unseen’ test (20%) set for evaluation.

The three models are evaluated in below parameters:root-mean-squared error (RMSE) and its aproximate family, the run time of machine learning, Reiver Operative Curve (ROC) with is a trade off summary of true positive value, true negative value, false positive value and false negative value (confusion matrix), as well as the balance of Precision versus Recall Error.

Summary of What we have found

The singular value decomposition method is a winner!

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

library(hexbin)

Collaborative Filting (IBCF, UBCF)

UserBased (UBCF) vs ItemBased (IBCF)

This gif illustrates the most commonly used recomendation system model: Collaborative Filting. Collaborative filtering can answer a question “What items do users with interests similar to yours like?

Figure

Singular Value Decomposition (SVD, Matrix Factorization)

Figure

SVD w Parameters

Mathematical Models

IBCF { width=30% }

IBCF Explained { width=30%}

UBCF

UserBased Content Filtering Cosion

Matrix Factorization

library(hexbin)
library (knitr)

Singular Value Decomposition (SVD)

Singular Value Decomposition begins by breaking an \(M\) by \(N\) matrix \(A\) (in this case \(M\) users and \(N\) jokes) into the product of three matrices: \(U\), which is \(M\) by \(M\), \(\Sigma\), which is \(M\) by \(N\), and \(V^T\), which is \(N\) by \(N\):

\[A = U \ \Sigma \ V^T\]

RSME, Residual Mean Square Error

The Residual Mean Square Error (RMSE) is the error function to that will measure accuracy and quantify the typical error we make when predicting the movie rating. RMSE defined;

\[ RMSE = \sqrt{\frac{1}{N}\displaystyle\sum_{u,i} (\hat{y}_{u,i}-y_{u,i})^{2}} \]

Dataset

The dataset i choose for this project is movieLense dataset. The dataset is already present in the recommenderlab package so we will be using that dataset and will explore it first before applying SVD (Singular Value Decompostion)

library(recommenderlab)

## Loading required package: Matrix

## Loading required package: arules

## 
## Attaching package: 'arules'

## The following objects are masked from 'package:base':
## 
##     abbreviate, write

## Loading required package: proxy

## 
## Attaching package: 'proxy'

## The following object is masked from 'package:Matrix':
## 
##     as.matrix

## The following objects are masked from 'package:stats':
## 
##     as.dist, dist

## The following object is masked from 'package:base':
## 
##     as.matrix

## Loading required package: registry

## Registered S3 methods overwritten by 'registry':
##   method               from 
##   print.registry_field proxy
##   print.registry_entry proxy

## 
## Attaching package: 'recommenderlab'

## The following objects are masked from 'package:caret':
## 
##     MAE, RMSE

library(ggplot2)
library(tidyverse)

## -- Attaching packages --------------------------------------------------------------------------------------------------------------------------- tidyverse 1.3.0 --

## v tibble  3.0.0     v dplyr   0.8.5
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## v purrr   0.3.4

## -- Conflicts ------------------------------------------------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x tidyr::expand() masks Matrix::expand()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x purrr::lift()   masks caret::lift()
## x tidyr::pack()   masks Matrix::pack()
## x dplyr::recode() masks arules::recode()
## x tidyr::unpack() masks Matrix::unpack()

library(pander)

Data Exploration

Lets load the dataser first

data(MovieLense)
movielense <- MovieLense # Loading the movie datset
movielense

## 943 x 1664 rating matrix of class 'realRatingMatrix' with 99392 ratings.

class (movielense)

## [1] "realRatingMatrix"
## attr(,"package")
## [1] "recommenderlab"

slotNames(movielense)

## [1] "data"      "normalize"

print(paste0("The dimensions of dataset : (Users x Movies)", nrow(movielense), " x ",ncol(movielense)))

## [1] "The dimensions of dataset : (Users x Movies)943 x 1664"

print('maximum Times that a movie is Rated is:')

## [1] "maximum Times that a movie is Rated is:"

max(movielense@data@i)

## [1] 942

print('The first 6 movies in this dataset  is:')

## [1] "The first 6 movies in this dataset  is:"

head(names(colCounts(movielense)))

## [1] "Toy Story (1995)"                                    
## [2] "GoldenEye (1995)"                                    
## [3] "Four Rooms (1995)"                                   
## [4] "Get Shorty (1995)"                                   
## [5] "Copycat (1995)"                                      
## [6] "Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)"

movMat<-as(movielense,'matrix')
class(movMat)

## [1] "matrix"

print ('Number of missing Rating is:')

## [1] "Number of missing Rating is:"

prod(dim(movMat)) -sum(is.na(movMat))

## [1] 99392

We would like to take a peak looking into the database now. Here is ratings of movies in the beginning and end the movie lense database, and we also peak what the movies the first user has rated.

library(kableExtra)

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

## View Data as a 5 by 5 example
y<-as.matrix(movielense@data[1:10,1:10])
y %>% kable (caption ="DataExample") %>% kable_styling ("striped", full_width=TRUE)

DataExample
Toy Story (1995)	GoldenEye (1995)	Four Rooms (1995)	Get Shorty (1995)	Copycat (1995)	Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)	Twelve Monkeys (1995)	Babe (1995)	Dead Man Walking (1995)	Richard III (1995)
5	3	4	3	3	5	4	1	5	3
4	0	0	0	0	0	0	0	0	2
0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0
4	3	0	0	0	0	0	0	0	0
4	0	0	0	0	0	2	4	4	0
0	0	0	5	0	0	5	5	5	4
0	0	0	0	0	0	3	0	0	0
0	0	0	0	0	5	4	0	0	0
4	0	0	4	0	0	4	0	4	0

# # look at the first 3 ratings of the first user
head(as(movielense[1,], "list")[[1]], 3)

##  Toy Story (1995)  GoldenEye (1995) Four Rooms (1995) 
##                 5                 3                 4

 ## look at the last 4 ratings of the no 16th user 
# tail(as(movielense[1,], "list")[[16]], 4)

## lets look at user number168
mov_rated168 <- as.data.frame(movielense@data[c("168"),])
# print(mov_rated168)
dim(mov_rated168)

## [1] 1664    1

tail(mov_rated168)  ## last 6 movies No168 user rated

##                                           movielense@data[c("168"), ]
## War at Home, The (1996)                                             0
## Sweet Nothing (1995)                                                0
## Mat' i syn (1997)                                                   0
## B. Monkey (1998)                                                    0
## You So Crazy (1994)                                                 0
## Scream of Stone (Schrei aus Stein) (1991)                           0

# # Loading the metadata that gets loaded with main dataset

moviemeta <- MovieLenseMeta
class(moviemeta)

## [1] "data.frame"

colnames(moviemeta)

##  [1] "title"       "year"        "url"         "unknown"     "Action"     
##  [6] "Adventure"   "Animation"   "Children's"  "Comedy"      "Crime"      
## [11] "Documentary" "Drama"       "Fantasy"     "Film-Noir"   "Horror"     
## [16] "Musical"     "Mystery"     "Romance"     "Sci-Fi"      "Thriller"   
## [21] "War"         "Western"

# rownames(moviemeta)
dim(moviemeta)

## [1] 1664   22

 pander(head(moviemeta,2),caption = "First few Rows within Movie Meta Data ")

First few Rows within Movie Meta Data (continued below)
title	year
Toy Story (1995)	1995
GoldenEye (1995)	1995

Table continues below
url	unknown	Action
http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)	0	0
http://us.imdb.com/M/title-exact?GoldenEye%20(1995)	0	1

Table continues below
Adventure	Animation	Children’s	Comedy	Crime	Documentary	Drama
0	1	1	1	0	0	0
1	0	0	0	0	0	0

Table continues below
Fantasy	Film-Noir	Horror	Musical	Mystery	Romance	Sci-Fi	Thriller
0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	1

War	Western
0	0
0	0

 # pander(tail(moviemeta), caption = "Last few Rows within Movie Meta Data")

Data Visualization

Top Ten Movies

movie_watched <- data.frame(
    movie_name = names(colCounts(movielense)),
    watched_times = colCounts(movielense)
      )
top_ten_movies <- movie_watched[order(movie_watched$watched_times, decreasing = TRUE), ][1:10, ] 

ggplot(top_ten_movies) + aes(x=movie_name, y=watched_times) + 
  geom_bar(stat = "identity",fill = "firebrick4", color = "dodgerblue2") + xlab("Movie Tile") + ylab("Count") +
  theme(axis.text = element_text(angle = 40, hjust = 1))

Lets see differnt ratings given by users.

Movie Ratings Histogram

We assume that the ratings of 0 is by users mistake, and therefore is excluding it from our data analysis, to prevent it from skewing.

Average Movie Rating Histogram

qplot(colMeans(movielense)) + stat_bin(bins=20, fill=I("blue"), col=I("red")) +
  xlim(0,5)+
  xlab("AVERAGE RATING") + ylab("COUNTS") + 
ggtitle("AVERAGE RATINGS COUNT")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 2 rows containing missing values (geom_bar).

## Warning: Removed 2 rows containing missing values (geom_bar).

mvector_raw<-as.vector(movielense@data)
mvector_raw<-factor(mvector_raw)
qplot(mvector_raw,fill=I("blue"), col=I("red") ) + ggtitle("RATINGS RAW COUNT") +
  xlab("RATINGS") + ylab("COUNT")

mvector <- as.vector(movielense@data)
mvector <- mvector[mvector != 0] 

unique(mvector)

## [1] 5 4 3 1 2

mvector <- factor(mvector)

qplot(mvector,fill=I("blue"), col=I("red") ) + ggtitle("RATINGS RAW COUNT EXCLUDING 0") +
  xlab("RATINGS") + ylab("COUNT")

Data preparation

As the dataset is quite large so we need to cut the dataset bit smaller we do this by user which have rated at least 30 movies and movies which are rated by minimum 60 users.

movielense <- movielense [rowCounts(movielense) > 30, colCounts(movielense) > 60]

print(paste0("Number of Rows, uers, after filtering : ", nrow(movielense)))

## [1] "Number of Rows, uers, after filtering : 726"

print(paste0("Number of Columns, items, after filtering : ", ncol(movielense)))

## [1] "Number of Columns, items, after filtering : 529"

Training and Testing Data

We split the data into 80% training, 20% testing. We choose the cutoff rating as 3 and above as the good ratings, with given of 15 items, and 10 runs.

set.seed(2020)#seed as year
n_folds <- 10  ## 10 iterations of run
to_keep <- 15  ## given 15 items
threshold <- 3 ## ratings above 3 as the cutoff point

e <- evaluationScheme(movielense, method="cross-validation",k = n_folds, train=0.8, given=to_keep,  goodRating=threshold)

print(e)

## Evaluation scheme with 15 items given
## Method: 'cross-validation' with 10 run(s).
## Good ratings: >=3.000000
## Data set: 726 x 529 rating matrix of class 'realRatingMatrix' with 74956 ratings.

training <- getData(e, "train")
known <- getData(e, "known")
unknown <- getData(e, "unknown")

print(paste0("Traing data has ", nrow(training)," rows, users"))

## [1] "Traing data has 648 rows, users"

print(paste0("Known Testing data has ", nrow(known)," rows, users"))

## [1] "Known Testing data has 78 rows, users"

# print(paste0("Unknown Testing data has ", nrow(unknown)," rows, users"))

Singular Value Decompostion

We choose 1st model as Singular Value Decompostion to train our model so that we can use it to recommend movies.

training_time <- system.time({
    model_svd <- Recommender(data = training, method = "SVD") })

print("Model training time : ")

## [1] "Model training time : "

print(training_time)

##    user  system elapsed 
##    0.04    0.01    0.06

print(model_svd)

## Recommender of type 'SVD' for 'realRatingMatrix' 
## learned using 648 users.

SVD Prediction

predicted_top_ten_movies_svd <- predict(object = model_svd, newdata = known, n = 10)  ## top10 movie recommendations

predicted_top_ten_movies_df_svd <- data.frame(users = sort(rep(1:length(predicted_top_ten_movies_svd@items), 
                                                          predicted_top_ten_movies_svd@n)), 
                                         ratings = unlist(predicted_top_ten_movies_svd@ratings),
                                         index = unlist(predicted_top_ten_movies_svd@items))

predicted_top_ten_movies_df_svd$title <- predicted_top_ten_movies_svd@itemLabels[predicted_top_ten_movies_df_svd$index]
predicted_top_ten_movies_df_svd$year <- MovieLenseMeta$year[predicted_top_ten_movies_df_svd$index]

predicted_top_ten_movies_df_svd <- predicted_top_ten_movies_df_svd %>% group_by(users) %>% top_n(4,ratings)  ## what does this  stands for??  ANswe: it stands for top n 4 movies to display, among the topxx movies recommended

predicted_top_ten_movies_df_svd[predicted_top_ten_movies_df_svd$users %in% (1:2), ]  ## first 2 users

## # A tibble: 8 x 5
## # Groups:   users [2]
##   users ratings index title                         year
##   <int>   <dbl> <int> <chr>                        <dbl>
## 1     1    3.51    43 Pulp Fiction (1994)           1994
## 2     1    3.44    79 Fargo (1996)                  1993
## 3     1    3.42   104 2001: A Space Odyssey (1968)  1996
## 4     1    3.40    37 Star Wars (1977)              1994
## 5     2    4.25    43 Pulp Fiction (1994)           1994
## 6     2    4.23   166 Back to the Future (1985)     1986
## 7     2    4.22   176 Field of Dreams (1989)        1986
## 8     2    4.22    19 Braveheart (1995)             1995

Acuracy Matrix SVD

svd_prediction <- predict(object = model_svd, newdata = known, n = 10, type = "ratings")

print("Acuracy Matrix SVD :")

## [1] "Acuracy Matrix SVD :"

print(calcPredictionAccuracy(x = svd_prediction, data = unknown, byUser = FALSE))

##      RMSE       MSE       MAE 
## 0.9986768 0.9973553 0.7918828

Item Based Collaborative Filtering (Cosine)

training_time <- system.time({
    model_ibcf_cosine <- Recommender(data = training, method = "IBCF", parameter = list(method = "Cosine"))
})

print("Model training time : ")

## [1] "Model training time : "

print(training_time)

##    user  system elapsed 
##    0.84    0.08    0.97

print(model_ibcf_cosine)

## Recommender of type 'IBCF' for 'realRatingMatrix' 
## learned using 648 users.

IBCF Prediction

predicted_top_ten_movies_ibcf_cosine <- predict(object = model_ibcf_cosine, newdata = known, n = 10)## top 10 Moviews

predicted_top_ten_movies_df_ibcf_cosine <- data.frame(users = sort(rep(1:length(predicted_top_ten_movies_ibcf_cosine@items), 
                                                          predicted_top_ten_movies_ibcf_cosine@n)), 
                                         ratings = unlist(predicted_top_ten_movies_ibcf_cosine@ratings),
                                         index = unlist(predicted_top_ten_movies_ibcf_cosine@items))

predicted_top_ten_movies_df_ibcf_cosine$title <- predicted_top_ten_movies_ibcf_cosine@itemLabels[predicted_top_ten_movies_df_ibcf_cosine$index]
predicted_top_ten_movies_df_ibcf_cosine$year <- MovieLenseMeta$year[predicted_top_ten_movies_df_ibcf_cosine$index]

predicted_top_ten_movies_df_ibcf_cosine <- predicted_top_ten_movies_df_ibcf_cosine %>% group_by(users) %>% top_n(10,ratings)  # to display the first 4 movies, among the top 10 movies recommended ,  ??????? WHY RESULTS BELOW, do not understand

predicted_top_ten_movies_df_ibcf_cosine[predicted_top_ten_movies_df_ibcf_cosine$users %in% (1:2), ]  ## first 2 users

## # A tibble: 20 x 5
## # Groups:   users [2]
##    users ratings index title                                      year
##    <int>   <dbl> <int> <chr>                                     <dbl>
##  1     1    5        3 Four Rooms (1995)                          1995
##  2     1    5       14 Mr. Holland's Opus (1995)                  1994
##  3     1    5       29 Net, The (1995)                            1995
##  4     1    5       38 Legends of the Fall (1994)                 1995
##  5     1    5       51 While You Were Sleeping (1995)             1994
##  6     1    5       59 Firm, The (1993)                           1994
##  7     1    5       67 Sleepless in Seattle (1993)                1994
##  8     1    5       80 Heavy Metal (1981)                         1993
##  9     1    5       85 Truth About Cats & Dogs, The (1996)        1994
## 10     1    5       88 Rock, The (1996)                           1993
## 11     2    5        1 Toy Story (1995)                           1995
## 12     2    5       24 Apollo 13 (1995)                           1996
## 13     2    5       37 Star Wars (1977)                           1994
## 14     2    5      134 Empire Strikes Back, The (1980)            1941
## 15     2    5      171 Indiana Jones and the Last Crusade (1989)  1991
## 16     2    5      176 Field of Dreams (1989)                     1986
## 17     2    5      202 Jungle2Jungle (1997)                       1993
## 18     2    4.50   166 Back to the Future (1985)                  1986
## 19     2    4.49   490 Space Jam (1996)                           1934
## 20     2    4.02    19 Braveheart (1995)                          1995

Acuracy Matrix IBCF

ibcf_prediction <- predict(object = model_ibcf_cosine, newdata = known, n = 10, type = "ratings")

print("Acuracy Matrix IBCF :")

## [1] "Acuracy Matrix IBCF :"

print(calcPredictionAccuracy(x = ibcf_prediction, data = unknown, byUser = FALSE))

##     RMSE      MSE      MAE 
## 1.444221 2.085774 1.092323

User Based Collaborative Filtering (Cosine)

training_time <- system.time({
    model_ubcf_cosine <- Recommender(data = training, method = "UBCF", parameter = list(method = "Cosine"))
})

print("Model training time : ")

## [1] "Model training time : "

print(training_time)

##    user  system elapsed 
##    0.01    0.00    0.01

print(model_ubcf_cosine)

## Recommender of type 'UBCF' for 'realRatingMatrix' 
## learned using 648 users.

UBCF Prediction

predicted_top_ten_movies_ubcf_cosine <- predict(object = model_ubcf_cosine, newdata = known, n = 10) #top10

predicted_top_ten_movies_df_ubcf_cosine <- data.frame(users = sort(rep(1:length(predicted_top_ten_movies_ubcf_cosine@items), 
                                                          predicted_top_ten_movies_ubcf_cosine@n)), 
                                         ratings = unlist(predicted_top_ten_movies_ubcf_cosine@ratings),
                                         index = unlist(predicted_top_ten_movies_ubcf_cosine@items))

predicted_top_ten_movies_df_ubcf_cosine$title <- predicted_top_ten_movies_ubcf_cosine@itemLabels[predicted_top_ten_movies_df_ubcf_cosine$index]
predicted_top_ten_movies_df_ubcf_cosine$year <- MovieLenseMeta$year[predicted_top_ten_movies_df_ubcf_cosine$index]

predicted_top_ten_movies_df_ubcf_cosine <- predicted_top_ten_movies_df_ubcf_cosine %>% group_by(users) %>% top_n(4,ratings)  ## display 4 of the top10 ratings to save space

predicted_top_ten_movies_df_ubcf_cosine[predicted_top_ten_movies_df_ubcf_cosine$users %in% (1:2), ]  ## first 2 users

## # A tibble: 8 x 5
## # Groups:   users [2]
##   users ratings index title                             year
##   <int>   <dbl> <int> <chr>                            <dbl>
## 1     1    3.78    79 Fargo (1996)                      1993
## 2     1    3.76    37 Star Wars (1977)                  1994
## 3     1    3.63    43 Pulp Fiction (1994)               1994
## 4     1    3.62   143 Return of the Jedi (1983)         1965
## 5     2    4.57    37 Star Wars (1977)                  1994
## 6     2    4.54    79 Fargo (1996)                      1993
## 7     2    4.48   143 Return of the Jedi (1983)         1965
## 8     2    4.46    77 Silence of the Lambs, The (1991)  1993

Acuracy Matrix UBCF

ubcf_prediction <- predict(object = model_ubcf_cosine, newdata = known, n = 10, type = "ratings")

print("Acuracy Matrix UBCF :")

## [1] "Acuracy Matrix UBCF :"

print(calcPredictionAccuracy(x = ubcf_prediction, data = unknown, byUser = FALSE))

##      RMSE       MSE       MAE 
## 0.9970938 0.9941960 0.7911661

Algorithm Model Evaluation Comparison

models_evaluation <- list( 
        SVD = list(name = "SVD"),
        IBCF = list(name = "IBCF", param = list(method = "cosine")),  
        UBCF = list(name = "UBCF", param = list(method = "cosine"))
      )

lerror <- evaluate(x = e, method = models_evaluation, type = "ratings")

## SVD run fold/sample [model time/prediction time]
##   1  [0.03sec/0.02sec] 
##   2  [0.05sec/0.01sec] 
##   3  [0.05sec/0.01sec] 
##   4  [0.05sec/0.03sec] 
##   5  [0.07sec/0sec] 
##   6  [0.06sec/0.02sec] 
##   7  [0.06sec/0.02sec] 
##   8  [0.06sec/0sec] 
##   9  [0.27sec/0.01sec] 
##   10  [0.06sec/0sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [1.05sec/0.01sec] 
##   2  [0.87sec/0.03sec] 
##   3  [0.91sec/0sec] 
##   4  [0.83sec/0.02sec] 
##   5  [0.85sec/0.01sec] 
##   6  [0.92sec/0.03sec] 
##   7  [1.36sec/0.01sec] 
##   8  [0.87sec/0.02sec] 
##   9  [0.83sec/0.01sec] 
##   10  [1sec/0.02sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0.02sec/0.14sec] 
##   2  [0sec/0.17sec] 
##   3  [0sec/0.16sec] 
##   4  [0.01sec/0.13sec] 
##   5  [0sec/0.16sec] 
##   6  [0.01sec/0.13sec] 
##   7  [0sec/0.36sec] 
##   8  [0sec/0.18sec] 
##   9  [0sec/0.25sec] 
##   10  [0sec/0.26sec]

mdlcmp <- as.data.frame(sapply(avg(lerror), rbind))

Although there are some actual difference in the run time, all three models run within reasonable timeframe, therefore, run time should not be our concern in this particular situlation and should not be a factor in our judgement. Our final model selection should be based on the acuracy and other evaluators.

cmpMdl <- as.data.frame(t(as.matrix(mdlcmp)))

colnames(cmpMdl) <- c("RMSE", "MSE", "MAE")

pander(cmpMdl, caption = "Model Comparison")

Model Comparison
	RMSE	MSE	MAE
SVD	1.037	1.077	0.8232
IBCF	1.469	2.162	1.116
UBCF	1.034	1.07	0.8197

rmse_ubcf<- calcPredictionAccuracy(x = ubcf_prediction, data = unknown, byUser = FALSE)
rmse_ibcf <- calcPredictionAccuracy(x = ibcf_prediction, data = unknown, byUser = FALSE)
rmse_svd <- calcPredictionAccuracy(x = svd_prediction, data = unknown, byUser = FALSE)

comparison = rbind(rmse_ibcf, rmse_ubcf, rmse_svd)
comparison = data.frame(comparison, row.names = NULL)
comparison = cbind(model =c('IBCF','UBCF','SVD'), comparison)

comparison %>% gather ('measure', 'value',-1) %>% 
  ggplot (aes (x=measure, y=value, fill=model)) +
  geom_bar (stat='identity', position=position_dodge())

Item based content filtering performs the worst, which has the biggest RMSE (root square mean standard deviation) value. Singular value decomposition and user based content filtering performs similar.

n_recommendations = c(1,3,5,8,10,15,20, 25)
results = evaluate (x=e, method = models_evaluation, n = n_recommendations)

## SVD run fold/sample [model time/prediction time]
##   1  [0.06sec/0.03sec] 
##   2  [0.07sec/0.01sec] 
##   3  [0.06sec/0.01sec] 
##   4  [0.05sec/0.01sec] 
##   5  [0.27sec/0.01sec] 
##   6  [0.05sec/0.03sec] 
##   7  [0.05sec/0.03sec] 
##   8  [0.05sec/0.03sec] 
##   9  [0.05sec/0.03sec] 
##   10  [0.05sec/0.05sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.87sec/0.02sec] 
##   2  [0.84sec/0.02sec] 
##   3  [0.79sec/0.02sec] 
##   4  [0.9sec/0.03sec] 
##   5  [0.9sec/0.02sec] 
##   6  [0.88sec/0.01sec] 
##   7  [0.89sec/0.02sec] 
##   8  [0.93sec/0.02sec] 
##   9  [0.95sec/0.05sec] 
##   10  [1.08sec/0.03sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.15sec] 
##   2  [0.02sec/0.16sec] 
##   3  [0sec/0.15sec] 
##   4  [0.02sec/0.15sec] 
##   5  [0sec/0.16sec] 
##   6  [0sec/0.16sec] 
##   7  [0sec/0.14sec] 
##   8  [0sec/0.16sec] 
##   9  [0sec/0.14sec] 
##   10  [0sec/0.16sec]

plot(results, y="ROC", annotate = 1, legend ="topleft")
title ("ROC Curve")

plot (results, y ='prec/rec', annotate=1)
title ("Precision-Recall")

The ROC (receiver operative curve) revales that singular value decomposition has the best area under the curve, followed by user based content filtering, while the item based content filtering has the worst area under curve. So is true with the precision-recall figure, with SVD ranks the best, and IBCF ranks the worst.

Conclusion

Singular value decomposition performes better than than the collaborative filterting family (UBCF and IBCF), in this movie setting. It is not surprising that below famous big tech all uses singular value decomposition as their recommendation system until very recently.

Figure

Toy Story (1995)	GoldenEye (1995)	Four Rooms (1995)	Get Shorty (1995)	Copycat (1995)	Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)	Twelve Monkeys (1995)	Babe (1995)	Dead Man Walking (1995)	Richard III (1995)
5	3	4	3	3	5	4	1	5	3
4	0	0	0	0	0	0	0	0	2
0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0
4	3	0	0	0	0	0	0	0	0
4	0	0	0	0	0	2	4	4	0
0	0	0	5	0	0	5	5	5	4
0	0	0	0	0	0	3	0	0	0
0	0	0	0	0	5	4	0	0	0
4	0	0	4	0	0	4	0	4	0

Toy Story (1995)	GoldenEye (1995)	Four Rooms (1995)	Get Shorty (1995)	Copycat (1995)	Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)	Twelve Monkeys (1995)	Babe (1995)	Dead Man Walking (1995)	Richard III (1995)
5	3	4	3	3	5	4	1	5	3
4	0	0	0	0	0	0	0	0	2
0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0
4	3	0	0	0	0	0	0	0	0
4	0	0	0	0	0	2	4	4	0
0	0	0	5	0	0	5	5	5	4
0	0	0	0	0	0	3	0	0	0
0	0	0	0	0	5	4	0	0	0
4	0	0	4	0	0	4	0	4	0

MovieLense Recommendation Overview

Gracie Hui Han

Background

Research Question:

Research Approach:

Summary of What we have found

Collaborative Filting (IBCF, UBCF)

Singular Value Decomposition (SVD, Matrix Factorization)

Mathematical Models

Singular Value Decomposition (SVD)

RSME, Residual Mean Square Error

Dataset

Data Exploration

Data Visualization

Top Ten Movies

Movie Ratings Histogram

Average Movie Rating Histogram

Data preparation

Training and Testing Data

Singular Value Decompostion

SVD Prediction

Acuracy Matrix SVD

Item Based Collaborative Filtering (Cosine)

IBCF Prediction

Acuracy Matrix IBCF

User Based Collaborative Filtering (Cosine)

UBCF Prediction

Acuracy Matrix UBCF

Algorithm Model Evaluation Comparison

Conclusion

Toy Story (1995)	GoldenEye (1995)	Four Rooms (1995)	Get Shorty (1995)	Copycat (1995)	Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)	Twelve Monkeys (1995)	Babe (1995)	Dead Man Walking (1995)	Richard III (1995)
5	3	4	3	3	5	4	1	5	3
4	0	0	0	0	0	0	0	0	2
0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0
4	3	0	0	0	0	0	0	0	0
4	0	0	0	0	0	2	4	4	0
0	0	0	5	0	0	5	5	5	4
0	0	0	0	0	0	3	0	0	0
0	0	0	0	0	5	4	0	0	0
4	0	0	4	0	0	4	0	4	0