The goal of this assignment is give you practice working with accuracy and other recommender system metrics. In this assignment you’re asked to do at least one or (if you like) both of the following: • Work in a small group, and/or • Choose a different dataset to work with from your previous projects.
library(recommenderlab)
## Warning: package 'recommenderlab' was built under R version 3.5.3
## Loading required package: Matrix
## Loading required package: arules
## Warning: package 'arules' was built under R version 3.5.3
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
## Loading required package: proxy
## Warning: package 'proxy' was built under R version 3.5.3
##
## Attaching package: 'proxy'
## The following object is masked from 'package:Matrix':
##
## as.matrix
## The following objects are masked from 'package:stats':
##
## as.dist, dist
## The following object is masked from 'package:base':
##
## as.matrix
## Loading required package: registry
## Warning: package 'registry' was built under R version 3.5.2
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.5.2
## -- Attaching packages --------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.2.1 v purrr 0.2.5
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 0.8.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.3.0
## Warning: package 'ggplot2' was built under R version 3.5.3
## Warning: package 'tibble' was built under R version 3.5.3
## Warning: package 'readr' was built under R version 3.5.2
## Warning: package 'dplyr' was built under R version 3.5.3
## Warning: package 'stringr' was built under R version 3.5.2
## Warning: package 'forcats' was built under R version 3.5.2
## -- Conflicts ------------------------------------------------ tidyverse_conflicts() --
## x tidyr::expand() masks Matrix::expand()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x dplyr::recode() masks arules::recode()
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 3.5.3
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
library(knitr)
library("ggplot2")
For this project I’m usiing data I haven’t used before.
data("MovieLense")
MovieLense
## 943 x 1664 rating matrix of class 'realRatingMatrix' with 99392 ratings.
This indicates that the dimensions of the matrix are 934X1664
methods(class= class(MovieLense))
## [1] [ [<- binarize
## [4] calcPredictionAccuracy coerce colCounts
## [7] colMeans colSds colSums
## [10] denormalize dim dimnames
## [13] dimnames<- dissimilarity evaluationScheme
## [16] getData.frame getList getNormalize
## [19] getRatingMatrix getRatings getTopNLists
## [22] image normalize nratings
## [25] Recommender removeKnownRatings rowCounts
## [28] rowMeans rowSds rowSums
## [31] sample show similarity
## see '?methods' for accessing help and source code
#taking a look at the dataset
df<-getData.frame(MovieLense)
summary(df)
## user item rating
## 405 : 735 Star Wars (1977) : 583 Min. :1.00
## 655 : 677 Contact (1997) : 509 1st Qu.:3.00
## 13 : 630 Fargo (1996) : 508 Median :4.00
## 450 : 538 Return of the Jedi (1983) : 507 Mean :3.53
## 276 : 515 Liar Liar (1997) : 485 3rd Qu.:4.00
## 416 : 487 English Patient, The (1996): 481 Max. :5.00
## (Other):95810 (Other) :96319
We can see that the ratings are from 1 to 5. Now let’s look at the distributin of the ratings.
hist(getRatings(MovieLense),main="Distribution of Movie Ratings")
We can see that the highest ratings are at the value 4.
#looking to see if there's any missing data
sum(is.na(df))
## [1] 0
There is no missing data that could affect the relationship betweeen user-user and item-item
# Exploring the average ratings
avg_ratings <- colMeans(MovieLense)
qplot(avg_ratings) + stat_bin(binwidth = 0.1) +
ggtitle("Distribution of the Average Movie Rating")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Splitting the data into train/test sets. I’m going to split the data into train and test set using 80% to 20% split and I’m considering 3 and above to be a good rating
set.seed(11)
eval <- evaluationScheme(MovieLense, method = "split", train = 0.8, given=15, goodRating = 3)
train <- getData(eval, "train")
known <- getData(eval, "known")
unknown <- getData(eval, "unknown")
I am going to compare 3 recommender system algorithms: SVD, Item Based Collaborative Filtering (IBCF), and User Based Collaborative Filtering (UBCF).
set.seed(44)
recom <- Recommender(train, method = "SVD")
pred <- predict(object = recom, newdata = known, type = "ratings")
eval_accuracy_SVD <- calcPredictionAccuracy(pred, unknown, byUser = FALSE)
recom <- Recommender(train, method = "IBCF")
pred <- predict(object = recom, newdata = known, type = "ratings")
eval_accuracy_IBCF <- calcPredictionAccuracy(pred, unknown, byUser = FALSE)
recom <- Recommender(train, method = "UBCF")
pred <- predict(object = recom, newdata = known, type = "ratings")
eval_accuracy_UBCF <- calcPredictionAccuracy(pred, unknown, byUser = FALSE)
rbind(eval_accuracy_SVD, eval_accuracy_IBCF, eval_accuracy_UBCF)
## RMSE MSE MAE
## eval_accuracy_SVD 1.024543 1.049689 0.8229837
## eval_accuracy_IBCF 1.274609 1.624629 0.9454343
## eval_accuracy_UBCF 1.019977 1.040353 0.8177085
The lowest RMSE (Root Mean Squared Error) is UBCF (user based collaborative system) algorithm which means is the most effective.
#Now I wanted to look into the top 3 Movies recommended by a user
set.seed(444)
pred<- predict(object = recom, newdata = unknown, type = "topNList", n = 3)
user_1 <- pred@items[[1]]
movies_user_1 <- pred@itemLabels[user_1]
movies_user_1 %>% kable(caption = "User1 Predictions") %>% kable_styling("striped", full_width = TRUE)
| x |
|---|
| Sting, The (1973) |
| Casablanca (1942) |
| Graduate, The (1967) |
The best RMSE was observed from the UBCF system and we can see that the 3 top movies recommended by a user are pretty popular, highly rated movies. The Sting 94% Rotten Tomatoes, Casablanca 98% Rotten Tomatoes and The Graduate 86% Rotten Tomatoes. My issue with this ratings is that although the algorithm seems to be valid the movies are pretty old. It will be interesting to see the results with a more up-to-date dataset.