Data612_Project4

The goal of this assignment is give you practice working with accuracy and other recommender system metrics. In this assignment you’re asked to do at least one or (if you like) both of the following: • Work in a small group, and/or • Choose a different dataset to work with from your previous projects.

library(recommenderlab)

## Warning: package 'recommenderlab' was built under R version 3.5.3

## Loading required package: Matrix

## Loading required package: arules

## Warning: package 'arules' was built under R version 3.5.3

## 
## Attaching package: 'arules'

## The following objects are masked from 'package:base':
## 
##     abbreviate, write

## Loading required package: proxy

## Warning: package 'proxy' was built under R version 3.5.3

## 
## Attaching package: 'proxy'

## The following object is masked from 'package:Matrix':
## 
##     as.matrix

## The following objects are masked from 'package:stats':
## 
##     as.dist, dist

## The following object is masked from 'package:base':
## 
##     as.matrix

## Loading required package: registry

## Warning: package 'registry' was built under R version 3.5.2

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 3.5.2

## -- Attaching packages --------------------------------------------- tidyverse 1.2.1 --

## v ggplot2 3.2.1     v purrr   0.2.5
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   0.8.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.3.0

## Warning: package 'ggplot2' was built under R version 3.5.3

## Warning: package 'tibble' was built under R version 3.5.3

## Warning: package 'readr' was built under R version 3.5.2

## Warning: package 'dplyr' was built under R version 3.5.3

## Warning: package 'stringr' was built under R version 3.5.2

## Warning: package 'forcats' was built under R version 3.5.2

## -- Conflicts ------------------------------------------------ tidyverse_conflicts() --
## x tidyr::expand() masks Matrix::expand()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x dplyr::recode() masks arules::recode()

library(kableExtra)

## Warning: package 'kableExtra' was built under R version 3.5.3

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

library(knitr)
library("ggplot2")

For this project I’m usiing data I haven’t used before.

data("MovieLense")
MovieLense

## 943 x 1664 rating matrix of class 'realRatingMatrix' with 99392 ratings.

This indicates that the dimensions of the matrix are 934X1664

methods(class= class(MovieLense))

##  [1] [                      [<-                    binarize              
##  [4] calcPredictionAccuracy coerce                 colCounts             
##  [7] colMeans               colSds                 colSums               
## [10] denormalize            dim                    dimnames              
## [13] dimnames<-             dissimilarity          evaluationScheme      
## [16] getData.frame          getList                getNormalize          
## [19] getRatingMatrix        getRatings             getTopNLists          
## [22] image                  normalize              nratings              
## [25] Recommender            removeKnownRatings     rowCounts             
## [28] rowMeans               rowSds                 rowSums               
## [31] sample                 show                   similarity            
## see '?methods' for accessing help and source code

#taking a look at the dataset
df<-getData.frame(MovieLense)
summary(df)

##       user                                item           rating    
##  405    :  735   Star Wars (1977)           :  583   Min.   :1.00  
##  655    :  677   Contact (1997)             :  509   1st Qu.:3.00  
##  13     :  630   Fargo (1996)               :  508   Median :4.00  
##  450    :  538   Return of the Jedi (1983)  :  507   Mean   :3.53  
##  276    :  515   Liar Liar (1997)           :  485   3rd Qu.:4.00  
##  416    :  487   English Patient, The (1996):  481   Max.   :5.00  
##  (Other):95810   (Other)                    :96319

We can see that the ratings are from 1 to 5. Now let’s look at the distributin of the ratings.

hist(getRatings(MovieLense),main="Distribution of Movie Ratings")

We can see that the highest ratings are at the value 4.

#looking to see if there's any missing data
sum(is.na(df))

## [1] 0

There is no missing data that could affect the relationship betweeen user-user and item-item

# Exploring the average ratings

avg_ratings <- colMeans(MovieLense)
qplot(avg_ratings) + stat_bin(binwidth = 0.1) +
   ggtitle("Distribution of the Average Movie Rating")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Splitting the data into train/test sets. I’m going to split the data into train and test set using 80% to 20% split and I’m considering 3 and above to be a good rating

set.seed(11)
eval <- evaluationScheme(MovieLense, method = "split", train = 0.8, given=15, goodRating = 3)
train <- getData(eval, "train")
known <- getData(eval, "known")
unknown <- getData(eval, "unknown")

I am going to compare 3 recommender system algorithms: SVD, Item Based Collaborative Filtering (IBCF), and User Based Collaborative Filtering (UBCF).

set.seed(44)
recom <- Recommender(train, method = "SVD")
pred <- predict(object = recom, newdata = known, type = "ratings")
eval_accuracy_SVD <- calcPredictionAccuracy(pred, unknown, byUser = FALSE)

recom <- Recommender(train, method = "IBCF")
pred <- predict(object = recom, newdata = known, type = "ratings")
eval_accuracy_IBCF <- calcPredictionAccuracy(pred, unknown, byUser = FALSE)

recom <- Recommender(train, method = "UBCF")
pred <- predict(object = recom, newdata = known, type = "ratings")
eval_accuracy_UBCF <- calcPredictionAccuracy(pred, unknown, byUser = FALSE)

rbind(eval_accuracy_SVD, eval_accuracy_IBCF, eval_accuracy_UBCF)

##                        RMSE      MSE       MAE
## eval_accuracy_SVD  1.024543 1.049689 0.8229837
## eval_accuracy_IBCF 1.274609 1.624629 0.9454343
## eval_accuracy_UBCF 1.019977 1.040353 0.8177085

The lowest RMSE (Root Mean Squared Error) is UBCF (user based collaborative system) algorithm which means is the most effective.

#Now I wanted to look into the top 3 Movies recommended by a user
set.seed(444)
pred<- predict(object = recom, newdata = unknown, type = "topNList", n = 3)

user_1 <- pred@items[[1]]
movies_user_1 <- pred@itemLabels[user_1]
movies_user_1 %>% kable(caption = "User1 Predictions") %>% kable_styling("striped", full_width = TRUE)

User1 Predictions
x
Sting, The (1973)
Casablanca (1942)
Graduate, The (1967)

The best RMSE was observed from the UBCF system and we can see that the 3 top movies recommended by a user are pretty popular, highly rated movies. The Sting 94% Rotten Tomatoes, Casablanca 98% Rotten Tomatoes and The Graduate 86% Rotten Tomatoes. My issue with this ratings is that although the algorithm seems to be valid the movies are pretty old. It will be interesting to see the results with a more up-to-date dataset.

Data612_Project4

Violet Stoyanova

3/30/2020