1 Acknowledgements

This paper and the research behind it would never have happened without the support of my supervisor, Dr. Mengshi Zhou, and Lois Schindler, the person who introduced me to Dr. Zhou. Without Dr. Zhou’s enthusiasm, knowledge and willingness to work with me, this research paper would have never happened. And without the support of Lois, this research opportunity would have never crossed my desk. Dr. Zhou and Lois have been a great pillar of support and I am grateful for them.

2 Introduction

2.1 Recommendation engines

The fast development of internet technology has led to the explosive growth of available data in the past few years. However, the huge amount of data is often an obstacle to finding relevant information due to the present of spam data or erroneous information.

Recommendation systems can solve the information overload problem by guiding people towards the information according to their preferences. The goal of a recommendation engine is to assist the user - like a consumer - in discovering information that is relevant to them. For example, http://google.com uses a recommendation engine. After you search something, those links are the top recommendations that the algorithm gives to the user. And Recommendations are not just for Google. They can be found in online marketing, online dating, book and music recommendations, and a plethora of other areas. For this article we will be focusing on movie recommendations. Due to the COVID-19 pandemic, consumers relied on video-streaming service such as Netflix and Disney+. For instance, think of Netflix. How does it recommend movies from its large movie database to its even larger costumer base? This will be the question we investigate throughout this article.

In this study, we investigate the performance of different recommendation algorithms on a commonly used movie recommendation dataset. We then build an artificial intelligence (AI)-based recommendation using the algorithm with the best performance. Finally we develop a shiny app for our AI-based recommendation engine.

2.2 Machine learning algorithms fo recommendation engines

For this study we will be looking at 9 different recommendation algorithms. They are (along with a brief description):

  • User-Based Content filtering Cosine (UBCF_C): Items are recommended assuming that similar users will rate items similarly. Based on a Cosine similarity matrix from a matrix that has movies as columns and users as rows.

  • Item-Based Content Filtering Cosine (IBCF_C): Items are recommended assuming that they will prefer other items which are similar to items they like. Based on the same similarity matrix from UBCD_C.

  • User-Based Content filtering Pearson (UBCF_P): The idea with UBCF_P is the same as UBCF_C. But instead of using the cosine similarity matrix, we use the Pearson similarity matrix.

  • Item-Based Content Filtering Pearson (IBCF_P): The idea with IBCF_P is the same as IBCF_C. But instead of using the cosine similarity matrix, we use the Pearson similarity matrix.

  • Singular Value Decomposition (SVD): SVD is a factorization of a real matrix. It then extract the latent factors and uses the latent factor matrix to make recommendations.

  • Popular: This method simply recommends based on item/movie popularity.

  • Alternating Least Squares Explicit (ALS): Recommends for explicit ratings based on latent factors, calculated by alternating least squares algorithm.

  • Alternating Least Squares Implicit (ALS_implicit): Recommends for implicit data based on latent factors, calculated by alternating least squares algorithm.

  • Random: Produce random recommendations (so we have a baseline to judge the other models with).

3 Data description

We obtained our data from IBMT, Which is an online database of information related to television programs, home videos, video games, and more importantly for us films. The data can be obtained from https://drive.google.com/file/d/1Dn1BZD3YxgBQJSIjbfNnmCFlDW2jdQGD/view. There are two data frames. One is a key comprising of a movie ID and the title of the unique movie that goes to that ID including the genre associated with the movie. The other data frame is the data we will be doing our analysis on, which comprises of userID, MovieID and the rating the user gave to the movie.

4 Data preprocessing

In our analysis we will make use of 5 packages tidyverse, data.table, reshape2,lobstr, and recommenderlab.

knitr::opts_chunk$set(echo = TRUE,
                      warning = F,
                      message = F)
library(tidyverse)                   
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.1.3     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## The following object is masked from 'package:purrr':
## 
##     transpose
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following objects are masked from 'package:data.table':
## 
##     dcast, melt
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(recommenderlab)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## Loading required package: arules
## 
## Attaching package: 'arules'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
## Loading required package: proxy
## 
## Attaching package: 'proxy'
## The following object is masked from 'package:Matrix':
## 
##     as.matrix
## The following objects are masked from 'package:stats':
## 
##     as.dist, dist
## The following object is masked from 'package:base':
## 
##     as.matrix
## Loading required package: registry
## Registered S3 methods overwritten by 'registry':
##   method               from 
##   print.registry_field proxy
##   print.registry_entry proxy
library(lobstr)

4.1 Data importing

You could save the data from https://drive.google.com/file/d/1Dn1BZD3YxgBQJSIjbfNnmCFlDW2jdQGD/view and then upload it from the save location. But for this analysis, to ease the Reproducibility, we save the data on the github page https://github.com/SeanCranston/2021_Summer_Research under the Data folder. This way we don’t have to download and save the file. We can just put in the URL to get the data (it does download to a temp file but is not permanent).

movie_data <- read_csv("https://raw.githubusercontent.com/SeanCranston/2021_Summer_Research/master/Data/movies.csv",
              col_types = cols(
                         movieId = col_double(),
                         title = col_character(),
                         genres = col_character()
                       ))
head(movie_data)
## # A tibble: 6 x 3
##   movieId title                              genres                             
##     <dbl> <chr>                              <chr>                              
## 1       1 Toy Story (1995)                   Adventure|Animation|Children|Comed~
## 2       2 Jumanji (1995)                     Adventure|Children|Fantasy         
## 3       3 Grumpier Old Men (1995)            Comedy|Romance                     
## 4       4 Waiting to Exhale (1995)           Comedy|Drama|Romance               
## 5       5 Father of the Bride Part II (1995) Comedy                             
## 6       6 Heat (1995)                        Action|Crime|Thriller
rating_data <- read_csv("https://raw.githubusercontent.com/SeanCranston/2021_Summer_Research/master/Data/ratings.csv",
                        col_types = cols(
                          userId = col_double(),
                          movieId = col_double(),
                          rating = col_double(),
                          timestamp = col_double()
                        ))
head(rating_data)
## # A tibble: 6 x 4
##   userId movieId rating  timestamp
##    <dbl>   <dbl>  <dbl>      <dbl>
## 1      1      16    4   1217897793
## 2      1      24    1.5 1217895807
## 3      1      32    4   1217896246
## 4      1      47    4   1217896556
## 5      1      50    4   1217896523
## 6      1     110    4   1217896150

4.2 Cleaning genre matrix

In this code chunk we want to clean the movie_data data frame. First, we need to find all the different genres in our data set and put it into a vector string. We will use this vector to split the different genres for the same movies. From there we will create a new matrix called genre with column names labeled with the vector of genres. Then we will put genre in a for loop and indicate with a 1 whether a movie has a specific genre. Lastly, we will bind the movies with the genre matrix for an easy reference.

# different genres
names_genre <- c("Action", "Adventure", "Animation", "Children", "Comedy", "Crime",
           "Documentary", "Drama", "Fantasy", "Film-Noir", "Horror", "Musical", 
           "Mystery","Romance", "Sci-Fi", "Thriller", "War", "Western") 

# split different genres for the same movie
list_genre <- str_split(movie_data$genres, "[|]")
head(list_genre)
## [[1]]
## [1] "Adventure" "Animation" "Children"  "Comedy"    "Fantasy"  
## 
## [[2]]
## [1] "Adventure" "Children"  "Fantasy"  
## 
## [[3]]
## [1] "Comedy"  "Romance"
## 
## [[4]]
## [1] "Comedy"  "Drama"   "Romance"
## 
## [[5]]
## [1] "Comedy"
## 
## [[6]]
## [1] "Action"   "Crime"    "Thriller"
# create genre matrix
genre = matrix(0, nrow(movie_data), ncol = length(names_genre)) %>% 
    `colnames<-`(names_genre) 
head(as_tibble(genre))
## # A tibble: 6 x 18
##   Action Adventure Animation Children Comedy Crime Documentary Drama Fantasy
##    <dbl>     <dbl>     <dbl>    <dbl>  <dbl> <dbl>       <dbl> <dbl>   <dbl>
## 1      0         0         0        0      0     0           0     0       0
## 2      0         0         0        0      0     0           0     0       0
## 3      0         0         0        0      0     0           0     0       0
## 4      0         0         0        0      0     0           0     0       0
## 5      0         0         0        0      0     0           0     0       0
## 6      0         0         0        0      0     0           0     0       0
## # ... with 9 more variables: Film-Noir <dbl>, Horror <dbl>, Musical <dbl>,
## #   Mystery <dbl>, Romance <dbl>, Sci-Fi <dbl>, Thriller <dbl>, War <dbl>,
## #   Western <dbl>
# fill genre matrix with 1 if the movie has that genre
for (row in 1:nrow(genre)) {
    for (col in 1:ncol(genre)) {
        genre[row,col] = ifelse(any(names_genre[col] == list_genre[[row]]), 1, 0)
    }
}

genre = as.data.frame(genre)
str(genre)
## 'data.frame':    10329 obs. of  18 variables:
##  $ Action     : num  0 0 0 0 0 1 0 0 1 1 ...
##  $ Adventure  : num  1 1 0 0 0 0 0 1 0 1 ...
##  $ Animation  : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ Children   : num  1 1 0 0 0 0 0 1 0 0 ...
##  $ Comedy     : num  1 0 1 1 1 0 1 0 0 0 ...
##  $ Crime      : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ Documentary: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Drama      : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ Fantasy    : num  1 1 0 0 0 0 0 0 0 0 ...
##  $ Film-Noir  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Horror     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Musical    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Mystery    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Romance    : num  0 0 1 1 0 0 1 0 0 0 ...
##  $ Sci-Fi     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Thriller   : num  0 0 0 0 0 1 0 0 0 1 ...
##  $ War        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Western    : num  0 0 0 0 0 0 0 0 0 0 ...
SearchMatrix <- cbind(movie_data[,1:2], genre[])
head(SearchMatrix, 10)
##    movieId                              title Action Adventure Animation
## 1        1                   Toy Story (1995)      0         1         1
## 2        2                     Jumanji (1995)      0         1         0
## 3        3            Grumpier Old Men (1995)      0         0         0
## 4        4           Waiting to Exhale (1995)      0         0         0
## 5        5 Father of the Bride Part II (1995)      0         0         0
## 6        6                        Heat (1995)      1         0         0
## 7        7                     Sabrina (1995)      0         0         0
## 8        8                Tom and Huck (1995)      0         1         0
## 9        9                Sudden Death (1995)      1         0         0
## 10      10                   GoldenEye (1995)      1         1         0
##    Children Comedy Crime Documentary Drama Fantasy Film-Noir Horror Musical
## 1         1      1     0           0     0       1         0      0       0
## 2         1      0     0           0     0       1         0      0       0
## 3         0      1     0           0     0       0         0      0       0
## 4         0      1     0           0     1       0         0      0       0
## 5         0      1     0           0     0       0         0      0       0
## 6         0      0     1           0     0       0         0      0       0
## 7         0      1     0           0     0       0         0      0       0
## 8         1      0     0           0     0       0         0      0       0
## 9         0      0     0           0     0       0         0      0       0
## 10        0      0     0           0     0       0         0      0       0
##    Mystery Romance Sci-Fi Thriller War Western
## 1        0       0      0        0   0       0
## 2        0       0      0        0   0       0
## 3        0       1      0        0   0       0
## 4        0       1      0        0   0       0
## 5        0       0      0        0   0       0
## 6        0       0      0        1   0       0
## 7        0       1      0        0   0       0
## 8        0       0      0        0   0       0
## 9        0       0      0        0   0       0
## 10       0       0      0        1   0       0

So from here we have a matrix that has the movie and the genre as variables. A better to have a matrix for computations.

4.3 Cleaning rating matrix

As of right now we have our rating_data in the format of where each row comprises of a user, movie, and a rating. rating_data also contains a timestamp column, but we will ignore that for this analysis. What we want is the data to be in the format where each row is a user, each column is a movie, and the entry for each user movie pair is a rating. In order to accomplish this we use reshape2 dcast() function. Next, we delete the userID column since the row index is the same number as that column. From here we have a huge matrix with 668 rows (users) and 10325 movies. Note that most of the rating entries are also NA (not rated), we will show this is the case in our data visualization section, though it can be seen by the size of the different matrices.

# change to user vs movie matrix with rating as the entry
ratingMatrix <- reshape2::dcast(rating_data, 
                                userId~movieId, 
                                value.var = "rating", 
                                na.rm=FALSE) %>% 
    select(-c('userId')) %>% # same as row index, just taking up space
    as.matrix() %>% #needs to be matrix to convert to sparse matrix
    as(.,"realRatingMatrix") 

# size of rating_data (can't really use this to make a predicting model)
obj_size(rating_data)
## 3,374,136 B
# size of the rating matrix when not in sparse format
obj_size(rating_data %>% 
           dcast(userId~movieId,value.var = "rating",na.rm=F) %>% 
           select(-c('userId')) %>% 
           as.matrix())
## 55,838,040 B
#size of the sparse matrix
obj_size(ratingMatrix)
## 1,968,424 B
#structure of our sparse matrix
str(ratingMatrix)
## Formal class 'realRatingMatrix' [package "recommenderlab"] with 2 slots
##   ..@ data     :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. .. ..@ i       : int [1:105339] 1 4 7 10 13 16 27 28 29 30 ...
##   .. .. ..@ p       : int [1:10326] 0 232 324 382 393 455 570 625 628 651 ...
##   .. .. ..@ Dim     : int [1:2] 668 10325
##   .. .. ..@ Dimnames:List of 2
##   .. .. .. ..$ : NULL
##   .. .. .. ..$ : chr [1:10325] "1" "2" "3" "4" ...
##   .. .. ..@ x       : num [1:105339] 5 4 5 4 4 5 3 4 4.5 4 ...
##   .. .. ..@ factors : list()
##   ..@ normalize: NULL
#other representations of ratinMatrix
dim(ratingMatrix)
## [1]   668 10325
#summary(ratingMatrix@data) %>% head() # how to you should visualize sparse matrix

Note that our ratingMatrix is of class s4 which just means it has slots (those are the what the @ are in str(ratingMatrix). For more information I welcome you to read https://adv-r.hadley.nz/s4.html?q=s4#s4 which does a very good job at describing what its purpose in object orientating programming. Note that putting our ratingMatrix reduces the size of our object by (55,838,040-1,968,424)/55,838,040 or 0.9647476%. Which makes storing and using are data frame much more efficient.

5 Data Visualization

5.1 How many unique users/movies

unique users: 668

unique movies: 10325

5.2 similar matrix of 100 users based on users

Below in the matrix, each row and column represents an user. We took ten users and found how similar they are by using the Cosine similarity matrix method. For visualizations we also added a colored plot.

similarity_mat <- similarity(ratingMatrix[1:10, ], #get 10 users 
                             method = "cosine",
                             which = "users") %>%  #similar based on user
  as.matrix()
similarity_mat
##            1         2         3         4         5         6         7
## 1  0.0000000 0.9760860 0.9641723 0.9914398 0.8359111 0.9918166 0.9677533
## 2  0.9760860 0.0000000 0.9925732 0.9374253 1.0000000 1.0000000 0.9732709
## 3  0.9641723 0.9925732 0.0000000 0.9888968 0.8414317 1.0000000 0.9171903
## 4  0.9914398 0.9374253 0.9888968 0.0000000 0.7957734 1.0000000 0.9180299
## 5  0.8359111 1.0000000 0.8414317 0.7957734 0.0000000 0.8915476 0.8687099
## 6  0.9918166 1.0000000 1.0000000 1.0000000 0.8915476 0.0000000 0.9606265
## 7  0.9677533 0.9732709 0.9171903 0.9180299 0.8687099 0.9606265 0.0000000
## 8  0.9829068 0.9719133 0.9940390 0.9374253 0.9938837 1.0000000 0.9687383
## 9  0.9637003 0.9610288 0.9544297 0.9493423 0.7340177 1.0000000 0.9564847
## 10 0.9216354        NA 0.9312404        NA 1.0000000 0.9736163 1.0000000
##            8         9        10
## 1  0.9829068 0.9637003 0.9216354
## 2  0.9719133 0.9610288        NA
## 3  0.9940390 0.9544297 0.9312404
## 4  0.9374253 0.9493423        NA
## 5  0.9938837 0.7340177 1.0000000
## 6  1.0000000 1.0000000 0.9736163
## 7  0.9687383 0.9564847 1.0000000
## 8  0.0000000 0.9619281        NA
## 9  0.9619281 0.0000000 0.9622504
## 10        NA 0.9622504 0.0000000
image(similarity_mat, main = "User's Similarities")

5.3 similar matrix of 100 users based on item

Similar to the above matrix, in the matrix below each row and column represents an item. We took ten items and found how similar they are by using the Cosine similarity matrix method. For visualizations we also added a colored plot.

movie_similarity <- similarity(ratingMatrix[,1:10],
                               method = "cosine",
                               which = "items") %>%  #similar based on movie
    as.matrix()
movie_similarity
##            1         2         3         4         5         6         7
## 1  0.0000000 0.9669732 0.9559341 0.9101276 0.9531838 0.9611291 0.9609033
## 2  0.9669732 0.0000000 0.9658757 0.9412416 0.9603386 0.9531442 0.9456375
## 3  0.9559341 0.9658757 0.0000000 0.9864877 0.9658946 0.9502471 0.9614828
## 4  0.9101276 0.9412416 0.9864877 0.0000000 0.9345030 0.9772545 0.9897595
## 5  0.9531838 0.9603386 0.9658946 0.9345030 0.0000000 0.9491592 0.9743898
## 6  0.9611291 0.9531442 0.9502471 0.9772545 0.9491592 0.0000000 0.9697378
## 7  0.9609033 0.9456375 0.9614828 0.9897595 0.9743898 0.9697378 0.0000000
## 8  0.9938837 1.0000000 0.9899495 1.0000000 0.9995121 1.0000000        NA
## 9  0.8427010 0.9781386 0.8827704 1.0000000 0.9658814 0.9279429 0.8957936
## 10 0.9786042 0.9675372 0.9726710 0.9365858 0.9649127 0.9799504 0.9713049
##            8         9        10
## 1  0.9938837 0.8427010 0.9786042
## 2  1.0000000 0.9781386 0.9675372
## 3  0.9899495 0.8827704 0.9726710
## 4  1.0000000 1.0000000 0.9365858
## 5  0.9995121 0.9658814 0.9649127
## 6  1.0000000 0.9279429 0.9799504
## 7         NA 0.8957936 0.9713049
## 8  0.0000000        NA 1.0000000
## 9         NA 0.0000000 0.9745159
## 10 1.0000000 0.9745159 0.0000000
image(movie_similarity, main = "Movies similarity")

5.4 distribution of ratings

Below is a table and two graphs. The main thing to notice here is that most movies do not have a rating. That is, we are working with very sparse data.

table_of_ratings <- as.vector(ratingMatrix@data) %>% 
    table()
table_of_ratings
## .
##       0     0.5       1     1.5       2     2.5       3     3.5       4     4.5 
## 6791761    1198    3258    1567    7943    5484   21729   12237   28880    8187 
##       5 
##   14856
ratings <- table_of_ratings %>% as.data.frame()
# with non rating movies
ggplot(ratings,
       aes(x = ., y = Freq)) +
  geom_col()

#only with rated movies
ggplot(ratings[-c(1),],
       aes(x = ., y = Freq)) +
  geom_col()

5.6 Heatmap of first 25 rows and columns

The heat map below is another visual to show us how sparse our data is. Note that each row is a user and each column is a movie with the rating being the colored box.

# Heatmap of movie ratings
image(ratingMatrix[1:25,1:25], 
      axes = FALSE, 
      main = "Heatmap of the first 25 rows and 25 columns")

6 Data Preparation

As we could see from section 4.4, there are many unrated movies by users. To make computations easier it is best practice to pick the movies and users with a certain threshold to be used in our model. For instance if a movie only has one rating it would be hard to recommend that movie to others. likewise if a user only rates one movie it would be hard to build a profile for that user. For those reasons we make it a requirement for our data that each user must rate more than 50 movies and each movie we include in our data set will need to be rated by more than 50 people

6.1 reducing ratingMatrix

We reduce our matrix by getting rid of the users that rated fewer that 50 movies and movies that were rated by fewer than 50 users.

movie_ratings <- ratingMatrix[rowCounts(ratingMatrix) > 50,
                              colCounts(ratingMatrix) > 50]
dim(movie_ratings)
## [1] 420 447
dim(ratingMatrix) #got rid of a lot of movies
## [1]   668 10325

6.2 visualizing reduced matrix

From the above output of movie_ratings, we observe that there are 420 users and 447 films as opposed to the previous 668 users and 10325 films. We can now observe our matrix of relevant users as follows:

minimum_movies<- quantile(rowCounts(movie_ratings), 0.98)
# 98th percentile of how many movies a user rated
minimum_users <- quantile(colCounts(movie_ratings), 0.98)
image(movie_ratings[rowCounts(movie_ratings) > minimum_movies,
                    #we only want users that rated more movies than the lower 999%
                    colCounts(movie_ratings) > minimum_users],
      main = "Heatmap of the top users and movies")

Where we are looking at the users who rated the most movies vs the movies that where rated the most.

Next, we can look at the distribution of the average rating per user.

average_ratings <- rowMeans(movie_ratings)
qplot(average_ratings, fill=I("steelblue"), col=I("white")) +
    ggtitle("Distribution of the average rating per user")

6.3 Data Normalization

We want to have normalized data because not everyone rates the same way. For example, we could have a happy user who rates all movies between 4 and 5 stars. Compared to a mean user who might never rate a movie higher than a 3. Normalizing our data will allow us to compare the mean and friendly raters without bias.

normalized_ratings <- normalize(movie_ratings)
sum(rowMeans(normalized_ratings) > 0.1)
## [1] 0
image(normalized_ratings[rowCounts(normalized_ratings) > minimum_movies,
                         colCounts(normalized_ratings) > minimum_users],
      main = "Normalized Ratings of the Top Users")

6.4 Data Binarization

Instead of normalizing the data we could pick our own threshold to determine which movie gets a 0 or 1. For instance, we could set a threshold at 3 such that when a movie is rated a 3 or lower it gets a new rating of 0 and if higher than 3 it gets a new rating of 1.

binary_minimum_movies <- quantile(rowCounts(movie_ratings), 0.98)
#I think they changed this to .98 so their would be more white spaces
binary_minimum_users <- quantile(colCounts(movie_ratings), 0.98)
#movies_watched <- binarize(movie_ratings, minRating = 1)
good_rated_films <- binarize(movie_ratings, minRating = 3)
image(good_rated_films[rowCounts(movie_ratings) > binary_minimum_movies,
                       colCounts(movie_ratings) > binary_minimum_users],
      main = "Heatmap of the top users and movies")

6.5 Quick Note

It is best to use Normalization and binarization together. That is, we want to normalize each row and then binarize the data by row. However, the package recommenderlab does this for us so we don’t need to worry about writing the code.

7 A list of machine learning models in recommenderlab

# Model options
recommendation_model <- recommenderRegistry$get_entries(
  dataType = "realRatingMatrix")
names(recommendation_model)
##  [1] "HYBRID_realRatingMatrix"       "ALS_realRatingMatrix"         
##  [3] "ALS_implicit_realRatingMatrix" "IBCF_realRatingMatrix"        
##  [5] "LIBMF_realRatingMatrix"        "POPULAR_realRatingMatrix"     
##  [7] "RANDOM_realRatingMatrix"       "RERECOMMEND_realRatingMatrix" 
##  [9] "SVD_realRatingMatrix"          "SVDF_realRatingMatrix"        
## [11] "UBCF_realRatingMatrix"
#list of model types and description
lapply(recommendation_model, "[[", "description") 
## $HYBRID_realRatingMatrix
## [1] "Hybrid recommender that aggegates several recommendation strategies using weighted averages."
## 
## $ALS_realRatingMatrix
## [1] "Recommender for explicit ratings based on latent factors, calculated by alternating least squares algorithm."
## 
## $ALS_implicit_realRatingMatrix
## [1] "Recommender for implicit data based on latent factors, calculated by alternating least squares algorithm."
## 
## $IBCF_realRatingMatrix
## [1] "Recommender based on item-based collaborative filtering."
## 
## $LIBMF_realRatingMatrix
## [1] "Matrix factorization with LIBMF via package recosystem (https://cran.r-project.org/web/packages/recosystem/vignettes/introduction.html)."
## 
## $POPULAR_realRatingMatrix
## [1] "Recommender based on item popularity."
## 
## $RANDOM_realRatingMatrix
## [1] "Produce random recommendations (real ratings)."
## 
## $RERECOMMEND_realRatingMatrix
## [1] "Re-recommends highly rated items (real ratings)."
## 
## $SVD_realRatingMatrix
## [1] "Recommender based on SVD approximation with column-mean imputation."
## 
## $SVDF_realRatingMatrix
## [1] "Recommender based on Funk SVD with gradient descend (https://sifter.org/~simon/journal/20061211.html)."
## 
## $UBCF_realRatingMatrix
## [1] "Recommender based on user-based collaborative filtering."
#different parameters for each model
# recommendation_model$IBCF_realRatingMatrix$parameters
# recommenderRegistry$get_entries(dataType = "realRatingMatrix")

8 setup scheme

We will be using 3-fold cross validation. where the data is split into 3 sub data frames, and then each sub data frame is used to calculate a model. Then the final model is averaged by all the models from each sub data frame.

e <-  evaluationScheme(movie_ratings, #evaluate will normalize for us
                       method="cross", 
                       k = 3,#3 fold cross validation
                       given=3, 
                       goodRating=3)#>= 0 is a good rating for normalized data
e
## Evaluation scheme with 3 items given
## Method: 'cross-validation' with 3 run(s).
## Good ratings: >=3.000000
## Data set: 420 x 447 rating matrix of class 'realRatingMatrix' with 38341 ratings.

9 Model Parameter Optimization

In order to provide a complete picture of the performance of each algorithm, we performed the 3-fold cross-validation for different parameter combinations. The parameters were then tuned for optimal performance using Precision-Recall (PR) curves and receiver operating characteristic (ROC) curves.

For each model the following parameters will be optimized:

Model # of Recommendations Parameter Method
IBCF 1,5,10,15,20,25 Similar Items (k): 1,5,10,20,30,40 Cosine, Pearson
UBCF 1,5,10,15,20,25 Nearest Neighbors (nn): 1,5,10,20,30,40 Cosine
SVD 1,5,10,15,20,25 Item Subset (k): 1,5,10,20,30,40,100 NA
ALS 1,5,10,15,20,25 Latent Factors (k): 1,5,10,20,30,40 Implicit, Explicit
Popular 1,5,10,15,20,25 NA

9.1 Item Based Content Filtering (Cosine and Pearson)

For the Cosine method we will use k = 40 since it has the greatest area under the curve. Then for the Pearson method we will use k = 30 since it is consistently one of the best models in the ROC curve and the precision recall curve.

#IBCF Cosine
vector_k <- c(1, 5, 10, 20, 30, 40)
models_to_evaluate <- lapply(vector_k, function(k){
  list(name = "IBCF", param = list(method = "Cosine", k = k))
})
names(models_to_evaluate) <- paste0("IBCF_k_", vector_k)

n_recommendations <- c(1, 5, 10, 15, 20, 25)
list_results <- evaluate(x = e, method = models_to_evaluate, n = n_recommendations)
## IBCF run fold/sample [model time/prediction time]
##   1  [0.19sec/0.03sec] 
##   2  [0.16sec/0.03sec] 
##   3  [0.15sec/0.02sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.16sec/0.01sec] 
##   2  [0.15sec/0.02sec] 
##   3  [0.15sec/0.02sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.16sec/0.01sec] 
##   2  [0.16sec/0.02sec] 
##   3  [0.15sec/0.02sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.16sec/0.03sec] 
##   2  [0.15sec/0.02sec] 
##   3  [0.16sec/0.01sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.14sec/0.03sec] 
##   2  [0.16sec/0.01sec] 
##   3  [0.16sec/0.01sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.14sec/0.02sec] 
##   2  [0.14sec/0.04sec] 
##   3  [0.15sec/0.02sec]
plot(list_results, annotate = 1, legend = "topleft")
title("ROC curve IBCF-Cosine")

plot(list_results, "prec/rec", annotate = 1, legend = "bottomright")
title("Precision-recall IBCF-Cosine")

#IBCF Pearson
models_to_evaluate <- lapply(vector_k, function(k){
  list(name = "IBCF", param = list(method = "Pearson", k = k))
})
names(models_to_evaluate) <- paste0("IBCF_k_", vector_k)

n_recommendations <- c(1, 5, 10, 15, 20, 25)
list_results <- evaluate(x = e, method = models_to_evaluate, n = n_recommendations)
## IBCF run fold/sample [model time/prediction time]
##   1  [0.18sec/0sec] 
##   2  [0.17sec/0sec] 
##   3  [0.16sec/0.01sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.18sec/0.01sec] 
##   2  [0.16sec/0.01sec] 
##   3  [0.17sec/0.02sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.17sec/0sec] 
##   2  [0.17sec/0.01sec] 
##   3  [0.19sec/0.02sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.17sec/0.02sec] 
##   2  [0.17sec/0.01sec] 
##   3  [0.18sec/0.02sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.17sec/0.02sec] 
##   2  [0.17sec/0.02sec] 
##   3  [0.17sec/0.02sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.17sec/0.01sec] 
##   2  [0.17sec/0.01sec] 
##   3  [0.18sec/0.01sec]
plot(list_results, annotate = 1, legend = "topleft")
title("ROC curve IBCF-Pearson")

plot(list_results, "prec/rec", annotate = 1, legend = "bottomright")
title("Precision-recall IBCF-Pearson")

9.2 User Based Content Filtering (Cosine only)

UBCF with the Cosine similarity matrix is at its best with our data when nn = 1.

Note that the blocked-out code does not generate the UBCF Pearson model. The problem could lie in that UBCF Pearson is not designed to be made from a sparse matrix.

#UBCF Cosine
vector_k <- c(1, 5, 10, 20, 30, 40)
models_to_evaluate <- lapply(vector_k, function(k){
  list(name = "UBCF", param = list(method = "Cosine", nn = k))
})
names(models_to_evaluate) <- paste0("UBCF_nn_", vector_k)

n_recommendations <- c(1, 5, 10, 15, 20, 25)
list_results <- evaluate(x = e, method = models_to_evaluate, n = n_recommendations)
## UBCF run fold/sample [model time/prediction time]
##   1  [0.01sec/0.14sec] 
##   2  [0sec/0.13sec] 
##   3  [0sec/0.12sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.14sec] 
##   2  [0sec/0.14sec] 
##   3  [0.01sec/0.14sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.17sec] 
##   2  [0sec/0.16sec] 
##   3  [0sec/0.15sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.18sec] 
##   2  [0.02sec/0.16sec] 
##   3  [0.01sec/0.16sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.18sec] 
##   2  [0sec/0.19sec] 
##   3  [0sec/0.17sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.36sec] 
##   2  [0sec/0.19sec] 
##   3  [0sec/0.21sec]
plot(list_results, annotate = 1, legend = "topleft")
title("ROC curve UBCF-Cosine")

plot(list_results, "prec/rec", annotate = 1, legend = "bottomright")
title("Precision-recall IBCF-Cosine")

#UBCF Pearson
# models_to_evaluate <- lapply(vector_k, function(k){
#   list(name = "UBCF", param = list(method = "pearson", nn = k))
# })
# names(models_to_evaluate) <- paste0("UBCF_nn_", vector_k)
# 
# n_recommendations <- c(1, 5, 10, 15, 20, 25)
# list_results <- evaluate(x = e, method = models_to_evaluate, n = n_recommendations)
# 
# plot(list_results, annotate = 1, legend = "topleft")
# title("ROC curve UBCF-Pearson")
# plot(list_results, "prec/rec", annotate = 1, legend = "bottomright")
# title("Precision-recall UBCF-Pearson")

9.3 Singular Value Decomposition

It appears that when k = 10, or we have 10 item subsets, SVD is at its best.

#SVD
vector_k <- c(1, 5, 10, 20, 30, 40, 100)
models_to_evaluate <- lapply(vector_k, function(k){
  list(name = "SVD", param = list(k = k))
})
names(models_to_evaluate) <- paste0("SVD_k_", vector_k)

n_recommendations <- c(1, 5, 10, 15, 20, 25)

list_results <- evaluate(x = e, method = models_to_evaluate, n = n_recommendations)
## SVD run fold/sample [model time/prediction time]
##   1  Error in r_a %*% model$svd$v %*% diag(1/model$svd$d) : 
##   non-conformable arguments
## SVD run fold/sample [model time/prediction time]
##   1  [0.01sec/0.04sec] 
##   2  [0sec/0.05sec] 
##   3  [0.02sec/0.04sec] 
## SVD run fold/sample [model time/prediction time]
##   1  [0.02sec/0.04sec] 
##   2  [0.02sec/0.05sec] 
##   3  [0.01sec/0.05sec] 
## SVD run fold/sample [model time/prediction time]
##   1  [0.03sec/0.05sec] 
##   2  [0.03sec/0.05sec] 
##   3  [0.03sec/0.05sec] 
## SVD run fold/sample [model time/prediction time]
##   1  [0.05sec/0.05sec] 
##   2  [0.05sec/0.04sec] 
##   3  [0.05sec/0.03sec] 
## SVD run fold/sample [model time/prediction time]
##   1  [0.07sec/0.04sec] 
##   2  [0.07sec/0.04sec] 
##   3  [0.06sec/0.03sec] 
## SVD run fold/sample [model time/prediction time]
##   1  [0.32sec/0.05sec] 
##   2  [0.33sec/0.06sec] 
##   3  [0.32sec/0.06sec]
plot(list_results, annotate = 1, legend = "topleft")
title("ROC curve SVD")

plot(list_results, "prec/rec", annotate = 1, legend = "bottomright")
title("Precision-recall SVD")

9.4 Alternating Least Squares

When n = 40, ALS performs its best.

#ALS
vector_k <- c(1, 5, 10, 20, 30, 40)
models_to_evaluate <- lapply(vector_k, function(k){
  list(name = "ALS", param = list(n_factors = k))
})
names(models_to_evaluate) <- paste0("ALS_n_", vector_k)

n_recommendations <- c(1, 5, 10, 15, 20, 25)
list_results <- evaluate(x = e, method = models_to_evaluate, n = n_recommendations)
## ALS run fold/sample [model time/prediction time]
##   1  [0sec/6.14sec] 
##   2  [0sec/5.84sec] 
##   3  [0sec/5.93sec] 
## ALS run fold/sample [model time/prediction time]
##   1  [0sec/5.81sec] 
##   2  [0sec/5.88sec] 
##   3  [0sec/5.67sec] 
## ALS run fold/sample [model time/prediction time]
##   1  [0sec/5.88sec] 
##   2  [0sec/6.09sec] 
##   3  [0sec/6sec] 
## ALS run fold/sample [model time/prediction time]
##   1  [0sec/5.95sec] 
##   2  [0sec/6.16sec] 
##   3  [0sec/6.18sec] 
## ALS run fold/sample [model time/prediction time]
##   1  [0sec/6.39sec] 
##   2  [0sec/6.36sec] 
##   3  [0sec/6.36sec] 
## ALS run fold/sample [model time/prediction time]
##   1  [0sec/7.35sec] 
##   2  [0sec/7.06sec] 
##   3  [0sec/6.94sec]
plot(list_results, annotate = 1, legend = "topleft")
title("ROC curve ALS")

plot(list_results, "prec/rec", annotate = 1, legend = "bottomright")
title("Precision-recall ALS")

10 Model Comparison

10.1 Models with parameters

When comparing different models in recommenderlab (and maybe r) we can put them in a list. This allows us to compare models with ease.

Models_to_be <- list("UBCF_C" = list(name = "UBCF", 
                                     param = list(method = "cosine",
                                                  nn = 1)),
                     "IBCF_C" = list(name = "IBCF",
                                     param = list(method = "cosine",
                                                  k = 40)),
                     # "UBCF_P" = list(name = "UBCF",#not working :(
                     #                 param = list(method = "Pearson")),
                     "IBCF_P" = list(name = "IBCF",
                                     param = list(method = "Pearson",
                                                  k = 30)),
                     "SVD" = list(name = "SVD",
                                  param = list(k = 10)),
                     "rand" = list(name = "RANDOM"),
                     "Popular" = list(name = "POPULAR"),
                     "ALS" = list(name = "ALS",
                                  param = list(n_factors = 40))
                     #"ALS_implicit" = list(name = "ALS_implicit")
                     )

10.2 RMSE

As we can see from the graph below the Model with the best RMSE is the Popular algorithm. A close second is the ALS method followed by the SVD algorithm.

r <- evaluate(e, 
              Models_to_be, 
              type = "ratings")# this gives estimated errors
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.13sec] 
##   2  [0sec/0.11sec] 
##   3  [0sec/0.12sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.16sec/0.01sec] 
##   2  [0.16sec/0.01sec] 
##   3  [0.33sec/0sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.15sec/0.02sec] 
##   2  [0.17sec/0.02sec] 
##   3  [0.17sec/0.02sec] 
## SVD run fold/sample [model time/prediction time]
##   1  [0.02sec/0.03sec] 
##   2  [0.01sec/0.03sec] 
##   3  [0.02sec/0.03sec] 
## RANDOM run fold/sample [model time/prediction time]
##   1  [0sec/0.02sec] 
##   2  [0sec/0.02sec] 
##   3  [0sec/0.02sec] 
## POPULAR run fold/sample [model time/prediction time]
##   1  [0sec/0.02sec] 
##   2  [0.01sec/0sec] 
##   3  [0.02sec/0sec] 
## ALS run fold/sample [model time/prediction time]
##   1  [0sec/7sec] 
##   2  [0sec/6.97sec] 
##   3  [0sec/6.89sec]
rec <- Recommender(data = getData(e,"train"),
                              method = "IBCF",
                              parameter = list(k = 30))#this is defualt
rec
## Recommender of type 'IBCF' for 'realRatingMatrix' 
## learned using 280 users.
class(rec)
## [1] "Recommender"
## attr(,"package")
## [1] "recommenderlab"
plot(r, ylim = c(0,3.25)) # RMSE

10.3 ROC and precision recall curve

Then from both the graphs below we can see that the popular algorithm out performs the others.

Models_results <- evaluate(e,
                      Models_to_be,
                      #type = "topNList",
                      n = c(1,3,5,10*1:10) #different values to generate top-N list
                      )#it doesn't like this
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.14sec] 
##   2  [0sec/0.13sec] 
##   3  [0sec/0.15sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.16sec/0.02sec] 
##   2  [0.14sec/0.03sec] 
##   3  [0.14sec/0.02sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.17sec/0.02sec] 
##   2  [0.18sec/0.01sec] 
##   3  [0.17sec/0.03sec] 
## SVD run fold/sample [model time/prediction time]
##   1  [0.02sec/0.04sec] 
##   2  [0.01sec/0.05sec] 
##   3  [0.01sec/0.05sec] 
## RANDOM run fold/sample [model time/prediction time]
##   1  [0sec/0.04sec] 
##   2  [0sec/0.04sec] 
##   3  [0sec/0.03sec] 
## POPULAR run fold/sample [model time/prediction time]
##   1  [0sec/0.17sec] 
##   2  [0sec/0.19sec] 
##   3  [0.02sec/0.16sec] 
## ALS run fold/sample [model time/prediction time]
##   1  [0sec/6.97sec] 
##   2  [0sec/6.92sec] 
##   3  [0sec/6.95sec]
plot(Models_results, annotate=c(1,3), legend = "bottomright") #ROC curve

plot(Models_results, "prec/rec", annotate = 3, legend = "bottomright") #precision recall curve

10.4 Other Comparisons

Note from the last graph that as the number of recommendations increase the popular algorithm finds itself having less false negatives and a higher accuracy.

Then from the graph below note that as Accuracy ((TPR+(1-FPR))/2) goes up FP also increases. But the Popular model rate of increase far exceeds that of how much the FP is growing compared to other models.

avg_conf <- function(results) {
  tmp <- results %>%
    getConfusionMatrix()
  as.data.frame(Reduce("+",tmp) / length(tmp))
}

Models_avg <- Models_results %>%
  map(avg_conf) %>% 
  enframe() %>% 
  unnest(cols = value) %>% 
  mutate(n = as.factor(n)) 

Models_avg
## # A tibble: 91 x 11
##    name      TP     FP    FN    TN     N precision  recall     TPR     FPR n    
##    <chr>  <dbl>  <dbl> <dbl> <dbl> <dbl>     <dbl>   <dbl>   <dbl>   <dbl> <fct>
##  1 UBCF~  0.352  0.610  77.0  366.   444     0.366 0.00498 0.00498 0.00161 1    
##  2 UBCF~  0.943  1.94   76.4  365.   444     0.327 0.0133  0.0133  0.00522 3    
##  3 UBCF~  1.58   3.23   75.8  363.   444     0.328 0.0223  0.0223  0.00870 5    
##  4 UBCF~  3.01   6.61   74.3  360.   444     0.313 0.0429  0.0429  0.0178  10   
##  5 UBCF~  5.81  13.4    71.5  353.   444     0.302 0.0819  0.0819  0.0362  20   
##  6 UBCF~  7.94  20.3    69.4  346.   444     0.281 0.111   0.111   0.0550  30   
##  7 UBCF~ 10.2   26.2    67.1  340.   444     0.279 0.143   0.143   0.0711  40   
##  8 UBCF~ 12.1   31.7    65.2  335.   444     0.275 0.168   0.168   0.0860  50   
##  9 UBCF~ 13.6   36.2    63.7  330.   444     0.272 0.190   0.190   0.0981  60   
## 10 UBCF~ 14.8   40.5    62.5  326.   444     0.268 0.207   0.207   0.110   70   
## # ... with 81 more rows
ggplot(Models_avg, aes(x = FP, y = (TPR + (1-FPR))/2, shape = name, color = n))+
  geom_point()

10.5 Best Model

Looking at all our test it is obvious that from our data the popular algorithm beats the others easily. So that is the algorithm we will use in the next section on predictions

12 Conclusion

Our exploration of recommender systems used User-Based Collaborative Filtering (UBCF), Item-Based Collaborative Filtering (IBCF), Singular Value Decomposition, Alternating Least Squares, and the popular algorithm to select movies for the user to watch next. We determined that the Popular algorithm was the most accurate model, producing higher specificity for all ranges of n.

The Recommenderlab package we relied on provides many more options for refining models, like Support Vector Machines. We were able to explore many of the package functions as a means of deepening our understanding of classification techniques and concepts. All the same, this small-scale study barely scratched the surface of this domain.

13 References

Funding NSF Grant Managed by Lois Schindler

Advisor Dr. Mengshi Zhou

Data Flair https://data-flair.training/blogs/data-science-r-movie-recommendation/

Netflix top ten https://sifter.org/~simon/journal/20061211.html

Intro to sparse matrix formats https://www.gormanalysis.com/blog/sparse-matrix-storage-formats/ https://www.gormanalysis.com/blog/sparse-matrix-construction-and-use-in-r/

Maruti techlabs https://marutitech.com/recommendation-engine-benefits/#:~:text=There%20are%20basically%20three%20important%20types%20of%20recommendation,filtering%202%20Content-Based%20Filtering%203%20Hybrid%20Recommendation%20Systems

https://rstudio-pubs-static.s3.amazonaws.com/276939_e571a5c526274fc688551066c058841e.html#comparing-multiple-models

https://rpubs.com/dhairavc/639597

And the manual for Recommenderlab which is attached to the github repository https://github.com/SeanCranston/2021_Summer_Research