Introduction

The goal of this project is to implement and configure a recommender using below two types of recommendation algorithms, and then to evaluate and compare different approaches, different algorithms, and similarity methods.

User-User Collaborative Filtering
Item-Item Collaborative Filtering

Load R Packages

# Load required packages
library(tidyverse)
library(recommenderlab)
library(psych)
library(reshape2)
library(ggpubr)
library(purrr)

Load Data

Both the movies and ratings datasets are taken from https://grouplens.org/datasets/movielens/latest/. There are two versions of these datasets. The small datasets are chosen due to limited computing power available on my laptop.

# Load movies and ratings datasets
movies <- read.csv("https://raw.githubusercontent.com/SieSiongWong/DATA-612/master/movies.csv")

ratings <- read.csv("https://raw.githubusercontent.com/SieSiongWong/DATA-612/master/ratings.csv")

head(movies)

##   movieId                              title
## 1       1                   Toy Story (1995)
## 2       2                     Jumanji (1995)
## 3       3            Grumpier Old Men (1995)
## 4       4           Waiting to Exhale (1995)
## 5       5 Father of the Bride Part II (1995)
## 6       6                        Heat (1995)
##                                        genres
## 1 Adventure|Animation|Children|Comedy|Fantasy
## 2                  Adventure|Children|Fantasy
## 3                              Comedy|Romance
## 4                        Comedy|Drama|Romance
## 5                                      Comedy
## 6                       Action|Crime|Thriller

head(ratings)

##   userId movieId rating timestamp
## 1      1       1      4 964982703
## 2      1       3      4 964981247
## 3      1       6      4 964982224
## 4      1      47      5 964983815
## 5      1      50      5 964982931
## 6      1      70      3 964982400

Data Exploration & Preprocessing

Statistic Summary

The movies dataset contain 3 columns and 9742 observations. The ratings dataset contain 4 columns and 100,836 observations.

We can see that the mean of the rating variable is at 3.5 and the standard deviation is 1.04 and the distribution is left skewed a little.

# Summary of movies and ratings datasets
str(movies)

## 'data.frame':    9742 obs. of  3 variables:
##  $ movieId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ title  : Factor w/ 9737 levels "'71 (2014)","'burbs, The (1989)",..: 8895 4662 3676 9250 2979 3859 7348 8834 8159 3544 ...
##  $ genres : Factor w/ 951 levels "(no genres listed)",..: 352 418 733 688 635 261 733 400 2 134 ...

str(ratings)

## 'data.frame':    100836 obs. of  4 variables:
##  $ userId   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ movieId  : int  1 3 6 47 50 70 101 110 151 157 ...
##  $ rating   : num  4 4 4 5 5 3 5 4 5 5 ...
##  $ timestamp: int  964982703 964981247 964982224 964983815 964982931 964982400 964980868 964982176 964984041 964984100 ...

# Statistical summary of rating variable
describe(ratings$rating)

##    vars      n mean   sd median trimmed  mad min max range  skew kurtosis se
## X1    1 100836  3.5 1.04    3.5    3.57 0.74 0.5   5   4.5 -0.64     0.12  0

# Plot a histogram to show the distribution of ratings
hist(ratings$rating, main = "Ratings Distribution", xlab = "Ratings", ylab = "Frequency", col = "hotpink", ylim = c(0,30000), breaks = 15)

Matrix Conversion

First of all, we have to convert the raw dataset into matrix format that can be used for building recommendation systems through the recommenderlab package.

# Convert to rating matrix
ratings_matrix <- dcast(ratings, userId~movieId, value.var = "rating", na.rm = FALSE)
  
# remove userid column
ratings_matrix <- as.matrix(ratings_matrix[,-1])
  
# Convert rating matrix into a recommenderlab sparse matrix
ratings_matrix <- as(ratings_matrix, "realRatingMatrix")

ratings_matrix

## 610 x 9724 rating matrix of class 'realRatingMatrix' with 100836 ratings.

Each row of the ratings_matrix corresponds to a user, and each column corresponds to a movie id. There are more than 610 x 9724 = 5,931,640 combinations between a user and a movie id. So, it requires 5,931,640 cells to build the matrix. As we know that not every user has watched every movie. There are only 100,836 observations, so this matrix is sparse.

Exploring the Values of the Rating

# Convert the ratings matrix into a vector
vec_ratings <- as.vector(ratings_matrix@data)

# Unique ratings
unique(vec_ratings)

##  [1] 4.0 0.0 4.5 2.5 3.5 3.0 5.0 0.5 2.0 1.5 1.0

# Count the occurrences for each rating
table_ratings <- table(vec_ratings)

table_ratings

## vec_ratings
##       0     0.5       1     1.5       2     2.5       3     3.5       4     4.5 
## 5830804    1370    2811    1791    7551    5550   20047   13136   26818    8551 
##       5 
##   13211

As we know a rating equal to 0 means a missing value in the matrix, so we can remove all of them before building a frequency plot of the ratings to visualize the ratings distribution.

# Remove zero rating and convert the vector to factor
vec_ratings <- vec_ratings[vec_ratings != 0] %>% factor()

# Visualize through qplot
qplot(vec_ratings, fill = I("steelblue")) + 
  ggtitle("Distribution of the Ratings") + 
  labs(x = "Ratings")

Explore Most Viewed Movies

# Search for the top 5 most viewed movies
most_views <- colCounts(ratings_matrix) %>% melt()

most_views <- tibble::rowid_to_column(most_views, "movieId") %>% 
  rename(count = value) %>% 
  top_n(count, n = 5) %>% 
  merge(movies, by = "movieId")

# Visualize the top 5 most viewed movies
ggplot(most_views, aes(x = reorder(title, count), y = count, fill = 'lightblue')) + 
  geom_bar(stat = "identity") + 
  theme(axis.text.x =element_text(angle = 45, hjust = 1)) + 
  ggtitle("Top 5 Most Viewed Movies") + 
  theme(legend.position = "none", axis.title.x = element_blank())

Explore the Average Ratings

# Average rating for each movie
avg_ratings_mv <- colMeans(ratings_matrix)

# Average rating for each user
avg_ratings_us <- rowMeans(ratings_matrix)

# Visualize the distribution of the average movie rating
avg1 <- qplot(avg_ratings_mv) + 
  stat_bin(binwidth = 0.1) +
  ggtitle("Average Movie Rating Distribution") + 
  labs(x = 'Average Rating', y = 'Frequency') 

# Visualize the distribution of the average rating per user
avg2 <- qplot(avg_ratings_us) + 
  stat_bin(binwidth = 0.1) +
  ggtitle("Average Rating Per User Distribution") + 
  labs(x = 'Average Rating', y = 'Frequency') 

figure <- ggarrange(avg1, avg2, ncol = 1, nrow = 2)

figure

From both of the plots above, we can see that there are some movies have only few ratings and some users only rated few movies. For building recommendation system, we don’t want take these movies and users into account as these ratings might be biased. To remove these least-watched movies and least-rated users, we can set a threshold of minimum number for example, 50.

# Filter users and movies more than 50 
ratings_matrix <- ratings_matrix[rowCounts(ratings_matrix) > 50, colCounts(ratings_matrix) > 50]

# Average rating for each movie
avg_ratings_mv2 <- colMeans(ratings_matrix)

# Average rating for each user
avg_ratings_us2 <- rowMeans(ratings_matrix)

# Visualize the distribution of the average movie rating
avg3 <- qplot(avg_ratings_mv2) + 
  stat_bin(binwidth = 0.1) +
  ggtitle("Average Movie Rating Distribution") + 
  labs(x = 'Average Rating', y = 'Frequency')

# Visualize the distribution of the average rating per user
avg4 <- qplot(avg_ratings_us2) + 
  stat_bin(binwidth = 0.1) +
  ggtitle("Average Rating Per User Distribution") + 
  labs(x = 'Average Rating', y = 'Frequency')

figure2 <- ggarrange(avg1, avg2, avg3, avg4, 
                     labels = c("A", "B", "C", "D"), 
                     ncol = 2, nrow = 2)

figure2

The effect of removing those potential biased ratings to the distribution is obvious. From above figure, we can see that the curve is much narrow and has less variance compared to before.

Recommenderlab

Let’s see what are some of the recommender options are available from the recommenderlab package applicable to the realRatingMatrix objects for building recommendation systems.

# Display the list of options for real rating matrix 
rec <-  recommenderRegistry$get_entries(dataType = "realRatingMatrix")
names(rec)

##  [1] "ALS_realRatingMatrix"          "ALS_implicit_realRatingMatrix"
##  [3] "IBCF_realRatingMatrix"         "LIBMF_realRatingMatrix"       
##  [5] "POPULAR_realRatingMatrix"      "RANDOM_realRatingMatrix"      
##  [7] "RERECOMMEND_realRatingMatrix"  "SVD_realRatingMatrix"         
##  [9] "SVDF_realRatingMatrix"         "UBCF_realRatingMatrix"

# Description for the IBCF method
lapply(rec, `[[`, 'description') %>%  `[[`('IBCF_realRatingMatrix')

## [1] "Recommender based on item-based collaborative filtering."

# Description for the UBCF method
lapply(rec, `[[`, 'description') %>%  `[[`('UBCF_realRatingMatrix')

## [1] "Recommender based on user-based collaborative filtering."

# Default parameter values for the IBCF method
rec$IBCF_realRatingMatrix$parameters

## $k
## [1] 30
## 
## $method
## [1] "Cosine"
## 
## $normalize
## [1] "center"
## 
## $normalize_sim_matrix
## [1] FALSE
## 
## $alpha
## [1] 0.5
## 
## $na_as_zero
## [1] FALSE

# Default parameter values for the UBCF method
rec$UBCF_realRatingMatrix$parameters

## $method
## [1] "cosine"
## 
## $nn
## [1] 25
## 
## $sample
## [1] FALSE
## 
## $normalize
## [1] "center"

“IBCF_realRatingMatrix” and “UBCF_realRatingMatrix” are the two models used to demonstrate in this project. One is item-based and the other is user-based collaborative filtering. Different parameters will be used to optimize the performance of these two recommendation models.

Collaborative Filtering System

Since both of the user-based and item-based CF algorithms automatically normalize the data, we can directly use the ratings matrix data from last step above without having to normalize the data manually.

Split Dataset

We will build this filtering system by splitting the dataset into 80% training set and 20% test set. 10 ratings per user will be given to the recommender to make predictions and the other ratings are held out for computing prediction accuracy.

evaluation <- evaluationScheme(ratings_matrix, method = "split", train = 0.8, given = 10)

evaluation

## Evaluation scheme with 10 items given
## Method: 'split' with 1 run(s).
## Training set proportion: 0.800
## Good ratings: NA
## Data set: 378 x 436 rating matrix of class 'realRatingMatrix' with 36214 ratings.

train <- getData(evaluation, "train")
train

## 302 x 436 rating matrix of class 'realRatingMatrix' with 28961 ratings.

test_known <- getData(evaluation, "known")
test_known

## 76 x 436 rating matrix of class 'realRatingMatrix' with 760 ratings.

test_unknown <- getData(evaluation, "unknown")
test_unknown

## 76 x 436 rating matrix of class 'realRatingMatrix' with 6493 ratings.

Item-Based

Create an IBCF recommender, using “Pearson” similarity measure and 50 most similar items.

# Create an item-based CF recommender using training data
rec_ib <- Recommender(data = train, method = "IBCF",
                        parameter = list(method = "pearson", k = 50))

# Create predictions for the test items using known ratings with type as ratings
pred_ib_acr <- predict(object = rec_ib, newdata = test_known, type = "ratings")

# Create predictions for the test items using known ratings with type as top n recommendation list
pred_ib_n <- predict(object = rec_ib, newdata = test_known, n = 5)

Exploring the Recommender Model on the Test Set

Top 5 recommendations for the first 5 users.

# Recommendations for the first 5 users.
first_5_users <- pred_ib_n@items[1:5] %>% data.frame()
colnames(first_5_users) <- c("user1", "user2", "user3", "user4", "user5")
first_5_users <- first_5_users %>% melt() %>% 
  rename(movieId = value) %>% 
  merge(movies, by = "movieId") %>% 
  rename(users = variable) %>% 
  select(users:title)
first_5_users <- first_5_users[order(first_5_users$users),]

first_5_users

##    users                                 title
## 1  user1                      Toy Story (1995)
## 2  user1                        Jumanji (1995)
## 4  user1                          Nixon (1995)
## 7  user1                         Powder (1995)
## 10 user1                           Babe (1995)
## 5  user2                          Nixon (1995)
## 12 user2            Usual Suspects, The (1995)
## 15 user2                       Lamerica (1994)
## 18 user2               MisÃ©rables, Les (1995)
## 13 user3            Usual Suspects, The (1995)
## 14 user3               Mighty Aphrodite (1995)
## 17 user3                      Fair Game (1995)
## 21 user3          Up Close and Personal (1996)
## 23 user3   Amazing Panda Adventure, The (1995)
## 3  user4                           Heat (1995)
## 6  user4 Ace Ventura: When Nature Calls (1995)
## 9  user4                Dangerous Minds (1995)
## 11 user4           Seven (a.k.a. Se7en) (1995)
## 16 user4                       Bio-Dome (1996)
## 8  user5                         Powder (1995)
## 19 user5                    Black Sheep (1996)
## 20 user5                 Pie in the Sky (1996)
## 22 user5                       Bad Boys (1995)
## 24 user5   Amazing Panda Adventure, The (1995)

Number of times each movie got recommended

# Define a matrix with the recommendations to the test set users
rec_matrix <- sapply(pred_ib_n@items, function(x){
  colnames(ratings_matrix)[x]
})

# Define a vector with all recommendations
num_of_items <- factor(table(rec_matrix))

# Visualize the distribution of the number of items
qplot(num_of_items) + ggtitle("Distribution of the Number of Items")

Top 5 most recommended movies

# Top 5 most recommended movies
top5_rec_mv <- num_of_items %>% data.frame()
top5_rec_mv <- cbind(movieId = rownames(top5_rec_mv), top5_rec_mv)
rownames(top5_rec_mv) <- 1:nrow(top5_rec_mv)
colnames(top5_rec_mv)[2] <- "count"
top5_rec_mv <- top5_rec_mv %>% 
  mutate_if( is.factor, ~ as.integer(levels(.x))[.x]) %>%
  top_n(count, n = 5) %>% 
  merge(movies, by = "movieId")

top5_rec_mv <- top5_rec_mv[order(top5_rec_mv$count, decreasing = TRUE),] %>% 
  select(title)

top5_rec_mv

##                          title
## 1             Toy Story (1995)
## 5            Braveheart (1995)
## 2               Jumanji (1995)
## 3      Grumpier Old Men (1995)
## 4 Sense and Sensibility (1995)

User-Based

Create an UBCF recommender, using “Pearson” similarity measure and 50 nearest neighbors.

# Create an user-based CF recommender using training data
rec_ub <- Recommender(data = train, method = "UBCF", 
                      parameter = list(method = "pearson", nn = 50))

# Create predictions for the test users using known ratings with type as ratings
pred_ub_acr <- predict(rec_ub, test_known, type = "ratings")

# Create predictions for the test users using known ratings with type as top n recommendation list
pred_ub_n <- predict(object = rec_ub, newdata = test_known, n = 5)

Exploring the Recommender Model on the Test Set

Top 5 recommendations for the first 5 users.

# Recommendations for the first 5 users
first_5_users <- pred_ub_n@items[1:5] %>% data.frame()
colnames(first_5_users) <- c("user1", "user2", "user3", "user4", "user5")
first_5_users <- first_5_users %>% melt() %>% 
  rename(movieId = value) %>% 
  merge(movies, by = "movieId") %>% 
  rename(users = variable) %>% 
  select(users:title)
first_5_users <- first_5_users[order(first_5_users$users),]

first_5_users

##    users                            title
## 3  user1                To Die For (1995)
## 4  user1       Usual Suspects, The (1995)
## 8  user1            Big Green, The (1995)
## 18 user1             Birdcage, The (1996)
## 20 user1          Immortal Beloved (1994)
## 5  user2       Usual Suspects, The (1995)
## 9  user2            Big Green, The (1995)
## 14 user2         Dunston Checks In (1996)
## 21 user2          Immortal Beloved (1994)
## 24 user2 Shawshank Redemption, The (1994)
## 2  user3                To Die For (1995)
## 10 user3            Big Green, The (1995)
## 13 user3        Mr. Holland's Opus (1995)
## 19 user3             Birdcage, The (1996)
## 25 user3             Little Buddha (1993)
## 1  user4     Sense and Sensibility (1995)
## 7  user4       Usual Suspects, The (1995)
## 11 user4            Big Green, The (1995)
## 16 user4              Nick of Time (1995)
## 17 user4              If Lucy Fell (1996)
## 6  user5       Usual Suspects, The (1995)
## 12 user5            Big Green, The (1995)
## 15 user5         Dunston Checks In (1996)
## 22 user5               Love Affair (1994)
## 23 user5          Man of the House (1995)

Visualize the distribution of the number of items

# Define a matrix with the recommendations to the test set users
rec_matrix <- sapply(pred_ub_n@items, function(x){
  colnames(ratings_matrix)[x]
})

# Define a vector with all recommendations
num_of_items <- factor(table(rec_matrix))

# Visualize the distribution of the number of items
qplot(num_of_items) + ggtitle("Distribution of the Number of Items")

Top 5 most recommended movies

# Top 5 most recommended movies
top5_rec_mv <- num_of_items %>% data.frame(stringsAsFactors = FALSE)
top5_rec_mv <- cbind(movieId = rownames(top5_rec_mv), top5_rec_mv)
rownames(top5_rec_mv) <- 1:nrow(top5_rec_mv)
colnames(top5_rec_mv)[2] <- "count"
top5_rec_mv <- top5_rec_mv %>% 
  mutate_if( is.factor, ~ as.integer(levels(.x))[.x]) %>%
  top_n(count, n = 5) %>% 
  merge(movies, by = "movieId")

top5_rec_mv <- top5_rec_mv[order(top5_rec_mv$count, decreasing = TRUE),] %>% 
  select(title)

top5_rec_mv

##                                       title
## 3          Shawshank Redemption, The (1994)
## 2                       Pulp Fiction (1994)
## 1 Star Wars: Episode IV - A New Hope (1977)
## 4                       Forrest Gump (1994)
## 5                   Schindler's List (1993)

Evaluation

Compare predictions with true “unknown” ratings

# Compare predictions with true "unknown" ratings
as(test_unknown, "matrix")[1:8,1:5]

##        1  2   3   6  7
## [1,] 4.5 NA  NA  NA NA
## [2,]  NA NA  NA  NA NA
## [3,]  NA  3  NA  NA NA
## [4,]  NA NA  NA  NA NA
## [5,] 3.0 NA  NA  NA  1
## [6,] 5.0 NA  NA  NA NA
## [7,] 4.0 NA 3.5 4.5 NA
## [8,]  NA NA  NA  NA NA

as(pred_ib_acr, "matrix")[1:8,1:5]

##             1        2        3        6        7
## [1,] 5.000000 5.000000 4.000000 4.000000 4.000000
## [2,] 3.339197 3.500000 3.000000       NA 3.502433
## [3,] 4.500000       NA       NA 1.500000 4.500000
## [4,]       NA 3.500000       NA 3.746954 2.473600
## [5,]       NA       NA       NA       NA 2.960024
## [6,] 4.678354 2.544514 4.672803       NA       NA
## [7,] 3.206836       NA       NA       NA       NA
## [8,] 4.500000       NA 4.000000 4.751435 4.000000

as(pred_ub_acr, "matrix")[1:8,1:5]

##             1        2        3        6        7
## [1,] 4.574934 4.487671 4.450626 4.501525 4.512176
## [2,] 3.288340 3.065994 3.096236 3.199075 3.135870
## [3,] 3.942737 3.720959 3.755558 3.822132 3.808681
## [4,] 2.984634 2.853417 2.857997 2.920167 2.877304
## [5,] 3.947475 3.923321 3.890224 4.015133 3.880585
## [6,] 4.268158 4.005265 4.047457 4.200309 4.023832
## [7,] 3.726804 3.635771 3.665291 3.865003 3.679749
## [8,] 4.271804 4.128269 4.189861 4.324460 4.202695

Evaluate the accuracy of User-Based CF and Item-Based CF recommender on unknown ratings.

# Evaluate Item-Based recommendations on unknown ratings
acr_ib <- calcPredictionAccuracy(pred_ib_acr, test_unknown)

# Evaluate User-Based recommendations on unknown ratings
acr_ub <- calcPredictionAccuracy(pred_ub_acr, test_unknown)

acr <- rbind(IBCF = acr_ib, UBCF = acr_ub)

acr

##           RMSE       MSE       MAE
## IBCF 1.0853645 1.1780161 0.7898596
## UBCF 0.9026854 0.8148408 0.6828060

Let’s try another evaluation scheme with “Cross Validation” method and “Cosine” similarity measure.

# Setup the evaluation scheme
evaluation_2 <- evaluationScheme(ratings_matrix, 
                                 method     = "cross", 
                                 k          = 5, 
                                 train      = 0.8, 
                                 given      = 10,
                                 goodRating = 5
                                 )

evaluation_2

## Evaluation scheme with 10 items given
## Method: 'cross-validation' with 5 run(s).
## Good ratings: >=5.000000
## Data set: 378 x 436 rating matrix of class 'realRatingMatrix' with 36214 ratings.

# Set up list of algorithms
algorithms <- list(
  "item-based CF"     = list(name  = "IBCF", parameter = list(method = "Cosine", k = 50)),
  "user-based CF"     = list(name  = "UBCF", parameter = list(method = "Cosine", nn = 50))
                  )

# Estimate the models
results <- evaluate(evaluation_2, 
                    algorithms, 
                    type  = "topNList", 
                    n     = c(1, 3, 5, 10, 15, 20)
                   )

## IBCF run fold/sample [model time/prediction time]
##   1  [1.41sec/0.11sec] 
##   2  [1.39sec/0.08sec] 
##   3  [2.16sec/0.06sec] 
##   4  [1.51sec/0.11sec] 
##   5  [1.44sec/0.09sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.35sec] 
##   2  [0.02sec/0.29sec] 
##   3  [0.01sec/0.35sec] 
##   4  [0.02sec/0.31sec] 
##   5  [0.02sec/0.33sec]

results

## List of evaluation results for 2 recommenders:
## Evaluation results for 5 folds/samples using method 'IBCF'.
## Evaluation results for 5 folds/samples using method 'UBCF'.

# Create a function to get average of precision, recall, TPR, FPR
avg_cf_matrix <- function(results) {
avg <- results %>%
  getConfusionMatrix()  %>%  
  as.list()
  as.data.frame( Reduce("+", avg) / length(avg)) %>% 
  mutate(n = c(1, 3, 5, 10, 15, 20)) %>%  
  select('n', 'precision', 'recall', 'TPR', 'FPR')
}

# Using map() to iterate the avg function across both models
results_tbl <- results %>% map(avg_cf_matrix) %>% enframe() %>% unnest()

## Warning: `cols` is now required when using unnest().
## Please use `cols = c(value)`

results_tbl

## # A tibble: 12 x 6
##    name              n precision  recall     TPR     FPR
##    <chr>         <dbl>     <dbl>   <dbl>   <dbl>   <dbl>
##  1 item-based CF     1    0.0256 0.00179 0.00179 0.00238
##  2 item-based CF     3    0.0282 0.00571 0.00571 0.00711
##  3 item-based CF     5    0.0364 0.0154  0.0154  0.0117 
##  4 item-based CF    10    0.0382 0.0280  0.0280  0.0234 
##  5 item-based CF    15    0.0395 0.0445  0.0445  0.0351 
##  6 item-based CF    20    0.0387 0.0571  0.0571  0.0468 
##  7 user-based CF     1    0.236  0.0235  0.0235  0.00185
##  8 user-based CF     3    0.178  0.0531  0.0531  0.00598
##  9 user-based CF     5    0.152  0.0690  0.0690  0.0103 
## 10 user-based CF    10    0.127  0.103   0.103   0.0212 
## 11 user-based CF    15    0.113  0.131   0.131   0.0323 
## 12 user-based CF    20    0.103  0.148   0.148   0.0436

# Plot ROC curves for each model
results_tbl %>%
  ggplot(aes(FPR, TPR, color = fct_reorder2(as.factor(name), FPR, TPR))) +
  geom_line() +
  geom_label(aes(label = n))  +
  labs(title = "ROC Curves", color = "Model") +
  theme_grey(base_size = 14)

# Plot Precision-Recall curves for each model
results_tbl %>%
  ggplot(aes(recall, precision, color = fct_reorder2(as.factor(name), recall, precision))) +
  geom_line() +
  geom_label(aes(label = n))  +
  labs(title = "Precision-Recall Curves", colour = "Model") +
  theme_grey(base_size = 14)

Summary

From the evaluation results, user-based CF model is the clear winner in either methods. We can see its RMSE is lower than item-based CF model. Also, we can see clearly from ROC curves that the user-based CF model achieves higher TPR for any given level of FPR. This means that the user-based CF model is producing higher number of relevant recommendations (true positives) for the same level of non-relevant recommendations (false positives). This happens the same in Precision-Recall curves where user-based CF model has higher Recall for any given level of Precision. This means that it minimizes False Negatives for all level of False Positives. Furthermore, each method has a number of tuning parameters such as type of similarity, number of neighbors, number of latent factors, regularization parameters and so on. We can do further comparison by playing around with these parameters.

Even though collaborative filtering is the most popular branch of recommendation but it does have some limitations when dealing with new users or items. If the new user hasn’t seen any movie yet, neither of the two models is able to recommend any item. It’s the same thing if the new item hasn’t been purchased by anyone, it will never be recommended. To handle this cold start problem, as recommended from “Building a Recommendation System with R” book we should take account of other information such as user profiles and item descriptions into building our recommendation systems. This will lead to building a hybrid recommender system, combination of item-based and/or used-based with content-based filtering models, which usually give better results.

Reference

Gorakala, K.G. & Usuelli, M. (2015, Sept). Building a Recommendation System with R (pp. 50-92). Packt Publishing Ltd.

Hashler, M. & Vereet, B. (2019, Aug 27). Package ‘recommenderlab’. CRAN. Retrieved from https://cran.r-project.org/web/packages/recommenderlab/recommenderlab.pdf.

Project 2

Sie Siong Wong

6/12/2020

Introduction

Load R Packages

Load Data

Data Exploration & Preprocessing

Statistic Summary

Matrix Conversion

Exploring the Values of the Rating

Explore Most Viewed Movies

Explore the Average Ratings

Recommenderlab

Collaborative Filtering System

Split Dataset

Item-Based

Exploring the Recommender Model on the Test Set

User-Based

Exploring the Recommender Model on the Test Set

Evaluation

Summary

Reference