# loading libraries
library(tidyverse)
library(tidyr)
library(readr)
library(data.table)
library(dplyr)
library(recommenderlab)
library(knitr)
library(kableExtra)
library(ggplot2)

Introduction

In this project, we are going to use collaborative filtering in which user’s preference would be used to create the recommendation system based on previously liked movies or relevant items i.e item based or user based recommendation systems. We are going to check the accuracy of the model later on and then the best model will be used to create a system in Shiny App that would provide a recommendations based on user’s preferences. We decided to continue working on the MovieLens dataset that contains both movies and reviews. Once, the data is trained and model is selected then we will deploy the results in Shiny App to provide the applicability of the model.

MovieLens Dataset

This data was collected through the MovieLens web site (movielens.umn.edu) during the seven-month period from September 19th, 1997 through April 22nd, 1998. The data set contains about 100,868 ratings (1-5) from 943 users on 9472 movies. Movie metadata is also provided in MovieLenseMeta. It is a built-in dataset in “recommenderlab” library. Furthermore, the data contains other datasets but we will only work with “movies.csv and ratings.csv”. The data was taken from “https://grouplens.org/datasets/movielens/latest/” which had two datasets. We took smaller version as the bigger version had 58000 movies reviewed by 280,000 users and it was taking long time to work on that version.

Let’s get the data now

ratings <- read.csv("C:/Users/hukha/Desktop/MS - Data Science/Data 612 - Recommender Systems/Data 612 - Recommender Systems/ml-latest-small/ratings.csv", stringsAsFactors = FALSE) %>% select(-c(timestamp))
movies <- read.csv("C:/Users/hukha/Desktop/MS - Data Science/Data 612 - Recommender Systems/Data 612 - Recommender Systems/ml-latest-small/movies.csv")

Data distribution

There are 100836 reviews in the review dataset from different users that were reviewed by different users for 9472 movies. Movie dataset contains list of movies along with their id which wouuld be useful to join with the review dataset later on. Movie dataset also contains genres of the movies that would be helpful for the user to select their preferences and can be categorized accordingly. Later on, we will split these genres into different columns.

# Ratings
head(ratings)
##   userId movieId rating
## 1      1       1      4
## 2      1       3      4
## 3      1       6      4
## 4      1      47      5
## 5      1      50      5
## 6      1      70      3
summary(ratings$rating)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.500   3.000   3.500   3.502   4.000   5.000
# Movies
head(movies)
##   movieId                              title
## 1       1                   Toy Story (1995)
## 2       2                     Jumanji (1995)
## 3       3            Grumpier Old Men (1995)
## 4       4           Waiting to Exhale (1995)
## 5       5 Father of the Bride Part II (1995)
## 6       6                        Heat (1995)
##                                        genres
## 1 Adventure|Animation|Children|Comedy|Fantasy
## 2                  Adventure|Children|Fantasy
## 3                              Comedy|Romance
## 4                        Comedy|Drama|Romance
## 5                                      Comedy
## 6                       Action|Crime|Thriller

Data Preparation

As we said previously that movie data contains list of movies with their genres. The problem is that movie genres are stored together in one column which needs to be seperated as it will be very useful for user’s preferences for movies that’s why we have to split the data.

Creating Matrix of movies with their genres seperately

#head(movies)
search_movies <- movies %>% separate(col= genres, into= c(as.character(1:10)), sep="[|]")
## Warning: Expected 10 pieces. Missing pieces filled with `NA` in 9741
## rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
## 20, ...].
head(search_movies) %>% kable(caption="List of Movies with Genres") %>% kable_styling()
List of Movies with Genres
movieId title 1 2 3 4 5 6 7 8 9 10
1 Toy Story (1995) Adventure Animation Children Comedy Fantasy NA NA NA NA NA
2 Jumanji (1995) Adventure Children Fantasy NA NA NA NA NA NA NA
3 Grumpier Old Men (1995) Comedy Romance NA NA NA NA NA NA NA NA
4 Waiting to Exhale (1995) Comedy Drama Romance NA NA NA NA NA NA NA
5 Father of the Bride Part II (1995) Comedy NA NA NA NA NA NA NA NA NA
6 Heat (1995) Action Crime Thriller NA NA NA NA NA NA NA

Creating realRating matrix

rm <- dcast(ratings, userId~movieId, value.var="rating", na.rm=FALSE)
## Warning in dcast(ratings, userId ~ movieId, value.var = "rating", na.rm =
## FALSE): The dcast generic in data.table has been passed a data.frame and
## will attempt to redirect to the reshape2::dcast; please note that reshape2
## is deprecated, and this redirection is now deprecated as well. Please
## do this redirection yourself like reshape2::dcast(ratings). In the next
## version, this warning will become an error.
rm <- as.matrix(rm[,-1])

rm <- as(rm, "realRatingMatrix")
rm
## 610 x 9724 rating matrix of class 'realRatingMatrix' with 100836 ratings.

Recommendation Models

We are going to use built-in functions from “recommenderlab” to create recommendation systems. Let’s explore which functions exist in recommenderlab that can be useful later.

recommender_models <- recommenderRegistry$get_entries(dataType= "realRatingMatrix")
names(recommender_models)
##  [1] "ALS_realRatingMatrix"          "ALS_implicit_realRatingMatrix"
##  [3] "IBCF_realRatingMatrix"         "LIBMF_realRatingMatrix"       
##  [5] "POPULAR_realRatingMatrix"      "RANDOM_realRatingMatrix"      
##  [7] "RERECOMMEND_realRatingMatrix"  "SVD_realRatingMatrix"         
##  [9] "SVDF_realRatingMatrix"         "UBCF_realRatingMatrix"

Above are the names of algorithms in recommenderlab. We will use IBCF an dUBCF models and their compare their performance to see which one is performing better in terms of recommending movies. Both are collaborative filtering which is based on similarity between either users or items and similarity can be measured through cosine, pearson and jaccard.

Data Exploration

Now we will determine the similarity of the first few users and movies through cosine method. Let’s see how it looks like.

Similarity

# Similarity among the users
similarity_users <- as.matrix(similarity(rm[1:5, ], method="cosine", which="users"))
image(similarity_users, main="User similarity")

# Similarity among the movies
similarity_movies <- as.matrix(similarity(rm[, 1:5], method="cosine", which="items"))
image(similarity_movies, main="Movies similarity")

Rating’ distribution

Let’s visualize the overall ratings from the users given to the movies. All the movies that are not rated would be excluded from graph.

rating_values <- as.vector(rm@data)
rating_values <- factor(rating_values[rating_values != 0])

# Now let's visualize it
plot(rating_values, main= " Ratings Distribution")

We can see that most of the reviews exist between 3 to 4 while some around 5 and least less than 2.

Top Watching Movies

view_per_movie <- colCounts(rm) # Total views for each movie

count_views <- data.frame(movie = names(view_per_movie), views= view_per_movie)
count_views <- count_views[order(count_views$views, decreasing=TRUE), ]
count_views$title <- NA
# Adding movie titles
for (i in 1:9724){
  count_views[i,3] <- as.character(subset(movies, 
                                         movies$movieId == count_views[i,1])$title)
}

head(count_views, 15) %>% kable() %>% kable_styling(full_width = FALSE)
movie views title
356 356 329 Forrest Gump (1994)
318 318 317 Shawshank Redemption, The (1994)
296 296 307 Pulp Fiction (1994)
593 593 279 Silence of the Lambs, The (1991)
2571 2571 278 Matrix, The (1999)
260 260 251 Star Wars: Episode IV - A New Hope (1977)
480 480 238 Jurassic Park (1993)
110 110 237 Braveheart (1995)
589 589 224 Terminator 2: Judgment Day (1991)
527 527 220 Schindler’s List (1993)
2959 2959 218 Fight Club (1999)
1 1 215 Toy Story (1995)
1196 1196 211 Star Wars: Episode V - The Empire Strikes Back (1980)
50 50 204 Usual Suspects, The (1995)
2858 2858 204 American Beauty (1999)

Now let’s visualize these top watching movies

ggplot(count_views[1:15,], aes(reorder(x=title, views),y= views))+geom_bar(stat="identity")+theme(axis.text.x=element_text(angle=45, hjust=1))+labs(title="Top Watching Movies", x="Movies", y="Views")

Forrest Gump and Shawshank Redemption are the mostly watching movies as per the dataset last updated in 2018.

Relevant Data

There are a lot of missing values in the data which represents the movies that were not reviewed by the users and generally speaking it’s almost impossible for all users to rate all the movies existed in the database. That being said, we will get the relevant data i.e. minimum numbers of users per rated movie and minimum views per view that are 50 for both

movie_ratings <- rm[rowCounts(rm) > 50, colCounts(rm) > 50]
movie_ratings
## 378 x 436 rating matrix of class 'realRatingMatrix' with 36214 ratings.

Now let’s visualize the reviews for the relevant data.

qplot(rowMeans(movie_ratings), binwidth=0.1, main="Distribution of Ratings")

Data Normalization

Data normalization is a significant step before creating recommender system because it helps avoid the biasness in the result and thus provides a better and accurate model. We will use “normalize” function from recommenderlab library.

movie_ratings_norm <- normalize(movie_ratings)
sum(rowMeans(movie_ratings_norm) > 0.0001)
## [1] 0

Now let’s create a heatmap to see how normalize function from recommenderlab helped removing the biasness.

image(movie_ratings_norm[rowCounts(movie_ratings) > quantile(rowCounts(movie_ratings),0.95),
                         colCounts(movie_ratings) > quantile(colCounts(movie_ratings), 0.95)], main="Normalized Heatmap for top data")

As we can see most of the data is almost normalized. Reason for few points that are far away from 0 is because the data is taken from top movies and not the entire data. Data is normalized now and we can go ahead and create recommender systems on the dataset.

The data is good to go now but one last part could be very helpful that is conversion of data to binary. We will convert missing values or bad ratings to 0 and 1 will be a movie that has rated. We can do same with 1 as above threshold ratings. Let’s create heatmap for both and see how it looks like.

movie_ratings_watched <- binarize(movie_ratings, minRating = 1)
boolean_min_movies <- quantile(rowCounts(movie_ratings), 0.95)
boolean_min_users <- quantile(colCounts(movie_ratings), 0.95)
image(movie_ratings_watched[rowCounts(movie_ratings) > boolean_min_movies,
                             colCounts(movie_ratings) > boolean_min_users], 
main = "Heatmap - Top Users & Movies")

movie_ratings_good <- binarize(movie_ratings, minRating = 3)
image(movie_ratings_good[rowCounts(movie_ratings) > boolean_min_movies, 
colCounts(movie_ratings) > boolean_min_users], 
main = "Heatmap - Top Users & Movies")

Second heatmap shows that there more movies with no or bad ratings than movies that were not viewed.

Item-based Collaborative Filtering

This type of collaborative filtering is based on similarity between two items rated sy similar user. For each item, k most similar items are needed to be identified and identify the items that are most similar to the users’ ratings for each user.

Splitting train and test sets

Before building a model, let’s split a model with 80% in training and 20% in the test set.

# Criteria
which_train <- sample(x= c(TRUE, FALSE), size= nrow(movie_ratings), replace= TRUE, prob= c(0.8, 0.2))


# Creating train and test
movie_train <- movie_ratings[which_train, ]
movie_test <- movie_ratings[!which_train, ]

Now since we are done creating train and test sets through splitting from movie_ratings, we can go ahead and create IBCF model.

Build a model

For sake of learning, let’s see what parameters IBCF algorithm has and then we will further talk about it.

recommender_models <- recommenderRegistry$get_entries(dataType ="realRatingMatrix")
recommender_models$IBCF_realRatingMatrix$parameters
## $k
## [1] 30
## 
## $method
## [1] "Cosine"
## 
## $normalize
## [1] "center"
## 
## $normalize_sim_matrix
## [1] FALSE
## 
## $alpha
## [1] 0.5
## 
## $na_as_zero
## [1] FALSE

By default k is 30, method of similarity is calculated via cosine. Now let’s create a model.

# Building a model
movie_model <- Recommender(data= movie_train, method="IBCF", parameter= list(k=30))
movie_model
## Recommender of type 'IBCF' for 'realRatingMatrix' 
## learned using 309 users.

A model has been built on train dataset.

Implementation of IBCF Model

# Prediction
model_predictions1 <- predict(object= movie_model, newdata= movie_test, n=10)
model_predictions1
## Recommendations as 'topNList' with n = 10 for 69 users.
# Let's check recommended movies for user1

user1_recommendation <- model_predictions1@items[[1]]
movies_user1 <- model_predictions1@itemLabels[user1_recommendation]
#movies_user1

for (i in 1:8){
  movies_user1[i] <- as.character(subset(movies, movies$movieId == movies_user1[i])$title)
}
movies_user1
##  [1] "First Knight (1995)"                                
##  [2] "Judge Dredd (1995)"                                 
##  [3] "Lion King, The (1994)"                              
##  [4] "City Slickers II: The Legend of Curly's Gold (1994)"
##  [5] "Cliffhanger (1993)"                                 
##  [6] "In the Line of Fire (1993)"                         
##  [7] "Aladdin (1992)"                                     
##  [8] "Beauty and the Beast (1991)"                        
##  [9] "597"                                                
## [10] "708"

Above movies are the recommended movies for User 1.

users <- paste0("User ", seq(1:8))
matrix_recommendation <- sapply(model_predictions1@items, 
                      function(x){ as.integer(colnames(movie_ratings)[x]) }) 
m <- matrix_recommendation[,1:8]
colnames(m) <- users
m 
##       User 1 User 2 User 3 User 4 User 5 User 6 User 7 User 8
##  [1,]    168   2080    161      3    349   3033     70   3147
##  [2,]    173    903   1220      7    357    480    110    587
##  [3,]    364  49530   4027     36    919    485    111  54503
##  [4,]    432  52973   6942    231   1097    858    145  63082
##  [5,]    434   1036  68954    293   2078   1196    160   3408
##  [6,]    474   1923  53125    329  48385   1200    185    317
##  [7,]    588     21    595    356 122904   1247    223   2011
##  [8,]    595    300  32587    434   1645   2502    231   2355
##  [9,]    597    357   5989    435   2115   2571    500   3489
## [10,]    708    508   1625    509   2617   2617    555  30707

Above matrix shows the 10 recommended movies for 8 first rows from the data.

num_items <- factor(table(matrix_recommendation))

num_items_sorted <- sort(num_items, decreasing = TRUE)
num_items_top <- head(num_items_sorted, n = 10)
table_top <- data.frame(as.integer(names(num_items_top)),
                       num_items_top)

for (i in 1:10){
  table_top[i,1] <- as.character(subset(movies, 
                                         movies$movieId == table_top[i,1])$title)
}


colnames(table_top) <- c("Movie Title", "# Items")
head(table_top)
##                    Movie Title # Items
## 3      Grumpier Old Men (1995)      13
## 70  From Dusk Till Dawn (1996)       8
## 172     Johnny Mnemonic (1995)       8
## 16               Casino (1995)       7
## 292            Outbreak (1995)       7
## 587               Ghost (1990)       7

Most of the movies have been recommended only a few times and a few movies have been recommended many times. IBCF recommends items on the basis of the similarity matrix. It’s an eager-learning model, that is, once it is built, it doesn’t need to access the initial data. For each item, the model stores the k-most similar, so the amount of information is small once the model is built. This is an advantage in the presence of lots of data. In addition, this algorithm is efficient and scalable, so it works well with big rating matrices. Its accuracy is rather good, compared with other recommendation models.

User-based Collaborative Filtering

This type of collaborative filtering focuses on user-based approach rather than item-based similarity. Similarity is calculated among the users and predicts that they might like same movies. In this approach, similar users are initially identified and then top-rated items rated by similar users are recommended.

For each new user, these are the steps:

1- Measure how similar each user is to the new one. Like IBCF, popular similarity measures are correlation and cosine. 2- Identify the most similar users. The options are take account of the top k users (k-nearest_neighbors) and take account of the users whose similarity is above a defined threshold 3- Rate the items purchased by the most similar users. The rating is the average rating among similar users and the approaches are avg rating and weighted avg rating. 4- Pick the top-rated items

Create a model

Let’s go ahead and create a model.

recommender_models2 <- Recommender(movie_train, method="UBCF")
recommender_models2
## Recommender of type 'UBCF' for 'realRatingMatrix' 
## learned using 309 users.

Above function went through train dataset and learned 312 users before going for prediction.

# Details of model
results_model <- getModel(recommender_models2)
results_model
## $description
## [1] "UBCF-Real data: contains full or sample of data set"
## 
## $data
## 309 x 436 rating matrix of class 'realRatingMatrix' with 30252 ratings.
## Normalized using center on rows.
## 
## $method
## [1] "cosine"
## 
## $nn
## [1] 25
## 
## $sample
## [1] FALSE
## 
## $normalize
## [1] "center"
## 
## $verbose
## [1] FALSE
results_model$data
## 309 x 436 rating matrix of class 'realRatingMatrix' with 30252 ratings.
## Normalized using center on rows.

Implement the model

As we created and implemented the model in IBCF, we will go ahead in a same way here and implement the model on the test data and see how it looks like.

model_predictions2 <- predict(object= recommender_models2, newdata= movie_test, n= 10)
model_predictions2
## Recommendations as 'topNList' with n = 10 for 69 users.

The above function created top 10 recommended movies for 66 users.

Let’s check out the first 4 users and see how it looks like.

users <- paste0("User ", seq(1:8))

matrix_recommendation2 <- sapply(model_predictions2@items, function(x){
  colnames(movie_ratings)[x]
} )

m <- matrix_recommendation2[, 1:8]
colnames(m) <- users
m 
##       User 1  User 2  User 3 User 4  User 5  User 6  User 7 User 8
##  [1,] "8961"  "589"   "1387" "318"   "49272" "589"   "110"  "593" 
##  [2,] "1193"  "2028"  "1200" "58559" "541"   "1198"  "337"  "223" 
##  [3,] "904"   "223"   "1035" "5952"  "2571"  "260"   "47"   "36"  
##  [4,] "4995"  "49272" "919"  "4993"  "356"   "1214"  "457"  "17"  
##  [5,] "60069" "3996"  "2804" "68954" "4896"  "908"   "36"   "337" 
##  [6,] "293"   "457"   "1097" "4973"  "8961"  "1291"  "1617" "2700"
##  [7,] "4226"  "2355"  "1704" "7153"  "1198"  "480"   "16"   "1197"
##  [8,] "48780" "1584"  "2791" "356"   "5418"  "750"   "904"  "457" 
##  [9,] "5618"  "1485"  "1203" "7361"  "6711"  "588"   "597"  "235" 
## [10,] "91529" "588"   "1028" "49272" "1270"  "60069" "349"  "594"

The above matrix shows the top 10 recommended movies for first 8 users. These numbers show the movie ids.

Top Movie Titles

num_items_sorted <- sort(num_items, decreasing = TRUE)
num_items_top <- head(num_items_sorted, n = 4)
table_top <- data.frame(as.integer(names(num_items_top)), num_items_top)

for (i in 1:4){
  table_top[i,1] <- as.character(subset(movies, 
                                         movies$movieId == table_top[i,1])$title)
}
colnames(table_top) <- c("Movie Titles", "# Items")
head(table_top)
##                   Movie Titles # Items
## 3      Grumpier Old Men (1995)      13
## 70  From Dusk Till Dawn (1996)       8
## 172     Johnny Mnemonic (1995)       8
## 16               Casino (1995)       7

Comparing the results of UBCF with IBCF helps in understanding the algorithm better. UBCF needs to access the initial data, so it is a lazy-learning model. Since it needs to keep the entire database in memory, it doesn’t work well in the presence of a big rating matrix. Also, building the similarity matrix requires a lot of computing power and time. However, UBCF’s accuracy is proven to be slightly more accurate than IBCF, so it’s a good option if the dataset is not too big.

Evaluation of Recommender Systems

In order to evaluate the recommender systems we have to go through the following steps:

1- Prepare the data to evaluate performance 2- Evaluate the performance of some models 3- Choose the best performing models 4- Optimize model parameters

Prepare the data to evaluate performance

To evaluate models, we need to build them with some data and test them on some other data i.e. train and test data which is also known as Data Splitting. We can also use bootstrapping k-fold approach.

Splitting data

We are going to split the data into train and test datasets with proportion of 80% and 20% respectively. For each user in the test set, we need to define how many items to use to generate recommendations. The remaining items will be used to test the model accuracy. It is better that this parameter is lower than the minimum number of items purchased by any user so that we don’t have users without items to test the models:

train_percent <- 0.80
min(rowCounts(movie_ratings))
## [1] 11

We will keep anything lower than 11 to come up with better results.

# Parameters
keep <- 8
rating_threshold <- 3
n_eval <- 1


# Let's split the data through evaluationScheme function from recommenderlab
eval_sets <- evaluationScheme(data= movie_ratings,
                              method="split",
                              train= train_percent,
                              given= keep,
                              goodRating = rating_threshold, k=n_eval)
eval_sets
## Evaluation scheme with 8 items given
## Method: 'split' with 1 run(s).
## Training set proportion: 0.800
## Good ratings: >=3.000000
## Data set: 378 x 436 rating matrix of class 'realRatingMatrix' with 36214 ratings.

Now in order to extract the sets we are going to use getData function from recommenderlab. Three sets will be extracted which are train, known and unknown. Known is the test set, with the item used to build the recommendations and unknwon is the test set, with the item used to test the recommendations. train is the training set.

getData(eval_sets, "train")
## 302 x 436 rating matrix of class 'realRatingMatrix' with 28057 ratings.
getData(eval_sets, "known")
## 76 x 436 rating matrix of class 'realRatingMatrix' with 608 ratings.
getData(eval_sets, "unknown")
## 76 x 436 rating matrix of class 'realRatingMatrix' with 7549 ratings.

As we can see, movie_ratings has splitted into three datasets and class is realRatingMatrix.

Bootstrapping approach

In this approach rather than splitting the data into two parts and train, we sample the rows with replacement. Same user can be sampled more than once and, if the training set has the same size as it did earlier, there will be more users in the test set. This is called bootstrapping and we can use it from recommenderlab library.

# Bootstrapping approach
eval_sets2 <- evaluationScheme(data= movie_ratings,
                               method="bootstrap",
                               train= train_percent,
                               given= keep,
                               goodRating= rating_threshold,
                               k=n_eval)

table_train <- table(eval_sets2@runsTrain[[1]])
n_repetitions <- factor(as.vector(table_train))
qplot(n_repetitions) + 
  ggtitle("Repetitions in the Training Set")

All of the users have been sampled few than four times.

k-fold Approach

Two previous approaches tested the recommender on part of the users. If, instead, we test the recommendation on each user, we could measure the performances much more accurately. We can split the data into some chunks, take a chunk out as the test set, and evaluate the accuracy and then check the other chunks. This approach is known as k-fold approach.

n_total <- 4
eval_sets3 <- evaluationScheme(data=movie_ratings,
                               method="cross-validation",
                               k= n_total,
                               given= keep,
                               goodRating= rating_threshold)
size_sets <- sapply(eval_sets3@runsTrain, length)
size_sets
## [1] 282 282 282 282

When using the k-fold approach then the result will be four sets of 282

Evaluating recommender techniques

In order to recommend items to new users, collaborative filtering estimates the ratings of items that are not yet purchased. Then, it recommends the top-rated items. At the moment, let’s forget about the last step. We can evaluate the model by comparing the estimated ratings with the real ones.

Evaluate Ratings

Let’s prepare the data for validation using k-fold approach as it seems the most accurate apporach for evaluation.

evaluate_accuracy <- evaluationScheme(data= movie_ratings,
                                      method="cross-validation",
                                      k= n_total,
                                      given= keep,
                                      goodRating= rating_threshold)
evaluate_accuracy
## Evaluation scheme with 8 items given
## Method: 'cross-validation' with 4 run(s).
## Good ratings: >=3.000000
## Data set: 378 x 436 rating matrix of class 'realRatingMatrix' with 36214 ratings.

Now we have to define the model that needs to be evaluated.

evaluation_for_model <- "IBCF"
model_parameters <- NULL

# model for evaluation
evaluate_model <- Recommender(data= getData(evaluate_accuracy, "train"),
                              method= evaluation_for_model, parameter=model_parameters)
evaluate_model
## Recommender of type 'IBCF' for 'realRatingMatrix' 
## learned using 282 users.
items_to_recommend <- 10
evaluate_prediction <- predict(object= evaluate_model, newdata= getData(evaluate_accuracy, "known"),
                               n= items_to_recommend,
                               type="ratings")
evaluate_prediction
## 96 x 436 rating matrix of class 'realRatingMatrix' with 15490 ratings.
qplot(rowCounts(evaluate_prediction)) + 
  geom_histogram(binwidth = 10) +
  ggtitle("Distribution of Movies, per user")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Number of movies per user is roughly between 100 and 220 approximately

Next is to measure the accuracy i.e. RMSE, MSE and MAE can be done using calcPredictionAccuracy from recommenderlab. Let’s calculate the accuracy

evaluate_accuracy2 <- calcPredictionAccuracy(x= evaluate_prediction, 
                                             data = getData(evaluate_accuracy, "unknown"), 
                                             byUser=TRUE)
head(evaluate_accuracy2, 10)
##            RMSE       MSE       MAE
##  [1,] 1.0741723 1.1538462 0.8461538
##  [2,] 1.3660964 1.8662194 0.9317085
##  [3,] 0.9149431 0.8371209 0.6887890
##  [4,] 0.9326441 0.8698250 0.7492508
##  [5,] 0.5679543 0.3225721 0.4626137
##  [6,] 2.1634577 4.6805491 1.7139765
##  [7,] 1.2817399 1.6428571 1.0000000
##  [8,] 1.3416408 1.8000000 1.0000000
##  [9,] 1.1180044 1.2499338 0.8951357
## [10,] 0.9443153 0.8917314 0.6846772

Now let’s visualize the RMSE for each user.

qplot(evaluate_accuracy2[, "RMSE"]) + 
  geom_histogram(binwidth = 0.1) +
  ggtitle("Distribution of  RMSE, by user")

Most of RMSE for each user fall between 0.5 and 2 approximately. Now let’s calculate a performance index of the whole model through setting byUser = FALSE.

evaluate_accuracy3 <- calcPredictionAccuracy(x= evaluate_prediction,
                                             data= getData(evaluate_accuracy, "unknown"),
                                             byUser= FALSE)
evaluate_accuracy3
##      RMSE       MSE       MAE 
## 1.2668545 1.6049204 0.9511248

RMSE for the entire model is 1.2362037

Evaluating Recommendations

Another way to measure accuracies is by comparing the recommendations with the movies having a positive rating. Let’s go ahead and see how it looks like.

results <- evaluate(x= evaluate_accuracy,
                    method= evaluation_for_model,
                    n = seq(10,100,10))
## IBCF run fold/sample [model time/prediction time]
##   1  [0.3sec/0.03sec] 
##   2  [0.29sec/0.02sec] 
##   3  [0.32sec/0.01sec] 
##   4  [0.3sec/0.01sec]
results
## Evaluation results for 4 folds/samples using method 'IBCF'.

Now using confusion matrix, we can extract a list of confusion matrices. Each element of the list correspondents to a different split of the k-fold. Let’s take a look at the first element

head(getConfusionMatrix(results))[[1]]
##            TP        FP       FN       TN precision     recall        TPR
## 10   1.510417  8.489583 73.67708 344.3229 0.1510417 0.02563164 0.02563164
## 20   3.281250 16.718750 71.90625 336.0938 0.1640625 0.05133733 0.05133733
## 30   5.250000 24.750000 69.93750 328.0625 0.1750000 0.07835795 0.07835795
## 40   7.020833 32.979167 68.16667 319.8333 0.1755208 0.10393649 0.10393649
## 50   8.510417 41.458333 66.67708 311.3542 0.1703014 0.12135278 0.12135278
## 60  10.135417 49.729167 65.05208 303.0833 0.1692598 0.14481997 0.14481997
## 70  11.614583 58.031250 63.57292 294.7812 0.1668231 0.16417442 0.16417442
## 80  13.270833 66.031250 61.91667 286.7812 0.1673626 0.18504456 0.18504456
## 90  14.645833 74.125000 60.54167 278.6875 0.1651420 0.20000847 0.20000847
## 100 16.093750 81.916667 59.09375 270.8958 0.1646384 0.21597834 0.21597834
##            FPR
## 10  0.02442051
## 20  0.04785770
## 30  0.07046754
## 40  0.09383172
## 50  0.11789477
## 60  0.14147531
## 70  0.16505899
## 80  0.18772660
## 90  0.21074200
## 100 0.23295844

We need to see TP, FP, FN and TN in the table form. Rest of the columns are better to be visualized to see accuracy.

sun_col <- c("TP", "FP", "FN", "TN")
indices_summed <- Reduce("+", getConfusionMatrix(results))[, sun_col]
head(indices_summed)
##           TP        FP       FN       TN
## 10  6.833333  33.06250 307.6979 1364.406
## 20 14.114583  65.67708 300.4167 1331.792
## 30 21.291667  98.37500 293.2396 1299.094
## 40 28.427083 130.69792 286.1042 1266.771
## 50 34.885417 163.25000 279.6458 1234.219
## 60 41.770833 195.05208 272.7604 1202.417

Now let’s visualize ROC

plot(results, annotate=TRUE, main ="ROC Curve")

Two accuracy metrics are precision and recall. Precision is the percentage of recommended items that have been purchases. It’s the number of False Positive (FP) divided by the total number of positives(True Positive + False Positive). Recall is the percentage of purchased items that have been recommended. It is the number of True Positives (TP) divided by the total number of purchases (TP + FN).

If a small percentage of purchased items are recommended, the precision usually decreases and if a higher percentage of purchased items will be recommended that will increase recall. Let’s visualize precision-recall plot to see results

plot(results, "prec/rec", annotate=TRUE, main = "Precision / Recall")

The plot shows the tradeoff between precision and recall. Even if the curve is not perfectly monotonic, the trends are as expected.

Identify the best model

In order to compare different models, we will create a baseline measure of the following list

  • Item-Based Collaborative Filtering - using the Cosine as the distance function
  • Item-based Collaborative Filtering - using the Pearson correlation as the distance function
  • User-based Collaborative Filtering - using the Cosine as the distance function
  • User-based Collaborative Filtering - using the Pearson correlation as the distance function Random Recommendations
baseline <- list(
IBCF_cosine = list(name = "IBCF", 
                param = list(method = "cosine")),
IBCF_pearson = list(name = "IBCF", 
                param = list(method = "pearson")),
UBCF_cosine = list(name = "UBCF", 
                param = list(method = "cosine")),
UBCF_pearson = list(name = "UBCF", 
                param = list(method = "pearson")),
Random = list(name = "RANDOM", param=NULL)
)
baseline
## $IBCF_cosine
## $IBCF_cosine$name
## [1] "IBCF"
## 
## $IBCF_cosine$param
## $IBCF_cosine$param$method
## [1] "cosine"
## 
## 
## 
## $IBCF_pearson
## $IBCF_pearson$name
## [1] "IBCF"
## 
## $IBCF_pearson$param
## $IBCF_pearson$param$method
## [1] "pearson"
## 
## 
## 
## $UBCF_cosine
## $UBCF_cosine$name
## [1] "UBCF"
## 
## $UBCF_cosine$param
## $UBCF_cosine$param$method
## [1] "cosine"
## 
## 
## 
## $UBCF_pearson
## $UBCF_pearson$name
## [1] "UBCF"
## 
## $UBCF_pearson$param
## $UBCF_pearson$param$method
## [1] "pearson"
## 
## 
## 
## $Random
## $Random$name
## [1] "RANDOM"
## 
## $Random$param
## NULL

In order to evaluate the models properly, we need to test them, varying the number of items. For instance, we might want to recommend up to 100 movies to each user. Since 100 is already a big number of recommendations, we don’t need to include higher values.

n_recommendations <- c(1, 5, seq(10, 100, 10))
results_list <- evaluate(x = eval_sets, 
                         method = baseline, 
                         n = n_recommendations)
## IBCF run fold/sample [model time/prediction time]
##   1  [0.33sec/0.02sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.33sec/0.02sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.06sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.06sec] 
## RANDOM run fold/sample [model time/prediction time]
##   1  [0sec/0.03sec]
sapply(results_list, class) == "evaluationResults"
##  IBCF_cosine IBCF_pearson  UBCF_cosine UBCF_pearson       Random 
##         TRUE         TRUE         TRUE         TRUE         TRUE
avg_matrices <- lapply(results_list, avg)
head(avg_matrices$IBCF_cos[, 5:8])
##    precision      recall         TPR        FPR
## 1  0.1842105 0.002068477 0.002068477 0.00240568
## 5  0.1921053 0.011795969 0.011795969 0.01195598
## 10 0.1986842 0.023051880 0.023051880 0.02350827
## 20 0.2092105 0.048745419 0.048745419 0.04607351
## 30 0.2057018 0.073326526 0.073326526 0.07009694
## 40 0.1927632 0.089933032 0.089933032 0.09524585

It’s easier to see the results visually and identify the best model to go ahead. We an compare the models by building a chart displaying their ROC curves.

plot(results_list, annotate=1, legend="topleft", main="ROC Curve")

A good performance index is the area under ROC curve. We can notice that the highest is UBCF_pearson and hence it is the best performing technique. Let’s plot precision-recall plot to check the insights.

plot(results_list, "prec/rec", annotate = 1, legend = "bottomright", main="Precision - Recall")

We can see that UBCF_pearson is still at the top and hence the best performing technique out of all the other models.

Optimization a numeric parameter

Recommendation models often contain some numeric parameters. For instance, IBCF takes account of the k-closest items. How can we optimize k? In a similar way as we checked other paramters,we can test different values of a numeric parameter and check which parameter is best. So far, k was left to default value i.e. 30. Now left explore more values, ranging between 5 and 40. We will take UBCF_pearson as it was model we checked previously.

k <- c(5, 10, 20, 30, 40)
baseline <- lapply(k, function(k){
  list(name = "UBCF",
       param = list(method = "pearson", k = k))
})
names(baseline) <- paste0("UBCF_k_", k)
names(baseline)
## [1] "UBCF_k_5"  "UBCF_k_10" "UBCF_k_20" "UBCF_k_30" "UBCF_k_40"
n_recommendations <- c(1,5, seq(10,100,10))
results_list <- evaluate(x= evaluate_accuracy, method= baseline, n= n_recommendations)
## UBCF run fold/sample [model time/prediction time]
##   1
## Warning: Unknown parameters: k
## Available parameter (with default values):
## method    =  cosine
## nn    =  25
## sample    =  FALSE
## normalize     =  center
## verbose   =  FALSE
## [0sec/0.09sec] 
##   2
## Warning: Unknown parameters: k
## Available parameter (with default values):
## method    =  cosine
## nn    =  25
## sample    =  FALSE
## normalize     =  center
## verbose   =  FALSE
## [0sec/0.08sec] 
##   3
## Warning: Unknown parameters: k
## Available parameter (with default values):
## method    =  cosine
## nn    =  25
## sample    =  FALSE
## normalize     =  center
## verbose   =  FALSE
## [0sec/0.08sec] 
##   4
## Warning: Unknown parameters: k
## Available parameter (with default values):
## method    =  cosine
## nn    =  25
## sample    =  FALSE
## normalize     =  center
## verbose   =  FALSE
## [0sec/0.11sec] 
## UBCF run fold/sample [model time/prediction time]
##   1
## Warning: Unknown parameters: k
## Available parameter (with default values):
## method    =  cosine
## nn    =  25
## sample    =  FALSE
## normalize     =  center
## verbose   =  FALSE
## [0.01sec/0.08sec] 
##   2
## Warning: Unknown parameters: k
## Available parameter (with default values):
## method    =  cosine
## nn    =  25
## sample    =  FALSE
## normalize     =  center
## verbose   =  FALSE
## [0sec/0.09sec] 
##   3
## Warning: Unknown parameters: k
## Available parameter (with default values):
## method    =  cosine
## nn    =  25
## sample    =  FALSE
## normalize     =  center
## verbose   =  FALSE
## [0sec/0.1sec] 
##   4
## Warning: Unknown parameters: k
## Available parameter (with default values):
## method    =  cosine
## nn    =  25
## sample    =  FALSE
## normalize     =  center
## verbose   =  FALSE
## [0sec/0.09sec] 
## UBCF run fold/sample [model time/prediction time]
##   1
## Warning: Unknown parameters: k
## Available parameter (with default values):
## method    =  cosine
## nn    =  25
## sample    =  FALSE
## normalize     =  center
## verbose   =  FALSE
## [0sec/0.08sec] 
##   2
## Warning: Unknown parameters: k
## Available parameter (with default values):
## method    =  cosine
## nn    =  25
## sample    =  FALSE
## normalize     =  center
## verbose   =  FALSE
## [0sec/0.09sec] 
##   3
## Warning: Unknown parameters: k
## Available parameter (with default values):
## method    =  cosine
## nn    =  25
## sample    =  FALSE
## normalize     =  center
## verbose   =  FALSE
## [0sec/0.1sec] 
##   4
## Warning: Unknown parameters: k
## Available parameter (with default values):
## method    =  cosine
## nn    =  25
## sample    =  FALSE
## normalize     =  center
## verbose   =  FALSE
## [0sec/0.1sec] 
## UBCF run fold/sample [model time/prediction time]
##   1
## Warning: Unknown parameters: k
## Available parameter (with default values):
## method    =  cosine
## nn    =  25
## sample    =  FALSE
## normalize     =  center
## verbose   =  FALSE
## [0sec/0.08sec] 
##   2
## Warning: Unknown parameters: k
## Available parameter (with default values):
## method    =  cosine
## nn    =  25
## sample    =  FALSE
## normalize     =  center
## verbose   =  FALSE
## [0sec/0.08sec] 
##   3
## Warning: Unknown parameters: k
## Available parameter (with default values):
## method    =  cosine
## nn    =  25
## sample    =  FALSE
## normalize     =  center
## verbose   =  FALSE
## [0sec/0.09sec] 
##   4
## Warning: Unknown parameters: k
## Available parameter (with default values):
## method    =  cosine
## nn    =  25
## sample    =  FALSE
## normalize     =  center
## verbose   =  FALSE
## [0sec/0.09sec] 
## UBCF run fold/sample [model time/prediction time]
##   1
## Warning: Unknown parameters: k
## Available parameter (with default values):
## method    =  cosine
## nn    =  25
## sample    =  FALSE
## normalize     =  center
## verbose   =  FALSE
## [0.02sec/0.08sec] 
##   2
## Warning: Unknown parameters: k
## Available parameter (with default values):
## method    =  cosine
## nn    =  25
## sample    =  FALSE
## normalize     =  center
## verbose   =  FALSE
## [0sec/0.08sec] 
##   3
## Warning: Unknown parameters: k
## Available parameter (with default values):
## method    =  cosine
## nn    =  25
## sample    =  FALSE
## normalize     =  center
## verbose   =  FALSE
## [0.02sec/0.08sec] 
##   4
## Warning: Unknown parameters: k
## Available parameter (with default values):
## method    =  cosine
## nn    =  25
## sample    =  FALSE
## normalize     =  center
## verbose   =  FALSE
## [0sec/0.1sec]

Let’s plot ROC curve to see which parameter is the best.

plot(results_list, annotate=1, legend="topleft", main= "ROC Curve")

plot(results_list, "prec/rec", annotate = 1, legend = "bottomright", main="Precision - Recall")

For UBCF_pearson, any k-value can be taken and it will give the same result. Reason we chose UBCF_pearson is because it was the best model technique based on previous graphs.

Conclusion

The idea was to create a recommendation system that will help user finding movies they might like. We chose to work on collaborative filtering (Both UBCF and IBCF) in the beginning to check which one of them works better. Result showed that User-based collaborative filtering with pearson is the best model out of the rest. k-value was also checked to see which one is better and ROC curve showed that they are all good so we can go with the default value of k which is 30. Unfortunately due to time constraint, we could not implement the model in Shiny App for users to check the recommendations.

Reference(s)

Building a Recommendation System with R (Suresh Gorakala and Michele Usuelli, 2015)