In this project, I develop a collaborative filtering recommender (CFR) system for recommending movies.
The basic idea of CFR systems is that, if two users share the same interests in the past, e.g. they liked the same book or the same movie, they will also have similar tastes in the future. If, for example, user A and user B have a similar purchase history and user A recently bought a book that user B has not yet seen, the basic idea is to propose this book to user B.
The collaborative filtering approach considers only user preferences and does not take into account the features or contents of the items (books or movies) being recommended. In this project, in order to recommend movies I will use a large set of users preferences towards the movies from a publicly available movie rating dataset.
For the full R code of this project please visit https://github.com/jeknov/movieRec .
The following libraries were used in this project:
library(recommenderlab)
library(ggplot2)
library(data.table)
library(reshape2)
The dataset used was from MovieLens, and is publicly available at http://grouplens.org/datasets/movielens/latest. In order to keep the recommender simple, I used the smallest dataset available (ml-latest-small.zip), which at the time of download contaied 105339 ratings and 6138 tag applications across 10329 movies. These data were created by 668 users between April 03, 1996 and January 09, 2016. This dataset was generated on January 11, 2016.
The data are contained in four files: links.csv, movies.csv, ratings.csv and tags.csv. I only use the files movies.csv and ratings.csv to build a recommendation system.
A summary of movies is given below, togeher with several first rows of a dataframe:
## movieId title genres
## Min. : 1 Length:10329 Length:10329
## 1st Qu.: 3240 Class :character Class :character
## Median : 7088 Mode :character Mode :character
## Mean : 31924
## 3rd Qu.: 59900
## Max. :149532
## movieId title
## 1 1 Toy Story (1995)
## 2 2 Jumanji (1995)
## 3 3 Grumpier Old Men (1995)
## 4 4 Waiting to Exhale (1995)
## 5 5 Father of the Bride Part II (1995)
## 6 6 Heat (1995)
## genres
## 1 Adventure|Animation|Children|Comedy|Fantasy
## 2 Adventure|Children|Fantasy
## 3 Comedy|Romance
## 4 Comedy|Drama|Romance
## 5 Comedy
## 6 Action|Crime|Thriller
And hre is a summary and a head of ratings:
## userId movieId rating timestamp
## Min. : 1.0 Min. : 1 Min. :0.500 Min. :8.286e+08
## 1st Qu.:192.0 1st Qu.: 1073 1st Qu.:3.000 1st Qu.:9.711e+08
## Median :383.0 Median : 2497 Median :3.500 Median :1.115e+09
## Mean :364.9 Mean : 13381 Mean :3.517 Mean :1.130e+09
## 3rd Qu.:557.0 3rd Qu.: 5991 3rd Qu.:4.000 3rd Qu.:1.275e+09
## Max. :668.0 Max. :149532 Max. :5.000 Max. :1.452e+09
## userId movieId rating timestamp
## 1 1 16 4.0 1217897793
## 2 1 24 1.5 1217895807
## 3 1 32 4.0 1217896246
## 4 1 47 4.0 1217896556
## 5 1 50 4.0 1217896523
## 6 1 110 4.0 1217896150
Both usersId and movieId are presented as integers and should be changed to factors. Genres of the movies are not easily usable because of tehir format, I will deal with this in the next step.
Some pre-processing of the data available is required before creating the recommendation system.
First of all, I will re-organize the information of movie genres in such a way that allows future users to search for the movies they like within specific genres. From the design perspective, this is much easier for the user compared to selecting a movie from a single very long list of all the available movies.
I use a one-hot encoding to create a matrix of corresponding genres for each movie.
## Action Adventure Animation Children Comedy Crime Documentary Drama
## 1 0 1 1 1 1 0 0 0
## 2 0 1 0 1 0 0 0 0
## 3 0 0 0 0 1 0 0 0
## 4 0 0 0 0 1 0 0 1
## 5 0 0 0 0 1 0 0 0
## 6 1 0 0 0 0 1 0 0
## Fantasy Film-Noir Horror Musical Mystery Romance Sci-Fi Thriller War
## 1 1 0 0 0 0 0 0 0 0
## 2 1 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 1 0 0 0
## 4 0 0 0 0 0 1 0 0 0
## 5 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 1 0
## Western
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
Now, I create a search matrix which allows an easy search of a movie by any of its genre.
## movieId title Action Adventure Animation
## 1 1 Toy Story (1995) 0 1 1
## 2 2 Jumanji (1995) 0 1 0
## 3 3 Grumpier Old Men (1995) 0 0 0
## 4 4 Waiting to Exhale (1995) 0 0 0
## 5 5 Father of the Bride Part II (1995) 0 0 0
## 6 6 Heat (1995) 1 0 0
## Children Comedy Crime Documentary Drama Fantasy Film-Noir Horror Musical
## 1 1 1 0 0 0 1 0 0 0
## 2 1 0 0 0 0 1 0 0 0
## 3 0 1 0 0 0 0 0 0 0
## 4 0 1 0 0 1 0 0 0 0
## 5 0 1 0 0 0 0 0 0 0
## 6 0 0 1 0 0 0 0 0 0
## Mystery Romance Sci-Fi Thriller War Western
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 1 0 0 0 0
## 4 0 1 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 1 0 0
We can see that each movie can correspond to either one or more than one genre.
In order to use the ratings data for building a recommendation engine with recommenderlab, I convert rating matrix into a sparse matrix of type realRatingMatrix.
## 668 x 10325 rating matrix of class 'realRatingMatrix' with 105339 ratings.
The recommenderlab package contains some options for the recommendation algorithm:
## [1] "IBCF_realRatingMatrix" "POPULAR_realRatingMatrix"
## [3] "RANDOM_realRatingMatrix" "RERECOMMEND_realRatingMatrix"
## [5] "SVD_realRatingMatrix" "SVDF_realRatingMatrix"
## [7] "UBCF_realRatingMatrix"
## $IBCF_realRatingMatrix
## [1] "Recommender based on item-based collaborative filtering (real data)."
##
## $POPULAR_realRatingMatrix
## [1] "Recommender based on item popularity (real data)."
##
## $RANDOM_realRatingMatrix
## [1] "Produce random recommendations (real ratings)."
##
## $RERECOMMEND_realRatingMatrix
## [1] "Re-recommends highly rated items (real ratings)."
##
## $SVD_realRatingMatrix
## [1] "Recommender based on SVD approximation with column-mean imputation (real data)."
##
## $SVDF_realRatingMatrix
## [1] "Recommender based on Funk SVD with gradient descend (real data)."
##
## $UBCF_realRatingMatrix
## [1] "Recommender based on user-based collaborative filtering (real data)."
I will use IBCF and UBCF models. Check the parameters of these two models.
recommender_models$IBCF_realRatingMatrix$parameters
## $k
## [1] 30
##
## $method
## [1] "Cosine"
##
## $normalize
## [1] "center"
##
## $normalize_sim_matrix
## [1] FALSE
##
## $alpha
## [1] 0.5
##
## $na_as_zero
## [1] FALSE
recommender_models$UBCF_realRatingMatrix$parameters
## $method
## [1] "cosine"
##
## $nn
## [1] 25
##
## $sample
## [1] FALSE
##
## $normalize
## [1] "center"
Collaborative filtering algorithms are based on measuring the similarity between users or between items. For this purpose, recommenderlab contains the similarity function. The supported methods to compute similarities are cosine, pearson, and jaccard.
Next, I determine how similar the first four users are with each other by creating and visualizing similarity matrix that uses the cosine distance:
## 1 2 3 4
## 1 0.0000000 0.1011133 0.21004361 0.12876575
## 2 0.1011133 0.0000000 0.11555911 0.03461020
## 3 0.2100436 0.1155591 0.00000000 0.05820771
## 4 0.1287658 0.0346102 0.05820771 0.00000000
In the given matrix, each row and each column corresponds to a user, and each cell corresponds to the similarity between two users. The more red the cell is, the more similar two users are. Note that the diagonal is red, since it’s comparing each user with itself.
Using the same approach, I compute similarity between the first four movies.
## 1 2 3 4
## 1 0.0000000 0.3830684 0.3374528 0.1347243
## 2 0.3830684 0.0000000 0.1992068 0.1233765
## 3 0.3374528 0.1992068 0.0000000 0.1733663
## 4 0.1347243 0.1233765 0.1733663 0.0000000
Now, I explore values of ratings.
vector_ratings <- as.vector(ratingmat@data)
unique(vector_ratings) # what are unique values of ratings
## [1] 0.0 5.0 4.0 3.0 4.5 1.5 2.0 3.5 1.0 2.5 0.5
table_ratings <- table(vector_ratings) # what is the count of each rating value
table_ratings
## vector_ratings
## 0 0.5 1 1.5 2 2.5 3 3.5 4
## 6791761 1198 3258 1567 7943 5484 21729 12237 28880
## 4.5 5
## 8187 14856
There are 11 unique score values. The lower values mean lower ratings and vice versa.
According to the documentation, a rating equal to 0 represents a missing value, so I remove them from the dataset before visualizing the results.
As we see, tehre are less low (less than 3) rating scores, the majority of movies are rated with a score of 3 or higher. The most common rating is 4.
Now, let’s see what are the most viewed movies.
## movie views title
## 296 296 325 Pulp Fiction (1994)
## 356 356 311 Forrest Gump (1994)
## 318 318 308 Shawshank Redemption, The (1994)
## 480 480 294 Jurassic Park (1993)
## 593 593 290 Silence of the Lambs, The (1991)
## 260 260 273 Star Wars: Episode IV - A New Hope (1977)
We see that “Pulp Fiction (1994)” is the most viewed movie, exceeding the second-most-viewed “Forrest Gump (1994)” by 14 views.
Now I identify the top-rated movies by computing the average rating of each of them.
The first image above shows the distribution of the average movie rating. The highest value is around 3, and there are a few movies whose rating is either 1 or 5. Probably, the reason is that these movies received a rating from a few people only, so we shouldn’t take them into account.
I remove the movies whose number of views is below a defined threshold of 50, creating a subset of only relevant movies. The second image above shows the distribution of the relevant average ratings. All the rankings are between 2.16 and 4.45. As expected, the extremes were removed. The highest value changes, and now it is around 4.
I visualize the whole matrix of ratings by building a heat map whose colors represent the ratings. Each row of the matrix corresponds to a user, each column to a movie, and each cell to its rating.
Since there are too many users and items, the first chart is hard to read. The second chart is built zooming in on the first rows and columns.
Some users saw more movies than the others. So, instead of displaying some random users and items, I should select the most relevant users and items. Thus I visualize only the users who have seen many movies and the movies that have been seen by many users. To identify and select the most relevant users and movies, I follow these steps:
## [1] "Minimum number of movies per user:"
## 99%
## 1198.17
## [1] "Minimum number of users per movie:"
## 99%
## 115
Let’s take account of the users having watched more movies. Most of them have seen all the top movies, and this is not surprising. Some columns of the heatmap are darker than the others, meaning that these columns represent the highest-rated movies.Conversely, darker rows represent users giving higher ratings. Because of this, it might be useful to normalize the data, which I will do in the next step.
The data preparation process consists of the following steps:
In order to select the most relevant data, I define the minimum number of users per rated movie as 50 and the minimum views number per movie as 50:
## 420 x 447 rating matrix of class 'realRatingMatrix' with 38341 ratings.
Such a selection of the most relevant data contains 420 users and 447 movies, compared to previous 668 users and 10325 movies in the total dataset.
Using the same approach as previously, I visualize the top 2 percent of users and movies in the new matrix of the most relevant data:
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
In the heatmap, some rows are darker than the others. This might mean that some users give higher ratings to all the movies. The distribution of the average rating per user across all the users varies a lot, as the second chart above shows.
Having users who give high (or low) ratings to all their movies might bias the results. In order to remove this effect, I normalize the data in such a way that the average rating of each user is 0. As a quick check, I calculate the average rating by users, and it is equal to 0, as expected:
ratings_movies_norm <- normalize(ratings_movies)
sum(rowMeans(ratings_movies_norm) > 0.00001)
## [1] 0
Now, I visualize the normalized matrix for the top movies. It is colored now because the data is continuous:
There are still some lines that seem to be more blue or more red. The reason is that I am visualizing only the top movies. I have already checked that the average rating is 0 for each user.
Some recommendation models work on binary data, so it might be useful to binarize the data, that is, define a table containing only 0s and 1s. The 0s will be either treated as missing values or as bad ratings.
In our case, I can either:
Depending on the context, one choice may be more appropriate than the other.
As a next step, I define two matrices following the two different approaches and visualize a 5 percent portion of each of binarized matrices.
There are more white cells in the second heatmap, which shows that there are more movies with no or bad ratings than those that were not watched by raters.
Collaborative filtering is a branch of recommendation that takes account of the information about different users. The word “collaborative” refers to the fact that users collaborate with each other to recommend items. In fact, the algorithms take account of user ratings and preferences.
The starting point is a rating matrix in which rows correspond to users and columns correspond to items. The core algorithm is based on these steps:
I build the model using 80% of the whole dataset as a training set, and 20% - as a test set.
Let’s have a look at the default parameters of IBCF model. Here, k is the number of items to compute the similarities among them in the first step. After, for each item, the algorithm identifies its k most similar items and stores the number. method is a similarity funtion, which is Cosine by default, may also be pearson. I create the model using the default parameters of method = Cosine and k=30.
## $k
## [1] 30
##
## $method
## [1] "Cosine"
##
## $normalize
## [1] "center"
##
## $normalize_sim_matrix
## [1] FALSE
##
## $alpha
## [1] 0.5
##
## $na_as_zero
## [1] FALSE
## Recommender of type 'IBCF' for 'realRatingMatrix'
## learned using 334 users.
## [1] "Recommender"
## attr(,"package")
## [1] "recommenderlab"
Exploring the recommender model:
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
## [1] 447 447
## row_sums
## 30
## 447
dgCMatrix is a similarity matrix created by the model. Its dimensions are 447 x 447, which is equal to the number of items. The heatmap of 20 first items show that many values are equal to 0. The reason is that each row contains only k (30) elements that are greater than 0. The number of non-null elements for each column depends on how many times the corresponding movie was included in the top k of another movie. Thus, the matrix is not neccessarily simmetric, which is also the case in our model.
The chart of the distribution of the number of elements by column shows there are a few movies that are similar to many others.
Now, it is possible to recommend movies to the users in the test set. I define n_recommended equal to 10 that specifies the number of movies to recommend to each user.
For each user, the algorithm extracts its rated movies. For each movie, it identifies all its similar items, starting from the similarity matrix. Then, the algorithm ranks each similar item in this way:
Then, the algorithm identifies the top 10 recommendations:
## Recommendations as 'topNList' with n = 10 for 86 users.
Let’s explore the results of the recommendations for the first user:
## [1] "Grumpier Old Men (1995)"
## [2] "Heat (1995)"
## [3] "Seven (a.k.a. Se7en) (1995)"
## [4] "Happy Gilmore (1996)"
## [5] "Rumble in the Bronx (Hont faan kui) (1995)"
## [6] "Birdcage, The (1996)"
## [7] "Casper (1995)"
## [8] "Congo (1995)"
## [9] "Star Wars: Episode IV - A New Hope (1977)"
## [10] "Natural Born Killers (1994)"
It’s also possible to define a matrix with the recommendations for each user. I visualize the recommendations for the first four users:
## [,1] [,2] [,3] [,4]
## [1,] 3 58559 253 5060
## [2,] 6 6016 318 1090
## [3,] 47 54286 1729 3052
## [4,] 104 3147 2329 1183
## [5,] 112 49272 2571 2542
## [6,] 141 72998 2761 4011
## [7,] 158 2542 2959 2019
## [8,] 160 55820 3081 1408
## [9,] 260 3578 3147 2947
## [10,] 288 5418 3578 4720
Here, the columns represent the first 4 users, and the rows are the movieId values of recommended 10 movies.
Now, let’s identify the most recommended movies. The following image shows the distribution of the number of items for IBCF:
## Movie title No of items
## 903 Vertigo (1958) 9
## 908 North by Northwest (1959) 9
## 1203 12 Angry Men (1957) 9
## 36 Dead Man Walking (1995) 8
Most of the movies have been recommended only a few times, and a few movies have been recommended more than 5 times.
IBCF recommends items on the basis of the similarity matrix. It’s an eager-learning model, that is, once it’s built, it doesn’t need to access the initial data. For each item, the model stores the k-most similar, so the amount of information is small once the model is built. This is an advantage in the presence of lots of data.
In addition, this algorithm is efficient and scalable, so it works well with big rating matrices.
Now, I will use the user-based approach. According to this approach, given a new user, its similar users are first identified. Then, the top-rated items rated by similar users are recommended.
For each new user, these are the steps:
Again, let’s first check the default parameters of UBCF model. Here, nn is a number of similar users, and method is a similarity function, which is cosine by default. I build a recommender model leaving the parameters to their defaults and using the training set.
## $method
## [1] "cosine"
##
## $nn
## [1] 25
##
## $sample
## [1] FALSE
##
## $normalize
## [1] "center"
## Recommender of type 'UBCF' for 'realRatingMatrix'
## learned using 334 users.
## 334 x 447 rating matrix of class 'realRatingMatrix' with 30135 ratings.
## Normalized using center on rows.
In the same way as the IBCF, I now determine the top ten recommendations for each new user in the test set.
## Recommendations as 'topNList' with n = 10 for 86 users.
Let’s take a look at the first four users:
## [,1] [,2] [,3] [,4]
## [1,] 50 318 318 2329
## [2,] 296 2959 50 2959
## [3,] 318 593 593 2858
## [4,] 356 110 858 5995
## [5,] 858 551 260 1136
## [6,] 47 2858 110 4995
## [7,] 293 2571 1198 4878
## [8,] 1221 4993 4993 1200
## [9,] 4993 1258 527 5989
## [10,] 2571 58559 2858 44191
The above matrix contain movieId of each recommended movie (rows) for the first four users (columns) in our test dataset.
I also compute how many times each movie got recommended and build the related frequency histogram:
Compared with the IBCF, the distribution has a longer tail. This means that there are some movies that are recommended much more often than the others. The maximum is more than 30, compared to 10-ish for IBCF.
Let’s take a look at the top titles:
## Movie title No of items
## 318 Shawshank Redemption, The (1994) 34
## 50 Usual Suspects, The (1995) 30
## 858 Godfather, The (1972) 27
## 527 Schindler's List (1993) 26
Comparing the results of UBCF with IBCF helps find some useful insight on different algorithms. UBCF needs to access the initial data. Since it needs to keep the entire database in memory, it doesn’t work well in the presence of a big rating matrix. Also, building the similarity matrix requires a lot of computing power and time.
However, UBCF’s accuracy is proven to be slightly more accurate than IBCF (I will also discuss it in the next section), so it’s a good option if the dataset is not too big.
There are a few options to choose from when deciding to create a recommendation engine. In order to compare their performances and choose the most appropriate model, I follow these steps:
We need two trainig and testing data to evaluate the model. There are several methods to create them: 1) splitting the data into training and test sets, 2) bootstrapping, 3) using k-fold.
Splitting the data into training and test sets is often done using a 80/20 proportion.
For each user in the test set, we need to define how many items to use to generate recommendations. For this, I first check the minimum number of items rated by users to be sure there will be no users with no items to test.
min(rowCounts(ratings_movies))
## [1] 8
items_to_keep <- 5 #number of items to generate recommendations
rating_threshold <- 3 # threshold with the minimum rating that is considered good
n_eval <- 1 #number of times to run evaluation
eval_sets <- evaluationScheme(data = ratings_movies,
method = "split",
train = percentage_training,
given = items_to_keep,
goodRating = rating_threshold,
k = n_eval)
eval_sets
## Evaluation scheme with 5 items given
## Method: 'split' with 1 run(s).
## Training set proportion: 0.800
## Good ratings: >=3.000000
## Data set: 420 x 447 rating matrix of class 'realRatingMatrix' with 38341 ratings.
getData(eval_sets, "train") # training set
## 336 x 447 rating matrix of class 'realRatingMatrix' with 31157 ratings.
getData(eval_sets, "known") # set with the items used to build the recommendations
## 84 x 447 rating matrix of class 'realRatingMatrix' with 420 ratings.
getData(eval_sets, "unknown") # set with the items used to test the recommendations
## 84 x 447 rating matrix of class 'realRatingMatrix' with 6764 ratings.
qplot(rowCounts(getData(eval_sets, "unknown"))) +
geom_histogram(binwidth = 10) +
ggtitle("unknown items by the users")
The above image displays the unknown items by the users, which varies a lot.
Bootrstrapping is another approach to split the data. The same user can be sampled more than once and, if the training set has the same size as it did earlier, there will be more users in the test set.
eval_sets <- evaluationScheme(data = ratings_movies,
method = "bootstrap",
train = percentage_training,
given = items_to_keep,
goodRating = rating_threshold,
k = n_eval)
table_train <- table(eval_sets@runsTrain[[1]])
n_repetitions <- factor(as.vector(table_train))
qplot(n_repetitions) +
ggtitle("Number of repetitions in the training set")
The above chart shows that most of the users have been sampled fewer than four times.
The k-fold cross-validation approach is the most accurate one, although it’s computationally heavier.
Using this approach, we split the data into some chunks, take a chunk out as the test set, and evaluate the accuracy. Then, we can do the same with each other chunk and compute the average accuracy.
n_fold <- 4
eval_sets <- evaluationScheme(data = ratings_movies,
method = "cross-validation",
k = n_fold,
given = items_to_keep,
goodRating = rating_threshold)
size_sets <- sapply(eval_sets@runsTrain, length)
size_sets
## [1] 315 315 315 315
Using 4-fold approach, we get four sets of the same size 315.
I use the k-fold approach for evaluation.
First, I re-define the evaluation sets, build IBCF model and create a matrix with predicted ratings.
The above image displays the distribution of movies per user in the matrix of predicted ratings.
Now, I compute the accuracy measures for each user. Most of the RMSEs (Root mean square errors) are in the range of 0.5 to 1.8:
## RMSE MSE MAE
## [1,] 0.6002487 0.3602984 0.3777928
## [2,] 0.7694427 0.5920421 0.5683099
## [3,] 1.1223170 1.2595954 0.9067505
## [4,] 1.1058170 1.2228312 0.7860035
## [5,] 0.7942651 0.6308570 0.6723061
## [6,] 0.9013878 0.8125000 0.8750000
In order to have a performance index for the whole model, I specify byUser as FALSE and compute the average indices:
## RMSE MSE MAE
## 1.1026305 1.2157940 0.7928826
The measures of accuracy are useful to compare the performance of different models on the same data.
Another way to measure accuracies is by comparing the recommendations with the purchases having a positive rating. For this, I can make use of a prebuilt evaluate function in recommenderlab library. The function evaluate the recommender performance depending on the number n of items to recommend to each user. I use n as a sequence n = seq(10, 100, 10). The first rows of the resulting performance matrix is presented below:
## IBCF run fold/sample [model time/prediction time]
## 1 [4.49sec/0.06sec]
## 2 [4.47sec/0.06sec]
## 3 [4.53sec/0.06sec]
## 4 [4.46sec/0.07sec]
## TP FP FN TN precision recall TPR
## 10 2.438095 7.180952 66.63810 365.7429 0.2534653 0.03684398 0.03684398
## 20 4.857143 14.380952 64.21905 358.5429 0.2524752 0.08070848 0.08070848
## 30 6.952381 21.904762 62.12381 351.0190 0.2409241 0.11743856 0.11743856
## 40 9.104762 29.371429 59.97143 343.5524 0.2366337 0.15409146 0.15409146
## 50 11.152381 36.942857 57.92381 335.9810 0.2318812 0.19071582 0.19071582
## 60 13.695238 43.923810 55.38095 329.0000 0.2374257 0.23003707 0.23003707
## FPR
## 10 0.01909268
## 20 0.03845362
## 30 0.05865530
## 40 0.07869669
## 50 0.09897217
## 60 0.11752674
In order to have a look at all the splits at the same time, I sum up the indices of columns TP, FP, FN and TN:
## TP FP FN TN
## 10 10.98095 28.16190 291.2571 1437.600
## 20 22.20000 56.08571 280.0381 1409.676
## 30 32.47619 84.94286 269.7619 1380.819
## 40 41.93333 114.53333 260.3048 1351.229
## 50 51.44762 144.00952 250.7905 1321.752
## 60 61.05714 173.01905 241.1810 1292.743
Finally, I plot the ROC and the precision/recall curves:
plot(results, annotate = TRUE, main = "ROC curve")
plot(results, "prec/rec", annotate = TRUE, main = "Precision-recall")
If a small percentage of rated movies is recommended, the precision decreases. On the other hand, the higher percentage of rated movies is recommended the higher is the recall.
In order to compare different models, I define them as a following list:
Then, I define a different set of numbers for recommended movies (n_recommendations <- c(1, 5, seq(10, 100, 10))), run and evaluate the models:
## IBCF run fold/sample [model time/prediction time]
## 1 [4.5sec/0.06sec]
## 2 [4.5sec/0.06sec]
## 3 [4.4sec/0.07sec]
## 4 [4.57sec/0.06sec]
## IBCF run fold/sample [model time/prediction time]
## 1 [8.3sec/0.06sec]
## 2 [8.52sec/0.04sec]
## 3 [8.38sec/0.06sec]
## 4 [8.33sec/0.07sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0sec/2.25sec]
## 2 [0sec/2.29sec]
## 3 [0sec/2.27sec]
## 4 [0.02sec/2.26sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0.01sec/2.87sec]
## 2 [0sec/2.93sec]
## 3 [0.02sec/2.89sec]
## 4 [0sec/2.84sec]
## RANDOM run fold/sample [model time/prediction time]
## 1 [0sec/0.1sec]
## 2 [0sec/0.09sec]
## 3 [0sec/0.1sec]
## 4 [0sec/0.1sec]
## IBCF_cos IBCF_cor UBCF_cos UBCF_cor random
## TRUE TRUE TRUE TRUE TRUE
The following table presents as an example the first rows of the performance evaluation matrix for the IBCF with Cosine distance:
## precision recall TPR FPR
## 1 0.2794085 0.004054432 0.004054432 0.001892934
## 5 0.2724670 0.020509942 0.020509942 0.009632347
## 10 0.2802074 0.041911963 0.041911963 0.019024055
## 20 0.2832357 0.086426698 0.086426698 0.038007303
## 30 0.2762620 0.126139933 0.126139933 0.057626777
## 40 0.2678761 0.163710847 0.163710847 0.077791926
I compare the models by building a chart displaying their ROC curves and Precision/recall curves.
A good performance index is the area under the curve (AUC), that is, the area under the ROC curve. Even without computing it, the chart shows that the highest is UBCF with cosine distance, so it’s the best-performing technique.
The UBCF with cosine distance is still the top model. Depending on what is the main purpose of the system, an appropriate number of items to recommend should be defined.
IBCF takes account of the k-closest items. I will explore more values, ranging between 5 and 40, in order to tune this parameter:
vector_k <- c(5, 10, 20, 30, 40)
models_to_evaluate <- lapply(vector_k, function(k){
list(name = "IBCF",
param = list(method = "cosine", k = k))
})
names(models_to_evaluate) <- paste0("IBCF_k_", vector_k)
Now I build and evaluate the same IBCF/cosine models with different values of the k-closest items:
## IBCF run fold/sample [model time/prediction time]
## 1 [4.63sec/0.04sec]
## 2 [4.47sec/0.04sec]
## 3 [4.52sec/0.04sec]
## 4 [4.57sec/0.03sec]
## IBCF run fold/sample [model time/prediction time]
## 1 [4.43sec/0.05sec]
## 2 [4.54sec/0.05sec]
## 3 [4.62sec/0.05sec]
## 4 [4.44sec/0.05sec]
## IBCF run fold/sample [model time/prediction time]
## 1 [4.57sec/0.05sec]
## 2 [4.94sec/0.08sec]
## 3 [4.83sec/0.05sec]
## 4 [4.64sec/0.04sec]
## IBCF run fold/sample [model time/prediction time]
## 1 [4.42sec/0.05sec]
## 2 [4.42sec/0.05sec]
## 3 [4.55sec/0.04sec]
## 4 [4.56sec/0.06sec]
## IBCF run fold/sample [model time/prediction time]
## 1 [4.48sec/0.07sec]
## 2 [4.53sec/0.06sec]
## 3 [4.64sec/0.06sec]
## 4 [4.54sec/0.06sec]
Based on the ROC curve’s plot, the k having the biggest AUC is 10. Another good candidate is 5, but it can never have a high TPR. This means that, even if we set a very high n value, the algorithm won’t be able to recommend a big percentage of items that the user liked. The IBCF with k = 5 recommends only a few items similar to the purchases. Therefore, it can’t be used to recommend many items.
Based on the precision/recall plot, k should be set to 10 to achieve the highest recall. If we are more interested in the precision, we set k to 5.
I have created a web application for the recommender system using the Shiny package in R, as shown in Figure 1. The web application is hosted on shinyapps.io and can be found at https://jeknov.shinyapps.io/movieRec.
In this web app, I am presenting a simple recommender system created by the user-based collaborative approach. This specific approach was used mainly because it was the best performing method, based on the evaluation performed for this project.
Note: the app hosted at Shinyapps.io is running on a free account. That means that it is very restricted in computational resources and may be slow or even not responsible either when a lot of connections are made, or when large files are used. It could be the best way to copy the files from https://github.com/jeknov/movieRec and try the app locally using RStudio.
In this project, I have developed and evaluated a collaborative filtering recommender (CFR) system for recommending movies. The online app was created to demonstrate the User-based Collaborative Filtering approach for recommendation model.
Let’s discuss the strengths and weaknesses of the User-based Collaborative Filtering approach in general.
Strengths: User-based Collaborative Filtering gives recommendations that can be complements to the item the user was interacting with. This might be a stronger recommendation than what a item-based recommender can provide as users might not be looking for direct substitutes to a movie they had just viewed or previously watched.
Weaknesses: User-based Collaborative Filtering is a type of Memory-based Collaborative Filtering that uses all user data in the database to create recommendations. Comparing the pairwise correlation of every user in your dataset is not scalable. If there were millions of users, this computation would be very time consuming. Possible ways to get around this would be to implement some form of dimensionality reduction, such as Principal Component Analysis, or to use a model-based algorithm instead. Also, user-based collaborative filtering relies on past user choices to make future recommendations. The implications of this is that it assumes that a user’s taste and preference remains more or less constant over time, which might not be true and makes it difficult to pre-compute user similarities offline.