Collaborative filtering is a technique used by recommender systems for predicting the interests of one user based on the preference information of other users [1]. This project is an implementation of a Movie Recommender System that uses the following techniques:
This implementation uses the recommenderlab package in R. The MovieLense dataset [2] is included with this package and is used here to train, predict and evaluate the models.
The following R libraries are used in this project.
library(dplyr)
library(ggplot2)
library(knitr)
library(recommenderlab)
The starting point in collaborative filtering is a rating matrix in which rows correspond to users and columns correspond to items [3]. This matrix is implemented in the MovieLense data object.
We now load and inspect the MovieLense data object.
data(MovieLense)
MovieLense
## 943 x 1664 rating matrix of class 'realRatingMatrix' with 99392 ratings.
class(MovieLense)
## [1] "realRatingMatrix"
## attr(,"package")
## [1] "recommenderlab"
slotNames(MovieLense)
## [1] "data" "normalize"
class(MovieLense@data)
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
We now show names of some of the movies present in the data set.
head(names(colCounts(MovieLense)))
## [1] "Toy Story (1995)"
## [2] "GoldenEye (1995)"
## [3] "Four Rooms (1995)"
## [4] "Get Shorty (1995)"
## [5] "Copycat (1995)"
## [6] "Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)"
The number of ratings corresponding to each rating value is shown in a table below.
vector_ratings <- as.vector(MovieLense@data)
kable(table(vector_ratings), caption="Rating frequency")
vector_ratings | Freq |
---|---|
0 | 1469760 |
1 | 6059 |
2 | 11307 |
3 | 27002 |
4 | 33947 |
5 | 21077 |
Since a rating with a value of 0 represents the absence of a rating, we remove such ratings from the ratings vector.
# Since a rating of 0 represents absence of a rating in this data set, we can remove such
# ratings from the ratings vector.
vector_ratings = vector_ratings[vector_ratings != 0]
hist(vector_ratings, main="Histogram of Ratings", xlab="Rating Value")
We see above that the rating of 4 (indicating high preference) is the most common rating, and that rating values are skewed to the left.
For building a collaborative filtering model we can limit the input data based on minimum thresholds: for example, we may ignore users that have provided too few ratings, and also ignore those movies that have received too few ratings from users.
Here we restrict the model training to those users who have rated at least 50 movies, and those movies that have been rated by at least 100 users.
ratings = MovieLense[rowCounts(MovieLense) > 50, colCounts(MovieLense) > 100]
dim(ratings)
## [1] 560 332
We normalize the data so that the average rating given by each user is 0. This handles cases where a user consistently assigns higher or lower ratings to all movies compared to the average for all users. In other words, normalizing of data is done to remove the bias in each user’s ratings.
ratings.n = normalize(ratings)
ratings.n.vec = as.vector(ratings.n@data)
ratings.n.vec = ratings.n.vec[ratings.n.vec != 0]
hist(ratings.n.vec, main="Histogram of Normalized Ratings", xlab="Rating")
For splitting data into test and train sets, we can use the evaluationScheme() function in recommenderlab. It extends the usage of generic methods of splitting the data, by allowing several parameters that are specific to recommender systems. As shown in the code section below, there is a parameter specifying how many items to use for each user, and another parameter specifying the minimum value that indicates a good rating.
percent_train = 0.8
#min(rowCounts(ratings.n))
items_to_keep = 15 # items to use for each user
rating_threshold = 3 # good rating implies >=3
n_eval = 1 # number of times to run eval
eval_sets = evaluationScheme(data = ratings, method = "split",
train = percent_train, given = items_to_keep,
goodRating = rating_threshold, k = n_eval)
eval_sets
## Evaluation scheme with 15 items given
## Method: 'split' with 1 run(s).
## Training set proportion: 0.800
## Good ratings: >=3.000000
## Data set: 560 x 332 rating matrix of class 'realRatingMatrix' with 55298 ratings.
In user-based collaborative filtering (UBCF) the procedure is to first find other users that are similar to a given user, then find the top-rated items purchased by those users. Those items are then recommended for the given user [3].
We now build a UBCF model using the default parameters of the Recommender() function, and use it to predict using the test portion of the data set. We use library functions to evaluate accuracy of the prediction by comparing against values in the data set. Performance metrics for the UBCF model are displayed.
eval_recommender = Recommender(data = getData(eval_sets, "train"),
method = "UBCF", parameter = NULL)
items_to_recommend = 10
eval_prediction = predict(object = eval_recommender,
newdata = getData(eval_sets, "known"),
n = items_to_recommend,
type = "ratings")
eval_accuracy = calcPredictionAccuracy(x = eval_prediction,
data = getData(eval_sets, "unknown"),
byUser = TRUE)
head(eval_accuracy)
## RMSE MSE MAE
## 8 0.9761524 0.9528735 0.8421709
## 13 1.3449749 1.8089574 1.1217356
## 41 0.7868309 0.6191029 0.6445177
## 43 0.9080320 0.8245220 0.7408471
## 52 0.7007719 0.4910813 0.5280814
## 59 1.1527556 1.3288455 0.8730173
Item-based collaborative filtering (IBCF) attempts to find, for a given user, items that are similar to items purchased by the user.
The core algorithm is based on these steps [3]:
We now build an IBCF model using the default parameters of the Recommender() function.
eval_recommender = Recommender(data = getData(eval_sets, "train"),
method = "IBCF", parameter = NULL)
items_to_recommend = 10
eval_prediction = predict(object = eval_recommender,
newdata = getData(eval_sets, "known"),
n = items_to_recommend,
type = "ratings")
eval_accuracy = calcPredictionAccuracy(x = eval_prediction,
data = getData(eval_sets, "unknown"),
byUser = TRUE)
head(eval_accuracy)
## RMSE MSE MAE
## 8 1.820972 3.315940 1.5089295
## 13 1.840546 3.387611 1.4523764
## 41 1.177189 1.385774 0.9802121
## 43 1.351875 1.827567 0.9793580
## 52 1.081872 1.170447 0.7779512
## 59 1.348306 1.817928 0.9045915
We find from the above accuracy tables that RMSE values are significantly lower for the UBCF model.
We now build models using different similarity parameters for computing similarity between users and items. The cosine similarity and the Pearson correlation are commonly used similarity measures and are used here.
models_to_evaluate = list(IBCF_cos = list(name = "IBCF", param = list(method = "cosine")),
IBCF_cor = list(name = "IBCF", param = list(method = "pearson")),
UBCF_cos = list(name = "UBCF", param = list(method = "cosine")),
UBCF_cor = list(name = "UBCF", param = list(method = "pearson")),
random = list(name = "RANDOM", param=NULL))
n_recommendations = c(1, 3, 5, 10, 15, 20)
results = evaluate(x = eval_sets, method = models_to_evaluate, n = n_recommendations)
## IBCF run fold/sample [model time/prediction time]
## 1 [0.219sec/0.021sec]
## IBCF run fold/sample [model time/prediction time]
## 1 [0.263sec/0.042sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0.002sec/0.099sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0.002sec/0.116sec]
## RANDOM run fold/sample [model time/prediction time]
## 1 [0.001sec/0.022sec]
# Draw ROC curve
plot(results, y = "ROC", annotate = 1, legend="topleft")
title("ROC Curve")
# Draw precision / recall curve
plot(results, y = "prec/rec", annotate=1)
title("Precision-Recall")
We see that UBCF’s accuracy is higher than that of IBCF. UBCF using Pearson Correlation outperforms all other models. On the other hand, UBCF has greater computational cost and requires more resources. There also exist hybrid systems that integrate both UBCF and IBCF approaches [6]. It is also worth noting that both UBCF and IBCF have limitations – for example when handling users who have made no purchases or items without a single purchase (the cold-start problem).