The purpose of this exercise is to build multiple recommender systems for the same dataset and evaluate their performance. By applying the system to a testing subset of the data, the accuracy of the various models can be compared.
The recommenderlab package is utilized throughout the exercise. The Jester5k dataset included in the package is the rating matrix on which the various models are evaluated. The dataset includes “5000 users from the anonymous ratings data from the Jester Online Joke Recommender System.” The ratings contained in the data frame range between -10.00 and 10.00.
The dataset is not very sparse – each user included has rated at least 36 jokes, with some users having rated all 100 jokes sampled in the dataset. To create some sparsity and better allow for recommendations, a subset of the dataset containing only those users with at least 20 un-rated jokes is considered.
Jester5k_rec <- Jester5k[100 - rowCounts(Jester5k) > 20]Naturally, users have different behaviors when rating items; this will be reflected in their average rating, as displayed below.
In order to remove any bias from different users’ average ratings, the ratings are normalized so that each user has an average rating of 0.
To evaluate the accuracy of the models created, the dataset is split into training and testing sets. This is accomplished using the built-inevaluationScheme function. 80% of the data is included in the training set, with the remaining 20% in the testing dataset. With the sparsity introduced in the previous section, 20 items are given for evaluation. For accuracy evaluation purposes, a threshold defining a “good” rating must be created. Given the normalization above, the threshold is set at 0.1, indicating jokes that users like more than their average rating.
Jester5k_rec <- normalize(Jester5k_rec)
train_test <- evaluationScheme(data = Jester5k_rec, method = "split", train = 0.8, given = 20, goodRating = 0.1)Adapting the example in Building a Recommendation System with R, models are evaluated. Two algorithms are used: - User-based collaborative filtering (UBCF) - Item-based collaboriatve filtering (IBCF)
Two similarity methods are used for each algorithm: - Cosine similarity - Pearson correllation similarity
models <- list(
UBCF_cos = list(name = "UBCF", param = list(method = "cosine")),
IBCF_cos = list(name = "IBCF", param = list(method = "cosine")),
UBCF_cor = list(name = "UBCF", param = list(method = "pearson")),
IBCF_cor = list(name = "IBCF", param = list(method = "pearson")))Each of the four models is evaluated using for the test dataset, with each model providing 1 to 20 recommendations per user.
eval_results <- evaluate(x = train_test, method = models, n = 1:20)UBCF run fold/sample [model time/prediction time]
1 [0sec/23.4sec]
IBCF run fold/sample [model time/prediction time]
1 [1.32sec/0.14sec]
UBCF run fold/sample [model time/prediction time]
1 [0sec/42.54sec]
IBCF run fold/sample [model time/prediction time]
1 [2.52sec/0.16sec]
In addition to accuracy, the models are compared based on time. The time (in seconds) to create the model and execute the prediction for each of the models is presented in the table below:
| Cosine | Pearson | |
|---|---|---|
| UBCF | 22.50 | 40.69 |
| IBCF | 1.44 | 2.59 |
These times show that the execution of the user-based collaborative filtering models takes significantly longer than the evaluation of the item-based collaborative filtering models. This makes sense given the dimensions of the dataset used – 3261 users vs. 100 items. It is necessary to view the accuracy of the generated models to determine if the computational expense of the user-based models is worthwhile.
The average confusion matrix for each model is extracted at each number of recommendations for each model.
model_performance <- lapply(eval_results, avg)The results are presented below:
| precision | recall | TPR | FPR | |
|---|---|---|---|---|
| 1 | 0.77 | 0.04 | 0.04 | 0.00 |
| 2 | 0.76 | 0.08 | 0.08 | 0.01 |
| 3 | 0.76 | 0.13 | 0.13 | 0.01 |
| 4 | 0.76 | 0.17 | 0.17 | 0.02 |
| 5 | 0.75 | 0.21 | 0.21 | 0.02 |
| 6 | 0.74 | 0.24 | 0.24 | 0.03 |
| 7 | 0.73 | 0.28 | 0.28 | 0.03 |
| 8 | 0.72 | 0.31 | 0.31 | 0.04 |
| 9 | 0.71 | 0.34 | 0.34 | 0.04 |
| 10 | 0.69 | 0.37 | 0.37 | 0.05 |
| 11 | 0.68 | 0.40 | 0.40 | 0.06 |
| 12 | 0.67 | 0.43 | 0.43 | 0.06 |
| 13 | 0.66 | 0.45 | 0.45 | 0.07 |
| 14 | 0.65 | 0.48 | 0.48 | 0.08 |
| 15 | 0.64 | 0.50 | 0.50 | 0.09 |
| 16 | 0.63 | 0.52 | 0.52 | 0.10 |
| 17 | 0.61 | 0.53 | 0.53 | 0.11 |
| 18 | 0.60 | 0.55 | 0.55 | 0.12 |
| 19 | 0.59 | 0.56 | 0.56 | 0.13 |
| 20 | 0.57 | 0.58 | 0.58 | 0.14 |
| precision | recall | TPR | FPR | |
|---|---|---|---|---|
| 1 | 0.24 | 0.01 | 0.01 | 0.01 |
| 2 | 0.29 | 0.03 | 0.03 | 0.02 |
| 3 | 0.32 | 0.05 | 0.05 | 0.03 |
| 4 | 0.34 | 0.07 | 0.07 | 0.04 |
| 5 | 0.36 | 0.09 | 0.09 | 0.05 |
| 6 | 0.37 | 0.11 | 0.11 | 0.06 |
| 7 | 0.38 | 0.13 | 0.13 | 0.07 |
| 8 | 0.39 | 0.15 | 0.15 | 0.08 |
| 9 | 0.40 | 0.18 | 0.18 | 0.09 |
| 10 | 0.40 | 0.20 | 0.20 | 0.10 |
| 11 | 0.41 | 0.22 | 0.22 | 0.11 |
| 12 | 0.41 | 0.25 | 0.25 | 0.12 |
| 13 | 0.41 | 0.27 | 0.27 | 0.13 |
| 14 | 0.42 | 0.30 | 0.30 | 0.14 |
| 15 | 0.42 | 0.32 | 0.32 | 0.15 |
| 16 | 0.42 | 0.34 | 0.34 | 0.16 |
| 17 | 0.42 | 0.37 | 0.37 | 0.16 |
| 18 | 0.42 | 0.39 | 0.39 | 0.17 |
| 19 | 0.42 | 0.41 | 0.41 | 0.18 |
| 20 | 0.42 | 0.43 | 0.43 | 0.19 |
| precision | recall | TPR | FPR | |
|---|---|---|---|---|
| 1 | 0.78 | 0.04 | 0.04 | 0.00 |
| 2 | 0.77 | 0.09 | 0.09 | 0.01 |
| 3 | 0.76 | 0.13 | 0.13 | 0.01 |
| 4 | 0.75 | 0.17 | 0.17 | 0.02 |
| 5 | 0.74 | 0.20 | 0.20 | 0.02 |
| 6 | 0.73 | 0.24 | 0.24 | 0.03 |
| 7 | 0.71 | 0.27 | 0.27 | 0.03 |
| 8 | 0.71 | 0.31 | 0.31 | 0.04 |
| 9 | 0.70 | 0.34 | 0.34 | 0.04 |
| 10 | 0.69 | 0.37 | 0.37 | 0.05 |
| 11 | 0.68 | 0.40 | 0.40 | 0.06 |
| 12 | 0.67 | 0.42 | 0.42 | 0.07 |
| 13 | 0.65 | 0.45 | 0.45 | 0.07 |
| 14 | 0.64 | 0.47 | 0.47 | 0.08 |
| 15 | 0.63 | 0.49 | 0.49 | 0.09 |
| 16 | 0.61 | 0.51 | 0.51 | 0.10 |
| 17 | 0.60 | 0.53 | 0.53 | 0.11 |
| 18 | 0.58 | 0.54 | 0.54 | 0.12 |
| 19 | 0.57 | 0.55 | 0.55 | 0.13 |
| 20 | 0.55 | 0.56 | 0.56 | 0.15 |
| precision | recall | TPR | FPR | |
|---|---|---|---|---|
| 1 | 0.32 | 0.01 | 0.01 | 0.01 |
| 2 | 0.34 | 0.03 | 0.03 | 0.02 |
| 3 | 0.36 | 0.05 | 0.05 | 0.03 |
| 4 | 0.37 | 0.07 | 0.07 | 0.04 |
| 5 | 0.38 | 0.09 | 0.09 | 0.05 |
| 6 | 0.38 | 0.11 | 0.11 | 0.06 |
| 7 | 0.39 | 0.13 | 0.13 | 0.07 |
| 8 | 0.39 | 0.15 | 0.15 | 0.08 |
| 9 | 0.38 | 0.17 | 0.17 | 0.09 |
| 10 | 0.38 | 0.19 | 0.19 | 0.10 |
| 11 | 0.39 | 0.21 | 0.21 | 0.11 |
| 12 | 0.38 | 0.23 | 0.23 | 0.12 |
| 13 | 0.39 | 0.25 | 0.25 | 0.13 |
| 14 | 0.39 | 0.27 | 0.27 | 0.14 |
| 15 | 0.38 | 0.28 | 0.28 | 0.15 |
| 16 | 0.39 | 0.31 | 0.31 | 0.16 |
| 17 | 0.38 | 0.32 | 0.32 | 0.18 |
| 18 | 0.39 | 0.34 | 0.34 | 0.19 |
| 19 | 0.38 | 0.36 | 0.36 | 0.20 |
| 20 | 0.38 | 0.38 | 0.38 | 0.21 |
ROC and Precision-Recall charts are provided for the four models:
The performance charts above both clearly indicate that the two user-based collaborative filtering models perform better than the item-based collaborative filtering models based on the area under the curves. Based on the clear improvement in performance, the computational expense is likely worthwhile. To investigate the difference in performance between similarity methods, zoomed-in performance charts are created:
The area under the curves for the two models are nearly indistinguishable. The computational expense of the model using Pearson correlation similarity does not provide any significant increase in performance; therefore the user-based collaborative filtering model with cosine similarity provides the best performance for the computational expense.