DATA 643 Project 2: System Evaluation

Motivation

The purpose of this exercise is to build multiple recommender systems for the same dataset and evaluate their performance. By applying the system to a testing subset of the data, the accuracy of the various models can be compared.

Data Utilized

The recommenderlab package is utilized throughout the exercise. The Jester5k dataset included in the package is the rating matrix on which the various models are evaluated. The dataset includes “5000 users from the anonymous ratings data from the Jester Online Joke Recommender System.” The ratings contained in the data frame range between -10.00 and 10.00.

The dataset is not very sparse – each user included has rated at least 36 jokes, with some users having rated all 100 jokes sampled in the dataset. To create some sparsity and better allow for recommendations, a subset of the dataset containing only those users with at least 20 un-rated jokes is considered.

Jester5k_rec <- Jester5k[100 - rowCounts(Jester5k) > 20]

Creating the Systems

Data Normalization

Naturally, users have different behaviors when rating items; this will be reflected in their average rating, as displayed below.

In order to remove any bias from different users’ average ratings, the ratings are normalized so that each user has an average rating of 0.

Data Splitting

To evaluate the accuracy of the models created, the dataset is split into training and testing sets. This is accomplished using the built-inevaluationScheme function. 80% of the data is included in the training set, with the remaining 20% in the testing dataset. With the sparsity introduced in the previous section, 20 items are given for evaluation. For accuracy evaluation purposes, a threshold defining a “good” rating must be created. Given the normalization above, the threshold is set at 0.1, indicating jokes that users like more than their average rating.

Jester5k_rec <- normalize(Jester5k_rec)
train_test <- evaluationScheme(data = Jester5k_rec, method = "split", train = 0.8, given = 20, goodRating = 0.1)

Models Considered

Adapting the example in Building a Recommendation System with R, models are evaluated. Two algorithms are used: - User-based collaborative filtering (UBCF) - Item-based collaboriatve filtering (IBCF)

Two similarity methods are used for each algorithm: - Cosine similarity - Pearson correllation similarity

models <- list(
  UBCF_cos = list(name = "UBCF", param = list(method = "cosine")), 
  IBCF_cos = list(name = "IBCF", param = list(method = "cosine")),
  UBCF_cor = list(name = "UBCF", param = list(method = "pearson")),
  IBCF_cor = list(name = "IBCF", param = list(method = "pearson")))

Model Evaluation

Each of the four models is evaluated using for the test dataset, with each model providing 1 to 20 recommendations per user.

eval_results <- evaluate(x = train_test, method = models, n = 1:20)

UBCF run fold/sample [model time/prediction time]
     1  [0sec/23.4sec] 
IBCF run fold/sample [model time/prediction time]
     1  [1.32sec/0.14sec] 
UBCF run fold/sample [model time/prediction time]
     1  [0sec/42.54sec] 
IBCF run fold/sample [model time/prediction time]
     1  [2.52sec/0.16sec]

Model Comparison

In addition to accuracy, the models are compared based on time. The time (in seconds) to create the model and execute the prediction for each of the models is presented in the table below:

	Cosine	Pearson
UBCF	22.50	40.69
IBCF	1.44	2.59

These times show that the execution of the user-based collaborative filtering models takes significantly longer than the evaluation of the item-based collaborative filtering models. This makes sense given the dimensions of the dataset used – 3261 users vs. 100 items. It is necessary to view the accuracy of the generated models to determine if the computational expense of the user-based models is worthwhile.

Accuracy

The average confusion matrix for each model is extracted at each number of recommendations for each model.

model_performance <- lapply(eval_results, avg)

The results are presented below:

User-Based, Cosine

	precision	recall	TPR	FPR
1	0.77	0.04	0.04	0.00
2	0.76	0.08	0.08	0.01
3	0.76	0.13	0.13	0.01
4	0.76	0.17	0.17	0.02
5	0.75	0.21	0.21	0.02
6	0.74	0.24	0.24	0.03
7	0.73	0.28	0.28	0.03
8	0.72	0.31	0.31	0.04
9	0.71	0.34	0.34	0.04
10	0.69	0.37	0.37	0.05
11	0.68	0.40	0.40	0.06
12	0.67	0.43	0.43	0.06
13	0.66	0.45	0.45	0.07
14	0.65	0.48	0.48	0.08
15	0.64	0.50	0.50	0.09
16	0.63	0.52	0.52	0.10
17	0.61	0.53	0.53	0.11
18	0.60	0.55	0.55	0.12
19	0.59	0.56	0.56	0.13
20	0.57	0.58	0.58	0.14

Item-Based, Cosine

	precision	recall	TPR	FPR
1	0.24	0.01	0.01	0.01
2	0.29	0.03	0.03	0.02
3	0.32	0.05	0.05	0.03
4	0.34	0.07	0.07	0.04
5	0.36	0.09	0.09	0.05
6	0.37	0.11	0.11	0.06
7	0.38	0.13	0.13	0.07
8	0.39	0.15	0.15	0.08
9	0.40	0.18	0.18	0.09
10	0.40	0.20	0.20	0.10
11	0.41	0.22	0.22	0.11
12	0.41	0.25	0.25	0.12
13	0.41	0.27	0.27	0.13
14	0.42	0.30	0.30	0.14
15	0.42	0.32	0.32	0.15
16	0.42	0.34	0.34	0.16
17	0.42	0.37	0.37	0.16
18	0.42	0.39	0.39	0.17
19	0.42	0.41	0.41	0.18
20	0.42	0.43	0.43	0.19

User-Based, Pearson

	precision	recall	TPR	FPR
1	0.78	0.04	0.04	0.00
2	0.77	0.09	0.09	0.01
3	0.76	0.13	0.13	0.01
4	0.75	0.17	0.17	0.02
5	0.74	0.20	0.20	0.02
6	0.73	0.24	0.24	0.03
7	0.71	0.27	0.27	0.03
8	0.71	0.31	0.31	0.04
9	0.70	0.34	0.34	0.04
10	0.69	0.37	0.37	0.05
11	0.68	0.40	0.40	0.06
12	0.67	0.42	0.42	0.07
13	0.65	0.45	0.45	0.07
14	0.64	0.47	0.47	0.08
15	0.63	0.49	0.49	0.09
16	0.61	0.51	0.51	0.10
17	0.60	0.53	0.53	0.11
18	0.58	0.54	0.54	0.12
19	0.57	0.55	0.55	0.13
20	0.55	0.56	0.56	0.15

Item-Based, Pearson

	precision	recall	TPR	FPR
1	0.32	0.01	0.01	0.01
2	0.34	0.03	0.03	0.02
3	0.36	0.05	0.05	0.03
4	0.37	0.07	0.07	0.04
5	0.38	0.09	0.09	0.05
6	0.38	0.11	0.11	0.06
7	0.39	0.13	0.13	0.07
8	0.39	0.15	0.15	0.08
9	0.38	0.17	0.17	0.09
10	0.38	0.19	0.19	0.10
11	0.39	0.21	0.21	0.11
12	0.38	0.23	0.23	0.12
13	0.39	0.25	0.25	0.13
14	0.39	0.27	0.27	0.14
15	0.38	0.28	0.28	0.15
16	0.39	0.31	0.31	0.16
17	0.38	0.32	0.32	0.18
18	0.39	0.34	0.34	0.19
19	0.38	0.36	0.36	0.20
20	0.38	0.38	0.38	0.21

Performance

ROC and Precision-Recall charts are provided for the four models:

Conclusions

The performance charts above both clearly indicate that the two user-based collaborative filtering models perform better than the item-based collaborative filtering models based on the area under the curves. Based on the clear improvement in performance, the computational expense is likely worthwhile. To investigate the difference in performance between similarity methods, zoomed-in performance charts are created:

The area under the curves for the two models are nearly indistinguishable. The computational expense of the model using Pearson correlation similarity does not provide any significant increase in performance; therefore the user-based collaborative filtering model with cosine similarity provides the best performance for the computational expense.