Assignment

The goal of this assignment is give you practice working with accuracy and other recommender system metrics.In this assignment you’re asked to do at least one or (if you like) both of the following:

Deliverables

  1. As in your previous assignments, compare the accuracy of at least two recommender system algorithms against your offline data.

  2. Implement support for at least one business or user experience goal such as increased serendipity, novelty, or diversity.

  3. Compare and report on any change in accuracy before and after you’ve made the change in #2.

  4. As part of your textual conclusion, discuss one or more additional experiments that could be performed and/or metrics that could be evaluated only if online evaluation was possible. Also, briefly propose how you would design a reasonable online evaluation environment.

Introduction and Motivation

In this assignment, we will create a system to provide joke recommendations to a user. The goal will be to intersperse the recommendations with serendipitous choices - in other words, switch out some of the top \(n\) recommendations with recommendations not in the top \(n\) results.

Technique

First, we will need to identify which recommender system to use, so we will compare the output of 5 algorithms:

  • UBCF - user-based collaborative filtering with (1) cosine and (2) pearson similarity
  • IBCF - item-based collaborative filtering with (3) cosine and (4) pearson similarity
  • RANDOM - recommender system that provides random results (5)

For each recommender system, we will use 3-fold cross-validation to generate predictions and measure the precision. Once we’ve identified the most suitable algorithm, we will introduce novel results into the recommendations by switching out a portion of the top \(n\) recommendations.

Data Source

We will be using the jester dataset for this assignment. This dataset contains data from 24,983 users who have rated 36 or more jokes. Ratings are real values ranging from -10.00 to +10.00 (the value “99” corresponds to “null” = “not rated”). Each row in the dataset represents a different user, and the first column represents the total number of jokes the individual has rated. The remaining 100 columns give the ratings for each joke.

Data Cleansing

First, we will load in the jester data set. We will remove the column with the number of rated jokes because this will not be used in the recommendation system. Additionally, the raw data represents non-rated jokes as the number 99, so we will replace these values with nulls. Finally, we will subset the data to 5,000 users to speed up computation time.

Data Exploration

Let’s dive a little deeper into our data.

First, let’s take a look at the number of jokes that each user has rated. We set the threshold at 36 jokes, and it appears that most individuals have rated either around 70 or 100 jokes.

Next, we can look at the number of ratings that each joke has. We can see that many of the jokes were rated by all 5000 users.

Now we can take a look at the average rating across all jokes. The median average rating is a little over 0 (neutral), which means that jokes are typically rated with a small positive skew.

##         0%        25%        50%        75%       100% 
## -3.8296705 -0.3050142  0.8847812  1.8412310  4.0056984

We can also plot the average ratings. A look at the distribution shows that most jokes have an average rating between -2 and 3. We have a few outliers that are rated more positively (+4) and more negatively (-4).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

From this data exploration, we can see that most users rated all jokes and conversely, most jokes were rated by all users. The median average rating for the users is a little over 0, so we will use 1 as our threshold for a good joke.

Initial Recommender Systems

Parameters

We’ll define the following:

  • Training Percent: The percent of the data that should be used in training. The remaining data will be used for testing.
  • Items To Keep: The total number of items that will be used to generate the recommendations. The remaining items will be used to test the model accuracy. We’ll identify the min number of jokes that an individual has rated and use a few less than that.
  • Rating Threshold: The threshold to be used for positive ratings. Since our data is on a scale of -10 to 10, we will use 1 as the threshold for a good joke.
  • Number of Folds: This is the number of folds that will be used for k-fold validation.

Finally, we’ll define our evaluation scheme for the models.

Creation of Systems

Now that we’ve set up the evaluation scheme for our recommender systems, we can compare different models. We will evaluate the output of 2 IBCF models (using cosine and pearson similarities), 2 UBCF models (once again, using cosine and pearson similarities), and 1 random model for a baseline. We will also vary the number of recommendations from 5 to 20.

## IBCF run fold/sample [model time/prediction time]
##   1  [0.44sec/0.57sec] 
##   2  [0.34sec/0.56sec] 
##   3  [0.27sec/0.28sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.38sec/0.57sec] 
##   2  [0.37sec/0.45sec] 
##   3  [0.51sec/0.31sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0.03sec/8.06sec] 
##   2  [0.03sec/12.43sec] 
##   3  [0.05sec/10.08sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0.04sec/9.87sec] 
##   2  [0.03sec/8.23sec] 
##   3  [0.03sec/8.55sec] 
## RANDOM run fold/sample [model time/prediction time]
##   1  [0sec/0.62sec] 
##   2  [0sec/0.49sec] 
##   3  [0sec/0.5sec]

Comparisons

We can look at the average results across all folds for each algorithm. Each row represents a different number of recommendations. We can see that on average, as the number of recommendations increases, so does our accuracy.

## $IBCF_cos
##           TP         FP       FN       TN precision      recall
## 1  0.2547962  0.7452038 21.60172 46.39828 0.2547962 0.009711868
## 5  1.3083533  3.6916467 20.54816 43.45184 0.2616707 0.052932956
## 10 2.7064349  7.2935651 19.15008 39.84992 0.2706435 0.113040744
## 15 4.2038369 10.7961631 17.65268 36.34732 0.2802558 0.181786860
## 20 5.7793765 14.2206235 16.07714 32.92286 0.2889688 0.254224427
##            TPR        FPR
## 1  0.009711868 0.01563933
## 5  0.052932956 0.07805979
## 10 0.113040744 0.15461808
## 15 0.181786860 0.22870896
## 20 0.254224427 0.30133031
## 
## $IBCF_cor
##           TP         FP       FN       TN precision     recall        TPR
## 1  0.3419265  0.6580735 21.51459 46.48541 0.3419265 0.01400987 0.01400987
## 5  1.9578337  3.0421663 19.89868 44.10132 0.3915667 0.09118989 0.09118989
## 10 4.1968425  5.8031575 17.65967 41.34033 0.4196843 0.20676132 0.20676132
## 15 6.3960831  8.6039169 15.46043 38.53957 0.4264055 0.32079364 0.32079364
## 20 8.4720224 11.5279776 13.38449 35.61551 0.4236011 0.42312147 0.42312147
##           FPR
## 1  0.01313838
## 5  0.06081768
## 10 0.11622541
## 15 0.17271128
## 20 0.23217592
## 
## $UBCF_cos
##           TP         FP       FN       TN precision     recall        TPR
## 1  0.7342126  0.2657874 21.12230 46.87770 0.7342126 0.05463168 0.05463168
## 5  3.3647082  1.6352918 18.49181 45.50819 0.6729416 0.22357549 0.22357549
## 10 6.0449640  3.9550360 15.81155 43.18845 0.6044964 0.36148458 0.36148458
## 15 8.0655476  6.9344524 13.79097 40.20903 0.5377032 0.44976838 0.44976838
## 20 9.6486811 10.3513189 12.20783 36.79217 0.4824341 0.51206237 0.51206237
##            FPR
## 1  0.005365546
## 5  0.032441140
## 10 0.077819982
## 15 0.137342446
## 20 0.206851832
## 
## $UBCF_cor
##          TP        FP       FN       TN precision     recall        TPR
## 1  0.741207  0.258793 21.11531 46.88469 0.7412070 0.05610683 0.05610683
## 5  3.412270  1.587730 18.44424 45.55576 0.6824540 0.22847722 0.22847722
## 10 6.129696  3.870304 15.72682 43.27318 0.6129696 0.36894983 0.36894983
## 15 8.188849  6.811151 13.66767 40.33233 0.5459233 0.45850204 0.45850204
## 20 9.817346 10.182654 12.03917 36.96083 0.4908673 0.52301642 0.52301642
##            FPR
## 1  0.005239679
## 5  0.031555908
## 10 0.076086481
## 15 0.134401485
## 20 0.202758536
## 
## $RANDOM
##          TP        FP       FN       TN precision     recall        TPR
## 1  0.347522  0.652478 21.50899 46.49101 0.3475220 0.01692541 0.01692541
## 5  1.662070  3.337930 20.19444 43.80556 0.3324141 0.07734025 0.07734025
## 10 3.209832  6.790168 18.64668 40.35332 0.3209832 0.14790227 0.14790227
## 15 4.770983 10.229017 17.08553 36.91447 0.3180655 0.21929545 0.21929545
## 20 6.324740 13.675260 15.53177 33.46823 0.3162370 0.29063756 0.29063756
##           FPR
## 1  0.01392500
## 5  0.07108681
## 10 0.14395558
## 15 0.21709316
## 20 0.29040571

We can also visualize the ROC curves for each of the algorithms we’ve run. Each marker on the graph represents the TP/FP ratio for \(n\) recommendations. The plot shows higher performance from the UBCF models.

The goal of our recommender system is to provide jokes that are funny to a user, so we want to minimize the number of false positives (recommendations that are wrong). We will therefore choose the algorithm with the highest precision (true positive rate). Once again, the UBCF models outperform the others.

Model Tuning

Based on this analysis, we will choose the UBCF model with Pearson similarity and 5 recommendations. We can further tune nn parameter for the the model.

## UBCF run fold/sample [model time/prediction time]
##   1  [0.03sec/8.36sec] 
##   2  [0.03sec/7.54sec] 
##   3  [0.02sec/6.93sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0.03sec/8.39sec] 
##   2  [0.02sec/7.73sec] 
##   3  [0.01sec/7.41sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0.01sec/8.33sec] 
##   2  [0.03sec/9.34sec] 
##   3  [0.03sec/9.14sec]

We’ll pick the model with the best precision, which is 200 neighbors:

## $UBCF_100
##         TP       FP       FN       TN precision    recall       TPR
## 5 3.511591 1.488409 18.34492 45.65508 0.7023181 0.2422375 0.2422375
##          FPR
## 5 0.02955098
## 
## $UBCF_150
##         TP       FP       FN       TN precision    recall       TPR
## 5 3.531974 1.468026 18.32454 45.67546 0.7063949 0.2447784 0.2447784
##          FPR
## 5 0.02916969
## 
## $UBCF_200
##         TP       FP       FN       TN precision    recall       TPR
## 5 3.528177 1.471823 18.32834 45.67166 0.7056355 0.2444135 0.2444135
##          FPR
## 5 0.02923025

Introduce Novelty

In order to introduce novelty to the recommendations, we’ll take a percentage of the top recommendations from the UBCF model and switch the recommendations out with randomly selected jokes. To do this, we’ll define a recommendation system using the RANDOM methodology and then create a hybrid recommender that combines it with the UBCF model.

##          TP          FP          FN          TN   precision      recall 
##  3.08393285  1.91606715 18.68405276 45.31594724  0.61678657  0.19829771 
##         TPR         FPR 
##  0.19829771  0.03765615
##      RMSE       MSE       MAE 
##  4.369204 19.089941  3.454902

Compare Results

We can compare the results of the UBCF-only model vs the hybrid model. The precision is worse in the hybrid model and the RMSE is higher.

##   METHOD PRECISION     RMSE
## 1   UBCF 0.7043165 4.265066
## 2 HYBRID 0.6167866 4.369204

We can also take a look at a comparison of the top 5 suggestions for one of the users. From this, we can see two things:

## $`2`
## [1] 53 29 35 66  9
## $`2`
## [1] 35 53 29 66 31

Discussion

Recommender systems can be evaluated offline or online. The purpose of evaluating a recommender system is to select algorithms for use in a production setting.Offline evaluations test the effectiveness of recommender system algorithms on a certain dataset. Online evaluation attempts to evaluate recommender systems by a method called A/B testing where a part of users are served by recommender system A and the another part of users by recommender system B. The recommender system that achieves a higher score according to a chosen metric ( for example, Click-Through-Rate) is chosen as a better recommender system, given other factors such as latency and complexity are comparable (Gebremeskel & De Vries, n.d.).

In this project, an offline evaluation was conducted. Precision and RMSE are two evaluation metrics that were used to assess the recommender algorithms. Novelty was also introduced as a business goal. There is no guarantee that the absolute performance of these algorithms will hold in an online evaluation.

In direct response to question four, one metric that could be evaluated only if online evaluation was possible is Click-Through-Rate (CTR). CTR is the ratio of clicked recommendations to displayed recommendations and reflects rate of user acceptance and assumed satisfaction. In this case, the assumption is that when a user clicks, downloads, or buys a recommended item, the user liked the recommendation. This can inform the effect of the introduced serendepity and/or also the level of serendepity that would be the most useful for favourable future outcomes.

Reference

Gebremeskel, G., & De Vries, A. (n.d.). Recommender Systems Evaluations: Offline, Online, Time and A/A Test. Retrieved June 28, 2020, from http://ceur-ws.org/Vol-1609/16090642.pdf