PROJECT 4 - Accuracy and Beyond

The goal of this assignment is give you practice working with accuracy and other recommender system metrics.

Deliverables

  1. As in your previous assignments, compare the accuracy of at least two recommender system algorithms against your offline data.

  2. Implement support for at least one business or user experience goal such as increased serendipity, novelty, or diversity.

  3. Compare and report on any change in accuracy before and after you’ve made the change in #2.

  4. As part of your textual conclusion, discuss one or more additional experiments that could be performed and/or metrics that could be evaluated only if online evaluation was possible. Also, briefly propose how you would design a reasonable online evaluation environment.

Libraries used

  • library(kableExtra)
  • library(recommenderlab)
  • library(ggplot2)
  • library(dplyr)

Load Data

For this project, I'll be using the dataset Jester available in the Recommenderlab.

data(Jester5k)
df <- as(Jester5k, "data.frame")

kable(head(df,10)) %>% kable_styling("striped", full_width = F)
user item rating
1 u2841 j1 7.91
3315 u2841 j2 9.17
6963 u2841 j3 5.34
10301 u2841 j4 8.16
13443 u2841 j5 -8.74
18441 u2841 j6 7.14
22514 u2841 j7 8.88
27513 u2841 j8 -8.25
32513 u2841 j9 5.87
35686 u2841 j10 6.21
# Summary of ratings per user
summary(rowCounts(Jester5k))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   36.00   53.00   72.00   72.42  100.00  100.00
# Matrix size
dim(Jester5k)
## [1] 5000  100
# Number of ratings
nratings(Jester5k)
## [1] 362106
# 'best' joke with highest average rating
best <- which.max(colMeans(Jester5k))
cat(JesterJokes[best])
## A guy goes into confession and says to the priest, "Father, I'm 80 years old, widower, with 11 grandchildren. Last night I met two beautiful flight attendants. They took me home and I made love to both of them. Twice." The priest said: "Well, my son, when was the last time you were in confession?" "Never Father, I'm Jewish." "So then, why are you telling me?" "I'm telling everybody."
# 'worst' joke
worst <- which.min(colMeans(Jester5k))
cat(JesterJokes[worst])
## How many teddybears does it take to change a lightbulb? It takes only one teddybear, but it takes a whole lot of lightbulbs.

Distribution of Ratings

Define the training and test sets

set.seed(100)

eval_sets <- evaluationScheme(data = Jester5k, method = "split", train = 0.8, given = 30, goodRating = 1)

#Evaluation datasets
eval_train = getData(eval_sets, "train")
eval_known = getData(eval_sets, "known")
eval_unknown = getData(eval_sets, "unknown")

Build several Recommender Models and run some Predictions

Let us build several recommender models and compare their accuracy.    

RMSE MSE MAE
SVD 4.342 18.854 3.431
POPULAR 4.352 18.937 3.441
UBCF 4.396 19.329 3.469
IBCF 4.855 23.567 3.854
RANDOM 6.190 38.319 4.781
## IBCF run fold/sample [model time/prediction time]
##   1  [0.255sec/0.187sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0.119sec/3.88sec] 
## POPULAR run fold/sample [model time/prediction time]
##   1  [0.057sec/1.908sec] 
## SVD run fold/sample [model time/prediction time]
##   1  [0.12sec/0.185sec] 
## RANDOM run fold/sample [model time/prediction time]
##   1  [0.003sec/0.183sec]

Since the SVD recommender system has the highest prediction accuracy among the four recommender systems tested, let's extract that.

   

Increase Serendipity

 

Lets see the effect of Serendipity on datasets with SVD as the base.

   

RMSE MSE MAE
POPULAR2 4.452 19.824 3.514
UBCF2 4.501 20.256 3.537
SVD2 4.521 20.442 3.571
IBCF2 5.007 25.070 3.972
RANDOM 6.352 40.346 4.925
## UBCF run fold/sample [model time/prediction time]
##   1  [0.031sec/3.838sec] 
##   2  [0.056sec/3.49sec] 
##   3  [0.032sec/3.283sec] 
##   4  [0.034sec/3.23sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.209sec/0.204sec] 
##   2  [0.175sec/0.224sec] 
##   3  [0.181sec/0.292sec] 
##   4  [0.177sec/0.226sec] 
## POPULAR run fold/sample [model time/prediction time]
##   1  [0.033sec/2.26sec] 
##   2  [0.037sec/2.321sec] 
##   3  [0.037sec/2.35sec] 
##   4  [0.035sec/2.612sec] 
## SVD run fold/sample [model time/prediction time]
##   1  [0.112sec/0.318sec] 
##   2  [0.111sec/0.244sec] 
##   3  [0.103sec/0.362sec] 
##   4  [0.115sec/0.246sec] 
## RANDOM run fold/sample [model time/prediction time]
##   1  [0.002sec/0.295sec] 
##   2  [0.004sec/0.34sec] 
##   3  [0.004sec/0.291sec] 
##   4  [0.003sec/0.315sec]

   

Conclusion

Based on the algorithms tested (UBCF, IBCF, SVD, POPULAR and RANDOM), POPULAR has the highest prediction accuracy when measured in terms of error metrics such as RMSE, MSE, and MAE.

Recommender systems can be evaluated offline or online. The purpose of recommender system evaluation is to select algorithms for use in a production setting.

Offline evaluation test the effectiveness of recommender system algorithms on a certain dataset. To measure accuracy, precision at position n (P@n) is often used to express how many items of the ground-truth are recommended within the top n recommendations. The purpose of offline evaluation is to select recommender systems for deployment online. Offline evaluations are easier and reproducible.

In the researh paper 'Recommender Systems Evaluations: Offline, Online, Time and A/A Test', it is said that, online evaluation attempts to evaluate recommender systems by a method called A/B testing where a part of users are served by recommender system A and the another part of users by recommender system B. The recommender system that achieves a higher score according to a chosen metric (for example, Click-Through-Rate) is chosen as a better recommender system, given other factors such as latency and complexity are comparable.

As mentioned, in the online evaluation, the A/B-testing (or multivariate testing) is the today’s most prominent approach in the RSs. Few different RSs may integrate, divide users into groups and put the RSs into fight. A bit costly, because it consumes development resources.

   

References:

Hijikata, Y. Offline Evaluation for Recommender Systems. http://soc-research.org/wp-content/uploads/2014/11/OfflineTest4RS.pdf

Gebrekirstos G., et. al. Recommender Systems Evaluations: Offline, Online, Time and A/A Test http://ceur-ws.org/Vol-1609/16090642.pdf