The goal of this assignment is give you practice working with accuracy and other recommender system metrics.
Deliverables
As in your previous assignments, compare the accuracy of at least two recommender system algorithms against your offline data.
Implement support for at least one business or user experience goal such as increased serendipity, novelty, or diversity.
Compare and report on any change in accuracy before and after you’ve made the change in #2.
As part of your textual conclusion, discuss one or more additional experiments that could be performed and/or metrics that could be evaluated only if online evaluation was possible. Also, briefly propose how you would design a reasonable online evaluation environment.
For this project, I'll be using the dataset Jester available in the Recommenderlab.
data(Jester5k)
df <- as(Jester5k, "data.frame")
kable(head(df,10)) %>% kable_styling("striped", full_width = F)
| user | item | rating | |
|---|---|---|---|
| 1 | u2841 | j1 | 7.91 |
| 3315 | u2841 | j2 | 9.17 |
| 6963 | u2841 | j3 | 5.34 |
| 10301 | u2841 | j4 | 8.16 |
| 13443 | u2841 | j5 | -8.74 |
| 18441 | u2841 | j6 | 7.14 |
| 22514 | u2841 | j7 | 8.88 |
| 27513 | u2841 | j8 | -8.25 |
| 32513 | u2841 | j9 | 5.87 |
| 35686 | u2841 | j10 | 6.21 |
# Summary of ratings per user
summary(rowCounts(Jester5k))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 36.00 53.00 72.00 72.42 100.00 100.00
# Matrix size
dim(Jester5k)
## [1] 5000 100
# Number of ratings
nratings(Jester5k)
## [1] 362106
# 'best' joke with highest average rating
best <- which.max(colMeans(Jester5k))
cat(JesterJokes[best])
## A guy goes into confession and says to the priest, "Father, I'm 80 years old, widower, with 11 grandchildren. Last night I met two beautiful flight attendants. They took me home and I made love to both of them. Twice." The priest said: "Well, my son, when was the last time you were in confession?" "Never Father, I'm Jewish." "So then, why are you telling me?" "I'm telling everybody."
# 'worst' joke
worst <- which.min(colMeans(Jester5k))
cat(JesterJokes[worst])
## How many teddybears does it take to change a lightbulb? It takes only one teddybear, but it takes a whole lot of lightbulbs.
set.seed(100)
eval_sets <- evaluationScheme(data = Jester5k, method = "split", train = 0.8, given = 30, goodRating = 1)
#Evaluation datasets
eval_train = getData(eval_sets, "train")
eval_known = getData(eval_sets, "known")
eval_unknown = getData(eval_sets, "unknown")
Let us build several recommender models and compare their accuracy.
| RMSE | MSE | MAE | |
|---|---|---|---|
| SVD | 4.342 | 18.854 | 3.431 |
| POPULAR | 4.352 | 18.937 | 3.441 |
| UBCF | 4.396 | 19.329 | 3.469 |
| IBCF | 4.855 | 23.567 | 3.854 |
| RANDOM | 6.190 | 38.319 | 4.781 |
## IBCF run fold/sample [model time/prediction time]
## 1 [0.255sec/0.187sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0.119sec/3.88sec]
## POPULAR run fold/sample [model time/prediction time]
## 1 [0.057sec/1.908sec]
## SVD run fold/sample [model time/prediction time]
## 1 [0.12sec/0.185sec]
## RANDOM run fold/sample [model time/prediction time]
## 1 [0.003sec/0.183sec]
Since the SVD recommender system has the highest prediction accuracy among the four recommender systems tested, let's extract that.
Lets see the effect of Serendipity on datasets with SVD as the base.
| RMSE | MSE | MAE | |
|---|---|---|---|
| POPULAR2 | 4.452 | 19.824 | 3.514 |
| UBCF2 | 4.501 | 20.256 | 3.537 |
| SVD2 | 4.521 | 20.442 | 3.571 |
| IBCF2 | 5.007 | 25.070 | 3.972 |
| RANDOM | 6.352 | 40.346 | 4.925 |
## UBCF run fold/sample [model time/prediction time]
## 1 [0.031sec/3.838sec]
## 2 [0.056sec/3.49sec]
## 3 [0.032sec/3.283sec]
## 4 [0.034sec/3.23sec]
## IBCF run fold/sample [model time/prediction time]
## 1 [0.209sec/0.204sec]
## 2 [0.175sec/0.224sec]
## 3 [0.181sec/0.292sec]
## 4 [0.177sec/0.226sec]
## POPULAR run fold/sample [model time/prediction time]
## 1 [0.033sec/2.26sec]
## 2 [0.037sec/2.321sec]
## 3 [0.037sec/2.35sec]
## 4 [0.035sec/2.612sec]
## SVD run fold/sample [model time/prediction time]
## 1 [0.112sec/0.318sec]
## 2 [0.111sec/0.244sec]
## 3 [0.103sec/0.362sec]
## 4 [0.115sec/0.246sec]
## RANDOM run fold/sample [model time/prediction time]
## 1 [0.002sec/0.295sec]
## 2 [0.004sec/0.34sec]
## 3 [0.004sec/0.291sec]
## 4 [0.003sec/0.315sec]
Based on the algorithms tested (UBCF, IBCF, SVD, POPULAR and RANDOM), POPULAR has the highest prediction accuracy when measured in terms of error metrics such as RMSE, MSE, and MAE.
Recommender systems can be evaluated offline or online. The purpose of recommender system evaluation is to select algorithms for use in a production setting.
Offline evaluation test the effectiveness of recommender system algorithms on a certain dataset. To measure accuracy, precision at position n (P@n) is often used to express how many items of the ground-truth are recommended within the top n recommendations. The purpose of offline evaluation is to select recommender systems for deployment online. Offline evaluations are easier and reproducible.
In the researh paper 'Recommender Systems Evaluations: Offline, Online, Time and A/A Test', it is said that, online evaluation attempts to evaluate recommender systems by a method called A/B testing where a part of users are served by recommender system A and the another part of users by recommender system B. The recommender system that achieves a higher score according to a chosen metric (for example, Click-Through-Rate) is chosen as a better recommender system, given other factors such as latency and complexity are comparable.
As mentioned, in the online evaluation, the A/B-testing (or multivariate testing) is the today’s most prominent approach in the RSs. Few different RSs may integrate, divide users into groups and put the RSs into fight. A bit costly, because it consumes development resources.
Hijikata, Y. Offline Evaluation for Recommender Systems. http://soc-research.org/wp-content/uploads/2014/11/OfflineTest4RS.pdf
Gebrekirstos G., et. al. Recommender Systems Evaluations: Offline, Online, Time and A/A Test http://ceur-ws.org/Vol-1609/16090642.pdf