DATA 612 - Project 4

PROJECT 4 - Accuracy and Beyond

The goal of this assignment is give you practice working with accuracy and other recommender system metrics.

Deliverables

As in your previous assignments, compare the accuracy of at least two recommender system algorithms against your offline data.
Implement support for at least one business or user experience goal such as increased serendipity, novelty, or diversity.
Compare and report on any change in accuracy before and after you’ve made the change in #2.
As part of your textual conclusion, discuss one or more additional experiments that could be performed and/or metrics that could be evaluated only if online evaluation was possible. Also, briefly propose how you would design a reasonable online evaluation environment.

Libraries used

library(kableExtra)
library(recommenderlab)
library(ggplot2)
library(dplyr)

Load Data

For this project, I'll be using the dataset Jester available in the Recommenderlab.

data(Jester5k)
df <- as(Jester5k, "data.frame")

kable(head(df,10)) %>% kable_styling("striped", full_width = F)

	user	item	rating
1	u2841	j1	7.91
3315	u2841	j2	9.17
6963	u2841	j3	5.34
10301	u2841	j4	8.16
13443	u2841	j5	-8.74
18441	u2841	j6	7.14
22514	u2841	j7	8.88
27513	u2841	j8	-8.25
32513	u2841	j9	5.87
35686	u2841	j10	6.21

# Summary of ratings per user
summary(rowCounts(Jester5k))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   36.00   53.00   72.00   72.42  100.00  100.00

# Matrix size
dim(Jester5k)

## [1] 5000  100

# Number of ratings
nratings(Jester5k)

## [1] 362106

# 'best' joke with highest average rating
best <- which.max(colMeans(Jester5k))
cat(JesterJokes[best])

## A guy goes into confession and says to the priest, "Father, I'm 80 years old, widower, with 11 grandchildren. Last night I met two beautiful flight attendants. They took me home and I made love to both of them. Twice." The priest said: "Well, my son, when was the last time you were in confession?" "Never Father, I'm Jewish." "So then, why are you telling me?" "I'm telling everybody."

# 'worst' joke
worst <- which.min(colMeans(Jester5k))
cat(JesterJokes[worst])

## How many teddybears does it take to change a lightbulb? It takes only one teddybear, but it takes a whole lot of lightbulbs.

Distribution of Ratings

Define the training and test sets

set.seed(100)

eval_sets <- evaluationScheme(data = Jester5k, method = "split", train = 0.8, given = 30, goodRating = 1)

#Evaluation datasets
eval_train = getData(eval_sets, "train")
eval_known = getData(eval_sets, "known")
eval_unknown = getData(eval_sets, "unknown")

Build several Recommender Models and run some Predictions

Let us build several recommender models and compare their accuracy.

	RMSE	MSE	MAE
SVD	4.342	18.854	3.431
POPULAR	4.352	18.937	3.441
UBCF	4.396	19.329	3.469
IBCF	4.855	23.567	3.854
RANDOM	6.190	38.319	4.781

## IBCF run fold/sample [model time/prediction time]
##   1  [0.255sec/0.187sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0.119sec/3.88sec] 
## POPULAR run fold/sample [model time/prediction time]
##   1  [0.057sec/1.908sec] 
## SVD run fold/sample [model time/prediction time]
##   1  [0.12sec/0.185sec] 
## RANDOM run fold/sample [model time/prediction time]
##   1  [0.003sec/0.183sec]

Since the SVD recommender system has the highest prediction accuracy among the four recommender systems tested, let's extract that.

Increase Serendipity

Lets see the effect of Serendipity on datasets with SVD as the base.

	RMSE	MSE	MAE
POPULAR2	4.452	19.824	3.514
UBCF2	4.501	20.256	3.537
SVD2	4.521	20.442	3.571
IBCF2	5.007	25.070	3.972
RANDOM	6.352	40.346	4.925

## UBCF run fold/sample [model time/prediction time]
##   1  [0.031sec/3.838sec] 
##   2  [0.056sec/3.49sec] 
##   3  [0.032sec/3.283sec] 
##   4  [0.034sec/3.23sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.209sec/0.204sec] 
##   2  [0.175sec/0.224sec] 
##   3  [0.181sec/0.292sec] 
##   4  [0.177sec/0.226sec] 
## POPULAR run fold/sample [model time/prediction time]
##   1  [0.033sec/2.26sec] 
##   2  [0.037sec/2.321sec] 
##   3  [0.037sec/2.35sec] 
##   4  [0.035sec/2.612sec] 
## SVD run fold/sample [model time/prediction time]
##   1  [0.112sec/0.318sec] 
##   2  [0.111sec/0.244sec] 
##   3  [0.103sec/0.362sec] 
##   4  [0.115sec/0.246sec] 
## RANDOM run fold/sample [model time/prediction time]
##   1  [0.002sec/0.295sec] 
##   2  [0.004sec/0.34sec] 
##   3  [0.004sec/0.291sec] 
##   4  [0.003sec/0.315sec]

Conclusion

Based on the algorithms tested (UBCF, IBCF, SVD, POPULAR and RANDOM), POPULAR has the highest prediction accuracy when measured in terms of error metrics such as RMSE, MSE, and MAE.

Recommender systems can be evaluated offline or online. The purpose of recommender system evaluation is to select algorithms for use in a production setting.

Offline evaluation test the effectiveness of recommender system algorithms on a certain dataset. To measure accuracy, precision at position n (P@n) is often used to express how many items of the ground-truth are recommended within the top n recommendations. The purpose of offline evaluation is to select recommender systems for deployment online. Offline evaluations are easier and reproducible.

In the researh paper 'Recommender Systems Evaluations: Offline, Online, Time and A/A Test', it is said that, online evaluation attempts to evaluate recommender systems by a method called A/B testing where a part of users are served by recommender system A and the another part of users by recommender system B. The recommender system that achieves a higher score according to a chosen metric (for example, Click-Through-Rate) is chosen as a better recommender system, given other factors such as latency and complexity are comparable.

As mentioned, in the online evaluation, the A/B-testing (or multivariate testing) is the today’s most prominent approach in the RSs. Few different RSs may integrate, divide users into groups and put the RSs into fight. A bit costly, because it consumes development resources.

References:

Hijikata, Y. Offline Evaluation for Recommender Systems. http://soc-research.org/wp-content/uploads/2014/11/OfflineTest4RS.pdf

Gebrekirstos G., et. al. Recommender Systems Evaluations: Offline, Online, Time and A/A Test http://ceur-ws.org/Vol-1609/16090642.pdf