Introduction

The purpose of the project is to compare the accuracy of at least two recommender systems and implement support for at least one business or user experience goal such as increased serendipity, novelty, or diversity.

The data for the project is taken from the Jester Joke Recommender System and downloaded from http://www.ieor.berkeley.edu/~goldberg/jester-data/. The dataset contains about 24,938 users and 100 jokes, which were rated by users on the scale from -10 to 10.

“recommenderLab” package will be used as a core packge for building a recommender algorithms.

#  loading smaller version of Jester Jokes from recommenderlab package
data(Jester5k)
dim(Jester5k)
## [1] 5000  100

Data Exploration

#  looking at the rating provided by user 1 and 100 for the first 5 jokes
Jester5k@data[1,1:5]
##    j1    j2    j3    j4    j5 
##  7.91  9.17  5.34  8.16 -8.74
Jester5k@data[100, 1:5]
##    j1    j2    j3    j4    j5 
## -2.48  3.93  2.72 -2.67  1.75
#  checking total number of ratings given by the users
nratings(Jester5k)
## [1] 362106
#  overall rating distribution
hist(getRatings(Jester5k), main = "Distribution Of Ratings", xlim=c(-10, 10), breaks="FD")

#  finding the most/least popular jokes
ratings_binary<-binarize(Jester5k, minRating = 1)
ratings_binary
## 5000 x 100 rating matrix of class 'binaryRatingMatrix' with 192462 ratings.
ratings_sum<-colSums(ratings_binary)
ratings_sum_df<- data.frame(joke = names(ratings_sum), pratings = ratings_sum)

# the most popular jokes
head(ratings_sum_df[order(-ratings_sum_df$pratings), ],10)
##     joke pratings
## j50  j50     3862
## j36  j36     3758
## j32  j32     3632
## j27  j27     3601
## j53  j53     3601
## j29  j29     3576
## j35  j35     3576
## j62  j62     3521
## j49  j49     3473
## j69  j69     3396
# the least popular jokes
tail(ratings_sum_df[order(-ratings_sum_df$pratings), ],10)
##     joke pratings
## j90  j90      930
## j44  j44      925
## j73  j73      913
## j77  j77      880
## j86  j86      875
## j79  j79      813
## j75  j75      785
## j71  j71      709
## j58  j58      566
## j74  j74      549
# two normalization techniques are going to be implemented: Centering and Z-score
data.norm.c<-normalize(Jester5k, method="center")
data.norm.z<-normalize(Jester5k, method="Z-score")

#  ploting rating distribution for Raw, Normalized and Z-score Normalized Ratings 
par(mfrow = c(3,1))
plot(density(getRatings(Jester5k)),main = 'Raw')
plot(density(getRatings(data.norm.c)),main = 'Normalized')
plot(density(getRatings(data.norm.z)),main = 'Z-score')

par(mfrow = c(1,1))

Accuracy Assessment

The following models will be built and evaluated: user-based CF, random and popular. User-based CF model will be built using cosine or pearson similarity method and normalization techniques will be applied.

#  creating evaluation scheme (3-fold CV; everything that is above 1 is considered as a good rating; 5 neighbours will be find for a given user to make recommendation)

set.seed(123)
es<- evaluationScheme(Jester5k, method = "cross", train = 0.9, given = 5, goodRating = 1, k = 3)

#  creating a list of models
models <- list(
  "ubcf_cosine" = list(name = "UBCF", param = list(method = "cosine", normalize = NULL)),
  "ubcf_pearson" = list(name = "UBCF", param = list(method = "pearson", normalize = NULL)),
  "ubcf_cosine_center" = list(name = "UBCF", param = list(method = "cosine", normalize = "center")),
  "ubcf_pearson_center" = list(name = "UBCF", param = list(method = "pearson", normalize = "center")),
  "ubcf_cosine_z" = list(name = "UBCF", param = list(method = "cosine", normalize = "Z-score")),
  "ubcf_pearson_z" = list(name = "UBCF", param = list(method = "pearson", normalize = "Z-score")),
  "random" = list(name = "RANDOM"),
  "popular" = list(name = "POPULAR")
)

#  calculating RMSE, MSE, MAE of the models
results_1<- evaluate(es, models, type = "ratings")
## UBCF run fold/sample [model time/prediction time]
##   1  [0.008sec/3.735sec] 
##   2  [0sec/3.274sec] 
##   3  [0sec/3.28sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0.001sec/4.01sec] 
##   2  [0sec/3.976sec] 
##   3  [0.001sec/3.821sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0.024sec/3.286sec] 
##   2  [0.024sec/3.258sec] 
##   3  [0.025sec/3.146sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0.024sec/3.607sec] 
##   2  [0.023sec/3.823sec] 
##   3  [0.025sec/3.769sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0.162sec/3.22sec] 
##   2  [0.188sec/3.162sec] 
##   3  [0.183sec/3.105sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0.161sec/3.796sec] 
##   2  [0.181sec/3.755sec] 
##   3  [0.16sec/4.031sec] 
## RANDOM run fold/sample [model time/prediction time]
##   1  [0.002sec/0.132sec] 
##   2  [0.002sec/0.114sec] 
##   3  [0.002sec/0.133sec] 
## POPULAR run fold/sample [model time/prediction time]
##   1  [0.029sec/0.031sec] 
##   2  [0.029sec/0.045sec] 
##   3  [0.028sec/0.029sec]
avg(results_1)
## $ubcf_cosine
##         RMSE      MSE      MAE
## res 5.076941 25.77629 4.198034
## 
## $ubcf_pearson
##         RMSE      MSE      MAE
## res 5.105416 26.06602 4.238901
## 
## $ubcf_cosine_center
##         RMSE      MSE      MAE
## res 4.896828 23.97942 3.829172
## 
## $ubcf_pearson_center
##         RMSE      MSE      MAE
## res 4.889742 23.90996 3.824092
## 
## $ubcf_cosine_z
##        RMSE      MSE      MAE
## res 4.91189 24.12707 3.808591
## 
## $ubcf_pearson_z
##         RMSE      MSE      MAE
## res 4.903886 24.04842 3.802786
## 
## $random
##         RMSE      MSE      MAE
## res 6.467079 41.82386 4.966529
## 
## $popular
##         RMSE      MSE      MAE
## res 4.777126 22.82142 3.728968

The best performing model is “popular” model as it has the lowest RMSE based on average RMSE of 3-folds cross-validation. The worst performing model is “random”, it has the largest RMSE.

Let’s look at the confusion matrix and ROC curve of the models for 5, 10, 15 and 20 recommendations.

#  creating a list of models
models <- list(
  "ubcf_cosine" = list(name = "UBCF", param = list(method = "cosine", normalize = NULL)),
  "ubcf_pearson" = list(name = "UBCF", param = list(method = "pearson", normalize = NULL)),
  "ubcf_cosine_center" = list(name = "UBCF", param = list(method = "cosine", normalize = "center")),
  "ubcf_pearson_center" = list(name = "UBCF", param = list(method = "pearson", normalize = "center")),
  "ubcf_cosine_z" = list(name = "UBCF", param = list(method = "cosine", normalize = "Z-score")),
  "ubcf_pearson_z" = list(name = "UBCF", param = list(method = "pearson", normalize = "Z-score")),
  "random" = list(name = "RANDOM"),
  "popular" = list(name = "POPULAR")
)

#  calculating confusion matrix for 5, 10, 15 and 20 recommendations.
results_2<- evaluate(es, models, n=c(5, 10, 15, 20))
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/3.32sec] 
##   2  [0sec/3.447sec] 
##   3  [0.001sec/3.332sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0.001sec/4.044sec] 
##   2  [0.001sec/3.969sec] 
##   3  [0sec/3.93sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0.022sec/3.221sec] 
##   2  [0.023sec/3.232sec] 
##   3  [0.022sec/3.218sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0.023sec/3.777sec] 
##   2  [0.023sec/3.909sec] 
##   3  [0.024sec/3.898sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0.159sec/3.304sec] 
##   2  [0.157sec/3.311sec] 
##   3  [0.157sec/3.252sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0.161sec/3.814sec] 
##   2  [0.161sec/3.98sec] 
##   3  [0.158sec/3.978sec] 
## RANDOM run fold/sample [model time/prediction time]
##   1  [0.002sec/0.314sec] 
##   2  [0.002sec/0.319sec] 
##   3  [0.002sec/0.319sec] 
## POPULAR run fold/sample [model time/prediction time]
##   1  [0.025sec/2.586sec] 
##   2  [0.028sec/2.586sec] 
##   3  [0.026sec/2.617sec]
#  plotting ROC curve
plot(results_2, annotate = 1:4, main = "ROC curve")

# plotting precision - recall chart
plot(results_2, "prec/rec", annotate = 1:4, main = "ROC curve")

From 10 observations and up the “popular” method outperforms others methods and has the the biggest AUC. “Random” model performed significantly worse that others and has the smallest AUC.

Increasing Diversity

The recommendations that are most accurate according to the standard metrics are sometimes not the recommendations that are most useful to users. Some studies are argue that one of the goals of recommender systems is to provide a user with personalized items and more diverse recommendations result in more opportunities for users to get recommended such items and utilize “long-tail” area. Having diverse recommendations is important as it helps to avoid the popularity bias. Higher diversity however, can come at the expense of accuracy. There is a tradeoff between accuracy and diversity because high accuracy may often be obtained by safely recommending to users the most popular items, which can clearly lead to the reduction in diversity, i.e., less personalized recommendations. Technically we can increase diversity simply by recommending less popular random items, however, the loss of recommendation accuracy in this case can be substantial and can lead to a bad recommendation.

In order to increase the recommender system diversity and not significantly compromising accuracy, hybrid model will be built. Hybrid MODEL combines UBCF model with RANDOM model. UBCF will be assigned 0.9 weight (keeping it as a core model) and Random model will be assigned 0.1 weight to increase systems’s diversity.

# splitting data on train and test sets
esf<- evaluationScheme(Jester5k, method = "split", train = 0.9, given = 5, goodRating = 1)
train <-getData(esf, "train")
test <-getData(esf, "unknown")
test_known <- getData(esf, "known")
#  building hybrid models: user-based recommendation model with 10 nearest neighbours and random model
param_f<- list(method = "Pearson", normalize = "center", nn=10)
recommendations <- HybridRecommender(
Recommender(train, method = "UBCF", param = param_f),
Recommender(train, method = "RANDOM"),
weights = c(.9, .1)
)
#  calculating RMSE of the Hybrid Model
final_prediction<- predict(recommendations, test_known, type = "ratings")
acc_h<- calcPredictionAccuracy(final_prediction, test)
acc_h
##      RMSE       MSE       MAE 
##  4.991526 24.915331  3.898768

As we see accuracy of the hybrid model is slightly worse than accuracy of UBCF (centered with pearson coefficient as a similarity measure), because random model was added which increased the RMSE. As weight of the random model is small RMSE has been increased only slightly.

# getting recommendations (top 10)
final_prediction<- predict (recommendations, test, n = 10, type = "topNList")

# top 10 jokes for the first user
final_prediction@items[1]
## $u2841
##  [1] 48 60 98 43 78 89 29 90 81 76
# ratings of the top 10 jokes for the first user
final_prediction@ratings[1]
## $u2841
##  [1] 5.654237 4.967253 4.740247 4.636713 4.524932 4.446355 4.144873
##  [8] 4.046763 3.958949 3.930029

Online recommender system evaluation

In an offline setting, recommending most popular items is the best strategy in terms of accuracy, while in a live environment this strategy is most likely the poorest. There are several researches that shows that offline and online evaluations often contradict each other. This problem is called “Surrogate Problem”. That means that we can not declare victory only assessing recommender system with RMSE, diversity and novelty until we measure the recommender system’ impact on real users. The obvious lack of predictive power of offline evaluation is the ignorance of human factor. These factors may strongly influence whether users are satisfied with recommendations, regardless of the recommendation’s relevance.

Several metrics are used for online recommender system evaluation: responsiveness, churn, A/B test, explicit perceived quality test.

Responsiveness is how quickly new user behavior influences the recommendations. The system that have instantanious or complex responsiveness is difficult to maintain and expensive to build. A balance between responsiveness and simplicity is required.

Churn measures how often users’ recommendations change, how sensitive the recommender system to a new user behaviour. For example, if user rates a new item does that substantially change his/her recommendations? If yes, the churn score will be high.

Explicit perceived quality test. This test explicitly asks user to rate the recommender system. In real life users will probably be confused whether they need to assess items or recommendations. In case with perceived quality test it is hard to interpret the data and it requires extra work from the customer without clear pay-off for them.

One of the widely used and very effective online recommender system evaluation methods is A/B testing. Users are randomly split into groups and each of the group is offered slightly different experience. These way different models can be tested and the best model can be selected using certain metrics such as actual purchases, number of views or other metrics that indicate interest in the recommendation that is been presented.

In general online recommender system evaluation methods help to avoid complexity that adds no value to the recommender system and unlike offline settings these methods assess user behaviour as the ultimate test of recommender system work.