The purpose of the project is to compare the accuracy of at least two recommender systems and implement support for at least one business or user experience goal such as increased serendipity, novelty, or diversity.
The data for the project is taken from the Jester Joke Recommender System and downloaded from http://www.ieor.berkeley.edu/~goldberg/jester-data/. The dataset contains about 24,938 users and 100 jokes, which were rated by users on the scale from -10 to 10.
“recommenderLab” package will be used as a core packge for building a recommender algorithms.
# loading smaller version of Jester Jokes from recommenderlab package
data(Jester5k)
dim(Jester5k)
## [1] 5000 100
# looking at the rating provided by user 1 and 100 for the first 5 jokes
Jester5k@data[1,1:5]
## j1 j2 j3 j4 j5
## 7.91 9.17 5.34 8.16 -8.74
Jester5k@data[100, 1:5]
## j1 j2 j3 j4 j5
## -2.48 3.93 2.72 -2.67 1.75
# checking total number of ratings given by the users
nratings(Jester5k)
## [1] 362106
# overall rating distribution
hist(getRatings(Jester5k), main = "Distribution Of Ratings", xlim=c(-10, 10), breaks="FD")
# finding the most/least popular jokes
ratings_binary<-binarize(Jester5k, minRating = 1)
ratings_binary
## 5000 x 100 rating matrix of class 'binaryRatingMatrix' with 192462 ratings.
ratings_sum<-colSums(ratings_binary)
ratings_sum_df<- data.frame(joke = names(ratings_sum), pratings = ratings_sum)
# the most popular jokes
head(ratings_sum_df[order(-ratings_sum_df$pratings), ],10)
## joke pratings
## j50 j50 3862
## j36 j36 3758
## j32 j32 3632
## j27 j27 3601
## j53 j53 3601
## j29 j29 3576
## j35 j35 3576
## j62 j62 3521
## j49 j49 3473
## j69 j69 3396
# the least popular jokes
tail(ratings_sum_df[order(-ratings_sum_df$pratings), ],10)
## joke pratings
## j90 j90 930
## j44 j44 925
## j73 j73 913
## j77 j77 880
## j86 j86 875
## j79 j79 813
## j75 j75 785
## j71 j71 709
## j58 j58 566
## j74 j74 549
# two normalization techniques are going to be implemented: Centering and Z-score
data.norm.c<-normalize(Jester5k, method="center")
data.norm.z<-normalize(Jester5k, method="Z-score")
# ploting rating distribution for Raw, Normalized and Z-score Normalized Ratings
par(mfrow = c(3,1))
plot(density(getRatings(Jester5k)),main = 'Raw')
plot(density(getRatings(data.norm.c)),main = 'Normalized')
plot(density(getRatings(data.norm.z)),main = 'Z-score')
par(mfrow = c(1,1))
The following models will be built and evaluated: user-based CF, random and popular. User-based CF model will be built using cosine or pearson similarity method and normalization techniques will be applied.
# creating evaluation scheme (3-fold CV; everything that is above 1 is considered as a good rating; 5 neighbours will be find for a given user to make recommendation)
set.seed(123)
es<- evaluationScheme(Jester5k, method = "cross", train = 0.9, given = 5, goodRating = 1, k = 3)
# creating a list of models
models <- list(
"ubcf_cosine" = list(name = "UBCF", param = list(method = "cosine", normalize = NULL)),
"ubcf_pearson" = list(name = "UBCF", param = list(method = "pearson", normalize = NULL)),
"ubcf_cosine_center" = list(name = "UBCF", param = list(method = "cosine", normalize = "center")),
"ubcf_pearson_center" = list(name = "UBCF", param = list(method = "pearson", normalize = "center")),
"ubcf_cosine_z" = list(name = "UBCF", param = list(method = "cosine", normalize = "Z-score")),
"ubcf_pearson_z" = list(name = "UBCF", param = list(method = "pearson", normalize = "Z-score")),
"random" = list(name = "RANDOM"),
"popular" = list(name = "POPULAR")
)
# calculating RMSE, MSE, MAE of the models
results_1<- evaluate(es, models, type = "ratings")
## UBCF run fold/sample [model time/prediction time]
## 1 [0.008sec/3.735sec]
## 2 [0sec/3.274sec]
## 3 [0sec/3.28sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0.001sec/4.01sec]
## 2 [0sec/3.976sec]
## 3 [0.001sec/3.821sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0.024sec/3.286sec]
## 2 [0.024sec/3.258sec]
## 3 [0.025sec/3.146sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0.024sec/3.607sec]
## 2 [0.023sec/3.823sec]
## 3 [0.025sec/3.769sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0.162sec/3.22sec]
## 2 [0.188sec/3.162sec]
## 3 [0.183sec/3.105sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0.161sec/3.796sec]
## 2 [0.181sec/3.755sec]
## 3 [0.16sec/4.031sec]
## RANDOM run fold/sample [model time/prediction time]
## 1 [0.002sec/0.132sec]
## 2 [0.002sec/0.114sec]
## 3 [0.002sec/0.133sec]
## POPULAR run fold/sample [model time/prediction time]
## 1 [0.029sec/0.031sec]
## 2 [0.029sec/0.045sec]
## 3 [0.028sec/0.029sec]
avg(results_1)
## $ubcf_cosine
## RMSE MSE MAE
## res 5.076941 25.77629 4.198034
##
## $ubcf_pearson
## RMSE MSE MAE
## res 5.105416 26.06602 4.238901
##
## $ubcf_cosine_center
## RMSE MSE MAE
## res 4.896828 23.97942 3.829172
##
## $ubcf_pearson_center
## RMSE MSE MAE
## res 4.889742 23.90996 3.824092
##
## $ubcf_cosine_z
## RMSE MSE MAE
## res 4.91189 24.12707 3.808591
##
## $ubcf_pearson_z
## RMSE MSE MAE
## res 4.903886 24.04842 3.802786
##
## $random
## RMSE MSE MAE
## res 6.467079 41.82386 4.966529
##
## $popular
## RMSE MSE MAE
## res 4.777126 22.82142 3.728968
The best performing model is “popular” model as it has the lowest RMSE based on average RMSE of 3-folds cross-validation. The worst performing model is “random”, it has the largest RMSE.
Let’s look at the confusion matrix and ROC curve of the models for 5, 10, 15 and 20 recommendations.
# creating a list of models
models <- list(
"ubcf_cosine" = list(name = "UBCF", param = list(method = "cosine", normalize = NULL)),
"ubcf_pearson" = list(name = "UBCF", param = list(method = "pearson", normalize = NULL)),
"ubcf_cosine_center" = list(name = "UBCF", param = list(method = "cosine", normalize = "center")),
"ubcf_pearson_center" = list(name = "UBCF", param = list(method = "pearson", normalize = "center")),
"ubcf_cosine_z" = list(name = "UBCF", param = list(method = "cosine", normalize = "Z-score")),
"ubcf_pearson_z" = list(name = "UBCF", param = list(method = "pearson", normalize = "Z-score")),
"random" = list(name = "RANDOM"),
"popular" = list(name = "POPULAR")
)
# calculating confusion matrix for 5, 10, 15 and 20 recommendations.
results_2<- evaluate(es, models, n=c(5, 10, 15, 20))
## UBCF run fold/sample [model time/prediction time]
## 1 [0sec/3.32sec]
## 2 [0sec/3.447sec]
## 3 [0.001sec/3.332sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0.001sec/4.044sec]
## 2 [0.001sec/3.969sec]
## 3 [0sec/3.93sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0.022sec/3.221sec]
## 2 [0.023sec/3.232sec]
## 3 [0.022sec/3.218sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0.023sec/3.777sec]
## 2 [0.023sec/3.909sec]
## 3 [0.024sec/3.898sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0.159sec/3.304sec]
## 2 [0.157sec/3.311sec]
## 3 [0.157sec/3.252sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0.161sec/3.814sec]
## 2 [0.161sec/3.98sec]
## 3 [0.158sec/3.978sec]
## RANDOM run fold/sample [model time/prediction time]
## 1 [0.002sec/0.314sec]
## 2 [0.002sec/0.319sec]
## 3 [0.002sec/0.319sec]
## POPULAR run fold/sample [model time/prediction time]
## 1 [0.025sec/2.586sec]
## 2 [0.028sec/2.586sec]
## 3 [0.026sec/2.617sec]
# plotting ROC curve
plot(results_2, annotate = 1:4, main = "ROC curve")
# plotting precision - recall chart
plot(results_2, "prec/rec", annotate = 1:4, main = "ROC curve")
From 10 observations and up the “popular” method outperforms others methods and has the the biggest AUC. “Random” model performed significantly worse that others and has the smallest AUC.
The recommendations that are most accurate according to the standard metrics are sometimes not the recommendations that are most useful to users. Some studies are argue that one of the goals of recommender systems is to provide a user with personalized items and more diverse recommendations result in more opportunities for users to get recommended such items and utilize “long-tail” area. Having diverse recommendations is important as it helps to avoid the popularity bias. Higher diversity however, can come at the expense of accuracy. There is a tradeoff between accuracy and diversity because high accuracy may often be obtained by safely recommending to users the most popular items, which can clearly lead to the reduction in diversity, i.e., less personalized recommendations. Technically we can increase diversity simply by recommending less popular random items, however, the loss of recommendation accuracy in this case can be substantial and can lead to a bad recommendation.
In order to increase the recommender system diversity and not significantly compromising accuracy, hybrid model will be built. Hybrid MODEL combines UBCF model with RANDOM model. UBCF will be assigned 0.9 weight (keeping it as a core model) and Random model will be assigned 0.1 weight to increase systems’s diversity.
# splitting data on train and test sets
esf<- evaluationScheme(Jester5k, method = "split", train = 0.9, given = 5, goodRating = 1)
train <-getData(esf, "train")
test <-getData(esf, "unknown")
test_known <- getData(esf, "known")
# building hybrid models: user-based recommendation model with 10 nearest neighbours and random model
param_f<- list(method = "Pearson", normalize = "center", nn=10)
recommendations <- HybridRecommender(
Recommender(train, method = "UBCF", param = param_f),
Recommender(train, method = "RANDOM"),
weights = c(.9, .1)
)
# calculating RMSE of the Hybrid Model
final_prediction<- predict(recommendations, test_known, type = "ratings")
acc_h<- calcPredictionAccuracy(final_prediction, test)
acc_h
## RMSE MSE MAE
## 4.991526 24.915331 3.898768
As we see accuracy of the hybrid model is slightly worse than accuracy of UBCF (centered with pearson coefficient as a similarity measure), because random model was added which increased the RMSE. As weight of the random model is small RMSE has been increased only slightly.
# getting recommendations (top 10)
final_prediction<- predict (recommendations, test, n = 10, type = "topNList")
# top 10 jokes for the first user
final_prediction@items[1]
## $u2841
## [1] 48 60 98 43 78 89 29 90 81 76
# ratings of the top 10 jokes for the first user
final_prediction@ratings[1]
## $u2841
## [1] 5.654237 4.967253 4.740247 4.636713 4.524932 4.446355 4.144873
## [8] 4.046763 3.958949 3.930029
In an offline setting, recommending most popular items is the best strategy in terms of accuracy, while in a live environment this strategy is most likely the poorest. There are several researches that shows that offline and online evaluations often contradict each other. This problem is called “Surrogate Problem”. That means that we can not declare victory only assessing recommender system with RMSE, diversity and novelty until we measure the recommender system’ impact on real users. The obvious lack of predictive power of offline evaluation is the ignorance of human factor. These factors may strongly influence whether users are satisfied with recommendations, regardless of the recommendation’s relevance.
Several metrics are used for online recommender system evaluation: responsiveness, churn, A/B test, explicit perceived quality test.
Responsiveness is how quickly new user behavior influences the recommendations. The system that have instantanious or complex responsiveness is difficult to maintain and expensive to build. A balance between responsiveness and simplicity is required.
Churn measures how often users’ recommendations change, how sensitive the recommender system to a new user behaviour. For example, if user rates a new item does that substantially change his/her recommendations? If yes, the churn score will be high.
Explicit perceived quality test. This test explicitly asks user to rate the recommender system. In real life users will probably be confused whether they need to assess items or recommendations. In case with perceived quality test it is hard to interpret the data and it requires extra work from the customer without clear pay-off for them.
One of the widely used and very effective online recommender system evaluation methods is A/B testing. Users are randomly split into groups and each of the group is offered slightly different experience. These way different models can be tested and the best model can be selected using certain metrics such as actual purchases, number of views or other metrics that indicate interest in the recommendation that is been presented.
In general online recommender system evaluation methods help to avoid complexity that adds no value to the recommender system and unlike offline settings these methods assess user behaviour as the ultimate test of recommender system work.