Load packages
data(MovieLense)
#distribution of ratings
vector_ratings <- as.vector(MovieLense@data)
(rating_frq <- table(vector_ratings))
## vector_ratings
## 0 1 2 3 4 5
## 1469760 6059 11307 27002 33947 21077
qplot(vector_ratings) + ggtitle("Distribution of the ratings") + labs(x = "Score")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
There are a lot of missing values(0). The most common rating is 4.
#selece only the users who have rated at least 50 movies and movies that had been rated more than 100 times
(ratings_movies <- MovieLense[rowCounts(MovieLense) > 50,
colCounts(MovieLense) > 100])
## 560 x 332 rating matrix of class 'realRatingMatrix' with 55298 ratings.
#convert the data to realRatingMatrix
ml <- as(MovieLense@data, "realRatingMatrix")
#distribution of user rating means
user_summary <- as.data.frame(cbind('mean'=rowMeans(ratings_movies),'number'=rowCounts(ml)))
## Warning in cbind(mean = rowMeans(ratings_movies), number = rowCounts(ml)):
## number of rows of result is not a multiple of vector length (arg 1)
user_summary <-as.data.frame(sapply(user_summary, function(x) as.numeric(as.character(x))))
#distribution of movie rating means
item_summary <- as.data.frame(cbind('mean'=colMeans(ml), 'number'=colCounts(ratings_movies)))
## Warning in cbind(mean = colMeans(ml), number = colCounts(ratings_movies)):
## number of rows of result is not a multiple of vector length (arg 2)
item_summary <-as.data.frame(sapply(item_summary, function(x) as.numeric(as.character(x))))
par(mfrow=c(1,2))
#distribution 1
ggplot(user_summary,aes(mean)) +
geom_histogram(binwidth = 0.05,col='white',fill="orchid3") + labs(x = "Average User Rating", y = "Count of Movies", title = "Distribution 1") + geom_vline(xintercept = mean(user_summary$mean),col='grey',size=2)
#distribution 2
ggplot(item_summary,aes(mean)) +
geom_histogram(binwidth = 0.05,col='white',fill="seagreen3") + labs(x = "Average Movie Rating", y = "Count of Movies", title = "Distribution 2") + geom_vline(xintercept = mean(item_summary$mean),col='grey',size=2)
#distribution 3
ggplot(user_summary,aes(number)) +
geom_histogram(binwidth = 0.8,fill="violetred2") + labs(x = "Count of Rated Items", y = "Count of Users", title = "Distribution 3")
#distribution 4
ggplot(item_summary,aes(number)) +
geom_histogram(binwidth = 0.8,fill="turquoise2") + labs(x = "Count of Ratings per Movie", y = "Count of Movies", title = "Distribution 4")
From the figures, we can see:
1) Distribution 1 contains the rating means of users and distribution 2 contains the rating means of movies. We can see that the distributions are normal with a grey line indicating the mean.
2) Distribution 3 contains the number of movies rated by each user. Distribution 4 contains the number of times each movie was rated. We can see that both distributions are right skewed, suggesting that many movies have been rated by only a few users.
The movies which have been rated less than 100 times will be removed for our analysis going forward.
image(ratings_movies, main = "Heatmap of the rating matrix")
# select the top users and movies
top_movies <- quantile(rowCounts(ratings_movies), 0.98)
top_users <- quantile(colCounts(ratings_movies), 0.98)
#heatmap of the top users and movies
image(ratings_movies[rowCounts(ratings_movies) > top_movies,
colCounts(ratings_movies) > top_users], main = "Heatmap of the top
users and movies")
ratings_movies_nor <- normalize(ratings_movies)
getRatingMatrix(ratings_movies_nor)[1:5,1:5]
## 5 x 5 sparse Matrix of class "dgCMatrix"
## Toy Story (1995) GoldenEye (1995) Get Shorty (1995)
## 1 1.17341040 -0.8265896 -0.8265896
## 2 0.07317073 . .
## 3 . . .
## 5 0.71739130 -0.2826087 .
## 6 0.38194444 . .
## Twelve Monkeys (1995) Babe (1995)
## 1 0.1734104 -2.8265896
## 2 . .
## 3 . .
## 5 . .
## 6 -1.6180556 0.3819444
(mean(getRatingMatrix(ratings_movies),na.rm=TRUE))
## [1] 1.102705
ratings_movies_bi <- binarize(ratings_movies, minRating = 1)
getRatingMatrix(ratings_movies_bi)
## itemMatrix in sparse format with
## 560 rows (elements/transactions) and
## 332 columns (items)
# use the minimum number of items purchased by any user tp decide item number to keep
(min(rowCounts(ratings_movies)))
## [1] 18
n_fold <- 4
items_to_keep <- 15
rating_threshold <- 3
# Use k-fold to validate models
eval_sets <- evaluationScheme(data = ratings_movies, method = "cross-validation",k = n_fold, given = items_to_keep, goodRating = rating_threshold)
models <- list(
RANDOM = list(name = "RANDOM", param = NULL),
POPULAR = list(name = "POPULAR", param = NULL),
IBCF=list(name="IBCF",param=list(method = "cosine")),
UBCF=list(name="UBCF", param=list(method = "cosine")),
SVD=list(name="SVD", param=list(k =100))
)
# varying the number of items we want to recommend to users
n_rec <- c(1, 5, seq(10, 100, 10))
# evaluating the recommendations
results <- evaluate(x = eval_sets, method = models, n= n_rec)
## RANDOM run fold/sample [model time/prediction time]
## 1 [0.02sec/0.06sec]
## 2 [0sec/0.04sec]
## 3 [0sec/0.07sec]
## 4 [0sec/0.06sec]
## POPULAR run fold/sample [model time/prediction time]
## 1 [0.01sec/0.17sec]
## 2 [0sec/0.18sec]
## 3 [0sec/0.2sec]
## 4 [0sec/0.19sec]
## IBCF run fold/sample [model time/prediction time]
## 1 [0.25sec/0.06sec]
## 2 [0.23sec/0.05sec]
## 3 [0.22sec/0.06sec]
## 4 [0.23sec/0.05sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0sec/0.2sec]
## 2 [0sec/0.19sec]
## 3 [0.02sec/0.19sec]
## 4 [0sec/0.21sec]
## SVD run fold/sample [model time/prediction time]
## 1 [0.28sec/0.05sec]
## 2 [0.31sec/0.05sec]
## 3 [0.28sec/0.05sec]
## 4 [0.29sec/0.05sec]
# extract the related average confusion matrices
(avg_matrices <- lapply(results, avg))
## $RANDOM
## TP FP FN TN precision recall
## 1 0.3285714 0.6714286 72.70536 243.2946 0.3285714 0.004751494
## 5 1.3910714 3.6089286 71.64286 240.3571 0.2782143 0.018800523
## 10 2.6339286 7.3660714 70.40000 236.6000 0.2633929 0.034771013
## 20 5.0107143 14.9892857 68.02321 228.9768 0.2505357 0.067270668
## 30 7.2678571 22.7321429 65.76607 221.2339 0.2422619 0.099457029
## 40 9.4553571 30.5446429 63.57857 213.4214 0.2363839 0.129299255
## 50 11.6428571 38.3571429 61.39107 205.6089 0.2328571 0.159635206
## 60 13.8642857 46.1357143 59.16964 197.8304 0.2310714 0.189704063
## 70 16.1660714 53.8339286 56.86786 190.1321 0.2309439 0.222247748
## 80 18.4089286 61.5910714 54.62500 182.3750 0.2301116 0.252360573
## 90 20.8160714 69.1839286 52.21786 174.7821 0.2312897 0.285477541
## 100 23.0553571 76.9446429 49.97857 167.0214 0.2305536 0.315768747
## TPR FPR
## 1 0.004751494 0.002705416
## 5 0.018800523 0.014611863
## 10 0.034771013 0.029935347
## 20 0.067270668 0.061085528
## 30 0.099457029 0.092975963
## 40 0.129299255 0.125099069
## 50 0.159635206 0.157232959
## 60 0.189704063 0.189164756
## 70 0.222247748 0.220800022
## 80 0.252360573 0.252616528
## 90 0.285477541 0.283744918
## 100 0.315768747 0.315513966
##
## $POPULAR
## TP FP FN TN precision recall
## 1 0.7196429 0.2803571 72.31429 243.6857 0.7196429 0.01197822
## 5 2.7767857 2.2232143 70.25714 241.7429 0.5553571 0.04195815
## 10 5.0928571 4.9071429 67.94107 239.0589 0.5092857 0.07457286
## 20 9.1732143 10.8267857 63.86071 233.1393 0.4586607 0.13214219
## 30 12.0910714 17.9089286 60.94286 226.0571 0.4030357 0.17255224
## 40 15.0964286 24.9035714 57.93750 219.0625 0.3774107 0.21423751
## 50 18.0196429 31.9803571 55.01429 211.9857 0.3603929 0.25188087
## 60 21.0285714 38.9714286 52.00536 204.9946 0.3504762 0.29300278
## 70 23.5625000 46.4375000 49.47143 197.5286 0.3366071 0.32675936
## 80 26.0839286 53.9160714 46.95000 190.0500 0.3260491 0.36092015
## 90 28.5267857 61.4732143 44.50714 182.4929 0.3169643 0.39424370
## 100 31.1071429 68.8928571 41.92679 175.0732 0.3110714 0.42835442
## TPR FPR
## 1 0.01197822 0.001038530
## 5 0.04195815 0.008491175
## 10 0.07457286 0.018876969
## 20 0.13214219 0.042216412
## 30 0.17255224 0.070867267
## 40 0.21423751 0.099180570
## 50 0.25188087 0.127731178
## 60 0.29300278 0.155893341
## 70 0.32675936 0.186294041
## 80 0.36092015 0.216771486
## 90 0.39424370 0.247690674
## 100 0.42835442 0.277831869
##
## $IBCF
## TP FP FN TN precision recall
## 1 0.3428571 0.6571429 72.69107 243.3089 0.3428571 0.004784149
## 5 1.4589286 3.5410714 71.57500 240.4250 0.2917857 0.019239660
## 10 2.8267857 7.1732143 70.20714 236.7929 0.2826786 0.038530232
## 20 5.4267857 14.5732143 67.60714 229.3929 0.2713393 0.073704328
## 30 7.9517857 22.0482143 65.08214 221.9179 0.2650595 0.110055096
## 40 10.4982143 29.5017857 62.53571 214.4643 0.2624554 0.146730724
## 50 13.0250000 36.9750000 60.00893 206.9911 0.2605000 0.182895654
## 60 15.5732143 44.4267857 57.46071 199.5393 0.2595536 0.221087871
## 70 18.1071429 51.8928571 54.92679 192.0732 0.2586735 0.256525401
## 80 20.5339286 59.4660714 52.50000 184.5000 0.2566741 0.291873604
## 90 23.0625000 66.9375000 49.97143 177.0286 0.2562500 0.328242675
## 100 25.6142857 74.3750000 47.41964 169.5911 0.2561668 0.366631295
## TPR FPR
## 1 0.004784149 0.002654013
## 5 0.019239660 0.014292539
## 10 0.038530232 0.029053575
## 20 0.073704328 0.059207686
## 30 0.110055096 0.089775002
## 40 0.146730724 0.120306965
## 50 0.182895654 0.150919378
## 60 0.221087871 0.181383035
## 70 0.256525401 0.211913728
## 80 0.291873604 0.242983796
## 90 0.328242675 0.273575665
## 100 0.366631295 0.304125911
##
## $UBCF
## TP FP FN TN precision recall
## 1 0.6714286 0.3285714 72.36250 243.6375 0.6714286 0.01192375
## 5 2.8696429 2.1303571 70.16429 241.8357 0.5739286 0.04747620
## 10 5.2035714 4.7964286 67.83036 239.1696 0.5203571 0.08332505
## 20 9.0892857 10.9107143 63.94464 233.0554 0.4544643 0.13942781
## 30 12.5857143 17.4142857 60.44821 226.5518 0.4195238 0.18952017
## 40 15.8232143 24.1767857 57.21071 219.7893 0.3955804 0.23425245
## 50 18.6178571 31.3821429 54.41607 212.5839 0.3723571 0.27270109
## 60 21.3892857 38.6107143 51.64464 205.3554 0.3564881 0.31059864
## 70 23.9803571 46.0196429 49.05357 197.9464 0.3425765 0.34651964
## 80 26.3571429 53.6428571 46.67679 190.3232 0.3294643 0.37714342
## 90 28.7464286 61.2535714 44.28750 182.7125 0.3194048 0.40802008
## 100 31.0160714 68.9839286 42.01786 174.9821 0.3101607 0.43747336
## TPR FPR
## 1 0.01192375 0.001256794
## 5 0.04747620 0.008251025
## 10 0.08332505 0.018689591
## 20 0.13942781 0.042947002
## 30 0.18952017 0.068995540
## 40 0.23425245 0.096211850
## 50 0.27270109 0.125431796
## 60 0.31059864 0.154775270
## 70 0.34651964 0.184848790
## 80 0.37714342 0.215934396
## 90 0.40802008 0.246919899
## 100 0.43747336 0.278566676
##
## $SVD
## TP FP FN TN precision recall
## 1 0.3839286 0.6160714 72.65000 243.3500 0.3839286 0.006832558
## 5 1.7178571 3.2821429 71.31607 240.6839 0.3435714 0.026314440
## 10 3.2803571 6.7196429 69.75357 237.2464 0.3280357 0.049360758
## 20 6.2142857 13.7857143 66.81964 230.1804 0.3107143 0.092362805
## 30 8.9053571 21.0946429 64.12857 222.8714 0.2968452 0.129903637
## 40 11.4928571 28.5071429 61.54107 215.4589 0.2873214 0.166361204
## 50 14.0250000 35.9750000 59.00893 207.9911 0.2805000 0.201502311
## 60 16.3946429 43.6053571 56.63929 200.3607 0.2732440 0.233678400
## 70 18.7678571 51.2321429 54.26607 192.7339 0.2681122 0.267187336
## 80 21.0750000 58.9250000 51.95893 185.0411 0.2634375 0.297843591
## 90 23.4053571 66.5946429 49.62857 177.3714 0.2600595 0.329361873
## 100 25.7375000 74.2625000 47.29643 169.7036 0.2573750 0.361183758
## TPR FPR
## 1 0.006832558 0.002488234
## 5 0.026314440 0.013262865
## 10 0.049360758 0.027167188
## 20 0.092362805 0.055937053
## 30 0.129903637 0.085732737
## 40 0.166361204 0.116040518
## 50 0.201502311 0.146505653
## 60 0.233678400 0.177633129
## 70 0.267187336 0.208826591
## 80 0.297843591 0.240256589
## 90 0.329361873 0.271656799
## 100 0.361183758 0.303008329
. True Positives (TP): These are recommended items that have been rated
. False Positives (FP): These are recommended items that haven’t been rated . False Negatives(FN): These are not recommended items that have been rated
. True Negatives (TN): These are not recommended items that haven’t been rated
. True Positive Rate (TPR): This is the percentage of purchased items that have been recommended. It’s the number of TP divided by the number of purchased items (TP + FN).
. False Positive Rate (FPR): This is the percentage of not purchased items that have been recommended. It’s the number of FP divided by the number of not purchased items (FP + TN).
recommender_random <- Recommender(data = getData(eval_sets, "train"),
method = "RANDOM",parameter = NULL)
recommender_popular <- Recommender(data = getData(eval_sets, "train"),
method = "POPULAR",parameter = NULL)
recommender_ibcf <- Recommender(data = getData(eval_sets, "train"),
method = "IBCF",parameter = list(method = "cosine"))
recommender_ubcf <- Recommender(data = getData(eval_sets, "train"),
method = "UBCF",parameter = list(method = "cosine"))
recommender_svd <- Recommender(data = getData(eval_sets, "train"),
method = "SVD",parameter = list(k =100))
items_to_recommend <- 10
eval_prediction_random <- predict(object = recommender_random, newdata =
getData(eval_sets, "known"), n = items_to_recommend, type = "ratings")
eval_prediction_popular <- predict(object = recommender_popular, newdata = getData(eval_sets, "known"), n = items_to_recommend, type = "ratings")
eval_prediction_ibcf <- predict(object = recommender_ibcf, newdata = getData(eval_sets, "known"), n = items_to_recommend, type = "ratings")
eval_prediction_ubcf <- predict(object = recommender_ubcf, newdata = getData(eval_sets, "known"), n = items_to_recommend, type = "ratings")
eval_prediction_svd <- predict(object = recommender_svd, newdata = getData(eval_sets, "known"), n = items_to_recommend, type = "ratings")
# compare RMSE for different models
######################RANDOM######################
eval_accuracy_random <- calcPredictionAccuracy(
x = eval_prediction_random, data = getData(eval_sets, "unknown"), byUser = F)
eval_accuracy_random_user <- calcPredictionAccuracy(
x = eval_prediction_random, data = getData(eval_sets, "unknown"), byUser = TRUE)
head(eval_accuracy_random_user)
## RMSE MSE MAE
## [1,] 1.451276 2.106202 1.1151512
## [2,] 1.337349 1.788503 1.0308103
## [3,] 1.093439 1.195610 0.7633556
## [4,] 1.098184 1.206007 0.8396755
## [5,] 1.561598 2.438587 1.2467081
## [6,] 1.472916 2.169481 1.1639751
qplot(eval_accuracy_random_user[, "RMSE"]) + geom_histogram(binwidth = 0.05) + ggtitle("Distribution of the Random RMSE by user")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
######################POPULAR######################
eval_accuracy_popular <- calcPredictionAccuracy(
x = eval_prediction_popular, data = getData(eval_sets, "unknown"), byUser =F)
eval_accuracy_popular_user <- calcPredictionAccuracy(
x = eval_prediction_popular, data = getData(eval_sets, "unknown"), byUser = TRUE)
head(eval_accuracy_popular_user)
## RMSE MSE MAE
## 2 0.6275100 0.3937688 0.5499331
## 14 1.1574183 1.3396171 0.8707250
## 16 0.7310052 0.5343686 0.5265590
## 18 0.8019874 0.6431838 0.6976061
## 21 1.0927126 1.1940207 0.9193394
## 23 0.8127401 0.6605464 0.6502611
qplot(eval_accuracy_popular_user[, "RMSE"]) + geom_histogram(binwidth = 0.05) + ggtitle("Distribution of the Popular RMSE by user")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
######################IBCF######################
eval_accuracy_ibcf <- calcPredictionAccuracy(
x = eval_prediction_ibcf, data = getData(eval_sets, "unknown"), byUser = F)
eval_accuracy_ibcf_user <- calcPredictionAccuracy(
x = eval_prediction_ibcf, data = getData(eval_sets, "unknown"), byUser = TRUE)
head(eval_accuracy_ibcf_user)
## RMSE MSE MAE
## 2 1.177366 1.386191 1.0237338
## 14 1.287356 1.657287 0.9344477
## 16 1.065587 1.135475 0.6739815
## 18 1.050587 1.103733 0.8178706
## 21 1.377252 1.896823 1.0701019
## 23 1.105258 1.221595 0.7758693
qplot(eval_accuracy_ibcf_user[, "RMSE"]) + geom_histogram(binwidth = 0.05) + ggtitle("Distribution of the IBCF RMSE by user")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
######################UBCF######################
eval_accuracy_ubcf <- calcPredictionAccuracy(
x = eval_prediction_ubcf, data = getData(eval_sets, "unknown"), byUser = F)
eval_accuracy_ubcf_user <- calcPredictionAccuracy(
x = eval_prediction_ubcf, data = getData(eval_sets, "unknown"), byUser = TRUE)
head(eval_accuracy_ubcf_user)
## RMSE MSE MAE
## 2 0.6675509 0.4456242 0.5819348
## 14 1.1266948 1.2694411 0.8646379
## 16 0.7847265 0.6157957 0.6119964
## 18 0.8327871 0.6935343 0.7192054
## 21 1.2129192 1.4711730 1.0343360
## 23 0.9014106 0.8125410 0.7577947
qplot(eval_accuracy_ubcf_user[, "RMSE"]) + geom_histogram(binwidth = 0.05) + ggtitle("Distribution of the IBCF RMSE by user")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
######################SVD######################
eval_accuracy_svd <- calcPredictionAccuracy(
x = eval_prediction_svd, data = getData(eval_sets, "unknown"), byUser = F)
eval_accuracy_svd_user <- calcPredictionAccuracy(
x = eval_prediction_svd, data = getData(eval_sets, "unknown"), byUser = TRUE)
head(eval_accuracy_svd_user)
## RMSE MSE MAE
## 2 0.7653761 0.5858006 0.6283317
## 14 1.1314783 1.2802431 0.8630623
## 16 0.8101806 0.6563926 0.6521109
## 18 0.8446688 0.7134654 0.7260224
## 21 1.2770420 1.6308361 1.0812567
## 23 0.9561779 0.9142761 0.8117162
qplot(eval_accuracy_svd_user[, "RMSE"]) + geom_histogram(binwidth = 0.05) + ggtitle("Distribution of the IBCF RMSE by user")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
kable(RMSE_COMPARE <- rbind(eval_accuracy_random,eval_accuracy_popular,eval_accuracy_ibcf,eval_accuracy_ubcf,eval_accuracy_svd))
| RMSE | MSE | MAE | |
|---|---|---|---|
| eval_accuracy_random | 1.3191730 | 1.7402174 | 1.0385862 |
| eval_accuracy_popular | 0.9334270 | 0.8712860 | 0.7312318 |
| eval_accuracy_ibcf | 1.2042800 | 1.4502903 | 0.8944834 |
| eval_accuracy_ubcf | 0.9679624 | 0.9369512 | 0.7600586 |
| eval_accuracy_svd | 1.0023309 | 1.0046673 | 0.7925106 |
RMSE_COMPARE <- as.data.frame(RMSE_COMPARE)
RMSE_COMPARE$model <- c('Random','Popular','IBCF','UBCF','SVD')
ggplot(RMSE_COMPARE, aes(model,RMSE))+ geom_point(colour = "blue", size = 3)
# convert data to realRatingMatrix format
mls <- as(ratings_movies,"realRatingMatrix")
items_genre_all <- MovieLenseMeta[,-c(2,3,4)]
items_genre_movie <- semi_join(data.frame('title'=colnames(ratings_movies)), items_genre_all, by = "title")
items_genre <- semi_join(items_genre_all, items_genre_movie, by = "title")
rating_m <- as.matrix(getRatingMatrix(ratings_movies))
# number of user, genre, and movie
nuser <- length(rating_m[,1])
ngenre <- length(items_genre[1,])-1
nmovie <- length(items_genre[,1])
# Compute average rating of each genre for a given user
user_genre <- data.frame(matrix(0, nrow = nuser, ncol = ngenre))
for (i in 1:nuser){
for(j in 1:ngenre){
v1 <- as.matrix(rating_m[i,])
v2 <- as.matrix(items_genre[,(j+1)])
user_genre[i,j] <- (t(v1) %*% v2)/sum(v2)
}
}
# Compute average rating of each item based on average rating for each genre
rating_genre_based <- data.frame(matrix(0, nrow = nuser, ncol =nmovie))
for (i in 1:nuser){
for(j in 1:nmovie){
v1 <- as.matrix(user_genre[i,])
v2 <- as.matrix(items_genre[j,2:(ngenre+1)])
rating_genre_based[i,j] <- (v1 %*% t(v2))/sum(v2)
}
}
rownames(rating_genre_based) <- rownames(rating_m)
colnames(rating_genre_based) <- colnames(rating_m)
rownames(user_genre) <- rownames(rating_m)
colnames(user_genre) <- colnames(items_genre[,2:19])
# Select a random set of users
set.seed(1)
users <- sample(rownames(getRatingMatrix(ratings_movies)),size=20)
# This rating profile based on genre shows the aggregated inclination of each user towards movie genres. Assumming that users like similar items, movies that are most similar to a user's preference for an movie's feature could be recommended.
#convert the predicted ratings based on genre to binary
avg_user_genre <- as.matrix(rowMeans(user_genre))
user_genre_bi <- user_genre - t(avg_user_genre)
user_genre_bi[user_genre_bi < 0] <- 0
user_genre_bi[user_genre_bi > 0] <- 1
#find the most similar movies for a selected user
profile <- user_genre_bi[users[1],]
colnames(profile) <- colnames(items_genre[,2:19])
sim_matrix <- rbind.data.frame(profile, items_genre[,2:19])
sim_matrix <- data.frame(lapply(sim_matrix,function(x) as.integer(x))) #convert data to type integer
#Calculate Jaccard distance between user profile and all movies
similarity <- dist(sim_matrix, method = "Jaccard")
similarity <- as.data.frame(as.matrix(similarity[1:333]))
rows <- which(similarity == min(similarity))
#Recommended movies
print(paste('Recommend use with userID', users[1],'the following movies:', items_genre[rows-1,1]))
## [1] "Recommend use with userID 255 the following movies: GoodFellas (1990)"
## [2] "Recommend use with userID 255 the following movies: George of the Jungle (1997)"
## [3] "Recommend use with userID 255 the following movies: Fly Away Home (1996)"
# Build a User-based recommendation model
rec <- Recommender(mls,method="UBCF", param=list(normalize = "Z-score",method="Cosine",nn=5, minRating=1))
## Warning: Unknown parameters: minRating
## Available parameter (with default values):
## method = cosine
## nn = 25
## sample = FALSE
## normalize = center
## verbose = FALSE
# Creat top-10 recommendation lists for five users
recom_movie <- predict(rec, ratings_movies[users], n=10)
movie_rec <- as(recom_movie, "list")
movie_ubcf <- data.frame(movie_rec[1:20])
colnames(movie_ubcf) <- users
kable(movie_ubcf[,1:5], caption = "Recommend 10 movies to users")
| 255 | 342 | 514 | 854 | 194 |
|---|---|---|---|---|
| Dead Man Walking (1995) | Shawshank Redemption, The (1994) | Trainspotting (1996) | Wrong Trousers, The (1993) | Reservoir Dogs (1992) |
| Godfather, The (1972) | Schindler’s List (1993) | Babe (1995) | GoodFellas (1990) | Shining, The (1980) |
| Face/Off (1997) | Godfather, The (1972) | Lion King, The (1994) | Close Shave, A (1995) | Willy Wonka and the Chocolate Factory (1971) |
| Close Shave, A (1995) | Cinema Paradiso (1988) | Sound of Music, The (1965) | Platoon (1986) | Rear Window (1954) |
| Raiders of the Lost Ark (1981) | Toy Story (1995) | Primal Fear (1996) | Godfather: Part II, The (1974) | Terminator 2: Judgment Day (1991) |
| Aliens (1986) | Much Ado About Nothing (1993) | American President, The (1995) | Raging Bull (1980) | People vs. Larry Flynt, The (1996) |
| Sling Blade (1996) | Vertigo (1958) | Pulp Fiction (1994) | Clockwork Orange, A (1971) | Wrong Trousers, The (1993) |
| Abyss, The (1989) | Full Monty, The (1997) | Sling Blade (1996) | To Kill a Mockingbird (1962) | Contact (1997) |
| Mr. Holland’s Opus (1995) | Strictly Ballroom (1992) | Raising Arizona (1987) | Face/Off (1997) | Good Will Hunting (1997) |
| Star Trek: The Wrath of Khan (1982) | Piano, The (1993) | Fried Green Tomatoes (1991) | Back to the Future (1985) | Face/Off (1997) |
# Creat the recommendation list RS.
d <- gather(movie_ubcf,user,movie,1:20)
## Warning: attributes are not identical across measure variables; they will
## be dropped
rs <- unique(as.character(d[,2]))
rs_df<- data.frame('Expected_list_UBCF'=rs)
kable(head(rs_df))
Dead Man Walking (1995)
Godfather, The (1972)
Face/Off (1997)
Close Shave, A (1995)
Raiders of the Lost Ark (1981) Aliens (1986)
# Primitive recommendation is based on picking top movies that has got the ratings from largest number of users and movies that got the highest average ratings.
# number of ratings for each item
number_of_users_for_items <- as.data.frame(colCounts(mls))
number_of_users_for_items[,1] <-number_of_users_for_items[order(-number_of_users_for_items[,1]),]
colnames(number_of_users_for_items) <- c('movie')
# Compute the movie IDs with the largest number of users.
avg_rating_for_items <- as.data.frame(colMeans(mls))
avg_rating_for_items[,1] <-avg_rating_for_items[order(-avg_rating_for_items[,1]),]
colnames(avg_rating_for_items) <- c('movie')
topN_PPM <- unique(c(rownames(number_of_users_for_items),rownames(avg_rating_for_items)))
# Get the PPM list based on a primitive recommendation method. Here, top 119 movies are selected based on the highest average rating and the largest number of users for each movie.
number_of_rankings_requested = 119
top119_PPM <- topN_PPM[1:119]
# Unexpected list of Recommendations is the list of items that are in RS list but not in PPM.
top119_PPM_df <- data.frame('movie'= top119_PPM)
top119_PPM_df <- data.frame(lapply(top119_PPM_df, as.character), stringsAsFactors=FALSE)
rs_df <- data.frame('movie'= rs)
rs_df <- data.frame(lapply(rs_df, as.character), stringsAsFactors=FALSE)
unexp <- anti_join(rs_df,top119_PPM_df,by.x= 'movie', by.y='movie')
## Joining, by = "movie"
unexp_df <- data.frame('Unexpected_list'=unexp)
kable(head(unexp_df,n=10))
Face/Off (1997)
Close Shave, A (1995)
Sling Blade (1996)
Star Trek: The Wrath of Khan (1982) Schindler’s List (1993)
Vertigo (1958)
Full Monty, The (1997)
Strictly Ballroom (1992)
Piano, The (1993)
Trainspotting (1996)
# retrive the predict ratings generated by UBCF model
recom_rating <- predict(rec, ratings_movies[users], type="ratings")
movie_predict <- as(recom_rating, "list")
predict_rating <- as(recom_rating, "matrix")
predict_rating <- as.data.frame(predict_rating)
head(predict_rating[,1:4])
## Toy Story (1995) GoldenEye (1995) Get Shorty (1995)
## 255 3.115311 2.854167 2.831789
## 342 4.140593 3.293230 NA
## 514 NA 3.900019 NA
## 854 NA 3.079813 NA
## 194 NA 3.017854 NA
## 843 NA 2.813357 2.560468
## Twelve Monkeys (1995)
## 255 NA
## 342 NA
## 514 NA
## 854 NA
## 194 NA
## 843 NA
nameList <- unexp[,1]
predict_rating <- predict_rating[,colnames(predict_rating) %in% nameList]
# Subset the rating matrix to get the predictions for randomly selected users.
predict_rating_sub <- as.data.frame(predict_rating[,rownames(predict_rating) %in% users])
predict_rating_sub$user <- rownames(predict_rating_sub)
predict_rating_sub <- predict_rating_sub[,c(62,1:61)]
predict_rating_sub$user <- sapply(predict_rating_sub$user, as.numeric)
# Compute the list of useful items from the unexpected set of movies.
# Usefulness is when the ratings of movies > 2.5
predict_rating_sub_long <- gather(predict_rating_sub,movie,rating,2:62)
usefulness <- predict_rating_sub_long[which(predict_rating_sub_long$rating>2.5),]
usefulness <- usefulness[order(usefulness$user),]
kable(head(usefulness),caption="Useful_List")
| user | movie | rating | |
|---|---|---|---|
| 50 | 58 | Patton (1970) | 3.902803 |
| 70 | 58 | Young Frankenstein (1974) | 4.308861 |
| 170 | 58 | Last of the Mohicans, The (1992) | 4.062107 |
| 230 | 58 | Fifth Element, The (1997) | 3.653026 |
| 250 | 58 | Contact (1997) | 4.316963 |
| 310 | 58 | Leaving Las Vegas (1995) | 4.258151 |
# subset item_genre matrix to small data set which contains only the movies in serendipity list
items_genre_imt <- semi_join(items_genre_all,data.frame('title'=unique(usefulness$movie)), by = "title")
# Compute average rating of each genre for a given user
v1 <- as.matrix(predict_rating_sub[,2:62])
v1[is.na(v1)] <- 0
user_genre_imt <- v1 %*% as.matrix(items_genre_imt[,2:19])
v3 <- colSums(items_genre_imt[,2:19])
v4 <- matrix(rep((1/v3),18),18,20)
## Warning in matrix(rep((1/v3), 18), 18, 20): data length [324] is not a sub-
## multiple or multiple of the number of columns [20]
v4 <- t(v4)
user_genre_imt <- user_genre_imt[,-7] # Genre Documentary has no rating, so we remove this genre
v4 <- v4[,-7] # remove genre Documentary
user_genre_sp <- user_genre_imt * v4 # average rating of each genre
# Compute average rating of each item based on average rating for each genre
items_genre_sp <- items_genre_imt[,-c(1,8)] # remove genre Documentary
rating_genre_based_sp <- user_genre_sp %*% t(items_genre_sp)#average rating of each item based on genre
colnames(rating_genre_based_sp) <-items_genre_imt$title
kable(head(rating_genre_based_sp[,1:6]),caption="Improved_Serendipity_List")
| Patton (1970) | Young Frankenstein (1974) | Last of the Mohicans, The (1992) | Fifth Element, The (1997) | Contact (1997) | Leaving Las Vegas (1995) | |
|---|---|---|---|---|---|---|
| 255 | 5.364220 | 4.789906 | 8.030055 | 5.480612 | 5.688661 | 5.662999 |
| 342 | 5.444992 | 3.014967 | 6.515864 | 1.748378 | 3.015978 | 4.802111 |
| 514 | 4.717684 | 4.357967 | 8.114585 | 5.545373 | 4.926453 | 4.834792 |
| 854 | 3.844094 | 5.769871 | 6.211448 | 5.116786 | 4.981673 | 4.318114 |
| 194 | 2.326376 | 2.027171 | 2.993236 | 2.316702 | 2.149810 | 1.740571 |
| 843 | 2.731897 | 1.566843 | 4.475721 | 4.101452 | 3.434087 | 2.513222 |
predict_rating_sub_1 <- predict_rating_sub[,-1]
# rerange the item sequence of serendipity list
predict_rating_sub_1 <- predict_rating_sub_1 [colnames(rating_genre_based_sp)]
serendipity <- predict_rating_sub_1
#calculate the difference between serendipity list and the imporved serendipity list and normalize by rating sacle 5.
delta <- ((rating_genre_based_sp - serendipity)/5)*100
#Movies whose delta are below the threshold can be removed from the serendipity list. Here set the threshold as -5%
less_serendipitous <- delta
less_serendipitous$less <- rowSums(less_serendipitous < -5, na.rm=T)
less_serendipitous <- data.frame('UserID'=rownames(less_serendipitous), 'Number of less serendipitous movies'=less_serendipitous[,62])
kable(less_serendipitous )
| UserID | Number.of.less.serendipitous.movies |
|---|---|
| 255 | 1 |
| 342 | 16 |
| 514 | 16 |
| 854 | 9 |
| 194 | 17 |
| 843 | 14 |
| 885 | 15 |
| 591 | 12 |
| 555 | 10 |
| 58 | 9 |
| 195 | 11 |
| 167 | 13 |
| 617 | 7 |
| 344 | 10 |
| 693 | 11 |
| 447 | 14 |
| 637 | 10 |
| 910 | 10 |
| 336 | 6 |
| 921 | 11 |
From the above table, we can see that a significant number of movies are considered less serendipitous based on the above threshold,
delta$UserID <- rownames(delta)
delta_long <- gather(delta,movie,delta_rating,1:61)
delta_long$UserID <- sapply(delta_long$UserID, as.numeric)
ceiling <- max(delta_long$delta_rating,na.rm=T)
bottom <- min(delta_long$delta_rating,na.rm=T)
ggplot(delta_long,aes(x=UserID,y=delta_rating))+geom_point(fill="blue", shape=21, size = 2) +ylim(bottom,ceiling)+geom_hline(yintercept = -5,linetype="dashed",size=1.5)+ggtitle("Serendipity List of Movies")
## Warning: Removed 338 rows containing missing values (geom_point).
The figure above shows the serendipity list of movies generated by a standard collaborative filtering method for 20 randomly selected users. The differences in useful ratings and genre-based ratings below 5% are considered less serendipitous. A significant fraction of movies has been identified as less serendipitous items based on the algorithm used here.
actual <- as.data.frame(as.matrix(getRatingMatrix(ratings_movies)))
actual$userid<-rownames(actual)
actual_long <- gather(actual, title, rating,1:332)
actual_rating <- semi_join(actual_long,data.frame('title'=unique(usefulness$movie)), by = "title")
actual_rating <- actual_rating[which(actual_rating$userid %in% c(users)),]
actual_rating$userid <- sapply(actual_rating$userid, as.numeric)
actual_rating <- actual_rating[match(delta_long$UserID, actual_rating$userid),]
rating_genre_based_sp_1 <- data.frame(rating_genre_based_sp)
rating_genre_based_sp_1$userid <- rownames(rating_genre_based_sp)
rating_genre_based_sp_long <- gather(rating_genre_based_sp_1, movie,rating,1:61)
rmse(actual_rating$rating,rating_genre_based_sp_long$rating)
## [1] 5.516745
The accuracy decrease after introduce the serendipity list into the UBCF model.
#####Diversity
ml.model <- Recommender(ml[1:900], method = "UBCF")
#top 5 recommendations for user
ml.523.top10 <- as(predict(ml.model, ml["523"], n = 10), "list")
length(ml.523.top10)
sample(ml.523.top10, 5)
r <- Recommender(MovieLense[1:800], method = "POPULAR")
r
getModel(r)$topN
#Recommend 3 movies to 2 users
recom <- predict(r, MovieLense[809:810], n = 10)
as(recom, "list")
length(recom)
rnorm(10)
After introducing serendipity to our model, we noticed an increase in our RMSE. Though the model is less accurate in one measure, it is potentially more valuable in the eyes of user experience. Trade-offs like this must be made when designing a recommender system, and the designer has to choose what is more valuable.
These are what we’ve done up to this point, using offline data and simulating user behavior. It is the least risky since there is not threat of driving away users. Therefore, they are the easiest to conduct, due to the fact that interaction with the user is not required.
Where real user populations interact with the recommender system. These experiments evaluate the performance of the recommender system. The user is oblivious to the conducted experiment in these scenarios. Online experiments are known to be the most trustworthy experiment. Here, we want to look at the change in the behavior of users when interacting with different recommendation systems. In this type of setup, small subsets of the user population are directed toward different systems and then their interactions with the different systems are recorded. A user’s assignment to a system must be random. Being that an online experiment can have a negative impact on the “real” system, you want to make sure that your online experiment is last, after extensive research offline.
A great way to examine the effectiveness of an online experiment is to measure the significance level (p-value) between two algorithms, A and B. This will help us determine whether there was a statistically significant difference in our results or they were due to chance. ANOVA or the Friedman test for ranking can be used for the comparison of multiple algorithms at once.
In the case of our Movielens data set, we could implement online experiments that are rolled out to users of movielens.org. In these online experiments, we want to manipulate variables that will enhance the user experience, leading the user to use the site more, rate more movies, etc.
We could design an online experiment where we increase diversity and record the user behavior in response to the change. This might be something like preventing the website from showing a user a movie too many times. If movielens suggests a movie to you and you don’t watch or rate it within a certain period of time, then that movie will be removed from your list of recommendations. That way, we are keeping the selection fresh for the user, always providing them with something new to watch.
We could also implement changes to our system surrounding novelty and roll them out to a small group of users. A user is most likely aware of the incredibly popular movies, and if they haven’t rated them yet, chances are they don’t want to see them. Therefore, we don’t want to suggest these movies to the users. We will put less weighting on the popular movies, and more weighting on the less popular movies, so that these appear more frequently in the recommendations. This will enhance the user experience by providing potentially more useful recommendations. We can measure how useful those recommendations are by observing their behavior on the site. If they rate the less popular movies we suggest, then we may be able to conclude that this approach is effective in meeting our business need.
Reference: 1. Suboojitha Sridharan,Introducing Serendipity In Recommender Systems Though Collaborative Methods,(2014). Open Access Master’s Theses. Paper 453.http://digitalcommons.uri.edu/theses/453