Kelly Shaffer, Yun Mai
June 29, 2017
Load packages
#convert to data to spase matrix
suppressWarnings(suppressMessages(library(Matrix)))
#recommedation model building
suppressWarnings(suppressMessages(library(recommenderlab)))
#data manipulation
suppressWarnings(suppressMessages(library(dplyr)))
suppressWarnings(suppressMessages(library(tidyr)))
suppressWarnings(suppressMessages(library(stringr)))
suppressWarnings(suppressMessages(library(knitr)))
# similarity calculation
suppressWarnings(suppressMessages(library(proxy)))
#linear algebra here used for rmse calculation
suppressWarnings(suppressMessages(library(Metrics)))
#visualizations
suppressWarnings(suppressMessages(library(ggplot2)))data(MovieLense)#distribution of ratings
vector_ratings <- as.vector(MovieLense@data)
(rating_frq <- table(vector_ratings))## vector_ratings
## 0 1 2 3 4 5
## 1469760 6059 11307 27002 33947 21077
qplot(vector_ratings) + ggtitle("Distribution of the ratings") + labs(x = "Score")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
There are a lot of missing values(0). The most common rating is 4.
#selece only the users who have rated at least 50 movies and movies that had been rated more than 100 times
(ratings_movies <- MovieLense[rowCounts(MovieLense) > 50,
colCounts(MovieLense) > 100])## 560 x 332 rating matrix of class 'realRatingMatrix' with 55298 ratings.
#convert the data to realRatingMatrix
ml <- as(MovieLense@data, "realRatingMatrix")
#distribution of user rating means
user_summary <- as.data.frame(cbind('mean'=rowMeans(ratings_movies),'number'=rowCounts(ml)))## Warning in cbind(mean = rowMeans(ratings_movies), number = rowCounts(ml)):
## number of rows of result is not a multiple of vector length (arg 1)
user_summary <-as.data.frame(sapply(user_summary, function(x) as.numeric(as.character(x))))
#distribution of movie rating means
item_summary <- as.data.frame(cbind('mean'=colMeans(ml), 'number'=colCounts(ratings_movies)))## Warning in cbind(mean = colMeans(ml), number = colCounts(ratings_movies)):
## number of rows of result is not a multiple of vector length (arg 2)
item_summary <-as.data.frame(sapply(item_summary, function(x) as.numeric(as.character(x))))
par(mfrow=c(1,2))
#distribution 1
ggplot(user_summary,aes(mean)) +
geom_histogram(binwidth = 0.05,col='white',fill="orchid3") + labs(x = "Average User Rating", y = "Count of Movies", title = "Distribution 1") + geom_vline(xintercept = mean(user_summary$mean),col='grey',size=2)#distribution 2
ggplot(item_summary,aes(mean)) +
geom_histogram(binwidth = 0.05,col='white',fill="seagreen3") + labs(x = "Average Movie Rating", y = "Count of Movies", title = "Distribution 2") + geom_vline(xintercept = mean(item_summary$mean),col='grey',size=2)#distribution 3
ggplot(user_summary,aes(number)) +
geom_histogram(binwidth = 0.8,fill="violetred2") + labs(x = "Count of Rated Items", y = "Count of Users", title = "Distribution 3")#distribution 4
ggplot(item_summary,aes(number)) +
geom_histogram(binwidth = 0.8,fill="turquoise2") + labs(x = "Count of Ratings per Movie", y = "Count of Movies", title = "Distribution 4")From the figures, we can see:
1) Distribution 1 contains the rating means of users and distribution 2 contains the rating means of movies. We can see that the distributions are normal with a grey line indicating the mean.
2) Distribution 3 contains the number of movies rated by each user. Distribution 4 contains the number of times each movie was rated. We can see that both distributions are right skewed, suggesting that many movies have been rated by only a few users.
The movies which have been rated less than 100 times will be removed for our analysis going forward.
image(ratings_movies, main = "Heatmap of the rating matrix")# select the top users and movies
top_movies <- quantile(rowCounts(ratings_movies), 0.98)
top_users <- quantile(colCounts(ratings_movies), 0.98)
#heatmap of the top users and movies
image(ratings_movies[rowCounts(ratings_movies) > top_movies,
colCounts(ratings_movies) > top_users], main = "Heatmap of the top
users and movies")ratings_movies_nor <- normalize(ratings_movies)
getRatingMatrix(ratings_movies_nor)[1:5,1:5]## 5 x 5 sparse Matrix of class "dgCMatrix"
## Toy Story (1995) GoldenEye (1995) Get Shorty (1995)
## 1 1.17341040 -0.8265896 -0.8265896
## 2 0.07317073 . .
## 3 . . .
## 5 0.71739130 -0.2826087 .
## 6 0.38194444 . .
## Twelve Monkeys (1995) Babe (1995)
## 1 0.1734104 -2.8265896
## 2 . .
## 3 . .
## 5 . .
## 6 -1.6180556 0.3819444
(mean(getRatingMatrix(ratings_movies),na.rm=TRUE))## [1] 1.102705
ratings_movies_bi <- binarize(ratings_movies, minRating = 1)
getRatingMatrix(ratings_movies_bi)## itemMatrix in sparse format with
## 560 rows (elements/transactions) and
## 332 columns (items)
# use the minimum number of items purchased by any user tp decide item number to keep
(min(rowCounts(ratings_movies)))## [1] 18
n_fold <- 4
items_to_keep <- 15
rating_threshold <- 3
# Use k-fold to validate models
eval_sets <- evaluationScheme(data = ratings_movies, method = "cross-validation",k = n_fold, given = items_to_keep, goodRating = rating_threshold)
models <- list(
RANDOM = list(name = "RANDOM", param = NULL),
POPULAR = list(name = "POPULAR", param = NULL),
IBCF=list(name="IBCF",param=list(method = "cosine")),
UBCF=list(name="UBCF", param=list(method = "cosine")),
SVD=list(name="SVD", param=list(k =100))
)
# varying the number of items we want to recommend to users
n_rec <- c(1, 5, seq(10, 100, 10))
# evaluating the recommendations
results <- evaluate(x = eval_sets, method = models, n= n_rec)## RANDOM run fold/sample [model time/prediction time]
## 1 [0.02sec/0.09sec]
## 2 [0sec/0.09sec]
## 3 [0sec/0.1sec]
## 4 [0sec/0.1sec]
## POPULAR run fold/sample [model time/prediction time]
## 1 [0sec/0.36sec]
## 2 [0.01sec/0.38sec]
## 3 [0.02sec/0.33sec]
## 4 [0sec/0.39sec]
## IBCF run fold/sample [model time/prediction time]
## 1 [0.41sec/0.11sec]
## 2 [0.4sec/0.11sec]
## 3 [0.36sec/0.11sec]
## 4 [0.38sec/0.11sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0.02sec/0.37sec]
## 2 [0sec/0.55sec]
## 3 [0sec/0.33sec]
## 4 [0sec/0.32sec]
## SVD run fold/sample [model time/prediction time]
## 1 [0.5sec/0.11sec]
## 2 [0.68sec/0.12sec]
## 3 [0.63sec/0.13sec]
## 4 [0.58sec/0.08sec]
# extract the related average confusion matrices
(avg_matrices <- lapply(results, avg))## $RANDOM
## TP FP FN TN precision recall
## 1 0.300000 0.700000 72.86964 243.1304 0.3000000 0.003710834
## 5 1.362500 3.637500 71.80714 240.1929 0.2725000 0.017694191
## 10 2.655357 7.344643 70.51429 236.4857 0.2655357 0.034585415
## 20 5.119643 14.880357 68.05000 228.9500 0.2559821 0.067484632
## 30 7.323214 22.676786 65.84643 221.1536 0.2441071 0.098416096
## 40 9.542857 30.457143 63.62679 213.3732 0.2385714 0.128622660
## 50 11.730357 38.269643 61.43929 205.5607 0.2346071 0.158258313
## 60 13.921429 46.078571 59.24821 197.7518 0.2320238 0.188344250
## 70 16.212500 53.787500 56.95714 190.0429 0.2316071 0.219148970
## 80 18.483929 61.516071 54.68571 182.3143 0.2310491 0.250191267
## 90 20.783929 69.216071 52.38571 174.6143 0.2309325 0.281634800
## 100 23.062500 76.937500 50.10714 166.8929 0.2306250 0.312489665
## TPR FPR
## 1 0.003710834 0.002790151
## 5 0.017694191 0.014728273
## 10 0.034585415 0.029791959
## 20 0.067484632 0.060596092
## 30 0.098416096 0.092585917
## 40 0.128622660 0.124538168
## 50 0.158258313 0.156715584
## 60 0.188344250 0.188818111
## 70 0.219148970 0.220396152
## 80 0.250191267 0.252096283
## 90 0.281634800 0.283772893
## 100 0.312489665 0.315503957
##
## $POPULAR
## TP FP FN TN precision recall
## 1 0.7214286 0.2785714 72.44821 243.5518 0.7214286 0.01197377
## 5 2.7625000 2.2375000 70.40714 241.5929 0.5525000 0.04205479
## 10 5.0785714 4.9214286 68.09107 238.9089 0.5078571 0.07488194
## 20 9.1053571 10.8946429 64.06429 232.9357 0.4552679 0.13135044
## 30 12.2589286 17.7410714 60.91071 226.0893 0.4086310 0.17523814
## 40 14.9196429 25.0803571 58.25000 218.7500 0.3729911 0.21192055
## 50 18.0642857 31.9357143 55.10536 211.8946 0.3612857 0.25277782
## 60 20.8500000 39.1500000 52.31964 204.6804 0.3475000 0.29027315
## 70 23.4589286 46.5410714 49.71071 197.2893 0.3351276 0.32690892
## 80 26.1267857 53.8732143 47.04286 189.9571 0.3265848 0.36295416
## 90 28.9214286 61.0785714 44.24821 182.7518 0.3213492 0.39953657
## 100 31.1160714 68.8839286 42.05357 174.9464 0.3111607 0.42868182
## TPR FPR
## 1 0.01197377 0.001032292
## 5 0.04205479 0.008570093
## 10 0.07488194 0.019001687
## 20 0.13135044 0.042545224
## 30 0.17523814 0.070125655
## 40 0.21192055 0.100045874
## 50 0.25277782 0.127542655
## 60 0.29027315 0.156794182
## 70 0.32690892 0.186922897
## 80 0.36295416 0.216775694
## 90 0.39953657 0.245875236
## 100 0.42868182 0.277839959
##
## $IBCF
## TP FP FN TN precision recall
## 1 0.3517857 0.6482143 72.81786 243.1821 0.3517857 0.004704604
## 5 1.5071429 3.4928571 71.66250 240.3375 0.3014286 0.020421888
## 10 2.8375000 7.1625000 70.33214 236.6679 0.2837500 0.038260790
## 20 5.5589286 14.4410714 67.61071 229.3893 0.2779464 0.075186480
## 30 8.1500000 21.8500000 65.01964 221.9804 0.2716667 0.111546479
## 40 10.7107143 29.2892857 62.45893 214.5411 0.2677679 0.147078492
## 50 13.1946429 36.8053571 59.97500 207.0250 0.2638929 0.184117461
## 60 15.6553571 44.3446429 57.51429 199.4857 0.2609226 0.218873772
## 70 18.1875000 51.8017857 54.98214 192.0286 0.2598382 0.255920967
## 80 20.8767857 59.0946429 52.29286 184.7357 0.2609989 0.295753385
## 90 23.3946429 66.5589286 49.77500 177.2714 0.2599969 0.332184465
## 100 25.8071429 74.1214286 47.36250 169.7089 0.2581691 0.367284151
## TPR FPR
## 1 0.004704604 0.002585776
## 5 0.020421888 0.014064897
## 10 0.038260790 0.028969708
## 20 0.075186480 0.058513435
## 30 0.111546479 0.088679090
## 40 0.147078492 0.119099210
## 50 0.184117461 0.149967775
## 60 0.218873772 0.180853028
## 70 0.255920967 0.211373666
## 80 0.295753385 0.241237780
## 90 0.332184465 0.271860369
## 100 0.367284151 0.302996193
##
## $UBCF
## TP FP FN TN precision recall TPR
## 1 0.687500 0.312500 72.48214 243.5179 0.6875000 0.01213981 0.01213981
## 5 2.867857 2.132143 70.30179 241.6982 0.5735714 0.04757398 0.04757398
## 10 5.225000 4.775000 67.94464 239.0554 0.5225000 0.08346185 0.08346185
## 20 9.253571 10.746429 63.91607 233.0839 0.4626786 0.14199344 0.14199344
## 30 12.667857 17.332143 60.50179 226.4982 0.4222619 0.19051825 0.19051825
## 40 15.848214 24.151786 57.32143 219.6786 0.3962054 0.23568372 0.23568372
## 50 18.819643 31.180357 54.35000 212.6500 0.3763929 0.27681225 0.27681225
## 60 21.580357 38.419643 51.58929 205.4107 0.3596726 0.31402751 0.31402751
## 70 24.125000 45.875000 49.04464 197.9554 0.3446429 0.34802450 0.34802450
## 80 26.558929 53.441071 46.61071 190.3893 0.3319866 0.37957817 0.37957817
## 90 28.891071 61.108929 44.27857 182.7214 0.3210119 0.41020202 0.41020202
## 100 31.117857 68.882143 42.05179 174.9482 0.3111786 0.43854524 0.43854524
## FPR
## 1 0.001190574
## 5 0.008285847
## 10 0.018608597
## 20 0.042323293
## 30 0.068694515
## 40 0.096221087
## 50 0.124584753
## 60 0.153956768
## 70 0.184347296
## 80 0.215264680
## 90 0.246684614
## 100 0.278480965
##
## $SVD
## TP FP FN TN precision recall
## 1 0.375000 0.625000 72.79464 243.2054 0.3750000 0.005697137
## 5 1.887500 3.112500 71.28214 240.7179 0.3775000 0.028998859
## 10 3.467857 6.532143 69.70179 237.2982 0.3467857 0.051861848
## 20 6.275000 13.725000 66.89464 230.1054 0.3137500 0.091864973
## 30 9.030357 20.969643 64.13929 222.8607 0.3010119 0.131267811
## 40 11.616071 28.383929 61.55357 215.4464 0.2904018 0.167688858
## 50 14.116071 35.883929 59.05357 207.9464 0.2823214 0.202292078
## 60 16.566071 43.433929 56.60357 200.3964 0.2761012 0.236276017
## 70 18.885714 51.114286 54.28393 192.7161 0.2697959 0.268850232
## 80 21.194643 58.805357 51.97500 185.0250 0.2649330 0.299672173
## 90 23.514286 66.485714 49.65536 177.3446 0.2612698 0.330601939
## 100 25.746429 74.253571 47.42321 169.5768 0.2574643 0.360884564
## TPR FPR
## 1 0.005697137 0.002517459
## 5 0.028998859 0.012577072
## 10 0.051861848 0.026449141
## 20 0.091864973 0.055758008
## 30 0.131267811 0.085297391
## 40 0.167688858 0.115575127
## 50 0.202292078 0.146175698
## 60 0.236276017 0.177105827
## 70 0.268850232 0.208557614
## 80 0.299672173 0.240072793
## 90 0.330601939 0.271404157
## 100 0.360884564 0.303318332
. True Positives (TP): These are recommended items that have been rated
. False Positives (FP): These are recommended items that haven't been rated . False Negatives(FN): These are not recommended items that have been rated
. True Negatives (TN): These are not recommended items that haven't been rated
. True Positive Rate (TPR): This is the percentage of purchased items that have been recommended. It's the number of TP divided by the number of purchased items (TP + FN).
. False Positive Rate (FPR): This is the percentage of not purchased items that have been recommended. It's the number of FP divided by the number of not purchased items (FP + TN).
recommender_random <- Recommender(data = getData(eval_sets, "train"),
method = "RANDOM",parameter = NULL)
recommender_popular <- Recommender(data = getData(eval_sets, "train"),
method = "POPULAR",parameter = NULL)
recommender_ibcf <- Recommender(data = getData(eval_sets, "train"),
method = "IBCF",parameter = list(method = "cosine"))
recommender_ubcf <- Recommender(data = getData(eval_sets, "train"),
method = "UBCF",parameter = list(method = "cosine"))
recommender_svd <- Recommender(data = getData(eval_sets, "train"),
method = "SVD",parameter = list(k =100))
items_to_recommend <- 10
eval_prediction_random <- predict(object = recommender_random, newdata =
getData(eval_sets, "known"), n = items_to_recommend, type = "ratings")
eval_prediction_popular <- predict(object = recommender_popular, newdata = getData(eval_sets, "known"), n = items_to_recommend, type = "ratings")
eval_prediction_ibcf <- predict(object = recommender_ibcf, newdata = getData(eval_sets, "known"), n = items_to_recommend, type = "ratings")
eval_prediction_ubcf <- predict(object = recommender_ubcf, newdata = getData(eval_sets, "known"), n = items_to_recommend, type = "ratings")
eval_prediction_svd <- predict(object = recommender_svd, newdata = getData(eval_sets, "known"), n = items_to_recommend, type = "ratings")
# compare RMSE for different models
######################RANDOM######################
eval_accuracy_random <- calcPredictionAccuracy(
x = eval_prediction_random, data = getData(eval_sets, "unknown"), byUser = F)
eval_accuracy_random_user <- calcPredictionAccuracy(
x = eval_prediction_random, data = getData(eval_sets, "unknown"), byUser = TRUE)
head(eval_accuracy_random_user)## RMSE MSE MAE
## [1,] 1.3597243 1.8488502 1.101434
## [2,] 0.9268175 0.8589907 0.728238
## [3,] 1.3961119 1.9491284 1.088635
## [4,] 1.8334279 3.3614578 1.503241
## [5,] 1.5094240 2.2783609 1.169797
## [6,] 1.4094622 1.9865838 1.079278
qplot(eval_accuracy_random_user[, "RMSE"]) + geom_histogram(binwidth = 0.05) + ggtitle("Distribution of the Random RMSE by user")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
######################POPULAR######################
eval_accuracy_popular <- calcPredictionAccuracy(
x = eval_prediction_popular, data = getData(eval_sets, "unknown"), byUser =F)
eval_accuracy_popular_user <- calcPredictionAccuracy(
x = eval_prediction_popular, data = getData(eval_sets, "unknown"), byUser = TRUE)
head(eval_accuracy_popular_user)## RMSE MSE MAE
## 2 0.8252278 0.6810009 0.5617565
## 10 0.4714709 0.2222848 0.3910219
## 42 0.9589071 0.9195028 0.7670577
## 49 1.3703266 1.8777950 1.1194657
## 56 0.8764282 0.7681264 0.6900810
## 59 0.9892119 0.9785402 0.7399601
qplot(eval_accuracy_popular_user[, "RMSE"]) + geom_histogram(binwidth = 0.05) + ggtitle("Distribution of the Popular RMSE by user")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
######################IBCF######################
eval_accuracy_ibcf <- calcPredictionAccuracy(
x = eval_prediction_ibcf, data = getData(eval_sets, "unknown"), byUser = F)
eval_accuracy_ibcf_user <- calcPredictionAccuracy(
x = eval_prediction_ibcf, data = getData(eval_sets, "unknown"), byUser = TRUE)
head(eval_accuracy_ibcf_user)## RMSE MSE MAE
## 2 1.3116035 1.7203037 0.9834384
## 10 0.6672847 0.4452689 0.5211795
## 42 1.2864529 1.6549612 0.9421041
## 49 1.7794485 3.1664371 1.3509088
## 56 1.3133327 1.7248427 0.9765165
## 59 1.3559660 1.8386439 0.9244176
qplot(eval_accuracy_ibcf_user[, "RMSE"]) + geom_histogram(binwidth = 0.05) + ggtitle("Distribution of the IBCF RMSE by user")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
######################UBCF######################
eval_accuracy_ubcf <- calcPredictionAccuracy(
x = eval_prediction_ubcf, data = getData(eval_sets, "unknown"), byUser = F)
eval_accuracy_ubcf_user <- calcPredictionAccuracy(
x = eval_prediction_ubcf, data = getData(eval_sets, "unknown"), byUser = TRUE)
head(eval_accuracy_ubcf_user)## RMSE MSE MAE
## 2 0.8609476 0.7412308 0.6314881
## 10 0.4991643 0.2491650 0.3975030
## 42 0.9342446 0.8728129 0.7246420
## 49 1.3601553 1.8500224 1.1325268
## 56 0.8884840 0.7894038 0.6565416
## 59 1.0730427 1.1514207 0.8145044
qplot(eval_accuracy_ubcf_user[, "RMSE"]) + geom_histogram(binwidth = 0.05) + ggtitle("Distribution of the IBCF RMSE by user")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
######################SVD######################
eval_accuracy_svd <- calcPredictionAccuracy(
x = eval_prediction_svd, data = getData(eval_sets, "unknown"), byUser = F)
eval_accuracy_svd_user <- calcPredictionAccuracy(
x = eval_prediction_svd, data = getData(eval_sets, "unknown"), byUser = TRUE)
head(eval_accuracy_svd_user)## RMSE MSE MAE
## 2 0.9400020 0.8836037 0.7102671
## 10 0.5833815 0.3403340 0.4447397
## 42 1.0026193 1.0052454 0.7776307
## 49 1.3821925 1.9104560 1.1574958
## 56 0.9367730 0.8775437 0.6853434
## 59 1.1105352 1.2332885 0.8429122
qplot(eval_accuracy_svd_user[, "RMSE"]) + geom_histogram(binwidth = 0.05) + ggtitle("Distribution of the IBCF RMSE by user")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
kable(RMSE_COMPARE <- rbind(eval_accuracy_random,eval_accuracy_popular,eval_accuracy_ibcf,eval_accuracy_ubcf,eval_accuracy_svd))| RMSE | MSE | MAE | |
|---|---|---|---|
| eval_accuracy_random | 1.3400123 | 1.7956330 | 1.0451535 |
| eval_accuracy_popular | 0.9486848 | 0.9000029 | 0.7402682 |
| eval_accuracy_ibcf | 1.2423668 | 1.5434753 | 0.9204428 |
| eval_accuracy_ubcf | 0.9729319 | 0.9465964 | 0.7618365 |
| eval_accuracy_svd | 1.0051898 | 1.0104065 | 0.7925634 |
RMSE_COMPARE <- as.data.frame(RMSE_COMPARE)
RMSE_COMPARE$model <- c('Random','Popular','IBCF','UBCF','SVD')
ggplot(RMSE_COMPARE, aes(model,RMSE))+ geom_point(colour = "blue", size = 3)# convert data to realRatingMatrix format
mls <- as(ratings_movies,"realRatingMatrix")
items_genre_all <- MovieLenseMeta[,-c(2,3,4)]
items_genre_movie <- semi_join(data.frame('title'=colnames(ratings_movies)), items_genre_all, by = "title")
items_genre <- semi_join(items_genre_all, items_genre_movie, by = "title")
rating_m <- as.matrix(getRatingMatrix(ratings_movies))
# number of user, genre, and movie
nuser <- length(rating_m[,1])
ngenre <- length(items_genre[1,])-1
nmovie <- length(items_genre[,1])
# Compute average rating of each genre for a given user
user_genre <- data.frame(matrix(0, nrow = nuser, ncol = ngenre))
for (i in 1:nuser){
for(j in 1:ngenre){
v1 <- as.matrix(rating_m[i,])
v2 <- as.matrix(items_genre[,(j+1)])
user_genre[i,j] <- (t(v1) %*% v2)/sum(v2)
}
}
# Compute average rating of each item based on average rating for each genre
rating_genre_based <- data.frame(matrix(0, nrow = nuser, ncol =nmovie))
for (i in 1:nuser){
for(j in 1:nmovie){
v1 <- as.matrix(user_genre[i,])
v2 <- as.matrix(items_genre[j,2:(ngenre+1)])
rating_genre_based[i,j] <- (v1 %*% t(v2))/sum(v2)
}
}
rownames(rating_genre_based) <- rownames(rating_m)
colnames(rating_genre_based) <- colnames(rating_m)
rownames(user_genre) <- rownames(rating_m)
colnames(user_genre) <- colnames(items_genre[,2:19])
# Select a random set of users
set.seed(1)
users <- sample(rownames(getRatingMatrix(ratings_movies)),size=20)
# This rating profile based on genre shows the aggregated inclination of each user towards movie genres. Assumming that users like similar items, movies that are most similar to a user's preference for an movie's feature could be recommended.
#convert the predicted ratings based on genre to binary
avg_user_genre <- as.matrix(rowMeans(user_genre))
user_genre_bi <- user_genre - t(avg_user_genre)
user_genre_bi[user_genre_bi < 0] <- 0
user_genre_bi[user_genre_bi > 0] <- 1
#find the most similar movies for a selected user
profile <- user_genre_bi[users[1],]
colnames(profile) <- colnames(items_genre[,2:19])
sim_matrix <- rbind.data.frame(profile, items_genre[,2:19])
sim_matrix <- data.frame(lapply(sim_matrix,function(x) as.integer(x))) #convert data to type integer
#Calculate Jaccard distance between user profile and all movies
similarity <- dist(sim_matrix, method = "Jaccard")
similarity <- as.data.frame(as.matrix(similarity[1:333]))
rows <- which(similarity == min(similarity))
#Recommended movies
print(paste('Recommend use with userID', users[1],'the following movies:', items_genre[rows-1,1]))## [1] "Recommend use with userID 255 the following movies: GoodFellas (1990)"
## [2] "Recommend use with userID 255 the following movies: George of the Jungle (1997)"
## [3] "Recommend use with userID 255 the following movies: Fly Away Home (1996)"
Recommend 10 movies to 20 randomly chose users and creat recommendation list RS.
# Build a User-based recommendation model
rec <- Recommender(mls,method="UBCF", param=list(normalize = "Z-score",method="Cosine",nn=5, minRating=1))## Warning: Unknown parameters: minRating
## Available parameter (with default values):
## method = cosine
## nn = 25
## sample = FALSE
## normalize = center
## verbose = FALSE
# Creat top-10 recommendation lists for five users
recom_movie <- predict(rec, ratings_movies[users], n=10)
movie_rec <- as(recom_movie, "list")
movie_ubcf <- data.frame(movie_rec[1:20])
colnames(movie_ubcf) <- users
kable(movie_ubcf[,1:5], caption = "Recommend 10 movies to users")| 255 | 342 | 514 | 854 | 194 |
|---|---|---|---|---|
| Dead Man Walking (1995) | Shawshank Redemption, The (1994) | Trainspotting (1996) | Wrong Trousers, The (1993) | Reservoir Dogs (1992) |
| Godfather, The (1972) | Schindler's List (1993) | Babe (1995) | GoodFellas (1990) | Shining, The (1980) |
| Face/Off (1997) | Godfather, The (1972) | Lion King, The (1994) | Close Shave, A (1995) | Willy Wonka and the Chocolate Factory (1971) |
| Close Shave, A (1995) | Cinema Paradiso (1988) | Sound of Music, The (1965) | Platoon (1986) | Rear Window (1954) |
| Raiders of the Lost Ark (1981) | Toy Story (1995) | Primal Fear (1996) | Godfather: Part II, The (1974) | Terminator 2: Judgment Day (1991) |
| Aliens (1986) | Much Ado About Nothing (1993) | American President, The (1995) | Raging Bull (1980) | People vs. Larry Flynt, The (1996) |
| Sling Blade (1996) | Vertigo (1958) | Pulp Fiction (1994) | Clockwork Orange, A (1971) | Wrong Trousers, The (1993) |
| Abyss, The (1989) | Full Monty, The (1997) | Sling Blade (1996) | To Kill a Mockingbird (1962) | Contact (1997) |
| Mr. Holland's Opus (1995) | Strictly Ballroom (1992) | Raising Arizona (1987) | Face/Off (1997) | Good Will Hunting (1997) |
| Star Trek: The Wrath of Khan (1982) | Piano, The (1993) | Fried Green Tomatoes (1991) | Back to the Future (1985) | Face/Off (1997) |
# Creat the recommendation list RS.
d <- gather(movie_ubcf,user,movie,1:20)## Warning: attributes are not identical across measure variables; they will
## be dropped
rs <- unique(as.character(d[,2]))
rs_df<- data.frame('Expected_list_UBCF'=rs)
kable(head(rs_df))Dead Man Walking (1995)
Godfather, The (1972)
Face/Off (1997)
Close Shave, A (1995)
Raiders of the Lost Ark (1981) Aliens (1986)
Create the PPM list for top-119 movies using a primitive recommendation method
# Primitive recommendation is based on picking top movies that has got the ratings from largest number of users and movies that got the highest average ratings.
# number of ratings for each item
number_of_users_for_items <- as.data.frame(colCounts(mls))
number_of_users_for_items[,1] <-number_of_users_for_items[order(-number_of_users_for_items[,1]),]
colnames(number_of_users_for_items) <- c('movie')
# Compute the movie IDs with the largest number of users.
avg_rating_for_items <- as.data.frame(colMeans(mls))
avg_rating_for_items[,1] <-avg_rating_for_items[order(-avg_rating_for_items[,1]),]
colnames(avg_rating_for_items) <- c('movie')
topN_PPM <- unique(c(rownames(number_of_users_for_items),rownames(avg_rating_for_items)))
# Get the PPM list based on a primitive recommendation method. Here, top 119 movies are selected based on the highest average rating and the largest number of users for each movie.
number_of_rankings_requested = 119
top119_PPM <- topN_PPM[1:119]
# Unexpected list of Recommendations is the list of items that are in RS list but not in PPM.
top119_PPM_df <- data.frame('movie'= top119_PPM)
top119_PPM_df <- data.frame(lapply(top119_PPM_df, as.character), stringsAsFactors=FALSE)
rs_df <- data.frame('movie'= rs)
rs_df <- data.frame(lapply(rs_df, as.character), stringsAsFactors=FALSE)
unexp <- anti_join(rs_df,top119_PPM_df,by.x= 'movie', by.y='movie')## Joining, by = "movie"
unexp_df <- data.frame('Unexpected_list'=unexp)
kable(head(unexp_df,n=10))Face/Off (1997)
Close Shave, A (1995)
Sling Blade (1996)
Star Trek: The Wrath of Khan (1982) Schindler's List (1993)
Vertigo (1958)
Full Monty, The (1997)
Strictly Ballroom (1992)
Piano, The (1993)
Trainspotting (1996)
# retrive the predict ratings generated by UBCF model
recom_rating <- predict(rec, ratings_movies[users], type="ratings")
movie_predict <- as(recom_rating, "list")
predict_rating <- as(recom_rating, "matrix")
predict_rating <- as.data.frame(predict_rating)
head(predict_rating[,1:4])## Toy Story (1995) GoldenEye (1995) Get Shorty (1995)
## 255 3.115311 2.854167 2.831789
## 342 4.140593 3.293230 NA
## 514 NA 3.900019 NA
## 854 NA 3.079813 NA
## 194 NA 3.017854 NA
## 843 NA 2.813357 2.560468
## Twelve Monkeys (1995)
## 255 NA
## 342 NA
## 514 NA
## 854 NA
## 194 NA
## 843 NA
nameList <- unexp[,1]
predict_rating <- predict_rating[,colnames(predict_rating) %in% nameList]
# Subset the rating matrix to get the predictions for randomly selected users.
predict_rating_sub <- as.data.frame(predict_rating[,rownames(predict_rating) %in% users])
predict_rating_sub$user <- rownames(predict_rating_sub)
predict_rating_sub <- predict_rating_sub[,c(62,1:61)]
predict_rating_sub$user <- sapply(predict_rating_sub$user, as.numeric)
# Compute the list of useful items from the unexpected set of movies.
# Usefulness is when the ratings of movies > 2.5
predict_rating_sub_long <- gather(predict_rating_sub,movie,rating,2:62)
usefulness <- predict_rating_sub_long[which(predict_rating_sub_long$rating>2.5),]
usefulness <- usefulness[order(usefulness$user),]
kable(head(usefulness),caption="Useful_List")| user | movie | rating | |
|---|---|---|---|
| 50 | 58 | Patton (1970) | 3.902803 |
| 70 | 58 | Young Frankenstein (1974) | 4.308861 |
| 170 | 58 | Last of the Mohicans, The (1992) | 4.062107 |
| 230 | 58 | Fifth Element, The (1997) | 3.653026 |
| 250 | 58 | Contact (1997) | 4.316963 |
| 310 | 58 | Leaving Las Vegas (1995) | 4.258151 |
# subset item_genre matrix to small data set which contains only the movies in serendipity list
items_genre_imt <- semi_join(items_genre_all,data.frame('title'=unique(usefulness$movie)), by = "title")
# Compute average rating of each genre for a given user
v1 <- as.matrix(predict_rating_sub[,2:62])
v1[is.na(v1)] <- 0
user_genre_imt <- v1 %*% as.matrix(items_genre_imt[,2:19])
v3 <- colSums(items_genre_imt[,2:19])
v4 <- matrix(rep((1/v3),18),18,20)## Warning in matrix(rep((1/v3), 18), 18, 20): data length [324] is not a sub-
## multiple or multiple of the number of columns [20]
v4 <- t(v4)
user_genre_imt <- user_genre_imt[,-7] # Genre Documentary has no rating, so we remove this genre
v4 <- v4[,-7] # remove genre Documentary
user_genre_sp <- user_genre_imt * v4 # average rating of each genre
# Compute average rating of each item based on average rating for each genre
items_genre_sp <- items_genre_imt[,-c(1,8)] # remove genre Documentary
rating_genre_based_sp <- user_genre_sp %*% t(items_genre_sp)#average rating of each item based on genre
colnames(rating_genre_based_sp) <-items_genre_imt$title
kable(head(rating_genre_based_sp[,1:6]),caption="Improved_Serendipity_List")| Patton (1970) | Young Frankenstein (1974) | Last of the Mohicans, The (1992) | Fifth Element, The (1997) | Contact (1997) | Leaving Las Vegas (1995) | |
|---|---|---|---|---|---|---|
| 255 | 5.364220 | 4.789906 | 8.030055 | 5.480612 | 5.688661 | 5.662999 |
| 342 | 5.444992 | 3.014967 | 6.515864 | 1.748378 | 3.015978 | 4.802111 |
| 514 | 4.717684 | 4.357967 | 8.114585 | 5.545373 | 4.926453 | 4.834792 |
| 854 | 3.844094 | 5.769871 | 6.211448 | 5.116786 | 4.981673 | 4.318114 |
| 194 | 2.326376 | 2.027171 | 2.993236 | 2.316702 | 2.149810 | 1.740571 |
| 843 | 2.731897 | 1.566843 | 4.475721 | 4.101452 | 3.434087 | 2.513222 |
predict_rating_sub_1 <- predict_rating_sub[,-1]
# rerange the item sequence of serendipity list
predict_rating_sub_1 <- predict_rating_sub_1 [colnames(rating_genre_based_sp)]
serendipity <- predict_rating_sub_1
#calculate the difference between serendipity list and the imporved serendipity list and normalize by rating sacle 5.
delta <- ((rating_genre_based_sp - serendipity)/5)*100
#Movies whose delta are below the threshold can be removed from the serendipity list. Here set the threshold as -5%
less_serendipitous <- delta
less_serendipitous$less <- rowSums(less_serendipitous < -5, na.rm=T)
less_serendipitous <- data.frame('UserID'=rownames(less_serendipitous), 'Number of less serendipitous movies'=less_serendipitous[,62])
kable(less_serendipitous )| UserID | Number.of.less.serendipitous.movies |
|---|---|
| 255 | 1 |
| 342 | 16 |
| 514 | 16 |
| 854 | 9 |
| 194 | 17 |
| 843 | 14 |
| 885 | 15 |
| 591 | 12 |
| 555 | 10 |
| 58 | 9 |
| 195 | 11 |
| 167 | 13 |
| 617 | 7 |
| 344 | 10 |
| 693 | 11 |
| 447 | 14 |
| 637 | 10 |
| 910 | 10 |
| 336 | 6 |
| 921 | 11 |
From the above table, we can see that a significant number of movies are considered less serendipitous based on the above threshold,
delta$UserID <- rownames(delta)
delta_long <- gather(delta,movie,delta_rating,1:61)
delta_long$UserID <- sapply(delta_long$UserID, as.numeric)
ceiling <- max(delta_long$delta_rating,na.rm=T)
bottom <- min(delta_long$delta_rating,na.rm=T)
ggplot(delta_long,aes(x=UserID,y=delta_rating))+geom_point(fill="blue", shape=21, size = 2) +ylim(bottom,ceiling)+geom_hline(yintercept = -5,linetype="dashed",size=1.5)+ggtitle("Serendipity List of Movies")## Warning: Removed 338 rows containing missing values (geom_point).
The figure above shows the serendipity list of movies generated by a standard collaborative filtering method for 20 randomly selected users. The differences in useful ratings and genre-based ratings below 5% are considered less serendipitous. A significant fraction of movies has been identified as less serendipitous items based on the algorithm used here.
actual <- as.data.frame(as.matrix(getRatingMatrix(ratings_movies)))
actual$userid<-rownames(actual)
actual_long <- gather(actual, title, rating,1:332)
actual_rating <- semi_join(actual_long,data.frame('title'=unique(usefulness$movie)), by = "title")
actual_rating <- actual_rating[which(actual_rating$userid %in% c(users)),]
actual_rating$userid <- sapply(actual_rating$userid, as.numeric)
actual_rating <- actual_rating[match(delta_long$UserID, actual_rating$userid),]
rating_genre_based_sp_1 <- data.frame(rating_genre_based_sp)
rating_genre_based_sp_1$userid <- rownames(rating_genre_based_sp)
rating_genre_based_sp_long <- gather(rating_genre_based_sp_1, movie,rating,1:61)
rmse(actual_rating$rating,rating_genre_based_sp_long$rating)## [1] 5.516745
The accuracy decrease after introduce the serendipity list into the UBCF model.
#####Diversity
ml.model <- Recommender(ml[1:900], method = "UBCF")
#top 5 recommendations for user
ml.523.top10 <- as(predict(ml.model, ml["523"], n = 10), "list")
length(ml.523.top10[[1]])
sample(ml.523.top10[[1]], 5)
r <- Recommender(MovieLense[1:800], method = "POPULAR")
r
getModel(r)$topN
#Recommend 3 movies to 2 users
recom <- predict(r, MovieLense[809:810], n = 10)
as(recom, "list")
length(recom)
rnorm(10)After introducing serendipity to our model, we noticed an increase in our RMSE. Though the model is less accurate in one measure, it is potentially more valuable in the eyes of user experience. Trade-offs like this must be made when designing a recommender system, and the designer has to choose what is more valuable.
Offline experiments:
These are what we've done up to this point, using offline data and simulating user behavior. It is the least risky since there is not threat of driving away users. Therefore, they are the easiest to conduct, due to the fact that interaction with the user is not required.
Online experiments:
Where real user populations interact with the recommender system. These experiments evaluate the performance of the recommender system. The user is oblivious to the conducted experiment in these scenarios. Online experiments are known to be the most trustworthy experiment. Here, we want to look at the change in the behavior of users when interacting with different recommendation systems. In this type of setup, small subsets of the user population are directed toward different systems and then their interactions with the different systems are recorded. A user's assignment to a system must be random. Being that an online experiment can have a negative impact on the "real" system, you want to make sure that your online experiment is last, after extensive research offline.
A great way to examine the effectiveness of an online experiment is to measure the significance level (p-value) between two algorithms, A and B. This will help us determine whether there was a statistically significant difference in our results or they were due to chance. ANOVA or the Friedman test for ranking can be used for the comparison of multiple algorithms at once.
In the case of our Movielens data set, we could implement online experiments that are rolled out to users of movielens.org. In these online experiments, we want to manipulate variables that will enhance the user experience, leading the user to use the site more, rate more movies, etc.
We could design an online experiment where we increase diversity and record the user behavior in response to the change. This might be something like preventing the website from showing a user a movie too many times. If movielens suggests a movie to you and you don't watch or rate it within a certain period of time, then that movie will be removed from your list of recommendations. That way, we are keeping the selection fresh for the user, always providing them with something new to watch.
We could also implement changes to our system surrounding novelty and roll them out to a small group of users. A user is most likely aware of the incredibly popular movies, and if they haven't rated them yet, chances are they don't want to see them. Therefore, we don't want to suggest these movies to the users. We will put less weighting on the popular movies, and more weighting on the less popular movies, so that these appear more frequently in the recommendations. This will enhance the user experience by providing potentially more useful recommendations. We can measure how useful those recommendations are by observing their behavior on the site. If they rate the less popular movies we suggest, then we may be able to conclude that this approach is effective in meeting our business need.
Reference: 1. Suboojitha Sridharan,Introducing Serendipity In Recommender Systems Though Collaborative Methods,(2014). Open Access Master's Theses. Paper 453.http://digitalcommons.uri.edu/theses/453