The goal of this assignment is for you to try out different ways of implementing and configuring a recommender, and to evaluate your different approaches.
For assignment 2, start with an existing dataset of user-item ratings, such as our toy books dataset, MovieLens, Jester [http://eigentaste.berkeley.edu/dataset/] or another dataset of your choosing. Implement at least two of these recommendation algorithms:
• Content-Based Filtering • User-User Collaborative Filtering • Item-Item Collaborative Filtering
As an example of implementing a Content-Based recommender, you could build item profiles for a subset of MovieLens movies from scraping http://www.imdb.com/ or using the API at https://www.omdbapi.com/ (which has very recently instituted a small monthly fee). A more challenging method would be to pull movie summaries or reviews and apply tf-idf and/or topic modeling.
You should evaluate and compare different approaches, using different algorithms, normalization techniques, similarity methods, neighborhood sizes, etc. You don’t need to be exhaustive—these are just some suggested possibilities.
You may use the course text’s recommenderlab or any other library that you want. Please provide at least one graph, and a textual summary of your findings and recommendations.
The MovieLens Latest Small Datasets contain 100,000 ratings and 3,600 tag applications applied to 9,000 movies by 600 users.
## [1] "data" "normalize"
## [1] "realRatingMatrix"
## attr(,"package")
## [1] "recommenderlab"
## [1] 943 1664
## Formal class 'realRatingMatrix' [package "recommenderlab"] with 2 slots
## ..@ data :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. ..@ i : int [1:99392] 0 1 4 5 9 12 14 15 16 17 ...
## .. .. ..@ p : int [1:1665] 0 452 583 673 882 968 994 1386 1605 1904 ...
## .. .. ..@ Dim : int [1:2] 943 1664
## .. .. ..@ Dimnames:List of 2
## .. .. .. ..$ : chr [1:943] "1" "2" "3" "4" ...
## .. .. .. ..$ : chr [1:1664] "Toy Story (1995)" "GoldenEye (1995)" "Four Rooms (1995)" "Get Shorty (1995)" ...
## .. .. ..@ x : num [1:99392] 5 4 4 4 4 3 1 5 4 5 ...
## .. .. ..@ factors : list()
## ..@ normalize: NULL
ratingvalues <- as.vector(MovieLense@data)
unique(ratingvalues) # The rating is numeric with the least value as 0.5 and the highest values as 5.## [1] 5 4 0 3 1 2
## ratingvalues
## 0 1 2 3 4 5
## 1469760 6059 11307 27002 33947 21077
hist(ratingvalues,
breaks = 6,
main="Distribution of Ratings",
xlab="Ratings",
col="pink",
freq=TRUE
)Convert data to numeric
ratingsMat <- sparseMatrix(i = movies$user, j = movies$item, x = movies$rating,
dims = c(length(unique(movies$user)), length(unique(movies$item))),
dimnames = list(paste("u", 1:length(unique(movies$user)), sep = ""),
paste("m", 1:length(unique(movies$item)), sep = "")))
ratingsReal <- new("realRatingMatrix", data = ratingsMat)
ratingsReal## 943 x 1664 rating matrix of class 'realRatingMatrix' with 99392 ratings.
I’m selecting the Users who have rated at least 100 movies and those movies that have been watched at least 150 times
Use 80% for training and 20% for testing the model
Prepare training dataset
Prepare testing set
The User based collaborative filtering algorithms are based on measuring the similarity between users. A “Recommender” object is then given the “UBCF” (User-based collaborative filter), with a center normalization, cosine method, with 25 nearest neighbors.But first, I will compute the similarity matrix.
A similarity matrix is a recommenderlab function that takes the “realRatingMatrix”" and calculates a cosine similarity which aids in the investigation of model development.
similarityUsers <- similarity(MovieLense[1:4, ], method = "cosine", which = "users")
image(as.matrix(similarityUsers), main = "Users similarity")Building the User Based Model using 25 nearest neighbor. Building the user collaborative filtering system to recommend movies to users based on how similar they are with other users.
usermodelprediction <- predict(object = usermodel, newdata = test, type="ratings")
as(usermodelprediction, "matrix")[, 1:4] %>% kable() %>% kable_styling(full_width = T)| Toy Story (1995) | Get Shorty (1995) | Twelve Monkeys (1995) | Babe (1995) | |
|---|---|---|---|---|
| 22 | 3.781925 | NA | 4.138567 | 4.021311 |
| 44 | NA | 3.782291 | NA | 4.023630 |
| 57 | NA | 3.627189 | NA | NA |
| 58 | NA | 4.108803 | NA | NA |
| 60 | 4.221428 | 3.945063 | NA | NA |
| 70 | NA | 3.793470 | 3.952004 | NA |
| 82 | NA | 3.422929 | NA | NA |
| 85 | 3.619038 | 3.626235 | 3.499749 | NA |
| 99 | NA | NA | NA | 3.871983 |
| 125 | NA | 3.731686 | 3.886966 | NA |
| 128 | NA | 3.650993 | 3.679047 | 3.793576 |
| 141 | NA | 3.810379 | NA | 4.045046 |
| 151 | NA | NA | NA | 4.137262 |
| 158 | NA | NA | NA | NA |
| 181 | NA | 2.218023 | NA | 2.823249 |
| 201 | NA | NA | NA | NA |
| 207 | 3.455511 | NA | 3.405310 | NA |
| 221 | 3.818030 | NA | NA | 3.654129 |
| 224 | 3.736582 | 3.438769 | 3.720914 | 3.593943 |
| 239 | 4.341020 | 3.808810 | 4.136243 | NA |
| 250 | NA | 3.651374 | NA | 3.902230 |
| 264 | 4.467544 | NA | NA | 4.751885 |
| 279 | NA | NA | NA | 3.717380 |
| 280 | NA | NA | NA | NA |
| 293 | NA | NA | NA | NA |
| 296 | NA | 4.130021 | NA | 4.468650 |
| 298 | NA | 4.056455 | 4.208806 | NA |
| 299 | NA | NA | NA | 3.654436 |
| 301 | NA | NA | NA | NA |
| 305 | NA | 3.305570 | NA | 3.419887 |
| 313 | NA | 3.680840 | 3.791153 | NA |
| 314 | NA | 3.994760 | NA | NA |
| 328 | 3.818877 | NA | NA | NA |
| 334 | 3.263622 | NA | NA | NA |
| 339 | NA | NA | NA | 4.164154 |
| 345 | NA | NA | 4.048610 | 4.209362 |
| 346 | 3.777548 | NA | NA | 3.867273 |
| 379 | NA | NA | NA | NA |
| 387 | NA | NA | NA | NA |
| 393 | NA | NA | NA | NA |
| 398 | NA | NA | 3.760041 | NA |
| 399 | NA | 2.777281 | 3.128807 | NA |
| 401 | NA | 2.865687 | 3.052007 | 3.014357 |
| 407 | NA | NA | NA | NA |
| 417 | NA | NA | NA | 3.789563 |
| 442 | 3.607058 | 3.372064 | NA | 3.424742 |
| 455 | NA | NA | NA | NA |
| 493 | NA | 4.044486 | NA | 4.197289 |
| 496 | 3.019448 | 3.101432 | NA | 3.207622 |
| 505 | NA | 3.155593 | NA | 3.402737 |
| 535 | NA | NA | NA | NA |
| 536 | NA | 4.033887 | 4.053168 | NA |
| 567 | NA | 3.627711 | NA | 3.922535 |
| 592 | NA | NA | NA | NA |
| 617 | 3.172960 | 2.634722 | NA | 2.645739 |
| 630 | NA | 3.412451 | NA | 3.716795 |
| 639 | 3.015679 | 2.709074 | 2.865973 | 2.869516 |
| 650 | NA | NA | NA | 3.628631 |
| 659 | 3.990956 | NA | NA | 3.882698 |
| 694 | 4.531028 | 4.175881 | 4.349192 | 4.409131 |
| 711 | 3.882092 | 3.654377 | 3.873608 | NA |
| 715 | NA | NA | NA | 3.872303 |
| 733 | NA | 2.980552 | NA | 3.257285 |
| 748 | NA | NA | NA | NA |
| 825 | 3.953589 | 3.890185 | NA | 4.216666 |
| 870 | NA | NA | NA | 3.984096 |
| 881 | NA | NA | NA | NA |
| 883 | NA | NA | NA | NA |
| 889 | NA | NA | NA | NA |
| 890 | NA | 4.071963 | NA | 4.163947 |
| 894 | NA | 3.590505 | NA | 3.727390 |
| 899 | NA | 3.656421 | 3.693090 | NA |
| 932 | NA | 4.026150 | NA | 4.079013 |
| 938 | NA | 3.262872 | NA | 3.671855 |
scheme <- evaluationScheme(ratings, method="cross-validation",
k = 4,
given = 10,
goodRating = 4)
results <- evaluate(x = scheme, method = "UBCF", n=c(10,25,50,75,100))## UBCF run fold/sample [model time/prediction time]
## 1 [0.01sec/0.25sec]
## 2 [0sec/0.4sec]
## 3 [0sec/0.14sec]
## 4 [0sec/0.17sec]
## TP FP FN TN precision recall TPR
## 10 4.439560 5.56044 48.96703 131.03297 0.4439560 0.08509718 0.08509718
## 25 9.769231 15.23077 43.63736 121.36264 0.3907692 0.18431378 0.18431378
## 50 17.824176 32.17582 35.58242 104.41758 0.3564835 0.33441705 0.33441705
## 75 26.186813 48.81319 27.21978 87.78022 0.3491575 0.49445469 0.49445469
## 100 33.780220 66.21978 19.62637 70.37363 0.3378022 0.63791011 0.63791011
## FPR
## 10 0.03969298
## 25 0.10943232
## 50 0.23236638
## 75 0.35363671
## 100 0.48105204
The TPR is the percentage of the movies that have been purchased and was recommended while the FPR is the percentage of the movies that was not purchased but was recommended with n as the number of recommendations (10,25,50,75,100).
ROC Curve Plot
The precision and recall shows the percentage of movies that have been purchased and the percentage of movies that was recommended.
The Item based collaborative filtering algorithms are based on measuring the similarity between items A “Recommender” object is then given the “IBCF” (Item-based collaborative filter), with a center normalization, cosine method, with K = 250.But first, I will compute the similarity matrix.
A similarity matrix is a recommenderlab function that takes the “realRatingMatrix”" and calculates a cosine similarity which aids in the investigation of model development.
similarityitems <- similarity(MovieLense[, 1:4], method = "cosine", which = "items")
image(as.matrix(similarityitems), main = "Items similarity")The diagonal is yellow because it’s comparing each items with itself.
Building the item-item collaborative filtering system to recommend movies to users where their item’s ratings are similar.
itemmodelprediction <- predict(object = itemmodel, newdata = test, type="ratings")
as(itemmodelprediction, "matrix")[, 1:4] %>% kable() %>% kable_styling(full_width = T)| Toy Story (1995) | Get Shorty (1995) | Twelve Monkeys (1995) | Babe (1995) | |
|---|---|---|---|---|
| 22 | 4.283615 | NA | 4.147702 | 4.102685 |
| 44 | NA | 3.901811 | NA | 3.850054 |
| 57 | NA | 3.878747 | NA | NA |
| 58 | NA | 4.181955 | NA | NA |
| 60 | 3.966626 | 4.110938 | NA | NA |
| 70 | NA | 3.750311 | 3.547979 | NA |
| 82 | NA | 3.106853 | NA | NA |
| 85 | 3.644796 | 3.620886 | 3.614051 | NA |
| 99 | NA | NA | NA | 3.614084 |
| 125 | NA | 3.624266 | 3.424463 | NA |
| 128 | NA | 3.740166 | 3.722324 | 3.692069 |
| 141 | NA | 3.731692 | NA | 4.409189 |
| 151 | NA | NA | NA | 4.061031 |
| 158 | NA | NA | NA | NA |
| 181 | NA | 1.993500 | NA | 2.399475 |
| 201 | NA | NA | NA | NA |
| 207 | 3.419416 | NA | 3.197713 | NA |
| 221 | 3.827299 | NA | NA | 4.011960 |
| 224 | 3.597986 | 3.671785 | 3.440014 | 3.382952 |
| 239 | 3.518356 | 4.245348 | 4.183181 | NA |
| 250 | NA | 3.699032 | NA | 3.692559 |
| 264 | 4.284696 | NA | NA | 4.286974 |
| 279 | NA | NA | NA | 3.467185 |
| 280 | NA | NA | NA | NA |
| 293 | NA | NA | NA | NA |
| 296 | NA | 3.986218 | NA | 4.514032 |
| 298 | NA | 4.174220 | 3.995286 | NA |
| 299 | NA | NA | NA | 3.886759 |
| 301 | NA | NA | NA | NA |
| 305 | NA | 3.303163 | NA | 3.673859 |
| 313 | NA | 3.985239 | 3.751861 | NA |
| 314 | NA | 4.034791 | NA | NA |
| 328 | 3.802414 | NA | NA | NA |
| 334 | 3.410970 | NA | NA | NA |
| 339 | NA | NA | NA | 4.126004 |
| 345 | NA | NA | 3.671957 | 3.942133 |
| 346 | 3.787034 | NA | NA | 3.900648 |
| 379 | NA | NA | NA | NA |
| 387 | NA | NA | NA | NA |
| 393 | NA | NA | NA | NA |
| 398 | NA | NA | 4.077610 | NA |
| 399 | NA | 3.307224 | 3.072430 | NA |
| 401 | NA | 3.315413 | 3.040033 | 3.186588 |
| 407 | NA | NA | NA | NA |
| 417 | NA | NA | NA | 3.536884 |
| 442 | 3.297734 | 3.341717 | NA | 3.457097 |
| 455 | NA | NA | NA | NA |
| 493 | NA | 3.866104 | NA | 3.987802 |
| 496 | 3.067049 | 2.859318 | NA | 3.103866 |
| 505 | NA | 3.476801 | NA | 3.799254 |
| 535 | NA | NA | NA | NA |
| 536 | NA | 4.222383 | 3.752575 | NA |
| 567 | NA | 4.048823 | NA | 3.917836 |
| 592 | NA | NA | NA | NA |
| 617 | 2.509773 | 2.866630 | NA | 3.201489 |
| 630 | NA | 3.711334 | NA | 3.562731 |
| 639 | 3.033611 | 2.554343 | 2.593819 | 3.139381 |
| 650 | NA | NA | NA | 3.695807 |
| 659 | 3.739144 | NA | NA | 3.997772 |
| 694 | 4.320639 | 4.305813 | 4.174105 | 4.320125 |
| 711 | 4.056210 | 3.968443 | 3.703726 | NA |
| 715 | NA | NA | NA | 3.547479 |
| 733 | NA | 2.907816 | NA | 3.196125 |
| 748 | NA | NA | NA | NA |
| 825 | 4.006931 | 4.026326 | NA | 4.109223 |
| 870 | NA | NA | NA | 3.706447 |
| 881 | NA | NA | NA | NA |
| 883 | NA | NA | NA | NA |
| 889 | NA | NA | NA | NA |
| 890 | NA | 4.019789 | NA | 4.144436 |
| 894 | NA | 3.859504 | NA | 3.634725 |
| 899 | NA | 3.799083 | 3.530526 | NA |
| 932 | NA | 3.959724 | NA | 4.251144 |
| 938 | NA | 2.888139 | NA | 3.363482 |
scheme <- evaluationScheme(ratings, method="cross-validation",
k = 4,
given = 10,
goodRating = 4)
results2 <- evaluate(x = scheme, method = "IBCF", n=c(10,25,50,75,100))## IBCF run fold/sample [model time/prediction time]
## 1 [0.19sec/0.06sec]
## 2 [0.19sec/0.04sec]
## 3 [0.38sec/0.05sec]
## 4 [0.17sec/0.05sec]
## TP FP FN TN precision recall TPR
## 10 3.153846 6.846154 46.65934 133.34066 0.3153846 0.06206133 0.06206133
## 25 7.043956 17.890110 42.76923 122.29670 0.2824523 0.14217212 0.14217212
## 50 13.351648 36.307692 36.46154 103.87912 0.2688259 0.26777350 0.26777350
## 75 19.516484 54.714286 30.29670 85.47253 0.2634008 0.39374833 0.39374833
## 100 25.703297 72.252747 24.10989 67.93407 0.2634209 0.51552199 0.51552199
## FPR
## 10 0.04846687
## 25 0.12718102
## 50 0.25850890
## 75 0.38986228
## 100 0.51441433
The TPR is the percentage of the movies that have been purchased and was recommended while the FPR is the percentage of the movies that was not purchased but was recommended with n as the number of recommendations (2,4,6,8,10,50).
ROC Curve Plot
The precision and recall for the User Based Recommender system shows the percentage of Movies that have been purchased and the percentage of Movies that was recommended. We can see that 10 movies were purchased 42% of the time while 100 movies were recommended 61% of the time.
On the other hand, the precision and recall for the Item Based Recommender system also shows the percentage of Movies that have been purchased and the percentage of Movies that was recommended. We can also see that 10 movies were purchased 36% of the time while 100 movies were recommended 50% of the time.
References
Kohavi, Ron (1995). “A study of cross-validation and bootstrap for accuracy estimation and model selection”. Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, pp. 1137-1143.
Breese JS, Heckerman D, Kadie C (1998). “Empirical Analysis of Predictive Algorithms for Collaborative Filtering.” In Uncertainty in Artificial Intelligence. Proceedings of the Fourteenth Conference, pp. 43-52.