The goal of this assignment is for you to try out different ways of implementing and configuring a recommender, and to evaluate your different approaches.
This project builds two recommender systems: User Based Collaborative Filtering (UBCF) and Item Based Collaborative Filtering (ICBF) using data from the MovieLense database.
The project closely follows the tutorial in chapters 3 and 4 of the text, Building A Recommendation System with R by Suresh K. Gorka and Michele Usuelli.
I began by downing the the MovieLense rating matrix which is within the Recommenderlab R package Link
## 943 x 1664 rating matrix of class 'realRatingMatrix' with 99392 ratings.
With the imported data, I took the top 100 users and the top 200 movies into an object called “ratings_movies”. This object is comprised of 100 users who have rated at least 200 movies.
ratings_movies <- MovieLense[rowCounts(MovieLense) > 100, colCounts(MovieLense) > 200]
ratings_movies ## 358 x 116 rating matrix of class 'realRatingMatrix' with 22112 ratings.
In the code block below, I computed the similarity between the first four users and items in the ratings_movies object using the Pearson similarity method.
similarity_users <- similarity(ratings_movies[1:4, ], method ="pearson", which = "users")
as.matrix(similarity_users)## 1 5 6 7
## 1 0.0000000 0.5878441 0.5833309 0.5243777
## 5 0.5878441 0.0000000 0.5234117 0.5091642
## 6 0.5833309 0.5234117 0.0000000 0.6123778
## 7 0.5243777 0.5091642 0.6123778 0.0000000
similarity_items <- similarity(ratings_movies[, 1:4], method ="pearson", which = "items")
as.matrix(similarity_items)## Toy Story (1995) Get Shorty (1995) Twelve Monkeys (1995)
## Toy Story (1995) 0.0000000 0.5234734 0.5653624
## Get Shorty (1995) 0.5234734 0.0000000 0.5377089
## Twelve Monkeys (1995) 0.5653624 0.5377089 0.0000000
## Babe (1995) 0.5531670 0.6005361 0.5566206
## Babe (1995)
## Toy Story (1995) 0.5531670
## Get Shorty (1995) 0.6005361
## Twelve Monkeys (1995) 0.5566206
## Babe (1995) 0.0000000
The visualization below shows the frequency distribution of ratings after removing those movies without ratings. We see that the rating of “4” is the most often used rating on a scale of 1 to 5.
vector_ratings <- as.vector(ratings_movies@data)
vector_ratings <- factor(vector_ratings[vector_ratings !=0])
vector_ratings_df <- as.data.frame(vector_ratings)
qplot(vector_ratings) + ggtitle("Distribution of the ratings")views_per_movie <- colCounts(ratings_movies)
views_per_movie_df <- as.data.frame(views_per_movie)
views_per_movie_df <- cbind(movies = rownames(views_per_movie_df), views_per_movie_df)
rownames(views_per_movie_df) <- 1:nrow(views_per_movie_df)
views_per_movie_df <- views_per_movie_df[order(views_per_movie_df$views_per_movie, decreasing = TRUE), ]
head(views_per_movie_df,5)## movies views_per_movie
## 12 Star Wars (1977) 325
## 43 Raiders of the Lost Ark (1981) 304
## 48 Return of the Jedi (1983) 298
## 61 Back to the Future (1985) 276
## 26 Fargo (1996) 271
ggplot(views_per_movie_df[1:5, ], aes(x=movies, y=views_per_movie)) +
geom_bar(stat="identity",colour="green", fill="yellow") + theme(axis.text.x =element_text(angle = 45, hjust = 1))In the following sections, we see a Heatmap for the top 5 users and movies, the Average ratings per user, and the Average movie ratings.
min_movies <- quantile(rowCounts(ratings_movies), 0.95)
min_users <- quantile(colCounts(ratings_movies), 0.95)image(ratings_movies[rowCounts(ratings_movies) > min_movies,
colCounts(ratings_movies) > min_users], main = "Heatmap of the top users and movies")average_ratings_per_user <- rowMeans(ratings_movies)
p1 <- ggplot() + aes(average_ratings_per_user)+ geom_histogram(binwidth=.05, colour="blue", fill="orange")
p1 ## Average Movie Ratings
average_movie_ratings<- colMeans(ratings_movies)
p2 <- ggplot() + aes(average_movie_ratings)+ geom_histogram(binwidth=.05, colour="yellow", fill="blue")
p2I used two different methods for splitting the data. First, I manually created a test and train sets from ratings_movies. Next, I used K-Fold splitting to create 8 different chunks of data where a chunk is taken out, tested, and validated. The same is done with the other chunks and the average accuracy is taken.
k_fold_split <- function(data, method, n_fold, items_to_keep, rating_threshold){
return(evaluationScheme(data = data, method = method, k = n_fold, given = items_to_keep, goodRating = rating_threshold))
}## [1] 308 308 308 308 308 308 308 308
UBCF model was built on the train set, recc_data_train.
## Recommender of type 'UBCF' for 'realRatingMatrix'
## learned using 291 users.
The model recommends 5 movies to each user.
n_recommended <- 5
recc_predicted <- predict(object = UBCF_recc_model, newdata = recc_data_test, n = n_recommended)
UBCF_recc_matrix <- sapply(recc_predicted@items, function(x){
colnames(ratings_movies)[x] })We can see the slots for the recc_predicted model as well as the predicted user ratings for each film.
## [1] "items" "ratings" "itemLabels" "n"
## $`7`
## [1] 4.534113 4.459819 4.394825 4.370785 4.254145
##
## $`10`
## [1] 4.721341 4.633436 4.629731 4.610652 4.602927
##
## $`15`
## [1] 3.389964 3.358901 3.347940 3.278984 3.276877
##
## $`21`
## [1] 4.062008 3.929971 3.891427 3.821683 3.814816
##
## $`42`
## [1] 4.463283 4.403875 4.367245 4.363597 4.359057
In the table below, we see the names of the five recommmended movies for the top 5 users.
| 7 | 10 | 15 | 21 | 42 |
|---|---|---|---|---|
| Titanic (1997) | Princess Bride, The (1987) | Silence of the Lambs, The (1991) | Usual Suspects, The (1995) | Braveheart (1995) |
| Trainspotting (1996) | Boot, Das (1981) | Empire Strikes Back, The (1980) | Shawshank Redemption, The (1994) | Titanic (1997) |
| L.A. Confidential (1997) | To Kill a Mockingbird (1962) | Casablanca (1942) | Schindler’s List (1993) | Conspiracy Theory (1997) |
| Leaving Las Vegas (1995) | Schindler’s List (1993) | Usual Suspects, The (1995) | Clockwork Orange, A (1971) | Contact (1997) |
| Game, The (1997) | Titanic (1997) | Braveheart (1995) | Apocalypse Now (1979) | Air Force One (1997) |
number_of_items <- factor(table(UBCF_recc_matrix))
chart_title <- "Distribution of the number of items for UBCF"
qplot(number_of_items) + ggtitle(chart_title)For Method two, I built the UBCF model using the K-fold split data. The custom function, build_UBCF_Model, creates the UBCF recommender model.
build_UBCF_Model <- function(data, model_to_evaluate, model_parameters ){
return( Recommender(data = data, method = model_to_evaluate, parameter = model_parameters))
}The data used is from the eval sets. We see that the UBCF_eval_recommender is part of the Recommender package.
data <- getData(eval_sets, "train")
model_to_evaluate <- "UBCF"
model_parameters <- NULL
UBCF_eval_recommender <- build_UBCF_Model(data,model_to_evaluate,model_parameters) ## [1] "Recommender"
## attr(,"package")
## [1] "recommenderlab"
We see that the data is normalize.
## [1] "center"
I calculated the RMSE, MSE, and MAE both by user and by the entirety of the model.
UBCF_eval_accuracy_users <- calcPredictionAccuracy(x = UBCF_eval_prediction, data = getData(eval_sets, "unknown"), byUser =TRUE)
UBCF_eval_accuracy <- calcPredictionAccuracy(x = UBCF_eval_prediction, data = getData(eval_sets, "unknown"), byUser =FALSE)## RMSE MSE MAE
## 7 0.9259111 0.8573114 0.7428040
## 11 0.8982434 0.8068411 0.7330236
## 85 0.8311870 0.6908718 0.6980006
## 95 0.9936766 0.9873931 0.7961966
## 125 1.6367242 2.6788661 1.1488154
## 141 1.1074558 1.2264583 0.7645077
## RMSE MSE MAE
## 0.9906267 0.9813413 0.7753522
To further evaluate the UBCF model, I used the evaluate function which tests and validates using the 8 k-fold sets and recommendations starting at 10 to to 100 movies.
## UBCF run fold/sample [model time/prediction time]
## 1 [0sec/0.08sec]
## 2 [0sec/0.03sec]
## 3 [0sec/0.03sec]
## 4 [0sec/0.05sec]
## 5 [0sec/0.03sec]
## 6 [0sec/0.03sec]
## 7 [0sec/0.03sec]
## 8 [0sec/0.03sec]
## TP FP FN TN precision recall TPR FPR
## 10 4.56 5.44 38.80 57.20 0.4560000 0.09774008 0.09774008 0.0850141
## 20 8.82 11.18 34.54 51.46 0.4410000 0.19333183 0.19333183 0.1750696
## 30 13.50 16.50 29.86 46.14 0.4500000 0.30022102 0.30022102 0.2583568
## 40 17.86 22.14 25.50 40.50 0.4465000 0.39704363 0.39704363 0.3461648
## 50 22.06 27.94 21.30 34.70 0.4412000 0.49225501 0.49225501 0.4365857
## 60 26.42 33.58 16.94 29.06 0.4403333 0.59318206 0.59318206 0.5264497
## 70 30.64 39.36 12.72 23.28 0.4377143 0.68613332 0.68613332 0.6169256
## 80 34.54 45.46 8.82 17.18 0.4317500 0.78187313 0.78187313 0.7167004
## 90 38.40 51.60 4.96 11.04 0.4266667 0.88092366 0.88092366 0.8169180
## 100 41.76 58.24 1.60 4.40 0.4176000 0.96194529 0.96194529 0.9260601
columns_to_sum <- c("TP", "FP", "FN", "TN")
indices_summed <- Reduce("+", getConfusionMatrix(results))[, columns_to_sum]
indices_summed## TP FP FN TN
## 10 38.54 41.46 322.32 445.68
## 20 76.04 83.96 284.82 403.18
## 30 112.80 127.20 248.06 359.94
## 40 148.78 171.22 212.08 315.92
## 50 185.44 214.56 175.42 272.58
## 60 221.76 258.24 139.10 228.90
## 70 256.96 303.04 103.90 184.10
## 80 289.00 351.00 71.86 136.14
## 90 320.40 399.60 40.46 87.54
## 100 346.60 453.40 14.26 33.74
Shows the relationship between the True Positive Rate (TPR) and the False Positive Rate (FPR). The TPR is the number of TP divided by the sum of True Positives(TP) + False Negatives(FN). This rate shows whether the recommdended item was rated by the user. The FPR is the number of FP divided by the sum of False Positives(TP) + True Negatives(FN). FPR measures recommendations that were not rated by the user.
Precision is the percentage of recommended items that have been rated. It’s the number of FP divided by the sum of (TP + FP). Recall is the percentage of user rated movies that have been recommended. It’s the number of TP divided by the sum of (TP + FN).
We can see clearly in the chart below that as the number of recommended movies increases, Precision falls but Recall increases.
For the Item Based Collaborative Filtering, I followed the exact same steps as for the User Based Collaborative Filtering (UBCF).
## Recommender of type 'IBCF' for 'realRatingMatrix'
## learned using 291 users.
n_recommended <- 5
recc_predicted <- predict(object = IBCF_recc_model, newdata = recc_data_test, n = n_recommended)
IBCF_recc_matrix <- sapply(recc_predicted@items, function(x){
colnames(ratings_movies)[x]
})| 7 | 10 | 15 | 21 | 42 |
|---|---|---|---|---|
| In & Out (1997) | Godfather: Part II, The (1974) | Groundhog Day (1993) | Graduate, The (1967) | Fish Called Wanda, A (1988) |
| Men in Black (1997) | Truth About Cats & Dogs, The (1996) | Rock, The (1996) | People vs. Larry Flynt, The (1996) | Get Shorty (1995) |
| Leaving Las Vegas (1995) | Star Trek: First Contact (1996) | Scream (1996) | Jerry Maguire (1996) | Murder at 1600 (1997) |
| Ransom (1996) | Birdcage, The (1996) | Leaving Las Vegas (1995) | Sound of Music, The (1965) | Blues Brothers, The (1980) |
| Evita (1996) | Mr. Holland’s Opus (1995) | Stand by Me (1986) | Indiana Jones and the Last Crusade (1989) | English Patient, The (1996) |
number_of_items <- factor(table(UBCF_recc_matrix))
chart_title <- "Distribution of the number of items for IBCF"
qplot(number_of_items) + ggtitle(chart_title)build_IBCF_Model <- function(data, model_to_evaluate, model_parameters ){
return( Recommender(data = data, method = model_to_evaluate, parameter = model_parameters))
}data <- getData(eval_sets, "train")
model_to_evaluate <- "IBCF"
model_parameters <- NULL
IBCF_eval_recommender <- build_IBCF_Model(data,model_to_evaluate,model_parameters) ## Recommender of type 'IBCF' for 'realRatingMatrix'
## learned using 308 users.
IBCF_eval_prediction <- predict(object = IBCF_eval_recommender, newdata = getData(eval_sets, "known"),
n = items_to_recommend, type = "ratings")
class(IBCF_eval_prediction)## [1] "realRatingMatrix"
## attr(,"package")
## [1] "recommenderlab"
number_of_items <- factor(table(IBCF_recc_matrix))
chart_title <- "Distribution of the number of items for IBCF"IBCF_eval_accuracy_users <- calcPredictionAccuracy(x = IBCF_eval_prediction, data = getData(eval_sets, "unknown"),
byUser =TRUE)
IBCF_eval_accuracy <- calcPredictionAccuracy(x = IBCF_eval_prediction, data = getData(eval_sets, "unknown"),
byUser =FALSE)## RMSE MSE MAE
## 7 1.155787 1.335844 0.9632382
## 11 1.202448 1.445880 0.9497916
## 85 1.069236 1.143266 0.8353018
## 95 1.349284 1.820566 0.9992662
## 125 1.693138 2.866715 1.1802906
## 141 1.184097 1.402085 0.8943180
qplot(IBCF_eval_accuracy_users[, "RMSE"]) + geom_histogram(binwidth = 0.125) +
ggtitle("Distribution of the RMSE by user")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## RMSE MSE MAE
## 1.2377853 1.5321124 0.9417777
## IBCF run fold/sample [model time/prediction time]
## 1 [0.03sec/0.02sec]
## 2 [0.03sec/0.02sec]
## 3 [0.03sec/0.02sec]
## 4 [0.03sec/0.02sec]
## 5 [0.03sec/0.02sec]
## 6 [0.03sec/0.02sec]
## 7 [0.05sec/0sec]
## 8 [0.03sec/0sec]
## TP FP FN TN precision recall TPR FPR
## 10 4.52 5.48 38.84 57.16 0.4520000 0.09930167 0.09930167 0.08648771
## 20 8.48 11.52 34.88 51.12 0.4240000 0.18888667 0.18888667 0.18435350
## 30 11.92 18.08 31.44 44.56 0.3973333 0.26377460 0.26377460 0.29071072
## 40 15.98 24.02 27.38 38.62 0.3995000 0.35537959 0.35537959 0.38629078
## 50 19.96 30.00 23.40 32.64 0.3994667 0.44639048 0.44639048 0.48208597
## 60 24.04 35.60 19.32 27.04 0.4026667 0.54149430 0.54149430 0.57206924
So which model, the UBCF or the IBCF, performed better? We see that the two models recommended entirely different sets of movies to the first five users.
| 7 | 10 | 15 | 21 | 42 |
|---|---|---|---|---|
| Titanic (1997) | Princess Bride, The (1987) | Silence of the Lambs, The (1991) | Usual Suspects, The (1995) | Braveheart (1995) |
| Trainspotting (1996) | Boot, Das (1981) | Empire Strikes Back, The (1980) | Shawshank Redemption, The (1994) | Titanic (1997) |
| L.A. Confidential (1997) | To Kill a Mockingbird (1962) | Casablanca (1942) | Schindler’s List (1993) | Conspiracy Theory (1997) |
| Leaving Las Vegas (1995) | Schindler’s List (1993) | Usual Suspects, The (1995) | Clockwork Orange, A (1971) | Contact (1997) |
| Game, The (1997) | Titanic (1997) | Braveheart (1995) | Apocalypse Now (1979) | Air Force One (1997) |
| 7 | 10 | 15 | 21 | 42 |
|---|---|---|---|---|
| In & Out (1997) | Godfather: Part II, The (1974) | Groundhog Day (1993) | Graduate, The (1967) | Fish Called Wanda, A (1988) |
| Men in Black (1997) | Truth About Cats & Dogs, The (1996) | Rock, The (1996) | People vs. Larry Flynt, The (1996) | Get Shorty (1995) |
| Leaving Las Vegas (1995) | Star Trek: First Contact (1996) | Scream (1996) | Jerry Maguire (1996) | Murder at 1600 (1997) |
| Ransom (1996) | Birdcage, The (1996) | Leaving Las Vegas (1995) | Sound of Music, The (1965) | Blues Brothers, The (1980) |
| Evita (1996) | Mr. Holland’s Opus (1995) | Stand by Me (1986) | Indiana Jones and the Last Crusade (1989) | English Patient, The (1996) |
To evaluate the two models, I created a list of UBCF and IBCF models with different parameters (pearson and cosine). This list was fed into the evaluate function and below. The visualizations below both show that the UBCF using the “pearson” method had the larger area under the curve.
In conclusion, this project has shown that if we want our model to be more precise with its recommendations, we would choose the UBCF model and recommend five or less movies. If the number of recommendations is 50 or above, then the UBCF-pearson model does not perform better than the other models.
models_to_evaluate <- list(IBCF_cos = list(name = "IBCF", param = list(method ="cosine")),
IBCF_cor = list(name = "IBCF", param = list(method ="pearson")),
UBCF_cos = list(name = "UBCF", param = list(method ="cosine")),
UBCF_cor = list(name = "UBCF", param = list(method ="pearson")),
random = list(name = "RANDOM", param=NULL)
)
n_recommendations <- c(1, 5, seq(10, 100, 10))## IBCF run fold/sample [model time/prediction time]
## 1 [0.03sec/0.02sec]
## 2 [0.13sec/0.01sec]
## 3 [0.03sec/0sec]
## 4 [0.01sec/0.02sec]
## 5 [0.03sec/0sec]
## 6 [0.01sec/0.02sec]
## 7 [0.02sec/0.01sec]
## 8 [0.04sec/0.02sec]
## IBCF run fold/sample [model time/prediction time]
## 1 [0.03sec/0.02sec]
## 2 [0.05sec/0.18sec]
## 3 [0.03sec/0.02sec]
## 4 [0.04sec/0.02sec]
## 5 [0.07sec/0.01sec]
## 6 [0.03sec/0sec]
## 7 [0.03sec/0sec]
## 8 [0.03sec/0.02sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0sec/0.03sec]
## 2 [0sec/0.03sec]
## 3 [0sec/0.04sec]
## 4 [0sec/0.03sec]
## 5 [0sec/0.03sec]
## 6 [0.02sec/0.14sec]
## 7 [0sec/0.03sec]
## 8 [0sec/0.03sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0sec/0.04sec]
## 2 [0sec/0.03sec]
## 3 [0sec/0.03sec]
## 4 [0sec/0.03sec]
## 5 [0sec/0.03sec]
## 6 [0sec/0.03sec]
## 7 [0sec/0.03sec]
## 8 [0sec/0.03sec]
## RANDOM run fold/sample [model time/prediction time]
## 1 [0sec/0.02sec]
## 2 [0sec/0.01sec]
## 3 [0sec/0.01sec]
## 4 [0sec/0sec]
## 5 [0sec/0sec]
## 6 [0sec/0sec]
## 7 [0sec/0.02sec]
## 8 [0sec/0sec]
## integer(0)