library(recommenderlab)
library(recommenderlabBX)
library(ggplot2)
library(dplyr)
library(tidyr)
library(data.table)
library(kableExtra)
data("BX")
For the final project for this course I would like to use the material learned in the course to build a recommendation model on the Book-Crossing Dataset. In determining the most appropriate model to use we will:
Once the most appropriate model is selected, I will run the model on my own profile to judge how the ratings are for me personally. I will do this by rating a number of books based on my preferences and interests and apply the recommendation model on it.
The Book-Crossing dataset contains 278,858 users providing 1,149,778 ratings (explicit / implicit) about 271,379 books. This makes the dataset very sparse with most ratings at 0, and most users only have rated just 1 book.
data("BXBooks")
data("BX")
url <- "https://github.com/dhairavc/DATA612-RecommenderSystems/raw/master/Final%20Project/BX-CSV-Dump.zip"
temp <- tempfile()
download.file(url, temp)
books <- fread(unzip(temp, files = "BX-Books.csv"))
unlink(temp)
#Ratings Distribution
ratings <- getRatings(BX)
ggplot() + geom_histogram(aes(ratings), binwidth = 1, col="white", fill="pink") +
labs(title = "Ratings Distribution", x="Rating", y="Count") +
scale_y_continuous(labels = scales::comma)
#Distribution of books rated per user
ratings_by_user <- rowCounts(BX)
ratings_by_user <- data.frame(User = names(ratings_by_user), BooksRated = ratings_by_user)
ratings_by_user_sum <- ratings_by_user %>% group_by(BooksRated) %>% summarise(Users = n())
ggplot(ratings_by_user_sum, aes(y=BooksRated, x=Users)) + geom_point(col="purple")
#User with most rated books
tail(ratings_by_user_sum) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)
BooksRated | Users |
---|---|
4785 | 1 |
5850 | 1 |
5891 | 1 |
6109 | 1 |
7550 | 1 |
13601 | 1 |
#users with least rated books
head(ratings_by_user_sum) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)
BooksRated | Users |
---|---|
1 | 59166 |
2 | 12503 |
3 | 6533 |
4 | 4265 |
5 | 3099 |
6 | 2334 |
#Most read books
rating_per_book <- colCounts(BX)
rating_per_book <- data.frame(ISBN = names(rating_per_book), Rated = rating_per_book)
most_rated <- (rating_per_book %>% arrange(desc(Rated)))[1:21,]
most_rated <- left_join(most_rated, books, by = "ISBN")
knitr::include_graphics(drop_na(most_rated)$`Image-URL-M`)
In order to models to run and complete in a timely manner, the dataset was reduced to users that have reviewed over 35 books and books that have over 35 reviews. Given the large dataset, using it entirely for models was limited to the processing power and available memory of my local machine. Additionally the RecommenderLab package would not handle datasets as this large.
Reduced Matrix
Subsetting Error
IBCF Error
UBCF Error
SVD Error
# Convert to Data Frame to remove books with low amount of ratings
BX_df <- getData.frame(BX)
keep_items <- BX_df %>% group_by(item) %>% summarise(Reviews = n())
keep_items <- keep_items %>% filter(Reviews > 35)
BX_df2 <- BX_df %>% filter(item %in% keep_items$item)
BX2 <- as(BX_df2, "realRatingMatrix")
# Remove Users with low amount of ratings
BX3 <- BX2[rowCounts(BX2) > 35]
# Original Matrix
BX
## 105283 x 340547 rating matrix of class 'realRatingMatrix' with 1149778 ratings.
# Reduced Matrix
BX3
## 1380 x 3430 rating matrix of class 'realRatingMatrix' with 145321 ratings.
Normalize dataset so it does not normalize multiple times through each model iteration and then build an evaluation scheme splitting the data to a 80/20 train and test ratio
set.seed(12)
BX_center <- normalize(BX3, "Center")
eval_books <- evaluationScheme(data=BX_center, method="split", train=.80, given=min(rowCounts(BX_center)), goodRating = 5)
eval_books
## Evaluation scheme with 36 items given
## Method: 'split' with 1 run(s).
## Training set proportion: 0.800
## Good ratings: >=5.000000
## Data set: 1380 x 3430 rating matrix of class 'realRatingMatrix' with 145321 ratings.
## Normalized using center on rows.
For this dataset, 4 models will be compared.
For each model the following parameters will be optimized:
Model | # of Recommendations | Parameter | Method |
---|---|---|---|
IBCF | 1,5,10,15,20,25 | Similar Items (k): 1,5,10,20,30,40 | Cosine, Pearson |
UBCF | 1,5,10,15,20,25 | Nearest Neighbors (nn): 1,5,10,20,30,40 | Cosine, Pearson |
SVD | 1,5,10,15,20,25 | Item Subset (k): 1,5,10,20,30,40, 100 | N/A |
ALS | 1,5,10,15,20,25 | Latent Factors (k): 1,5,10,20,30,40 | N/A |
For IBCF, the optimal parameters are the Pearson method with 20 similar items
#IBCF Cosine
vector_k <- c(1, 5, 10, 20, 30, 40)
models_to_evaluate <- lapply(vector_k, function(k){
list(name = "IBCF", param = list(method = "Cosine", k = k))
})
names(models_to_evaluate) <- paste0("IBCF_k_", vector_k)
n_recommendations <- c(1, 5, 10, 15, 20, 25)
list_results <- evaluate(x = eval_books, method = models_to_evaluate, n = n_recommendations)
## IBCF run fold/sample [model time/prediction time]
## 1 [321.64sec/0.26sec]
## IBCF run fold/sample [model time/prediction time]
## 1 [315.85sec/0.16sec]
## IBCF run fold/sample [model time/prediction time]
## 1 [256.98sec/0.17sec]
## IBCF run fold/sample [model time/prediction time]
## 1 [273.65sec/0.43sec]
## IBCF run fold/sample [model time/prediction time]
## 1 [316.98sec/0.28sec]
## IBCF run fold/sample [model time/prediction time]
## 1 [295.94sec/0.4sec]
plot(list_results, annotate = 1, legend = "topleft")
title("ROC curve IBCF-Cosine")
plot(list_results, "prec/rec", annotate = 1, legend = "bottomright")
title("Precision-recall IBCF-Cosine")
#IBCF Pearson
models_to_evaluate <- lapply(vector_k, function(k){
list(name = "IBCF", param = list(method = "Pearson", k = k))
})
names(models_to_evaluate) <- paste0("IBCF_k_", vector_k)
n_recommendations <- c(1, 5, 10, 15, 20, 25)
list_results <- evaluate(x = eval_books, method = models_to_evaluate, n = n_recommendations)
## IBCF run fold/sample [model time/prediction time]
## 1 [311.13sec/0.13sec]
## IBCF run fold/sample [model time/prediction time]
## 1 [393.81sec/0.49sec]
## IBCF run fold/sample [model time/prediction time]
## 1 [396.39sec/0.38sec]
## IBCF run fold/sample [model time/prediction time]
## 1 [377.71sec/0.56sec]
## IBCF run fold/sample [model time/prediction time]
## 1 [393.42sec/0.63sec]
## IBCF run fold/sample [model time/prediction time]
## 1 [314.27sec/1.03sec]
plot(list_results, annotate = 1, legend = "topleft")
title("ROC curve IBCF-Pearson")
plot(list_results, "prec/rec", annotate = 1, legend = "bottomright")
title("Precision-recall IBCF-Pearson")
For UBCF, the optimal parameters are the Pearson method with 40 neighbors
#UBCF Cosine
vector_k <- c(1, 5, 10, 20, 30, 40)
models_to_evaluate <- lapply(vector_k, function(k){
list(name = "UBCF", param = list(method = "Cosine", nn = k))
})
names(models_to_evaluate) <- paste0("UBCF_nn_", vector_k)
n_recommendations <- c(1, 5, 10, 15, 20, 25)
list_results <- evaluate(x = eval_books, method = models_to_evaluate, n = n_recommendations)
## UBCF run fold/sample [model time/prediction time]
## 1
## Timing stopped at: 7.21 0.05 7.47
## Error in neighbors[, x] : incorrect number of dimensions
## UBCF run fold/sample [model time/prediction time]
## 1 [0sec/11.42sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0sec/9.22sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0sec/8.87sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0sec/9.19sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0sec/9.26sec]
plot(list_results, annotate = 1, legend = "topleft")
title("ROC curve UBCF-Cosine")
plot(list_results, "prec/rec", annotate = 1, legend = "bottomright")
title("Precision-recall IBCF-Cosine")
#UBCF Pearson
models_to_evaluate <- lapply(vector_k, function(k){
list(name = "UBCF", param = list(method = "Pearson", nn = k))
})
names(models_to_evaluate) <- paste0("UBCF_nn_", vector_k)
n_recommendations <- c(1, 5, 10, 15, 20, 25)
list_results <- evaluate(x = eval_books, method = models_to_evaluate, n = n_recommendations)
## UBCF run fold/sample [model time/prediction time]
## 1
## Timing stopped at: 3.45 0.06 3.57
## Error in neighbors[, x] : incorrect number of dimensions
## UBCF run fold/sample [model time/prediction time]
## 1 [0sec/5.3sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0sec/5.15sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0sec/5.66sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0sec/4.92sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0sec/5.21sec]
plot(list_results, annotate = 1, legend = "topleft")
title("ROC curve UBCF-Pearson")
plot(list_results, "prec/rec", annotate = 1, legend = "bottomright")
title("Precision-recall UBCF-Pearson")
For SVD the optimal parameter is 100
#SVD
vector_k <- c(1, 5, 10, 20, 30, 40, 100)
models_to_evaluate <- lapply(vector_k, function(k){
list(name = "SVD", param = list(k = k))
})
names(models_to_evaluate) <- paste0("SVD_k_", vector_k)
n_recommendations <- c(1, 5, 10, 15, 20, 25)
list_results <- evaluate(x = eval_books, method = models_to_evaluate, n = n_recommendations)
## SVD run fold/sample [model time/prediction time]
## 1
## Timing stopped at: 0 0.01 0.02
## Error in r_a %*% model$svd$v %*% diag(1/model$svd$d) :
## non-conformable arguments
## SVD run fold/sample [model time/prediction time]
## 1 [0.58sec/0.92sec]
## SVD run fold/sample [model time/prediction time]
## 1 [0.86sec/1.13sec]
## SVD run fold/sample [model time/prediction time]
## 1 [1.21sec/1.15sec]
## SVD run fold/sample [model time/prediction time]
## 1 [2.15sec/1.01sec]
## SVD run fold/sample [model time/prediction time]
## 1 [2.72sec/1.08sec]
## SVD run fold/sample [model time/prediction time]
## 1 [7.81sec/1.14sec]
list_results$SVD_k_5@results
## [[1]]
## An object of class "confusionMatrix"
## Slot "cm":
## TP FP FN TN precision recall TPR
## 1 0.0326087 0.9673913 9.333333 3383.667 0.03260870 0.004780831 0.004780831
## 5 0.0942029 4.9057971 9.271739 3379.728 0.01884058 0.013182120 0.013182120
## 10 0.1521739 9.8478261 9.213768 3374.786 0.01521739 0.018108543 0.018108543
## 15 0.1847826 14.8152174 9.181159 3369.819 0.01231884 0.024188423 0.024188423
## 20 0.2101449 19.7898551 9.155797 3364.844 0.01050725 0.026375357 0.026375357
## 25 0.2572464 24.7427536 9.108696 3359.891 0.01028986 0.033900453 0.033900453
## FPR
## 1 0.0002858102
## 5 0.0014493981
## 10 0.0029094062
## 15 0.0043769765
## 20 0.0058467404
## 25 0.0073101174
##
## Slot "model":
## NULL
plot(list_results, annotate = 1, legend = "topleft")
title("ROC curve SVD")
plot(list_results, "prec/rec", annotate = 1, legend = "bottomright")
title("Precision-recall SVD")
For ALS the optimal latent factor is 40
#ALS
vector_k <- c(1, 5, 10, 20, 30, 40)
models_to_evaluate <- lapply(vector_k, function(k){
list(name = "ALS", param = list(n_factors = k))
})
names(models_to_evaluate) <- paste0("ALS_n_", vector_k)
n_recommendations <- c(1, 5, 10, 15, 20, 25)
list_results <- evaluate(x = eval_books, method = models_to_evaluate, n = n_recommendations)
## ALS run fold/sample [model time/prediction time]
## 1 [0sec/122.71sec]
## ALS run fold/sample [model time/prediction time]
## 1 [0sec/102.27sec]
## ALS run fold/sample [model time/prediction time]
## 1 [0.02sec/107.26sec]
## ALS run fold/sample [model time/prediction time]
## 1 [0sec/106.64sec]
## ALS run fold/sample [model time/prediction time]
## 1 [0sec/109.53sec]
## ALS run fold/sample [model time/prediction time]
## 1 [0sec/115.61sec]
plot(list_results, annotate = 1, legend = "topleft")
title("ROC curve ALS")
plot(list_results, "prec/rec", annotate = 1, legend = "bottomright")
title("Precision-recall ALS")
Optimal Parameters for each of the model is listed below.
Model | # of Recommendations | Parameter | Method |
---|---|---|---|
IBCF | 1,5,10,15,20,25 | Similar Items (k): 20 | Pearson |
UBCF | 1,5,10,15,20,25 | Nearest Neighbors (nn): 40 | Pearson |
SVD | 1,5,10,15,20,25 | Item Subset (k): 100 | N/A |
ALS | 1,5,10,15,20,25 | Latent Factors (k): 40 | N/A |
n_recommendations <- c(1, 5, 10, 15, 20, 25)
recc_algos <- list(
IBCF = list(name = "IBCF", param = list(method = "Pearson", k=20)),
UBCF = list(name = "UBCF", param = list(method = "Pearson", nn=40)),
SVD = list(name = "SVD", param = list(k=100)),
ALS = list(name = "ALS", param = list(n_factors=40)),
random = list(name = "RANDOM", param=NULL)
)
recommendation_results <- evaluate(eval_books, method = recc_algos, n = n_recommendations)
## IBCF run fold/sample [model time/prediction time]
## 1 [373.15sec/0.17sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0sec/4.27sec]
## SVD run fold/sample [model time/prediction time]
## 1 [6.37sec/1.02sec]
## ALS run fold/sample [model time/prediction time]
## 1 [0sec/154.08sec]
## RANDOM run fold/sample [model time/prediction time]
## 1 [0sec/1.02sec]
plot(recommendation_results, annotate = TRUE, legend = "topleft")
title("ROC curve - Model Comparisons")
plot(recommendation_results, "prec/rec", annotate = TRUE, legend = "bottomright")
title("Precision-recall - Model Comparisons")
To test the recommender model, below are 30 books that I have personally read and given a rating on. These ratings are inserted alongside the reduced Book Crossings dataset. In particular my personal ratings (user 99999) are in the test set for recommendation generation. We’ll see top 10 recommendations for the 2 best models, SVD and UBCF.
Below are 30 books which I have rated. These ratings will be input into the SVD and UBCF models
url2 <- "https://github.com/dhairavc/DATA612-RecommenderSystems/raw/master/Final%20Project/titles_myrating3.csv"
my_ratings <- read.csv(url2, stringsAsFactors = FALSE)[,1:3]
head(drop_na(my_ratings))
## user item rating
## 1 99999 0099993805 6
## 2 99999 0345339711 10
## 3 99999 0345339738 10
## 4 99999 0385504209 10
## 5 99999 0553296981 4
## 6 99999 0618002227 10
# Insert my personal ratings into reduced dataset
temp_df <- getData.frame(BX3)
user <- c(as.character(temp_df$user), as.character(my_ratings$user))
item <- c(as.character(temp_df$item), as.character(my_ratings$item))
rating <- c(temp_df$rating, my_ratings$rating)
BX3_v2<- data.frame(user, item, rating) %>% drop_na()
BX3_v3 <- as(BX3_v2, "realRatingMatrix")
# Create training and test datasets with personal ratings in the test set
split_data <- sample(x = c(TRUE, FALSE), size = nrow(BX3_v3)-1, replace = TRUE, prob = c(0.8, 0.2))
split2 <- append(split_data, c(FALSE))
train_m <- BX3_v3[split_data, ]
test_m <- BX3_v3[!split2, ]
images_my <- left_join(my_ratings, books, by = c("item" = "ISBN"))
knitr::include_graphics(images_my$`Image-URL-M`)
The SVD model recommended some books such as Harry Potter and Lord of the Rings which I have already rated, but of a different ISBN
my_model_SVD <- Recommender(data = train_m, method = "SVD", param = list(k=100))
results_SVD <- predict(object = my_model_SVD, newdata = test_m, n=10)
my_recommendations_SVD <- results_SVD@itemLabels[results_SVD@items[[length(results_SVD@items)]]]
images_svd <- books %>% filter(ISBN %in% my_recommendations_SVD) %>% select(`Image-URL-M`)
knitr::include_graphics(images_svd$`Image-URL-M`)
The UBCF model is giving me some recommendations that are completely out of category of what I rated such as the Ya-Ya Sisterhood. Since this is user based filtering, I suppose I was matched with users who had rated these movies
my_model_UBCF <- Recommender(data = train_m, method = "UBCF", param = list(method="Pearson", nn=40))
results_UBCF <- predict(object = my_model_UBCF, newdata = test_m, n=10)
my_recommendations_UBCF <- results_UBCF@itemLabels[results_UBCF@items[[length(results_UBCF@items)]]]
images_UBCF <- books %>% filter(ISBN %in% my_recommendations_UBCF) %>% select(`Image-URL-M`)
knitr::include_graphics(images_UBCF$`Image-URL-M`)
several models were compared on a reduced dataset. The two top models (SVD & UBCF) were selected to give recommendations on my personal ratings. Both recommendations were not fairly good as both models while the highest had very low True Positive Rates. No model achieved a TPR higher than .08. Some important criteria for comparisons of models are Precision and Recall:
Precision is defined as the ratio between all the instances the model correctly predicted a good match over the total the number of matches predicted by the model.
Recall (Sensitivity) is defined as the ratio between all the instances the model correctly predicted a good match over all the matches the user would have liked.
By looking at the confusion matrix below, there is a trade off between Precision and Recall. In our case of recommending items, having a Type I error is more problematic than a Type II error. This is because Precision shows how much a recommendation can be trusted. If the model has low precision and the recommendations are not trusted then faith in the entire model is diminished.
In both models, I can see that these recommendations cannot be entirely trusted given the low precision rates. Practically I see this as some of the recommendations I would not have any motivation to read, but not all in SVD model are incorrect. In conclusion this shows that the models need further refinement or need to be trained on a larger dataset as we discarded a significant chunk of data to prioritize run time and RecommenderLab package limitations.