Final Project

library(recommenderlab)
library(recommenderlabBX)
library(ggplot2)
library(dplyr)
library(tidyr)
library(data.table)
library(kableExtra)
data("BX")

General Goals

For the final project for this course I would like to use the material learned in the course to build a recommendation model on the Book-Crossing Dataset. In determining the most appropriate model to use we will:

Simulate various recommender algorithms available in the RecommenderLab package such as Item Based Content Filtering (IBCF), User Based Content Filtering (UBCF), Singular Vector Decomposition (SVD), and Alternating Least Squares (ALS)
For each algorithm, fine tune/optimize each’s parameters such as distance method, number of recommendations, and normalization method
Evaluate each model based on ROC Curve and AUC

Once the most appropriate model is selected, I will run the model on my own profile to judge how the ratings are for me personally. I will do this by rating a number of books based on my preferences and interests and apply the recommendation model on it.

About the Dataset

The Book-Crossing dataset contains 278,858 users providing 1,149,778 ratings (explicit / implicit) about 271,379 books. This makes the dataset very sparse with most ratings at 0, and most users only have rated just 1 book.

data("BXBooks")
data("BX")
url <- "https://github.com/dhairavc/DATA612-RecommenderSystems/raw/master/Final%20Project/BX-CSV-Dump.zip"
temp <- tempfile()
download.file(url, temp)
books <- fread(unzip(temp, files = "BX-Books.csv"))
unlink(temp)


#Ratings Distribution
ratings <- getRatings(BX)
ggplot() + geom_histogram(aes(ratings), binwidth = 1, col="white", fill="pink") +
  labs(title = "Ratings Distribution", x="Rating", y="Count") + 
  scale_y_continuous(labels = scales::comma)

#Distribution of books rated per user
ratings_by_user <- rowCounts(BX)
ratings_by_user <- data.frame(User = names(ratings_by_user), BooksRated = ratings_by_user)
ratings_by_user_sum <- ratings_by_user %>% group_by(BooksRated) %>% summarise(Users = n())
ggplot(ratings_by_user_sum, aes(y=BooksRated, x=Users)) + geom_point(col="purple")

#User with most rated books
tail(ratings_by_user_sum) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)

BooksRated	Users
4785	1
5850	1
5891	1
6109	1
7550	1
13601	1

#users with least rated books
head(ratings_by_user_sum) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)

BooksRated	Users
1	59166
2	12503
3	6533
4	4265
5	3099
6	2334

#Most read books
rating_per_book <- colCounts(BX)
rating_per_book <- data.frame(ISBN = names(rating_per_book), Rated = rating_per_book)
most_rated <- (rating_per_book %>% arrange(desc(Rated)))[1:21,]
most_rated <- left_join(most_rated, books, by = "ISBN")
knitr::include_graphics(drop_na(most_rated)$`Image-URL-M`)

Data Reduction

In order to models to run and complete in a timely manner, the dataset was reduced to users that have reviewed over 35 books and books that have over 35 reviews. Given the large dataset, using it entirely for models was limited to the processing power and available memory of my local machine. Additionally the RecommenderLab package would not handle datasets as this large.

Reduced Matrix
Subsetting Error
IBCF Error
UBCF Error
SVD Error

# Convert to Data Frame to remove books with low amount of ratings
BX_df <- getData.frame(BX)
keep_items <- BX_df %>% group_by(item) %>% summarise(Reviews = n())
keep_items <- keep_items %>% filter(Reviews > 35)
BX_df2 <- BX_df %>% filter(item %in% keep_items$item)

BX2 <- as(BX_df2, "realRatingMatrix")

# Remove Users with low amount of ratings
BX3 <- BX2[rowCounts(BX2) > 35]

# Original Matrix
BX

## 105283 x 340547 rating matrix of class 'realRatingMatrix' with 1149778 ratings.

# Reduced Matrix
BX3

## 1380 x 3430 rating matrix of class 'realRatingMatrix' with 145321 ratings.

Set up Scheme

Normalize dataset so it does not normalize multiple times through each model iteration and then build an evaluation scheme splitting the data to a 80/20 train and test ratio

set.seed(12)
BX_center <- normalize(BX3, "Center")
eval_books <- evaluationScheme(data=BX_center, method="split", train=.80, given=min(rowCounts(BX_center)), goodRating = 5)
eval_books

## Evaluation scheme with 36 items given
## Method: 'split' with 1 run(s).
## Training set proportion: 0.800
## Good ratings: >=5.000000
## Data set: 1380 x 3430 rating matrix of class 'realRatingMatrix' with 145321 ratings.
## Normalized using center on rows.

Model Parameter Optimization

For this dataset, 4 models will be compared.

Item Based Content Filtering (IBCF)
User Based Content Filtering (UBCF)
Singular Value Decomposition (SVD)
Alternating Least Squares (ALS)

For each model the following parameters will be optimized:

Model	# of Recommendations	Parameter	Method
IBCF	1,5,10,15,20,25	Similar Items (k): 1,5,10,20,30,40	Cosine, Pearson
UBCF	1,5,10,15,20,25	Nearest Neighbors (nn): 1,5,10,20,30,40	Cosine, Pearson
SVD	1,5,10,15,20,25	Item Subset (k): 1,5,10,20,30,40, 100	N/A
ALS	1,5,10,15,20,25	Latent Factors (k): 1,5,10,20,30,40	N/A

Item Based Content Filtering (IBCF)

For IBCF, the optimal parameters are the Pearson method with 20 similar items

#IBCF Cosine
vector_k <- c(1, 5, 10, 20, 30, 40)
models_to_evaluate <- lapply(vector_k, function(k){
  list(name = "IBCF", param = list(method = "Cosine", k = k))
})
names(models_to_evaluate) <- paste0("IBCF_k_", vector_k)

n_recommendations <- c(1, 5, 10, 15, 20, 25)
list_results <- evaluate(x = eval_books, method = models_to_evaluate, n = n_recommendations)

## IBCF run fold/sample [model time/prediction time]
##   1  [321.64sec/0.26sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [315.85sec/0.16sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [256.98sec/0.17sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [273.65sec/0.43sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [316.98sec/0.28sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [295.94sec/0.4sec]

plot(list_results, annotate = 1, legend = "topleft")
title("ROC curve IBCF-Cosine")
plot(list_results, "prec/rec", annotate = 1, legend = "bottomright")
title("Precision-recall IBCF-Cosine")

#IBCF Pearson
models_to_evaluate <- lapply(vector_k, function(k){
  list(name = "IBCF", param = list(method = "Pearson", k = k))
})
names(models_to_evaluate) <- paste0("IBCF_k_", vector_k)

n_recommendations <- c(1, 5, 10, 15, 20, 25)
list_results <- evaluate(x = eval_books, method = models_to_evaluate, n = n_recommendations)

## IBCF run fold/sample [model time/prediction time]
##   1  [311.13sec/0.13sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [393.81sec/0.49sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [396.39sec/0.38sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [377.71sec/0.56sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [393.42sec/0.63sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [314.27sec/1.03sec]

plot(list_results, annotate = 1, legend = "topleft")
title("ROC curve IBCF-Pearson")
plot(list_results, "prec/rec", annotate = 1, legend = "bottomright")
title("Precision-recall IBCF-Pearson")

User Based Content Filtering (UBCF)

For UBCF, the optimal parameters are the Pearson method with 40 neighbors

#UBCF Cosine
vector_k <- c(1, 5, 10, 20, 30, 40)
models_to_evaluate <- lapply(vector_k, function(k){
  list(name = "UBCF", param = list(method = "Cosine", nn = k))
})
names(models_to_evaluate) <- paste0("UBCF_nn_", vector_k)

n_recommendations <- c(1, 5, 10, 15, 20, 25)
list_results <- evaluate(x = eval_books, method = models_to_evaluate, n = n_recommendations)

## UBCF run fold/sample [model time/prediction time]
##   1

## Timing stopped at: 7.21 0.05 7.47

## Error in neighbors[, x] : incorrect number of dimensions
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/11.42sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/9.22sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/8.87sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/9.19sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/9.26sec]

plot(list_results, annotate = 1, legend = "topleft")
title("ROC curve UBCF-Cosine")
plot(list_results, "prec/rec", annotate = 1, legend = "bottomright")
title("Precision-recall IBCF-Cosine")

#UBCF Pearson
models_to_evaluate <- lapply(vector_k, function(k){
  list(name = "UBCF", param = list(method = "Pearson", nn = k))
})
names(models_to_evaluate) <- paste0("UBCF_nn_", vector_k)

n_recommendations <- c(1, 5, 10, 15, 20, 25)
list_results <- evaluate(x = eval_books, method = models_to_evaluate, n = n_recommendations)

## UBCF run fold/sample [model time/prediction time]
##   1

## Timing stopped at: 3.45 0.06 3.57

## Error in neighbors[, x] : incorrect number of dimensions
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/5.3sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/5.15sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/5.66sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/4.92sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/5.21sec]

plot(list_results, annotate = 1, legend = "topleft")
title("ROC curve UBCF-Pearson")
plot(list_results, "prec/rec", annotate = 1, legend = "bottomright")
title("Precision-recall UBCF-Pearson")

Singular Value Decomposition (SVD)

For SVD the optimal parameter is 100

#SVD
vector_k <- c(1, 5, 10, 20, 30, 40, 100)
models_to_evaluate <- lapply(vector_k, function(k){
  list(name = "SVD", param = list(k = k))
})
names(models_to_evaluate) <- paste0("SVD_k_", vector_k)

n_recommendations <- c(1, 5, 10, 15, 20, 25)

list_results <- evaluate(x = eval_books, method = models_to_evaluate, n = n_recommendations)

## SVD run fold/sample [model time/prediction time]
##   1

## Timing stopped at: 0 0.01 0.02

## Error in r_a %*% model$svd$v %*% diag(1/model$svd$d) : 
##   non-conformable arguments
## SVD run fold/sample [model time/prediction time]
##   1  [0.58sec/0.92sec] 
## SVD run fold/sample [model time/prediction time]
##   1  [0.86sec/1.13sec] 
## SVD run fold/sample [model time/prediction time]
##   1  [1.21sec/1.15sec] 
## SVD run fold/sample [model time/prediction time]
##   1  [2.15sec/1.01sec] 
## SVD run fold/sample [model time/prediction time]
##   1  [2.72sec/1.08sec] 
## SVD run fold/sample [model time/prediction time]
##   1  [7.81sec/1.14sec]

list_results$SVD_k_5@results

## [[1]]
## An object of class "confusionMatrix"
## Slot "cm":
##           TP         FP       FN       TN  precision      recall         TPR
## 1  0.0326087  0.9673913 9.333333 3383.667 0.03260870 0.004780831 0.004780831
## 5  0.0942029  4.9057971 9.271739 3379.728 0.01884058 0.013182120 0.013182120
## 10 0.1521739  9.8478261 9.213768 3374.786 0.01521739 0.018108543 0.018108543
## 15 0.1847826 14.8152174 9.181159 3369.819 0.01231884 0.024188423 0.024188423
## 20 0.2101449 19.7898551 9.155797 3364.844 0.01050725 0.026375357 0.026375357
## 25 0.2572464 24.7427536 9.108696 3359.891 0.01028986 0.033900453 0.033900453
##             FPR
## 1  0.0002858102
## 5  0.0014493981
## 10 0.0029094062
## 15 0.0043769765
## 20 0.0058467404
## 25 0.0073101174
## 
## Slot "model":
## NULL

plot(list_results, annotate = 1, legend = "topleft")
title("ROC curve SVD")
plot(list_results, "prec/rec", annotate = 1, legend = "bottomright")
title("Precision-recall SVD")

Alternating Least Squares (ALS)

For ALS the optimal latent factor is 40

#ALS
vector_k <- c(1, 5, 10, 20, 30, 40)
models_to_evaluate <- lapply(vector_k, function(k){
  list(name = "ALS", param = list(n_factors = k))
})
names(models_to_evaluate) <- paste0("ALS_n_", vector_k)

n_recommendations <- c(1, 5, 10, 15, 20, 25)
list_results <- evaluate(x = eval_books, method = models_to_evaluate, n = n_recommendations)

## ALS run fold/sample [model time/prediction time]
##   1  [0sec/122.71sec] 
## ALS run fold/sample [model time/prediction time]
##   1  [0sec/102.27sec] 
## ALS run fold/sample [model time/prediction time]
##   1  [0.02sec/107.26sec] 
## ALS run fold/sample [model time/prediction time]
##   1  [0sec/106.64sec] 
## ALS run fold/sample [model time/prediction time]
##   1  [0sec/109.53sec] 
## ALS run fold/sample [model time/prediction time]
##   1  [0sec/115.61sec]

plot(list_results, annotate = 1, legend = "topleft")
title("ROC curve ALS")
plot(list_results, "prec/rec", annotate = 1, legend = "bottomright")
title("Precision-recall ALS")

Model Comparison

Optimal Parameters for each of the model is listed below.

Model	# of Recommendations	Parameter	Method
IBCF	1,5,10,15,20,25	Similar Items (k): 20	Pearson
UBCF	1,5,10,15,20,25	Nearest Neighbors (nn): 40	Pearson
SVD	1,5,10,15,20,25	Item Subset (k): 100	N/A
ALS	1,5,10,15,20,25	Latent Factors (k): 40	N/A

n_recommendations <- c(1, 5, 10, 15, 20, 25)

recc_algos <- list(
  IBCF = list(name = "IBCF", param = list(method = "Pearson", k=20)),
  UBCF = list(name = "UBCF", param = list(method = "Pearson", nn=40)),
  SVD = list(name = "SVD", param = list(k=100)),
  ALS = list(name = "ALS", param = list(n_factors=40)),
  random = list(name = "RANDOM", param=NULL)
)

recommendation_results <- evaluate(eval_books, method = recc_algos, n = n_recommendations)

## IBCF run fold/sample [model time/prediction time]
##   1  [373.15sec/0.17sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/4.27sec] 
## SVD run fold/sample [model time/prediction time]
##   1  [6.37sec/1.02sec] 
## ALS run fold/sample [model time/prediction time]
##   1  [0sec/154.08sec] 
## RANDOM run fold/sample [model time/prediction time]
##   1  [0sec/1.02sec]

plot(recommendation_results, annotate = TRUE, legend = "topleft") 
title("ROC curve - Model Comparisons")

plot(recommendation_results, "prec/rec", annotate = TRUE, legend = "bottomright") 
title("Precision-recall - Model Comparisons")

Personal Reviews

To test the recommender model, below are 30 books that I have personally read and given a rating on. These ratings are inserted alongside the reduced Book Crossings dataset. In particular my personal ratings (user 99999) are in the test set for recommendation generation. We’ll see top 10 recommendations for the 2 best models, SVD and UBCF.

My Reviews

Below are 30 books which I have rated. These ratings will be input into the SVD and UBCF models

url2 <- "https://github.com/dhairavc/DATA612-RecommenderSystems/raw/master/Final%20Project/titles_myrating3.csv"
my_ratings <- read.csv(url2, stringsAsFactors = FALSE)[,1:3]
head(drop_na(my_ratings))

##    user       item rating
## 1 99999 0099993805      6
## 2 99999 0345339711     10
## 3 99999 0345339738     10
## 4 99999 0385504209     10
## 5 99999 0553296981      4
## 6 99999 0618002227     10

# Insert my personal ratings into reduced dataset
temp_df <- getData.frame(BX3)
user <- c(as.character(temp_df$user), as.character(my_ratings$user))
item <- c(as.character(temp_df$item), as.character(my_ratings$item))
rating <- c(temp_df$rating, my_ratings$rating)
BX3_v2<- data.frame(user, item, rating) %>% drop_na()
BX3_v3 <- as(BX3_v2, "realRatingMatrix")

# Create training and test datasets with personal ratings in the test set
split_data <- sample(x = c(TRUE, FALSE), size = nrow(BX3_v3)-1, replace = TRUE, prob = c(0.8, 0.2))
split2 <- append(split_data, c(FALSE))
train_m <- BX3_v3[split_data, ]
test_m <- BX3_v3[!split2, ]

images_my <- left_join(my_ratings, books, by = c("item" = "ISBN"))
knitr::include_graphics(images_my$`Image-URL-M`)

Personal Recommendations SVD Model

The SVD model recommended some books such as Harry Potter and Lord of the Rings which I have already rated, but of a different ISBN

my_model_SVD <- Recommender(data = train_m, method = "SVD", param = list(k=100))
results_SVD <- predict(object = my_model_SVD, newdata = test_m, n=10)
my_recommendations_SVD <- results_SVD@itemLabels[results_SVD@items[[length(results_SVD@items)]]]
images_svd <- books %>% filter(ISBN %in% my_recommendations_SVD) %>% select(`Image-URL-M`)
knitr::include_graphics(images_svd$`Image-URL-M`)

Personal Recommendations UBCF Model

The UBCF model is giving me some recommendations that are completely out of category of what I rated such as the Ya-Ya Sisterhood. Since this is user based filtering, I suppose I was matched with users who had rated these movies

my_model_UBCF <- Recommender(data = train_m, method = "UBCF", param = list(method="Pearson", nn=40))
results_UBCF <- predict(object = my_model_UBCF, newdata = test_m, n=10)
my_recommendations_UBCF <- results_UBCF@itemLabels[results_UBCF@items[[length(results_UBCF@items)]]]
images_UBCF <- books %>% filter(ISBN %in% my_recommendations_UBCF) %>% select(`Image-URL-M`)
knitr::include_graphics(images_UBCF$`Image-URL-M`)

Conclusion

several models were compared on a reduced dataset. The two top models (SVD & UBCF) were selected to give recommendations on my personal ratings. Both recommendations were not fairly good as both models while the highest had very low True Positive Rates. No model achieved a TPR higher than .08. Some important criteria for comparisons of models are Precision and Recall:

Precision is defined as the ratio between all the instances the model correctly predicted a good match over the total the number of matches predicted by the model.

Recall (Sensitivity) is defined as the ratio between all the instances the model correctly predicted a good match over all the matches the user would have liked.

By looking at the confusion matrix below, there is a trade off between Precision and Recall. In our case of recommending items, having a Type I error is more problematic than a Type II error. This is because Precision shows how much a recommendation can be trusted. If the model has low precision and the recommendations are not trusted then faith in the entire model is diminished.

In both models, I can see that these recommendations cannot be entirely trusted given the low precision rates. Practically I see this as some of the recommendations I would not have any motivation to read, but not all in SVD model are incorrect. In conclusion this shows that the models need further refinement or need to be trained on a larger dataset as we discarded a significant chunk of data to prioritize run time and RecommenderLab package limitations.

References

Data Science and Machine Learning