library(recommenderlab)
library(recommenderlabBX)
library(ggplot2)
library(dplyr)
library(tidyr)
library(data.table)
library(kableExtra)
For the final project for this course I would like to use the material learned in the course to build a recommendation model on the Book-Crossing Dataset. In determining the most appropriate model to use we will:
Once the most appropriate model is selected, I will run the model on my own profile to judge how the ratings are for me personally. I will do this by rating a number of books based on my preferences and interests and apply the recommendation model on it.
The Book-Crossing dataset contains 278,858 users providing 1,149,778 ratings (explicit / implicit) about 271,379 books. This makes the dataset very sparse with most ratings at 0, and most users only have rated just 1 book.
data("BXBooks")
data("BX")
url <- "https://github.com/dhairavc/DATA612-RecommenderSystems/raw/master/Final%20Project/BX-CSV-Dump.zip"
temp <- tempfile()
download.file(url, temp)
books <- fread(unzip(temp, files = "BX-Books.csv"))
unlink(temp)
BX
## 105283 x 340547 rating matrix of class 'realRatingMatrix' with 1149778 ratings.
#Ratings Distribution
ratings <- getRatings(BX)
ggplot() + geom_histogram(aes(ratings), binwidth = 1, col="white", fill="pink") +
labs(title = "Ratings Distribution", x="Rating", y="Count") +
scale_y_continuous(labels = scales::comma)
#Distribution of books rated per user
ratings_by_user <- rowCounts(BX)
ratings_by_user <- data.frame(User = names(ratings_by_user), BooksRated = ratings_by_user)
ratings_by_user_sum <- ratings_by_user %>% group_by(BooksRated) %>% summarise(Users = n())
ggplot(ratings_by_user_sum, aes(y=BooksRated, x=Users)) + geom_point(col="purple")
#User with most rated books
tail(ratings_by_user_sum) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)
| BooksRated | Users |
|---|---|
| 4785 | 1 |
| 5850 | 1 |
| 5891 | 1 |
| 6109 | 1 |
| 7550 | 1 |
| 13601 | 1 |
#users with least rated books
head(ratings_by_user_sum) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)
| BooksRated | Users |
|---|---|
| 1 | 59166 |
| 2 | 12503 |
| 3 | 6533 |
| 4 | 4265 |
| 5 | 3099 |
| 6 | 2334 |
#Most read books
rating_per_book <- colCounts(BX)
rating_per_book <- data.frame(ISBN = names(rating_per_book), Rated = rating_per_book)
most_rated <- (rating_per_book %>% arrange(desc(Rated)))[1:21,]
most_rated <- left_join(most_rated, books, by = "ISBN")
knitr::include_graphics(drop_na(most_rated)$`Image-URL-M`)