Goodreads is a free social cataloging website that allows individuals to freely search its database of books, annotations, reviews and ratings. People can check out personalized recommendations and find out if a books is a good for them. This dataset contain 10,000 books and 50,000+ users. Ratings are 1 - 5 and each users rated at least 2 books. Data can be found here.
library(tidyverse)
library(Matrix)
library(recommenderlab)
library(kableExtra)
library(gridExtra)
book_ratings <- read.csv("https://raw.githubusercontent.com/zygmuntz/goodbooks-10k/master/ratings.csv", sep = ",", header = T, stringsAsFactors = F)
book_titles <- read.csv("https://raw.githubusercontent.com/zygmuntz/goodbooks-10k/master/books.csv", sep = ",", header = T, stringsAsFactors = F) %>% select(book_id, title)
book_titles$book_id <- as.factor(book_titles$book_id)
# table dimensions
dim(book_ratings)
[1] 5976479 3
# first few ratings for books
head(book_ratings, 10)
The size of this dataset:
object.size(book_ratings)
71718744 bytes
Only a subset of the data will be used to build the recommender systems.
book_ratings$user_id <- as.factor(book_ratings$user_id)
book_ratings$book_id <- as.factor(book_ratings$book_id)
bmatrix <- as(book_ratings, "realRatingMatrix")
dim(bmatrix@data)
[1] 53424 10000
Users
sim <- similarity(bmatrix[1:10, ], method = "cosine", which = "users")
image(as.matrix(sim), main = "User Similarity")
Books
sim2 <- similarity(bmatrix[ ,1:10], method = "cosine", which = "items")
image(as.matrix(sim2), main = "Item Similarity")
Going forward, we will build recommender systems using data that consist of users who rated at least 150 books and books rated at least 300 times.
# users who rated at least 100 books and books rated at least 100 times
bmatrix <- bmatrix[rowCounts(bmatrix) > 150, colCounts(bmatrix) > 300]
bmatrix
4011 x 4147 rating matrix of class realRatingMatrix with 582842 ratings.
tbl_ratings <- as.data.frame(table(as.vector(bmatrix@data)))
tbl_ratings
tbl_ratings <- tbl_ratings[-1,] #0 means missing values so remove missing values
ggplot(tbl_ratings, aes(x = Var1, y = Freq, fill = Var1)) + geom_bar(stat = "identity") + ggtitle("Distribution of Book Ratings")
rated_count <- colCounts(bmatrix)
read_book <- data.frame(
book_id = names(rated_count),
read = rated_count
)
top_books <-
inner_join(read_book, book_titles, by = "book_id") %>%
arrange(desc(read)) %>%
select(-book_id) %>%
head(10) %>%
ggplot(aes(x = title, y = read)) + geom_bar(stat = "identity", fill = "lightblue") + geom_text(aes(label=read), vjust=-0.3, size=3.5) + ggtitle("Top 10 Rated Books") + coord_flip()
Column `book_id` joining factors with different levels, coercing to character vectorpackage <U+393C><U+3E31>bindrcpp<U+393C><U+3E32> was built under R version 3.5.2
top_books
avg_book_ratings <- data.frame("avg_rating" = colMeans(bmatrix)) %>%
ggplot(aes(x = avg_rating)) +
geom_histogram(color = "black", fill = "lightgreen") +
ggtitle("Distribution of Average Ratings for Books")
avg_book_ratings
Matrix of first 100 users and 100 books. Darker spots represents the highest rated books.
image(bmatrix[1:100, 1:100], main = "First 100 users and books")
min_readers <- quantile(rowCounts(bmatrix), 0.99)
min_books <- quantile(colCounts(bmatrix), 0.99)
a <- image(bmatrix[rowCounts(bmatrix) > min_readers, colCounts(bmatrix) > min_books], main = "Non-Normalized")
# to eliminate bias therefore average rating would be 0
book_norm <- normalize(bmatrix)
b <- image(book_norm[rowCounts(book_norm) > min_readers, colCounts(book_norm) > min_books], main = "Normalized")
grid.arrange(a, b, ncol = 2)
train <- sample(x = c(T, F), size = nrow(bmatrix), replace = T, prob = c(0.8, 0.2))
books_train <- bmatrix[train, ]
books_test <- bmatrix[-train, ]
A filtering method in which the similarity between items is calculated using people’s ratings of those items. In other words the algorithm recommends items similar to the user’s previous selections. In the algorithm, the similarities between different items in the dataset are calculated by using one of a number of similarity measures, and then these similarity values are used to predict ratings for user-item pairs not present in the dataset.
Imodel <- Recommender(data = books_train, method = "IBCF")
Imodel
Recommender of type IBCF for realRatingMatrix
learned using 3185 users.
Predict with test data.
Ipredict <- predict(Imodel, newdata = books_test, n = 5) %>% list()
Ipredict
[[1]]
Recommendations as topNList with n = 5 for 4010 users.
Books recommended for specific users simliar to a specified item or item chosen by user.
# function created to display recommended similar books to users
item_recc_books <- function(i){
p <- Ipredict[[1]]@items[[i]]
p <- data.frame("guess" = as.factor(p))
p <- inner_join(p, book_titles, by = c("guess" = "book_id")) %>% select(title)
r <- data.frame("name" = as.factor(i))
r <- inner_join(r, book_titles, by = c("name" = "book_id")) %>% select(title)
print(paste("Books similar to --", r))
return(as.list(p))
}
item_recc_books(5); item_recc_books(200); item_recc_books(18)
[1] "Books similar to -- The Great Gatsby"
$`title`
[1] "Something Borrowed (Darcy & Rachel, #1)"
[2] "The Lord of the Rings (The Lord of the Rings, #1-3)"
[3] "Harry Potter and the Cursed Child - Parts One and Two (Harry Potter, #8)"
[4] "The Runaway Jury"
[5] "The Tales of Beedle the Bard"
[1] "Books similar to -- And Then There Were None"
$`title`
[1] "The Host (The Host, #1)"
[2] "Dracula"
[3] "Me Before You (Me Before You, #1)"
[4] "City of Glass (The Mortal Instruments, #3)"
[5] "Beautiful Disaster (Beautiful, #1)"
[1] "Books similar to -- Harry Potter and the Prisoner of Azkaban (Harry Potter, #3)"
$`title`
[1] "City of Bones (The Mortal Instruments, #1)"
[2] "Holes (Holes, #1)"
[3] "Jurassic Park (Jurassic Park, #1)"
[4] "Alice's Adventures in Wonderland & Through the Looking-Glass"
[5] "Quiet: The Power of Introverts in a World That Can't Stop Talking"
ibcf <- table(unlist(Ipredict[[1]]@items)) %>% barplot(main = "Distribution of the number of items for IBCF")
Some books were recommended more often than the others as seen in the plot above.
Recommends items that are similar purchased by the same people. The algorithm identifies other people with similar tastes to a target user and combines their ratings to make recommendations for that user.
Create user-based model
Umodel <- Recommender(data = books_train, method = "UBCF")
Umodel
Recommender of type UBCF for realRatingMatrix
learned using 3185 users.
Predict data with testing data
Upredict <- predict(Umodel, newdata = books_test, n = 5) %>% list()
Upredict
[[1]]
Recommendations as topNList with n = 5 for 4010 users.
# function created to display recommended similar books to users
user_recc_books <- function(u){
p <- Upredict[[1]]@items[[u]]
p <- data.frame("guess" = as.factor(p))
p <- inner_join(p, book_titles, by = c("guess" = "book_id")) %>% select(title)
r <- data.frame("name" = as.factor(u))
r <- inner_join(r, book_titles, by = c("name" = "book_id")) %>% select(title)
print(paste("Books similar to --", r, "-- based on similar users"))
return(as.list(p))
}
user_recc_books(5); user_recc_books(200); user_recc_books(18)
[1] "Books similar to -- The Great Gatsby -- based on similar users"
$`title`
[1] "A Clash of Kings (A Song of Ice and Fire, #2)"
[2] "A Storm of Swords (A Song of Ice and Fire, #3)"
[3] "The Lord of the Rings (The Lord of the Rings, #1-3)"
[4] "The Two Towers (The Lord of the Rings, #2)"
[5] "A Game of Thrones (A Song of Ice and Fire, #1)"
[1] "Books similar to -- And Then There Were None -- based on similar users"
$`title`
[1] "Harry Potter and the Deathly Hallows (Harry Potter, #7)"
[2] "Words of Radiance (The Stormlight Archive, #2)"
[3] "Ender's Game (Ender's Saga, #1)"
[4] "The Way of Kings (The Stormlight Archive, #1)"
[5] "The Name of the Wind (The Kingkiller Chronicle, #1)"
[1] "Books similar to -- Harry Potter and the Prisoner of Azkaban (Harry Potter, #3) -- based on similar users"
$`title`
[1] "The Da Vinci Code (Robert Langdon, #2)"
[2] "Under the Never Sky (Under the Never Sky, #1)"
[3] "Unbroken: A World War II Story of Survival, Resilience, and Redemption"
[4] "What Alice Forgot"
[5] "From Dead to Worse (Sookie Stackhouse, #8)"
ubcf <- table(unlist(Upredict[[1]]@items)) %>% barplot(main = "Distribution of the number of items for UBCF")
Some books were recommended to users more than the others.
Overall, building both reccommendation sysems gave a better understanding of how they work. On my end, the User Based CF took a longer time to compute than the IBCF. This was proven based on the fact that the book “Building Recommendation Systems with R” mentioned that the UBCF is a lazy method. It actually needs access to all of the data to perform a prediction hence why it does not work well with large matrices. On a whole, Item-item collaborative filtering had less error than user-user collaborative filtering.