Problem Statement: People love books. And people enjoy discussing and sharing books that they love. Goodreads is one amazing platform designed for such bibliophiles (and budding bibliophiles!). What I felt was missing from this website is a summary analysis of popular books. The objective of this project is to analyze the data of popular books from goodreads and identify any interesting trends for these books and what makes them popular.
How I did this?: The data for this analysis is a contiguous dataset of 10,000 most popular books, with user level information on 6 million ratings for these books. We also have several generic and user defined genres (/tags/shelves) for each book. A combination of this information helped answer the above questions. The analysis was divided into three parts:
The packages used in the project (currently) are:
%>% operator!# Loading the packages
library(data.table)
library(DT)
library(kableExtra)
library(knitr)
library(tidyverse)
library(stringr)
library(gridExtra)
library(wordcloud)
library(RColorBrewer)
library(GGally)
library(corrplot)
The original data gooodooks-10k was scraped in August 2017, with an aim of providing a database similar to movies and songs. The objective for creator was to create recommendation engine for books, similar to music and movies.
The dataset was first published in kaggle. Contributors then identified anomalies in the data like duplicate records, and multiple user ratings for the same book. Such anomalies were removed, and the clean data was then posted into github. We will load the data from this github repository.
The column information is provided for the key variables. Some assumptions are made for the other variables based on data present (and looking up the goodreads website).
The data is contiguous for 10k books and 50k users. The database has 5 tables, descriptions of which are given below
| Table | Description | Notes |
|---|---|---|
| books | Data of the 10k most popular books with metadata (author, ratings etc) | Popular books are determined by number of total # of users who have rated the book |
| book_tags | Genres associated with each book along with the number of tags for the genre | Genric and User defined tags both are included; user defined can be anything |
| tags | Description of the genres | |
| ratings | User, book, raating level data for 10k books and 50k users | For 10k books and 50k users who have made at least 2 ratings |
| to_read | User, book pair that a user has marked to read | For 10k books and 50k users |
On initial checks, I found that the data sets had blanks, which R could not identify as NA. So I added na.strings to help R identify nulls
# Reading all the datasets from github using fread
ratings <- fread("https://github.com/zygmuntz/goodbooks-10k/raw/master/ratings.csv", na.strings = c("", "NA"), showProgress = F)
book_tags <- fread("https://github.com/zygmuntz/goodbooks-10k/raw/master/book_tags.csv", na.strings = c("", "NA"), showProgress = F)
books <- fread("https://github.com/zygmuntz/goodbooks-10k/raw/master/books.csv", na.strings = c("", "NA"), showProgress = F)
tags <- fread("https://github.com/zygmuntz/goodbooks-10k/raw/master/tags.csv", na.strings = c("", "NA"), showProgress = F)
to_read <- fread("https://github.com/zygmuntz/goodbooks-10k/raw/master/to_read.csv", na.strings = c("", "NA"), showProgress = F)
The attributes of the different dataframes are described below
#Creating a List of all tables
tbl_list <- list("books" = books, "book_tags" = book_tags, "tags" = tags, "ratings" = ratings, "to_read" = to_read)
# Writing a function to get dimensions
dim_func <- function(x){
dimension = str_trim(paste0(dim(tbl_list[[x]]), sep = " ", collapse = ""))
}
# Writing a function to get all column names
names_func <- function(y){
vars = str_trim(paste0(names(tbl_list[[y]]), sep = " | ", collapse = ""))
}
# Creating a table
table_attributes <- data_frame(
Table = names(tbl_list),
`Rows Columns` = unlist(lapply(1:5, dim_func)),
Variables = unlist(lapply(1:5, names_func ))
)
# Removing the list
rm(tbl_list)
# Printing the table
kable(table_attributes, format = "html") %>%
kable_styling(bootstrap_options = "striped") %>%
column_spec(2, width = "12em")
| Table | Rows Columns | Variables |
|---|---|---|
| books | 10000 23 | book_id | goodreads_book_id | best_book_id | work_id | books_count | isbn | isbn13 | authors | original_publication_year | original_title | title | language_code | average_rating | ratings_count | work_ratings_count | work_text_reviews_count | ratings_1 | ratings_2 | ratings_3 | ratings_4 | ratings_5 | image_url | small_image_url | |
| book_tags | 999912 3 | goodreads_book_id | tag_id | count | |
| tags | 34252 2 | tag_id | tag_name | |
| ratings | 5976479 3 | user_id | book_id | rating | |
| to_read | 912705 2 | user_id | book_id | |
As a first step, I would remove the columns that I do not intend to use from the books table. All other tables have only the relevant columns
# Selecting the relevant columns
books <- books %>% select(-c(best_book_id, work_id, isbn, isbn13), -(ratings_1:small_image_url))
tbl_list <- list("books" = books, "book_tags" = book_tags, "tags" = tags, "ratings" = ratings, "to_read" = to_read)
The next step in cleaning the data was to identify any missing values across datasets. A quick is.na() check reveals that only books has missing values in some columns
# Identifying missing across all tables using a loop
for (x in 1:length(tbl_list)) {
dd <- colSums(is.na(tbl_list[[x]]))
if (sum(dd[dd > 0]) > 0) {
print(paste("For the data -",names(tbl_list[x])))
print(dd[dd > 0])
}
}
## [1] "For the data - books"
## original_publication_year original_title
## 21 590
## language_code
## 1084
# Removing the list
rm(tbl_list)
I have decided to retain all the columns based on the explanation below
Among the different summary() codes that were run across all tables, only two anomalies were observed
original_publication_year in books had a negative yearcount in book_tags which records how many times a book was given the particular genre has negative values# Display summary statistics of the important columns
summary(books$original_publication_year)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -1750 1990 2004 1982 2011 2017 21
summary(book_tags$count)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.0 7.0 15.0 208.9 40.0 596234.0
| original_publication_year | title | goodreads_book_id |
|---|---|---|
| -720 | The Odyssey | 1381 |
| -750 | The Iliad | 1371 |
| -500 | The Art of War | 10534 |
| -380 | The Republic | 30289 |
| -430 | Oedipus Rex (The Theban Plays, #1) | 1554 |
I noticed that the books were actually written in B.C ***
The count in book_tags was mutated to 0 if negative
# Mutating negative values to 0
book_tags[which(book_tags$count < 0), "count"] = 0
datatable(head(books, 500))
datatable(head(ratings, 500))
datatable(head(to_read, 500))
| Variables | Description | Present_In |
|---|---|---|
| book_id | User assigned book id based on popularity integer(1 - 10000) | books, ratings, to_read |
| goodreads_book_id | Book id linked to goodreads | book_tags |
| books_count | Number of editions of the book | books |
| authors | Author(s) of the book | books |
| original_publication_year | First year in which the book was published | books |
| original_title | First title of the book | books |
| title | Current running title of the edition selected | books |
| language_code | Language of the book | books |
| average_rating | Average Rating of the book (1 - 5) | books |
| ratings_count | Total users who have rated the book | books |
| work_ratings_count | Ratings of the current edition selected (1 - 5) | books |
| work_text_reviews_count | Total comments on the current edition selected | books |
| tag_id | ID of the tag | book_tags |
| count | Number of users who have tagged the genre for the book | book_tags |
| new_tag_name | Cleaned Tag name | book_tags |
| user_id | ID of the user (1 - 53424) | ratings, to_read |
| rating | Rating that the user has given integer(1 - 5) | ratings |
First, I wanted to understand which genres feature in the top 10,000 popular books on goodreads. Since a book can be associated with multiple genres, I divided them into main genre and associated genre. Main genre has the most votes for a books while the associate genre can be what the sub-genre of the book is. The associated genres are limited to 4.
Most of the tags appear in both cases. That is, they are the main genre for some books and associated genre for other books. From the below graph, I assume fiction to thriller as the main genres. Genres from urban fantasy to epic fantasy are considered as the associated tags for the book
If we ignore fiction and non-fiction, people like to read fantasy, mystery and romantic books!
final_book_tags <- final_book_tags %>%
group_by(goodreads_book_id) %>%
mutate(rank = rank(desc(count), ties.method = "first")) %>%
ungroup() %>%
filter(rank <= 5) %>%
arrange(goodreads_book_id, rank)
#Associated tags for each book
top_5 <- final_book_tags %>%
filter(rank > 1) %>%
within(new_tag_name <- factor(new_tag_name, levels = names(sort(table(new_tag_name), decreasing = F))))
#Main Genre for each book
top_1 <- final_book_tags %>%
filter(rank == 1) %>%
within(new_tag_name <- factor(new_tag_name, levels = names(sort(table(new_tag_name), decreasing = F))))
#Bar graph
ggplot(NULL, aes(x = new_tag_name)) +
geom_bar(data = top_1, aes(y = (..count..)/sum(..count..), fill = "Main Genre"),alpha = 0.5) +
geom_bar(data = top_5, aes(y = (..count..)/sum(..count..), fill = "Associated Genre"),alpha = 0.5) +
scale_y_continuous(expand = c(0, 0), limits = c(0, 0.20), labels = scales::percent) +
coord_flip() +
labs(title = "Which Genre features most in the top books?", subtitle = "Main vs, Associated Genre", x = "Genre", y = "Percentage of Books", fill = "Genre Type")
Similar to what movies follow, I wanted to check if older books (books written before 1980’s) make up for a large chunk of the popular books.
Interestingly, 75% the books are from 1980 and above, indicating a young audience base using the website. The reference can be found here
#Older books are more in the popular lists?:
plot_year <- books %>%
select(original_publication_year) %>%
group_by(original_publication_year) %>%
summarise(count = n()) %>%
ggplot(aes(x = original_publication_year, y = count)) +
geom_line() +
theme(legend.position = "none") +
labs(title = "When were the popular books written", x = "Year",
y = "Number of Books")
plot_zoom_year <- books %>%
select(original_publication_year) %>%
group_by(original_publication_year) %>%
summarise(count = n()) %>%
ggplot(aes(x = original_publication_year, y = count, color = (original_publication_year >= 1983 &
original_publication_year <= 2011))) +
geom_line(aes(group = 1)) +
theme(legend.position = "none") +
labs(title = "Zoomed in for 1900 and beyond", x = "Year") +
ylab(NULL) +
coord_cartesian(xlim = c(1900,2016))
grid.arrange(plot_year, plot_zoom_year, ncol = 2)
After we’ve got a list of genres and associated genres, I wanted to check the association between the genres. This metric is a calcuation of how many times did a genre feature with the other one. invoke_map was used to run across all the lists of tags.
The strongest associations were observed between science fiction & paranormal with fantasy, thriller with mystery (duh), and graphic novels with comics.
Interestingly, Chick lit (or) Chick Literature authors pivot their books around romance only. The young adult genre has a wider foucus involving fantasy, romance, paranormal and adventure books.
#Picking the 15/35 genres that have the top tag for 93% books
genres_for_analysis <- top_1 %>%
group_by(new_tag_name) %>%
summarise(num_books = n()) %>%
ungroup() %>%
arrange(num_books) %>%
top_n(n = 15, wt = num_books)
#Creating an empty dataframe
assoc_table <- data_frame(main = character(), assoc = character(), val = as.integer(character()))
# Writing the function for association mapping that checks each genre
association_map <- function(x, df){
book_ids <- df[which(df$new_tag_name == x), "goodreads_book_id"]
all_genres <- df[which(df$goodreads_book_id %in% (book_ids$goodreads_book_id)),] %>%
group_by(new_tag_name) %>%
summarise(n = n())
temp_table <- data_frame( main = rep(x, times = nrow(all_genres)), assoc = as.character(all_genres$new_tag_name), val = all_genres$n)
assoc_table <<- union(assoc_table, temp_table)
return(NULL)
}
#Calling the function on all genres
invoke_map(association_map, final_tags$new_tag_name, df = top_5)
# Filter and spread
assoc_table <- assoc_table %>%
filter(assoc_table$main != assoc_table$assoc, main %in% (genres_for_analysis$new_tag_name), !assoc %in% c('fiction')) %>%
spread(assoc, val) #Spread the table to get the lower matrix
assoc_table[is.na(assoc_table)] <- 0
assoc_title <- assoc_table[,1]
assoc_head <- assoc_table[,-1]
assoc_head <- prop.table(as.matrix(assoc_head), margin = 1)
#Binding back to the first column
assoc_table <- assoc_head %>%
cbind(assoc_title) %>% #Binding back the main column
gather(key = "assoc", value = "val", -main) #Gather the table
ggplot(assoc_table, aes(y = main, x = assoc)) +
theme_bw() +
geom_tile(aes(fill = val), color = 'white') +
scale_fill_gradient(low = 'white', high = 'darkblue', space = 'Lab') +
theme(axis.text.x = element_text(angle = 90), panel.grid.major = element_line(color = '#eeeeee')) +
labs(title = "Association between Genres", y = "Main Genre", x = "Associated genre", fill = "Association Score")
First, I’d like to see the general distribution of the rating across all 10,000 books. A weighted-average was taken for the mean, as the number of votes for each book can vary. It was observed that the ratings follow a normal distribution with a mean rating of 4.03.
We now know that the popular books are indeed popular becuase they are a good read. However, we do see a wide range from 2.47 to 4.82
mean_rat <- books %>%
summarise(average_rating = sum(average_rating * work_ratings_count)/sum(work_ratings_count)) %>%
round(digits = 2)
ggplot(data =NULL, aes(x = average_rating)) +
geom_histogram(data = books,bins = 50) +
geom_vline(data = books, aes(xintercept = sum(average_rating*work_ratings_count)/sum(work_ratings_count)), color = "blue", linetype = "dashed", size = 0.75) +
geom_text(data = mean_rat, aes(label = paste("Mean =",average_rating) , y = 820), hjust = -0.17, color = "blue") +
labs(title = "How are ratings distributed for the popular books", x = "Average Rating", y = "Number of Books")
Now that we know the top genres, a more directed approach to start your next book would be to see how do other people rate these genres.
For this analysis, I have considered the top genre for each book and selected the top 15 genres to analyze. The top 15 genres cover 91% of the books as shown below
genres_for_analysis %>%
summarise(`Books Covered` = sum(num_books)) %>%
kable(format = "html") %>%
kable_styling(bootstrap_options = "striped")
| Books Covered |
|---|
| 9059 |
The below graph explains the variation across genre ratings. From the boxplot below, we infer that:
#Getting the top 15 genres
filter_books <- top_1 %>%
filter(new_tag_name %in% genres_for_analysis$new_tag_name)
#Creating the boxplot
ratings_with_genre <- books %>%
inner_join(filter_books, by = "goodreads_book_id")
ggplot(ratings_with_genre, aes(x = new_tag_name, y = average_rating)) +
geom_boxplot() +
coord_flip() +
labs(title = "How are ratings distributed for the popular books", x = "Average Rating", y = "Number of Books")
The next analysis was to see if older books actually get better ratings when compared to the newer ones. The below line graph depicts that it is not the case. In fact, we see a slight increase in ratings across the years. This is subect to low count of books prior to 1980’s as well.
So sadly, old is acutally not gold for books. The quality of the book does matter for the bibliophiles
ratings_with_genre %>%
group_by(original_publication_year) %>%
summarise(num_books = n(), rating = sum(average_rating*work_ratings_count)/sum(work_ratings_count)) %>%
ungroup() %>%
ggplot(aes(x = original_publication_year, y = rating)) +
geom_line() +
coord_cartesian(xlim = c(1900,2017)) +
geom_smooth(method = "loess", se = FALSE) +
labs(title = "Rating vs. Year of publication", x = "Year of publication", y = "Average rating")
I did a simple correlation to understand if ratings is influenced by any factors. If the co-efficient is above 0.3, I would have rated it as correlation.
However, ratings are independent of any factors that we have, and is probably affected by other factors overall
books_numeric <- books %>%
mutate(num_authors = str_count(books$authors, ",") + 1) %>%
select(books_count, original_publication_year, average_rating, ratings_count, work_ratings_count, work_text_reviews_count, num_authors)
books_numeric <- books_numeric[complete.cases(books_numeric),]
ggcorr(books_numeric, nbreaks = 5, label = TRUE, label_size = 3, hjust = 0.7, label_round = 4, size = 2.5) +
labs(title = "Correlation of average rating with other variables")
We can identify if a book is a part of some series by checking cases when the title is not the same as original title. For such cases, the rating generally increases till the 5th part of the book. The quality of the book keeps increasing till the 5th part and then stabilizes from the 6th to tenth parts.
So if you start reading a series, like Harry Potter, you tend to like the books even more. That’s how they create the fan base, don’t they?
# Getting the books that have editions
books_editions <- filter(books, original_title != title, str_detect(title, "#"))
# Getting the edition number
start <- regexpr("\\#[^\\#]*$", books_editions$title)
end <- regexpr("\\)[^\\)]*$", books_editions$title)
books_editions$edition <- as.numeric(str_sub(books_editions$title, start = start + 1, end = end - 1))
books_editions <- books_editions[which(books_editions$edition %in% c(1:10)),]
correlation <- round(cor(books_editions$edition, books_editions$average_rating ), digits = 3)
ggplot(data = books_editions, aes(x = factor(edition), y = average_rating)) +
geom_boxplot(notch = TRUE) +
geom_rect(xmin = 9.18, xmax = 10.32, ymin = 4.68, ymax = 4.82, color = "blue", fill = "white") +
annotate("text", x = 9.75, y = 4.75, label = paste("Corr:",as.character(correlation)), color = "blue") +
labs(title = "What about Book Series?", subtitle = "Ratings by Book Series", x = "Number in the Series",
y = "Average Rating")
Bibliophiles form the most important component of goodreads. They are the people who actually rate and recommend books so that everyone can have a good read and avoid the bad ones.
It is assumed that if a user rates a book, he/she has read the book. The analysis further assumes books read as the metric.
The first analysis is to check the distribution of the number of books that people read. Similar to ratings across books, the number of books that users read follows a normal distribution with a mean of 111.9 (Well that’s a lot of books!). The minimum number of books that a person has read 19 is while the maximum is 200.
To reiterate, since this is a contiguous datset, these numbers might not reflect the actual population parameters.
# Grouping the ratings and books read by user
rating_user <- ratings %>%
group_by(user_id) %>%
summarise(n = n_distinct(book_id), average_rating = mean(rating)) %>%
ungroup()
#Calculating the min, max, and mean
mean_b <- mean(rating_user$n)
min_b <- min(rating_user$n)
max_b <- max(rating_user$n)
#How many books do users read
ggplot(rating_user, aes(x = n)) +
geom_histogram(bins = 40) +
geom_vline(aes(xintercept = mean_b), color = "blue") +
geom_vline(aes(xintercept = min_b), color = "tan2") +
geom_vline(aes(xintercept = max_b), color = "tan2") +
annotate("text", x = mean_b + 15, y = 4500, label = paste("Mean =", round(mean_b, 1)), color = "blue") +
annotate("text", x = min_b + 10, y = 4500, label = paste("Min =", round(min_b, 1)), color = "tan2") +
annotate("text", x = max_b - 10, y = 4500, label = paste("Max =", round(max_b, 1)), color = "tan2") +
labs(title = "How many books do users read", x = "Number of books read", y = "Count of users")
We decide to quntile the users based on the number of books that they’ve read. People are distributed into Q1-Q5, with Q1 the lowest quintile
# Assigning the quartiles
rating_user$Quartile <- with(rating_user, factor(findInterval(n, c(-Inf,
quantile(n, probs=c(0.2, .4, .6, .8)), Inf)), labels = c("Q1", "Q2", "Q3", "Q4","Q5")))
rating_user %>%
group_by(Quartile) %>%
summarise(Mean_Books = mean(n), Min_Books = min(n), Max_Books = max(n)) %>%
kable(format = "html") %>%
kable_styling(bootstrap_options = "striped")
| Quartile | Mean_Books | Min_Books | Max_Books |
|---|---|---|---|
| Q1 | 75.87592 | 19 | 91 |
| Q2 | 98.33239 | 92 | 104 |
| Q3 | 110.39975 | 105 | 116 |
| Q4 | 123.86480 | 117 | 132 |
| Q5 | 148.92101 | 133 | 200 |
As the title reveals, users tend to get slightly critical of the books they read with the increase in the number of books read. We observe a constant decline in rating patterns across the 5 quartiles.
New readers tend to rate the book generously and begin to get critical as they start reading good content.
# Quartile level rating
rating_user %>%
group_by(Quartile) %>%
summarise(mean_avg = sum(average_rating*n)/sum(n), min = min(n), max = max(n)) %>%
ungroup() %>%
ggplot(aes(x = paste0(Quartile, " (", min," - ", max, ")"), y = mean_avg)) +
geom_bar(stat = "identity") +
coord_cartesian(ylim = c(3.5, 4.02)) +
labs(title = "How do users rate books?", x = "Number of books read", y = "Average Rating given")
Our next objective was to identify any differences between the reading patterns of our bibliophile qunitles. The percentage of books read was calcuated across the genre and quintile. The top 5 genres in each quintiles are selected for analysis.
Some interesting observations were
rating_w_tags <- ratings %>%
left_join(rating_user, by = "user_id") %>%
left_join(books, by = "book_id") %>%
left_join(top_1, by = "goodreads_book_id")
rating_w_tags %>%
group_by(Quartile, new_tag_name) %>%
summarise( n = n()) %>%
mutate(freq = n / sum(n)) %>%
ungroup() %>%
group_by(Quartile) %>%
mutate(rank = rank(desc(freq))) %>%
filter(rank <= 5) %>%
ungroup() %>%
arrange(Quartile, rank) %>%
ggplot(aes(x = reorder(new_tag_name, -freq), y = freq)) +
geom_bar(stat = "identity") +
scale_y_continuous(labels = scales::percent) +
facet_wrap(~ Quartile, scales = "free_x", ncol = 2) +
labs(title = "What books to the segmented users read?", x = "Top 5 genres",
y = "Percentage of users who read the genre")
After we’ve indentified what different segments of users read, our last analysis is to identify what makes a books worth to read.
This analysis is carried out by identifying the number of books that people have marked as “to-read” in goodreads. This variable will then be analyzed against all other attributes of a book to see if a relation exists.
A strong correlation exists between the number of users who have marked a books as to read with the reviews that a book has received. As the number of text reviews increase, more people would like to read the book. Interestingly, the ratings of a books do not matter when a user wants to read the book.
Other variables that have a good correlation with the to-read are the number of editions of the book (0.3), the number of users that have rated the book (0.63).
The hypothesis is popular books (in terms of ratings and comments) feature in the different lists of goodreads which motivates the users to read these books.
So if you launch a new book, generate more hype for the books in terms of comments and ratings.
# Get the number of users who have marked a book as to read
to_read_total <- to_read %>%
group_by(book_id) %>%
summarise(number_users = n_distinct(user_id))
# Join with the attributes of the book
book_tr <- books %>%
left_join(to_read_total, by = "book_id") %>%
select(number_users, books_count, original_publication_year, average_rating, ratings_count, work_ratings_count, work_text_reviews_count)
# Correlation
correlation <- cor(book_tr[complete.cases(book_tr),])
corrplot(correlation, type = "lower", addCoef.col = "black", diag = FALSE, number.cex = 0.8,tl.srt = 45, tl.col = "black", tl.cex = 0.6, title = "Correlation of books marked as to read", mar = c(0,0,1,0))
The shiny app below takes multiple genres from the user and ouputs the recommended books to read based on the genre.
The books are sorted according to the combination of genres (most genres first) and then rating
Clicking on the name of the book will direct you to the goodreads page of the book.
The app is hosted here
First, the dataframes books, final_tags, and final_book_tags need to be converted to CSV’s and then placed in the same folder as the ui and server codes.
The code for ui.R is given below:
# Loading the required libraries
library(shiny)
library(DT)
# Reading the files
final_tags <- read.csv("final_tags.csv")
books <- read.csv("books.csv")
shinyUI(fluidPage(
# Adding title
titlePanel("Book Recommender"),
# Sidebar
sidebarLayout(
sidebarPanel(
# Genre Selection
selectInput(inputId = "Columns", label = "Which genres do you like?",
unique(final_tags$new_tag_name), multiple = TRUE),
verbatimTextOutput("fiction"),
# Rating Selection
sliderInput(inputId = "range",
label = "Ragne of Ratings that you wish to read?",
min = min(books$average_rating),
max = 5,
value = c(3,5))
),
# Datatable output
mainPanel(
"The top 50 recommended books for the genre(s) selected are",
DT::dataTableOutput(outputId = "bookreco"))
)
)
)
The code for server.R is given below:
#Loading the required libraries
library(shiny)
library(DT)
library(tidyverse)
# Reading the CSV files
books <- read.csv("books.csv")
final_book_tags <- read.csv("final_book_tags.csv")
# Server function
shinyServer(function(input, output) {
datasetInput <- reactive({
# Aggregating all selected genres
if (final_book_tags %>% filter(new_tag_name %in% as.vector(input$Columns)) %>% count() == 0) {
result <<- data_frame(goodreads_book_id = c(1),
Genre = c("temp"))
} else{
result <<- final_book_tags %>%
filter(new_tag_name %in% as.vector(input$Columns)) %>%
aggregate(new_tag_name ~ goodreads_book_id, data = ., paste, collapse = ", ") %>%
rename(Genre = new_tag_name)
}
# Filtering the books based on genre and rating
final_book_tags %>%
filter(new_tag_name %in% as.vector(input$Columns)) %>%
group_by(goodreads_book_id) %>%
summarise(num_tags = n()) %>%
ungroup() %>%
left_join(result, by = "goodreads_book_id") %>%
left_join(books, by = "goodreads_book_id") %>%
filter(average_rating >= as.numeric(input$range[1]), average_rating <= as.numeric(input$range[2])) %>%
arrange(desc(num_tags), desc(average_rating)) %>%
select(title, average_rating, goodreads_book_id, Genre ) %>%
mutate(Book_Name = paste0("<a href='",paste0('https://www.goodreads.com/book/show/',goodreads_book_id),"'target='_blank'>", title,"</a>")) %>%
select(Book_Name, Genre, average_rating) %>%
rename( Rating = average_rating, `Book` = Book_Name, `Genre(s)` = Genre)
})
#Rendering the table
output$bookreco <- DT::renderDataTable({
DT::datatable(head(datasetInput(), n = 50), escape = FALSE, options = list(scrollX = '1000px'))
})
})
We split the summary into three sections
To determine how a user decides “to-read” a book, we ran a correlation:
The hypothesis is popular books (in terms of ratings and comments) feature in the different lists of goodreads which motivates the users to read these books