Analysis of the most popular books

Introduction

Problem Statement: People love books. And people enjoy discussing and sharing books that they love. Goodreads is one amazing platform designed for such bibliophiles (and budding bibliophiles!). What I felt was missing from this website is a summary analysis of popular books. The objective of this project is to analyze the data of popular books from goodreads and identify any interesting trends for these books and what makes them popular.
How I did this?: The data for this analysis is a contiguous dataset of 10,000 most popular books, with user level information on 6 million ratings for these books. We also have several generic and user defined genres (/tags/shelves) for each book. A combination of this information helped answer the above questions. The analysis was divided into three parts:

The first analysis was on genres of books and associated authors
Second, a ratings analysis was done across different varaibles to identify how the books are related
Last a user level analysis was done to segregate based on the number of books that they read and identify common patterns

What can you do with the analysis?:

Identify factors that affect the ratings, patterns among what avid readers actually like to read
Associations between different genres, and what genres to invest in (if you’re a publisher)
Identify segments of bibliophiles based on the number of books they have rated (assuming rated as read)
As an avid reader, find genres that are popular but were off your radar
- If you love that genre, get a recommended book for it
As a newbie, see what people read and hopefully become a reader yourself!

Packages Used

The packages used in the project (currently) are:

data.table: To read the csv files in the fastest possible way
tidyverse : Collection of R packages for data manipulation, exploration and visualization. I am currently using
- dplyr: Data manipulation using filter, joins, summarise etc.
- magrittr: The pipe %>% operator!
- purrr: Apply functions recursively
stringr: String replacements and pattern matching
DT : Filtering, pagination, and sorting of data tables in html outputs
knitr : Aligned displays of table in a html doc
kableExtra : Manipulate table styles for good visualizations
gridExtra : Arrange multiple plots
wordcloud : Creating wordclouds
RColorBrewer : Provide colors to the wordcloud created
GGally & corrplot : Plot coorelation results

# Loading the packages
library(data.table)
library(DT)
library(kableExtra)
library(knitr)
library(tidyverse)
library(stringr)
library(gridExtra)
library(wordcloud)
library(RColorBrewer)
library(GGally)
library(corrplot)

Data Preparation

Original Data and Import

Description: goodbooks-10k

The original data gooodooks-10k was scraped in August 2017, with an aim of providing a database similar to movies and songs. The objective for creator was to create recommendation engine for books, similar to music and movies.

The dataset was first published in kaggle. Contributors then identified anomalies in the data like duplicate records, and multiple user ratings for the same book. Such anomalies were removed, and the clean data was then posted into github. We will load the data from this github repository.

The column information is provided for the key variables. Some assumptions are made for the other variables based on data present (and looking up the goodreads website).

The data is contiguous for 10k books and 50k users. The database has 5 tables, descriptions of which are given below

Table	Description	Notes
books	Data of the 10k most popular books with metadata (author, ratings etc)	Popular books are determined by number of total # of users who have rated the book
book_tags	Genres associated with each book along with the number of tags for the genre	Genric and User defined tags both are included; user defined can be anything
tags	Description of the genres
ratings	User, book, raating level data for 10k books and 50k users	For 10k books and 50k users who have made at least 2 ratings
to_read	User, book pair that a user has marked to read	For 10k books and 50k users

Importing the data

On initial checks, I found that the data sets had blanks, which R could not identify as NA. So I added na.strings to help R identify nulls

# Reading all the datasets from github using fread
ratings   <- fread("https://github.com/zygmuntz/goodbooks-10k/raw/master/ratings.csv",   na.strings = c("", "NA"), showProgress = F)
book_tags <- fread("https://github.com/zygmuntz/goodbooks-10k/raw/master/book_tags.csv", na.strings = c("", "NA"), showProgress = F)
books     <- fread("https://github.com/zygmuntz/goodbooks-10k/raw/master/books.csv",     na.strings = c("", "NA"), showProgress = F)
tags      <- fread("https://github.com/zygmuntz/goodbooks-10k/raw/master/tags.csv",      na.strings = c("", "NA"), showProgress = F)
to_read   <- fread("https://github.com/zygmuntz/goodbooks-10k/raw/master/to_read.csv",   na.strings = c("", "NA"), showProgress = F)

The attributes of the different dataframes are described below

#Creating a List of all tables
tbl_list <- list("books" = books, "book_tags" = book_tags, "tags" = tags, "ratings" = ratings, "to_read" = to_read)

# Writing a function to get dimensions
dim_func <- function(x){
  dimension = str_trim(paste0(dim(tbl_list[[x]]), sep = "  ", collapse = ""))
}

# Writing a function to get all column names
names_func <- function(y){  
  vars = str_trim(paste0(names(tbl_list[[y]]), sep = " | ", collapse = ""))
}

# Creating a table
table_attributes <- data_frame(
  Table = names(tbl_list),
  `Rows  Columns` = unlist(lapply(1:5, dim_func)),
  Variables = unlist(lapply(1:5, names_func ))
)

# Removing the list
rm(tbl_list)

# Printing the table
kable(table_attributes, format = "html") %>%
  kable_styling(bootstrap_options = "striped") %>%
    column_spec(2, width = "12em")

Table	Rows Columns	Variables
books	10000 23	book_id \| goodreads_book_id \| best_book_id \| work_id \| books_count \| isbn \| isbn13 \| authors \| original_publication_year \| original_title \| title \| language_code \| average_rating \| ratings_count \| work_ratings_count \| work_text_reviews_count \| ratings_1 \| ratings_2 \| ratings_3 \| ratings_4 \| ratings_5 \| image_url \| small_image_url \|
book_tags	999912 3	goodreads_book_id \| tag_id \| count \|
tags	34252 2	tag_id \| tag_name \|
ratings	5976479 3	user_id \| book_id \| rating \|
to_read	912705 2	user_id \| book_id \|

Cleaning the Data

Keeping the relevant columns across all tables

As a first step, I would remove the columns that I do not intend to use from the books table. All other tables have only the relevant columns

# Selecting the relevant columns
books <- books %>% select(-c(best_book_id, work_id, isbn, isbn13), -(ratings_1:small_image_url))
tbl_list <- list("books" = books, "book_tags" = book_tags, "tags" = tags, "ratings" = ratings, "to_read" = to_read)

Are there missing values?

The next step in cleaning the data was to identify any missing values across datasets. A quick is.na() check reveals that only books has missing values in some columns

# Identifying missing across all tables using a loop
for (x in 1:length(tbl_list)) {
  dd <- colSums(is.na(tbl_list[[x]]))
  if (sum(dd[dd > 0]) > 0) {
    print(paste("For the data -",names(tbl_list[x])))
    print(dd[dd > 0])
  }
}

## [1] "For the data - books"
## original_publication_year            original_title 
##                        21                       590 
##             language_code 
##                      1084

# Removing the list
rm(tbl_list)

I have decided to retain all the columns based on the explanation below

original_publication_year: Missing only for 21 out of the 10 books; Retain and take care of this issue during individual analysis
original_title: We have the title column that can be used in its place if the value is
language_code: Retain and take care of this issue in individual analysis

Anomaly in the data recorded

Among the different summary() codes that were run across all tables, only two anomalies were observed

The original_publication_year in books had a negative year
The count in book_tags which records how many times a book was given the particular genre has negative values

# Display summary statistics of the important columns
summary(books$original_publication_year)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   -1750    1990    2004    1982    2011    2017      21

summary(book_tags$count)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     -1.0      7.0     15.0    208.9     40.0 596234.0

On checking a few records for the original publication year, like

original_publication_year	title	goodreads_book_id
-720	The Odyssey	1381
-750	The Iliad	1371
-500	The Art of War	10534
-380	The Republic	30289
-430	Oedipus Rex (The Theban Plays, #1)	1554

I noticed that the books were actually written in B.C ***

The count in book_tags was mutated to 0 if negative

# Mutating negative values to 0
book_tags[which(book_tags$count < 0), "count"] = 0

The tags dataset

Most of the issues were uncovered in the tags dataset. The tag_name column can have several generic tags and several user defined tags. User defined tags generally do not qualify under the genre of the book, as most of them are “to read”“,”currently reading" etc. as described below:

# Indetifying the top tags across all books
book_tags %>% 
  group_by(tag_id) %>%
  summarise(num_books = n_distinct(goodreads_book_id)) %>%
  mutate(rank = rank(desc(num_books))) %>%
  filter(rank <= 5) %>%
  left_join(tags, by = "tag_id") %>%
  arrange(rank) %>%
  select(tag_name, num_books) %>%
  ungroup() %>%
  kable()

tag_name	num_books
to-read	9983
favorites	9881
owned	9857
books-i-own	9799
currently-reading	9776

To identify 35 logical and distinct tags that could cover most books, several steps are performed:

Remove the special characters ‘-’ and ’_’ from the tag name
Remove non-alphanumeric characters from the tags
Remove tags that contain to read, reading, own, club, favorites etc.
Remove tags that are ya, novel, series (which cannot be tagged as genres)

# Remove the '-' and '_' from the tag name

tags <- tags %>%
  mutate(new_tag_name = str_replace_all(str_replace_all(tags$tag_name, "-", " "),"\\_", " ")) %>%
  select(tag_id, new_tag_name) %>%
  distinct(tag_id, new_tag_name)

# Remove all user defined tags and non alphanumeric ones
list_remove <- "+to read+|+reading+|^[[:digit:]]*$|+i own+|+currently+|+owned+|[^[:alnum:] ]|+favorites+|+club+|+buy+|+library+|+read+|+borrowed+|+abandoned+|+audio+|+ebook+|+kindle+|+default+"
tag_remove <- tags[grepl(list_remove,tags$new_tag_name),]

# Hard code removal of certain tags
tag_remove <- rbind(tag_remove,tags[tolower(tags$new_tag_name) %in% c('ya','novels','series'),])

Next the common genres with slightly different names, for example, classics and classic, were collated into a single tag_id for both tags and book_tags. The same procedure was repeated for children’s books, non fiction, graphic novels, and science fiction. I created a function and recursively called it using invoke_map

# Writing a function to change the tags
Remove_Genre <- function(x, y) {

  keep_tags <- tags[which(tags$new_tag_name %in% x), "tag_id"]
  new_tag   <- tags[which(tags$new_tag_name %in% y), "tag_id"][1]
  
  tags[which(tags$new_tag_name %in% x), "tag_id"] <<- new_tag
  tags[which(tags$new_tag_name %in% x), "new_tag_name"] <<- y
  
  book_tags[which(book_tags$tag_id %in% keep_tags ), "tag_id"] <<- new_tag
  return(NULL)

}
  
#Creating the list of tags to change
list_change <- list(list(x =  c("children s", "children s books", "childrens", "kids", "childrens books"),
           y = c("children")),
      list(x =  c("classic"),
           y = c("classics")),  
      list(x =  c("graphic novel"),
           y = c("graphic novels")),        
      list(x =  c("non fiction"),
           y = c("nonfiction")),  
      list(x =  c("sci fi", "scifi", "sci fi fantasy"),
           y = c("science fiction")),  
      list(x =  c("dystopia"),
           y = c("dystopian")))

# Calling the invoke map 
invoke_map(Remove_Genre, list_change) 

new_tag   <- tags[which(tags$new_tag_name %in% c("science fiction")), "tag_id"][1]

book_tags[which(book_tags$tag_id == 26894), "tag_id"] <- new_tag
tags[which(tags$tag_id == 26894), "tag_id"] <- new_tag

# Removing the duplicates from tags dataset
tags <- unique(tags)

# summarising the book_tags to get common counts
book_tags <- book_tags %>%
  group_by(goodreads_book_id, tag_id) %>%
  summarise(count = sum(count)) %>%
  ungroup()

As a last step, I subset the top 5 genres for each book (based on the count) and the top 35 genres across all books (by the number of books that have the genre in top 5). Thrse tags are checked for uniqueness and logical sense

# Removing the tags that are not actual genres
book_tags2 <- book_tags %>% anti_join(tag_remove, by = "tag_id")

# Ranking based on count and selecting top 5 genres
subset_tags <- book_tags2 %>% 
  group_by(goodreads_book_id) %>%
  mutate( rr = rank(desc(count))) %>%
  filter(rr <= 5) %>%
  ungroup()
 
# Selecting the top 35 genres across all books
subset_tags %>%  
  group_by(tag_id) %>% 
  summarise(num_of_books = n()) %>% 
  mutate(rank = rank(desc(num_of_books))) %>% 
  filter(rank <= 35) %>%
  ungroup() %>%
  left_join(tags, by = "tag_id") %>% 
  arrange(rank) %>%
  select(new_tag_name) %>%
  t %>%
  paste0

##  [1] "fiction"              "fantasy"              "romance"             
##  [4] "young adult"          "nonfiction"           "mystery"             
##  [7] "classics"             "contemporary"         "science fiction"     
## [10] "historical fiction"   "children"             "thriller"            
## [13] "paranormal"           "chick lit"            "historical"          
## [16] "crime"                "horror"               "humor"               
## [19] "history"              "biography"            "literature"          
## [22] "memoir"               "graphic novels"       "dystopian"           
## [25] "vampires"             "contemporary romance" "adventure"           
## [28] "comics"               "urban fantasy"        "picture books"       
## [31] "short stories"        "suspense"             "philosophy"          
## [34] "new adult"            "epic fantasy"

I’ll join the tag names to the book_tags and discard the tags dataset

# subsetting the top 35 tags
good_tags <- subset_tags %>% 
  group_by(tag_id) %>% 
  summarise(num_of_books = n()) %>% 
  mutate(rank = rank(desc(num_of_books))) %>% 
  filter(rank <= 35) %>%
  ungroup()

# Getting only those tags foe the books
temp_book_tags <- book_tags2 %>% semi_join(good_tags, by = "tag_id")
final_tags <- tags %>% semi_join(good_tags, by = "tag_id")

# Joining
final_book_tags <- temp_book_tags %>% 
  left_join(final_tags, by = "tag_id")

Snapshot of the data

Books

datatable(head(books, 500))

Book Tags

datatable(head(final_book_tags, 500))

Ratings

datatable(head(ratings, 500))

To Read

datatable(head(to_read, 500))

Data Description

Below is the snapshot of all the important columns

Variables	Description	Present_In
book_id	User assigned book id based on popularity integer(1 - 10000)	books, ratings, to_read
goodreads_book_id	Book id linked to goodreads	book_tags
books_count	Number of editions of the book	books
authors	Author(s) of the book	books
original_publication_year	First year in which the book was published	books
original_title	First title of the book	books
title	Current running title of the edition selected	books
language_code	Language of the book	books
average_rating	Average Rating of the book (1 - 5)	books
ratings_count	Total users who have rated the book	books
work_ratings_count	Ratings of the current edition selected (1 - 5)	books
work_text_reviews_count	Total comments on the current edition selected	books
tag_id	ID of the tag	book_tags
count	Number of users who have tagged the genre for the book	book_tags
new_tag_name	Cleaned Tag name	book_tags
user_id	ID of the user (1 - 53424)	ratings, to_read
rating	Rating that the user has given integer(1 - 5)	ratings

Exploratory Data Analysis

What makes up the popular list?

The popular genre list

First, I wanted to understand which genres feature in the top 10,000 popular books on goodreads. Since a book can be associated with multiple genres, I divided them into main genre and associated genre. Main genre has the most votes for a books while the associate genre can be what the sub-genre of the book is. The associated genres are limited to 4.

Most of the tags appear in both cases. That is, they are the main genre for some books and associated genre for other books. From the below graph, I assume fiction to thriller as the main genres. Genres from urban fantasy to epic fantasy are considered as the associated tags for the book

If we ignore fiction and non-fiction, people like to read fantasy, mystery and romantic books!

final_book_tags <- final_book_tags %>% 
  group_by(goodreads_book_id) %>%
  mutate(rank = rank(desc(count), ties.method = "first")) %>%
  ungroup() %>%
  filter(rank <= 5) %>%
  arrange(goodreads_book_id, rank) 

#Associated tags for each book
top_5 <- final_book_tags %>%  
  filter(rank > 1) %>%
  within(new_tag_name <- factor(new_tag_name, levels = names(sort(table(new_tag_name), decreasing = F))))

#Main Genre for each book
top_1 <- final_book_tags %>%  
  filter(rank == 1) %>%
  within(new_tag_name <- factor(new_tag_name, levels = names(sort(table(new_tag_name), decreasing = F))))

#Bar graph
ggplot(NULL, aes(x = new_tag_name)) + 
  geom_bar(data = top_1, aes(y = (..count..)/sum(..count..), fill = "Main Genre"),alpha = 0.5) +
  geom_bar(data = top_5, aes(y = (..count..)/sum(..count..), fill = "Associated Genre"),alpha = 0.5) +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 0.20), labels = scales::percent) +
  coord_flip() +
  labs(title = "Which Genre features most in the top books?", subtitle = "Main vs, Associated Genre", x = "Genre", y = "Percentage of Books", fill = "Genre Type")

Old books or new books?

Similar to what movies follow, I wanted to check if older books (books written before 1980’s) make up for a large chunk of the popular books.

Interestingly, 75% the books are from 1980 and above, indicating a young audience base using the website. The reference can be found here

#Older books are more in the popular lists?:
plot_year <- books %>% 
  select(original_publication_year) %>%
  group_by(original_publication_year) %>%
  summarise(count = n()) %>%
  ggplot(aes(x = original_publication_year, y = count)) +
  geom_line() +
  theme(legend.position = "none") +
  labs(title = "When were the popular books written", x = "Year", 
       y = "Number of Books")

plot_zoom_year <- books %>% 
  select(original_publication_year) %>%
  group_by(original_publication_year) %>%
  summarise(count = n()) %>%
  ggplot(aes(x = original_publication_year, y = count, color = (original_publication_year >= 1983 &  
                                                                original_publication_year <= 2011))) +
  geom_line(aes(group = 1)) +
  theme(legend.position = "none") +
  labs(title = "Zoomed in for 1900 and beyond", x = "Year") +
  ylab(NULL) +
  coord_cartesian(xlim = c(1900,2016)) 

grid.arrange(plot_year, plot_zoom_year, ncol = 2)

The Goldmine Authors

After we’ve found the genres that are popular, I wanted to know which authors appear more in the popular lists. Authors were split for books that had multiple authors.

James Patterson, Stephen King, and Nora Roberts have a frequent appearance in this list. These authors are famous for writing fiction, fanstasy, & romantic novels respectively, which moves them to the top in our goldmine author list.

If you’re a publisher, you know whose books to publish. If you’re a newbie, you know which author to look for!

#Authors that feature in the famous list
authors <- books %>% 
  select(authors) %>%
  t() %>%
  as.character() %>%
  str_split(",") %>%
  unlist() %>%
  factor() %>%
  table() %>%
  as.data.frame() %>%
  rename("Author" = ".", "Freq" = "Freq") %>%
  filter(Freq >= 15) %>%
  arrange(desc(Freq)) 

wordcloud(authors$Author, authors$Freq, scale = c(2.5,0.15), random.order =  FALSE, colors = brewer.pal(8,"Dark2"), 
          rot.per = .20)

Genre within a Genre (Genre-ception!)

After we’ve got a list of genres and associated genres, I wanted to check the association between the genres. This metric is a calcuation of how many times did a genre feature with the other one. invoke_map was used to run across all the lists of tags.

The strongest associations were observed between science fiction & paranormal with fantasy, thriller with mystery (duh), and graphic novels with comics.

Interestingly, Chick lit (or) Chick Literature authors pivot their books around romance only. The young adult genre has a wider foucus involving fantasy, romance, paranormal and adventure books.

#Picking the 15/35 genres that have the top tag for 93% books
genres_for_analysis <- top_1 %>% 
  group_by(new_tag_name) %>%
  summarise(num_books = n()) %>%
  ungroup() %>%
  arrange(num_books) %>%
  top_n(n = 15, wt = num_books)

#Creating an empty dataframe
assoc_table <- data_frame(main = character(), assoc = character(), val = as.integer(character()))

# Writing the function for association mapping that checks each genre
association_map <- function(x, df){
  book_ids <- df[which(df$new_tag_name == x), "goodreads_book_id"]
  
  all_genres <- df[which(df$goodreads_book_id %in% (book_ids$goodreads_book_id)),] %>% 
    group_by(new_tag_name) %>%
    summarise(n = n())
  
  temp_table <- data_frame( main = rep(x, times = nrow(all_genres)), assoc = as.character(all_genres$new_tag_name), val = all_genres$n)

  assoc_table <<- union(assoc_table, temp_table)
  return(NULL)
}

#Calling the function on all genres
invoke_map(association_map, final_tags$new_tag_name, df = top_5)

# Filter and spread

assoc_table <- assoc_table %>%
filter(assoc_table$main != assoc_table$assoc, main %in% (genres_for_analysis$new_tag_name), !assoc %in% c('fiction')) %>%
  spread(assoc, val)                                  #Spread the table to get the lower matrix

assoc_table[is.na(assoc_table)]  <- 0

assoc_title <- assoc_table[,1]
assoc_head <- assoc_table[,-1]
assoc_head <- prop.table(as.matrix(assoc_head), margin = 1)

#Binding back to the first column
assoc_table <- assoc_head %>%
  cbind(assoc_title) %>%                       #Binding back the main column
  gather(key = "assoc", value = "val", -main) #Gather the table 

ggplot(assoc_table, aes(y = main, x = assoc)) +
  theme_bw() +
  geom_tile(aes(fill = val), color = 'white') +
  scale_fill_gradient(low = 'white', high = 'darkblue', space = 'Lab') +
  theme(axis.text.x = element_text(angle = 90), panel.grid.major = element_line(color = '#eeeeee')) +
  labs(title = "Association between Genres", y = "Main Genre", x = "Associated genre", fill = "Association Score")

What are the Ratings across popular books?

What is the general distribution?

First, I’d like to see the general distribution of the rating across all 10,000 books. A weighted-average was taken for the mean, as the number of votes for each book can vary. It was observed that the ratings follow a normal distribution with a mean rating of 4.03.

We now know that the popular books are indeed popular becuase they are a good read. However, we do see a wide range from 2.47 to 4.82

mean_rat <- books %>%
  summarise(average_rating = sum(average_rating * work_ratings_count)/sum(work_ratings_count)) %>%
  round(digits = 2)

ggplot(data =NULL, aes(x = average_rating)) +
  geom_histogram(data = books,bins = 50) +
  geom_vline(data = books, aes(xintercept = sum(average_rating*work_ratings_count)/sum(work_ratings_count)), color = "blue", linetype = "dashed", size = 0.75) +
  geom_text(data = mean_rat, aes(label = paste("Mean =",average_rating) , y = 820), hjust = -0.17, color = "blue") +
  labs(title = "How are ratings distributed for the popular books", x = "Average Rating", y = "Number of Books")

Which are the top rated genres?

Now that we know the top genres, a more directed approach to start your next book would be to see how do other people rate these genres.

For this analysis, I have considered the top genre for each book and selected the top 15 genres to analyze. The top 15 genres cover 91% of the books as shown below

genres_for_analysis %>%
  summarise(`Books Covered` = sum(num_books)) %>%
  kable(format = "html") %>%
  kable_styling(bootstrap_options = "striped")

Books Covered
9059

The below graph explains the variation across genre ratings. From the boxplot below, we infer that:

Paranormal and Graphic novels are the genres which have the best ratings. The average rating of these genres is higher than most other genres
Chick lit has the lowest ratings among all genres (I think there might be bias largely due to disapprovals from critics)
Fiction as a genre is subject to many outlier ratings, which decreases the overall average

#Getting the top 15 genres
filter_books <- top_1 %>%
  filter(new_tag_name %in% genres_for_analysis$new_tag_name)

#Creating the boxplot
ratings_with_genre <- books %>%
  inner_join(filter_books, by = "goodreads_book_id")

ggplot(ratings_with_genre, aes(x = new_tag_name, y = average_rating)) +
  geom_boxplot() +
  coord_flip() +
  labs(title = "How are ratings distributed for the popular books", x = "Average Rating", y = "Number of Books")

Is old gold?

The next analysis was to see if older books actually get better ratings when compared to the newer ones. The below line graph depicts that it is not the case. In fact, we see a slight increase in ratings across the years. This is subect to low count of books prior to 1980’s as well.

So sadly, old is acutally not gold for books. The quality of the book does matter for the bibliophiles

ratings_with_genre %>% 
  group_by(original_publication_year) %>%
  summarise(num_books = n(), rating = sum(average_rating*work_ratings_count)/sum(work_ratings_count)) %>%
  ungroup() %>%
  ggplot(aes(x = original_publication_year, y = rating)) +
  geom_line() +
  coord_cartesian(xlim = c(1900,2017)) +
  geom_smooth(method = "loess", se = FALSE) +
  labs(title = "Rating vs. Year of publication", x = "Year of publication", y = "Average rating")

Is rating even associated with any factor?

I did a simple correlation to understand if ratings is influenced by any factors. If the co-efficient is above 0.3, I would have rated it as correlation.

However, ratings are independent of any factors that we have, and is probably affected by other factors overall

books_numeric <- books %>%
  mutate(num_authors = str_count(books$authors, ",") + 1) %>%
  select(books_count, original_publication_year, average_rating, ratings_count, work_ratings_count, work_text_reviews_count, num_authors)
 
books_numeric <- books_numeric[complete.cases(books_numeric),]

ggcorr(books_numeric,  nbreaks = 5, label = TRUE, label_size = 3, hjust = 0.7, label_round = 4, size = 2.5) +
  labs(title = "Correlation of average rating with other variables")

Part 1 or Part 2 or … Part n of the book?

We can identify if a book is a part of some series by checking cases when the title is not the same as original title. For such cases, the rating generally increases till the 5th part of the book. The quality of the book keeps increasing till the 5th part and then stabilizes from the 6th to tenth parts.

So if you start reading a series, like Harry Potter, you tend to like the books even more. That’s how they create the fan base, don’t they?

# Getting the books that have editions
books_editions <- filter(books, original_title != title, str_detect(title, "#"))

# Getting the edition number
start <- regexpr("\\#[^\\#]*$", books_editions$title)
end <-  regexpr("\\)[^\\)]*$", books_editions$title)

books_editions$edition  <- as.numeric(str_sub(books_editions$title, start = start + 1, end = end - 1))
books_editions <- books_editions[which(books_editions$edition %in% c(1:10)),]

correlation <- round(cor(books_editions$edition, books_editions$average_rating ), digits = 3)

ggplot(data = books_editions, aes(x = factor(edition), y = average_rating)) +
  geom_boxplot(notch = TRUE) +
  geom_rect(xmin = 9.18, xmax = 10.32, ymin = 4.68, ymax = 4.82, color = "blue", fill = "white") +
  annotate("text",  x = 9.75, y = 4.75, label =  paste("Corr:",as.character(correlation)), color = "blue") +
  labs(title = "What about Book Series?", subtitle = "Ratings by Book Series", x = "Number in the Series", 
       y = "Average Rating")

What about the bibliophiles?

Bibliophiles form the most important component of goodreads. They are the people who actually rate and recommend books so that everyone can have a good read and avoid the bad ones.

How many books do these users rate/read?

It is assumed that if a user rates a book, he/she has read the book. The analysis further assumes books read as the metric.

The first analysis is to check the distribution of the number of books that people read. Similar to ratings across books, the number of books that users read follows a normal distribution with a mean of 111.9 (Well that’s a lot of books!). The minimum number of books that a person has read 19 is while the maximum is 200.

To reiterate, since this is a contiguous datset, these numbers might not reflect the actual population parameters.

# Grouping the ratings and books read by user
rating_user <- ratings %>%
  group_by(user_id) %>%
  summarise(n = n_distinct(book_id), average_rating = mean(rating)) %>%
  ungroup()

#Calculating the min, max, and mean
mean_b <- mean(rating_user$n)
min_b <-  min(rating_user$n)
max_b <-  max(rating_user$n)

#How many books do users read
ggplot(rating_user, aes(x = n)) +
  geom_histogram(bins = 40) +
  geom_vline(aes(xintercept = mean_b), color = "blue") +
  geom_vline(aes(xintercept = min_b), color = "tan2") +
  geom_vline(aes(xintercept = max_b), color = "tan2") +
  annotate("text", x = mean_b + 15, y = 4500, label = paste("Mean =", round(mean_b, 1)), color = "blue") +
  annotate("text", x = min_b + 10, y = 4500, label = paste("Min =", round(min_b, 1)), color = "tan2") +
  annotate("text", x = max_b - 10, y = 4500, label = paste("Max =", round(max_b, 1)), color = "tan2") +
  labs(title = "How many books do users read", x = "Number of books read", y = "Count of users")

Segregating the bibliophiles

We decide to quntile the users based on the number of books that they’ve read. People are distributed into Q1-Q5, with Q1 the lowest quintile

# Assigning the quartiles
rating_user$Quartile <- with(rating_user, factor(findInterval(n, c(-Inf,
                     quantile(n, probs=c(0.2, .4, .6, .8)), Inf)), labels = c("Q1", "Q2", "Q3", "Q4","Q5")))

rating_user %>% 
  group_by(Quartile) %>%
  summarise(Mean_Books = mean(n), Min_Books = min(n), Max_Books = max(n)) %>%
  kable(format = "html") %>%
  kable_styling(bootstrap_options = "striped")

Quartile	Mean_Books	Min_Books	Max_Books
Q1	75.87592	19	91
Q2	98.33239	92	104
Q3	110.39975	105	116
Q4	123.86480	117	132
Q5	148.92101	133	200

Critical - Critical as you read

As the title reveals, users tend to get slightly critical of the books they read with the increase in the number of books read. We observe a constant decline in rating patterns across the 5 quartiles.

New readers tend to rate the book generously and begin to get critical as they start reading good content.

# Quartile level rating
rating_user %>%
  group_by(Quartile) %>%
  summarise(mean_avg = sum(average_rating*n)/sum(n), min = min(n), max = max(n)) %>%
  ungroup() %>%
  ggplot(aes(x = paste0(Quartile, " (", min," - ", max, ")"), y = mean_avg)) +
  geom_bar(stat = "identity") +
  coord_cartesian(ylim = c(3.5, 4.02)) +
  labs(title = "How do users rate books?", x = "Number of books read", y = "Average Rating given")

Is there a difference in the reading patterns?

Our next objective was to identify any differences between the reading patterns of our bibliophile qunitles. The percentage of books read was calcuated across the genre and quintile. The top 5 genres in each quintiles are selected for analysis.

Some interesting observations were

The pecentage of fiction book read by the users increases with the increase in number of books
People tend to start reading romantic novels first. However, the craze seems to fade off once people read more books
We see children’s books as the top 5 genres for the last 3 quintiles
This depicts that parents of children tend to rate these books a lot, leading to children’s books showing up in the top lists
Teenagers tend to read a little more than the lowest quintile, which results in young adult books featuring in the top 5 read genres for quintile 2
Fiction, Non-Fiction, Classics, and Fanstasy are common across all genres

rating_w_tags <- ratings %>% 
  left_join(rating_user, by = "user_id") %>%
  left_join(books, by = "book_id") %>%
  left_join(top_1, by = "goodreads_book_id")

rating_w_tags %>%
  group_by(Quartile, new_tag_name) %>%
  summarise( n = n()) %>%
  mutate(freq = n / sum(n)) %>%
  ungroup() %>%
  group_by(Quartile) %>%
  mutate(rank = rank(desc(freq))) %>%
  filter(rank <= 5) %>%
  ungroup() %>%
  arrange(Quartile, rank) %>%
  ggplot(aes(x = reorder(new_tag_name, -freq), y = freq)) +
  geom_bar(stat = "identity") +
  scale_y_continuous(labels = scales::percent) +
  facet_wrap(~ Quartile, scales = "free_x", ncol = 2) +
  labs(title = "What books to the segmented users read?", x = "Top 5 genres", 
       y = "Percentage of users who read the genre")

What makes a book worth to be a read?

After we’ve indentified what different segments of users read, our last analysis is to identify what makes a books worth to read.

This analysis is carried out by identifying the number of books that people have marked as “to-read” in goodreads. This variable will then be analyzed against all other attributes of a book to see if a relation exists.

A strong correlation exists between the number of users who have marked a books as to read with the reviews that a book has received. As the number of text reviews increase, more people would like to read the book. Interestingly, the ratings of a books do not matter when a user wants to read the book.

Other variables that have a good correlation with the to-read are the number of editions of the book (0.3), the number of users that have rated the book (0.63).

The hypothesis is popular books (in terms of ratings and comments) feature in the different lists of goodreads which motivates the users to read these books.

So if you launch a new book, generate more hype for the books in terms of comments and ratings.

# Get the number of users who have marked a book as to read
to_read_total <- to_read %>% 
  group_by(book_id) %>%
  summarise(number_users = n_distinct(user_id))

# Join with the attributes of the book
book_tr <- books %>%
  left_join(to_read_total, by = "book_id") %>%
  select(number_users, books_count, original_publication_year, average_rating, ratings_count, work_ratings_count, work_text_reviews_count)

# Correlation
correlation <- cor(book_tr[complete.cases(book_tr),])
corrplot(correlation, type = "lower", addCoef.col = "black", diag = FALSE, number.cex = 0.8,tl.srt = 45, tl.col = "black", tl.cex = 0.6, title = "Correlation of books marked as to read", mar = c(0,0,1,0))

Get a book recommendation

The shiny app below takes multiple genres from the user and ouputs the recommended books to read based on the genre.

The books are sorted according to the combination of genres (most genres first) and then rating

Clicking on the name of the book will direct you to the goodreads page of the book.

The app is hosted here

Book recommender app

Code for the app

First, the dataframes books, final_tags, and final_book_tags need to be converted to CSV’s and then placed in the same folder as the ui and server codes.

The code for ui.R is given below:

# Loading the required libraries
library(shiny)
library(DT)

# Reading the files
final_tags <- read.csv("final_tags.csv")
books <- read.csv("books.csv")


shinyUI(fluidPage(
  # Adding title
  titlePanel("Book Recommender"),
  
  # Sidebar 
  sidebarLayout(
    sidebarPanel(
      # Genre Selection
      
      selectInput(inputId = "Columns", label = "Which genres do you like?",
                  unique(final_tags$new_tag_name), multiple = TRUE),
      verbatimTextOutput("fiction"),
      
      # Rating Selection
      sliderInput(inputId = "range",
                  label = "Ragne of Ratings that you wish to read?",
                  min = min(books$average_rating),
                  max = 5,
                  value = c(3,5))
      
    ),
    
    # Datatable output
    mainPanel(
      "The top 50 recommended books for the genre(s) selected are",
      DT::dataTableOutput(outputId = "bookreco"))
    
  )
)
)

The code for server.R is given below:

#Loading the required libraries
library(shiny)
library(DT)
library(tidyverse)

# Reading the CSV files
books <- read.csv("books.csv")
final_book_tags <- read.csv("final_book_tags.csv")

# Server function
shinyServer(function(input, output) {
  
  datasetInput <- reactive({
    
    # Aggregating all selected genres    
    if (final_book_tags %>% filter(new_tag_name %in% as.vector(input$Columns)) %>% count() == 0) {
      result <<- data_frame(goodreads_book_id = c(1),
                            Genre = c("temp"))
    } else{
      result <<- final_book_tags %>% 
        filter(new_tag_name %in% as.vector(input$Columns)) %>%
        aggregate(new_tag_name ~ goodreads_book_id, data = ., paste, collapse = ", ") %>%
        rename(Genre  = new_tag_name)
    }
    
    # Filtering the books based on genre and rating
    final_book_tags %>% 
      filter(new_tag_name %in% as.vector(input$Columns)) %>%
      group_by(goodreads_book_id) %>%
      summarise(num_tags = n()) %>%
      ungroup() %>%
      left_join(result, by = "goodreads_book_id") %>%
      left_join(books, by = "goodreads_book_id") %>%
      filter(average_rating >= as.numeric(input$range[1]), average_rating <= as.numeric(input$range[2])) %>%
      arrange(desc(num_tags), desc(average_rating)) %>%
      select(title,  average_rating,  goodreads_book_id, Genre ) %>%
      mutate(Book_Name = paste0("<a href='",paste0('https://www.goodreads.com/book/show/',goodreads_book_id),"'target='_blank'>", title,"</a>")) %>%
      select(Book_Name, Genre, average_rating) %>%
      rename( Rating = average_rating, `Book` = Book_Name, `Genre(s)` = Genre)
    
    
  })
  
  #Rendering the table
  output$bookreco <- DT::renderDataTable({
    
    DT::datatable(head(datasetInput(), n = 50), escape = FALSE, options = list(scrollX = '1000px'))
  })
})

Summary

We split the summary into three sections

Genre

The most read books are from the genres fiction, non-fiction, fantasy, mystery, and romance
Contraray to popular belief, most books are not from the old times; Rahter 75% of the books are from 1980 or later
James Patterson, Stephen King, and Nora Roberts have a frequent appearance in the popular list

Ratings

Popular books are indeed popular becuase they are a good read; The average rating for these popular books is 4.03
Paranormal and Graphic novels are the genres which have the best ratings. The average rating of these genres is higher than most other genres
For books that are part of a series, the average rating actually increases till series # 5 of the book; The quality of the book keeps increasing till the 5th part
Ratings are not associated with any factors present in the dataset (publication year, books count, total ratings, and total reviews)
Hence the hypothesis is that the quality of the books actually matter rather than other attributes

User level

The number of books that a user reads follows a normal distribution with a mean of 111.9 (Very big number)
The bibiliophiles were segregated into qunitiles, and the below insights are for the quintiles
Users tend to get slightly critical of the books they read with the increase in the number of books read. We observe a constant decline in rating patterns across the 5 quintiles
People tend to start reading romantic novels first; However, the craze seems to fade off once people read more books
We see children’s books as the top 5 genres for the last 3 quintiles
- This depicts that parents of children tend to rate these books a lot, leading to children’s books showing up in the top lists

To determine how a user decides “to-read” a book, we ran a correlation:

A strong correlation exists between the number of users who have marked a books as to read with the reviews that a book has received
Other variables that have a good correlation with the to-read are the number of editions of the book (0.3), the number of users that have rated the book (0.63)

The hypothesis is popular books (in terms of ratings and comments) feature in the different lists of goodreads which motivates the users to read these books

Future Analysis

Since we couldn’t find any associations for ratings of a book, the next best step would be to get the text of the book and do a text analysis to see if common words lead to better ratings
The “to-read” for a book can be predicted using the correlation analysis

: What will you read next?

Rohit Jain

December 3, 2017

Analysis of the most popular books

Introduction

Packages Used

Data Preparation

Original Data and Import

Description: goodbooks-10k

Importing the data

Cleaning the Data

Keeping the relevant columns across all tables

Are there missing values?

Anomaly in the data recorded

The tags dataset

Snapshot of the data

Books

Book Tags

Ratings

To Read

Data Description

Exploratory Data Analysis

What makes up the popular list?

The popular genre list

Old books or new books?

The Goldmine Authors

Genre within a Genre (Genre-ception!)

What are the Ratings across popular books?

What is the general distribution?

Which are the top rated genres?

Is old gold?

Is rating even associated with any factor?

Part 1 or Part 2 or … Part n of the book?

What about the bibliophiles?

How many books do these users rate/read?

Segregating the bibliophiles

Critical - Critical as you read

Is there a difference in the reading patterns?

What makes a book worth to be a read?

Get a book recommendation

Book recommender app

Code for the app

Summary

Genre

Ratings

User level

Future Analysis