• Show All Code
  • Hide All Code

: What will you read next?

Rohit Jain

December 3, 2017

Analysis of the most popular books

Introduction

  1. Problem Statement: People love books. And people enjoy discussing and sharing books that they love. Goodreads is one amazing platform designed for such bibliophiles (and budding bibliophiles!). What I felt was missing from this website is a summary analysis of popular books. The objective of this project is to analyze the data of popular books from goodreads and identify any interesting trends for these books and what makes them popular.

  2. How I did this?: The data for this analysis is a contiguous dataset of 10,000 most popular books, with user level information on 6 million ratings for these books. We also have several generic and user defined genres (/tags/shelves) for each book. A combination of this information helped answer the above questions. The analysis was divided into three parts:

  • The first analysis was on genres of books and associated authors
  • Second, a ratings analysis was done across different varaibles to identify how the books are related
  • Last a user level analysis was done to segregate based on the number of books that they read and identify common patterns
  1. What can you do with the analysis?:
  • Identify factors that affect the ratings, patterns among what avid readers actually like to read
  • Associations between different genres, and what genres to invest in (if you’re a publisher)
  • Identify segments of bibliophiles based on the number of books they have rated (assuming rated as read)
  • As an avid reader, find genres that are popular but were off your radar
    • If you love that genre, get a recommended book for it
  • As a newbie, see what people read and hopefully become a reader yourself!

Packages Used

The packages used in the project (currently) are:

  • data.table: To read the csv files in the fastest possible way
  • tidyverse : Collection of R packages for data manipulation, exploration and visualization. I am currently using
    • dplyr: Data manipulation using filter, joins, summarise etc.
    • magrittr: The pipe %>% operator!
    • purrr: Apply functions recursively
  • stringr: String replacements and pattern matching
  • DT : Filtering, pagination, and sorting of data tables in html outputs
  • knitr : Aligned displays of table in a html doc
  • kableExtra : Manipulate table styles for good visualizations
  • gridExtra : Arrange multiple plots
  • wordcloud : Creating wordclouds
  • RColorBrewer : Provide colors to the wordcloud created
  • GGally & corrplot : Plot coorelation results
# Loading the packages
library(data.table)
library(DT)
library(kableExtra)
library(knitr)
library(tidyverse)
library(stringr)
library(gridExtra)
library(wordcloud)
library(RColorBrewer)
library(GGally)
library(corrplot)

Data Preparation

Original Data and Import

Description: goodbooks-10k

The original data gooodooks-10k was scraped in August 2017, with an aim of providing a database similar to movies and songs. The objective for creator was to create recommendation engine for books, similar to music and movies.

The dataset was first published in kaggle. Contributors then identified anomalies in the data like duplicate records, and multiple user ratings for the same book. Such anomalies were removed, and the clean data was then posted into github. We will load the data from this github repository.

The column information is provided for the key variables. Some assumptions are made for the other variables based on data present (and looking up the goodreads website).

The data is contiguous for 10k books and 50k users. The database has 5 tables, descriptions of which are given below

Table Description Notes
books Data of the 10k most popular books with metadata (author, ratings etc) Popular books are determined by number of total # of users who have rated the book
book_tags Genres associated with each book along with the number of tags for the genre Genric and User defined tags both are included; user defined can be anything
tags Description of the genres
ratings User, book, raating level data for 10k books and 50k users For 10k books and 50k users who have made at least 2 ratings
to_read User, book pair that a user has marked to read For 10k books and 50k users

Importing the data

On initial checks, I found that the data sets had blanks, which R could not identify as NA. So I added na.strings to help R identify nulls

# Reading all the datasets from github using fread
ratings   <- fread("https://github.com/zygmuntz/goodbooks-10k/raw/master/ratings.csv",   na.strings = c("", "NA"), showProgress = F)
book_tags <- fread("https://github.com/zygmuntz/goodbooks-10k/raw/master/book_tags.csv", na.strings = c("", "NA"), showProgress = F)
books     <- fread("https://github.com/zygmuntz/goodbooks-10k/raw/master/books.csv",     na.strings = c("", "NA"), showProgress = F)
tags      <- fread("https://github.com/zygmuntz/goodbooks-10k/raw/master/tags.csv",      na.strings = c("", "NA"), showProgress = F)
to_read   <- fread("https://github.com/zygmuntz/goodbooks-10k/raw/master/to_read.csv",   na.strings = c("", "NA"), showProgress = F)

The attributes of the different dataframes are described below

#Creating a List of all tables
tbl_list <- list("books" = books, "book_tags" = book_tags, "tags" = tags, "ratings" = ratings, "to_read" = to_read)

# Writing a function to get dimensions
dim_func <- function(x){
  dimension = str_trim(paste0(dim(tbl_list[[x]]), sep = "  ", collapse = ""))
}

# Writing a function to get all column names
names_func <- function(y){  
  vars = str_trim(paste0(names(tbl_list[[y]]), sep = " | ", collapse = ""))
}

# Creating a table
table_attributes <- data_frame(
  Table = names(tbl_list),
  `Rows  Columns` = unlist(lapply(1:5, dim_func)),
  Variables = unlist(lapply(1:5, names_func ))
)

# Removing the list
rm(tbl_list)

# Printing the table
kable(table_attributes, format = "html") %>%
  kable_styling(bootstrap_options = "striped") %>%
    column_spec(2, width = "12em")
Table Rows Columns Variables
books 10000 23 book_id | goodreads_book_id | best_book_id | work_id | books_count | isbn | isbn13 | authors | original_publication_year | original_title | title | language_code | average_rating | ratings_count | work_ratings_count | work_text_reviews_count | ratings_1 | ratings_2 | ratings_3 | ratings_4 | ratings_5 | image_url | small_image_url |
book_tags 999912 3 goodreads_book_id | tag_id | count |
tags 34252 2 tag_id | tag_name |
ratings 5976479 3 user_id | book_id | rating |
to_read 912705 2 user_id | book_id |

Cleaning the Data

Keeping the relevant columns across all tables

As a first step, I would remove the columns that I do not intend to use from the books table. All other tables have only the relevant columns

# Selecting the relevant columns
books <- books %>% select(-c(best_book_id, work_id, isbn, isbn13), -(ratings_1:small_image_url))
tbl_list <- list("books" = books, "book_tags" = book_tags, "tags" = tags, "ratings" = ratings, "to_read" = to_read)

Are there missing values?

The next step in cleaning the data was to identify any missing values across datasets. A quick is.na() check reveals that only books has missing values in some columns

# Identifying missing across all tables using a loop
for (x in 1:length(tbl_list)) {
  dd <- colSums(is.na(tbl_list[[x]]))
  if (sum(dd[dd > 0]) > 0) {
    print(paste("For the data -",names(tbl_list[x])))
    print(dd[dd > 0])
  }
}
## [1] "For the data - books"
## original_publication_year            original_title 
##                        21                       590 
##             language_code 
##                      1084
# Removing the list
rm(tbl_list)

I have decided to retain all the columns based on the explanation below

  • original_publication_year: Missing only for 21 out of the 10 books; Retain and take care of this issue during individual analysis
  • original_title: We have the title column that can be used in its place if the value is
  • language_code: Retain and take care of this issue in individual analysis

Anomaly in the data recorded

Among the different summary() codes that were run across all tables, only two anomalies were observed

  • The original_publication_year in books had a negative year
  • The count in book_tags which records how many times a book was given the particular genre has negative values
# Display summary statistics of the important columns
summary(books$original_publication_year)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   -1750    1990    2004    1982    2011    2017      21
summary(book_tags$count)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     -1.0      7.0     15.0    208.9     40.0 596234.0

On checking a few records for the original publication year, like
original_publication_year title goodreads_book_id
-720 The Odyssey 1381
-750 The Iliad 1371
-500 The Art of War 10534
-380 The Republic 30289
-430 Oedipus Rex (The Theban Plays, #1) 1554

I noticed that the books were actually written in B.C ***

The count in book_tags was mutated to 0 if negative

# Mutating negative values to 0
book_tags[which(book_tags$count < 0), "count"] = 0

The tags dataset

Most of the issues were uncovered in the tags dataset. The tag_name column can have several generic tags and several user defined tags. User defined tags generally do not qualify under the genre of the book, as most of them are “to read”“,”currently reading" etc. as described below:

# Indetifying the top tags across all books
book_tags %>% 
  group_by(tag_id) %>%
  summarise(num_books = n_distinct(goodreads_book_id)) %>%
  mutate(rank = rank(desc(num_books))) %>%
  filter(rank <= 5) %>%
  left_join(tags, by = "tag_id") %>%
  arrange(rank) %>%
  select(tag_name, num_books) %>%
  ungroup() %>%
  kable()
tag_name num_books
to-read 9983
favorites 9881
owned 9857
books-i-own 9799
currently-reading 9776

To identify 35 logical and distinct tags that could cover most books, several steps are performed:

  • Remove the special characters ‘-’ and ’_’ from the tag name
  • Remove non-alphanumeric characters from the tags
  • Remove tags that contain to read, reading, own, club, favorites etc.
  • Remove tags that are ya, novel, series (which cannot be tagged as genres)
# Remove the '-' and '_' from the tag name

tags <- tags %>%
  mutate(new_tag_name = str_replace_all(str_replace_all(tags$tag_name, "-", " "),"\\_", " ")) %>%
  select(tag_id, new_tag_name) %>%
  distinct(tag_id, new_tag_name)

# Remove all user defined tags and non alphanumeric ones
list_remove <- "+to read+|+reading+|^[[:digit:]]*$|+i own+|+currently+|+owned+|[^[:alnum:] ]|+favorites+|+club+|+buy+|+library+|+read+|+borrowed+|+abandoned+|+audio+|+ebook+|+kindle+|+default+"
tag_remove <- tags[grepl(list_remove,tags$new_tag_name),]

# Hard code removal of certain tags
tag_remove <- rbind(tag_remove,tags[tolower(tags$new_tag_name) %in% c('ya','novels','series'),])

  • Next the common genres with slightly different names, for example, classics and classic, were collated into a single tag_id for both tags and book_tags. The same procedure was repeated for children’s books, non fiction, graphic novels, and science fiction. I created a function and recursively called it using invoke_map
# Writing a function to change the tags
Remove_Genre <- function(x, y) {

  keep_tags <- tags[which(tags$new_tag_name %in% x), "tag_id"]
  new_tag   <- tags[which(tags$new_tag_name %in% y), "tag_id"][1]
  
  tags[which(tags$new_tag_name %in% x), "tag_id"] <<- new_tag
  tags[which(tags$new_tag_name %in% x), "new_tag_name"] <<- y
  
  book_tags[which(book_tags$tag_id %in% keep_tags ), "tag_id"] <<- new_tag
  return(NULL)

}
  
#Creating the list of tags to change
list_change <- list(list(x =  c("children s", "children s books", "childrens", "kids", "childrens books"),
           y = c("children")),
      list(x =  c("classic"),
           y = c("classics")),  
      list(x =  c("graphic novel"),
           y = c("graphic novels")),        
      list(x =  c("non fiction"),
           y = c("nonfiction")),  
      list(x =  c("sci fi", "scifi", "sci fi fantasy"),
           y = c("science fiction")),  
      list(x =  c("dystopia"),
           y = c("dystopian")))

# Calling the invoke map 
invoke_map(Remove_Genre, list_change) 

new_tag   <- tags[which(tags$new_tag_name %in% c("science fiction")), "tag_id"][1]

book_tags[which(book_tags$tag_id == 26894), "tag_id"] <- new_tag
tags[which(tags$tag_id == 26894), "tag_id"] <- new_tag

# Removing the duplicates from tags dataset
tags <- unique(tags)

# summarising the book_tags to get common counts
book_tags <- book_tags %>%
  group_by(goodreads_book_id, tag_id) %>%
  summarise(count = sum(count)) %>%
  ungroup()

As a last step, I subset the top 5 genres for each book (based on the count) and the top 35 genres across all books (by the number of books that have the genre in top 5). Thrse tags are checked for uniqueness and logical sense

# Removing the tags that are not actual genres
book_tags2 <- book_tags %>% anti_join(tag_remove, by = "tag_id")

# Ranking based on count and selecting top 5 genres
subset_tags <- book_tags2 %>% 
  group_by(goodreads_book_id) %>%
  mutate( rr = rank(desc(count))) %>%
  filter(rr <= 5) %>%
  ungroup()
 
# Selecting the top 35 genres across all books
subset_tags %>%  
  group_by(tag_id) %>% 
  summarise(num_of_books = n()) %>% 
  mutate(rank = rank(desc(num_of_books))) %>% 
  filter(rank <= 35) %>%
  ungroup() %>%
  left_join(tags, by = "tag_id") %>% 
  arrange(rank) %>%
  select(new_tag_name) %>%
  t %>%
  paste0
##  [1] "fiction"              "fantasy"              "romance"             
##  [4] "young adult"          "nonfiction"           "mystery"             
##  [7] "classics"             "contemporary"         "science fiction"     
## [10] "historical fiction"   "children"             "thriller"            
## [13] "paranormal"           "chick lit"            "historical"          
## [16] "crime"                "horror"               "humor"               
## [19] "history"              "biography"            "literature"          
## [22] "memoir"               "graphic novels"       "dystopian"           
## [25] "vampires"             "contemporary romance" "adventure"           
## [28] "comics"               "urban fantasy"        "picture books"       
## [31] "short stories"        "suspense"             "philosophy"          
## [34] "new adult"            "epic fantasy"

I’ll join the tag names to the book_tags and discard the tags dataset

# subsetting the top 35 tags
good_tags <- subset_tags %>% 
  group_by(tag_id) %>% 
  summarise(num_of_books = n()) %>% 
  mutate(rank = rank(desc(num_of_books))) %>% 
  filter(rank <= 35) %>%
  ungroup()

# Getting only those tags foe the books
temp_book_tags <- book_tags2 %>% semi_join(good_tags, by = "tag_id")
final_tags <- tags %>% semi_join(good_tags, by = "tag_id")

# Joining
final_book_tags <- temp_book_tags %>% 
  left_join(final_tags, by = "tag_id")

Snapshot of the data

Books

datatable(head(books, 500))

Book Tags

datatable(head(final_book_tags, 500))

Ratings

datatable(head(ratings, 500))

To Read

datatable(head(to_read, 500))

Data Description

Below is the snapshot of all the important columns
Variables Description Present_In
book_id User assigned book id based on popularity integer(1 - 10000) books, ratings, to_read
goodreads_book_id Book id linked to goodreads book_tags
books_count Number of editions of the book books
authors Author(s) of the book books
original_publication_year First year in which the book was published books
original_title First title of the book books
title Current running title of the edition selected books
language_code Language of the book books
average_rating Average Rating of the book (1 - 5) books
ratings_count Total users who have rated the book books
work_ratings_count Ratings of the current edition selected (1 - 5) books
work_text_reviews_count Total comments on the current edition selected books
tag_id ID of the tag book_tags
count Number of users who have tagged the genre for the book book_tags
new_tag_name Cleaned Tag name book_tags
user_id ID of the user (1 - 53424) ratings, to_read
rating Rating that the user has given integer(1 - 5) ratings

Exploratory Data Analysis

What makes up the popular list?

The popular genre list

First, I wanted to understand which genres feature in the top 10,000 popular books on goodreads. Since a book can be associated with multiple genres, I divided them into main genre and associated genre. Main genre has the most votes for a books while the associate genre can be what the sub-genre of the book is. The associated genres are limited to 4.

Most of the tags appear in both cases. That is, they are the main genre for some books and associated genre for other books. From the below graph, I assume fiction to thriller as the main genres. Genres from urban fantasy to epic fantasy are considered as the associated tags for the book

If we ignore fiction and non-fiction, people like to read fantasy, mystery and romantic books!

final_book_tags <- final_book_tags %>% 
  group_by(goodreads_book_id) %>%
  mutate(rank = rank(desc(count), ties.method = "first")) %>%
  ungroup() %>%
  filter(rank <= 5) %>%
  arrange(goodreads_book_id, rank) 

#Associated tags for each book
top_5 <- final_book_tags %>%  
  filter(rank > 1) %>%
  within(new_tag_name <- factor(new_tag_name, levels = names(sort(table(new_tag_name), decreasing = F))))

#Main Genre for each book
top_1 <- final_book_tags %>%  
  filter(rank == 1) %>%
  within(new_tag_name <- factor(new_tag_name, levels = names(sort(table(new_tag_name), decreasing = F))))

#Bar graph
ggplot(NULL, aes(x = new_tag_name)) + 
  geom_bar(data = top_1, aes(y = (..count..)/sum(..count..), fill = "Main Genre"),alpha = 0.5) +
  geom_bar(data = top_5, aes(y = (..count..)/sum(..count..), fill = "Associated Genre"),alpha = 0.5) +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 0.20), labels = scales::percent) +
  coord_flip() +
  labs(title = "Which Genre features most in the top books?", subtitle = "Main vs, Associated Genre", x = "Genre", y = "Percentage of Books", fill = "Genre Type")


Old books or new books?

Similar to what movies follow, I wanted to check if older books (books written before 1980’s) make up for a large chunk of the popular books.

Interestingly, 75% the books are from 1980 and above, indicating a young audience base using the website. The reference can be found here

#Older books are more in the popular lists?:
plot_year <- books %>% 
  select(original_publication_year) %>%
  group_by(original_publication_year) %>%
  summarise(count = n()) %>%
  ggplot(aes(x = original_publication_year, y = count)) +
  geom_line() +
  theme(legend.position = "none") +
  labs(title = "When were the popular books written", x = "Year", 
       y = "Number of Books")

plot_zoom_year <- books %>% 
  select(original_publication_year) %>%
  group_by(original_publication_year) %>%
  summarise(count = n()) %>%
  ggplot(aes(x = original_publication_year, y = count, color = (original_publication_year >= 1983 &  
                                                                original_publication_year <= 2011))) +
  geom_line(aes(group = 1)) +
  theme(legend.position = "none") +
  labs(title = "Zoomed in for 1900 and beyond", x = "Year") +
  ylab(NULL) +
  coord_cartesian(xlim = c(1900,2016)) 

grid.arrange(plot_year, plot_zoom_year, ncol = 2)


The Goldmine Authors

After we’ve found the genres that are popular, I wanted to know which authors appear more in the popular lists. Authors were split for books that had multiple authors.

James Patterson, Stephen King, and Nora Roberts have a frequent appearance in this list. These authors are famous for writing fiction, fanstasy, & romantic novels respectively, which moves them to the top in our goldmine author list.

If you’re a publisher, you know whose books to publish. If you’re a newbie, you know which author to look for!

#Authors that feature in the famous list
authors <- books %>% 
  select(authors) %>%
  t() %>%
  as.character() %>%
  str_split(",") %>%
  unlist() %>%
  factor() %>%
  table() %>%
  as.data.frame() %>%
  rename("Author" = ".", "Freq" = "Freq") %>%
  filter(Freq >= 15) %>%
  arrange(desc(Freq)) 

wordcloud(authors$Author, authors$Freq, scale = c(2.5,0.15), random.order =  FALSE, colors = brewer.pal(8,"Dark2"), 
          rot.per = .20)


Genre within a Genre (Genre-ception!)

After we’ve got a list of genres and associated genres, I wanted to check the association between the genres. This metric is a calcuation of how many times did a genre feature with the other one. invoke_map was used to run across all the lists of tags.

The strongest associations were observed between science fiction & paranormal with fantasy, thriller with mystery (duh), and graphic novels with comics.

Interestingly, Chick lit (or) Chick Literature authors pivot their books around romance only. The young adult genre has a wider foucus involving fantasy, romance, paranormal and adventure books.

#Picking the 15/35 genres that have the top tag for 93% books
genres_for_analysis <- top_1 %>% 
  group_by(new_tag_name) %>%
  summarise(num_books = n()) %>%
  ungroup() %>%
  arrange(num_books) %>%
  top_n(n = 15, wt = num_books)

#Creating an empty dataframe
assoc_table <- data_frame(main = character(), assoc = character(), val = as.integer(character()))

# Writing the function for association mapping that checks each genre
association_map <- function(x, df){
  book_ids <- df[which(df$new_tag_name == x), "goodreads_book_id"]
  
  all_genres <- df[which(df$goodreads_book_id %in% (book_ids$goodreads_book_id)),] %>% 
    group_by(new_tag_name) %>%
    summarise(n = n())
  
  temp_table <- data_frame( main = rep(x, times = nrow(all_genres)), assoc = as.character(all_genres$new_tag_name), val = all_genres$n)

  assoc_table <<- union(assoc_table, temp_table)
  return(NULL)
}

#Calling the function on all genres
invoke_map(association_map, final_tags$new_tag_name, df = top_5)
# Filter and spread

assoc_table <- assoc_table %>%
filter(assoc_table$main != assoc_table$assoc, main %in% (genres_for_analysis$new_tag_name), !assoc %in% c('fiction')) %>%
  spread(assoc, val)                                  #Spread the table to get the lower matrix

assoc_table[is.na(assoc_table)]  <- 0

assoc_title <- assoc_table[,1]
assoc_head <- assoc_table[,-1]
assoc_head <- prop.table(as.matrix(assoc_head), margin = 1)

#Binding back to the first column
assoc_table <- assoc_head %>%
  cbind(assoc_title) %>%                       #Binding back the main column
  gather(key = "assoc", value = "val", -main) #Gather the table 

ggplot(assoc_table, aes(y = main, x = assoc)) +
  theme_bw() +
  geom_tile(aes(fill = val), color = 'white') +
  scale_fill_gradient(low = 'white', high = 'darkblue', space = 'Lab') +
  theme(axis.text.x = element_text(angle = 90), panel.grid.major = element_line(color = '#eeeeee')) +
  labs(title = "Association between Genres", y = "Main Genre", x = "Associated genre", fill = "Association Score")

What are the Ratings across popular books?

What is the general distribution?

First, I’d like to see the general distribution of the rating across all 10,000 books. A weighted-average was taken for the mean, as the number of votes for each book can vary. It was observed that the ratings follow a normal distribution with a mean rating of 4.03.

We now know that the popular books are indeed popular becuase they are a good read. However, we do see a wide range from 2.47 to 4.82

mean_rat <- books %>%
  summarise(average_rating = sum(average_rating * work_ratings_count)/sum(work_ratings_count)) %>%
  round(digits = 2)

ggplot(data =NULL, aes(x = average_rating)) +
  geom_histogram(data = books,bins = 50) +
  geom_vline(data = books, aes(xintercept = sum(average_rating*work_ratings_count)/sum(work_ratings_count)), color = "blue", linetype = "dashed", size = 0.75) +
  geom_text(data = mean_rat, aes(label = paste("Mean =",average_rating) , y = 820), hjust = -0.17, color = "blue") +
  labs(title = "How are ratings distributed for the popular books", x = "Average Rating", y = "Number of Books")


Which are the top rated genres?

Now that we know the top genres, a more directed approach to start your next book would be to see how do other people rate these genres.

For this analysis, I have considered the top genre for each book and selected the top 15 genres to analyze. The top 15 genres cover 91% of the books as shown below

genres_for_analysis %>%
  summarise(`Books Covered` = sum(num_books)) %>%
  kable(format = "html") %>%
  kable_styling(bootstrap_options = "striped")
Books Covered
9059

The below graph explains the variation across genre ratings. From the boxplot below, we infer that:

  • Paranormal and Graphic novels are the genres which have the best ratings. The average rating of these genres is higher than most other genres
  • Chick lit has the lowest ratings among all genres (I think there might be bias largely due to disapprovals from critics)
  • Fiction as a genre is subject to many outlier ratings, which decreases the overall average
#Getting the top 15 genres
filter_books <- top_1 %>%
  filter(new_tag_name %in% genres_for_analysis$new_tag_name)

#Creating the boxplot
ratings_with_genre <- books %>%
  inner_join(filter_books, by = "goodreads_book_id")

ggplot(ratings_with_genre, aes(x = new_tag_name, y = average_rating)) +
  geom_boxplot() +
  coord_flip() +
  labs(title = "How are ratings distributed for the popular books", x = "Average Rating", y = "Number of Books")


Is old gold?

The next analysis was to see if older books actually get better ratings when compared to the newer ones. The below line graph depicts that it is not the case. In fact, we see a slight increase in ratings across the years. This is subect to low count of books prior to 1980’s as well.

So sadly, old is acutally not gold for books. The quality of the book does matter for the bibliophiles

ratings_with_genre %>% 
  group_by(original_publication_year) %>%
  summarise(num_books = n(), rating = sum(average_rating*work_ratings_count)/sum(work_ratings_count)) %>%
  ungroup() %>%
  ggplot(aes(x = original_publication_year, y = rating)) +
  geom_line() +
  coord_cartesian(xlim = c(1900,2017)) +
  geom_smooth(method = "loess", se = FALSE) +
  labs(title = "Rating vs. Year of publication", x = "Year of publication", y = "Average rating")


Is rating even associated with any factor?

I did a simple correlation to understand if ratings is influenced by any factors. If the co-efficient is above 0.3, I would have rated it as correlation.

However, ratings are independent of any factors that we have, and is probably affected by other factors overall

books_numeric <- books %>%
  mutate(num_authors = str_count(books$authors, ",") + 1) %>%
  select(books_count, original_publication_year, average_rating, ratings_count, work_ratings_count, work_text_reviews_count, num_authors)
 
books_numeric <- books_numeric[complete.cases(books_numeric),]

ggcorr(books_numeric,  nbreaks = 5, label = TRUE, label_size = 3, hjust = 0.7, label_round = 4, size = 2.5) +
  labs(title = "Correlation of average rating with other variables")  


Part 1 or Part 2 or … Part n of the book?

We can identify if a book is a part of some series by checking cases when the title is not the same as original title. For such cases, the rating generally increases till the 5th part of the book. The quality of the book keeps increasing till the 5th part and then stabilizes from the 6th to tenth parts.

So if you start reading a series, like Harry Potter, you tend to like the books even more. That’s how they create the fan base, don’t they?

# Getting the books that have editions
books_editions <- filter(books, original_title != title, str_detect(title, "#"))

# Getting the edition number
start <- regexpr("\\#[^\\#]*$", books_editions$title)
end <-  regexpr("\\)[^\\)]*$", books_editions$title)

books_editions$edition  <- as.numeric(str_sub(books_editions$title, start = start + 1, end = end - 1))
books_editions <- books_editions[which(books_editions$edition %in% c(1:10)),]

correlation <- round(cor(books_editions$edition, books_editions$average_rating ), digits = 3)

ggplot(data = books_editions, aes(x = factor(edition), y = average_rating)) +
  geom_boxplot(notch = TRUE) +
  geom_rect(xmin = 9.18, xmax = 10.32, ymin = 4.68, ymax = 4.82, color = "blue", fill = "white") +
  annotate("text",  x = 9.75, y = 4.75, label =  paste("Corr:",as.character(correlation)), color = "blue") +
  labs(title = "What about Book Series?", subtitle = "Ratings by Book Series", x = "Number in the Series", 
       y = "Average Rating")

What about the bibliophiles?

Bibliophiles form the most important component of goodreads. They are the people who actually rate and recommend books so that everyone can have a good read and avoid the bad ones.

How many books do these users rate/read?

It is assumed that if a user rates a book, he/she has read the book. The analysis further assumes books read as the metric.

The first analysis is to check the distribution of the number of books that people read. Similar to ratings across books, the number of books that users read follows a normal distribution with a mean of 111.9 (Well that’s a lot of books!). The minimum number of books that a person has read 19 is while the maximum is 200.

To reiterate, since this is a contiguous datset, these numbers might not reflect the actual population parameters.

# Grouping the ratings and books read by user
rating_user <- ratings %>%
  group_by(user_id) %>%
  summarise(n = n_distinct(book_id), average_rating = mean(rating)) %>%
  ungroup()

#Calculating the min, max, and mean
mean_b <- mean(rating_user$n)
min_b <-  min(rating_user$n)
max_b <-  max(rating_user$n)

#How many books do users read
ggplot(rating_user, aes(x = n)) +
  geom_histogram(bins = 40) +
  geom_vline(aes(xintercept = mean_b), color = "blue") +
  geom_vline(aes(xintercept = min_b), color = "tan2") +
  geom_vline(aes(xintercept = max_b), color = "tan2") +
  annotate("text", x = mean_b + 15, y = 4500, label = paste("Mean =", round(mean_b, 1)), color = "blue") +
  annotate("text", x = min_b + 10, y = 4500, label = paste("Min =", round(min_b, 1)), color = "tan2") +
  annotate("text", x = max_b - 10, y = 4500, label = paste("Max =", round(max_b, 1)), color = "tan2") +
  labs(title = "How many books do users read", x = "Number of books read", y = "Count of users")


Segregating the bibliophiles

We decide to quntile the users based on the number of books that they’ve read. People are distributed into Q1-Q5, with Q1 the lowest quintile

# Assigning the quartiles
rating_user$Quartile <- with(rating_user, factor(findInterval(n, c(-Inf,
                     quantile(n, probs=c(0.2, .4, .6, .8)), Inf)), labels = c("Q1", "Q2", "Q3", "Q4","Q5")))

rating_user %>% 
  group_by(Quartile) %>%
  summarise(Mean_Books = mean(n), Min_Books = min(n), Max_Books = max(n)) %>%
  kable(format = "html") %>%
  kable_styling(bootstrap_options = "striped")
Quartile Mean_Books Min_Books Max_Books
Q1 75.87592 19 91
Q2 98.33239 92 104
Q3 110.39975 105 116
Q4 123.86480 117 132
Q5 148.92101 133 200

Critical - Critical as you read

As the title reveals, users tend to get slightly critical of the books they read with the increase in the number of books read. We observe a constant decline in rating patterns across the 5 quartiles.

New readers tend to rate the book generously and begin to get critical as they start reading good content.

# Quartile level rating
rating_user %>%
  group_by(Quartile) %>%
  summarise(mean_avg = sum(average_rating*n)/sum(n), min = min(n), max = max(n)) %>%
  ungroup() %>%
  ggplot(aes(x = paste0(Quartile, " (", min," - ", max, ")"), y = mean_avg)) +
  geom_bar(stat = "identity") +
  coord_cartesian(ylim = c(3.5, 4.02)) +
  labs(title = "How do users rate books?", x = "Number of books read", y = "Average Rating given")


Is there a difference in the reading patterns?

Our next objective was to identify any differences between the reading patterns of our bibliophile qunitles. The percentage of books read was calcuated across the genre and quintile. The top 5 genres in each quintiles are selected for analysis.

Some interesting observations were

  • The pecentage of fiction book read by the users increases with the increase in number of books
  • People tend to start reading romantic novels first. However, the craze seems to fade off once people read more books
  • We see children’s books as the top 5 genres for the last 3 quintiles
  • This depicts that parents of children tend to rate these books a lot, leading to children’s books showing up in the top lists
  • Teenagers tend to read a little more than the lowest quintile, which results in young adult books featuring in the top 5 read genres for quintile 2
  • Fiction, Non-Fiction, Classics, and Fanstasy are common across all genres
rating_w_tags <- ratings %>% 
  left_join(rating_user, by = "user_id") %>%
  left_join(books, by = "book_id") %>%
  left_join(top_1, by = "goodreads_book_id")

rating_w_tags %>%
  group_by(Quartile, new_tag_name) %>%
  summarise( n = n()) %>%
  mutate(freq = n / sum(n)) %>%
  ungroup() %>%
  group_by(Quartile) %>%
  mutate(rank = rank(desc(freq))) %>%
  filter(rank <= 5) %>%
  ungroup() %>%
  arrange(Quartile, rank) %>%
  ggplot(aes(x = reorder(new_tag_name, -freq), y = freq)) +
  geom_bar(stat = "identity") +
  scale_y_continuous(labels = scales::percent) +
  facet_wrap(~ Quartile, scales = "free_x", ncol = 2) +
  labs(title = "What books to the segmented users read?", x = "Top 5 genres", 
       y = "Percentage of users who read the genre")


What makes a book worth to be a read?

After we’ve indentified what different segments of users read, our last analysis is to identify what makes a books worth to read.

This analysis is carried out by identifying the number of books that people have marked as “to-read” in goodreads. This variable will then be analyzed against all other attributes of a book to see if a relation exists.

A strong correlation exists between the number of users who have marked a books as to read with the reviews that a book has received. As the number of text reviews increase, more people would like to read the book. Interestingly, the ratings of a books do not matter when a user wants to read the book.

Other variables that have a good correlation with the to-read are the number of editions of the book (0.3), the number of users that have rated the book (0.63).

The hypothesis is popular books (in terms of ratings and comments) feature in the different lists of goodreads which motivates the users to read these books.

So if you launch a new book, generate more hype for the books in terms of comments and ratings.

# Get the number of users who have marked a book as to read
to_read_total <- to_read %>% 
  group_by(book_id) %>%
  summarise(number_users = n_distinct(user_id))

# Join with the attributes of the book
book_tr <- books %>%
  left_join(to_read_total, by = "book_id") %>%
  select(number_users, books_count, original_publication_year, average_rating, ratings_count, work_ratings_count, work_text_reviews_count)

# Correlation
correlation <- cor(book_tr[complete.cases(book_tr),])
corrplot(correlation, type = "lower", addCoef.col = "black", diag = FALSE, number.cex = 0.8,tl.srt = 45, tl.col = "black", tl.cex = 0.6, title = "Correlation of books marked as to read", mar = c(0,0,1,0))

Get a book recommendation

The shiny app below takes multiple genres from the user and ouputs the recommended books to read based on the genre.

The books are sorted according to the combination of genres (most genres first) and then rating

Clicking on the name of the book will direct you to the goodreads page of the book.

The app is hosted here

Book recommender app

Code for the app

First, the dataframes books, final_tags, and final_book_tags need to be converted to CSV’s and then placed in the same folder as the ui and server codes.

The code for ui.R is given below:

# Loading the required libraries
library(shiny)
library(DT)

# Reading the files
final_tags <- read.csv("final_tags.csv")
books <- read.csv("books.csv")


shinyUI(fluidPage(
  # Adding title
  titlePanel("Book Recommender"),
  
  # Sidebar 
  sidebarLayout(
    sidebarPanel(
      # Genre Selection
      
      selectInput(inputId = "Columns", label = "Which genres do you like?",
                  unique(final_tags$new_tag_name), multiple = TRUE),
      verbatimTextOutput("fiction"),
      
      # Rating Selection
      sliderInput(inputId = "range",
                  label = "Ragne of Ratings that you wish to read?",
                  min = min(books$average_rating),
                  max = 5,
                  value = c(3,5))
      
    ),
    
    # Datatable output
    mainPanel(
      "The top 50 recommended books for the genre(s) selected are",
      DT::dataTableOutput(outputId = "bookreco"))
    
  )
)
)

The code for server.R is given below:

#Loading the required libraries
library(shiny)
library(DT)
library(tidyverse)

# Reading the CSV files
books <- read.csv("books.csv")
final_book_tags <- read.csv("final_book_tags.csv")

# Server function
shinyServer(function(input, output) {
  
  datasetInput <- reactive({
    
    # Aggregating all selected genres    
    if (final_book_tags %>% filter(new_tag_name %in% as.vector(input$Columns)) %>% count() == 0) {
      result <<- data_frame(goodreads_book_id = c(1),
                            Genre = c("temp"))
    } else{
      result <<- final_book_tags %>% 
        filter(new_tag_name %in% as.vector(input$Columns)) %>%
        aggregate(new_tag_name ~ goodreads_book_id, data = ., paste, collapse = ", ") %>%
        rename(Genre  = new_tag_name)
    }
    
    # Filtering the books based on genre and rating
    final_book_tags %>% 
      filter(new_tag_name %in% as.vector(input$Columns)) %>%
      group_by(goodreads_book_id) %>%
      summarise(num_tags = n()) %>%
      ungroup() %>%
      left_join(result, by = "goodreads_book_id") %>%
      left_join(books, by = "goodreads_book_id") %>%
      filter(average_rating >= as.numeric(input$range[1]), average_rating <= as.numeric(input$range[2])) %>%
      arrange(desc(num_tags), desc(average_rating)) %>%
      select(title,  average_rating,  goodreads_book_id, Genre ) %>%
      mutate(Book_Name = paste0("<a href='",paste0('https://www.goodreads.com/book/show/',goodreads_book_id),"'target='_blank'>", title,"</a>")) %>%
      select(Book_Name, Genre, average_rating) %>%
      rename( Rating = average_rating, `Book` = Book_Name, `Genre(s)` = Genre)
    
    
  })
  
  #Rendering the table
  output$bookreco <- DT::renderDataTable({
    
    DT::datatable(head(datasetInput(), n = 50), escape = FALSE, options = list(scrollX = '1000px'))
  })
})

Summary

We split the summary into three sections

Genre

  • The most read books are from the genres fiction, non-fiction, fantasy, mystery, and romance
  • Contraray to popular belief, most books are not from the old times; Rahter 75% of the books are from 1980 or later
  • James Patterson, Stephen King, and Nora Roberts have a frequent appearance in the popular list

Ratings

  • Popular books are indeed popular becuase they are a good read; The average rating for these popular books is 4.03
  • Paranormal and Graphic novels are the genres which have the best ratings. The average rating of these genres is higher than most other genres
  • For books that are part of a series, the average rating actually increases till series # 5 of the book; The quality of the book keeps increasing till the 5th part
  • Ratings are not associated with any factors present in the dataset (publication year, books count, total ratings, and total reviews)
  • Hence the hypothesis is that the quality of the books actually matter rather than other attributes

User level

  • The number of books that a user reads follows a normal distribution with a mean of 111.9 (Very big number)
  • The bibiliophiles were segregated into qunitiles, and the below insights are for the quintiles
  • Users tend to get slightly critical of the books they read with the increase in the number of books read. We observe a constant decline in rating patterns across the 5 quintiles
  • People tend to start reading romantic novels first; However, the craze seems to fade off once people read more books
  • We see children’s books as the top 5 genres for the last 3 quintiles
    • This depicts that parents of children tend to rate these books a lot, leading to children’s books showing up in the top lists

To determine how a user decides “to-read” a book, we ran a correlation:

  • A strong correlation exists between the number of users who have marked a books as to read with the reviews that a book has received
  • Other variables that have a good correlation with the to-read are the number of editions of the book (0.3), the number of users that have rated the book (0.63)

The hypothesis is popular books (in terms of ratings and comments) feature in the different lists of goodreads which motivates the users to read these books

Future Analysis

  • Since we couldn’t find any associations for ratings of a book, the next best step would be to get the text of the book and do a text analysis to see if common words lead to better ratings
  • The “to-read” for a book can be predicted using the correlation analysis