Analysis of the most popular books

Introduction

Problem Statement: People love books. And people enjoy discussing and sharing books that they love. Goodreads is one amazing platform designed for such bibliophiles (and budding bibliophiles!). What I felt was missing from this website is a summary analysis of popular books. The objective of this project is to analyze the data of popular books from goodreads and identify any interesting trends for these books and what makes them popular.
How would I do this?: The data for this analysis is a contiguous dataset of 10,000 most popular books, with user level information on 6 million ratings for these books. We also have several generic and user defined genres (/tags/shelves) for each book. A combination of this information would help me answer the above questions. I want to explore ratings across different verticals such as genre, year of publication, editions, etc. Next, I would analyze the users to identify any strong relations among different readers. The different analytic techniques that I wish to apply post data preparation include, but not limited to:
- Univariate analysis
- Bivariate analysis
- Clustering/ CHAID/ SVM (any one)
Last, I want to build a recommendation engine based on the genre(s) that the user selects.
What can you do with the analysis?:

Identify factors that affect the ratings, patterns among what avid readers actually like to read,
Associations between different genres, and what genres to invest in (if you’re a publisher)
Identify segments of bibliophiles based on the number of books they have rated (assuming rated as read)
As an avid reader, find genres that are popular but were off your radar
- If you love that genre, get a recommended book for it
As a newbie, see what people read and hopefully become a reader yourself!

Packages Used

The packages used in the project (currently) are:

data.table: To read the csv files in the fastest possible way
tidyverse : Collection of R packages for data manipulation, exploration and visualization. I am currently using
- dplyr: Data manipulation using filter, joins, summarise etc.
- magrittr: The pipe %>% operator!
stringr: String replacements and pattern matching
DT : Filtering, pagination, and sorting of data tables in html outputs
knitr : Aligned displays of table in a html doc
kableExtra : Manipulate table styles for good visualizations

# Loading the packages
library(data.table)
library(DT)
library(kableExtra)
library(knitr)
library(tidyverse)
library(stringr)

Data Preparation

Original Data and Import

Description: goodbooks-10k

The original data gooodooks-10k was scraped in August 2017, with an aim of providing a database similar to movies and songs. The objective for creator was to create recommendation engine for books, similar to music and movies.

The dataset was first published in kaggle. Contributors then identified anomalies in the data like duplicate records, and multiple user ratings for the same book. Such anomalies were removed, and the clean data was then posted into github. We will load the data from this github repository.

The column information is provided for the key variables. Some assumptions are made for the other variables based on data present (and looking up the goodreads website).

The data is contiguous for 10k books and 50k users. The database has 5 tables, descriptions of which are given below

Table	Description	Notes
books	Data of the 10k most popular books with metadata (author, ratings etc)	Popular books are determined by number of total # of users who have rated the book
book_tags	Genres associated with each book along with the number of tags for the genre	Genric and User defined tags both are included; user defined can be anything
tags	Description of the genres
ratings	User, book, raating level data for 10k books and 50k users	For 10k books and 50k users who have made at least 2 ratings
to_read	User, book pair that a user has marked to read	For 10k books and 50k users

Importing the data

On initial checks, I found that the data sets had blanks, which R could not identify as NA. So I added na.strings to help R identify nulls

# Reading all the datasets from github using fread
ratings   <- fread("https://github.com/zygmuntz/goodbooks-10k/raw/master/ratings.csv",   na.strings = c("", "NA"))
book_tags <- fread("https://github.com/zygmuntz/goodbooks-10k/raw/master/book_tags.csv", na.strings = c("", "NA"))
books     <- fread("https://github.com/zygmuntz/goodbooks-10k/raw/master/books.csv",     na.strings = c("", "NA"))
tags      <- fread("https://github.com/zygmuntz/goodbooks-10k/raw/master/tags.csv",      na.strings = c("", "NA"))
to_read   <- fread("https://github.com/zygmuntz/goodbooks-10k/raw/master/to_read.csv",   na.strings = c("", "NA"))

The attributes of the different dataframes are described below

#Creating a List of all tables
tbl_list <- list("books" = books, "book_tags" = book_tags, "tags" = tags, "ratings" = ratings, "to_read" = to_read)

# Writing a function to get dimensions
dim_func <- function(x){
  dimension = str_trim(paste0(dim(tbl_list[[x]]), sep = "  ", collapse = ""))
}

# Writing a function to get all column names
names_func <- function(y){  
  vars = str_trim(paste0(names(tbl_list[[y]]), sep = " | ", collapse = ""))
}

# Creating a table
table_attributes <- data_frame(
  Table = names(tbl_list),
  `Rows  Columns` = unlist(lapply(1:5, dim_func)),
  Variables = unlist(lapply(1:5, names_func ))
)

# Removing the list
rm(tbl_list)

# Printing the table
kable(table_attributes, format = "html") %>%
  kable_styling(bootstrap_options = "striped") %>%
    column_spec(2, width = "12em")

Table	Rows Columns	Variables
books	10000 23	book_id \| goodreads_book_id \| best_book_id \| work_id \| books_count \| isbn \| isbn13 \| authors \| original_publication_year \| original_title \| title \| language_code \| average_rating \| ratings_count \| work_ratings_count \| work_text_reviews_count \| ratings_1 \| ratings_2 \| ratings_3 \| ratings_4 \| ratings_5 \| image_url \| small_image_url \|
book_tags	999912 3	goodreads_book_id \| tag_id \| count \|
tags	34252 2	tag_id \| tag_name \|
ratings	5976479 3	user_id \| book_id \| rating \|
to_read	912705 2	user_id \| book_id \|

Cleaning the Data

Keeping the relevant columns across all tables

As a first step, I would remove the columns that I do not intend to use from the books table. All other tables have only the relevant columns

# Selecting the relevant columns
books <- books %>% select(-c(best_book_id, work_id, isbn, isbn13), -(ratings_1:small_image_url))
tbl_list <- list("books" = books, "book_tags" = book_tags, "tags" = tags, "ratings" = ratings, "to_read" = to_read)

Are there missing values?

The next step in cleaning the data was to identify any missing values across datasets. A quick is.na() check reveals that only books has missing values in some columns

# Identifying missing across all tables using a loop
for (x in 1:length(tbl_list)) {
  dd <- colSums(is.na(tbl_list[[x]]))
  if (sum(dd[dd > 0]) > 0) {
    print(paste("For the data -",names(tbl_list[x])))
    print(dd[dd > 0])
  }
}

## [1] "For the data - books"
## original_publication_year            original_title 
##                        21                       590 
##             language_code 
##                      1084

# Removing the list
rm(tbl_list)

I have decided to retain all the columns based on the explanation below

original_publication_year: Missing only for 21 out of the 10 books; Retain and take care of this issue during individual analysis
original_title: We have the title column that can be used in its place if missing
language_code: Retain and take care of this issue during individual analysis

Anomaly in the data recorded

Among the different summary() codes that were run across all tables, only two anomalies were observed

The original_publication_year in books had a negative year
The count in book_tags which records how many times a book was given the particular genre has negative values

# Display summary statistics of the important columns
summary(books$original_publication_year)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   -1750    1990    2004    1982    2011    2017      21

summary(book_tags$count)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     -1.0      7.0     15.0    208.9     40.0 596234.0

On checking a few records for the original publication year, like

original_publication_year	title	goodreads_book_id
-720	The Odyssey	1381
-750	The Iliad	1371
-500	The Art of War	10534
-380	The Republic	30289
-430	Oedipus Rex (The Theban Plays, #1)	1554

I noticed that the books were actually written in B.C (Wow!) ***

The count in book_tags was mutated to 0 if negative

# Mutating negative values to 0
book_tags[which(book_tags$count < 0), "count"] = 0

The tags dataset

Most of the issues were uncovered in the tags dataset. The tag_name column can have several generic tags and several user defined tags. User defined tags generally do not qualify under the genre of the book, as most of them are like to read, currently reading etc. as described below:

# Indetifying the top tags across all books
book_tags %>% 
  group_by(tag_id) %>%
  summarise(num_books = n_distinct(goodreads_book_id)) %>%
  mutate(rank = rank(desc(num_books))) %>%
  filter(rank <= 5) %>%
  left_join(tags, by = "tag_id") %>%
  arrange(rank) %>%
  select(tag_name, num_books) %>%
  ungroup() %>%
  kable()

tag_name	num_books
to-read	9983
favorites	9881
owned	9857
books-i-own	9799
currently-reading	9776

To identify 35 logical and distinct tags that could cover most books, several steps are performed:

Remove the special characters ‘-’ and ’_’ from the tag name
Remove non-alphanumeric characters from the tags
Remove tags that contain to read, reading, own, club, favorites etc.
Remove tags that are ya, novel, series (which cannot be tagged as genres)

# Remove the '-' and '_' from the tag name

tags <- tags %>%
  mutate(new_tag_name = str_replace_all(str_replace_all(tags$tag_name, "-", " "),"\\_", " ")) %>%
  select(tag_id, new_tag_name) %>%
  distinct(tag_id, new_tag_name)

# Remove all user defined tags and non alphanumeric ones
list_remove <- "+to read+|+reading+|^[[:digit:]]*$|+i own+|+currently+|+owned+|[^[:alnum:] ]|+favorites+|+club+|+buy+|+library+|+read+|+borrowed+|+abandoned+|+audio+|+ebook+|+kindle+|+default+"
tag_remove <- tags[grepl(list_remove,tags$new_tag_name),]

# Hard code removal of certain tags
tag_remove <- rbind(tag_remove,tags[tolower(tags$new_tag_name) %in% c('ya','novels','series'),])

Next the common genres with slightly different names, for example, classics and classic, were collated into a single tag_id for both tags and book_tags. The same procedure was repeated for children’s books, non fiction, graphic novels, and science fiction
This was a recursive step till we get unique genres A sample code for one case is below

# Identifying what to replace
keep_tags_classic <- tags[which(tags$new_tag_name %in% c("classic")), "tag_id"]
new_tag_classic   <- tags[which(tags$new_tag_name %in% c("classics")), "tag_id"]

# Replacing the tags in the dataset
tags[which(tags$new_tag_name %in% c("classic")), "tag_id"] = new_tag_classic
tags[which(tags$new_tag_name %in% c("classic")), "new_tag_name"] = "classics"
book_tags[which(book_tags$tag_id %in% keep_tags_classic ), "tag_id"] = new_tag_classic

# Getting the unique ids
tags <- unique(tags)
book_tags <- group_by(goodreads_book_id, tag_id) %>%
  summarise(count = sum(count)) %>%
  ungroup()


# For children's genre which have similar names
keep_tags_children <- tags[which(tags$new_tag_name %in% c("children s", "children s books", "childrens", "kids",
                                                          "childrens books")), "tag_id"]
new_tag_children   <- tags[which(tags$new_tag_name %in% c("children")), "tag_id"]

tags[which(tags$new_tag_name %in% c("children s", "children s books", "childrens", "kids", 
                                    "childrens books")), "tag_id"] = new_tag_children
tags[which(tags$new_tag_name %in% c("children s", "children s books", "childrens", "kids", "childrens books")),
     "new_tag_name"] = "children"

book_tags[which(book_tags$tag_id %in% keep_tags_children ), "tag_id"] = new_tag_children

# For classics genre which have similar names
keep_tags_classic <- tags[which(tags$new_tag_name %in% c("classic")), "tag_id"]
new_tag_classic   <- tags[which(tags$new_tag_name %in% c("classics")), "tag_id"]

tags[which(tags$new_tag_name %in% c("classic")), "tag_id"] = new_tag_classic
tags[which(tags$new_tag_name %in% c("classic")), "new_tag_name"] = "classics"

book_tags[which(book_tags$tag_id %in% keep_tags_classic ), "tag_id"] = new_tag_classic

# For graphic novels genre which have similar names
keep_tags_graphic <- tags[which(tags$new_tag_name %in% c("graphic novel")), "tag_id"]
new_tag_graphic   <- tags[which(tags$new_tag_name %in% c("graphic novels")), "tag_id"][1]

tags[which(tags$new_tag_name %in% c("graphic novel")), "tag_id"] = new_tag_graphic
tags[which(tags$new_tag_name %in% c("graphic novel")), "new_tag_name"] = "graphic novels"

book_tags[which(book_tags$tag_id %in% keep_tags_graphic ), "tag_id"] = new_tag_graphic

# For non-fiction genre which have similar names
keep_tags_non <- tags[which(tags$new_tag_name %in% c("nonfiction")), "tag_id"]
new_tag_non   <- tags[which(tags$new_tag_name %in% c("non fiction")), "tag_id"][1]

tags[which(tags$new_tag_name %in% c("nonfiction")), "tag_id"] = new_tag_non
tags[which(tags$new_tag_name %in% c("nonfiction")), "new_tag_name"] = "non fiction"

book_tags[which(book_tags$tag_id %in% keep_tags_non), "tag_id"] = new_tag_non

# For science fiction genre which have similar names
keep_tags_sci <- tags[which(tags$new_tag_name %in% c("sci fi", "scifi", "sci fi fantasy")), "tag_id"]
new_tag_sci   <- tags[which(tags$new_tag_name %in% c("science fiction")), "tag_id"][1]

tags[which(tags$new_tag_name %in% c("sci fi", "scifi", "sci fi fantasy")), "tag_id"] = new_tag_sci
tags[which(tags$new_tag_name %in% c("sci fi", "scifi", "sci fi fantasy")), "new_tag_name"] = "science fiction"

book_tags[which(book_tags$tag_id %in% keep_tags_sci), "tag_id"] = new_tag_sci

book_tags[which(book_tags$tag_id == 26894), "tag_id"] = new_tag_sci
tags[which(tags$tag_id == 26894), "tag_id"] = new_tag_sci

# For dystopian genre which have similar names
keep_tags_dys <- tags[which(tags$new_tag_name %in% c("dystopia")), "tag_id"]
new_tag_dys  <- tags[which(tags$new_tag_name %in% c("dystopian")), "tag_id"]

tags[which(tags$new_tag_name %in% c("dystopia")), "tag_id"] = new_tag_dys
tags[which(tags$new_tag_name %in% c("dystopia")), "new_tag_name"] = "dystopian"

book_tags[which(book_tags$tag_id %in% keep_tags_dys), "tag_id"] = new_tag_dys

# Removing the duplicates from tags dataset
tags <- unique(tags)

# summarising the book_tags to get common counts
book_tags <- book_tags %>%
  group_by(goodreads_book_id, tag_id) %>%
  summarise(count = sum(count)) %>%
  ungroup()

As a last step, I subset the top 5 genres for each book (based on the count) and the top 35 genres across all books (by the number of books that have the genre in top 5). Thrse tags are checked for uniqueness and logical sense

# Removing the tags that are not actual genres
book_tags2 <- book_tags %>% anti_join(tag_remove, by = "tag_id")

# Ranking based on count and selecting top 5 genres
subset_tags <- book_tags2 %>% 
  group_by(goodreads_book_id) %>%
  mutate( rr = rank(desc(count))) %>%
  filter(rr <= 5) %>%
  ungroup()
 
# Selecting the top 35 genres across all books
subset_tags %>%  
  group_by(tag_id) %>% 
  summarise(num_of_books = n()) %>% 
  mutate(rank = rank(desc(num_of_books))) %>% 
  filter(rank <= 35) %>%
  ungroup() %>%
  left_join(tags, by = "tag_id") %>% 
  arrange(rank) %>%
  select(new_tag_name) %>%
  t %>%
  paste0

##  [1] "fiction"              "fantasy"              "romance"             
##  [4] "young adult"          "non fiction"          "mystery"             
##  [7] "classics"             "contemporary"         "science fiction"     
## [10] "historical fiction"   "children"             "thriller"            
## [13] "paranormal"           "chick lit"            "historical"          
## [16] "crime"                "horror"               "humor"               
## [19] "history"              "biography"            "literature"          
## [22] "memoir"               "graphic novels"       "dystopian"           
## [25] "vampires"             "contemporary romance" "adventure"           
## [28] "comics"               "urban fantasy"        "picture books"       
## [31] "short stories"        "suspense"             "philosophy"          
## [34] "new adult"            "epic fantasy"

I’ll join the tag names to the book_tags and discard the tags dataset

# subsetting the top 35 tags
good_tags <- subset_tags %>% 
  group_by(tag_id) %>% 
  summarise(num_of_books = n()) %>% 
  mutate(rank = rank(desc(num_of_books))) %>% 
  filter(rank <= 35) %>%
  ungroup()

# Getting only those tags foe the books
temp_book_tags <- book_tags2 %>% semi_join(good_tags, by = "tag_id")
final_tags <- tags %>% semi_join(good_tags, by = "tag_id")

# Joining
final_book_tags <- temp_book_tags %>% 
  left_join(final_tags, by = "tag_id")

Snapshot of the data

Books

datatable(head(books, 500))

Book Tags

datatable(head(final_book_tags, 500))

Ratings

datatable(head(ratings, 500))

To Read

datatable(head(to_read, 500))

Data Description

Below is the snapshot of all the important columns

Variables	Description	Present_In
book_id	User assigned book id based on popularity integer(1 - 10000)	books, ratings, to_read
goodreads_book_id	Book id linked to goodreads	book_tags
books_count	Number of editions of the book	books
authors	Author(s) of the book	books
original_publication_year	First year in which the book was published	books
original_title	First title of the book	books
title	Current running title of the edition selected	books
language_code	Language of the book	books
average_rating	Average Rating of the book (1 - 5)	books
ratings_count	Total users who have rated the book	books
work_ratings_count	Ratings of the current edition selected (1 - 5)	books
work_text_reviews_count	Total comments on the current edition selected	books
tag_id	ID of the tag	book_tags
count	Number of users who have tagged the genre for the book	book_tags
new_tag_name	Cleaned Tag name	book_tags
user_id	ID of the user (1 - 53424)	ratings, to_read
rating	Rating that the user has given integer(1 - 5)	ratings

The future

Next steps

The first step is to understand how many books are written in which genre, language. Then to identify if there are particular year clusters that occupy a chunk of the popular books
Then I would like to do a ratings distribution across years, genres, editions, languages and a combination of these to identify any significant trends
The prominent factors can be used in SVM/CHAID/clustering to identify any strong indicators on how a book gets good ratings. (This depends on above trends)
Next would be a user analysis, based on the number of ratings they have given to identify any differences in what they read
- Analysis 3 and 4 might be either-or, that is I might end up doing the significant one.
A recommendation engine based on the genre selected

Coding in R - What I would want to do

All the above analysis would require joining dataframes and slicing of the data. The different plots that can help are: bargraphs, boxplots, histograms, and heatmaps (I might not know some now)

What I would want to learn

Plots: ggplots and what are the different plots that can be done
Machine Learning Techniques: How to implement clustering/CHAID/SVM in R
Shiny: To create the recommendation engine

I hope this is doable :)

: Project Proposal

Rohit Jain

November 4, 2017