• Show All Code
  • Hide All Code

: Project Proposal

Rohit Jain

November 4, 2017

Analysis of the most popular books

Introduction

  1. Problem Statement: People love books. And people enjoy discussing and sharing books that they love. Goodreads is one amazing platform designed for such bibliophiles (and budding bibliophiles!). What I felt was missing from this website is a summary analysis of popular books. The objective of this project is to analyze the data of popular books from goodreads and identify any interesting trends for these books and what makes them popular.

  2. How would I do this?: The data for this analysis is a contiguous dataset of 10,000 most popular books, with user level information on 6 million ratings for these books. We also have several generic and user defined genres (/tags/shelves) for each book. A combination of this information would help me answer the above questions. I want to explore ratings across different verticals such as genre, year of publication, editions, etc. Next, I would analyze the users to identify any strong relations among different readers. The different analytic techniques that I wish to apply post data preparation include, but not limited to:
    • Univariate analysis
    • Bivariate analysis
    • Clustering/ CHAID/ SVM (any one)

    Last, I want to build a recommendation engine based on the genre(s) that the user selects.

  3. What can you do with the analysis?:
  • Identify factors that affect the ratings, patterns among what avid readers actually like to read,
  • Associations between different genres, and what genres to invest in (if you’re a publisher)
  • Identify segments of bibliophiles based on the number of books they have rated (assuming rated as read)
  • As an avid reader, find genres that are popular but were off your radar
    • If you love that genre, get the recommended book for it
  • As a newbie, see what people read and hopefully become a reader yourself!

Packages Used

The packages used in the project (currently) are:

  • data.table: To read the csv files in the fastest possible way
  • tidyverse : Collection of R packages for data manipulation, exploration and visualization. I am currently using
    • dplyr: Data manipulation using filter, joins, summarise etc.
    • magrittr: The pipe %>% operator!
  • stringr: string replacements and pattern matching
  • DT : Filtering, pagination, and sorting of data tables in html outputs
  • knitr : Aligned displays of table in a html doc
  • kableExtra : Manipulate table styles for good visualizations
library(data.table)
library(DT)
library(kableExtra)
library(knitr)
library(tidyverse)
library(stringr)

Data Preparation

Original Data and Import

Description: goodbooks-10k

The original data gooodooks-10k was scraped in August 2017, with an aim of providing a database similar to movies and songs. The objective for creator was to create recommendation engine for books, similar to music and movies.

The dataset was first published in kaggle. Contributors then identified anomalies in the data like duplicate records, and multiple user ratings for the same book. Such anomalies were removed, and the clean data was then posted into github. We will load the data from this github repository.

The column information is provided for the key variables. Some assumptions are made for the other variables based on data present (and looking up the goodreads website).

The data is contiguous for 10k books and 50k users. The database has 5 tables, descriptions of which are given below

Table Description Notes
books Data of the 10k most popular books with metadata (author, ratings etc) Popular books are determined by number of total # of users who have rated the book
book_tags Genres associated with each book along with the number of tags for the genre Genric and User defined tags both are included; user defined can be anything
tags Description of the genres
ratings User, book, raating level data for 10k books and 50k users For 10k books and 50k users who have made at least 2 ratings
to_read User, book pair that a user has marked to read For 10k books and 50k users

Importing the data

On initial checks, I found that the data sets had blanks, which R could not identify as NA. So I added na.strings to help R identify nulls

ratings   <- fread("https://github.com/zygmuntz/goodbooks-10k/raw/master/ratings.csv",   na.strings = c("", "NA"))
book_tags <- fread("https://github.com/zygmuntz/goodbooks-10k/raw/master/book_tags.csv", na.strings = c("", "NA"))
books     <- fread("https://github.com/zygmuntz/goodbooks-10k/raw/master/books.csv",     na.strings = c("", "NA"))
tags      <- fread("https://github.com/zygmuntz/goodbooks-10k/raw/master/tags.csv",      na.strings = c("", "NA"))
to_read   <- fread("https://github.com/zygmuntz/goodbooks-10k/raw/master/to_read.csv",   na.strings = c("", "NA"))

The attributes of the different dataframes are described below

tbl_list <- list("books" = books, "book_tags" = book_tags, "tags" = tags, "ratings" = ratings, "to_read" = to_read)

dim_func <- function(x){
  dimension = str_trim(paste0(dim(tbl_list[[x]]), sep = "  ", collapse = ""))
}

names_func <- function(y){  
  vars = str_trim(paste0(names(tbl_list[[y]]), sep = " | ", collapse = ""))
}

table_attributes <- data_frame(
  Table = names(tbl_list),
  `Rows  Columns` = unlist(lapply(1:5, dim_func)),
  Variables = unlist(lapply(1:5, names_func ))
)

rm(tbl_list)

kable(table_attributes, format = "html") %>%
  kable_styling(bootstrap_options = "striped") %>%
    column_spec(2, width = "12em")
Table Rows Columns Variables
books 10000 23 book_id | goodreads_book_id | best_book_id | work_id | books_count | isbn | isbn13 | authors | original_publication_year | original_title | title | language_code | average_rating | ratings_count | work_ratings_count | work_text_reviews_count | ratings_1 | ratings_2 | ratings_3 | ratings_4 | ratings_5 | image_url | small_image_url |
book_tags 999912 3 goodreads_book_id | tag_id | count |
tags 34252 2 tag_id | tag_name |
ratings 5976479 3 user_id | book_id | rating |
to_read 912705 2 user_id | book_id |

Cleaning the Data

Keeping the relevant columns across all tables

As a first step, I would remove the columns that I do not intend to use from the books table. All other tables have only the relevant columns

books <- books %>% select(-c(best_book_id, work_id, isbn, isbn13), -(ratings_1:small_image_url))
tbl_list <- list("books" = books, "book_tags" = book_tags, "tags" = tags, "ratings" = ratings, "to_read" = to_read)

Are there missing values?

The next step in cleaning the data was to identify any missing values across datasets. A quick is.na() check reveals that only books has missing values in some columns

for (x in 1:length(tbl_list)) {
  dd <- colSums(is.na(tbl_list[[x]]))
  if (sum(dd[dd > 0]) > 0) {
    print(paste("For the data -",names(tbl_list[x])))
    print(dd[dd > 0])
  }
}
## [1] "For the data - books"
## original_publication_year            original_title 
##                        21                       590 
##             language_code 
##                      1084
rm(tbl_list)

I have decided to retain all the columns based on the explanation below

  • original_publication_year: Missing only for 21 out of the 10 books; Retain and take care of this issue during individual analysis
  • original_title: We have the title column that can be used in its place if missing
  • language_code: Retain and take care of this issue during individual analysis

Anomaly in the data recorded

Among the different summary() codes that were run across all tables, only two anomalies were observed

  • The original_publication_year in books had a negative year
  • The count in book_tags which records how many times a book was given the particular genre has negative values
summary(books$original_publication_year)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   -1750    1990    2004    1982    2011    2017      21
summary(book_tags$count)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     -1.0      7.0     15.0    208.9     40.0 596234.0

On checking a few records for the original publication year, like
original_publication_year title goodreads_book_id
-720 The Odyssey 1381
-750 The Iliad 1371
-500 The Art of War 10534
-380 The Republic 30289
-430 Oedipus Rex (The Theban Plays, #1) 1554

I noticed that the books were actually written in B.C (Wow!) ***

The count in book_tags was mutated to 0 if negative

book_tags[which(book_tags$count < 0), "count"] = 0

The tags dataset

Most of the issues were uncovered in the tags dataset. The tag_name column can have several generic tags and several user defined tags. User defined tags generally do not qualify under the genre of the book, as most of them are like to read, currently reading etc. as described below:

book_tags %>% 
  group_by(tag_id) %>%
  summarise(num_books = n_distinct(goodreads_book_id)) %>%
  mutate(rank = rank(desc(num_books))) %>%
  filter(rank <= 5) %>%
  left_join(tags, by = "tag_id") %>%
  arrange(rank) %>%
  select(tag_name, num_books) %>%
  ungroup() %>%
  kable()
tag_name num_books
to-read 9983
favorites 9881
owned 9857
books-i-own 9799
currently-reading 9776

To identify 35 logical and distinct tags that could cover most books, several steps are performed:

  • Remove the special characters ‘-’ and ’_’ from the tag name
  • Remove non-alphanumeric characters from the tags
  • Remove tags that contain to read, reading, own, club, favorites etc.
  • Remove tags that are ya, novel, series (which cannot be tagged as genres)
# Remove the '-' and '_' from the tag name

tags <- tags %>%
  mutate(new_tag_name = str_replace_all(str_replace_all(tags$tag_name, "-", " "),"\\_", " ")) %>%
  select(tag_id, new_tag_name) %>%
  distinct(tag_id, new_tag_name)

# Remove all user defined tags and non alphanumeric ones
list_remove <- "+to read+|+reading+|^[[:digit:]]*$|+i own+|+currently+|+owned+|[^[:alnum:] ]|+favorites+|+club+|+buy+|+library+|+read+|+borrowed+|+abandoned+|+audio+|+ebook+|+kindle+|+default+"
tag_remove <- tags[grepl(list_remove,tags$new_tag_name),]

# Hard code removal of certain tags
tag_remove <- rbind(tag_remove,tags[tolower(tags$new_tag_name) %in% c('ya','novels','series'),])

  • Next the common genres with slightly different names, for example, classics and classic, were collated into a single tag_id for both tags and book_tags. The same procedure was repeated for children’s books, non fiction, graphic novels, and science fiction

  • This was a recursive step till we get unique genres A sample code for one case is below

# Identifying what to replace
keep_tags_classic <- tags[which(tags$new_tag_name %in% c("classic")), "tag_id"]
new_tag_classic   <- tags[which(tags$new_tag_name %in% c("classics")), "tag_id"]

# Replacing the tags in the dataset
tags[which(tags$new_tag_name %in% c("classic")), "tag_id"] = new_tag_classic
tags[which(tags$new_tag_name %in% c("classic")), "new_tag_name"] = "classics"
book_tags[which(book_tags$tag_id %in% keep_tags_classic ), "tag_id"] = new_tag_classic

# Getting the unique ids
tags <- unique(tags)
book_tags <- group_by(goodreads_book_id, tag_id) %>%
  summarise(count = sum(count)) %>%
  ungroup()

# For children's genre which have similar names
keep_tags_children <- tags[which(tags$new_tag_name %in% c("children s", "children s books", "childrens", "kids",
                                                          "childrens books")), "tag_id"]
new_tag_children   <- tags[which(tags$new_tag_name %in% c("children")), "tag_id"]

tags[which(tags$new_tag_name %in% c("children s", "children s books", "childrens", "kids", 
                                    "childrens books")), "tag_id"] = new_tag_children
tags[which(tags$new_tag_name %in% c("children s", "children s books", "childrens", "kids", "childrens books")),
     "new_tag_name"] = "children"

book_tags[which(book_tags$tag_id %in% keep_tags_children ), "tag_id"] = new_tag_children

# For classics genre which have similar names
keep_tags_classic <- tags[which(tags$new_tag_name %in% c("classic")), "tag_id"]
new_tag_classic   <- tags[which(tags$new_tag_name %in% c("classics")), "tag_id"]

tags[which(tags$new_tag_name %in% c("classic")), "tag_id"] = new_tag_classic
tags[which(tags$new_tag_name %in% c("classic")), "new_tag_name"] = "classics"

book_tags[which(book_tags$tag_id %in% keep_tags_classic ), "tag_id"] = new_tag_classic

# For graphic novels genre which have similar names
keep_tags_graphic <- tags[which(tags$new_tag_name %in% c("graphic novel")), "tag_id"]
new_tag_graphic   <- tags[which(tags$new_tag_name %in% c("graphic novels")), "tag_id"][1]

tags[which(tags$new_tag_name %in% c("graphic novel")), "tag_id"] = new_tag_graphic
tags[which(tags$new_tag_name %in% c("graphic novel")), "new_tag_name"] = "graphic novels"

book_tags[which(book_tags$tag_id %in% keep_tags_graphic ), "tag_id"] = new_tag_graphic

# For non-fiction genre which have similar names
keep_tags_non <- tags[which(tags$new_tag_name %in% c("nonfiction")), "tag_id"]
new_tag_non   <- tags[which(tags$new_tag_name %in% c("non fiction")), "tag_id"][1]

tags[which(tags$new_tag_name %in% c("nonfiction")), "tag_id"] = new_tag_non
tags[which(tags$new_tag_name %in% c("nonfiction")), "new_tag_name"] = "non fiction"

book_tags[which(book_tags$tag_id %in% keep_tags_non), "tag_id"] = new_tag_non

# For science fiction genre which have similar names
keep_tags_sci <- tags[which(tags$new_tag_name %in% c("sci fi", "scifi", "sci fi fantasy")), "tag_id"]
new_tag_sci   <- tags[which(tags$new_tag_name %in% c("science fiction")), "tag_id"][1]

tags[which(tags$new_tag_name %in% c("sci fi", "scifi", "sci fi fantasy")), "tag_id"] = new_tag_sci
tags[which(tags$new_tag_name %in% c("sci fi", "scifi", "sci fi fantasy")), "new_tag_name"] = "science fiction"

book_tags[which(book_tags$tag_id %in% keep_tags_sci), "tag_id"] = new_tag_sci

book_tags[which(book_tags$tag_id == 26894), "tag_id"] = new_tag_sci
tags[which(tags$tag_id == 26894), "tag_id"] = new_tag_sci

# For dystopian genre which have similar names
keep_tags_dys <- tags[which(tags$new_tag_name %in% c("dystopia")), "tag_id"]
new_tag_dys  <- tags[which(tags$new_tag_name %in% c("dystopian")), "tag_id"]

tags[which(tags$new_tag_name %in% c("dystopia")), "tag_id"] = new_tag_dys
tags[which(tags$new_tag_name %in% c("dystopia")), "new_tag_name"] = "dystopian"

book_tags[which(book_tags$tag_id %in% keep_tags_dys), "tag_id"] = new_tag_dys

# Removing the duplicates from tags dataset
tags <- unique(tags)

# summarising the book_tags to get common counts
book_tags <- book_tags %>%
  group_by(goodreads_book_id, tag_id) %>%
  summarise(count = sum(count)) %>%
  ungroup()

As a last step, I subset the top 5 genres for each book (based on the count) and the top 35 genres across all books (by the number of books that have the genre in top 5). Thrse tags are checked for uniqueness and logical sense

# Removing the tags that are not actual genres
book_tags2 <- book_tags %>% anti_join(tag_remove, by = "tag_id")

# Ranking based on count and selecting top 5 genres
subset_tags <- book_tags2 %>% 
  group_by(goodreads_book_id) %>%
  mutate( rr = rank(desc(count))) %>%
  filter(rr <= 5) %>%
  ungroup()
 
# Selecting the top 35 genres across all books
subset_tags %>%  
  group_by(tag_id) %>% 
  summarise(num_of_books = n()) %>% 
  mutate(rank = rank(desc(num_of_books))) %>% 
  filter(rank <= 35) %>%
  ungroup() %>%
  left_join(tags, by = "tag_id") %>% 
  arrange(rank) %>%
  select(new_tag_name) %>%
  t %>%
  paste0
##  [1] "fiction"              "fantasy"              "romance"             
##  [4] "young adult"          "non fiction"          "mystery"             
##  [7] "classics"             "contemporary"         "science fiction"     
## [10] "historical fiction"   "children"             "thriller"            
## [13] "paranormal"           "chick lit"            "historical"          
## [16] "crime"                "horror"               "humor"               
## [19] "history"              "biography"            "literature"          
## [22] "memoir"               "graphic novels"       "dystopian"           
## [25] "vampires"             "contemporary romance" "adventure"           
## [28] "comics"               "urban fantasy"        "picture books"       
## [31] "short stories"        "suspense"             "philosophy"          
## [34] "new adult"            "epic fantasy"

I’ll join the tag names to the book_tags and discard the tags dataset

good_tags <- subset_tags %>% 
  group_by(tag_id) %>% 
  summarise(num_of_books = n()) %>% 
  mutate(rank = rank(desc(num_of_books))) %>% 
  filter(rank <= 35) %>%
  ungroup()

temp_book_tags <- book_tags2 %>% semi_join(good_tags, by = "tag_id")
final_tags <- tags %>% semi_join(good_tags, by = "tag_id")

final_book_tags <- temp_book_tags %>% 
  left_join(final_tags, by = "tag_id")

Snapshot of the data

Books

datatable(head(books, 500))

Book Tags

datatable(head(final_book_tags, 500))

Ratings

datatable(head(ratings, 500))

To Read

datatable(head(to_read, 500))

Data Description

Below is the snapshot of all the important columns
Variables Description Present_In
book_id User assigned book id based on popularity integer(1 - 10000) books, ratings, to_read
goodreads_book_id Book id linked to goodreads book_tags
books_count Number of editions of the book books
authors Author(s) of the book books
original_publication_year First year in which the book was published books
original_title First title of the book books
title Current running title of the edition selected books
language_code Language of the book books
average_rating Average Rating of the book (1 - 5) books
ratings_count Total users who have rated the book books
work_ratings_count Ratings of the current edition selected (1 - 5) books
work_text_reviews_count Total comments on the current edition selected books
tag_id ID of the tag book_tags
count Number of users who have tagged the genre for the book book_tags
new_tag_name Cleaned Tag name book_tags
user_id ID of the user (1 - 53424) ratings, to_read
rating Rating that the user has given integer(1 - 5) ratings

The future

Next steps

  • The first step is to understand how many books are written in which genre, language. Then to identify if there are particular year clusters that occupy a chunk of the popular books

  • Then I would like to do a ratings distribution across years, genres, editions, languages and a combination of these to identify any significant trends

  • The prominent factors can be used in SVM/CHAID/clustering to identify any strong indicators on how a book gets good ratings. (This depends on above trends)

  • Next would be a user analysis, based on the number of ratings they have given to identify any differences in what they read
    • Analysis 3 and 4 might be either-or, that is I might end up doing the significant one.
  • A recommendation engine based on the genre selected

Coding in R - What I would want to do

All the above analysis would require joining dataframes and slicing of the dapa The different plots that can help are: bargraphs, boxplots, histograms, and heatmaps (I might not know some now)

What I would want to learn

  • Plots: ggplots and what are the different plots that can be done
  • Machine Learning Techniques: How to implement clustering/CHAID/SVM in R
  • Shiny: To create the recommendation engine

Why do I feel that this might be too much to do?