Problem Statement: People love books. And people enjoy discussing and sharing books that they love. Goodreads is one amazing platform designed for such bibliophiles (and budding bibliophiles!). What I felt was missing from this website is a summary analysis of popular books. The objective of this project is to analyze the data of popular books from goodreads and identify any interesting trends for these books and what makes them popular.
Last, I want to build a recommendation engine based on the genre(s) that the user selects.
The packages used in the project (currently) are:
%>% operator!# Loading the packages
library(data.table)
library(DT)
library(kableExtra)
library(knitr)
library(tidyverse)
library(stringr)
The original data gooodooks-10k was scraped in August 2017, with an aim of providing a database similar to movies and songs. The objective for creator was to create recommendation engine for books, similar to music and movies.
The dataset was first published in kaggle. Contributors then identified anomalies in the data like duplicate records, and multiple user ratings for the same book. Such anomalies were removed, and the clean data was then posted into github. We will load the data from this github repository.
The column information is provided for the key variables. Some assumptions are made for the other variables based on data present (and looking up the goodreads website).
The data is contiguous for 10k books and 50k users. The database has 5 tables, descriptions of which are given below
| Table | Description | Notes |
|---|---|---|
| books | Data of the 10k most popular books with metadata (author, ratings etc) | Popular books are determined by number of total # of users who have rated the book |
| book_tags | Genres associated with each book along with the number of tags for the genre | Genric and User defined tags both are included; user defined can be anything |
| tags | Description of the genres | |
| ratings | User, book, raating level data for 10k books and 50k users | For 10k books and 50k users who have made at least 2 ratings |
| to_read | User, book pair that a user has marked to read | For 10k books and 50k users |
On initial checks, I found that the data sets had blanks, which R could not identify as NA. So I added na.strings to help R identify nulls
# Reading all the datasets from github using fread
ratings <- fread("https://github.com/zygmuntz/goodbooks-10k/raw/master/ratings.csv", na.strings = c("", "NA"))
book_tags <- fread("https://github.com/zygmuntz/goodbooks-10k/raw/master/book_tags.csv", na.strings = c("", "NA"))
books <- fread("https://github.com/zygmuntz/goodbooks-10k/raw/master/books.csv", na.strings = c("", "NA"))
tags <- fread("https://github.com/zygmuntz/goodbooks-10k/raw/master/tags.csv", na.strings = c("", "NA"))
to_read <- fread("https://github.com/zygmuntz/goodbooks-10k/raw/master/to_read.csv", na.strings = c("", "NA"))
The attributes of the different dataframes are described below
#Creating a List of all tables
tbl_list <- list("books" = books, "book_tags" = book_tags, "tags" = tags, "ratings" = ratings, "to_read" = to_read)
# Writing a function to get dimensions
dim_func <- function(x){
dimension = str_trim(paste0(dim(tbl_list[[x]]), sep = " ", collapse = ""))
}
# Writing a function to get all column names
names_func <- function(y){
vars = str_trim(paste0(names(tbl_list[[y]]), sep = " | ", collapse = ""))
}
# Creating a table
table_attributes <- data_frame(
Table = names(tbl_list),
`Rows Columns` = unlist(lapply(1:5, dim_func)),
Variables = unlist(lapply(1:5, names_func ))
)
# Removing the list
rm(tbl_list)
# Printing the table
kable(table_attributes, format = "html") %>%
kable_styling(bootstrap_options = "striped") %>%
column_spec(2, width = "12em")
| Table | Rows Columns | Variables |
|---|---|---|
| books | 10000 23 | book_id | goodreads_book_id | best_book_id | work_id | books_count | isbn | isbn13 | authors | original_publication_year | original_title | title | language_code | average_rating | ratings_count | work_ratings_count | work_text_reviews_count | ratings_1 | ratings_2 | ratings_3 | ratings_4 | ratings_5 | image_url | small_image_url | |
| book_tags | 999912 3 | goodreads_book_id | tag_id | count | |
| tags | 34252 2 | tag_id | tag_name | |
| ratings | 5976479 3 | user_id | book_id | rating | |
| to_read | 912705 2 | user_id | book_id | |
As a first step, I would remove the columns that I do not intend to use from the books table. All other tables have only the relevant columns
# Selecting the relevant columns
books <- books %>% select(-c(best_book_id, work_id, isbn, isbn13), -(ratings_1:small_image_url))
tbl_list <- list("books" = books, "book_tags" = book_tags, "tags" = tags, "ratings" = ratings, "to_read" = to_read)
The next step in cleaning the data was to identify any missing values across datasets. A quick is.na() check reveals that only books has missing values in some columns
# Identifying missing across all tables using a loop
for (x in 1:length(tbl_list)) {
dd <- colSums(is.na(tbl_list[[x]]))
if (sum(dd[dd > 0]) > 0) {
print(paste("For the data -",names(tbl_list[x])))
print(dd[dd > 0])
}
}
## [1] "For the data - books"
## original_publication_year original_title
## 21 590
## language_code
## 1084
# Removing the list
rm(tbl_list)
I have decided to retain all the columns based on the explanation below
Among the different summary() codes that were run across all tables, only two anomalies were observed
original_publication_year in books had a negative yearcount in book_tags which records how many times a book was given the particular genre has negative values# Display summary statistics of the important columns
summary(books$original_publication_year)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -1750 1990 2004 1982 2011 2017 21
summary(book_tags$count)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.0 7.0 15.0 208.9 40.0 596234.0
| original_publication_year | title | goodreads_book_id |
|---|---|---|
| -720 | The Odyssey | 1381 |
| -750 | The Iliad | 1371 |
| -500 | The Art of War | 10534 |
| -380 | The Republic | 30289 |
| -430 | Oedipus Rex (The Theban Plays, #1) | 1554 |
I noticed that the books were actually written in B.C (Wow!) ***
The count in book_tags was mutated to 0 if negative
# Mutating negative values to 0
book_tags[which(book_tags$count < 0), "count"] = 0
datatable(head(books, 500))
datatable(head(ratings, 500))
datatable(head(to_read, 500))
| Variables | Description | Present_In |
|---|---|---|
| book_id | User assigned book id based on popularity integer(1 - 10000) | books, ratings, to_read |
| goodreads_book_id | Book id linked to goodreads | book_tags |
| books_count | Number of editions of the book | books |
| authors | Author(s) of the book | books |
| original_publication_year | First year in which the book was published | books |
| original_title | First title of the book | books |
| title | Current running title of the edition selected | books |
| language_code | Language of the book | books |
| average_rating | Average Rating of the book (1 - 5) | books |
| ratings_count | Total users who have rated the book | books |
| work_ratings_count | Ratings of the current edition selected (1 - 5) | books |
| work_text_reviews_count | Total comments on the current edition selected | books |
| tag_id | ID of the tag | book_tags |
| count | Number of users who have tagged the genre for the book | book_tags |
| new_tag_name | Cleaned Tag name | book_tags |
| user_id | ID of the user (1 - 53424) | ratings, to_read |
| rating | Rating that the user has given integer(1 - 5) | ratings |
The first step is to understand how many books are written in which genre, language. Then to identify if there are particular year clusters that occupy a chunk of the popular books
Then I would like to do a ratings distribution across years, genres, editions, languages and a combination of these to identify any significant trends
The prominent factors can be used in SVM/CHAID/clustering to identify any strong indicators on how a book gets good ratings. (This depends on above trends)
A recommendation engine based on the genre selected
All the above analysis would require joining dataframes and slicing of the data. The different plots that can help are: bargraphs, boxplots, histograms, and heatmaps (I might not know some now)
I hope this is doable :)