In this Rpubs, we are going to explore the goodreads dataset provided by Soumik on Kaggle and to performed a simple EDA with it. The dataset that I used can be found in here. From EDA we can find several insights, such as figuring out which books that are heavily discussed, most popular books among goodreads user, which companies that have published the highest number of books in the last few years, etc. There are plenty of insights that we can extract with the goodreads dataset by performing EDA only. More advanced analysis of the same dataset will be made for future report but for now, let’s do some data cleansing and exploratory data analysis.
library(readr)
goodreads <- read_csv("books.csv")
head(goodreads)library(dplyr)
glimpse(goodreads)## Rows: 11,127
## Columns: 12
## $ bookID <dbl> 1, 2, 4, 5, 8, 9, 10, 12, 13, 14, 16, 18, 21, 22, 2~
## $ title <chr> "Harry Potter and the Half-Blood Prince (Harry Pott~
## $ authors <chr> "J.K. Rowling/Mary GrandPré", "J.K. Rowling/Mary Gr~
## $ average_rating <dbl> 4.57, 4.49, 4.42, 4.56, 4.78, 3.74, 4.73, 4.38, 4.3~
## $ isbn <chr> "0439785960", "0439358078", "0439554896", "04396554~
## $ isbn13 <chr> "9780439785969", "9780439358071", "9780439554893", ~
## $ language_code <chr> "eng", "eng", "eng", "eng", "eng", "en-US", "eng", ~
## $ num_pages <dbl> 652, 870, 352, 435, 2690, 152, 3342, 815, 815, 215,~
## $ ratings_count <dbl> 2095690, 2153167, 6333, 2339585, 41428, 19, 28242, ~
## $ text_reviews_count <dbl> 27591, 29221, 244, 36325, 164, 1, 808, 254, 4080, 4~
## $ publication_date <chr> "9/16/2006", "9/1/2004", "11/1/2003", "5/1/2004", "~
## $ publisher <chr> "Scholastic Inc.", "Scholastic Inc.", "Scholastic",~
We have a total of 11,131 observations and 10 columns. The summary above also shows us several columns that need to be converted to their correct data types. Before we began to processed and analysed them, we will perform data wrangling first.
For data wrangling process we are going to change some of the columns that were mistyped or were stored in incorrect data types. Here are some of the columns that we are going to converted to the correct data types: * Change to factor: language_code * Change to numeric: average_rating * Change to integer: num_pages * changed to datetime: publication_date * Dropped unwanted columns: isbn, isbn13
library(lubridate)
goodreads <- goodreads %>%
mutate(language_code = as.factor(language_code),
average_rating = as.numeric(average_rating),
num_pages = as.integer(num_pages),
publication_date = mdy(publication_date)) %>%
select(-c(isbn, isbn13, bookID))
goodreadssummary(goodreads)## title authors average_rating language_code
## Length:11127 Length:11127 Min. :0.000 eng :8908
## Class :character Class :character 1st Qu.:3.770 en-US :1408
## Mode :character Mode :character Median :3.960 spa : 218
## Mean :3.934 en-GB : 214
## 3rd Qu.:4.140 fre : 144
## Max. :5.000 ger : 99
## NA's :4 (Other): 136
## num_pages ratings_count text_reviews_count publication_date
## Min. : 0.0 Min. : 0 Min. : 0.0 Min. :1900-01-01
## 1st Qu.: 192.0 1st Qu.: 104 1st Qu.: 9.0 1st Qu.:1998-07-17
## Median : 299.0 Median : 745 Median : 47.0 Median :2003-03-01
## Mean : 336.4 Mean : 17936 Mean : 541.9 Mean :2000-08-29
## 3rd Qu.: 416.0 3rd Qu.: 4994 3rd Qu.: 237.5 3rd Qu.:2005-10-01
## Max. :6576.0 Max. :4597666 Max. :94265.0 Max. :2020-03-31
## NA's :4 NA's :6
## publisher
## Length:11127
## Class :character
## Mode :character
##
##
##
##
Good, now every columns are stored in the correct data types. But, as we had run through the process of doing explicit coercion with our columns, it is possible to have NA or missing values in our rows. Now let’s check whether we have a duplicated and missing values in our dataset or not.
sum(duplicated(goodreads))## [1] 0
There’s no duplicated rows in our goodreads dataset.
anyNA(goodreads)## [1] TRUE
However, we do have missing values. Let’s detect the columns that contained missing values.
colSums(is.na(goodreads))## title authors average_rating language_code
## 0 0 4 0
## num_pages ratings_count text_reviews_count publication_date
## 4 0 0 6
## publisher
## 0
Three columns appeared to have missing values in their rows, column average_rating, num_pages and publication_date. If we have missing values in our dataset, what kind of treatment should we give to those columns?
For numerical data, we can either drop those rows or replaced them with the mean number for each columns. There’s also another condition where we might consider dropping an observation if it contains more than 2 missing values, such as observation 3349.
which(is.na(goodreads$average_rating), arr.ind = TRUE)## [1] 3349 4703 5878 8980
goodreads[3349,]If we look here, observation number 3349, 4702, 5877 and 8979 have more than 2 missing values in their rows. Therefore, we’re going to exclude them from our dataset.
goodreads <- goodreads[-c(3349, 4702, 5877, 8979),]
goodreadsCheck again for missing values.
colSums(is.na(goodreads))## title authors average_rating language_code
## 0 0 3 0
## num_pages ratings_count text_reviews_count publication_date
## 3 0 0 5
## publisher
## 0
For publication date, since there were only 2 rows that have missing values, we will also exclude them from our dataset by using fucntion na.omit().
which(is.na(goodreads$publication_date), arr.ind = TRUE)## [1] 4701 5875 8178 8976 11095
goodreads <- goodreads %>% na.omit()
anyNA(goodreads)## [1] FALSE
At this stage, we have performed data cleansing as we have stored all columns to their correct data types and we no longer have duplicated nor missing values in our dataset.
head(goodreads)table(goodreads$language_code)##
## 9780674842113 9780851742717 9781563841552 9781593600112 ale
## 0 0 0 0 1
## ara en-CA en-GB en-US eng
## 1 7 214 1408 8904
## enm fre ger gla glg
## 3 143 99 1 1
## grc ita jpn lat msa
## 11 5 46 3 1
## mul nl nor por rus
## 19 1 1 10 2
## spa srp swe tur wel
## 218 1 2 1 1
## zho
## 14
There are some unused levels in column language_code that we need to drop. These levels include: “9780674842113” “9780851742717”, “9781563841552”, “9781593600112”
goodreads$language_code <- droplevels(goodreads$language_code)
# check again whether the unused levels have been dropped or not
table(goodreads$language_code)##
## ale ara en-CA en-GB en-US eng enm fre ger gla glg grc ita
## 1 1 7 214 1408 8904 3 143 99 1 1 11 5
## jpn lat msa mul nl nor por rus spa srp swe tur wel
## 46 3 1 19 1 1 10 2 218 1 2 1 1
## zho
## 14
The unused levels are now gone. The next step that we’re going to do is to combine level “en-CA”, “en-GB”, “en-US” and “eng” as a single level, which is “eng” only. We won’t differentiate books that were translated into the US, UK or Canadian version of english.
library(forcats)
# rename factor levels to "eng"
new_lc <- fct_recode(goodreads$language_code, "eng" = "en-US", "eng" = "en-GB", "eng" = "en-CA")
table(new_lc)## new_lc
## ale ara eng enm fre ger gla glg grc ita jpn lat msa
## 1 1 10533 3 143 99 1 1 11 5 46 3 1
## mul nl nor por rus spa srp swe tur wel zho
## 19 1 1 10 2 218 1 2 1 1 14
Combine new_lc to goodreads dataframe.
goodreads_clean <- cbind(goodreads, new_lc)
goodreads_cleanDrop column language_code then renamed column new_lc.
goodreads_clean <- goodreads_clean %>%
select(-language_code) %>%
rename("language_code" = "new_lc")
table(goodreads_clean$language_code)##
## ale ara eng enm fre ger gla glg grc ita jpn lat msa
## 1 1 10533 3 143 99 1 1 11 5 46 3 1
## mul nl nor por rus spa srp swe tur wel zho
## 19 1 1 10 2 218 1 2 1 1 14
Next, we’ll do some feature engineering with column publication_date and to only extract the year which the books were published. In order to do that, we can do it both with base R and the package lubridate.
library(lubridate)
goodreads_clean$year <- year(goodreads_clean$publication_date)
goodreads_clean %>%
select(year, publication_date) %>%
head(10)Check year range.
range(goodreads_clean$year)## [1] 1900 2020
The publication date range from 1900 until 2020.
Check distributions for all numerical columns.
library(Hmisc)
goodreads_clean %>%
select(is.numeric) %>%
hist.data.frame()Column average_rating is normally distributed and most likely to have a mean of around 4. Most of the books also have a length of less than 1500 pages. It also seems that a majority number of books didn’t receive a rating nor text reviews from users hence why there’s more than 10,000 books out of 11,118 observations that have 0 ratings count.
library(ggplot2)
goodreads_clean %>%
ggplot(aes(x = ratings_count, y = text_reviews_count)) +
geom_point(color = "orange", size = 3) +
geom_smooth() +
labs(title = "Correlation between ratings and text reviews count",
x = "ratings count",
y = "text reviews count") +
theme_minimal()## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Ratings_count and text_reviews_count have a positive linear correlation. As the ratings count increases, so does the text reviews count.
Next, we’ll check the ten most popular and heavily discussed books among goodreads user based on the total number of their ratings count.
ratings <- goodreads_clean %>%
select(c(title, ratings_count, text_reviews_count, average_rating)) %>%
arrange(desc(ratings_count)) %>%
head(10)
ratingsratings %>%
ggplot(aes(x = ratings_count, y = reorder(title, ratings_count))) +
geom_col(aes(fill = ratings_count)) +
scale_fill_gradient(low = "#76777B", high = "#FFCB05") +
xlab("ratings count") +
ylab("") +
guides(fill="none") +
theme_minimal()Proportion of books published in english and other languages.
# rename all rows that have a level of "eng" to "English" and then assigned it to a new column
goodreads_clean$english <- gsub("eng","English",goodreads_clean$language_code)
# change column "english" to a factor
goodreads_clean$english <- as.factor(goodreads_clean$english)
# rename all rows that were not published in english to "Other languages"
language_vec <- ifelse(goodreads_clean$english == "English", "English", "Other languages")
goodreads_clean <- cbind(goodreads_clean, language_vec)Drop column english then changed column language_vec into a factor.
goodreads_clean <- goodreads_clean %>%
select(-english) %>%
mutate(language_vec = as.factor(language_vec))
head(goodreads_clean)Check levels for column language_vec
levels(goodreads_clean$language_vec)## [1] "English" "Other languages"
We’ve done with our feature engineering by creating a new column that categorized levels in column language_code to either “English” or “Other languages”.
# create new data frame that only contains the frequency of each labels
languages <- as.data.frame(table(goodreads_clean$language_vec))
languagesNow, we’ll visualized it with a pie chart.
library(scales)##
## Attaching package: 'scales'
## The following object is masked from 'package:readr':
##
## col_factor
sum_of_observation <- 11118
languages %>%
ggplot(aes(x = "", y = Freq, fill = Var1)) +
geom_bar(stat="identity", width = 1, color = "white") +
coord_polar("y", start = 0) +
theme_void() +
scale_fill_manual(values = c("#76777B", "#FFCB05")) +
labs(title = "Language proportion",
subtitle = "Number of books published in english and in other languages",
fill = "Language") +
geom_text(aes(label = paste0(round(Freq/sum_of_observation, 2)*100, "%")), position = position_stack(vjust = 0.5)) #geom_text(aes(y = Freq/2 + c(0, cumsum(Freq)[-length(Freq)]),
#label = percent(Freq/sum_of_observation)), size=5)Now let’s further check the language proportion from non-english books.
# Prepare the dataset
non_english <- goodreads_clean %>%
filter(!language_code == "eng") %>%
select(language_code)
non_english <- as.data.frame(table(non_english))
# limit the plot to 15 languages
non_english <- non_english %>%
arrange(desc(Freq)) %>%
head(15)
non_englishLet’s visualized it with a lollipop plot.
non_english %>%
ggplot(aes(x = non_english, y = Freq)) +
geom_segment(aes(x = non_english, xend = non_english, y = 0, yend = Freq), color = "grey") +
geom_point( color="orange", size=4) +
theme_light() +
theme(
panel.grid.major.x = element_blank(),
panel.border = element_blank(),
axis.ticks.x = element_blank()
) +
xlab("languages other than english") +
ylab("number of books") +
labs(title = "Number of books published in other languages")Among the non-english books, spain, french, germany and japan are the languages that have the most number of books from the goodreads dataset.
Next, we’ll find out which publisher had published the most number of books between 2000 - 2020.
twenty_years <- goodreads_clean %>%
filter(year >= 2000)
publisher <- as.data.frame(table(twenty_years$publisher))publisher %>%
arrange(desc(Freq)) %>%
head(15) %>%
ggplot(aes(x = Freq, y = reorder(Var1, Freq))) +
geom_col(aes(fill = Freq)) +
scale_fill_gradient(low = "#76777B", high = "#FFCB05") +
xlab("number of books published") +
ylab("publisher") +
labs(title = "Top 15 publishing companies",
subtitle = "Based on number of books published in 2000 - 2020") +
guides(fill="none") +
theme_minimal()median(twenty_years$average_rating)## [1] 3.95
The average rating is 3.95. Now let’s have a conditional subsetting with books that have an average rating of more than that number.
above_average <- twenty_years %>%
filter(average_rating >= 3.95)
above_averageFind out which authors produced the most number of books with an average rating score of above the average.
above_average_freq <- as.data.frame(table(above_average$authors))
above_average_freq %>%
arrange(desc(Freq)) %>%
head(20) %>%
ggplot(aes(x = Freq, y = reorder(Var1, Freq))) +
geom_col(aes(fill = Freq)) +
scale_fill_gradient(low = "#76777B", high = "#FFCB05") +
xlab("number of books") +
ylab("author") +
labs(title = "Top 20 Authors between 2000 - 2020",
subtitle = "Based on number of books with an average ratings above 3.95") +
guides(fill="none") +
theme_minimal()Through the process of EDA, we were able to extract several insights from our goodreads dataset. One of them was to find out which of the ten most popular and heavily discussed books among goodreads user based on the ratings count. The following books were part of that list: Twilight #1, The Hobbit, Harry Potter, Angel or Demons and The Catcher in the Rye. Another insight that we found is that there’s an overwhelming amount of books published in english (95%) compared to books published in other languages (5%). Vintage has become the number one publishing company with the highest number of books published since 2000-2020. The average rating from books published between 2000-2020 was 3.95 and P. G. Wodehouse was the only author who has more than 30 books which all were rated above the average.