1 Background

In this Rpubs, we are going to explore the goodreads dataset provided by Soumik on Kaggle and to performed a simple EDA with it. The dataset that I used can be found in here. From EDA we can find several insights, such as figuring out which books that are heavily discussed, most popular books among goodreads user, which companies that have published the highest number of books in the last few years, etc. There are plenty of insights that we can extract with the goodreads dataset by performing EDA only. More advanced analysis of the same dataset will be made for future report but for now, let’s do some data cleansing and exploratory data analysis.

2 Let’s get started

2.1 Import and read file

library(readr)
goodreads <- read_csv("books.csv")
head(goodreads)

2.2 Inspect dataset

library(dplyr)
glimpse(goodreads)
## Rows: 11,127
## Columns: 12
## $ bookID             <dbl> 1, 2, 4, 5, 8, 9, 10, 12, 13, 14, 16, 18, 21, 22, 2~
## $ title              <chr> "Harry Potter and the Half-Blood Prince (Harry Pott~
## $ authors            <chr> "J.K. Rowling/Mary GrandPré", "J.K. Rowling/Mary Gr~
## $ average_rating     <dbl> 4.57, 4.49, 4.42, 4.56, 4.78, 3.74, 4.73, 4.38, 4.3~
## $ isbn               <chr> "0439785960", "0439358078", "0439554896", "04396554~
## $ isbn13             <chr> "9780439785969", "9780439358071", "9780439554893", ~
## $ language_code      <chr> "eng", "eng", "eng", "eng", "eng", "en-US", "eng", ~
## $ num_pages          <dbl> 652, 870, 352, 435, 2690, 152, 3342, 815, 815, 215,~
## $ ratings_count      <dbl> 2095690, 2153167, 6333, 2339585, 41428, 19, 28242, ~
## $ text_reviews_count <dbl> 27591, 29221, 244, 36325, 164, 1, 808, 254, 4080, 4~
## $ publication_date   <chr> "9/16/2006", "9/1/2004", "11/1/2003", "5/1/2004", "~
## $ publisher          <chr> "Scholastic Inc.", "Scholastic Inc.", "Scholastic",~

We have a total of 11,131 observations and 10 columns. The summary above also shows us several columns that need to be converted to their correct data types. Before we began to processed and analysed them, we will perform data wrangling first.

3 Data wrangling

For data wrangling process we are going to change some of the columns that were mistyped or were stored in incorrect data types. Here are some of the columns that we are going to converted to the correct data types: * Change to factor: language_code * Change to numeric: average_rating * Change to integer: num_pages * changed to datetime: publication_date * Dropped unwanted columns: isbn, isbn13

library(lubridate)

goodreads <- goodreads %>% 
  mutate(language_code = as.factor(language_code),
         average_rating = as.numeric(average_rating),
         num_pages = as.integer(num_pages),
         publication_date = mdy(publication_date)) %>%
  select(-c(isbn, isbn13, bookID))

goodreads
summary(goodreads)
##     title             authors          average_rating  language_code 
##  Length:11127       Length:11127       Min.   :0.000   eng    :8908  
##  Class :character   Class :character   1st Qu.:3.770   en-US  :1408  
##  Mode  :character   Mode  :character   Median :3.960   spa    : 218  
##                                        Mean   :3.934   en-GB  : 214  
##                                        3rd Qu.:4.140   fre    : 144  
##                                        Max.   :5.000   ger    :  99  
##                                        NA's   :4       (Other): 136  
##    num_pages      ratings_count     text_reviews_count publication_date    
##  Min.   :   0.0   Min.   :      0   Min.   :    0.0    Min.   :1900-01-01  
##  1st Qu.: 192.0   1st Qu.:    104   1st Qu.:    9.0    1st Qu.:1998-07-17  
##  Median : 299.0   Median :    745   Median :   47.0    Median :2003-03-01  
##  Mean   : 336.4   Mean   :  17936   Mean   :  541.9    Mean   :2000-08-29  
##  3rd Qu.: 416.0   3rd Qu.:   4994   3rd Qu.:  237.5    3rd Qu.:2005-10-01  
##  Max.   :6576.0   Max.   :4597666   Max.   :94265.0    Max.   :2020-03-31  
##  NA's   :4                                             NA's   :6           
##   publisher        
##  Length:11127      
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

Good, now every columns are stored in the correct data types. But, as we had run through the process of doing explicit coercion with our columns, it is possible to have NA or missing values in our rows. Now let’s check whether we have a duplicated and missing values in our dataset or not.

sum(duplicated(goodreads))
## [1] 0

There’s no duplicated rows in our goodreads dataset.

anyNA(goodreads)
## [1] TRUE

However, we do have missing values. Let’s detect the columns that contained missing values.

colSums(is.na(goodreads))
##              title            authors     average_rating      language_code 
##                  0                  0                  4                  0 
##          num_pages      ratings_count text_reviews_count   publication_date 
##                  4                  0                  0                  6 
##          publisher 
##                  0

Three columns appeared to have missing values in their rows, column average_rating, num_pages and publication_date. If we have missing values in our dataset, what kind of treatment should we give to those columns?

For numerical data, we can either drop those rows or replaced them with the mean number for each columns. There’s also another condition where we might consider dropping an observation if it contains more than 2 missing values, such as observation 3349.

which(is.na(goodreads$average_rating), arr.ind = TRUE)
## [1] 3349 4703 5878 8980
goodreads[3349,]

If we look here, observation number 3349, 4702, 5877 and 8979 have more than 2 missing values in their rows. Therefore, we’re going to exclude them from our dataset.

goodreads <- goodreads[-c(3349, 4702, 5877, 8979),]
goodreads

Check again for missing values.

colSums(is.na(goodreads))
##              title            authors     average_rating      language_code 
##                  0                  0                  3                  0 
##          num_pages      ratings_count text_reviews_count   publication_date 
##                  3                  0                  0                  5 
##          publisher 
##                  0

For publication date, since there were only 2 rows that have missing values, we will also exclude them from our dataset by using fucntion na.omit().

which(is.na(goodreads$publication_date), arr.ind = TRUE)
## [1]  4701  5875  8178  8976 11095
goodreads <- goodreads %>% na.omit()
anyNA(goodreads)
## [1] FALSE

At this stage, we have performed data cleansing as we have stored all columns to their correct data types and we no longer have duplicated nor missing values in our dataset.

head(goodreads)
table(goodreads$language_code)
## 
## 9780674842113 9780851742717 9781563841552 9781593600112           ale 
##             0             0             0             0             1 
##           ara         en-CA         en-GB         en-US           eng 
##             1             7           214          1408          8904 
##           enm           fre           ger           gla           glg 
##             3           143            99             1             1 
##           grc           ita           jpn           lat           msa 
##            11             5            46             3             1 
##           mul            nl           nor           por           rus 
##            19             1             1            10             2 
##           spa           srp           swe           tur           wel 
##           218             1             2             1             1 
##           zho 
##            14

There are some unused levels in column language_code that we need to drop. These levels include: “9780674842113” “9780851742717”, “9781563841552”, “9781593600112”

goodreads$language_code <- droplevels(goodreads$language_code)

# check again whether the unused levels have been dropped or not
table(goodreads$language_code)
## 
##   ale   ara en-CA en-GB en-US   eng   enm   fre   ger   gla   glg   grc   ita 
##     1     1     7   214  1408  8904     3   143    99     1     1    11     5 
##   jpn   lat   msa   mul    nl   nor   por   rus   spa   srp   swe   tur   wel 
##    46     3     1    19     1     1    10     2   218     1     2     1     1 
##   zho 
##    14

The unused levels are now gone. The next step that we’re going to do is to combine level “en-CA”, “en-GB”, “en-US” and “eng” as a single level, which is “eng” only. We won’t differentiate books that were translated into the US, UK or Canadian version of english.

library(forcats)
# rename factor levels to "eng"
new_lc <- fct_recode(goodreads$language_code, "eng" = "en-US", "eng" = "en-GB", "eng" = "en-CA")
table(new_lc)
## new_lc
##   ale   ara   eng   enm   fre   ger   gla   glg   grc   ita   jpn   lat   msa 
##     1     1 10533     3   143    99     1     1    11     5    46     3     1 
##   mul    nl   nor   por   rus   spa   srp   swe   tur   wel   zho 
##    19     1     1    10     2   218     1     2     1     1    14

Combine new_lc to goodreads dataframe.

goodreads_clean <- cbind(goodreads, new_lc)
goodreads_clean

Drop column language_code then renamed column new_lc.

goodreads_clean <- goodreads_clean %>% 
  select(-language_code) %>% 
  rename("language_code" = "new_lc")

table(goodreads_clean$language_code)
## 
##   ale   ara   eng   enm   fre   ger   gla   glg   grc   ita   jpn   lat   msa 
##     1     1 10533     3   143    99     1     1    11     5    46     3     1 
##   mul    nl   nor   por   rus   spa   srp   swe   tur   wel   zho 
##    19     1     1    10     2   218     1     2     1     1    14

Next, we’ll do some feature engineering with column publication_date and to only extract the year which the books were published. In order to do that, we can do it both with base R and the package lubridate.

library(lubridate)

goodreads_clean$year <- year(goodreads_clean$publication_date)

goodreads_clean %>% 
  select(year, publication_date) %>% 
  head(10)

Check year range.

range(goodreads_clean$year)
## [1] 1900 2020

The publication date range from 1900 until 2020.

4 EDA

Check distributions for all numerical columns.

library(Hmisc)

goodreads_clean %>% 
  select(is.numeric) %>% 
  hist.data.frame()

Column average_rating is normally distributed and most likely to have a mean of around 4. Most of the books also have a length of less than 1500 pages. It also seems that a majority number of books didn’t receive a rating nor text reviews from users hence why there’s more than 10,000 books out of 11,118 observations that have 0 ratings count.

library(ggplot2)
goodreads_clean %>% 
  ggplot(aes(x = ratings_count, y = text_reviews_count)) +
  geom_point(color = "orange", size = 3) +
  geom_smooth() +
  labs(title = "Correlation between ratings and text reviews count",
       x = "ratings count",
       y = "text reviews count") +
  theme_minimal()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Ratings_count and text_reviews_count have a positive linear correlation. As the ratings count increases, so does the text reviews count.

Next, we’ll check the ten most popular and heavily discussed books among goodreads user based on the total number of their ratings count.

ratings <- goodreads_clean %>% 
  select(c(title, ratings_count, text_reviews_count, average_rating)) %>%
  arrange(desc(ratings_count)) %>% 
  head(10)

ratings
ratings %>% 
  ggplot(aes(x = ratings_count, y = reorder(title, ratings_count))) +
  geom_col(aes(fill = ratings_count)) +
  scale_fill_gradient(low = "#76777B", high = "#FFCB05") +
  xlab("ratings count") +
  ylab("") +
  guides(fill="none") +
  theme_minimal()

Proportion of books published in english and other languages.

# rename all rows that have a level of "eng" to "English" and then assigned it to a new column
goodreads_clean$english <- gsub("eng","English",goodreads_clean$language_code)

# change column "english" to a factor
goodreads_clean$english <- as.factor(goodreads_clean$english)

# rename all rows that were not published in english to "Other languages"
language_vec <- ifelse(goodreads_clean$english == "English", "English", "Other languages")
goodreads_clean <- cbind(goodreads_clean, language_vec)

Drop column english then changed column language_vec into a factor.

goodreads_clean <- goodreads_clean %>% 
  select(-english) %>% 
  mutate(language_vec = as.factor(language_vec))

head(goodreads_clean)

Check levels for column language_vec

levels(goodreads_clean$language_vec)
## [1] "English"         "Other languages"

We’ve done with our feature engineering by creating a new column that categorized levels in column language_code to either “English” or “Other languages”.

# create new data frame that only contains the frequency of each labels
languages <- as.data.frame(table(goodreads_clean$language_vec))
languages

Now, we’ll visualized it with a pie chart.

library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:readr':
## 
##     col_factor
sum_of_observation <- 11118

languages %>% 
  ggplot(aes(x = "", y = Freq, fill = Var1)) +
  geom_bar(stat="identity", width = 1, color = "white") +
  coord_polar("y", start = 0) +
  theme_void() +
  scale_fill_manual(values = c("#76777B", "#FFCB05")) +
  labs(title = "Language proportion",
       subtitle = "Number of books published in english and in other languages",
       fill = "Language") +
  geom_text(aes(label = paste0(round(Freq/sum_of_observation, 2)*100, "%")), position = position_stack(vjust = 0.5))

  #geom_text(aes(y = Freq/2 + c(0, cumsum(Freq)[-length(Freq)]),
                #label = percent(Freq/sum_of_observation)), size=5)

Now let’s further check the language proportion from non-english books.

# Prepare the dataset
non_english <- goodreads_clean %>% 
  filter(!language_code == "eng") %>% 
  select(language_code)

non_english <- as.data.frame(table(non_english))

# limit the plot to 15 languages
non_english <- non_english %>% 
  arrange(desc(Freq)) %>% 
  head(15)

non_english

Let’s visualized it with a lollipop plot.

non_english %>% 
  ggplot(aes(x = non_english, y = Freq)) +
  geom_segment(aes(x = non_english, xend = non_english, y = 0, yend = Freq), color = "grey") +
  geom_point( color="orange", size=4) +
  theme_light() +
  theme(
    panel.grid.major.x = element_blank(),
    panel.border = element_blank(),
    axis.ticks.x = element_blank()
  ) +
  xlab("languages other than english") +
  ylab("number of books") +
  labs(title = "Number of books published in other languages")

Among the non-english books, spain, french, germany and japan are the languages that have the most number of books from the goodreads dataset.

Next, we’ll find out which publisher had published the most number of books between 2000 - 2020.

twenty_years <- goodreads_clean %>% 
  filter(year >= 2000)

publisher <- as.data.frame(table(twenty_years$publisher))
publisher %>% 
  arrange(desc(Freq)) %>% 
  head(15) %>%
  ggplot(aes(x = Freq, y = reorder(Var1, Freq))) +
  geom_col(aes(fill = Freq)) +
  scale_fill_gradient(low = "#76777B", high = "#FFCB05") +
  xlab("number of books published") +
  ylab("publisher") +
  labs(title = "Top 15 publishing companies",
       subtitle = "Based on number of books published in 2000 - 2020") +
  guides(fill="none") +
  theme_minimal()

median(twenty_years$average_rating)
## [1] 3.95

The average rating is 3.95. Now let’s have a conditional subsetting with books that have an average rating of more than that number.

above_average <- twenty_years %>%
  filter(average_rating >= 3.95)

above_average

Find out which authors produced the most number of books with an average rating score of above the average.

above_average_freq <- as.data.frame(table(above_average$authors))

above_average_freq %>% 
  arrange(desc(Freq)) %>% 
  head(20) %>% 
  ggplot(aes(x = Freq, y = reorder(Var1, Freq))) +
  geom_col(aes(fill = Freq)) +
  scale_fill_gradient(low = "#76777B", high = "#FFCB05") +
  xlab("number of books") +
  ylab("author") +
  labs(title = "Top 20 Authors between 2000 - 2020",
       subtitle = "Based on number of books with an average ratings above 3.95") +
  guides(fill="none") +
  theme_minimal()

5 Conclusion

Through the process of EDA, we were able to extract several insights from our goodreads dataset. One of them was to find out which of the ten most popular and heavily discussed books among goodreads user based on the ratings count. The following books were part of that list: Twilight #1, The Hobbit, Harry Potter, Angel or Demons and The Catcher in the Rye. Another insight that we found is that there’s an overwhelming amount of books published in english (95%) compared to books published in other languages (5%). Vintage has become the number one publishing company with the highest number of books published since 2000-2020. The average rating from books published between 2000-2020 was 3.95 and P. G. Wodehouse was the only author who has more than 30 books which all were rated above the average.