Introduction

Hey book enthusiasts! Our report, “Beyond the Pages,” is your backstage pass to the fascinating world of Goodreads books. We’re breaking down the data to find out what makes your favorite books tick. From finding out who the rockstar authors are to exploring the most-loved genres, we’re on a mission to unveil the secrets hidden in the numbers.

Curious about the trends over the years? Wondering which awards stand out? We’ve got you covered. Navigate through the story lengths, languages, and even the little details like publisher influence. This report isn’t just about numbers; it’s about the heart of the books you adore.

Overview

The analysis involved a comprehensive exploration of books ratings, spanning diverse genres, authors and timeframes. To ensure the reliability and accuracy of our findings, a rigorous data cleaning and preprocessing phase preceded the analysis.

Key Findings

#load package
library(pacman)
p_load(tidyverse, janitor, lubridate, ggthemes, ggeasy, wordcloud, RColorBrewer)
# -------- load data ------
books_df <- read_csv("books.csv", show_col_types = FALSE)
# ------------ Data Cleaning ------------
books_df <- clean_names(books_df)
#Problem 2: convert date column to date
books_df$first_publish_date <- mdy(books_df$first_publish_date)
## Warning: 1186 failed to parse.
books_df <- books_df %>% mutate(year = year(first_publish_date))
#Problem 4: filter year from 1990-2020 inclusive
books_df <- books_df %>% filter(year >= 1990 & year <= 2020)
#Problem 5: drop unwanted columns
books_col_rm <- select(books_df, -c(publish_date, edition, characters, price, genres, setting, isbn))
#Problem 6: filter for pages below 1200
books <- books_col_rm %>% filter(pages < 1200)

1. Distribution of Rating:

The histogram illustrates the distribution of book ratings on Goodreads. The majority of books fall within the 3.79 to 4.15 rating range, indicating a generally positive sentiment among users.

#histogram of book ratings
ggplot(books, aes(x=rating)) + 
  geom_histogram(binwidth = .25, fill='red') +
  labs(title = 'Histogram of Book Ratings',
       x = 'Rating',
       y = 'Number of Books') +
  theme_bw()

2. Pages Count:

The median number of pages per book is 319, suggesting that half of the books in our dataset contain fewer than 319 pages. Additionally, the boxplot reveals that certain books in the dataset extend beyond 700 pages, with a subset exceeding 1000 pages.

#boxplot of number of pages
ggplot(books, aes(x = pages)) + geom_boxplot(fill= "magenta") +
  labs(title = "Box Plot of Page Counts",
       x = "Pages") +
  theme_economist()

Now, when it comes to genres (like mystery, romance, or science fiction), they usually have certain expectations for how long a book should be. We dug a bit deeper and noticed something cool: books that fall into multiple genres tend to more pages. This means if a book is in both the mystery and romance categories, it might have more pages. It’s like genres teaming up to create longer stories!

# group by genres and and find the sum per genres, average page
books_genres <- books_df %>% filter(pages <1200) %>% 
  group_by(genres) %>% 
  summarise(genres_count = n(),
            avg_page = mean(pages)) %>% 
  arrange(desc(genres_count))
#sort by ascending order(avg_page)
books_genres %>% arrange(desc(avg_page))

3. Genres:

Upon closer examination of the genres, among the individual genres, ‘Fiction’ emerges as the most prevalent, with a total count of 59 books, indicating its popularity among readers. Following closely are ‘Fantasy’ and ‘Historical Fiction’ with 42 and 22 books, respectively. Genres such as ‘Poetry’ and ‘Nonfiction’ contribute to the dataset with 34 and 30 books.

books_genres <- books_genres[-1,]
books_genres
options(warn=-1)
books_genres <- head(books_genres, n=5000)
wordcloud(words = books_genres$genres, freq = books_genres$genres_count, 
          min.freq = 1, max.words=2000, random.order=FALSE,
          colors = brewer.pal(8, "Dark2"))

4. Publishers:

Looking at the publishers, ‘Random House’ is the big leader, making up a big chunk, around 42.8%, of all the books. Right behind them is ‘Harper Collins,’ and together, these two publishers cover a lot, about 67.6% of all the books in the dataset. This means most books come from just a couple of big publishers, showing their strong influence. It makes us think about how this might affect the types of books we see and read, with these big publishers having a big say in what’s available. Notably, Random House is an American book publisher and holds the distinction of being the largest general-interest paperback publisher globally.

#--------------- Problem 5:
book_publishers <- books %>% group_by(publisher) %>% 
  summarise(total_books = n()) 

book_publishers <- book_publishers %>% na.omit() %>%
  filter(total_books >= 250) %>%
  arrange(desc(total_books)) %>% 
  mutate(publisher = factor(publisher, levels = fct_inorder(publisher)),
         cum_count = cumsum(total_books),
         rel_freq = total_books/sum(total_books),
         cum_freq = cumsum(rel_freq))
#Pareto chart
ggplot(book_publishers, aes(x = publisher, y= total_books)) +
  geom_bar(stat = "identity", fill='cyan') +
  geom_line(aes(x = publisher, y= cum_count, group=1)) +
  geom_point(aes(x = publisher, y= cum_count)) +
  labs(title = "Pareto and Ogive of Publisher Book Counts (1990 - 2020)",
       x = "Publisher",
       y = "Number of Books") +
  theme_clean() +
  easy_rotate_x_labels(angle = 45, side = c('right'))

5. Scatter Plot of Pages Vs Rating Vs Year

Looking at the scatter plot, it suggests that there’s a weak positive connection between the number of pages in a book and its rating. This means that longer books tend to have slightly higher ratings, but the link isn’t very strong.

Additionally, the plot hints at a small trend: books from earlier years, like the 90s, seem to have more pages compared to books from more recent years, starting from around 2010.

#scatter plot of Pages vs. Rating
ggplot(books, aes(x=pages, y=rating, color = year)) + 
  geom_point() +
  labs(title = "Scatter Plot of Pages vs. Rating",
       x = "Pages",
       y = "Rating") +
  theme_tufte()

#---- Function: Get numeric columns -----
#this function select only the numeric features within a dataframe
get_numeric_feat <- function(df){
  num_col <- c()
  for (i in 1:length(df)){
    if (is.numeric(df[[i]])==T){
      num_col <- append(num_col, i)
    }
  }
  select(df, all_of(num_col))
}

The patterns observed in the scatter plot are supported by evidence from the correlation table. Specifically, a positive correlation coefficient of 0.0971 between ‘Rating’ and ‘Pages’ affirms a weak positive relationship, validating the notion that longer books tend to have slightly higher ratings. Additionally, the negative correlation coefficient of -0.0789 between ‘Pages’ and ‘Year’ aligns with the scatter plot’s suggestion, indicating a weak negative relationship. This implies that books with more pages are associated with earlier years

cor(get_numeric_feat(books[c('rating','pages','year')]))
##            rating       pages        year
## rating 1.00000000  0.09714094  0.04262525
## pages  0.09714094  1.00000000 -0.07893865
## year   0.04262525 -0.07893865  1.00000000

6. Total Number of Books

It appears that the number of books over the years follows a trend of peaking in 2011 and then gradually decreasing.

#Problem 8:
by_year <- books %>% group_by(year) %>%
  summarise(total_books = n(),
            avg_rating = mean(rating))
by_year
#------line plot of total number of books per year------
ggplot(by_year, aes(x = year, y=total_books)) + 
  geom_line() +
  geom_point(aes(size=avg_rating, color = avg_rating)) +
  labs(title = "Total Number of Books Rated Per Year") +
  theme_excel_new()

#observation: the trend number books over the years tend to 
#drop after reaching a peak in 2011

7. Sample Mean

#Problem 10:
#the function calculate the mean of a population
pop_avg <- function(vect){
  avg <- sum(vect)/length(vect)
  avg
}

#----- function: variance-----
# this function calculate the variance of a vector
pop_var <- function(vect) {
  N <- length(vect) #length of the vector
  variance <- sum((vect-pop_avg(vect))^2) / N
  
  return(variance) #return the result
}

#----- Function: standard deviation-----
#this function calculate the standard deviation
pop_sd <- function(vect){
  deviation <- sqrt(pop_var(vect))
  return(deviation)
}
#using the function above calculate the average, variance, sd
books %>% summarise(avg_rating = pop_avg(rating),
                    variance = pop_var(rating),
                    sd = pop_sd(rating))
#Problem 12
#set seed
set.seed(42)

# sample the data
sample_1 <- sample_n(books, size = 100)
sample_2 <- sample_n(books, size = 100)
sample_3 <- sample_n(books, size = 100)

#sample statistic for each sample
sample_1_stat <- c(mean(sample_1$rating),var(sample_1$rating),
                   sd(sample_1$rating))
sample_2_stat <- c(mean(sample_2$rating),var(sample_2$rating),
                   sd(sample_2$rating))
sample_3_stat <- c(mean(sample_3$rating),var(sample_3$rating),
                   sd(sample_3$rating))
#create dataframe with the sample statistic
sample_stats <- data_frame("Statistic"=c('mean','variance','sd'),sample_1_stat, sample_2_stat, 
                           sample_3_stat)
sample_stats #call the data

Upon comparing the sample statistics with the population values, the sample means, represented by 3.97, 4.00, and 3.96, align reasonably closely with the population mean of 3.98. This convergence suggests that our samples effectively capture the central tendency of the entire population. However, when examining sample variances (0.115, 0.0741, and 0.118) and standard deviations (0.339, 0.272, and 0.3437), we notice some variation compared to the population values (0.0963, 0.310). This variability is expected, given the inherent uncertainty introduced by random sampling.

8. Authors Count

#-------- Author's frequency
author_freq <- unlist(strsplit((books$author), ","))
author_freq <- as.data.frame(table(author_freq)) %>% arrange(desc(Freq))
author_freq <- head(author_freq, n=10)[-1,]
author_freq
ggplot(author_freq, aes(y = author_freq, x = Freq)) +
  geom_col(fill='steelblue') +
  labs(title = "Top 10 Authors",
       x= "Frequency",
       y = "Names of Author") +
  theme_classic()

Conclusion

In conclusion, most of the books in the dataset have around 319 pages, but there are some with more than 1000 pages. Genres such as like fiction, fantasy, non-friction, childrens & Historical fictions seems to be the books emerge as readers favorite. Also books with multiple genres tend to have more pages compared to single genres.

Additionally, an interesting trend surfaced in the data. Books published in recent years tend to garner slightly higher ratings than those from the early ’90s. In the same view, books published in current years have lower page numbers compared to books in the 90s.

Reference

Kabacoff, Robert I. n.d. R in Action, Third Edition — Learning.oreilly.com.” https://learning.oreilly.com/library/view/r-in-action/9781617296055/OEBPS/Text/Ch-03.htm#heading_id_25.
Penguin Random House — Penguinrandomhouse.com.” n.d. https://www.penguinrandomhouse.com/.
R: Easily Rotate &Apos;x&apos; Axis Labels — Search.r-Project.org.” n.d. https://search.r-project.org/CRAN/refmans/ggeasy/html/easy_rotate_labels.html.
Reorder Factor Levels by First Appearance, Frequency, or Numeric Order — Fct_inorder — Forcats.tidyverse.org.” n.d. https://forcats.tidyverse.org/reference/fct_inorder.html.
Zach. n.d. How to Change Point Size in Ggplot2 (3 Examples) - Statology — Statology.org.” https://www.statology.org/ggplot-point-size/.