Key Findings

#load package
library(pacman)
p_load(tidyverse, janitor, lubridate, ggthemes, ggeasy, wordcloud, RColorBrewer)
# -------- load data ------
books_df <- read_csv("books.csv", show_col_types = FALSE)
# ------------ Data Cleaning ------------
books_df <- clean_names(books_df)
#Problem 2: convert date column to date
books_df$first_publish_date <- mdy(books_df$first_publish_date)

## Warning: 1186 failed to parse.

books_df <- books_df %>% mutate(year = year(first_publish_date))
#Problem 4: filter year from 1990-2020 inclusive
books_df <- books_df %>% filter(year >= 1990 & year <= 2020)
#Problem 5: drop unwanted columns
books_col_rm <- select(books_df, -c(publish_date, edition, characters, price, genres, setting, isbn))
#Problem 6: filter for pages below 1200
books <- books_col_rm %>% filter(pages < 1200)

1. Distribution of Rating:

The histogram illustrates the distribution of book ratings on Goodreads. The majority of books fall within the 3.79 to 4.15 rating range, indicating a generally positive sentiment among users.

#histogram of book ratings
ggplot(books, aes(x=rating)) + 
  geom_histogram(binwidth = .25, fill='red') +
  labs(title = 'Histogram of Book Ratings',
       x = 'Rating',
       y = 'Number of Books') +
  theme_bw()

2. Pages Count:

The median number of pages per book is 319, suggesting that half of the books in our dataset contain fewer than 319 pages. Additionally, the boxplot reveals that certain books in the dataset extend beyond 700 pages, with a subset exceeding 1000 pages.

#boxplot of number of pages
ggplot(books, aes(x = pages)) + geom_boxplot(fill= "magenta") +
  labs(title = "Box Plot of Page Counts",
       x = "Pages") +
  theme_economist()

Now, when it comes to genres (like mystery, romance, or science fiction), they usually have certain expectations for how long a book should be. We dug a bit deeper and noticed something cool: books that fall into multiple genres tend to more pages. This means if a book is in both the mystery and romance categories, it might have more pages. It’s like genres teaming up to create longer stories!

# group by genres and and find the sum per genres, average page
books_genres <- books_df %>% filter(pages <1200) %>% 
  group_by(genres) %>% 
  summarise(genres_count = n(),
            avg_page = mean(pages)) %>% 
  arrange(desc(genres_count))
#sort by ascending order(avg_page)
books_genres %>% arrange(desc(avg_page))

3. Genres:

Upon closer examination of the genres, among the individual genres, ‘Fiction’ emerges as the most prevalent, with a total count of 59 books, indicating its popularity among readers. Following closely are ‘Fantasy’ and ‘Historical Fiction’ with 42 and 22 books, respectively. Genres such as ‘Poetry’ and ‘Nonfiction’ contribute to the dataset with 34 and 30 books.

books_genres <- books_genres[-1,]
books_genres

options(warn=-1)
books_genres <- head(books_genres, n=5000)
wordcloud(words = books_genres$genres, freq = books_genres$genres_count, 
          min.freq = 1, max.words=2000, random.order=FALSE,
          colors = brewer.pal(8, "Dark2"))

4. Publishers:

Looking at the publishers, ‘Random House’ is the big leader, making up a big chunk, around 42.8%, of all the books. Right behind them is ‘Harper Collins,’ and together, these two publishers cover a lot, about 67.6% of all the books in the dataset. This means most books come from just a couple of big publishers, showing their strong influence. It makes us think about how this might affect the types of books we see and read, with these big publishers having a big say in what’s available. Notably, Random House is an American book publisher and holds the distinction of being the largest general-interest paperback publisher globally.

#--------------- Problem 5:
book_publishers <- books %>% group_by(publisher) %>% 
  summarise(total_books = n()) 

book_publishers <- book_publishers %>% na.omit() %>%
  filter(total_books >= 250) %>%
  arrange(desc(total_books)) %>% 
  mutate(publisher = factor(publisher, levels = fct_inorder(publisher)),
         cum_count = cumsum(total_books),
         rel_freq = total_books/sum(total_books),
         cum_freq = cumsum(rel_freq))

#Pareto chart
ggplot(book_publishers, aes(x = publisher, y= total_books)) +
  geom_bar(stat = "identity", fill='cyan') +
  geom_line(aes(x = publisher, y= cum_count, group=1)) +
  geom_point(aes(x = publisher, y= cum_count)) +
  labs(title = "Pareto and Ogive of Publisher Book Counts (1990 - 2020)",
       x = "Publisher",
       y = "Number of Books") +
  theme_clean() +
  easy_rotate_x_labels(angle = 45, side = c('right'))

5. Scatter Plot of Pages Vs Rating Vs Year

Looking at the scatter plot, it suggests that there’s a weak positive connection between the number of pages in a book and its rating. This means that longer books tend to have slightly higher ratings, but the link isn’t very strong.

Additionally, the plot hints at a small trend: books from earlier years, like the 90s, seem to have more pages compared to books from more recent years, starting from around 2010.

#scatter plot of Pages vs. Rating
ggplot(books, aes(x=pages, y=rating, color = year)) + 
  geom_point() +
  labs(title = "Scatter Plot of Pages vs. Rating",
       x = "Pages",
       y = "Rating") +
  theme_tufte()

#---- Function: Get numeric columns -----
#this function select only the numeric features within a dataframe
get_numeric_feat <- function(df){
  num_col <- c()
  for (i in 1:length(df)){
    if (is.numeric(df[[i]])==T){
      num_col <- append(num_col, i)
    }
  }
  select(df, all_of(num_col))
}

The patterns observed in the scatter plot are supported by evidence from the correlation table. Specifically, a positive correlation coefficient of 0.0971 between ‘Rating’ and ‘Pages’ affirms a weak positive relationship, validating the notion that longer books tend to have slightly higher ratings. Additionally, the negative correlation coefficient of -0.0789 between ‘Pages’ and ‘Year’ aligns with the scatter plot’s suggestion, indicating a weak negative relationship. This implies that books with more pages are associated with earlier years

cor(get_numeric_feat(books[c('rating','pages','year')]))

##            rating       pages        year
## rating 1.00000000  0.09714094  0.04262525
## pages  0.09714094  1.00000000 -0.07893865
## year   0.04262525 -0.07893865  1.00000000

6. Total Number of Books

It appears that the number of books over the years follows a trend of peaking in 2011 and then gradually decreasing.

#Problem 8:
by_year <- books %>% group_by(year) %>%
  summarise(total_books = n(),
            avg_rating = mean(rating))
by_year

#------line plot of total number of books per year------
ggplot(by_year, aes(x = year, y=total_books)) + 
  geom_line() +
  geom_point(aes(size=avg_rating, color = avg_rating)) +
  labs(title = "Total Number of Books Rated Per Year") +
  theme_excel_new()

#observation: the trend number books over the years tend to 
#drop after reaching a peak in 2011

7. Sample Mean

#Problem 10:
#the function calculate the mean of a population
pop_avg <- function(vect){
  avg <- sum(vect)/length(vect)
  avg
}

#----- function: variance-----
# this function calculate the variance of a vector
pop_var <- function(vect) {
  N <- length(vect) #length of the vector
  variance <- sum((vect-pop_avg(vect))^2) / N
  
  return(variance) #return the result
}

#----- Function: standard deviation-----
#this function calculate the standard deviation
pop_sd <- function(vect){
  deviation <- sqrt(pop_var(vect))
  return(deviation)
}

#using the function above calculate the average, variance, sd
books %>% summarise(avg_rating = pop_avg(rating),
                    variance = pop_var(rating),
                    sd = pop_sd(rating))

#Problem 12
#set seed
set.seed(42)

# sample the data
sample_1 <- sample_n(books, size = 100)
sample_2 <- sample_n(books, size = 100)
sample_3 <- sample_n(books, size = 100)

#sample statistic for each sample
sample_1_stat <- c(mean(sample_1$rating),var(sample_1$rating),
                   sd(sample_1$rating))
sample_2_stat <- c(mean(sample_2$rating),var(sample_2$rating),
                   sd(sample_2$rating))
sample_3_stat <- c(mean(sample_3$rating),var(sample_3$rating),
                   sd(sample_3$rating))
#create dataframe with the sample statistic
sample_stats <- data_frame("Statistic"=c('mean','variance','sd'),sample_1_stat, sample_2_stat, 
                           sample_3_stat)
sample_stats #call the data

Upon comparing the sample statistics with the population values, the sample means, represented by 3.97, 4.00, and 3.96, align reasonably closely with the population mean of 3.98. This convergence suggests that our samples effectively capture the central tendency of the entire population. However, when examining sample variances (0.115, 0.0741, and 0.118) and standard deviations (0.339, 0.272, and 0.3437), we notice some variation compared to the population values (0.0963, 0.310). This variability is expected, given the inherent uncertainty introduced by random sampling.

8. Authors Count

#-------- Author's frequency
author_freq <- unlist(strsplit((books$author), ","))
author_freq <- as.data.frame(table(author_freq)) %>% arrange(desc(Freq))
author_freq <- head(author_freq, n=10)[-1,]
author_freq

ggplot(author_freq, aes(y = author_freq, x = Freq)) +
  geom_col(fill='steelblue') +
  labs(title = "Top 10 Authors",
       x= "Frequency",
       y = "Names of Author") +
  theme_classic()

Beyond the Pages: An In-Depth Exploration of the Goodreads Books Dataset

Northeastern University

ALY6000: Introduction to Analytics

Prof. John Wilder

Sheila Kwartemaa Boateng

2024-01-30

Introduction

Overview