Final Project - Rafiza Rahman

Goodreads Analysis

Source: https://www.fontinlogo.com/logo/goodreads

Source: https://www.pinterest.com/pin/102738435234090056/

Introduction

The data set that I have decided to work with is scraped using the Goodreads API. This data set is a compilation of various authors, their works, ratings of their works, and what publishing companies they are under. I chose to work with this data set because I have always had an interest in highly rated fantasy novels, and I was intrigued to learn if there was a hidden formula to producing a book with high ratings. In order to proceed, I decided to mainly focus on the following variables; average ratings, authora, publishers, and the ratings count. While there are other variabled that I did in fact delve into, these are the main ones that will be highlighted in my analysis.

Load Libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyr)
library(ggplot2)
library(dplyr)
library(highcharter)
Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 
Highcharts (www.highcharts.com) is a Highsoft software product which is
not free for commercial and Governmental use
library(leaflet)
library(readr)
library(plotly)

Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout
library(RColorBrewer)

Load Dataset

suppressWarnings({
setwd("C:/Users/rafiz/Downloads")
data <- read_csv("books.csv")})
Rows: 11127 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): title, authors, isbn, isbn13, language_code, publication_date, publ...
dbl (5): bookID, average_rating, num_pages, ratings_count, text_reviews_count

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Cleaning the Data

#check for any NA's
sum(is.na(data))
[1] 8
colSums(is.na(data))
            bookID              title            authors     average_rating 
                 0                  0                  0                  4 
              isbn             isbn13      language_code          num_pages 
                 0                  0                  0                  4 
     ratings_count text_reviews_count   publication_date          publisher 
                 0                  0                  0                  0 
#clear NA's
data_clean <- data |>
  filter(!is.na(`average_rating`)) |>
  filter(!is.na(`num_pages`)) |>
  filter(!is.na(`authors`))

#recheck
colSums(is.na(data_clean))
            bookID              title            authors     average_rating 
                 0                  0                  0                  0 
              isbn             isbn13      language_code          num_pages 
                 0                  0                  0                  0 
     ratings_count text_reviews_count   publication_date          publisher 
                 0                  0                  0                  0 
#convert american and british english to just "eng"
data_clean$language_code <- ifelse(grepl("en-US", data_clean$language_code), "eng", data_clean$language_code)

data_clean$language_code <- ifelse(grepl("en-GB", data_clean$language_code), "eng", data_clean$language_code)

data_clean2 <- data_clean[data_clean$language_code == "eng", ]
#find the unique values for publishers in order to manually check how many there are and how many occur
options(max.print = 2500)

publisher_counts <- table(data_clean2$publisher)

# get value and sort alphabetically
unique_sorted_values <- sort(unique(data_clean2$publisher))

# print w counts
#for (publisher_name in unique_sorted_values) {
  #cat(publisher_name, ": ", publisher_counts[publisher_name], "\n")}
#consolidate the names of the publishers so that the multiple variations fall into one
data_clean2$publisher <- ifelse(grepl("Harper", data_clean2$publisher), "Harper", data_clean2$publisher)

data_clean2$publisher <- ifelse(grepl("Penguin", data_clean2$publisher), "Penguin", data_clean2$publisher)

data_clean2$publisher <- ifelse(grepl("Bantam", data_clean2$publisher), "Bantam", data_clean2$publisher)

data_clean2$publisher <- ifelse(grepl("Vintage", data_clean2$publisher), "Vintage", data_clean2$publisher)

data_clean2$publisher <- ifelse(grepl("VIZ", data_clean2$publisher), "VIZ Media", data_clean2$publisher)

data_clean2$publisher <- ifelse(grepl("Dover", data_clean2$publisher), "Dover Publications", data_clean2$publisher)

data_clean2$publisher <- ifelse(grepl("Mariner", data_clean2$publisher), "Mariner Books", data_clean2$publisher)

data_clean2$publisher <- ifelse(grepl("Pocket", data_clean2$publisher), "Pocket Books", data_clean2$publisher)

data_clean2$publisher <- ifelse(grepl("Ballantine", data_clean2$publisher), "Ballantine Books", data_clean2$publisher)

data_clean2$publisher <- ifelse(grepl("Berkley", data_clean2$publisher), "Berkley", data_clean2$publisher)

data_clean2$publisher <- ifelse(grepl("Bloomsbury", data_clean2$publisher), "Bloomsbury", data_clean2$publisher)

data_clean2$publisher <- ifelse(grepl("Scholastic", data_clean2$publisher), "Scholastic", data_clean2$publisher)

data_clean2$publisher <- ifelse(grepl("Simon", data_clean2$publisher), "Simon & Schuster", data_clean2$publisher)

data_clean2$publisher <- ifelse(grepl("W.", data_clean2$publisher), "W. W. Norton Company", data_clean2$publisher)

data_clean2$publisher <- ifelse(grepl("Random House", data_clean2$publisher), "Random House", data_clean2$publisher)
#find the top 10 publishers with the most amount of books
value_counts <- table(data_clean2$publisher)

sorted_values <- sort(value_counts, decreasing = TRUE)

top_ten <- names(sorted_values)[1:min(10, length(sorted_values))]
top_ten
 [1] "W. W. Norton Company" "Penguin"              "Harper"              
 [4] "Vintage"              "Bantam"               "Random House"        
 [7] "Simon & Schuster"     "Berkley"              "Ballantine Books"    
[10] "Mariner Books"       
#create a new data set with only the top 10 publishers
ten_pub <- c("W. W. Norton Company", "Penguin", "Harper", "Vintage", "Random House", "Bantam", "Simon & Schuster", "Berkley", "Ballantine Books", "Mariner Books")
data_clean3 <- data_clean2[data_clean2$publisher == ten_pub, ]

#getting rid of an unnecessary line that would create an outlier in my visualizations
data_clean3 <- data_clean3[-294, ]

Statistical Analysis

#find the p-value and r-squared values
fit1 <- lm(data = data_clean3, average_rating ~ num_pages + ratings_count + text_reviews_count)
summary(fit1)

Call:
lm(formula = average_rating ~ num_pages + ratings_count + text_reviews_count, 
    data = data_clean3)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.01822 -0.17512  0.01558  0.16740  0.58126 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         3.847e+00  2.750e-02 139.894   <2e-16 ***
num_pages           1.706e-04  6.967e-05   2.448   0.0149 *  
ratings_count       6.587e-07  6.578e-07   1.001   0.3174    
text_reviews_count -1.708e-05  1.969e-05  -0.867   0.3863    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2576 on 336 degrees of freedom
Multiple R-squared:  0.02333,   Adjusted R-squared:  0.01461 
F-statistic: 2.675 on 3 and 336 DF,  p-value: 0.04722

As you can see from this z-test summary, the p-value is 0.047

fit2 <- lm(data = data_clean3, average_rating ~ num_pages)
summary(fit2)

Call:
lm(formula = average_rating ~ num_pages, data = data_clean3)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.02043 -0.17556  0.01796  0.16590  0.58312 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 3.844e+00  2.716e-02 141.517  < 2e-16 ***
num_pages   1.797e-04  6.913e-05   2.599  0.00975 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2574 on 338 degrees of freedom
Multiple R-squared:  0.0196,    Adjusted R-squared:  0.0167 
F-statistic: 6.757 on 1 and 338 DF,  p-value: 0.009749
fit3 <- lm(data = data_clean3, average_rating ~ ratings_count)
summary(fit3)

Call:
lm(formula = average_rating ~ ratings_count, data = data_clean3)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.01232 -0.16551  0.01275  0.16741  0.60772 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   3.902e+00  1.438e-02 271.450   <2e-16 ***
ratings_count 1.262e-07  1.393e-07   0.906    0.366    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2596 on 338 degrees of freedom
Multiple R-squared:  0.002421,  Adjusted R-squared:  -0.0005304 
F-statistic: 0.8203 on 1 and 338 DF,  p-value: 0.3657

As we can see from the following summaries, it seems that the number of pages a novel has does have an impact of the ratings given. According to the p-value of the second fit, we can denote that the provided information is statistically significant and therefore we can reject the null hypothesis of no impact. Although the R-squared value remains extremely small for the most part, it does rise in the slightest when comparing multiple variables to just the page count.

par(mfrow = c(2,2))
plot(fit1)

My equation for my model is 3.874 + 1.706e-04(num_pages) + 6.587e-07(ratings_count) + -1.708e-05(text_reviews_count)

Visualizations

#basic histogram
total_books <- length(data_clean2$title)
mean(data_clean2$average_rating)
[1] 3.931251
hist(data_clean2$average_rating)

I wanted to begin by initially making a simple histogram to compare the average rating of all english books with all publishers. I chose to do this so that I would have an idea of where a majority of my data lies.

Plot 1

custom_colors <- c("#EA6D59", "#EAA659", "#EAE859", "#53DA57", "#33C7FF", "#B153DA", "#DA53CA", "#5355DA", "#E73533", "#005E0D")


#creating plot
plot1 <- ggplot(data_clean3, aes(x = average_rating, fill = publisher)) +
  geom_histogram(binwidth = 0.3, position = "dodge") +
  labs(caption = "Source: Goodreads", title = "<b> Histogram of Book Ratings by Publisher",
       x = "<b> Rating",
       y = "<b> Published Books") +
  scale_fill_manual(values = custom_colors) +
  theme_minimal() +
  theme(legend.position = "top")

# ggplot to plotly
ggplotly(plot1)

In this visualization, I specifically took the ten publishers with the highest amount of books and compared their average ratings. I chose to do this to find out whether or not an author would have an inherent advantage by choosing to publish with a certain company or not. I also chose to reduce the bin width so that the viewer would be able to visually notice not only the company with the most amount of high ratings, but also the disparity between the high and low averages.

Plot 2

# allow for color varience 
num_colors <- 7
colors <- brewer.pal(num_colors, "Spectral")

# making plot
plot2 <- ggplot(data_clean3, aes(x = num_pages, y = average_rating, size = ratings_count)) +
  geom_point(aes(color = ratings_count)) +
  geom_smooth(method = "lm", se = FALSE, aes(text = NULL)) +
  labs(caption = "Source: Goodreads",
       title = '<b> Bubble Chart of Books',
       x = '<b> Number of Pages',
       y = '<b> Average Rating') +
  theme_minimal() +
  scale_color_gradientn(colors = colors)

# ggplot to plotly
ggplotly(plot2)
`geom_smooth()` using formula = 'y ~ x'
Warning: The following aesthetics were dropped during statistical transformation: size
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?
ggplotly(plot2) |>
  style(text = paste("Title: ", data_clean3$title, "<br>Author: ", data_clean3$authors, "<br> Rating: ", data_clean3$average_rating, "<br> Ratings Count: ", data_clean3$ratings_count))
`geom_smooth()` using formula = 'y ~ x'
Warning: The following aesthetics were dropped during statistical transformation: size
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

In this visualization, I wanted to find if the length of the book written had anything to do with the reviews received. I also used the size and color of the bubbles to indicate how many reviews each book received, as I think there is an important denotation from a book that has high ratings from a small sample versus one that has high ratings from a bigger crowd.I found that due to the fact that many of the bubbles were similar in size, the different colors were an important distinction between the titles. While I decided to increase the number of colors represented in the legend to 7, seeing how the colors are still mainly red indicates that a majority of the novels have around the same number of reviews save for the select few that have a major jump.

Plot 3

# aggregate by company
publisher_stats <- data_clean3 |>
  group_by(publisher) |>
  summarise(books_published = n(),
            avg_rating = mean(average_rating))

# sort by books published
publisher_stats <- publisher_stats |>
  arrange(desc(books_published))

# heatmap ???
heatmap <- plot_ly(
  data = publisher_stats,
  x = ~publisher,
  y = ~avg_rating,
  z = ~books_published,
  type = "heatmap",
  colorscale = "Viridis",
  colorbar = list(title = "<b> Number of Books"),
  text = ~paste("Publisher: ", publisher, "<br>",
                "Avg Rating: ", round(avg_rating, 2), "<br>",
                "Books Published: ", books_published),
  hoverinfo = "text"
  
) |>
  layout(
    title = "<b> Publishing Companies Heatmap",
    xaxis = list(title = "<b> Publishing Company", showgrid = FALSE),
    yaxis = list(title = "<b> Average Rating", showgrid = FALSE),
    caption = "Source: Goodreads",
    plot_bgcolor = "lightgray"
  )

heatmap
Warning: 'layout' objects don't have these attributes: 'caption'
Valid attributes include:
'_deprecated', 'activeshape', 'annotations', 'autosize', 'autotypenumbers', 'calendar', 'clickmode', 'coloraxis', 'colorscale', 'colorway', 'computed', 'datarevision', 'dragmode', 'editrevision', 'editType', 'font', 'geo', 'grid', 'height', 'hidesources', 'hoverdistance', 'hoverlabel', 'hovermode', 'images', 'legend', 'mapbox', 'margin', 'meta', 'metasrc', 'modebar', 'newshape', 'paper_bgcolor', 'plot_bgcolor', 'polar', 'scene', 'selectdirection', 'selectionrevision', 'separators', 'shapes', 'showlegend', 'sliders', 'smith', 'spikedistance', 'template', 'ternary', 'title', 'transition', 'uirevision', 'uniformtext', 'updatemenus', 'width', 'xaxis', 'yaxis', 'barmode', 'bargap', 'mapType'

I found this to be my most interesting visualization simply because it literally did not work as I wanted it to. I initially wanted to create a heat map that showed publication companies and their popularity however, even though this visualization did not go as planned, I found it to still be informative to a viewer. The axis are clearly able to highlight what I am comparing and the length of each box give almost something akin to box plots. I also think that the inclusion of the different colors for the amount of books published it helpful in indicating the accomplishments of each company. While completing these visualizations, I found it important to work as if my audience were an aspiring author who wanted their work to achieve the best results when published, which is why I focused on my variables to be those of importance to an author

Conclusion

Some background research on my topic, I wanted to understand the relationship between authors and publishing companies. Going into this, I initially knew very little about how publications worked with big companies in comparison to traditional self-publishing, and I wanted to investigate whether or not the particular company had a significant impact. After some research, it was clear to be that the decision to work with a company was a large financial burden, especially on authors that are just starting out. I believe that my analysis on ratings, which correlate with popularity, would give authors a chance to compare and see whether a guaranteed hit with their novel is worth shelling out a couple extra thousand dollars.

Throughout my analysis, some things I wish I could have changed or spent more time on would definitely circle back to my data set the most. I would have wanted to go further in depth with works in other languages as the cut of the most publishing companies and choosing to focus on books written in English dropped my objects from over 11,0000 to around 300. This data set is massive and has a lot of potential for other types of analysis which I wish I had the time to look in to. If possible, I would have liked to correlate the publishing companies with the location in which they are based and see if that also had any sort of impact.

Bibliography

Lucid Books

Books and Partnerships

Partner Publishing