The data set that I have decided to work with is scraped using the Goodreads API. This data set is a compilation of various authors, their works, ratings of their works, and what publishing companies they are under. I chose to work with this data set because I have always had an interest in highly rated fantasy novels, and I was intrigued to learn if there was a hidden formula to producing a book with high ratings. In order to proceed, I decided to mainly focus on the following variables; average ratings, authora, publishers, and the ratings count. While there are other variabled that I did in fact delve into, these are the main ones that will be highlighted in my analysis.
Load Libraries
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.3 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
Highcharts (www.highcharts.com) is a Highsoft software product which is
not free for commercial and Governmental use
library(leaflet)library(readr)library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
Rows: 11127 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): title, authors, isbn, isbn13, language_code, publication_date, publ...
dbl (5): bookID, average_rating, num_pages, ratings_count, text_reviews_count
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#convert american and british english to just "eng"data_clean$language_code <-ifelse(grepl("en-US", data_clean$language_code), "eng", data_clean$language_code)data_clean$language_code <-ifelse(grepl("en-GB", data_clean$language_code), "eng", data_clean$language_code)data_clean2 <- data_clean[data_clean$language_code =="eng", ]
#find the unique values for publishers in order to manually check how many there are and how many occuroptions(max.print =2500)publisher_counts <-table(data_clean2$publisher)# get value and sort alphabeticallyunique_sorted_values <-sort(unique(data_clean2$publisher))# print w counts#for (publisher_name in unique_sorted_values) {#cat(publisher_name, ": ", publisher_counts[publisher_name], "\n")}
#consolidate the names of the publishers so that the multiple variations fall into onedata_clean2$publisher <-ifelse(grepl("Harper", data_clean2$publisher), "Harper", data_clean2$publisher)data_clean2$publisher <-ifelse(grepl("Penguin", data_clean2$publisher), "Penguin", data_clean2$publisher)data_clean2$publisher <-ifelse(grepl("Bantam", data_clean2$publisher), "Bantam", data_clean2$publisher)data_clean2$publisher <-ifelse(grepl("Vintage", data_clean2$publisher), "Vintage", data_clean2$publisher)data_clean2$publisher <-ifelse(grepl("VIZ", data_clean2$publisher), "VIZ Media", data_clean2$publisher)data_clean2$publisher <-ifelse(grepl("Dover", data_clean2$publisher), "Dover Publications", data_clean2$publisher)data_clean2$publisher <-ifelse(grepl("Mariner", data_clean2$publisher), "Mariner Books", data_clean2$publisher)data_clean2$publisher <-ifelse(grepl("Pocket", data_clean2$publisher), "Pocket Books", data_clean2$publisher)data_clean2$publisher <-ifelse(grepl("Ballantine", data_clean2$publisher), "Ballantine Books", data_clean2$publisher)data_clean2$publisher <-ifelse(grepl("Berkley", data_clean2$publisher), "Berkley", data_clean2$publisher)data_clean2$publisher <-ifelse(grepl("Bloomsbury", data_clean2$publisher), "Bloomsbury", data_clean2$publisher)data_clean2$publisher <-ifelse(grepl("Scholastic", data_clean2$publisher), "Scholastic", data_clean2$publisher)data_clean2$publisher <-ifelse(grepl("Simon", data_clean2$publisher), "Simon & Schuster", data_clean2$publisher)data_clean2$publisher <-ifelse(grepl("W.", data_clean2$publisher), "W. W. Norton Company", data_clean2$publisher)data_clean2$publisher <-ifelse(grepl("Random House", data_clean2$publisher), "Random House", data_clean2$publisher)
#find the top 10 publishers with the most amount of booksvalue_counts <-table(data_clean2$publisher)sorted_values <-sort(value_counts, decreasing =TRUE)top_ten <-names(sorted_values)[1:min(10, length(sorted_values))]top_ten
#create a new data set with only the top 10 publishersten_pub <-c("W. W. Norton Company", "Penguin", "Harper", "Vintage", "Random House", "Bantam", "Simon & Schuster", "Berkley", "Ballantine Books", "Mariner Books")data_clean3 <- data_clean2[data_clean2$publisher == ten_pub, ]#getting rid of an unnecessary line that would create an outlier in my visualizationsdata_clean3 <- data_clean3[-294, ]
Statistical Analysis
#find the p-value and r-squared valuesfit1 <-lm(data = data_clean3, average_rating ~ num_pages + ratings_count + text_reviews_count)summary(fit1)
Call:
lm(formula = average_rating ~ num_pages + ratings_count + text_reviews_count,
data = data_clean3)
Residuals:
Min 1Q Median 3Q Max
-1.01822 -0.17512 0.01558 0.16740 0.58126
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.847e+00 2.750e-02 139.894 <2e-16 ***
num_pages 1.706e-04 6.967e-05 2.448 0.0149 *
ratings_count 6.587e-07 6.578e-07 1.001 0.3174
text_reviews_count -1.708e-05 1.969e-05 -0.867 0.3863
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2576 on 336 degrees of freedom
Multiple R-squared: 0.02333, Adjusted R-squared: 0.01461
F-statistic: 2.675 on 3 and 336 DF, p-value: 0.04722
As you can see from this z-test summary, the p-value is 0.047
Call:
lm(formula = average_rating ~ ratings_count, data = data_clean3)
Residuals:
Min 1Q Median 3Q Max
-1.01232 -0.16551 0.01275 0.16741 0.60772
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.902e+00 1.438e-02 271.450 <2e-16 ***
ratings_count 1.262e-07 1.393e-07 0.906 0.366
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2596 on 338 degrees of freedom
Multiple R-squared: 0.002421, Adjusted R-squared: -0.0005304
F-statistic: 0.8203 on 1 and 338 DF, p-value: 0.3657
As we can see from the following summaries, it seems that the number of pages a novel has does have an impact of the ratings given. According to the p-value of the second fit, we can denote that the provided information is statistically significant and therefore we can reject the null hypothesis of no impact. Although the R-squared value remains extremely small for the most part, it does rise in the slightest when comparing multiple variables to just the page count.
par(mfrow =c(2,2))plot(fit1)
My equation for my model is 3.874 + 1.706e-04(num_pages) + 6.587e-07(ratings_count) + -1.708e-05(text_reviews_count)
I wanted to begin by initially making a simple histogram to compare the average rating of all english books with all publishers. I chose to do this so that I would have an idea of where a majority of my data lies.
Plot 1
custom_colors <-c("#EA6D59", "#EAA659", "#EAE859", "#53DA57", "#33C7FF", "#B153DA", "#DA53CA", "#5355DA", "#E73533", "#005E0D")#creating plotplot1 <-ggplot(data_clean3, aes(x = average_rating, fill = publisher)) +geom_histogram(binwidth =0.3, position ="dodge") +labs(caption ="Source: Goodreads", title ="<b> Histogram of Book Ratings by Publisher",x ="<b> Rating",y ="<b> Published Books") +scale_fill_manual(values = custom_colors) +theme_minimal() +theme(legend.position ="top")# ggplot to plotlyggplotly(plot1)
In this visualization, I specifically took the ten publishers with the highest amount of books and compared their average ratings. I chose to do this to find out whether or not an author would have an inherent advantage by choosing to publish with a certain company or not. I also chose to reduce the bin width so that the viewer would be able to visually notice not only the company with the most amount of high ratings, but also the disparity between the high and low averages.
Plot 2
# allow for color varience num_colors <-7colors <-brewer.pal(num_colors, "Spectral")# making plotplot2 <-ggplot(data_clean3, aes(x = num_pages, y = average_rating, size = ratings_count)) +geom_point(aes(color = ratings_count)) +geom_smooth(method ="lm", se =FALSE, aes(text =NULL)) +labs(caption ="Source: Goodreads",title ='<b> Bubble Chart of Books',x ='<b> Number of Pages',y ='<b> Average Rating') +theme_minimal() +scale_color_gradientn(colors = colors)# ggplot to plotlyggplotly(plot2)
`geom_smooth()` using formula = 'y ~ x'
Warning: The following aesthetics were dropped during statistical transformation: size
ℹ This can happen when ggplot fails to infer the correct grouping structure in
the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
variable into a factor?
Warning: The following aesthetics were dropped during statistical transformation: size
ℹ This can happen when ggplot fails to infer the correct grouping structure in
the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
variable into a factor?
In this visualization, I wanted to find if the length of the book written had anything to do with the reviews received. I also used the size and color of the bubbles to indicate how many reviews each book received, as I think there is an important denotation from a book that has high ratings from a small sample versus one that has high ratings from a bigger crowd.I found that due to the fact that many of the bubbles were similar in size, the different colors were an important distinction between the titles. While I decided to increase the number of colors represented in the legend to 7, seeing how the colors are still mainly red indicates that a majority of the novels have around the same number of reviews save for the select few that have a major jump.
I found this to be my most interesting visualization simply because it literally did not work as I wanted it to. I initially wanted to create a heat map that showed publication companies and their popularity however, even though this visualization did not go as planned, I found it to still be informative to a viewer. The axis are clearly able to highlight what I am comparing and the length of each box give almost something akin to box plots. I also think that the inclusion of the different colors for the amount of books published it helpful in indicating the accomplishments of each company. While completing these visualizations, I found it important to work as if my audience were an aspiring author who wanted their work to achieve the best results when published, which is why I focused on my variables to be those of importance to an author
Conclusion
Some background research on my topic, I wanted to understand the relationship between authors and publishing companies. Going into this, I initially knew very little about how publications worked with big companies in comparison to traditional self-publishing, and I wanted to investigate whether or not the particular company had a significant impact. After some research, it was clear to be that the decision to work with a company was a large financial burden, especially on authors that are just starting out. I believe that my analysis on ratings, which correlate with popularity, would give authors a chance to compare and see whether a guaranteed hit with their novel is worth shelling out a couple extra thousand dollars.
Throughout my analysis, some things I wish I could have changed or spent more time on would definitely circle back to my data set the most. I would have wanted to go further in depth with works in other languages as the cut of the most publishing companies and choosing to focus on books written in English dropped my objects from over 11,0000 to around 300. This data set is massive and has a lot of potential for other types of analysis which I wish I had the time to look in to. If possible, I would have liked to correlate the publishing companies with the location in which they are based and see if that also had any sort of impact.