Book Sales Analysis

By Eitan Cohen and Joseph Slater

Abstract

This project is an analysis of a book sales csv. Using this information, the goal is to help authors achieve their goals, whether it be pursue a career in writing or have a massively successful book. We start by going over some preliminary information, then dig into the analysis done and what methods were used. After that, we results are presented in a concise manner, along with conclusions to draw from them. Finally, there is a discussions about how to expand this analysis in the future along with any notable references.



Introduction

The goal of our project is to help authors in 2 main regards.

Firstly, we want to help authors get off the ground. Writing is a difficult career to break through, and making a sustainable income can become near impossible. So with our information, we will attempt to compile advice to assist authors in making a sustainable career through writing.

Secondly, I am no author,but I imagine every author wants to write a book adored by the masses. This leads to our second goal, to find information authors can use in order to write popular books. Whether that mean highly rated or best sellers is to each authors own distinction, but we will try to cover both of those possibilities.



Data Set Description

The link to the data set is here:

Book Sales and Rating

This data set contains 13 columns, and each will be explained below.

Note the category titles are identical to the data set, which is why some are capitalized, snake case, etc.

Category Title Data Type Explanation
Publishing Year Numeric The year in which the book was published
Book Name Categorical The title of the book
Author Categorical The name of the author of the book
language_code Categorical The code representing the language in which the book is written
Author_Rating Categorical The rating of the author based on their previous work
Book_average_rating Numeric The average rating given to the book by readers
Book_ratings_count Numeric The number of rating given to the book by readers
genre Categorical The genre or category to which the book belongs
gross sales Numeric The total sales revenue generate by a specific book
publisher revenue Numeric The revenue earned by a publisher from selling a specific book
sales price Numeric The price at which a specific book was sold
sales rank Numeric The rank of a particular book based on its sale performance
units sold Numeric The number of units sold for any particular book

This is a quick overview of the data set, and it will be analyzed further in Project Analysis



Methods and Algorithms

When analyzing this data set, we had one overarching question. As we continued our analysis, more smaller questions began to crop up with fed into our primary question. For the first sub questions, we used t testing to compare means of sale price and gross sales. The second sub question used permutation testing to see if the correlations between author rating and average book rating would be significant. Next we analyze units sold. We compare it to a few columns, then after finding the best match did some splicing and linear regression to find if rating count has a relationship with units sold.



Project Analysis



Pre Processing

Firstly, let’s do a quick overview of our data

df <- read.csv("Books_Data_Clean.csv")
print(head(df, 3))
##   index Publishing.Year         Book.Name
## 1     0            1975           Beowulf
## 2     1            1987  Batman: Year One
## 3     2            2015 Go Set a Watchman
##                                                            Author language_code
## 1                                          Unknown, Seamus Heaney         en-US
## 2 Frank Miller, David Mazzucchelli, Richmond Lewis, Dennis O'Neil           eng
## 3                                                      Harper Lee           eng
##   Author_Rating Book_average_rating Book_ratings_count         genre
## 1        Novice                3.42             155903 genre fiction
## 2  Intermediate                4.23             145267 genre fiction
## 3        Novice                3.31             138669 genre fiction
##   gross.sales publisher.revenue sale.price sales.rank
## 1     34160.0           20496.0       4.88          1
## 2     12437.5            7462.5       1.99          2
## 3     47795.0           28677.0       8.69          3
##                        Publisher units.sold
## 1       HarperCollins Publishers       7000
## 2       HarperCollins Publishers       6250
## 3 Amazon Digital Services,  Inc.       5500

Now that we have done a basic search, let’s check if there are null values, and if so let’s remove those rows

any_null <- any(is.na(df))

if (any_null){
  df <- df[complete.cases(df), ]
}

Let’s break our data into categorical and numeric data

numeric_df <- df[, sapply(df, is.numeric)]
categorical_df <- df[, !sapply(df, is.numeric)]

numeric_col <- names(numeric_df)
categorical_col <- names(categorical_df)

print(paste("Numeric DF Columns: ", numeric_col))
## [1] "Numeric DF Columns:  index"              
## [2] "Numeric DF Columns:  Publishing.Year"    
## [3] "Numeric DF Columns:  Book_average_rating"
## [4] "Numeric DF Columns:  Book_ratings_count" 
## [5] "Numeric DF Columns:  gross.sales"        
## [6] "Numeric DF Columns:  publisher.revenue"  
## [7] "Numeric DF Columns:  sale.price"         
## [8] "Numeric DF Columns:  sales.rank"         
## [9] "Numeric DF Columns:  units.sold"
print(paste("Categorical DF Columns: ", categorical_col))
## [1] "Categorical DF Columns:  Book.Name"    
## [2] "Categorical DF Columns:  Author"       
## [3] "Categorical DF Columns:  language_code"
## [4] "Categorical DF Columns:  Author_Rating"
## [5] "Categorical DF Columns:  genre"        
## [6] "Categorical DF Columns:  Publisher"

Now let’s fix the labeling in the fiction genre so everything is simply labeled as “fiction”

old <- c("genre fiction")
new <- c("fiction")

categorical_df$genre <- ifelse(categorical_df$genre %in% old, new[match(categorical_df$genre, old)], categorical_df$genre)

Finally, let’s make a correlation matrix to see and possible relationships in the numeric df

print(cor(numeric_df))
##                           index Publishing.Year Book_average_rating
## index                1.00000000    0.0159927245        -0.019470901
## Publishing.Year      0.01599272    1.0000000000         0.054552921
## Book_average_rating -0.01947090    0.0545529211         1.000000000
## Book_ratings_count  -0.84341585    0.0002904512         0.093393346
## gross.sales         -0.45560864    0.0089077156        -0.041750527
## publisher.revenue   -0.35079440    0.0076285140        -0.043192279
## sale.price           0.07335896    0.0286262524         0.001157624
## sales.rank           0.99947505    0.0167963023        -0.019720424
## units.sold           0.22193908   -0.0152313433        -0.008008357
##                     Book_ratings_count  gross.sales publisher.revenue
## index                    -0.8434158480 -0.455608644      -0.350794395
## Publishing.Year           0.0002904512  0.008907716       0.007628514
## Book_average_rating       0.0933933464 -0.041750527      -0.043192279
## Book_ratings_count        1.0000000000  0.502108212       0.401299011
## gross.sales               0.5021082123  1.000000000       0.912889523
## publisher.revenue         0.4012990107  0.912889523       1.000000000
## sale.price               -0.0686507627  0.272735247       0.313569892
## sales.rank               -0.8355349389 -0.442249337      -0.340668751
## units.sold               -0.2522869232 -0.150707200      -0.116350361
##                       sale.price  sales.rank   units.sold
## index                0.073358959  0.99947505  0.221939081
## Publishing.Year      0.028626252  0.01679630 -0.015231343
## Book_average_rating  0.001157624 -0.01972042 -0.008008357
## Book_ratings_count  -0.068650763 -0.83553494 -0.252286923
## gross.sales          0.272735247 -0.44224934 -0.150707200
## publisher.revenue    0.313569892 -0.34066875 -0.116350361
## sale.price           1.000000000  0.07177560  0.004092362
## sales.rank           0.071775601  1.00000000  0.211478492
## units.sold           0.004092362  0.21147849  1.000000000

All Notable Relationships

  1. Moderate Positive: Gross Sales, Rating Count

  2. Weak Positive: Revenue, Ratings Count

  3. Strong Negative: Sales Rank, Ratings Count

  4. Strong Positive: Revenue, Sales

  5. Weak Negative: Rank, Sales

  6. Weak Positive: Price, Revenue

  7. Weak Negative: Rank, Revenue

After looking at the categorical data, there are some we can convert to numeric to ease the analysis

genre_mappings <- c("fiction" = 1, "nonfiction" = 2, "children" = 3)
author_rating_mapping <- c("Novice" = 1, "Intermediate" = 2, "Excellent" = 3, "Famous" = 4)

genre_encoded <- genre_mappings[categorical_df$genre]
author_rating_encoded <- author_rating_mapping[categorical_df$Author_Rating]

numeric_df$genre <- genre_encoded
numeric_df$author_rating <- author_rating_encoded

Now let’s make a new correlation matrix with these new columns

print(cor(numeric_df))
##                           index Publishing.Year Book_average_rating
## index                1.00000000    0.0159927245        -0.019470901
## Publishing.Year      0.01599272    1.0000000000         0.054552921
## Book_average_rating -0.01947090    0.0545529211         1.000000000
## Book_ratings_count  -0.84341585    0.0002904512         0.093393346
## gross.sales         -0.45560864    0.0089077156        -0.041750527
## publisher.revenue   -0.35079440    0.0076285140        -0.043192279
## sale.price           0.07335896    0.0286262524         0.001157624
## sales.rank           0.99947505    0.0167963023        -0.019720424
## units.sold           0.22193908   -0.0152313433        -0.008008357
## genre                0.10840438    0.0389150871         0.027771790
## author_rating        0.25761656    0.0323569041         0.736201004
##                     Book_ratings_count  gross.sales publisher.revenue
## index                    -0.8434158480 -0.455608644      -0.350794395
## Publishing.Year           0.0002904512  0.008907716       0.007628514
## Book_average_rating       0.0933933464 -0.041750527      -0.043192279
## Book_ratings_count        1.0000000000  0.502108212       0.401299011
## gross.sales               0.5021082123  1.000000000       0.912889523
## publisher.revenue         0.4012990107  0.912889523       1.000000000
## sale.price               -0.0686507627  0.272735247       0.313569892
## sales.rank               -0.8355349389 -0.442249337      -0.340668751
## units.sold               -0.2522869232 -0.150707200      -0.116350361
## genre                    -0.0830835189 -0.065021358      -0.020057114
## author_rating            -0.1836376430 -0.198307809      -0.173678973
##                       sale.price  sales.rank   units.sold       genre
## index                0.073358959  0.99947505  0.221939081  0.10840438
## Publishing.Year      0.028626252  0.01679630 -0.015231343  0.03891509
## Book_average_rating  0.001157624 -0.01972042 -0.008008357  0.02777179
## Book_ratings_count  -0.068650763 -0.83553494 -0.252286923 -0.08308352
## gross.sales          0.272735247 -0.44224934 -0.150707200 -0.06502136
## publisher.revenue    0.313569892 -0.34066875 -0.116350361 -0.02005711
## sale.price           1.000000000  0.07177560  0.004092362  0.08707355
## sales.rank           0.071775601  1.00000000  0.211478492  0.10684124
## units.sold           0.004092362  0.21147849  1.000000000  0.04156756
## genre                0.087073550  0.10684124  0.041567561  1.00000000
## author_rating        0.018691326  0.25208362  0.105066888  0.06463014
##                     author_rating
## index                  0.25761656
## Publishing.Year        0.03235690
## Book_average_rating    0.73620100
## Book_ratings_count    -0.18363764
## gross.sales           -0.19830781
## publisher.revenue     -0.17367897
## sale.price             0.01869133
## sales.rank             0.25208362
## units.sold             0.10506689
## genre                  0.06463014
## author_rating          1.00000000

Let’s do a quick summary of our data

print(summary(numeric_df))
##      index        Publishing.Year Book_average_rating Book_ratings_count
##  Min.   :   0.0   Min.   :-560    Min.   :2.970       Min.   : 27308    
##  1st Qu.: 267.0   1st Qu.:1985    1st Qu.:3.850       1st Qu.: 70382    
##  Median : 535.0   Median :2003    Median :4.010       Median : 89262    
##  Mean   : 534.6   Mean   :1971    Mean   :4.007       Mean   : 94914    
##  3rd Qu.: 802.0   3rd Qu.:2010    3rd Qu.:4.170       3rd Qu.:113915    
##  Max.   :1069.0   Max.   :2016    Max.   :4.770       Max.   :206792    
##   gross.sales      publisher.revenue   sale.price       sales.rank    
##  Min.   :  104.9   Min.   :    0.0   Min.   : 0.990   Min.   :   1.0  
##  1st Qu.:  371.2   1st Qu.:    0.0   1st Qu.: 1.990   1st Qu.: 287.0  
##  Median :  810.8   Median :  272.9   Median : 3.990   Median : 596.0  
##  Mean   : 1857.9   Mean   :  843.8   Mean   : 4.871   Mean   : 611.8  
##  3rd Qu.: 1488.3   3rd Qu.:  721.6   3rd Qu.: 6.990   3rd Qu.: 933.0  
##  Max.   :47795.0   Max.   :28677.0   Max.   :33.860   Max.   :1273.0  
##    units.sold        genre       author_rating  
##  Min.   :  106   Min.   :1.000   Min.   :1.000  
##  1st Qu.:  550   1st Qu.:1.000   1st Qu.:2.000  
##  Median : 3915   Median :1.000   Median :2.000  
##  Mean   : 9682   Mean   :1.187   Mean   :2.408  
##  3rd Qu.: 5319   3rd Qu.:1.000   3rd Qu.:3.000  
##  Max.   :61560   Max.   :3.000   Max.   :4.000

Now to visualize some of these values, let’s make some charts

hist(numeric_df$Book_average_rating, main = "Book Average Rating", xlab="Rating", 
     ylab = "Amount", col="royalblue")

hist(numeric_df$author_rating, main = "Author Average Rating", xlab="Rating", 
     ylab = "Amount", col="maroon")

hist(numeric_df$units.sold, main = "Units Sold", xlab="Quanitity", 
     ylab = "Amount", col="forestgreen")

hist(numeric_df$Book_ratings_count, main = "Rating Count", xlab="Count", 
     ylab = "Amount", col="azure")

hist(numeric_df$sale.price, main = "Sales Price", xlab="Book Price", 
     ylab = "Amount", col="gold")

  1. We can see that Book Average Rating a bell curve with a left skew, and 4 is the mean rating
  2. The most populated author group is 2.0 (intermediate), followed by 3.0 (excellent), then 4 and 1 (famous and novice respectively)
  3. Most books sell somewhere from 0 to 5,000 copies, but there is a significant portion of outliers past the 25,000 range
  4. Most books average 95,000 ratings, but the graph has a left skew
  5. The average book sells for right under $5, and the graph has a left skew with some significant outliers at $25 and $35

Next let’s plot some of the stronger relationships to see if we can find anything interesting to dig deeper into

plot(numeric_df$sales.rank, numeric_df$Book_ratings_count, 
     main = "Sales Rank vs Rating Count", xlab = "Sales Rank", 
     ylab = "Rating Count", col= "navy", pch = 16)

plot(numeric_df$publisher.revenue, numeric_df$gross.sales, 
     main = "Publisher Revenue vs Gross Sales", xlab = "Publisher Revenue", 
     ylab = "Gross Sales", col= "goldenrod", pch = 16)

plot(numeric_df$gross.sales, numeric_df$Book_ratings_count, 
     main = "Gross Sales vs Book Rating Count", xlab = "Gross Sales", 
     ylab = "Rating Count", col= "darkviolet", pch = 16)

plot(numeric_df$Book_average_rating, numeric_df$author_rating, 
     main = "Average Book Rating vs Author Rating", xlab = "Average Book Rating", 
     ylab = "Author Rating", col= "thistle", pch = 16)

  1. As sales rank decreases, ratings increase. However it is important to note that having a sales rank of 1 is considered the best, so although the relationship here is negative, it is actually beneficial to have a lower sales rank.
  2. As publisher revenue increases so does gross sales, however there is a cluster at 0 on gross sales
  3. Book ratings don’t have a strong relationship with gross sales
  4. A more well known author generically has high ratings on average



Self Published Books

The first interesting relationship to be analyzed from these charts is the publisher revenue vs gross sales scatter plot. There is clearly a notable cluster where x = 0, which we interpreted as books self published. Now, self publishing is the only practical option for some new authors, so let’s dig deeper into it. From the graph it appears having a publisher does greatly enhance your odds of having a successful book, but is that true.

First, let’s make 2 new data sets. A self published data set, and a data set with no self published books.

self_published <- subset(numeric_df, numeric_df$publisher.revenue == 0)
publisher <- subset(numeric_df, numeric_df$publisher.revenue != 0)

Now let’s compare the gross sales from our 3 groups

print(summary(self_published$gross.sales))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   104.9   331.9   477.3  1297.7  1019.6 19960.0
print(summary(numeric_df$gross.sales))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   104.9   371.2   810.8  1857.9  1488.3 47795.0
print(summary(publisher$gross.sales))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   104.9   478.4   977.5  2156.8  1675.1 47795.0

So, let do a T-series test on these 3 groupings and see what we get

t_self_all <- t.test(self_published$gross.sales, numeric_df$gross.sales)
t_self_pub <- t.test(self_published$gross.sales, publisher$gross.sales)
t_all_pub <- t.test(numeric_df$gross.sales, publisher$gross.sales)

print(t_self_all)
## 
##  Welch Two Sample t-test
## 
## data:  self_published$gross.sales and numeric_df$gross.sales
## t = -3.1426, df = 1009.1, p-value = 0.001723
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -909.9793 -210.3929
## sample estimates:
## mean of x mean of y 
##  1297.676  1857.862
print(t_self_pub)
## 
##  Welch Two Sample t-test
## 
## data:  self_published$gross.sales and publisher$gross.sales
## t = -4, df = 1063.7, p-value = 6.771e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1280.6249  -437.7078
## sample estimates:
## mean of x mean of y 
##  1297.676  2156.842
print(t_all_pub)
## 
##  Welch Two Sample t-test
## 
## data:  numeric_df$gross.sales and publisher$gross.sales
## t = -1.4355, df = 1349.3, p-value = 0.1514
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -707.5722  109.6117
## sample estimates:
## mean of x mean of y 
##  1857.862  2156.842

With this information, we can reject the null hypotheses for two of our three questions. We can confidently say that for both the self published books vs all and self published vs publisher the mean is statistically significant. For all books vs publisher, we cannot reject the null hypothesis.

Let’s plot this information out for a visual representation

means <- c(mean(self_published$gross.sales), mean(numeric_df$gross.sales), mean(publisher$gross.sales))
barplot(means, names.arg = c("Self Published", "All", "With Publisher"), col = c("lightblue", "lightgreen", "lightcoral"), ylim = c(0, max(means) + 100),  ylab = "Gross Sales", main = "Mean Gross Sales")

Now, maybe someone still believes that self published books can peak at the same height of books with a publisher, so let’s create another box plot with error bars to show standard deviation.

standard_errors <- c(sd(self_published$gross.sales), sd(numeric_df$gross.sales), sd(publisher$gross.sales))
barplot(means, names.arg = c("Self Published", "All", "With Publisher"), col = c("lightblue", "lightgreen", "lightcoral"), ylim = c(0, max(means) + max(standard_errors) + 50), ylab = "Gross Sales", main = "Mean Gross Sales with SD Error Line")
arrows(1:3, means, 1:3, means + standard_errors, angle = 90, code = 3, length = 0.1)

Even with this plot it is abundantly clear that both the mean and peak of self published books sales is lower then that with a publisher.

Now, let’s do the same process again with sale price. Maybe the actual price of the books are higher even though the total gross profit is lower.

print(summary(self_published$sale.price))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.990   0.990   2.990   3.031   3.990  12.990
print(summary(numeric_df$sale.price))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.990   1.990   3.990   4.871   6.990  33.860
print(summary(publisher$sale.price))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.990   2.240   5.990   5.854   7.990  33.860

Again let’s run some t tests to see if we confirm a statistical difference

t_self_all_p <- t.test(self_published$sale.price, numeric_df$sale.price)
t_self_pub_p <- t.test(self_published$sale.price, publisher$sale.price)
t_all_pub_p <- t.test(numeric_df$sale.price, publisher$sale.price)

print(t_self_all_p)
## 
##  Welch Two Sample t-test
## 
## data:  self_published$sale.price and numeric_df$sale.price
## t = -13.108, df = 1304.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.116313 -1.565303
## sample estimates:
## mean of x mean of y 
##  3.030511  4.871319
print(t_self_pub_p)
## 
##  Welch Two Sample t-test
## 
## data:  self_published$sale.price and publisher$sale.price
## t = -16.422, df = 1036.4, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.160632 -2.485922
## sample estimates:
## mean of x mean of y 
##  3.030511  5.853788
print(t_all_pub_p)
## 
##  Welch Two Sample t-test
## 
## data:  numeric_df$sale.price and publisher$sale.price
## t = -5.3629, df = 1393.6, p-value = 9.575e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.3418394 -0.6230979
## sample estimates:
## mean of x mean of y 
##  4.871319  5.853788

From these three t tests we can see that all three are null hypothesis can be dis-proven. So, we can concretely say that the price of a self published book is less then the price of a grouping of all books, which is less then the price of books published with a publisher. So, let’s quickly visualize this.

means_p <- c(mean(self_published$sale.price), mean(numeric_df$sale.price), mean(publisher$sale.price))

standard_errors_p <- c(sd(self_published$sale.price), sd(numeric_df$sale.price), sd(publisher$sale.price))
barplot(means_p, names.arg = c("Self Published", "All", "With Publisher"), col = c("lightblue", "lightgreen", "lightcoral"), ylim = c(0, max(means_p) + max(standard_errors_p) + 50), ylab = "Sales Price", main = "Mean Sales Price with SD Error Line")
arrows(1:3, means_p, 1:3, means_p + standard_errors_p, angle = 90, code = 3, length = 0.1)

From here we could see that once again, both the mean and range of prices are heavily influenced by having a publisher. If your book has a publisher, it helps boost prices by nearly double, which can sum to quite a lot of income.

And now that we know both the mean of gross sales and of sale price is influenced by the presence of a publisher, we can make a scatter plot to visualize this relationship one more time.

plot(self_published$gross.sales, self_published$sale.price, col = "lightblue", pch = 16, main = "Comparing Self Published to Base Dataset", xlab = "Gross Sales", ylab = "Sales Price")
points(publisher$gross.sales, publisher$sale.price, col = "lightcoral", pch = 16)
legend("topright", legend = c("Self Published", "With Publisher"), col = c("lightblue", "lightcoral"), pch = 16)

As we can clearly see, the bottom right corner is a mix of both groups, but every where else is dominated by with publisher. Especially once you cross the $6 threshold, only a handful of self published books are present while a significant portion of the books with a publisher a present.



Top Rated Books

For some people writing a successful book isn’t defined by sales, but by public perception (or rating). So here we are attempting to figure out how a author can best write a book which is rated highly by the general public. We’re going to key in on two specific variables, author rating and genre. We believe being a household name helps a book be rated more highly. Along with that we also believe genre will have minimal (if any) effect on ratings. People will read what they want to, so genre shouldn’t matter.

Firstly, let’s split our data in 3 data sets. One for highly rated book (greater then 4.5), well received books (3.5 to 4.5), and poorly received books (less then 3.5)

high <- subset(numeric_df, numeric_df$Book_average_rating > 4.25)
med <- subset(numeric_df, numeric_df$Book_average_rating <= 4.25 & numeric_df$Book_average_rating >= 3.5)
low <- subset(numeric_df, numeric_df$Book_average_rating < 3.5)

First lets analyze the author rating. Just to refresh, 1 is novice, 2 is intermediate, 3 is excellent and 4 is famous

observed <- function(data){
  corr <- cor(data$Book_average_rating, data$author_rating)
  return (corr)
}

high_stats <- observed(high)
med_stats <- observed(med)
low_stats <- observed(low)

print(high_stats)
## [1] 0.6930298
print(med_stats)
## [1] 0.624698
print(low_stats)
## [1] 0.1074156

So there is a high correlation between two of our groups and author rating. Let’s check the mean author rating of these books to see what authors we are working with

print(mean(high$author_rating))
## [1] 3.110429
print(mean(med$author_rating))
## [1] 2.322248
print(mean(low$author_rating))
## [1] 1.235294

So it seems our highly rated books are written by a majority of excellent and famous authors. Our middle tier of books seems to be mainly written by intermediate and excellent authors.However, we can’t just say this without proving it, so let’s get to that. Let’s conduct a permutation test to see if the relationship between author rating and book average rating is significant.

permutation_test <- function(data, n){
  observed_stat <- observed(data)
  perm_stats <- numeric(n)
  
  for (i in 1:n){
    shuffled <- data
    shuffled$author_rating <- sample(data$author_rating)
    
    perm_stats[i] <- observed(shuffled)
  }
  
  p_value <- mean(perm_stats >= observed_stat)
  
  return (list(
    observed = observed_stat,
    perm_stats = perm_stats,
    p_value = p_value
  ))
}

high_perm_test <- permutation_test(high, 10000)
med_perm_test <- permutation_test(med, 10000)
low_perm_test <- permutation_test(low, 10000)

Now that the entire permutation has been calculated over 10,000 cycles, let’s analyze the results

print_results <- function(data){
  print("Summary of the observed:")
  print(summary(data$observed))
  print("Summary of the permuted:") 
  print(summary(data$perm_stats))
  print(paste("P-Value:", data$p_value))
}

print("Highly Rated Books")
## [1] "Highly Rated Books"
print_results(high_perm_test)
## [1] "Summary of the observed:"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.693   0.693   0.693   0.693   0.693   0.693 
## [1] "Summary of the permuted:"
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -0.2729878 -0.0534804  0.0002386 -0.0002017  0.0530316  0.2799484 
## [1] "P-Value: 0"
print("Medium Rated Books")
## [1] "Medium Rated Books"
print_results(med_perm_test)
## [1] "Summary of the observed:"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6247  0.6247  0.6247  0.6247  0.6247  0.6247 
## [1] "Summary of the permuted:"
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -1.361e-01 -2.277e-02  1.101e-04  3.153e-05  2.299e-02  1.145e-01 
## [1] "P-Value: 0"
print("Poorly Rated Books")
## [1] "Poorly Rated Books"
print_results(low_perm_test)
## [1] "Summary of the observed:"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1074  0.1074  0.1074  0.1074  0.1074  0.1074 
## [1] "Summary of the permuted:"
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.74184 -0.10406  0.02350 -0.00133  0.13091  0.32896 
## [1] "P-Value: 0.3002"

Now with these results it’s clear that for highly rated books there is a strong linear relationship. We can see that from the observed summary. Now when looking at the P value we can determine this is in fact statistically significant. Analyzing the permuted groups mean, we clearly see it is close to 0, further strengthening the claim that there is a correlation, and a strong one at that.

The same could be said fro the medium rated books. The observed also imply a strong linear relationship (yet slightly weaker then highly rated books). The P value still remains low, indicating this is statistically significant. And the permutations’ mean further strengthen that point.

However, things change when it comes to the poorly rated book section. Here our p value is above .3, far to high to reject our null hypothesis (which is that there is no relationship between book rating and author rating). So any author can have their books received poorly by consumers.

Units Sold

Now, according some authors, units sold is could be the metric of success. For some it’s selling enough to get by, for others it would be selling enough to become a recognizable title. Regardless, let’s analyze the units sold to see what information we can garner for what would be considered a successful book. Sales price, average ratings, and ratings count are the columns we are most interested in.

First let’s look at some general information

units_sold <- numeric_df$units.sold
print(summary(units_sold))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     106     550    3915    9682    5319   61560

From here would could see there’s a massive range in the books. Let’s visualize this data quickly

quartiles <- summary(numeric_df$units.sold)

boxplot(units_sold, main = "Boxplot of Units Sold", ylab = "Unit Sold",
        col = "firebrick", pch = 16, at=0)

From this visualization we can clearly see that we have two main groups. There is the actual group which makes up the box of our graph, then a massive cluster of extremely well selling books. So why not split these up and see if we can gain any insights From the look of this graph, 20,000 units sold seems like an appropriate cut off.

lower_end <- subset(numeric_df, numeric_df$units.sold < 20000)
higher_end <- subset(numeric_df, numeric_df$units.sold >= 20000)

Let’s plot these out to see the difference from our initial boxplot

quartiles <- summary(lower_end$units.sold)

boxplot(lower_end$units.sold, main = "Fair Selling Units Sold",
        ylab = "Unit Sold", col = "firebrick", pch = 16, at = 1)

boxplot(higher_end$units.sold, main = "Great Selling Units Sold: ",
        ylab = "Unit Sold", col = "midnightblue", pch = 16, at = 1)

Now that we have a basic representation of this, let’s do some analysis using our aforementioned columns

print_corr <- function(data){
  print(cor.test(data$units.sold, data$sale.price)$p.value)
  print(cor.test(data$units.sold, data$Book_ratings_count)$p.value)
  print(cor.test(data$units.sold, data$Book_average_rating)$p.value)
}

print("Lower End")
## [1] "Lower End"
print_corr(lower_end)
## [1] 0.3056349
## [1] 4.437602e-14
## [1] 0.7706901
print("Higher End")
## [1] "Higher End"
print_corr(higher_end)
## [1] 0.6570966
## [1] 0.001818884
## [1] 0.4847973

Interestingly enough, for both the lower and higher end the relationship with the book rating count is the strongest. These are all P values, so this would indicate we could reject the null hypothesis, which is that there is no relationship between units sold and rating count. Let’s plot these out and see what we have

plot(lower_end$units.sold, lower_end$Book_ratings_count, xlab = "Units Sold", ylab = "Book Rating Count",
     col = "firebrick", main = "Lower End: Units Sold vs Book Rating Count")

plot(higher_end$units.sold, higher_end$Book_ratings_count, xlab = "Units Sold", ylab = "Book Rating Count",
     col = "midnightblue", main = "Higher End: Units Sold vs Book Rating Count")

With our lower end, we see, to have four clusters. 1. 0 - 100 units and 50,000 - 80,000 reviews 2. 200 - 1,000 units sold and 100,000 - 170,000 reviews 3. 2,900 - 3,000 units sold and 50,000 - 100,000 reviews 4. 4,200 - 4,500 units sold and 50,000 - 100,000 reviews

Now obviously it doesn’t make sense to have more reviews then units sold, so we have to assume that units sold is a multiple of some kind. However, out data set doesn’t include that information, so this is the best we can do with what is provided.

Looking at our higher end, things are much more sparatic with no true clusters.

So, for both of these let’s fit them with linear regressoins and see if we can gain anything with that information

plot(lower_end$units.sold, lower_end$Book_ratings_count, xlab = "Units Sold", ylab = "Book Rating Count",
     col = "firebrick", main = "Lower End: Units Sold vs Book Rating Count")

lower_model <- lm(lower_end$Book_ratings_count ~ lower_end$units.sold, data = lower_end)
abline(lower_model, col = "maroon1")

plot(higher_end$units.sold, higher_end$Book_ratings_count, xlab = "Units Sold", ylab = "Book Rating Count",
     col = "midnightblue", main = "Higher End: Units Sold vs Book Rating Count")

higher_model <- lm(higher_end$Book_ratings_count ~ higher_end$units.sold, data = higher_end)
abline(higher_model, col = "steelblue2")

Now, before we go further into this analysis, let’s do some residuals to see how good of a fit everything is

residuals_l <- residuals(lower_model)
fitted_values_l <- fitted(lower_model)

plot(fitted_values_l, residuals_l, main = "Lower End: Residuals vs. Fitted Values",
     xlab = "Fitted Values", ylab = "Residuals", col="firebrick")
abline(h = 0, col = "magenta3", lty = 2)  

residuals_h <- residuals(higher_model)
fitted_values_h <- fitted(higher_model)

plot(fitted_values_h, residuals_h, main = "Higher End: Residuals vs. Fitted Values",
     xlab = "Fitted Values", ylab = "Residuals", col="midnightblue")
abline(h = 0, col = "royalblue", lty = 2)  

Based on these residuals we can surmise two things. 1. The lower end regression isn’t very helpful. The residual shows the same patterns as the regression (just flipped), and as the values don’t appear to be random we can say the relationship doesn’t appear to be linear. 2. The high end regression can be used for further analysis. The residual shows a drop of clustering, but nothing serious enough to say there is a lack of a linear relationship.

So for now, let’s continue analyzing the high end relationship. First let’s look at the model summary.

print(summary(higher_model))
## 
## Call:
## lm(formula = higher_end$Book_ratings_count ~ higher_end$units.sold, 
##     data = higher_end)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -45131  -8955  -1564   8557  33373 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           6.494e+04  4.588e+03   14.15  < 2e-16 ***
## higher_end$units.sold 3.531e-01  1.117e-01    3.16  0.00182 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13510 on 202 degrees of freedom
## Multiple R-squared:  0.04711,    Adjusted R-squared:  0.04239 
## F-statistic: 9.987 on 1 and 202 DF,  p-value: 0.001819

From this we can gather that there is in fact a positive relationship between units sold and rating count. Our Y intercept is estimated at 64,940, so if a book sells 20,000 copies it should have approximately 64,940 reviews. After that for every additional unit sold, the ratings count increases by .3531. Now, the R^2 implies that there are multiple factors going into an increase in rating counts, but units sold is definitely one of them.

Now, lets conduct hypothesis testing to confirm there is a definitive relationship between units sold and ratings count

coef_estimate <- coef(summary(higher_model))[2, 1]
std_error <- coef(summary(higher_model))[2, 2]

t_stat <- coef_estimate / std_error
p_value <- 2 * pt(-abs(t_stat), df = df.residual(higher_model))
print(p_value)
## [1] 0.001818884

With this P value, we can confidently reject the null hypothesis and say that there is a significant relationship between units sold and ratings count for books selling over 20,000 copies.

Final Results

First, let’s go over the self published data. We hypothesized that having a publisher greatly increases the success of a given book. From this we can clearly see a definitive relationship between having a publisher and more success. With a publisher, you sell more copies and for a higher value then if not. We proved this through T tests, and can say our hypothesis was correct.

Second, we looked at the top rated books. We broke the books up into 3 categories based on ratings, all different tiers, and analyzed all of them individually. Our hypothesis was that each group of books would have around a their respective author rating be the dominant one. So, for highly rated books it would be comprised of mostly excellent and famous authors. For well rated books it would be intermediate and excellent. And finally for poorly rated books it would be novice and intermediates. This proved two thirds true. Through the permutation test we proved the highly rated and well rated books both followed the pattern we predicted. However, when it came to poorly rated books that proved to not be the case. Poorly rated books did not have a significant correlation with the author’s rating.

Lastly, we looked at the units sold. We hypothesized units sold would have a correlation with rating count, average rating, and sale price. After splitting the data into “best-seller” and other categories, both showed that the only possible correlation was with rating count. After doing linear regression on both groups, the only possible correlation which still remained was the best-seller rating count relationship. After more testing, we were able to confirm that this was a true correlation.

Conclusions

So, let’s review our initial questions. First, what can a author do to maximize his odds of a best seller. And second, what can a author do to jump start his career.

So, what we saw was that if you want a best seller, you need to have a recognizable name. Now this is pure conjecture, but people are probably predisposed to assume authors they heard of before are great at writing, and they rate the books better because of that. And as we saw later, ratings and units sold have a positive correlation, so if you book is gaining more ratings we can say it will effect the units sold in a positive manner. So, the author needs to build himself a name, and then release a book which gets people talking. That is the path to a best seller.

Second, what about a struggling new author. For a new author, there’s far less information. But one thing that is a necessity is to find yourself a publisher. Publishers increase gross sales and sale price, which allows a author to continue writing. Now this is more conjecture, but if a author has a publisher and can generate enough money to continue wiring, they can eventually build themselves a large enough name to have that best seller.

Future Scope

In the future there are some things that could improve the results of this analysis. Firstly, if the genres was broken up further, it would allow more nuanced analysis. You look look at a variety of genres and see how they perform against one another, something this data set lacks. Besides that, the data set could use more actual data. We were able to make some subsets and do some interseting analysis, but with more data you can break things down further and be more precise.

References

“Books Sales and Ratings.” Kaggle, www.kaggle.com/datasets/thedevastator/books-sales-and-ratings. Accessed 18 Dec. 2023.