By Eitan Cohen and Joseph Slater
This project is an analysis of a book sales csv. Using this information, the goal is to help authors achieve their goals, whether it be pursue a career in writing or have a massively successful book. We start by going over some preliminary information, then dig into the analysis done and what methods were used. After that, we results are presented in a concise manner, along with conclusions to draw from them. Finally, there is a discussions about how to expand this analysis in the future along with any notable references.
The goal of our project is to help authors in 2 main regards.
Firstly, we want to help authors get off the ground. Writing is a difficult career to break through, and making a sustainable income can become near impossible. So with our information, we will attempt to compile advice to assist authors in making a sustainable career through writing.
Secondly, I am no author,but I imagine every author wants to write a book adored by the masses. This leads to our second goal, to find information authors can use in order to write popular books. Whether that mean highly rated or best sellers is to each authors own distinction, but we will try to cover both of those possibilities.
The link to the data set is here:
This data set contains 13 columns, and each will be explained below.
Note the category titles are identical to the data set, which is why some are capitalized, snake case, etc.
| Category Title | Data Type | Explanation |
|---|---|---|
| Publishing Year | Numeric | The year in which the book was published |
| Book Name | Categorical | The title of the book |
| Author | Categorical | The name of the author of the book |
| language_code | Categorical | The code representing the language in which the book is written |
| Author_Rating | Categorical | The rating of the author based on their previous work |
| Book_average_rating | Numeric | The average rating given to the book by readers |
| Book_ratings_count | Numeric | The number of rating given to the book by readers |
| genre | Categorical | The genre or category to which the book belongs |
| gross sales | Numeric | The total sales revenue generate by a specific book |
| publisher revenue | Numeric | The revenue earned by a publisher from selling a specific book |
| sales price | Numeric | The price at which a specific book was sold |
| sales rank | Numeric | The rank of a particular book based on its sale performance |
| units sold | Numeric | The number of units sold for any particular book |
This is a quick overview of the data set, and it will be analyzed further in Project Analysis
When analyzing this data set, we had one overarching question. As we continued our analysis, more smaller questions began to crop up with fed into our primary question. For the first sub questions, we used t testing to compare means of sale price and gross sales. The second sub question used permutation testing to see if the correlations between author rating and average book rating would be significant. Next we analyze units sold. We compare it to a few columns, then after finding the best match did some splicing and linear regression to find if rating count has a relationship with units sold.
Firstly, let’s do a quick overview of our data
df <- read.csv("Books_Data_Clean.csv")
print(head(df, 3))
## index Publishing.Year Book.Name
## 1 0 1975 Beowulf
## 2 1 1987 Batman: Year One
## 3 2 2015 Go Set a Watchman
## Author language_code
## 1 Unknown, Seamus Heaney en-US
## 2 Frank Miller, David Mazzucchelli, Richmond Lewis, Dennis O'Neil eng
## 3 Harper Lee eng
## Author_Rating Book_average_rating Book_ratings_count genre
## 1 Novice 3.42 155903 genre fiction
## 2 Intermediate 4.23 145267 genre fiction
## 3 Novice 3.31 138669 genre fiction
## gross.sales publisher.revenue sale.price sales.rank
## 1 34160.0 20496.0 4.88 1
## 2 12437.5 7462.5 1.99 2
## 3 47795.0 28677.0 8.69 3
## Publisher units.sold
## 1 HarperCollins Publishers 7000
## 2 HarperCollins Publishers 6250
## 3 Amazon Digital Services, Inc. 5500
Now that we have done a basic search, let’s check if there are null values, and if so let’s remove those rows
any_null <- any(is.na(df))
if (any_null){
df <- df[complete.cases(df), ]
}
Let’s break our data into categorical and numeric data
numeric_df <- df[, sapply(df, is.numeric)]
categorical_df <- df[, !sapply(df, is.numeric)]
numeric_col <- names(numeric_df)
categorical_col <- names(categorical_df)
print(paste("Numeric DF Columns: ", numeric_col))
## [1] "Numeric DF Columns: index"
## [2] "Numeric DF Columns: Publishing.Year"
## [3] "Numeric DF Columns: Book_average_rating"
## [4] "Numeric DF Columns: Book_ratings_count"
## [5] "Numeric DF Columns: gross.sales"
## [6] "Numeric DF Columns: publisher.revenue"
## [7] "Numeric DF Columns: sale.price"
## [8] "Numeric DF Columns: sales.rank"
## [9] "Numeric DF Columns: units.sold"
print(paste("Categorical DF Columns: ", categorical_col))
## [1] "Categorical DF Columns: Book.Name"
## [2] "Categorical DF Columns: Author"
## [3] "Categorical DF Columns: language_code"
## [4] "Categorical DF Columns: Author_Rating"
## [5] "Categorical DF Columns: genre"
## [6] "Categorical DF Columns: Publisher"
Now let’s fix the labeling in the fiction genre so everything is simply labeled as “fiction”
old <- c("genre fiction")
new <- c("fiction")
categorical_df$genre <- ifelse(categorical_df$genre %in% old, new[match(categorical_df$genre, old)], categorical_df$genre)
Finally, let’s make a correlation matrix to see and possible relationships in the numeric df
print(cor(numeric_df))
## index Publishing.Year Book_average_rating
## index 1.00000000 0.0159927245 -0.019470901
## Publishing.Year 0.01599272 1.0000000000 0.054552921
## Book_average_rating -0.01947090 0.0545529211 1.000000000
## Book_ratings_count -0.84341585 0.0002904512 0.093393346
## gross.sales -0.45560864 0.0089077156 -0.041750527
## publisher.revenue -0.35079440 0.0076285140 -0.043192279
## sale.price 0.07335896 0.0286262524 0.001157624
## sales.rank 0.99947505 0.0167963023 -0.019720424
## units.sold 0.22193908 -0.0152313433 -0.008008357
## Book_ratings_count gross.sales publisher.revenue
## index -0.8434158480 -0.455608644 -0.350794395
## Publishing.Year 0.0002904512 0.008907716 0.007628514
## Book_average_rating 0.0933933464 -0.041750527 -0.043192279
## Book_ratings_count 1.0000000000 0.502108212 0.401299011
## gross.sales 0.5021082123 1.000000000 0.912889523
## publisher.revenue 0.4012990107 0.912889523 1.000000000
## sale.price -0.0686507627 0.272735247 0.313569892
## sales.rank -0.8355349389 -0.442249337 -0.340668751
## units.sold -0.2522869232 -0.150707200 -0.116350361
## sale.price sales.rank units.sold
## index 0.073358959 0.99947505 0.221939081
## Publishing.Year 0.028626252 0.01679630 -0.015231343
## Book_average_rating 0.001157624 -0.01972042 -0.008008357
## Book_ratings_count -0.068650763 -0.83553494 -0.252286923
## gross.sales 0.272735247 -0.44224934 -0.150707200
## publisher.revenue 0.313569892 -0.34066875 -0.116350361
## sale.price 1.000000000 0.07177560 0.004092362
## sales.rank 0.071775601 1.00000000 0.211478492
## units.sold 0.004092362 0.21147849 1.000000000
All Notable Relationships
Moderate Positive: Gross Sales, Rating Count
Weak Positive: Revenue, Ratings Count
Strong Negative: Sales Rank, Ratings Count
Strong Positive: Revenue, Sales
Weak Negative: Rank, Sales
Weak Positive: Price, Revenue
Weak Negative: Rank, Revenue
After looking at the categorical data, there are some we can convert to numeric to ease the analysis
genre_mappings <- c("fiction" = 1, "nonfiction" = 2, "children" = 3)
author_rating_mapping <- c("Novice" = 1, "Intermediate" = 2, "Excellent" = 3, "Famous" = 4)
genre_encoded <- genre_mappings[categorical_df$genre]
author_rating_encoded <- author_rating_mapping[categorical_df$Author_Rating]
numeric_df$genre <- genre_encoded
numeric_df$author_rating <- author_rating_encoded
Now let’s make a new correlation matrix with these new columns
print(cor(numeric_df))
## index Publishing.Year Book_average_rating
## index 1.00000000 0.0159927245 -0.019470901
## Publishing.Year 0.01599272 1.0000000000 0.054552921
## Book_average_rating -0.01947090 0.0545529211 1.000000000
## Book_ratings_count -0.84341585 0.0002904512 0.093393346
## gross.sales -0.45560864 0.0089077156 -0.041750527
## publisher.revenue -0.35079440 0.0076285140 -0.043192279
## sale.price 0.07335896 0.0286262524 0.001157624
## sales.rank 0.99947505 0.0167963023 -0.019720424
## units.sold 0.22193908 -0.0152313433 -0.008008357
## genre 0.10840438 0.0389150871 0.027771790
## author_rating 0.25761656 0.0323569041 0.736201004
## Book_ratings_count gross.sales publisher.revenue
## index -0.8434158480 -0.455608644 -0.350794395
## Publishing.Year 0.0002904512 0.008907716 0.007628514
## Book_average_rating 0.0933933464 -0.041750527 -0.043192279
## Book_ratings_count 1.0000000000 0.502108212 0.401299011
## gross.sales 0.5021082123 1.000000000 0.912889523
## publisher.revenue 0.4012990107 0.912889523 1.000000000
## sale.price -0.0686507627 0.272735247 0.313569892
## sales.rank -0.8355349389 -0.442249337 -0.340668751
## units.sold -0.2522869232 -0.150707200 -0.116350361
## genre -0.0830835189 -0.065021358 -0.020057114
## author_rating -0.1836376430 -0.198307809 -0.173678973
## sale.price sales.rank units.sold genre
## index 0.073358959 0.99947505 0.221939081 0.10840438
## Publishing.Year 0.028626252 0.01679630 -0.015231343 0.03891509
## Book_average_rating 0.001157624 -0.01972042 -0.008008357 0.02777179
## Book_ratings_count -0.068650763 -0.83553494 -0.252286923 -0.08308352
## gross.sales 0.272735247 -0.44224934 -0.150707200 -0.06502136
## publisher.revenue 0.313569892 -0.34066875 -0.116350361 -0.02005711
## sale.price 1.000000000 0.07177560 0.004092362 0.08707355
## sales.rank 0.071775601 1.00000000 0.211478492 0.10684124
## units.sold 0.004092362 0.21147849 1.000000000 0.04156756
## genre 0.087073550 0.10684124 0.041567561 1.00000000
## author_rating 0.018691326 0.25208362 0.105066888 0.06463014
## author_rating
## index 0.25761656
## Publishing.Year 0.03235690
## Book_average_rating 0.73620100
## Book_ratings_count -0.18363764
## gross.sales -0.19830781
## publisher.revenue -0.17367897
## sale.price 0.01869133
## sales.rank 0.25208362
## units.sold 0.10506689
## genre 0.06463014
## author_rating 1.00000000
Let’s do a quick summary of our data
print(summary(numeric_df))
## index Publishing.Year Book_average_rating Book_ratings_count
## Min. : 0.0 Min. :-560 Min. :2.970 Min. : 27308
## 1st Qu.: 267.0 1st Qu.:1985 1st Qu.:3.850 1st Qu.: 70382
## Median : 535.0 Median :2003 Median :4.010 Median : 89262
## Mean : 534.6 Mean :1971 Mean :4.007 Mean : 94914
## 3rd Qu.: 802.0 3rd Qu.:2010 3rd Qu.:4.170 3rd Qu.:113915
## Max. :1069.0 Max. :2016 Max. :4.770 Max. :206792
## gross.sales publisher.revenue sale.price sales.rank
## Min. : 104.9 Min. : 0.0 Min. : 0.990 Min. : 1.0
## 1st Qu.: 371.2 1st Qu.: 0.0 1st Qu.: 1.990 1st Qu.: 287.0
## Median : 810.8 Median : 272.9 Median : 3.990 Median : 596.0
## Mean : 1857.9 Mean : 843.8 Mean : 4.871 Mean : 611.8
## 3rd Qu.: 1488.3 3rd Qu.: 721.6 3rd Qu.: 6.990 3rd Qu.: 933.0
## Max. :47795.0 Max. :28677.0 Max. :33.860 Max. :1273.0
## units.sold genre author_rating
## Min. : 106 Min. :1.000 Min. :1.000
## 1st Qu.: 550 1st Qu.:1.000 1st Qu.:2.000
## Median : 3915 Median :1.000 Median :2.000
## Mean : 9682 Mean :1.187 Mean :2.408
## 3rd Qu.: 5319 3rd Qu.:1.000 3rd Qu.:3.000
## Max. :61560 Max. :3.000 Max. :4.000
Now to visualize some of these values, let’s make some charts
hist(numeric_df$Book_average_rating, main = "Book Average Rating", xlab="Rating",
ylab = "Amount", col="royalblue")
hist(numeric_df$author_rating, main = "Author Average Rating", xlab="Rating",
ylab = "Amount", col="maroon")
hist(numeric_df$units.sold, main = "Units Sold", xlab="Quanitity",
ylab = "Amount", col="forestgreen")
hist(numeric_df$Book_ratings_count, main = "Rating Count", xlab="Count",
ylab = "Amount", col="azure")
hist(numeric_df$sale.price, main = "Sales Price", xlab="Book Price",
ylab = "Amount", col="gold")
Next let’s plot some of the stronger relationships to see if we can find anything interesting to dig deeper into
plot(numeric_df$sales.rank, numeric_df$Book_ratings_count,
main = "Sales Rank vs Rating Count", xlab = "Sales Rank",
ylab = "Rating Count", col= "navy", pch = 16)
plot(numeric_df$publisher.revenue, numeric_df$gross.sales,
main = "Publisher Revenue vs Gross Sales", xlab = "Publisher Revenue",
ylab = "Gross Sales", col= "goldenrod", pch = 16)
plot(numeric_df$gross.sales, numeric_df$Book_ratings_count,
main = "Gross Sales vs Book Rating Count", xlab = "Gross Sales",
ylab = "Rating Count", col= "darkviolet", pch = 16)
plot(numeric_df$Book_average_rating, numeric_df$author_rating,
main = "Average Book Rating vs Author Rating", xlab = "Average Book Rating",
ylab = "Author Rating", col= "thistle", pch = 16)
The first interesting relationship to be analyzed from these charts is the publisher revenue vs gross sales scatter plot. There is clearly a notable cluster where x = 0, which we interpreted as books self published. Now, self publishing is the only practical option for some new authors, so let’s dig deeper into it. From the graph it appears having a publisher does greatly enhance your odds of having a successful book, but is that true.
First, let’s make 2 new data sets. A self published data set, and a data set with no self published books.
self_published <- subset(numeric_df, numeric_df$publisher.revenue == 0)
publisher <- subset(numeric_df, numeric_df$publisher.revenue != 0)
Now let’s compare the gross sales from our 3 groups
print(summary(self_published$gross.sales))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 104.9 331.9 477.3 1297.7 1019.6 19960.0
print(summary(numeric_df$gross.sales))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 104.9 371.2 810.8 1857.9 1488.3 47795.0
print(summary(publisher$gross.sales))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 104.9 478.4 977.5 2156.8 1675.1 47795.0
So, let do a T-series test on these 3 groupings and see what we get
t_self_all <- t.test(self_published$gross.sales, numeric_df$gross.sales)
t_self_pub <- t.test(self_published$gross.sales, publisher$gross.sales)
t_all_pub <- t.test(numeric_df$gross.sales, publisher$gross.sales)
print(t_self_all)
##
## Welch Two Sample t-test
##
## data: self_published$gross.sales and numeric_df$gross.sales
## t = -3.1426, df = 1009.1, p-value = 0.001723
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -909.9793 -210.3929
## sample estimates:
## mean of x mean of y
## 1297.676 1857.862
print(t_self_pub)
##
## Welch Two Sample t-test
##
## data: self_published$gross.sales and publisher$gross.sales
## t = -4, df = 1063.7, p-value = 6.771e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1280.6249 -437.7078
## sample estimates:
## mean of x mean of y
## 1297.676 2156.842
print(t_all_pub)
##
## Welch Two Sample t-test
##
## data: numeric_df$gross.sales and publisher$gross.sales
## t = -1.4355, df = 1349.3, p-value = 0.1514
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -707.5722 109.6117
## sample estimates:
## mean of x mean of y
## 1857.862 2156.842
With this information, we can reject the null hypotheses for two of our three questions. We can confidently say that for both the self published books vs all and self published vs publisher the mean is statistically significant. For all books vs publisher, we cannot reject the null hypothesis.
Let’s plot this information out for a visual representation
means <- c(mean(self_published$gross.sales), mean(numeric_df$gross.sales), mean(publisher$gross.sales))
barplot(means, names.arg = c("Self Published", "All", "With Publisher"), col = c("lightblue", "lightgreen", "lightcoral"), ylim = c(0, max(means) + 100), ylab = "Gross Sales", main = "Mean Gross Sales")
Now, maybe someone still believes that self published books can peak at the same height of books with a publisher, so let’s create another box plot with error bars to show standard deviation.
standard_errors <- c(sd(self_published$gross.sales), sd(numeric_df$gross.sales), sd(publisher$gross.sales))
barplot(means, names.arg = c("Self Published", "All", "With Publisher"), col = c("lightblue", "lightgreen", "lightcoral"), ylim = c(0, max(means) + max(standard_errors) + 50), ylab = "Gross Sales", main = "Mean Gross Sales with SD Error Line")
arrows(1:3, means, 1:3, means + standard_errors, angle = 90, code = 3, length = 0.1)
Even with this plot it is abundantly clear that both the mean and peak of self published books sales is lower then that with a publisher.
Now, let’s do the same process again with sale price. Maybe the actual price of the books are higher even though the total gross profit is lower.
print(summary(self_published$sale.price))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.990 0.990 2.990 3.031 3.990 12.990
print(summary(numeric_df$sale.price))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.990 1.990 3.990 4.871 6.990 33.860
print(summary(publisher$sale.price))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.990 2.240 5.990 5.854 7.990 33.860
Again let’s run some t tests to see if we confirm a statistical difference
t_self_all_p <- t.test(self_published$sale.price, numeric_df$sale.price)
t_self_pub_p <- t.test(self_published$sale.price, publisher$sale.price)
t_all_pub_p <- t.test(numeric_df$sale.price, publisher$sale.price)
print(t_self_all_p)
##
## Welch Two Sample t-test
##
## data: self_published$sale.price and numeric_df$sale.price
## t = -13.108, df = 1304.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.116313 -1.565303
## sample estimates:
## mean of x mean of y
## 3.030511 4.871319
print(t_self_pub_p)
##
## Welch Two Sample t-test
##
## data: self_published$sale.price and publisher$sale.price
## t = -16.422, df = 1036.4, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.160632 -2.485922
## sample estimates:
## mean of x mean of y
## 3.030511 5.853788
print(t_all_pub_p)
##
## Welch Two Sample t-test
##
## data: numeric_df$sale.price and publisher$sale.price
## t = -5.3629, df = 1393.6, p-value = 9.575e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.3418394 -0.6230979
## sample estimates:
## mean of x mean of y
## 4.871319 5.853788
From these three t tests we can see that all three are null hypothesis can be dis-proven. So, we can concretely say that the price of a self published book is less then the price of a grouping of all books, which is less then the price of books published with a publisher. So, let’s quickly visualize this.
means_p <- c(mean(self_published$sale.price), mean(numeric_df$sale.price), mean(publisher$sale.price))
standard_errors_p <- c(sd(self_published$sale.price), sd(numeric_df$sale.price), sd(publisher$sale.price))
barplot(means_p, names.arg = c("Self Published", "All", "With Publisher"), col = c("lightblue", "lightgreen", "lightcoral"), ylim = c(0, max(means_p) + max(standard_errors_p) + 50), ylab = "Sales Price", main = "Mean Sales Price with SD Error Line")
arrows(1:3, means_p, 1:3, means_p + standard_errors_p, angle = 90, code = 3, length = 0.1)
From here we could see that once again, both the mean and range of prices are heavily influenced by having a publisher. If your book has a publisher, it helps boost prices by nearly double, which can sum to quite a lot of income.
And now that we know both the mean of gross sales and of sale price is influenced by the presence of a publisher, we can make a scatter plot to visualize this relationship one more time.
plot(self_published$gross.sales, self_published$sale.price, col = "lightblue", pch = 16, main = "Comparing Self Published to Base Dataset", xlab = "Gross Sales", ylab = "Sales Price")
points(publisher$gross.sales, publisher$sale.price, col = "lightcoral", pch = 16)
legend("topright", legend = c("Self Published", "With Publisher"), col = c("lightblue", "lightcoral"), pch = 16)
As we can clearly see, the bottom right corner is a mix of both groups, but every where else is dominated by with publisher. Especially once you cross the $6 threshold, only a handful of self published books are present while a significant portion of the books with a publisher a present.
For some people writing a successful book isn’t defined by sales, but by public perception (or rating). So here we are attempting to figure out how a author can best write a book which is rated highly by the general public. We’re going to key in on two specific variables, author rating and genre. We believe being a household name helps a book be rated more highly. Along with that we also believe genre will have minimal (if any) effect on ratings. People will read what they want to, so genre shouldn’t matter.
Firstly, let’s split our data in 3 data sets. One for highly rated book (greater then 4.5), well received books (3.5 to 4.5), and poorly received books (less then 3.5)
high <- subset(numeric_df, numeric_df$Book_average_rating > 4.25)
med <- subset(numeric_df, numeric_df$Book_average_rating <= 4.25 & numeric_df$Book_average_rating >= 3.5)
low <- subset(numeric_df, numeric_df$Book_average_rating < 3.5)
First lets analyze the author rating. Just to refresh, 1 is novice, 2 is intermediate, 3 is excellent and 4 is famous
observed <- function(data){
corr <- cor(data$Book_average_rating, data$author_rating)
return (corr)
}
high_stats <- observed(high)
med_stats <- observed(med)
low_stats <- observed(low)
print(high_stats)
## [1] 0.6930298
print(med_stats)
## [1] 0.624698
print(low_stats)
## [1] 0.1074156
So there is a high correlation between two of our groups and author rating. Let’s check the mean author rating of these books to see what authors we are working with
print(mean(high$author_rating))
## [1] 3.110429
print(mean(med$author_rating))
## [1] 2.322248
print(mean(low$author_rating))
## [1] 1.235294
So it seems our highly rated books are written by a majority of excellent and famous authors. Our middle tier of books seems to be mainly written by intermediate and excellent authors.However, we can’t just say this without proving it, so let’s get to that. Let’s conduct a permutation test to see if the relationship between author rating and book average rating is significant.
permutation_test <- function(data, n){
observed_stat <- observed(data)
perm_stats <- numeric(n)
for (i in 1:n){
shuffled <- data
shuffled$author_rating <- sample(data$author_rating)
perm_stats[i] <- observed(shuffled)
}
p_value <- mean(perm_stats >= observed_stat)
return (list(
observed = observed_stat,
perm_stats = perm_stats,
p_value = p_value
))
}
high_perm_test <- permutation_test(high, 10000)
med_perm_test <- permutation_test(med, 10000)
low_perm_test <- permutation_test(low, 10000)
Now that the entire permutation has been calculated over 10,000 cycles, let’s analyze the results
print_results <- function(data){
print("Summary of the observed:")
print(summary(data$observed))
print("Summary of the permuted:")
print(summary(data$perm_stats))
print(paste("P-Value:", data$p_value))
}
print("Highly Rated Books")
## [1] "Highly Rated Books"
print_results(high_perm_test)
## [1] "Summary of the observed:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.693 0.693 0.693 0.693 0.693 0.693
## [1] "Summary of the permuted:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.2729878 -0.0534804 0.0002386 -0.0002017 0.0530316 0.2799484
## [1] "P-Value: 0"
print("Medium Rated Books")
## [1] "Medium Rated Books"
print_results(med_perm_test)
## [1] "Summary of the observed:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.6247 0.6247 0.6247 0.6247 0.6247 0.6247
## [1] "Summary of the permuted:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.361e-01 -2.277e-02 1.101e-04 3.153e-05 2.299e-02 1.145e-01
## [1] "P-Value: 0"
print("Poorly Rated Books")
## [1] "Poorly Rated Books"
print_results(low_perm_test)
## [1] "Summary of the observed:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1074 0.1074 0.1074 0.1074 0.1074 0.1074
## [1] "Summary of the permuted:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.74184 -0.10406 0.02350 -0.00133 0.13091 0.32896
## [1] "P-Value: 0.3002"
Now with these results it’s clear that for highly rated books there is a strong linear relationship. We can see that from the observed summary. Now when looking at the P value we can determine this is in fact statistically significant. Analyzing the permuted groups mean, we clearly see it is close to 0, further strengthening the claim that there is a correlation, and a strong one at that.
The same could be said fro the medium rated books. The observed also imply a strong linear relationship (yet slightly weaker then highly rated books). The P value still remains low, indicating this is statistically significant. And the permutations’ mean further strengthen that point.
However, things change when it comes to the poorly rated book section. Here our p value is above .3, far to high to reject our null hypothesis (which is that there is no relationship between book rating and author rating). So any author can have their books received poorly by consumers.
Now, according some authors, units sold is could be the metric of success. For some it’s selling enough to get by, for others it would be selling enough to become a recognizable title. Regardless, let’s analyze the units sold to see what information we can garner for what would be considered a successful book. Sales price, average ratings, and ratings count are the columns we are most interested in.
First let’s look at some general information
units_sold <- numeric_df$units.sold
print(summary(units_sold))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 106 550 3915 9682 5319 61560
From here would could see there’s a massive range in the books. Let’s visualize this data quickly
quartiles <- summary(numeric_df$units.sold)
boxplot(units_sold, main = "Boxplot of Units Sold", ylab = "Unit Sold",
col = "firebrick", pch = 16, at=0)
From this visualization we can clearly see that we have two main groups. There is the actual group which makes up the box of our graph, then a massive cluster of extremely well selling books. So why not split these up and see if we can gain any insights From the look of this graph, 20,000 units sold seems like an appropriate cut off.
lower_end <- subset(numeric_df, numeric_df$units.sold < 20000)
higher_end <- subset(numeric_df, numeric_df$units.sold >= 20000)
Let’s plot these out to see the difference from our initial boxplot
quartiles <- summary(lower_end$units.sold)
boxplot(lower_end$units.sold, main = "Fair Selling Units Sold",
ylab = "Unit Sold", col = "firebrick", pch = 16, at = 1)
boxplot(higher_end$units.sold, main = "Great Selling Units Sold: ",
ylab = "Unit Sold", col = "midnightblue", pch = 16, at = 1)
Now that we have a basic representation of this, let’s do some analysis using our aforementioned columns
print_corr <- function(data){
print(cor.test(data$units.sold, data$sale.price)$p.value)
print(cor.test(data$units.sold, data$Book_ratings_count)$p.value)
print(cor.test(data$units.sold, data$Book_average_rating)$p.value)
}
print("Lower End")
## [1] "Lower End"
print_corr(lower_end)
## [1] 0.3056349
## [1] 4.437602e-14
## [1] 0.7706901
print("Higher End")
## [1] "Higher End"
print_corr(higher_end)
## [1] 0.6570966
## [1] 0.001818884
## [1] 0.4847973
Interestingly enough, for both the lower and higher end the relationship with the book rating count is the strongest. These are all P values, so this would indicate we could reject the null hypothesis, which is that there is no relationship between units sold and rating count. Let’s plot these out and see what we have
plot(lower_end$units.sold, lower_end$Book_ratings_count, xlab = "Units Sold", ylab = "Book Rating Count",
col = "firebrick", main = "Lower End: Units Sold vs Book Rating Count")
plot(higher_end$units.sold, higher_end$Book_ratings_count, xlab = "Units Sold", ylab = "Book Rating Count",
col = "midnightblue", main = "Higher End: Units Sold vs Book Rating Count")
With our lower end, we see, to have four clusters. 1. 0 - 100 units and 50,000 - 80,000 reviews 2. 200 - 1,000 units sold and 100,000 - 170,000 reviews 3. 2,900 - 3,000 units sold and 50,000 - 100,000 reviews 4. 4,200 - 4,500 units sold and 50,000 - 100,000 reviews
Now obviously it doesn’t make sense to have more reviews then units sold, so we have to assume that units sold is a multiple of some kind. However, out data set doesn’t include that information, so this is the best we can do with what is provided.
Looking at our higher end, things are much more sparatic with no true clusters.
So, for both of these let’s fit them with linear regressoins and see if we can gain anything with that information
plot(lower_end$units.sold, lower_end$Book_ratings_count, xlab = "Units Sold", ylab = "Book Rating Count",
col = "firebrick", main = "Lower End: Units Sold vs Book Rating Count")
lower_model <- lm(lower_end$Book_ratings_count ~ lower_end$units.sold, data = lower_end)
abline(lower_model, col = "maroon1")
plot(higher_end$units.sold, higher_end$Book_ratings_count, xlab = "Units Sold", ylab = "Book Rating Count",
col = "midnightblue", main = "Higher End: Units Sold vs Book Rating Count")
higher_model <- lm(higher_end$Book_ratings_count ~ higher_end$units.sold, data = higher_end)
abline(higher_model, col = "steelblue2")
Now, before we go further into this analysis, let’s do some residuals to see how good of a fit everything is
residuals_l <- residuals(lower_model)
fitted_values_l <- fitted(lower_model)
plot(fitted_values_l, residuals_l, main = "Lower End: Residuals vs. Fitted Values",
xlab = "Fitted Values", ylab = "Residuals", col="firebrick")
abline(h = 0, col = "magenta3", lty = 2)
residuals_h <- residuals(higher_model)
fitted_values_h <- fitted(higher_model)
plot(fitted_values_h, residuals_h, main = "Higher End: Residuals vs. Fitted Values",
xlab = "Fitted Values", ylab = "Residuals", col="midnightblue")
abline(h = 0, col = "royalblue", lty = 2)
Based on these residuals we can surmise two things. 1. The lower end regression isn’t very helpful. The residual shows the same patterns as the regression (just flipped), and as the values don’t appear to be random we can say the relationship doesn’t appear to be linear. 2. The high end regression can be used for further analysis. The residual shows a drop of clustering, but nothing serious enough to say there is a lack of a linear relationship.
So for now, let’s continue analyzing the high end relationship. First let’s look at the model summary.
print(summary(higher_model))
##
## Call:
## lm(formula = higher_end$Book_ratings_count ~ higher_end$units.sold,
## data = higher_end)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45131 -8955 -1564 8557 33373
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.494e+04 4.588e+03 14.15 < 2e-16 ***
## higher_end$units.sold 3.531e-01 1.117e-01 3.16 0.00182 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13510 on 202 degrees of freedom
## Multiple R-squared: 0.04711, Adjusted R-squared: 0.04239
## F-statistic: 9.987 on 1 and 202 DF, p-value: 0.001819
From this we can gather that there is in fact a positive relationship between units sold and rating count. Our Y intercept is estimated at 64,940, so if a book sells 20,000 copies it should have approximately 64,940 reviews. After that for every additional unit sold, the ratings count increases by .3531. Now, the R^2 implies that there are multiple factors going into an increase in rating counts, but units sold is definitely one of them.
Now, lets conduct hypothesis testing to confirm there is a definitive relationship between units sold and ratings count
coef_estimate <- coef(summary(higher_model))[2, 1]
std_error <- coef(summary(higher_model))[2, 2]
t_stat <- coef_estimate / std_error
p_value <- 2 * pt(-abs(t_stat), df = df.residual(higher_model))
print(p_value)
## [1] 0.001818884
With this P value, we can confidently reject the null hypothesis and say that there is a significant relationship between units sold and ratings count for books selling over 20,000 copies.
First, let’s go over the self published data. We hypothesized that having a publisher greatly increases the success of a given book. From this we can clearly see a definitive relationship between having a publisher and more success. With a publisher, you sell more copies and for a higher value then if not. We proved this through T tests, and can say our hypothesis was correct.
Second, we looked at the top rated books. We broke the books up into 3 categories based on ratings, all different tiers, and analyzed all of them individually. Our hypothesis was that each group of books would have around a their respective author rating be the dominant one. So, for highly rated books it would be comprised of mostly excellent and famous authors. For well rated books it would be intermediate and excellent. And finally for poorly rated books it would be novice and intermediates. This proved two thirds true. Through the permutation test we proved the highly rated and well rated books both followed the pattern we predicted. However, when it came to poorly rated books that proved to not be the case. Poorly rated books did not have a significant correlation with the author’s rating.
Lastly, we looked at the units sold. We hypothesized units sold would have a correlation with rating count, average rating, and sale price. After splitting the data into “best-seller” and other categories, both showed that the only possible correlation was with rating count. After doing linear regression on both groups, the only possible correlation which still remained was the best-seller rating count relationship. After more testing, we were able to confirm that this was a true correlation.
So, let’s review our initial questions. First, what can a author do to maximize his odds of a best seller. And second, what can a author do to jump start his career.
So, what we saw was that if you want a best seller, you need to have a recognizable name. Now this is pure conjecture, but people are probably predisposed to assume authors they heard of before are great at writing, and they rate the books better because of that. And as we saw later, ratings and units sold have a positive correlation, so if you book is gaining more ratings we can say it will effect the units sold in a positive manner. So, the author needs to build himself a name, and then release a book which gets people talking. That is the path to a best seller.
Second, what about a struggling new author. For a new author, there’s far less information. But one thing that is a necessity is to find yourself a publisher. Publishers increase gross sales and sale price, which allows a author to continue writing. Now this is more conjecture, but if a author has a publisher and can generate enough money to continue wiring, they can eventually build themselves a large enough name to have that best seller.
In the future there are some things that could improve the results of this analysis. Firstly, if the genres was broken up further, it would allow more nuanced analysis. You look look at a variety of genres and see how they perform against one another, something this data set lacks. Besides that, the data set could use more actual data. We were able to make some subsets and do some interseting analysis, but with more data you can break things down further and be more precise.
“Books Sales and Ratings.” Kaggle, www.kaggle.com/datasets/thedevastator/books-sales-and-ratings. Accessed 18 Dec. 2023.