A NonParametric Curve for Bestselling Books

Data

The amazonBest data set was found on kaggle.com and originally recorded from amazon.com. The data is a collection of the top 550 bestselling books on the Amazon website. There are 550 observations and 7 variables total.

amazonBest <- read_csv("~/Documents/Math 248 2022/Amazon Bestsellers - bestsellers with categories.csv")
head(amazonBest)
## # A tibble: 6 × 7
##   Name                                  Author User …¹ Reviews Price  Year Genre
##   <chr>                                 <chr>    <dbl>   <dbl> <dbl> <dbl> <chr>
## 1 10-Day Green Smoothie Cleanse         JJ Sm…     4.7   17350     8  2016 Non …
## 2 11/22/63: A Novel                     Steph…     4.6    2052    22  2011 Fict…
## 3 12 Rules for Life: An Antidote to Ch… Jorda…     4.7   18979    15  2018 Non …
## 4 1984 (Signet Classics)                Georg…     4.7   21424     6  2017 Fict…
## 5 5,000 Awesome Facts (About Everythin… Natio…     4.8    7665    12  2019 Non …
## 6 A Dance with Dragons (A Song of Ice … Georg…     4.4   12643    11  2011 Fict…
## # … with abbreviated variable name ¹`User Rating`

The variables are: Name, Author, User Rating, Reviews, Price, Year, and Genre.

Name Refers to the name of the book on the list.

Author Refers to the name of the author of the book on the list.

User Rating a quantitative variable that refers to the average rating it has from Amazon users out of 5 stars.

Reviews a quantitative variable that refers to the number of reviews on the Amazon website for the book.

Price a quantitative variable that refers to the price in US dollars of the book.

Year a quantitative variable that refers to the year the book was published.

Genre a categorical variable that refers to the genre of the book. The book can only be ‘Fiction’ or ‘Non Fiction’

The response variable of the model being created is Price, and the explanatory variable is User Rating.

Introduction

Can we create a model to predict Price based on the User Rating a book on the Amazon bestsellers list has?

To answer this question the best we can, the distribution of the variables in use should be looked at.

ggplot(data = amazonBest, aes(x = `Price`)) +
  geom_histogram(bins=30)

The distribution of Price is very skewed to the right and looks like it is centered around 7 or 8.

User Rating must have its name changed so it has an underscore instead of a space.

names(amazonBest)[names(amazonBest) == "User Rating"] <- "User_Rating"

ggplot(data = amazonBest, aes(x = User_Rating)) + 
    geom_bar()

The distribution of User Rating is very skewed to the left and looks like it is centered around 4.8.

Methodology

Based on both of the variables being skewed, NonParametric regression is a good choice in model for this research question.

Nonparametric regression Is often used when parametric regression conditions are not fully met. This method calculates the slope of a model based on the median of the data set instead of the mean.

Because the response and explanation variables are all skewed, the normality condition is not met for parametric equations. Luckily, though, that is not a condition for nonparametric regression.

The theoretical model for the regression being fit is: \[Price = \beta_0 + \beta_1Year + \epsilon\]

Conditions for NonParametric Regression

Median error of the calculated nonparametric model is zero.

Continuous distribution of the variables (not necessarily normal).

Errors in the model are independent of each other.

Results

Fitting the nonparametric Model

NonParmodel = mblm(Price ~ User_Rating,
               data=amazonBest)

summary(NonParmodel)
## 
## Call:
## mblm(formula = Price ~ User_Rating, dataframe = amazonBest)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##    -10     -2      2      7     95 
## 
## Coefficients:
##             Estimate    MAD V value Pr(>|V|)    
## (Intercept)    55.00  62.27  122188   <2e-16 ***
## User_Rating   -10.00  14.83   33555   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.61 on 548 degrees of freedom

The MAD or mean absolute deviation of the User Rating variable is 14.83, which measures the variability of the variable. 14.83 is not very high, which is a good sign. The V value is 33555 which is similar to the t value but calculated with the median instead of the mean. The higher the value, the better the model fits the data set, so this big number is a good sign. The p-value is les that 2e-16, which is very very good.

The bad sign is, though, that the median residuals is 2. According to the conditions of this model, the median error should be zero. Having a median error of 2 is not a good sign.

The fitted model is \[\widehat{Price} = 55 -10UserRating\]

Visualization of the Model

plot(Price ~ User_Rating,
     data = amazonBest,
     pch  = 16)

abline(NonParmodel,
       col="blue",
       lwd=2)


Pvalue    = as.numeric(summary(NonParmodel)$coefficients[2,4])
Intercept = as.numeric(summary(NonParmodel)$coefficients[1,1])
Slope     = as.numeric(summary(NonParmodel)$coefficients[2,1])
R2        = NULL

The model above looks like a pretty good fit for the data. It is a hard data set to make a predictive model for because it is so skewed and there is a lot of variation. The line seems to go through the middle of each user rating span and is not affected by outliars because it is being calculated based off of the median of the data.

Conclusion

An interesting result came from this model in that Price is negatively related to User Rating. This goes against my intuition because I would assume that the more expensive a book is, the more well liked it is. This model shows the opposite in that sometimes more expensive books can leave readers feeling like they could’ve gotten a cheaper book for the same level of satisfaction.

A strength of the model is that the existence of outliars does not affect the model, and that the MAD, V value, and the p-value of the explanatory variable User Rating all point towards it being a good predictor for the response variable Price. A weakness of the model is that the median error of the model is 2 when it is supposed to be zero.