Week 8 | Data Dive — Regression Modeling

Response Variable

The most valuable continuous variable in our dataset is “Reviews.” The number of reviews reflects the popularity and potentially the quality of a book, making it essential for both buyers and sellers.

Explanatory Variable

We consider “Genre” as the explanatory variable. The genre of a book can significantly influence the number of reviews it receives, as different genres attract distinct audiences.

ANOVA Test for Genre Influence

We begin by testing whether the mean number of reviews differs across different genres using an ANOVA test.

# Load the data
bestsellers <- read.csv("bestsellers.csv")

# Consolidate Genre categories if needed
# Assuming there are more than 10 categories, let's not consolidate for simplicity

# Perform ANOVA
anova_results <- aov(Reviews ~ Genre, data = bestsellers)
summary(anova_results)

##              Df    Sum Sq   Mean Sq F value   Pr(>F)    
## Genre         1 5.926e+09 5.926e+09   46.64 2.27e-11 ***
## Residuals   548 6.963e+10 1.271e+08                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA test output indicates the following:

Df: Degrees of freedom, which represent the number of independent pieces of information used to estimate a parameter. In this case, there is 1 degree of freedom for the Genre factor and 548 degrees of freedom for the residuals.
Sum Sq: The sum of squares, which measures the variability or dispersion in the data. For the Genre factor, the sum of squares is 5.926e+09, and for the residuals, it is 6.963e+10.
Mean Sq: The mean square, obtained by dividing the sum of squares by the degrees of freedom. It represents the variance explained by the factor or residuals. For the Genre factor, the mean square is 5.926e+09, and for the residuals, it is 1.271e+08.
F value: The F-statistic, which is calculated as the ratio of the mean square for the factor to the mean square for the residuals. It tests whether there is a significant difference between group means. Here, the F value is 46.64.
Pr(>F): The p-value associated with the F-statistic, which indicates the probability of observing the data if the null hypothesis (no difference between group means) is true. A lower p-value suggests stronger evidence against the null hypothesis. In this case, the p-value is 2.27e-11, which is very small.

Interpretation:

The p-value (2.27e-11) is highly significant (much smaller than the conventional significance level of 0.05), indicating strong evidence against the null hypothesis.
Therefore, we reject the null hypothesis and conclude that there is a significant difference in the mean number of reviews across different genres.
The Genre of a book significantly influences the number of reviews it receives.

This result is important for stakeholders such as publishers, authors, and readers, as it suggests that the genre choice can impact the popularity of a book, as measured by the number of reviews. Publishers may consider this information when deciding which genres to focus on or promote, and authors may consider it when selecting the genre for their next book. Similarly, readers may be interested in knowing how genre influences the popularity of books they are considering reading.

Next, let’s consider a continuous variable that might influence the number of reviews a book receives.

Continuous Explanatory Variable: For this, let’s consider “Price” as it’s plausible that the price of a book could influence its popularity and thus the number of reviews it receives.

Now, let’s build a linear regression model:

# Build linear regression model
lm_model <- lm(Reviews ~ Price, data = bestsellers)

# Evaluate the model
summary(lm_model)

## 
## Call:
## lm(formula = Reviews ~ Price, data = bestsellers)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12805  -7575  -3477   5183  76112 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13500.82     780.98  17.287   <2e-16 ***
## Price        -118.13      45.94  -2.571   0.0104 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11670 on 548 degrees of freedom
## Multiple R-squared:  0.01192,    Adjusted R-squared:  0.01012 
## F-statistic: 6.611 on 1 and 548 DF,  p-value: 0.0104

The linear regression model output provides the following information:

Residuals: These are the differences between the observed values of the dependent variable (Reviews) and the values predicted by the model. They provide insights into how well the model fits the data.
- Min: The minimum residual value is -12805.
- 1Q: The first quartile of the residuals is -7575.
- Median: The median of the residuals is -3477.
- 3Q: The third quartile of the residuals is 5183.
- Max: The maximum residual value is 76112.
Coefficients: These represent the estimated coefficients of the linear regression model.
- Intercept: The intercept estimate is 13500.82. It represents the predicted number of reviews when the price is zero.
- Price: The coefficient estimate for Price is -118.13. It indicates the average change in the number of reviews for a one-unit increase in price.
Significance codes: These indicate the level of significance of each coefficient.
- The intercept and the coefficient for Price are both significant at the 0.05 level.
Residual standard error: This represents the standard deviation of the residuals, which provides a measure of the model’s accuracy in predicting the dependent variable.
R-squared: This indicates the proportion of variance in the dependent variable (Reviews) that is explained by the independent variable (Price). In this case, the adjusted R-squared is 0.01012, suggesting that only about 1% of the variance in the number of reviews is explained by the price.
F-statistic: This tests the overall significance of the model. The p-value associated with the F-statistic is 0.0104, indicating that the model is statistically significant.

Interpretation:

The negative coefficient for Price (-118.13) suggests that, on average, as the price of a book increases by one unit, the number of reviews decreases by approximately 118.13 units.
The model explains a small proportion of the variance in the number of reviews (adjusted R-squared of 0.01012), indicating that Price alone may not be a strong predictor of the number of reviews.
However, the model is statistically significant (p-value of 0.0104), suggesting that there is a significant relationship between Price and the number of reviews.

library(ggplot2)
# Plotting the data with a regression line
ggplot(bestsellers, aes(x = Price, y = Reviews)) +
  geom_point(alpha = 0.5, color = "blue") + # Adds scatter plot points
  geom_smooth(method = "lm", se = TRUE, color = "red") + # Adds a linear regression line with confidence interval
  labs(title = "Relationship Between Book Price and Number of Reviews",
       x = "Price ($)",
       y = "Number of Reviews") +
  theme_minimal() # Adds a minimal theme for visual clarity

## `geom_smooth()` using formula = 'y ~ x'

Conclusion and Recommendations

Genre Influence: Publishers and authors should consider genre selection carefully, as it significantly affects the number of reviews a book receives.
Price Influence: While book price shows a statistically significant relationship with reviews, its practical impact seems minor. Publishers and authors should focus on other factors, such as genre and marketing strategies, to maximize book reviews.

By understanding these relationships, stakeholders can make informed decisions to enhance the popularity and success of their books.

Week 8 | Data Dive — Regression Modeling

Shresta

2024-04-06

Response Variable

Explanatory Variable

ANOVA Test for Genre Influence

Conclusion and Recommendations