Response Variable

The most valuable continuous variable in our dataset is “Reviews.” The number of reviews reflects the popularity and potentially the quality of a book, making it essential for both buyers and sellers.

Explanatory Variable

We consider “Genre” as the explanatory variable. The genre of a book can significantly influence the number of reviews it receives, as different genres attract distinct audiences.

ANOVA Test for Genre Influence

We begin by testing whether the mean number of reviews differs across different genres using an ANOVA test.

# Load the data
bestsellers <- read.csv("bestsellers.csv")

# Consolidate Genre categories if needed
# Assuming there are more than 10 categories, let's not consolidate for simplicity

# Perform ANOVA
anova_results <- aov(Reviews ~ Genre, data = bestsellers)
summary(anova_results)
##              Df    Sum Sq   Mean Sq F value   Pr(>F)    
## Genre         1 5.926e+09 5.926e+09   46.64 2.27e-11 ***
## Residuals   548 6.963e+10 1.271e+08                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA test output indicates the following:

Interpretation:

This result is important for stakeholders such as publishers, authors, and readers, as it suggests that the genre choice can impact the popularity of a book, as measured by the number of reviews. Publishers may consider this information when deciding which genres to focus on or promote, and authors may consider it when selecting the genre for their next book. Similarly, readers may be interested in knowing how genre influences the popularity of books they are considering reading.

Next, let’s consider a continuous variable that might influence the number of reviews a book receives.

Continuous Explanatory Variable: For this, let’s consider “Price” as it’s plausible that the price of a book could influence its popularity and thus the number of reviews it receives.

Now, let’s build a linear regression model:

# Build linear regression model
lm_model <- lm(Reviews ~ Price, data = bestsellers)

# Evaluate the model
summary(lm_model)
## 
## Call:
## lm(formula = Reviews ~ Price, data = bestsellers)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12805  -7575  -3477   5183  76112 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13500.82     780.98  17.287   <2e-16 ***
## Price        -118.13      45.94  -2.571   0.0104 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11670 on 548 degrees of freedom
## Multiple R-squared:  0.01192,    Adjusted R-squared:  0.01012 
## F-statistic: 6.611 on 1 and 548 DF,  p-value: 0.0104

The linear regression model output provides the following information:

Interpretation:

library(ggplot2)
# Plotting the data with a regression line
ggplot(bestsellers, aes(x = Price, y = Reviews)) +
  geom_point(alpha = 0.5, color = "blue") + # Adds scatter plot points
  geom_smooth(method = "lm", se = TRUE, color = "red") + # Adds a linear regression line with confidence interval
  labs(title = "Relationship Between Book Price and Number of Reviews",
       x = "Price ($)",
       y = "Number of Reviews") +
  theme_minimal() # Adds a minimal theme for visual clarity
## `geom_smooth()` using formula = 'y ~ x'

Conclusion and Recommendations

By understanding these relationships, stakeholders can make informed decisions to enhance the popularity and success of their books.