In this analysis, we will extend the simple linear regression model built last week to include additional variables. We’ll consider the inclusion of one or more variables and evaluate the resulting model’s performance.
Let’s start by loading the dataset and reviewing its structure.
# Load the data
bestsellers <- read.csv("bestsellers.csv")
# Display structure of the data
str(bestsellers)
## 'data.frame': 550 obs. of 7 variables:
## $ Name : chr "10-Day Green Smoothie Cleanse" "11/22/63: A Novel" "12 Rules for Life: An Antidote to Chaos" "1984 (Signet Classics)" ...
## $ Author : chr "JJ Smith" "Stephen King" "Jordan B. Peterson" "George Orwell" ...
## $ User.Rating: num 4.7 4.6 4.7 4.7 4.8 4.4 4.7 4.7 4.7 4.6 ...
## $ Reviews : int 17350 2052 18979 21424 7665 12643 19735 19699 5983 23848 ...
## $ Price : int 8 22 15 6 12 11 30 15 3 8 ...
## $ Year : int 2016 2011 2018 2017 2019 2011 2014 2017 2018 2016 ...
## $ Genre : chr "Non Fiction" "Fiction" "Non Fiction" "Fiction" ...
We will add one or more variables to our regression model. Let’s consider including the “User.Rating” and “Year” variables alongside “Price.”
Including the user rating can be insightful as it reflects the perceived quality of the book. Higher user ratings may lead to more reviews, assuming that satisfied readers are more likely to leave reviews.
The year of publication may also influence the number of reviews. Newer books might attract more attention and thus receive more reviews compared to older ones.
Before adding these variables, let’s check for multicollinearity.
# Calculate correlation matrix
correlation_matrix <- cor(bestsellers[c("Price", "User.Rating", "Year")])
# Display correlation matrix
correlation_matrix
## Price User.Rating Year
## Price 1.0000000 -0.1330863 -0.1539786
## User.Rating -0.1330863 1.0000000 0.2423830
## Year -0.1539786 0.2423830 1.0000000
Price and User.Rating:
The correlation coefficient is approximately -0.133.
There is a weak negative correlation between price and user rating.
This suggests that, on average, higher-priced books tend to have slightly lower user ratings, but the correlation is not strong.
Price and Year:
The correlation coefficient is approximately -0.154.
There is a weak negative correlation between price and the year of publication.
This indicates that, on average, newer books tend to be priced slightly lower, but again, the correlation is not strong.
User.Rating and Year:
The correlation coefficient is approximately 0.242.
There is a weak positive correlation between user rating and the year of publication.
This suggests that, on average, books published in more recent years tend to have slightly higher user ratings, but the correlation is not strong.
Based on these correlations, we do not observe multicollinearity issues, as all correlation coefficients are relatively low (less than 0.5). Therefore, it seems reasonable to include all three variables (Price, User.Rating, and Year) in our regression model.
To explore potential interactions, we’ll create an interaction term between “Price” and “User.Rating.” This interaction may capture the combined effect of price and perceived quality on the number of reviews.
Let’s construct the regression model with the selected variables.
# Build regression model with Price, User.Rating, Year, and their interaction
model <- lm(Reviews ~ Price + User.Rating + Year + Price:User.Rating, data = bestsellers)
# Summary of the model
summary(model)
##
## Call:
## lm(formula = Reviews ~ Price + User.Rating + Year + Price:User.Rating,
## data = bestsellers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14425 -7031 -3586 4212 71703
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1989537.6 316847.1 -6.279 6.96e-10 ***
## Price -389.1 1314.2 -0.296 0.767
## User.Rating -4893.3 4387.4 -1.115 0.265
## Year 1005.6 158.9 6.327 5.20e-10 ***
## Price:User.Rating 66.4 286.1 0.232 0.817
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11290 on 545 degrees of freedom
## Multiple R-squared: 0.07999, Adjusted R-squared: 0.07324
## F-statistic: 11.85 on 4 and 545 DF, p-value: 3.099e-09
The summary of the regression model provides insights into the relationships between the predictor variables (Price, User.Rating, Year, and the interaction term Price:User.Rating) and the response variable (Reviews). Here’s the interpretation:
Intercept: The intercept estimate indicates the predicted number of reviews when all other predictor variables are zero. In this case, it is -1989537.6.
Price, User.Rating, Year, and Interaction Term:
The coefficients for Price, User.Rating, and Year represent the change in the number of reviews associated with a one-unit increase in each respective variable.
The coefficient for the interaction term (Price:User.Rating) represents the additional change in the number of reviews associated with the interaction between Price and User.Rating.
The p-values associated with each coefficient indicate the probability of observing the data if the null hypothesis (no effect of the predictor variable on the response) is true.
A lower p-value suggests stronger evidence against the null hypothesis.
Intercept: Highly significant (p < 0.001), indicating that the intercept is significantly different from zero.
Price, User.Rating, Year, and Interaction Term: None of these variables are significant at the 0.05 level, as all p-values are greater than 0.05.
Residual standard error: The standard deviation of the residuals, providing a measure of the model’s accuracy in predicting the response variable. In this case, it is 11290.
Multiple R-squared: The proportion of variance in the response variable explained by the predictor variables. In this case, it is 0.07999, indicating that the model explains approximately 7.999% of the variance in the number of reviews.
Adjusted R-squared: A version of R-squared adjusted for the number of predictors in the model. In this case, it is 0.07324.
F-statistic: A test of overall significance of the model. The associated p-value (3.099e-09) is highly significant, suggesting that the model as a whole is significant.
The model overall is significant, but none of the individual predictor variables (Price, User.Rating, Year, and Interaction Term) are significant at the 0.05 level.
The adjusted R-squared is relatively low (0.07324), indicating that the predictor variables collectively explain only a small proportion of the variance in the number of reviews.
The lack of significance for individual predictors suggests that the model may not adequately capture the relationship between the predictors and the number of reviews.
# Diagnostic plots
par(mfrow=c(2,2))
plot(model)
The top left plot is a residuals vs fitted plot. This plot shows the residuals (the difference between the observed values and the fitted values) on the y-axis and the fitted values on the x-axis. In a good model, the residuals should be randomly scattered around zero. In this plot, the residuals appear to be scattered somewhat randomly around zero. This suggests that there is no clear pattern in the residuals and that the model may be a good fit for the data.
The top right plot is a Q-Q plot, or quantile-quantile plot. This plot compares the quantiles of the residuals to the quantiles of a standard normal distribution. If the residuals are normally distributed, the points should fall close to a straight line. This plot suggests that the residuals may not be normally distributed.
The bottom left plot is a scale-location plot. This plot shows the standardized residuals on the y-axis and the fitted values on the x-axis. Standardized residuals are residuals that have been divided by the estimated standard deviation of the error term. In a good model, the standardized residuals should be randomly scattered around zero. This plot appears to show most of the standardized residuals clustered near zero, but there are a few outliers.
The bottom right plot is a residuals vs leverage plot. Leverage is a measure of how much influence a particular data point has on the fitted regression line. This plot shows the residuals on the y-axis and the leverage on the x-axis. In a good model, the residuals should be randomly scattered around zero, regardless of the leverage. This plot suggests that there may be some outliers that have a high leverage.
Overall, these diagnostic plots suggest that the linear regression model may not be ideal. There is evidence that the errors may not be homoscedastic or normally distributed, and there may be some outliers. It is important to consider these issues before drawing conclusions from the model.