mario <- read.csv("mariokart.csv")
model1 <- lm(total_pr ~ duration + cond + stock_photo + wheels, data = mario)
summary(model1)
##
## Call:
## lm(formula = total_pr ~ duration + cond + stock_photo + wheels,
## data = mario)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.485 -6.511 -2.530 1.836 263.025
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 43.5201 8.3701 5.199 7.05e-07 ***
## duration 0.3788 0.9388 0.403 0.687206
## condused -2.5816 5.2272 -0.494 0.622183
## stock_photoyes -6.7542 5.1729 -1.306 0.193836
## wheels 9.9476 2.7184 3.659 0.000359 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.4 on 138 degrees of freedom
## Multiple R-squared: 0.1235, Adjusted R-squared: 0.09808
## F-statistic: 4.86 on 4 and 138 DF, p-value: 0.001069
the slope for wheels is approximately 9.95, this is statistically significant.
We can interpret this as the following, if we leave all the other variables the same, each additional wheel included in the auction is associated with and average increase of $9.95 in total auction price.
The coefficient is approximately -2.58 and the p-value is high 0.62.
We can interpret this as, a used game is associated with and average decrease of $2.58 in price compared to a new game. The high p-value means this difference is not statistically significant.
r^2 = .123 adjusted r^2 = 0.098
We can interpret this as, approximately 12.3% of the variability in the total auction price is explained by this model. The adjusted R^2 is slightly lower, which indicates that the models predictive power is relatively weak.
new_data <- data.frame(duration = 7,
cond = "used",
stock_photo = "no",
wheels = 2)
prediction <- predict(model1, newdata = new_data)
cat("Predicted Total Price for the specified auction: $", round(prediction, 2))
## Predicted Total Price for the specified auction: $ 63.49
Intercept = 40.94
This is the predicted total price for an auction where all continuous predictors are zero (duration = 0 wheels = 0) and all categorical predictors are at their reference level (cond = new, stock photo = no) Since duration = 0 is not meaningful, the intercept has no practical interpretation.
Using the step-down approach, dropping the least significant variable (highest p-value) The variable duration has the highest p-value (0.687)
model2 <- update(model1, . ~ . - duration)
summary(model2)
##
## Call:
## lm(formula = total_pr ~ cond + stock_photo + wheels, data = mario)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.772 -6.002 -2.424 1.840 263.738
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.123 7.345 6.143 8.03e-09 ***
## condused -1.919 4.947 -0.388 0.698726
## stock_photoyes -7.267 4.999 -1.454 0.148281
## wheels 9.784 2.680 3.651 0.000369 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.32 on 139 degrees of freedom
## Multiple R-squared: 0.1224, Adjusted R-squared: 0.1035
## F-statistic: 6.465 on 3 and 139 DF, p-value: 0.0003961
In model2 the highest p-value belongs to condused (p-0.698) This is the next candidate to drop,
Drop cond
model4 <- update(model2, . ~ . - cond)
summary(model4)
##
## Call:
## lm(formula = total_pr ~ stock_photo + wheels, data = mario)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.401 -6.502 -2.369 1.503 263.109
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.931 4.679 9.175 5.29e-16 ***
## stock_photoyes -6.522 4.601 -1.418 0.159
## wheels 10.235 2.407 4.251 3.86e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.25 on 140 degrees of freedom
## Multiple R-squared: 0.1215, Adjusted R-squared: 0.1089
## F-statistic: 9.681 on 2 and 140 DF, p-value: 0.0001153
in model4 which is our best one, th highest p-value belongs to stock photoyes (p-.159) We drop it
model5 <- update(model4, . ~ . - stock_photo)
summary(model5)
##
## Call:
## lm(formula = total_pr ~ wheels, data = mario)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.401 -6.411 -2.417 0.579 268.093
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.405 3.433 11.188 < 2e-16 ***
## wheels 10.006 2.411 4.151 5.7e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.34 on 141 degrees of freedom
## Multiple R-squared: 0.1089, Adjusted R-squared: 0.1026
## F-statistic: 17.23 on 1 and 141 DF, p-value: 5.704e-05
the adjusted r^2 dropped from 0.109 (model4) to 0.103(model5) therefore, we select model 4 as the final, most parsimonious model.
# Check for Nearly Normal Residuals
hist(model4$residuals, main = "Histogram of Residuals (Model 4)", xlab = "Residuals")
we can observe the residuals are highly right-skewed, suggesting a violation of the Nearly Normal Residuals conditions due to extreme positive outliers.
# absolute residuals vs. fitted values
plot(model4$fitted.values, abs(model4$residuals),
xlab = "Fitted Values of Total Price (USD)",
ylab = "|Residuals| (USD)",
main = "Absolute Residuals vs. Fitted Values (Model 4)")
abline(h = 0, lty = 2, col = "red")
the plot shows a noticeable fan shape, where the variability of the residuals increases as the fitted values increase. This violates the Constant Variability condition
# Set up a 2-panel plot
par(mfrow = c(1, 2))
# Residuals vs. wheels
plot(mario$wheels, model4$residuals,
xlab = "Number of Wheels",
ylab = "Residuals",
main = "Residuals vs. Wheels (Model 4)")
abline(h = 0, lty = 2, col = "red")
# Residuals vs. Index
plot(model4$residuals,
xlab = "Observation Index (Order of Data Collection)",
ylab = "Residuals",
main = "Residuals in Order of Collection (Model 4)")
abline(h = 0, lty = 2, col = "red")
Conclusion: The diagnostic plots show significant issues with non-normal residuals (skewness and outliers) and non constant variance. The model assumptions are severely violated, indicating that the linear model is inappropriate for the data as is.