HW 5 Mario Kart Garcia Andres

mario <- read.csv("mariokart.csv")

model1 <- lm(total_pr ~ duration + cond + stock_photo + wheels, data = mario)


summary(model1)

## 
## Call:
## lm(formula = total_pr ~ duration + cond + stock_photo + wheels, 
##     data = mario)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.485  -6.511  -2.530   1.836 263.025 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     43.5201     8.3701   5.199 7.05e-07 ***
## duration         0.3788     0.9388   0.403 0.687206    
## condused        -2.5816     5.2272  -0.494 0.622183    
## stock_photoyes  -6.7542     5.1729  -1.306 0.193836    
## wheels           9.9476     2.7184   3.659 0.000359 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.4 on 138 degrees of freedom
## Multiple R-squared:  0.1235, Adjusted R-squared:  0.09808 
## F-statistic:  4.86 on 4 and 138 DF,  p-value: 0.001069

9.21 Slope for Wheels

the slope for wheels is approximately 9.95, this is statistically significant.

We can interpret this as the following, if we leave all the other variables the same, each additional wheel included in the auction is associated with and average increase of $9.95 in total auction price.

9.22 Condused Coefficient and P-value

The coefficient is approximately -2.58 and the p-value is high 0.62.

We can interpret this as, a used game is associated with and average decrease of $2.58 in price compared to a new game. The high p-value means this difference is not statistically significant.

9.23 R^2 and Adjusted R^2

r^2 = .123 adjusted r^2 = 0.098

We can interpret this as, approximately 12.3% of the variability in the total auction price is explained by this model. The adjusted R^2 is slightly lower, which indicates that the models predictive power is relatively weak.

9.24 Prediction

new_data <- data.frame(duration = 7,
                       cond = "used",
                       stock_photo = "no",
                       wheels = 2)

prediction <- predict(model1, newdata = new_data)
cat("Predicted Total Price for the specified auction: $", round(prediction, 2))

## Predicted Total Price for the specified auction: $ 63.49

9.25

Intercept = 40.94

This is the predicted total price for an auction where all continuous predictors are zero (duration = 0 wheels = 0) and all categorical predictors are at their reference level (cond = new, stock photo = no) Since duration = 0 is not meaningful, the intercept has no practical interpretation.

9.4.2

Using the step-down approach, dropping the least significant variable (highest p-value) The variable duration has the highest p-value (0.687)

model2 <- update(model1, . ~ . - duration)
summary(model2)

## 
## Call:
## lm(formula = total_pr ~ cond + stock_photo + wheels, data = mario)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.772  -6.002  -2.424   1.840 263.738 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      45.123      7.345   6.143 8.03e-09 ***
## condused         -1.919      4.947  -0.388 0.698726    
## stock_photoyes   -7.267      4.999  -1.454 0.148281    
## wheels            9.784      2.680   3.651 0.000369 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.32 on 139 degrees of freedom
## Multiple R-squared:  0.1224, Adjusted R-squared:  0.1035 
## F-statistic: 6.465 on 3 and 139 DF,  p-value: 0.0003961

9.27

In model2 the highest p-value belongs to condused (p-0.698) This is the next candidate to drop,

9.28

Drop cond

model4 <- update(model2, . ~ . - cond)
summary(model4)

## 
## Call:
## lm(formula = total_pr ~ stock_photo + wheels, data = mario)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.401  -6.502  -2.369   1.503 263.109 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      42.931      4.679   9.175 5.29e-16 ***
## stock_photoyes   -6.522      4.601  -1.418    0.159    
## wheels           10.235      2.407   4.251 3.86e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.25 on 140 degrees of freedom
## Multiple R-squared:  0.1215, Adjusted R-squared:  0.1089 
## F-statistic: 9.681 on 2 and 140 DF,  p-value: 0.0001153

9.29

in model4 which is our best one, th highest p-value belongs to stock photoyes (p-.159) We drop it

model5 <- update(model4, . ~ . - stock_photo)
summary(model5)

## 
## Call:
## lm(formula = total_pr ~ wheels, data = mario)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.401  -6.411  -2.417   0.579 268.093 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   38.405      3.433  11.188  < 2e-16 ***
## wheels        10.006      2.411   4.151  5.7e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.34 on 141 degrees of freedom
## Multiple R-squared:  0.1089, Adjusted R-squared:  0.1026 
## F-statistic: 17.23 on 1 and 141 DF,  p-value: 5.704e-05

the adjusted r^2 dropped from 0.109 (model4) to 0.103(model5) therefore, we select model 4 as the final, most parsimonious model.

9.4.3 Checking model conditions using graphs

# Check for Nearly Normal Residuals
hist(model4$residuals, main = "Histogram of Residuals (Model 4)", xlab = "Residuals")

we can observe the residuals are highly right-skewed, suggesting a violation of the Nearly Normal Residuals conditions due to extreme positive outliers.

# absolute residuals vs. fitted values
plot(model4$fitted.values, abs(model4$residuals),
     xlab = "Fitted Values of Total Price (USD)",
     ylab = "|Residuals| (USD)",
     main = "Absolute Residuals vs. Fitted Values (Model 4)")
abline(h = 0, lty = 2, col = "red")

the plot shows a noticeable fan shape, where the variability of the residuals increases as the fitted values increase. This violates the Constant Variability condition

Residuals against each predictor variable

# Set up a 2-panel plot
par(mfrow = c(1, 2))

# Residuals vs. wheels
plot(mario$wheels, model4$residuals,
     xlab = "Number of Wheels",
     ylab = "Residuals",
     main = "Residuals vs. Wheels (Model 4)")
abline(h = 0, lty = 2, col = "red")

# Residuals vs. Index
plot(model4$residuals,
     xlab = "Observation Index (Order of Data Collection)",
     ylab = "Residuals",
     main = "Residuals in Order of Collection (Model 4)")
abline(h = 0, lty = 2, col = "red")

Conclusion: The diagnostic plots show significant issues with non-normal residuals (skewness and outliers) and non constant variance. The model assumptions are severely violated, indicating that the linear model is inappropriate for the data as is.