This exercise involves the Auto dataset. We first remove
missing values.
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : int 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
## - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
## ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...
mpg,
cylinders, displacement,
horsepower, weight, acceleration,
yearorigin (1 = American, 2 =
European, 3 = Japanese), nameAlthough origin is stored as numeric (1, 2, 3), it
represents categories with no meaningful order. It should be converted
to a factor in R to ensure it is treated correctly in any modelling:
quant_vars <- Auto[, c("mpg", "cylinders", "displacement",
"horsepower", "weight", "acceleration", "year")]
sapply(quant_vars, range)## mpg cylinders displacement horsepower weight acceleration year
## [1,] 9.0 3 68 46 1613 8.0 70
## [2,] 46.6 8 455 230 5140 24.8 82
## mpg cylinders displacement horsepower weight acceleration
## 23.445918 5.471939 194.411990 104.469388 2977.584184 15.541327
## year
## 75.979592
## mpg cylinders displacement horsepower weight acceleration
## 7.805007 1.705783 104.644004 38.491160 849.402560 2.758864
## year
## 3.683737
Auto_subset <- Auto[-(10:85), ]
quant_subset <- Auto_subset[, c("mpg", "cylinders", "displacement",
"horsepower", "weight", "acceleration", "year")]
sapply(quant_subset, range)## mpg cylinders displacement horsepower weight acceleration year
## [1,] 11.0 3 68 46 1649 8.5 70
## [2,] 46.6 8 455 230 4997 24.8 82
## mpg cylinders displacement horsepower weight acceleration
## 24.404430 5.373418 187.240506 100.721519 2935.971519 15.726899
## year
## 77.145570
## mpg cylinders displacement horsepower weight acceleration
## 7.867283 1.654179 99.678367 35.708853 811.300208 2.693721
## year
## 3.106217
After removing observations 10 through 85, the ranges, means, and standard deviations shift noticeably for some variables, indicating those rows contained values spread across the distribution.
par(mfrow = c(1, 3))
plot(Auto$weight, Auto$mpg, pch = 16, col = "steelblue",
xlab = "Weight", ylab = "MPG", main = "Weight vs MPG")
plot(Auto$horsepower, Auto$mpg, pch = 16, col = "tomato",
xlab = "Horsepower", ylab = "MPG", main = "Horsepower vs MPG")
plot(Auto$year, Auto$mpg, pch = 16, col = "darkgreen",
xlab = "Year", ylab = "MPG", main = "Year vs MPG")The scatterplot matrix reveals several clear patterns among the
predictors. Weight, displacement, and horsepower are strongly correlated
with each other, and all three show a negative relationship with
mpg. The variable year shows a positive trend
with mpg.
## mpg cylinders displacement horsepower weight acceleration
## 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442 0.4233285
## year
## 0.5805410
Based on the correlation values and scatterplots:
mpg —
heavier and more powerful cars are less fuel efficient.These variables would all be useful predictors of
mpg.
pairs(Auto[, !(names(Auto) %in% c("name", "origin"))],
main = "Scatterplot Matrix - Full Auto Dataset")## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## acceleration year
## mpg 0.4233285 0.5805410
## cylinders -0.5046834 -0.3456474
## displacement -0.5438005 -0.3698552
## horsepower -0.6891955 -0.4163615
## weight -0.4168392 -0.3091199
## acceleration 1.0000000 0.2903161
## year 0.2903161 1.0000000
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.0095 -2.0785 -0.0982 1.9856 13.3608
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.795e+01 4.677e+00 -3.839 0.000145 ***
## cylinders -4.897e-01 3.212e-01 -1.524 0.128215
## displacement 2.398e-02 7.653e-03 3.133 0.001863 **
## horsepower -1.818e-02 1.371e-02 -1.326 0.185488
## weight -6.710e-03 6.551e-04 -10.243 < 2e-16 ***
## acceleration 7.910e-02 9.822e-02 0.805 0.421101
## year 7.770e-01 5.178e-02 15.005 < 2e-16 ***
## originEuropean 2.630e+00 5.664e-01 4.643 4.72e-06 ***
## originJapanese 2.853e+00 5.527e-01 5.162 3.93e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.307 on 383 degrees of freedom
## Multiple R-squared: 0.8242, Adjusted R-squared: 0.8205
## F-statistic: 224.5 on 8 and 383 DF, p-value: < 2.2e-16
i. Is there a relationship between the predictors and the
response?
Yes. The F-statistic is large with a very small p-value, indicating a
significant overall relationship.
ii. Which predictors are statistically
significant?
displacement, weight, year, and
origin all have p-values below 0.05 and are statistically
significant.
iii. What does the year coefficient
suggest?
The coefficient for year is approximately 0.75, meaning
that holding all other variables constant, each additional model year is
associated with an increase of ~0.75 mpg. This reflects improvements in
fuel efficiency over time.
##
## Call:
## lm(formula = mpg ~ displacement * weight, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.8664 -2.4801 -0.3355 1.8071 17.9429
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.372e+01 1.940e+00 27.697 < 2e-16 ***
## displacement -7.831e-02 1.131e-02 -6.922 1.85e-11 ***
## weight -8.931e-03 8.474e-04 -10.539 < 2e-16 ***
## displacement:weight 1.744e-05 2.789e-06 6.253 1.06e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.097 on 388 degrees of freedom
## Multiple R-squared: 0.7265, Adjusted R-squared: 0.7244
## F-statistic: 343.6 on 3 and 388 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = mpg ~ horsepower * weight, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.7725 -2.2074 -0.2708 1.9973 14.7314
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.356e+01 2.343e+00 27.127 < 2e-16 ***
## horsepower -2.508e-01 2.728e-02 -9.195 < 2e-16 ***
## weight -1.077e-02 7.738e-04 -13.921 < 2e-16 ***
## horsepower:weight 5.355e-05 6.649e-06 8.054 9.93e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.93 on 388 degrees of freedom
## Multiple R-squared: 0.7484, Adjusted R-squared: 0.7465
## F-statistic: 384.8 on 3 and 388 DF, p-value: < 2.2e-16
lm_interact3 <- lm(mpg ~ acceleration + year + displacement:weight, data = Auto)
summary(lm_interact3)##
## Call:
## lm(formula = mpg ~ acceleration + year + displacement:weight,
## data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.7447 -2.7976 -0.3147 2.2772 15.6057
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.245e+01 4.827e+00 -4.651 4.53e-06 ***
## acceleration -1.002e-01 8.917e-02 -1.123 0.262
## year 7.137e-01 6.115e-02 11.671 < 2e-16 ***
## displacement:weight -1.024e-05 4.839e-07 -21.157 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.108 on 388 degrees of freedom
## Multiple R-squared: 0.7251, Adjusted R-squared: 0.723
## F-statistic: 341.1 on 3 and 388 DF, p-value: < 2.2e-16
The interaction term displacement:weight is
statistically significant, suggesting that the effect of displacement on
mpg depends on the weight of the car. Specifically, the
negative interaction coefficient indicates that the negative impact of
displacement on fuel efficiency is even more severe in heavier vehicles
— a large engine in a heavy car compounds the fuel efficiency penalty
beyond what either variable alone would suggest. The
horsepower:weight interaction is similarly significant and
follows the same logic.
lm_log <- lm(mpg ~ log(horsepower), data = Auto)
lm_sqrt <- lm(mpg ~ sqrt(horsepower), data = Auto)
lm_sq <- lm(mpg ~ I(horsepower^2), data = Auto)
par(mfrow = c(1, 3))
plot(Auto$horsepower, Auto$mpg, pch = 16, col = "gray",
main = "log(horsepower)", xlab = "Horsepower", ylab = "MPG")
lines(sort(Auto$horsepower),
fitted(lm_log)[order(Auto$horsepower)], col = "red", lwd = 2)
plot(Auto$horsepower, Auto$mpg, pch = 16, col = "gray",
main = "sqrt(horsepower)", xlab = "Horsepower", ylab = "MPG")
lines(sort(Auto$horsepower),
fitted(lm_sqrt)[order(Auto$horsepower)], col = "blue", lwd = 2)
plot(Auto$horsepower, Auto$mpg, pch = 16, col = "gray",
main = "horsepower^2", xlab = "Horsepower", ylab = "MPG")
lines(sort(Auto$horsepower),
fitted(lm_sq)[order(Auto$horsepower)], col = "darkgreen", lwd = 2)## R² - log: 0.6683348
## R² - sqrt: 0.6437036
## R² - sq: 0.507367
The log(horsepower) transformation achieves the highest
R² and produces the best visual fit. This confirms that the relationship
between horsepower and mpg is non-linear, and a log
transformation appropriately captures this curvature.
# Collect p-values for each simple regression
slr_pvals <- sapply(boston_vars, function(var) {
formula <- as.formula(paste("crim ~", var))
fit <- lm(formula, data = Boston)
coef(summary(fit))[2, 4]
})
slr_pvals## zn indus chas nox rm age
## 5.506472e-06 1.450349e-21 2.094345e-01 3.751739e-23 6.346703e-07 2.854869e-16
## dis rad tax ptratio lstat medv
## 8.519949e-19 2.693844e-56 2.357127e-47 2.942922e-11 2.654277e-27 1.173987e-19
par(mfrow = c(1, 4))
for (var in c("lstat", "medv", "rad", "tax")) {
plot(Boston[[var]], Boston$crim, pch = 16, col = "steelblue",
xlab = var, ylab = "crim", main = paste("crim vs", var))
abline(lm(as.formula(paste("crim ~", var)), data = Boston),
col = "red", lwd = 2)
}Almost all predictors show a statistically significant simple
association with crim. The strongest associations are with
rad (index of accessibility to highways), tax,
lstat, and medv.
##
## Call:
## lm(formula = crim ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.534 -2.248 -0.348 1.087 73.923
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.7783938 7.0818258 1.946 0.052271 .
## zn 0.0457100 0.0187903 2.433 0.015344 *
## indus -0.0583501 0.0836351 -0.698 0.485709
## chas -0.8253776 1.1833963 -0.697 0.485841
## nox -9.9575865 5.2898242 -1.882 0.060370 .
## rm 0.6289107 0.6070924 1.036 0.300738
## age -0.0008483 0.0179482 -0.047 0.962323
## dis -1.0122467 0.2824676 -3.584 0.000373 ***
## rad 0.6124653 0.0875358 6.997 8.59e-12 ***
## tax -0.0037756 0.0051723 -0.730 0.465757
## ptratio -0.3040728 0.1863598 -1.632 0.103393
## lstat 0.1388006 0.0757213 1.833 0.067398 .
## medv -0.2200564 0.0598240 -3.678 0.000261 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.46 on 493 degrees of freedom
## Multiple R-squared: 0.4493, Adjusted R-squared: 0.4359
## F-statistic: 33.52 on 12 and 493 DF, p-value: < 2.2e-16
In the multiple regression model, we can reject H₀ (βj = 0) for:
zn, dis, rad, black,
and medv. Many predictors that were significant in simple
regression lose significance here, likely due to multicollinearity.
slr_coefs <- sapply(boston_vars, function(var) {
coef(lm(as.formula(paste("crim ~", var)), data = Boston))[2]
})
mlr_coefs <- coef(mlr_fit)[-1]
plot(slr_coefs, mlr_coefs,
xlab = "Simple Regression Coefficients",
ylab = "Multiple Regression Coefficients",
main = "Univariate vs Multiple Regression Coefficients",
pch = 16, col = "steelblue")
abline(0, 1, col = "red", lty = 2)
text(slr_coefs, mlr_coefs, labels = boston_vars, cex = 0.7, pos = 3)The most striking difference is for nox, which has a
large positive coefficient in simple regression but a negative
coefficient in multiple regression. This is a sign of confounding — when
other correlated predictors are included, the estimated effect of
nox changes dramatically. This makes sense contextually:
nitrogen oxide levels are closely tied to industrial zones and heavy
traffic, which are themselves associated with higher crime rates, so in
simple regression nox absorbs the effect of these omitted
variables.
# Show p-values for quadratic and cubic terms for select variables
for (var in c("medv", "lstat", "dis", "nox")) {
formula <- as.formula(paste("crim ~ poly(", var, ", 3)"))
fit <- lm(formula, data = Boston)
cat("\n---", var, "---\n")
print(coef(summary(fit))[, 4])
}##
## --- medv ---
## (Intercept) poly(medv, 3)1 poly(medv, 3)2 poly(medv, 3)3
## 7.024110e-31 4.930818e-27 2.928577e-35 1.046510e-12
##
## --- lstat ---
## (Intercept) poly(lstat, 3)1 poly(lstat, 3)2 poly(lstat, 3)3
## 4.939398e-24 1.678072e-27 3.780418e-02 1.298906e-01
##
## --- dis ---
## (Intercept) poly(dis, 3)1 poly(dis, 3)2 poly(dis, 3)3
## 1.060226e-25 1.253249e-21 7.869767e-14 1.088832e-08
##
## --- nox ---
## (Intercept) poly(nox, 3)1 poly(nox, 3)2 poly(nox, 3)3
## 2.742908e-26 2.457491e-26 7.736755e-05 6.961110e-16
For several predictors (medv, lstat,
dis, nox), the quadratic and/or cubic terms
are statistically significant, providing evidence of non-linear
relationships with crim.
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume Today
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747 Min. :-18.1950
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
## Median : 0.2380 Median : 0.2340 Median :1.00268 Median : 0.2410
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462 Mean : 0.1499
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821 Max. : 12.0260
## Direction
## Down:484
## Up :605
##
##
##
##
## Year Lag1 Lag2 Lag3 Lag4
## Year 1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1 -0.03228927 1.000000000 -0.07485305 0.05863568 -0.071273876
## Lag2 -0.03339001 -0.074853051 1.00000000 -0.07572091 0.058381535
## Lag3 -0.03000649 0.058635682 -0.07572091 1.00000000 -0.075395865
## Lag4 -0.03112792 -0.071273876 0.05838153 -0.07539587 1.000000000
## Lag5 -0.03051910 -0.008183096 -0.07249948 0.06065717 -0.075675027
## Volume 0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today -0.03245989 -0.075031842 0.05916672 -0.07124364 -0.007825873
## Lag5 Volume Today
## Year -0.030519101 0.84194162 -0.032459894
## Lag1 -0.008183096 -0.06495131 -0.075031842
## Lag2 -0.072499482 -0.08551314 0.059166717
## Lag3 0.060657175 -0.06928771 -0.071243639
## Lag4 -0.075675027 -0.06107462 -0.007825873
## Lag5 1.000000000 -0.05851741 0.011012698
## Volume -0.058517414 1.00000000 -0.033077783
## Today 0.011012698 -0.03307778 1.000000000
par(mfrow = c(1, 2))
plot(Weekly$Volume, type = "l", col = "steelblue",
xlab = "Index", ylab = "Volume",
main = "Trading Volume Over Time")
plot(Weekly$Year, Weekly$Volume, pch = 16, col = "tomato",
xlab = "Year", ylab = "Volume",
main = "Volume by Year")Trading volume has increased substantially over time. The lag
variables show very low correlations with each other and with
Direction. The only strong correlation is between
Year and Volume.
glm_fit <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data = Weekly, family = binomial)
summary(glm_fit)##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Weekly)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
Only Lag2 appears statistically significant (p <
0.05). All other predictors, including Volume, are not
significant.
glm_probs <- predict(glm_fit, type = "response")
glm_pred <- ifelse(glm_probs > 0.5, "Up", "Down")
table(Predicted = glm_pred, Actual = Weekly$Direction)## Actual
## Predicted Down Up
## Down 54 48
## Up 430 557
##
## Overall accuracy: 0.5610652
The overall accuracy is approximately 56%, only slightly better than random guessing. The confusion matrix reveals a clear imbalance in the model’s performance:
This means the model is biased toward the majority class (“Up”), making it unreliable for detecting market downturns despite its modest overall accuracy.
train <- Weekly$Year <= 2008
test <- Weekly[!train, ]
glm_fit2 <- glm(Direction ~ Lag2,
data = Weekly,
family = binomial,
subset = train)
summary(glm_fit2)##
## Call:
## glm(formula = Direction ~ Lag2, family = binomial, data = Weekly,
## subset = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.20326 0.06428 3.162 0.00157 **
## Lag2 0.05810 0.02870 2.024 0.04298 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1354.7 on 984 degrees of freedom
## Residual deviance: 1350.5 on 983 degrees of freedom
## AIC: 1354.5
##
## Number of Fisher Scoring iterations: 4
glm_probs2 <- predict(glm_fit2, newdata = test, type = "response")
glm_pred2 <- ifelse(glm_probs2 > 0.5, "Up", "Down")
table(Predicted = glm_pred2, Actual = test$Direction)## Actual
## Predicted Down Up
## Down 9 5
## Up 34 56
##
## Test accuracy: 0.625
Using only Lag2 as a predictor and evaluating on
held-out data from 2009–2010, the test accuracy improves to
approximately 62.5%. This is better than the full-data model, suggesting
that a simpler model with a proper train/test split generalizes better.
The model still tends to predict “Up” more frequently than “Down”.
End of Midterm Homework