This exercise involves the
Autodata set. Make sure that the missing values have been removed from the data.
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : int 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
## - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
## ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...
Answer:
mpg,
cylinders, displacement,
horsepower, weight, acceleration,
yearname, origin
(origin is coded as 1/2/3 representing American/European/Japanese —
treated as categorical)quant_vars <- c("mpg", "cylinders", "displacement", "horsepower",
"weight", "acceleration", "year")
sapply(Auto[, quant_vars], range)## mpg cylinders displacement horsepower weight acceleration year
## [1,] 9.0 3 68 46 1613 8.0 70
## [2,] 46.6 8 455 230 5140 24.8 82
## mpg cylinders displacement horsepower weight acceleration
## Mean 23.445918 5.471939 194.412 104.46939 2977.5842 15.541327
## SD 7.805007 1.705783 104.644 38.49116 849.4026 2.758864
## year
## Mean 75.979592
## SD 3.683737
Auto_sub <- Auto[-(10:85), ]
rbind(
Range_Min = sapply(Auto_sub[, quant_vars], min),
Range_Max = sapply(Auto_sub[, quant_vars], max),
Mean = sapply(Auto_sub[, quant_vars], mean),
SD = sapply(Auto_sub[, quant_vars], sd)
)## mpg cylinders displacement horsepower weight acceleration
## Range_Min 11.000000 3.000000 68.00000 46.00000 1649.0000 8.500000
## Range_Max 46.600000 8.000000 455.00000 230.00000 4997.0000 24.800000
## Mean 24.404430 5.373418 187.24051 100.72152 2935.9715 15.726899
## SD 7.867283 1.654179 99.67837 35.70885 811.3002 2.693721
## year
## Range_Min 70.000000
## Range_Max 82.000000
## Mean 77.145570
## SD 3.106217
Findings:
mpg has a clear negative relationship with
displacement, horsepower, and
weight — heavier, more powerful cars tend to be less
fuel-efficient.displacement, horsepower, and
weight are highly correlated with each other
(multicollinearity).year shows a modest positive relationship with
mpg, suggesting newer cars tend to be more
fuel-efficient.cylinders correlates strongly with
displacement and weight.mpg?par(mfrow = c(2, 3))
for (v in setdiff(quant_vars, "mpg")) {
plot(Auto[[v]], Auto$mpg,
xlab = v, ylab = "mpg",
main = paste("mpg vs", v),
pch = 19, col = "steelblue", cex = 0.6)
abline(lm(Auto$mpg ~ Auto[[v]]), col = "red", lwd = 2)
}Answer:
Based on the plots, displacement,
horsepower, weight, and year all
appear useful for predicting mpg:
displacement, horsepower, and
weight show strong negative linear (or slightly curved)
relationships with mpg.year shows a positive trend — newer cars are more fuel
efficient.cylinders is also associated but is largely captured by
displacement.acceleration has a weaker relationship.Multiple linear regression on the
Autodata set.
## mpg cylinders displacement horsepower weight acceleration
## mpg 1.000 -0.778 -0.805 -0.778 -0.832 0.423
## cylinders -0.778 1.000 0.951 0.843 0.898 -0.505
## displacement -0.805 0.951 1.000 0.897 0.933 -0.544
## horsepower -0.778 0.843 0.897 1.000 0.865 -0.689
## weight -0.832 0.898 0.933 0.865 1.000 -0.417
## acceleration 0.423 -0.505 -0.544 -0.689 -0.417 1.000
## year 0.581 -0.346 -0.370 -0.416 -0.309 0.290
## year
## mpg 0.581
## cylinders -0.346
## displacement -0.370
## horsepower -0.416
## weight -0.309
## acceleration 0.290
## year 1.000
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
Comments:
i. Is there a relationship between the predictors and the
response?
Yes. The overall F-statistic is very large with a p-value essentially 0,
indicating at least one predictor is significantly related to
mpg.
ii. Which predictors have a statistically significant
relationship?
Based on the p-values (< 0.05): displacement,
weight, year, and origin are
statistically significant.
iii. What does the coefficient for year
suggest?
The positive coefficient for year (approximately +0.75)
suggests that, holding all other variables constant, each additional
model year is associated with about 0.75 more miles per gallon. Cars
have become more fuel-efficient over time.
Comments:
# Test some meaningful interactions
lm_interact <- lm(mpg ~ displacement + weight + year + origin +
displacement:weight + weight:year, data = Auto)
summary(lm_interact)##
## Call:
## lm(formula = mpg ~ displacement + weight + year + origin + displacement:weight +
## weight:year, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.9585 -1.6093 -0.0373 1.3563 12.7428
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.140e+01 1.374e+01 -4.468 1.04e-05 ***
## displacement -6.043e-02 9.407e-03 -6.424 3.92e-10 ***
## weight 9.045e-03 4.895e-03 1.848 0.0654 .
## year 1.498e+00 1.739e-01 8.616 < 2e-16 ***
## origin 3.388e-01 2.525e-01 1.342 0.1805
## displacement:weight 1.708e-05 2.383e-06 7.169 3.89e-12 ***
## weight:year -2.486e-04 6.159e-05 -4.037 6.54e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.958 on 385 degrees of freedom
## Multiple R-squared: 0.8586, Adjusted R-squared: 0.8564
## F-statistic: 389.6 on 6 and 385 DF, p-value: < 2.2e-16
Comment:
The interaction displacement:weight and
weight:year can be assessed from the output. If their
p-values are < 0.05, the interaction is statistically significant.
Typically, weight:year is significant — suggesting the
effect of weight on mpg changes over model years.
par(mfrow = c(1, 3))
plot(log(Auto$horsepower), Auto$mpg,
xlab = "log(horsepower)", ylab = "mpg",
main = "mpg vs log(horsepower)", pch = 19, col = "steelblue", cex = 0.6)
plot(sqrt(Auto$horsepower), Auto$mpg,
xlab = "sqrt(horsepower)", ylab = "mpg",
main = "mpg vs sqrt(horsepower)", pch = 19, col = "darkgreen", cex = 0.6)
plot(Auto$horsepower^2, Auto$mpg,
xlab = "horsepower^2", ylab = "mpg",
main = "mpg vs horsepower^2", pch = 19, col = "tomato", cex = 0.6)lm_log <- lm(mpg ~ log(horsepower) + weight + year + origin, data = Auto)
lm_sqrt <- lm(mpg ~ sqrt(horsepower) + weight + year + origin, data = Auto)
lm_sq <- lm(mpg ~ I(horsepower^2) + weight + year + origin, data = Auto)
cat("log(horsepower) R²:", summary(lm_log)$r.squared, "\n")## log(horsepower) R²: 0.827627
## sqrt(horsepower) R²: 0.8214188
## horsepower^2 R²: 0.8185632
Comment:
The log(horsepower) and sqrt(horsepower)
transformations tend to produce a more linear relationship with
mpg and slightly improve R². The scatterplot with
log(horsepower) shows a cleaner linear pattern compared to
the raw or squared values.
Predict per capita crime rate (
crim) using theBostondata set.
# Fit simple regression for each predictor and collect p-values & coefficients
slr_results <- data.frame(
predictor = predictors,
coef = NA,
p_value = NA
)
for (i in seq_along(predictors)) {
v <- predictors[i]
fit <- lm(crim ~ Boston[[v]], data = Boston)
s <- summary(fit)
slr_results$coef[i] <- coef(fit)[2]
slr_results$p_value[i] <- s$coefficients[2, 4]
}
slr_results$significant <- slr_results$p_value < 0.05
print(slr_results)## predictor coef p_value significant
## 1 zn -0.07393498 5.506472e-06 TRUE
## 2 indus 0.50977633 1.450349e-21 TRUE
## 3 chas -1.89277655 2.094345e-01 FALSE
## 4 nox 31.24853120 3.751739e-23 TRUE
## 5 rm -2.68405122 6.346703e-07 TRUE
## 6 age 0.10778623 2.854869e-16 TRUE
## 7 dis -1.55090168 8.519949e-19 TRUE
## 8 rad 0.61791093 2.693844e-56 TRUE
## 9 tax 0.02974225 2.357127e-47 TRUE
## 10 ptratio 1.15198279 2.942922e-11 TRUE
## 11 lstat 0.54880478 2.654277e-27 TRUE
## 12 medv -0.36315992 1.173987e-19 TRUE
par(mfrow = c(4, 4))
for (v in predictors) {
plot(Boston[[v]], Boston$crim,
xlab = v, ylab = "crim",
main = paste("crim ~", v),
pch = 19, col = "steelblue", cex = 0.4)
abline(lm(crim ~ Boston[[v]], data = Boston), col = "red", lwd = 1.5)
}
par(mfrow = c(1, 1))Answer:
Almost all predictors have a statistically significant association with
crim in simple linear regression. Exceptions
(non-significant) tend to be chas (Charles River dummy
variable).
##
## Call:
## lm(formula = crim ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.534 -2.248 -0.348 1.087 73.923
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.7783938 7.0818258 1.946 0.052271 .
## zn 0.0457100 0.0187903 2.433 0.015344 *
## indus -0.0583501 0.0836351 -0.698 0.485709
## chas -0.8253776 1.1833963 -0.697 0.485841
## nox -9.9575865 5.2898242 -1.882 0.060370 .
## rm 0.6289107 0.6070924 1.036 0.300738
## age -0.0008483 0.0179482 -0.047 0.962323
## dis -1.0122467 0.2824676 -3.584 0.000373 ***
## rad 0.6124653 0.0875358 6.997 8.59e-12 ***
## tax -0.0037756 0.0051723 -0.730 0.465757
## ptratio -0.3040728 0.1863598 -1.632 0.103393
## lstat 0.1388006 0.0757213 1.833 0.067398 .
## medv -0.2200564 0.0598240 -3.678 0.000261 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.46 on 493 degrees of freedom
## Multiple R-squared: 0.4493, Adjusted R-squared: 0.4359
## F-statistic: 33.52 on 12 and 493 DF, p-value: < 2.2e-16
Answer:
In the multiple regression model, we can reject H₀: βⱼ = 0 (at α = 0.05)
for: zn, dis, rad,
black, and medv. Many predictors that were
significant in simple regression lose significance here due to
multicollinearity.
mlr_coefs <- coef(mlr_fit)[-1] # drop intercept
# Align
common <- intersect(slr_results$predictor, names(mlr_coefs))
slr_c <- slr_results$coef[slr_results$predictor %in% common]
mlr_c <- mlr_coefs[common]
plot(slr_c, mlr_c,
xlab = "Simple Regression Coefficient",
ylab = "Multiple Regression Coefficient",
main = "Univariate vs. Multiple Regression Coefficients",
pch = 19, col = "steelblue")
abline(0, 1, lty = 2, col = "red")
text(slr_c, mlr_c, labels = common, cex = 0.7, pos = 3)Comment:
The coefficients differ substantially. nox (nitrogen oxide)
has a large positive coefficient in simple regression but shifts in the
multiple regression — this is because many predictors are correlated
with nox.
poly_results <- data.frame(
predictor = predictors,
p_X2 = NA,
p_X3 = NA
)
for (i in seq_along(predictors)) {
v <- predictors[i]
# Skip binary variable chas
if (length(unique(Boston[[v]])) <= 2) next
fit3 <- lm(crim ~ poly(Boston[[v]], 3), data = Boston)
s <- summary(fit3)$coefficients
poly_results$p_X2[i] <- s[3, 4]
poly_results$p_X3[i] <- s[4, 4]
}
poly_results$nonlinear <- with(poly_results,
(!is.na(p_X2) & p_X2 < 0.05) | (!is.na(p_X3) & p_X3 < 0.05))
print(poly_results)## predictor p_X2 p_X3 nonlinear
## 1 zn 4.420507e-03 2.295386e-01 TRUE
## 2 indus 1.086057e-03 1.196405e-12 TRUE
## 3 chas NA NA FALSE
## 4 nox 7.736755e-05 6.961110e-16 TRUE
## 5 rm 1.508545e-03 5.085751e-01 TRUE
## 6 age 2.291156e-06 6.679915e-03 TRUE
## 7 dis 7.869767e-14 1.088832e-08 TRUE
## 8 rad 9.120558e-03 4.823138e-01 TRUE
## 9 tax 3.665348e-06 2.438507e-01 TRUE
## 10 ptratio 2.405468e-03 6.300514e-03 TRUE
## 11 lstat 3.780418e-02 1.298906e-01 TRUE
## 12 medv 2.928577e-35 1.046510e-12 TRUE
Answer:
Several predictors show evidence of non-linear association with
crim (significant X² or X³ terms), including
indus, nox, age,
dis, ptratio, and medv. This
suggests polynomial or other non-linear models may be more appropriate
for those predictors.
Logistic regression using the
Weeklydata set.
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume Today
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747 Min. :-18.1950
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
## Median : 0.2380 Median : 0.2340 Median :1.00268 Median : 0.2410
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462 Mean : 0.1499
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821 Max. : 12.0260
## Direction
## Down:484
## Up :605
##
##
##
##
pairs(Weekly[, -9], col = ifelse(Weekly$Direction == "Up", "steelblue", "tomato"),
main = "Weekly Data — Blue = Up, Red = Down")##
## Down Up
## 484 605
##
## Down Up
## 0.4444444 0.5555556
Patterns:
Up about 55.6% of the time
vs. Down 44.4% — a slight upward bias.Volume (average shares traded) increases notably over
time, reflecting growth in market activity.Direction, making prediction
challenging.glm_full <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data = Weekly,
family = binomial)
summary(glm_full)##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Weekly)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
Answer:
Only Lag2 appears statistically significant (p ≈ 0.03). The
other lag variables and Volume are not significant
predictors of Direction.
glm_probs <- predict(glm_full, type = "response")
glm_pred <- ifelse(glm_probs > 0.5, "Up", "Down")
conf_matrix <- table(Predicted = glm_pred, Actual = Weekly$Direction)
print(conf_matrix)## Actual
## Predicted Down Up
## Down 54 48
## Up 430 557
##
## Overall fraction correct: 0.5610652
Interpretation:
Up the vast majority of the time,
reflecting the base rate.Up weeks but does poorly
on Down weeks (low sensitivity for “Down”).train <- Weekly$Year <= 2008
test <- !train
glm_lag2 <- glm(Direction ~ Lag2,
data = Weekly,
subset = train,
family = binomial)
# Predict on held-out data (2009–2010)
test_probs <- predict(glm_lag2, newdata = Weekly[test, ], type = "response")
test_pred <- ifelse(test_probs > 0.5, "Up", "Down")
conf_test <- table(Predicted = test_pred, Actual = Weekly$Direction[test])
print(conf_test)## Actual
## Predicted Down Up
## Down 9 5
## Up 34 56
##
## Fraction correct on test set: 0.625
Interpretation:
Lag2 as a predictor and testing on 2009–2010
data, the model achieves approximately 62.5% correct
predictions — better than the full model on training data.Lag2 is the most informative predictor,
and the simpler model generalizes better.End of Midterm Homework