This exercise involves the
Autodata set. Missing values have been removed.
## [1] 392 9
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 1
## 2 15 8 350 165 3693 11.5 70 1
## 3 18 8 318 150 3436 11.0 70 1
## 4 16 8 304 150 3433 12.0 70 1
## 5 17 8 302 140 3449 10.5 70 1
## 6 15 8 429 198 4341 10.0 70 1
## name
## 1 chevrolet chevelle malibu
## 2 buick skylark 320
## 3 plymouth satellite
## 4 amc rebel sst
## 5 ford torino
## 6 ford galaxie 500
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : int 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
## - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
## ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...
Quantitative predictors: mpg,
cylinders, displacement,
horsepower, weight, acceleration,
year
Qualitative predictor: name (car name,
a character/factor), and origin is often treated as
qualitative (1 = American, 2 = European, 3 = Japanese) even though it is
stored as integer.
quant_vars <- c("mpg", "cylinders", "displacement", "horsepower",
"weight", "acceleration", "year")
sapply(Auto[, quant_vars], range)## mpg cylinders displacement horsepower weight acceleration year
## [1,] 9.0 3 68 46 1613 8.0 70
## [2,] 46.6 8 455 230 5140 24.8 82
## mpg cylinders displacement horsepower weight acceleration
## Mean 23.445918 5.471939 194.412 104.46939 2977.5842 15.541327
## SD 7.805007 1.705783 104.644 38.49116 849.4026 2.758864
## year
## Mean 75.979592
## SD 3.683737
Auto_sub <- Auto[-(10:85), ]
rbind(
Range_Min = sapply(Auto_sub[, quant_vars], min),
Range_Max = sapply(Auto_sub[, quant_vars], max),
Mean = sapply(Auto_sub[, quant_vars], mean),
SD = sapply(Auto_sub[, quant_vars], sd)
)## mpg cylinders displacement horsepower weight acceleration
## Range_Min 11.000000 3.000000 68.00000 46.00000 1649.0000 8.500000
## Range_Max 46.600000 8.000000 455.00000 230.00000 4997.0000 24.800000
## Mean 24.404430 5.373418 187.24051 100.72152 2935.9715 15.726899
## SD 7.867283 1.654179 99.67837 35.70885 811.3002 2.693721
## year
## Range_Min 70.000000
## Range_Max 82.000000
## Mean 77.145570
## SD 3.106217
Removing rows 10–85 slightly changes the ranges, means, and standard deviations for every predictor, since those 76 observations are no longer included in the calculations.
ggpairs(Auto[, quant_vars],
lower = list(continuous = wrap("points", alpha = 0.3, size = 0.6)),
diag = list(continuous = wrap("densityDiag")),
upper = list(continuous = wrap("cor", size = 3))) +
theme_bw(base_size = 9) +
ggtitle("Scatterplot Matrix – Auto Dataset (Quantitative Variables)")Findings:
displacement, horsepower, and
weight are strongly positively correlated with each
other.mpg.cylinders closely tracks displacement and
weight.year shows a moderate positive correlation with
mpg, suggesting newer cars tend to be more
fuel-efficient.acceleration is only weakly correlated with most other
variables.mpgp1 <- ggplot(Auto, aes(x = weight, y = mpg)) +
geom_point(alpha = 0.4) +
geom_smooth(method = "loess", se = TRUE, colour = "steelblue") +
labs(title = "mpg vs. weight") + theme_bw()
p2 <- ggplot(Auto, aes(x = year, y = mpg)) +
geom_jitter(alpha = 0.4, width = 0.3) +
geom_smooth(method = "lm", se = TRUE, colour = "tomato") +
labs(title = "mpg vs. year") + theme_bw()
gridExtra::grid.arrange(p1, p2, ncol = 2)Conclusion: weight,
displacement, horsepower, and
cylinders all appear highly useful for predicting
mpg (strong negative relationships). year also
shows a clear positive trend (newer cars are more efficient).
acceleration has a weaker, less consistent relationship
with mpg.
Multiple linear regression on the
Autodata set.
ggpairs(Auto[, quant_vars],
lower = list(continuous = wrap("points", alpha = 0.3, size = 0.6)),
diag = list(continuous = wrap("densityDiag")),
upper = list(continuous = wrap("cor", size = 3))) +
theme_bw(base_size = 9) +
ggtitle("Scatterplot Matrix – Auto (Chapter 3)")## mpg cylinders displacement horsepower weight acceleration
## mpg 1.000 -0.778 -0.805 -0.778 -0.832 0.423
## cylinders -0.778 1.000 0.951 0.843 0.898 -0.505
## displacement -0.805 0.951 1.000 0.897 0.933 -0.544
## horsepower -0.778 0.843 0.897 1.000 0.865 -0.689
## weight -0.832 0.898 0.933 0.865 1.000 -0.417
## acceleration 0.423 -0.505 -0.544 -0.689 -0.417 1.000
## year 0.581 -0.346 -0.370 -0.416 -0.309 0.290
## year
## mpg 0.581
## cylinders -0.346
## displacement -0.370
## horsepower -0.416
## weight -0.309
## acceleration 0.290
## year 1.000
The correlation matrix confirms the strong negative relationships
between mpg and displacement,
horsepower, and weight, and positive
relationship between mpg and year.
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
i. Is there a relationship between the predictors and the response?
Yes. The overall F-statistic is very large with an extremely small
p-value, indicating that at least one predictor has a statistically
significant relationship with mpg.
ii. Which predictors appear statistically significant?
At the 5% significance level, displacement,
weight, year, and origin all have
p-values well below 0.05. horsepower is marginally
significant. cylinders and acceleration are
not significant in the presence of the other predictors.
iii. What does the coefficient for year
suggest?
The positive coefficient for year (approximately +0.75)
indicates that, holding all other predictors constant, each additional
model year is associated with an increase of about 0.75 miles per
gallon. This reflects the trend toward more fuel-efficient vehicles over
the sample period.
Comments:
# Test a few theoretically motivated interactions
lm_int <- lm(mpg ~ displacement + weight + year + origin +
acceleration + horsepower +
displacement:weight +
year:origin,
data = Auto)
summary(lm_int)##
## Call:
## lm(formula = mpg ~ displacement + weight + year + origin + acceleration +
## horsepower + displacement:weight + year:origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.9885 -1.7706 -0.0383 1.4555 12.3455
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.889e+01 8.075e+00 2.340 0.019806 *
## displacement -6.954e-02 9.080e-03 -7.658 1.56e-13 ***
## weight -1.044e-02 6.863e-04 -15.215 < 2e-16 ***
## year 4.720e-01 1.000e-01 4.718 3.34e-06 ***
## origin -1.399e+01 4.173e+00 -3.351 0.000885 ***
## acceleration 7.653e-02 8.666e-02 0.883 0.377712
## horsepower -3.029e-02 1.217e-02 -2.490 0.013215 *
## displacement:weight 2.236e-05 2.175e-06 10.281 < 2e-16 ***
## year:origin 1.873e-01 5.357e-02 3.496 0.000527 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.918 on 383 degrees of freedom
## Multiple R-squared: 0.8631, Adjusted R-squared: 0.8602
## F-statistic: 301.7 on 8 and 383 DF, p-value: < 2.2e-16
The interaction displacement:weight is statistically
significant (p < 0.05), indicating that the effect of engine
displacement on fuel economy depends on the weight of the vehicle. The
year:origin interaction is also worth noting – the rate of
improvement in fuel economy over the years differs by region of
manufacture.
# Log transformation of horsepower
lm_log <- lm(mpg ~ log(horsepower) + weight + year + origin, data = Auto)
# Square root of displacement
lm_sqrt <- lm(mpg ~ sqrt(displacement) + weight + year + origin, data = Auto)
# Quadratic horsepower
lm_quad <- lm(mpg ~ horsepower + I(horsepower^2) + weight + year + origin, data = Auto)
data.frame(
Model = c("log(horsepower)", "sqrt(displacement)", "hp + hp^2"),
Adj_R2 = c(summary(lm_log)$adj.r.squared,
summary(lm_sqrt)$adj.r.squared,
summary(lm_quad)$adj.r.squared)
)## Model Adj_R2
## 1 log(horsepower) 0.8258454
## 2 sqrt(displacement) 0.8159244
## 3 hp + hp^2 0.8486682
All three transformations improve model fit compared to a purely
linear specification. The log(horsepower) model achieves a
high adjusted R², confirming that the relationship between
mpg and horsepower is non-linear (diminishing
returns at higher power levels). The quadratic horsepower
model similarly captures this curvature.
Predicting per capita crime rate (
crim) using theBostondata set.
# Fit a simple regression for every predictor and collect coefficients + p-values
simple_results <- lapply(predictors, function(var) {
fit <- lm(reformulate(var, "crim"), data = Boston)
s <- summary(fit)
data.frame(
predictor = var,
coef = coef(fit)[2],
p_value = s$coefficients[2, 4],
row.names = NULL
)
})
simple_df <- do.call(rbind, simple_results)
simple_df$significant <- simple_df$p_value < 0.05
print(simple_df)## predictor coef p_value significant
## 1 zn -0.07393498 5.506472e-06 TRUE
## 2 indus 0.50977633 1.450349e-21 TRUE
## 3 chas -1.89277655 2.094345e-01 FALSE
## 4 nox 31.24853120 3.751739e-23 TRUE
## 5 rm -2.68405122 6.346703e-07 TRUE
## 6 age 0.10778623 2.854869e-16 TRUE
## 7 dis -1.55090168 8.519949e-19 TRUE
## 8 rad 0.61791093 2.693844e-56 TRUE
## 9 tax 0.02974225 2.357127e-47 TRUE
## 10 ptratio 1.15198279 2.942922e-11 TRUE
## 11 lstat 0.54880478 2.654277e-27 TRUE
## 12 medv -0.36315992 1.173987e-19 TRUE
# Plot for two illustrative predictors
p_lstat <- ggplot(Boston, aes(x = lstat, y = crim)) +
geom_point(alpha = 0.4) + geom_smooth(method = "lm", colour = "steelblue") +
labs(title = "crim ~ lstat") + theme_bw()
p_medv <- ggplot(Boston, aes(x = medv, y = crim)) +
geom_point(alpha = 0.4) + geom_smooth(method = "lm", colour = "tomato") +
labs(title = "crim ~ medv") + theme_bw()
gridExtra::grid.arrange(p_lstat, p_medv, ncol = 2)Almost all predictors show a statistically significant univariate
association with crim. Exceptions (if any) tend to be
chas (Charles River dummy variable), which is not
significant.
##
## Call:
## lm(formula = crim ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.534 -2.248 -0.348 1.087 73.923
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.7783938 7.0818258 1.946 0.052271 .
## zn 0.0457100 0.0187903 2.433 0.015344 *
## indus -0.0583501 0.0836351 -0.698 0.485709
## chas -0.8253776 1.1833963 -0.697 0.485841
## nox -9.9575865 5.2898242 -1.882 0.060370 .
## rm 0.6289107 0.6070924 1.036 0.300738
## age -0.0008483 0.0179482 -0.047 0.962323
## dis -1.0122467 0.2824676 -3.584 0.000373 ***
## rad 0.6124653 0.0875358 6.997 8.59e-12 ***
## tax -0.0037756 0.0051723 -0.730 0.465757
## ptratio -0.3040728 0.1863598 -1.632 0.103393
## lstat 0.1388006 0.0757213 1.833 0.067398 .
## medv -0.2200564 0.0598240 -3.678 0.000261 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.46 on 493 degrees of freedom
## Multiple R-squared: 0.4493, Adjusted R-squared: 0.4359
## F-statistic: 33.52 on 12 and 493 DF, p-value: < 2.2e-16
In the multiple regression, far fewer predictors remain statistically
significant. We can reject H₀: βⱼ = 0 for zn,
dis, rad, black, and
medv at conventional significance levels. Many predictors
that were significant in simple regressions lose significance once the
other variables are controlled for, indicating collinearity.
multi_coefs <- coef(lm_multi)[-1] # exclude intercept
# Align with simple_df
plot_df <- simple_df %>%
filter(predictor %in% names(multi_coefs)) %>%
mutate(multi_coef = multi_coefs[predictor])
ggplot(plot_df, aes(x = coef, y = multi_coef, label = predictor)) +
geom_point(colour = "steelblue", size = 3) +
ggrepel::geom_text_repel(size = 3) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", colour = "grey50") +
labs(x = "Univariate coefficient",
y = "Multiple regression coefficient",
title = "Simple vs. Multiple Regression Coefficients – Boston") +
theme_bw()The coefficients differ substantially between univariate and multiple
models, particularly for nox and rad,
illustrating the confounding effect of correlated predictors.
poly_results <- lapply(predictors, function(var) {
# Skip binary variable chas
if (length(unique(Boston[[var]])) <= 2) return(NULL)
fit <- lm(reformulate(c(var, paste0("I(", var, "^2)"), paste0("I(", var, "^3)")),
"crim"), data = Boston)
s <- summary(fit)$coefficients
data.frame(
predictor = var,
p_X2 = round(s[3, 4], 4), # p-value of X^2 term
p_X3 = round(s[4, 4], 4), # p-value of X^3 term
row.names = NULL
)
})
poly_df <- do.call(rbind, poly_results)
poly_df$nonlinear <- (poly_df$p_X2 < 0.05 | poly_df$p_X3 < 0.05)
print(poly_df)## predictor p_X2 p_X3 nonlinear
## 1 zn 0.0938 0.2295 FALSE
## 2 indus 0.0000 0.0000 TRUE
## 3 nox 0.0000 0.0000 TRUE
## 4 rm 0.3641 0.5086 FALSE
## 5 age 0.0474 0.0067 TRUE
## 6 dis 0.0000 0.0000 TRUE
## 7 rad 0.6130 0.4823 FALSE
## 8 tax 0.1375 0.2439 FALSE
## 9 ptratio 0.0041 0.0063 TRUE
## 10 lstat 0.0646 0.1299 FALSE
## 11 medv 0.0000 0.0000 TRUE
Several predictors show statistically significant polynomial (X² or
X³) terms, indicating non-linear associations with crime rate. Notable
examples typically include dis, nox,
lstat, and medv.
Logistic regression using the
Weeklydata set (1990–2010).
## [1] 1089 9
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume Today
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747 Min. :-18.1950
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
## Median : 0.2380 Median : 0.2340 Median :1.00268 Median : 0.2410
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462 Mean : 0.1499
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821 Max. : 12.0260
## Direction
## Down:484
## Up :605
##
##
##
##
## Lag1 Lag2 Lag3 Lag4
## Min. :-18.1950 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580 1st Qu.: -1.1580
## Median : 0.2410 Median : 0.2410 Median : 0.2410 Median : 0.2380
## Mean : 0.1506 Mean : 0.1511 Mean : 0.1472 Mean : 0.1458
## 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. : 12.0260 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag5 Volume Today
## Min. :-18.1950 Min. :0.08747 Min. :-18.1950
## 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
## Median : 0.2340 Median :1.00268 Median : 0.2410
## Mean : 0.1399 Mean :1.57462 Mean : 0.1499
## 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
## Max. : 12.0260 Max. :9.32821 Max. : 12.0260
# Volume over time
ggplot(Weekly, aes(x = Year, y = Volume)) +
geom_jitter(alpha = 0.3, width = 0.3, colour = "steelblue") +
geom_smooth(method = "loess", colour = "tomato", se = FALSE) +
labs(title = "Trading Volume Over Time – Weekly Data",
x = "Year", y = "Volume (billions of shares)") +
theme_bw()Patterns: Trading volume increased substantially
from 1990 to around 2004–2005 and then leveled off. The lag returns
(Lag1 through Lag5) show little obvious trend,
as expected for financial returns. The market was more likely to go
Up than Down across the full period (roughly
55% Up weeks).
glm_full <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data = Weekly,
family = binomial)
summary(glm_full)##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Weekly)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
Significant predictors: Only Lag2 has a
p-value below 0.05. The other lag variables and Volume are
not statistically significant in the presence of all predictors.
prob_full <- predict(glm_full, type = "response")
pred_full <- ifelse(prob_full > 0.5, "Up", "Down")
conf_full <- table(Predicted = pred_full, Actual = Weekly$Direction)
conf_full## Actual
## Predicted Down Up
## Down 54 48
## Up 430 557
# Overall fraction of correct predictions
accuracy_full <- mean(pred_full == Weekly$Direction)
cat("Overall accuracy (full model, training data):", round(accuracy_full, 4), "\n")## Overall accuracy (full model, training data): 0.5611
Interpretation: The confusion matrix reveals that
the model correctly predicts most Up weeks but struggles
significantly with Down weeks — it almost always predicts
Up. This is consistent with the imbalanced base rate (more
Up weeks than Down). The overall training accuracy of ~56% is only
marginally better than the naive strategy of always predicting
Up.
train <- Weekly$Year <= 2008
test <- Weekly[!train, ]
glm_lag2 <- glm(Direction ~ Lag2,
data = Weekly,
subset = train,
family = binomial)
prob_test <- predict(glm_lag2, newdata = test, type = "response")
pred_test <- ifelse(prob_test > 0.5, "Up", "Down")
conf_test <- table(Predicted = pred_test, Actual = test$Direction)
conf_test## Actual
## Predicted Down Up
## Down 9 5
## Up 34 56
accuracy_test <- mean(pred_test == test$Direction)
cat("Overall accuracy (Lag2 model, test 2009-2010):", round(accuracy_test, 4), "\n")## Overall accuracy (Lag2 model, test 2009-2010): 0.625
Interpretation: Using only Lag2 as a
predictor and evaluating on held-out data from 2009–2010, the model
achieves an accuracy of approximately 62.5%. This is a
meaningful improvement over random guessing and suggests that
Lag2 carries some genuine predictive signal for the
direction of weekly returns.
End of Midterm Homework