data(Auto)
Auto <- na.omit(Auto)
str(Auto)
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : int 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
## - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
## ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...
Answer:
mpg,
cylinders, displacement,
horsepower, weight, acceleration,
yearname, origin
(origin is coded numerically but represents region: 1=American,
2=European, 3=Japanese)quant_vars <- Auto[, c("mpg", "cylinders", "displacement",
"horsepower", "weight", "acceleration", "year")]
sapply(quant_vars, range)
## mpg cylinders displacement horsepower weight acceleration year
## [1,] 9.0 3 68 46 1613 8.0 70
## [2,] 46.6 8 455 230 5140 24.8 82
cat("--- Mean ---\n")
## --- Mean ---
sapply(quant_vars, mean)
## mpg cylinders displacement horsepower weight acceleration
## 23.445918 5.471939 194.411990 104.469388 2977.584184 15.541327
## year
## 75.979592
cat("--- Standard Deviation ---\n")
## --- Standard Deviation ---
sapply(quant_vars, sd)
## mpg cylinders displacement horsepower weight acceleration
## 7.805007 1.705783 104.644004 38.491160 849.402560 2.758864
## year
## 3.683737
auto_sub <- Auto[-(10:85), ]
quant_sub <- auto_sub[, c("mpg", "cylinders", "displacement",
"horsepower", "weight", "acceleration", "year")]
cat("--- Range (subset) ---\n")
## --- Range (subset) ---
sapply(quant_sub, range)
## mpg cylinders displacement horsepower weight acceleration year
## [1,] 11.0 3 68 46 1649 8.5 70
## [2,] 46.6 8 455 230 4997 24.8 82
cat("--- Mean (subset) ---\n")
## --- Mean (subset) ---
sapply(quant_sub, mean)
## mpg cylinders displacement horsepower weight acceleration
## 24.404430 5.373418 187.240506 100.721519 2935.971519 15.726899
## year
## 77.145570
cat("--- SD (subset) ---\n")
## --- SD (subset) ---
sapply(quant_sub, sd)
## mpg cylinders displacement horsepower weight acceleration
## 7.867283 1.654179 99.678367 35.708853 811.300208 2.693721
## year
## 3.106217
pairs(quant_vars, main = "Scatterplot Matrix — Auto Dataset")
Comment: Several clear relationships emerge from the
scatterplot matrix. displacement, weight, and
horsepower are strongly positively correlated with each
other. mpg shows a strong negative relationship with
displacement, weight, and
horsepower. year appears to have a mild
positive relationship with mpg, suggesting cars have become
more fuel-efficient over time.
mpgpar(mfrow = c(2, 3))
for (v in c("cylinders", "displacement", "horsepower", "weight", "acceleration", "year")) {
plot(Auto[[v]], Auto$mpg, xlab = v, ylab = "mpg",
main = paste("mpg vs", v), pch = 20, col = "steelblue")
abline(lm(mpg ~ Auto[[v]], data = Auto), col = "red", lwd = 2)
}
par(mfrow = c(1, 1))
Comment: displacement,
horsepower, and weight all show a strong
negative relationship with mpg — heavier, more powerful
cars tend to have lower fuel economy. year shows a positive
trend, suggesting newer cars are more fuel-efficient.
acceleration has a weak positive relationship. These
variables (especially weight, horsepower, and
displacement) would likely be most useful as predictors of
mpg.
This question involves multiple linear regression on the
Autodata set.
pairs(Auto[, -9], main = "Scatterplot Matrix — Auto (all variables)")
cor(Auto[, -9]) # Exclude 'name' (qualitative)
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
mpg ~ all predictors
except namelm_fit <- lm(mpg ~ . - name, data = Auto)
summary(lm_fit)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
i. Is there a relationship between the predictors and the response?
Yes. The overall F-statistic is large and the p-value is essentially
zero, indicating that at least one predictor has a statistically
significant relationship with mpg.
ii. Which predictors have a statistically significant relationship to the response?
At the 0.05 significance level: displacement,
weight, year, and origin all show
significant p-values. horsepower and
acceleration are not significant when the others are
included.
iii. What does the coefficient for year
suggest?
The positive coefficient for year (approximately 0.75)
suggests that, holding all other predictors constant, each additional
year is associated with about 0.75 more miles per gallon — reflecting
steady improvements in fuel efficiency over time.
par(mfrow = c(2, 2))
plot(lm_fit)
par(mfrow = c(1, 1))
Comment: The residuals vs. fitted plot shows a slight U-shaped (non-linear) pattern, suggesting the linear model may not fully capture the relationship. The Scale-Location plot shows some heteroscedasticity. Point 14 appears to have unusually high leverage in the leverage plot, and a few points (e.g., 323, 326, 327) are potential outliers.
# Test some plausible interactions
lm_interact <- lm(mpg ~ . - name + displacement:weight + horsepower:weight, data = Auto)
summary(lm_interact)
##
## Call:
## lm(formula = mpg ~ . - name + displacement:weight + horsepower:weight,
## data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.9445 -1.6144 -0.1607 1.5992 12.0129
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.142e+00 4.723e+00 0.242 0.809055
## cylinders 3.110e-02 2.921e-01 0.106 0.915276
## displacement -1.670e-02 1.960e-02 -0.852 0.394640
## horsepower -1.799e-01 4.796e-02 -3.752 0.000202 ***
## weight -1.122e-02 7.281e-04 -15.406 < 2e-16 ***
## acceleration -4.948e-02 9.447e-02 -0.524 0.600789
## year 7.748e-01 4.512e-02 17.174 < 2e-16 ***
## origin 7.315e-01 2.647e-01 2.763 0.005998 **
## displacement:weight 6.763e-06 5.494e-06 1.231 0.219098
## horsepower:weight 4.082e-05 1.286e-05 3.173 0.001628 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.929 on 382 degrees of freedom
## Multiple R-squared: 0.8624, Adjusted R-squared: 0.8591
## F-statistic: 266 on 9 and 382 DF, p-value: < 2.2e-16
Comment: The interaction between
displacement and weight and between
horsepower and weight are worth exploring.
Statistically significant interactions (p < 0.05) would indicate that
the effect of one predictor on mpg depends on the value of
another.
# Log transformation
lm_log <- lm(mpg ~ log(displacement) + log(horsepower) + log(weight) +
acceleration + year + origin, data = Auto)
summary(lm_log)
##
## Call:
## lm(formula = mpg ~ log(displacement) + log(horsepower) + log(weight) +
## acceleration + year + origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5892 -1.7692 -0.0696 1.5646 12.8531
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 106.96196 9.88311 10.823 < 2e-16 ***
## log(displacement) 0.08762 1.05613 0.083 0.933923
## log(horsepower) -5.66760 1.56474 -3.622 0.000331 ***
## log(weight) -13.99299 2.19174 -6.384 4.96e-10 ***
## acceleration -0.19698 0.10271 -1.918 0.055867 .
## year 0.72422 0.04697 15.419 < 2e-16 ***
## origin 0.91679 0.27198 3.371 0.000826 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.077 on 385 degrees of freedom
## Multiple R-squared: 0.8469, Adjusted R-squared: 0.8445
## F-statistic: 355 on 6 and 385 DF, p-value: < 2.2e-16
# Square root transformation
lm_sqrt <- lm(mpg ~ sqrt(horsepower) + sqrt(weight) +
acceleration + year + origin, data = Auto)
summary(lm_sqrt)
##
## Call:
## lm(formula = mpg ~ sqrt(horsepower) + sqrt(weight) + acceleration +
## year + origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.7979 -1.8869 -0.0932 1.7232 13.0860
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.73553 4.93400 0.960 0.3378
## sqrt(horsepower) -0.59693 0.29991 -1.990 0.0473 *
## sqrt(weight) -0.57618 0.05805 -9.925 < 2e-16 ***
## acceleration -0.07373 0.09989 -0.738 0.4609
## year 0.72892 0.04890 14.905 < 2e-16 ***
## origin 1.02110 0.25469 4.009 7.31e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.208 on 386 degrees of freedom
## Multiple R-squared: 0.8332, Adjusted R-squared: 0.831
## F-statistic: 385.6 on 5 and 386 DF, p-value: < 2.2e-16
# Quadratic transformation
lm_quad <- lm(mpg ~ horsepower + I(horsepower^2) + weight +
acceleration + year + origin, data = Auto)
summary(lm_quad)
##
## Call:
## lm(formula = mpg ~ horsepower + I(horsepower^2) + weight + acceleration +
## year + origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.7149 -1.6679 -0.1828 1.5876 12.0605
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.215e+00 4.544e+00 0.267 0.78929
## horsepower -3.088e-01 3.298e-02 -9.361 < 2e-16 ***
## I(horsepower^2) 9.614e-04 9.774e-05 9.836 < 2e-16 ***
## weight -3.445e-03 5.059e-04 -6.810 3.76e-11 ***
## acceleration -3.123e-01 9.486e-02 -3.292 0.00109 **
## year 7.367e-01 4.568e-02 16.127 < 2e-16 ***
## origin 1.085e+00 2.349e-01 4.618 5.30e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.998 on 385 degrees of freedom
## Multiple R-squared: 0.8547, Adjusted R-squared: 0.8524
## F-statistic: 377.4 on 6 and 385 DF, p-value: < 2.2e-16
Comment: Log transformations of
displacement, horsepower, and
weight tend to improve the model fit (higher R²) and
produce more linear residual patterns, consistent with the
multiplicative nature of these physical quantities. The quadratic term
for horsepower is also significant, capturing the curved
relationship seen in the scatterplots.
Predict per capita crime rate (
crim) using theBostondata set.
data(Boston)
predictors <- setdiff(names(Boston), "crim")
# Fit simple regression for each predictor and store coefficients + p-values
slr_results <- lapply(predictors, function(var) {
fit <- lm(as.formula(paste("crim ~", var)), data = Boston)
s <- summary(fit)
data.frame(
predictor = var,
coef = coef(fit)[2],
p_value = s$coefficients[2, 4]
)
})
slr_df <- do.call(rbind, slr_results)
print(slr_df)
## predictor coef p_value
## zn zn -0.07393498 5.506472e-06
## indus indus 0.50977633 1.450349e-21
## chas chas -1.89277655 2.094345e-01
## nox nox 31.24853120 3.751739e-23
## rm rm -2.68405122 6.346703e-07
## age age 0.10778623 2.854869e-16
## dis dis -1.55090168 8.519949e-19
## rad rad 0.61791093 2.693844e-56
## tax tax 0.02974225 2.357127e-47
## ptratio ptratio 1.15198279 2.942922e-11
## lstat lstat 0.54880478 2.654277e-27
## medv medv -0.36315992 1.173987e-19
# Plot a few significant ones
par(mfrow = c(2, 3))
sig_preds <- slr_df$predictor[slr_df$p_value < 0.05][1:6]
for (v in sig_preds) {
plot(Boston[[v]], Boston$crim, xlab = v, ylab = "crim",
pch = 20, col = "tomato", main = paste("crim vs", v))
abline(lm(as.formula(paste("crim ~", v)), data = Boston), col = "blue", lwd = 2)
}
par(mfrow = c(1, 1))
Comment: Almost all predictors show a statistically
significant univariate association with crim. Variables
like rad (accessibility to radial highways),
tax (property tax rate), and lstat (% lower
status population) have particularly strong associations.
mlr_fit <- lm(crim ~ ., data = Boston)
summary(mlr_fit)
##
## Call:
## lm(formula = crim ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.534 -2.248 -0.348 1.087 73.923
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.7783938 7.0818258 1.946 0.052271 .
## zn 0.0457100 0.0187903 2.433 0.015344 *
## indus -0.0583501 0.0836351 -0.698 0.485709
## chas -0.8253776 1.1833963 -0.697 0.485841
## nox -9.9575865 5.2898242 -1.882 0.060370 .
## rm 0.6289107 0.6070924 1.036 0.300738
## age -0.0008483 0.0179482 -0.047 0.962323
## dis -1.0122467 0.2824676 -3.584 0.000373 ***
## rad 0.6124653 0.0875358 6.997 8.59e-12 ***
## tax -0.0037756 0.0051723 -0.730 0.465757
## ptratio -0.3040728 0.1863598 -1.632 0.103393
## lstat 0.1388006 0.0757213 1.833 0.067398 .
## medv -0.2200564 0.0598240 -3.678 0.000261 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.46 on 493 degrees of freedom
## Multiple R-squared: 0.4493, Adjusted R-squared: 0.4359
## F-statistic: 33.52 on 12 and 493 DF, p-value: < 2.2e-16
Comment: In the multiple regression model, only a
subset of predictors remain significant: zn,
dis, rad, black, and
medv. Many predictors that were significant in simple
regression lose significance when others are controlled for — a sign of
multicollinearity.
mlr_coefs <- coef(mlr_fit)[-1] # exclude intercept
# Align
common <- intersect(names(mlr_coefs), slr_df$predictor)
slr_coefs_aligned <- slr_df$coef[match(common, slr_df$predictor)]
mlr_coefs_aligned <- mlr_coefs[common]
plot(slr_coefs_aligned, mlr_coefs_aligned,
xlab = "Simple Regression Coefficients",
ylab = "Multiple Regression Coefficients",
main = "Univariate vs. Multiple Regression Coefficients",
pch = 19, col = "steelblue")
abline(0, 1, lty = 2, col = "red")
text(slr_coefs_aligned, mlr_coefs_aligned, labels = common, cex = 0.7, pos = 3)
Comment: There is substantial divergence between the
simple and multiple regression coefficients, especially for
nox — a strong predictor in simple regression but with a
very different coefficient in the multiple model. This reflects the
confounding influence of correlated predictors.
poly_results <- lapply(predictors, function(var) {
# Skip variables with fewer than 4 unique values (can't fit degree-3 poly)
if (length(unique(Boston[[var]])) < 4) return(NULL)
fit <- lm(as.formula(paste("crim ~ poly(", var, ", 3)")), data = Boston)
s <- summary(fit)
pvals <- s$coefficients[-1, 4]
data.frame(
predictor = var,
p_linear = pvals[1],
p_quad = ifelse(length(pvals) >= 2, pvals[2], NA),
p_cubic = ifelse(length(pvals) >= 3, pvals[3], NA)
)
})
do.call(rbind, Filter(Negate(is.null), poly_results))
## predictor p_linear p_quad p_cubic
## poly(zn, 3)1 zn 4.697806e-06 4.420507e-03 2.295386e-01
## poly(indus, 3)1 indus 8.854243e-24 1.086057e-03 1.196405e-12
## poly(nox, 3)1 nox 2.457491e-26 7.736755e-05 6.961110e-16
## poly(rm, 3)1 rm 5.128048e-07 1.508545e-03 5.085751e-01
## poly(age, 3)1 age 4.878803e-17 2.291156e-06 6.679915e-03
## poly(dis, 3)1 dis 1.253249e-21 7.869767e-14 1.088832e-08
## poly(rad, 3)1 rad 1.053211e-56 9.120558e-03 4.823138e-01
## poly(tax, 3)1 tax 6.976314e-49 3.665348e-06 2.438507e-01
## poly(ptratio, 3)1 ptratio 1.565484e-11 2.405468e-03 6.300514e-03
## poly(lstat, 3)1 lstat 1.678072e-27 3.780418e-02 1.298906e-01
## poly(medv, 3)1 medv 4.930818e-27 2.928577e-35 1.046510e-12
Comment: For many predictors (e.g.,
indus, nox, age,
dis, ptratio, medv), the
quadratic or cubic terms are statistically significant, suggesting
non-linear relationships with crime rate.
Using the
Weeklydata set to predict market direction via logistic regression.
data(Weekly)
summary(Weekly)
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume Today
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747 Min. :-18.1950
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
## Median : 0.2380 Median : 0.2340 Median :1.00268 Median : 0.2410
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462 Mean : 0.1499
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821 Max. : 12.0260
## Direction
## Down:484
## Up :605
##
##
##
##
pairs(Weekly[, -9], col = ifelse(Weekly$Direction == "Up", "steelblue", "tomato"),
main = "Weekly Data — Blue=Up, Red=Down")
par(mfrow = c(1, 2))
plot(Weekly$Year, Weekly$Volume, type = "l", col = "darkgreen",
xlab = "Year", ylab = "Volume", main = "Trading Volume Over Time")
barplot(table(Weekly$Direction), col = c("tomato", "steelblue"),
main = "Direction Counts", ylab = "Frequency")
par(mfrow = c(1, 1))
Comment: Trading volume has increased substantially over the 21-year period. The market went “Up” in approximately 56% of weeks. The lag variables (Lag1–Lag5) show little obvious pattern, but Volume appears to trend upward over time.
glm_fit <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data = Weekly, family = binomial)
summary(glm_fit)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Weekly)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
Comment: Only Lag2 appears to be
statistically significant (p ≈ 0.03). The other lag variables and Volume
are not significant predictors in this full model.
glm_probs <- predict(glm_fit, type = "response")
glm_pred <- ifelse(glm_probs > 0.5, "Up", "Down")
conf_mat <- table(Predicted = glm_pred, Actual = Weekly$Direction)
print(conf_mat)
## Actual
## Predicted Down Up
## Down 54 48
## Up 430 557
accuracy <- mean(glm_pred == Weekly$Direction)
cat("\nOverall fraction of correct predictions:", round(accuracy, 4), "\n")
##
## Overall fraction of correct predictions: 0.5611
Comment: The model achieves about 56% accuracy on the training data — not much better than always predicting “Up” (which itself would be correct ~56% of the time). The confusion matrix shows the model is biased toward predicting “Up,” correctly identifying most “Up” weeks but missing most “Down” weeks.
train <- Weekly$Year <= 2008
test <- Weekly[!train, ]
glm_lag2 <- glm(Direction ~ Lag2, data = Weekly, family = binomial, subset = train)
summary(glm_lag2)
##
## Call:
## glm(formula = Direction ~ Lag2, family = binomial, data = Weekly,
## subset = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.20326 0.06428 3.162 0.00157 **
## Lag2 0.05810 0.02870 2.024 0.04298 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1354.7 on 984 degrees of freedom
## Residual deviance: 1350.5 on 983 degrees of freedom
## AIC: 1354.5
##
## Number of Fisher Scoring iterations: 4
# Predictions on held-out data
lag2_probs <- predict(glm_lag2, newdata = test, type = "response")
lag2_pred <- ifelse(lag2_probs > 0.5, "Up", "Down")
conf_mat2 <- table(Predicted = lag2_pred, Actual = test$Direction)
print(conf_mat2)
## Actual
## Predicted Down Up
## Down 9 5
## Up 34 56
accuracy2 <- mean(lag2_pred == test$Direction)
cat("\nTest accuracy (2009–2010):", round(accuracy2, 4), "\n")
##
## Test accuracy (2009–2010): 0.625
Comment: Using only Lag2 as the
predictor and evaluating on the held-out 2009–2010 data, the model
achieves approximately 62.5% accuracy — an improvement over the full
model trained on the entire dataset. This suggests Lag2
carries genuine predictive signal, and the simpler model generalises
better.
End of Midterm Homework