1 Chapter 2 — Question 9 (Auto Dataset)

This exercise involves the Auto dataset. We first remove missing values.

Auto <- na.omit(Auto)

1.1 (a) Quantitative vs Qualitative Predictors

str(Auto)
## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : int  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
##  - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
##   ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...
  • Quantitative: mpg, cylinders, displacement, horsepower, weight, acceleration, year
  • Qualitative: origin (1 = American, 2 = European, 3 = Japanese), name

Although origin is stored as numeric (1, 2, 3), it represents categories with no meaningful order. It should be converted to a factor in R to ensure it is treated correctly in any modelling:

Auto$origin <- factor(Auto$origin, levels = c(1, 2, 3),
                      labels = c("American", "European", "Japanese"))

1.2 (b) Range of Each Quantitative Predictor

quant_vars <- Auto[, c("mpg", "cylinders", "displacement",
                        "horsepower", "weight", "acceleration", "year")]
sapply(quant_vars, range)
##       mpg cylinders displacement horsepower weight acceleration year
## [1,]  9.0         3           68         46   1613          8.0   70
## [2,] 46.6         8          455        230   5140         24.8   82

1.3 (c) Mean and Standard Deviation

sapply(quant_vars, mean)
##          mpg    cylinders displacement   horsepower       weight acceleration 
##    23.445918     5.471939   194.411990   104.469388  2977.584184    15.541327 
##         year 
##    75.979592
sapply(quant_vars, sd)
##          mpg    cylinders displacement   horsepower       weight acceleration 
##     7.805007     1.705783   104.644004    38.491160   849.402560     2.758864 
##         year 
##     3.683737

1.4 (d) Remove Rows 10–85 and Recompute

Auto_subset <- Auto[-(10:85), ]
quant_subset <- Auto_subset[, c("mpg", "cylinders", "displacement",
                                 "horsepower", "weight", "acceleration", "year")]
sapply(quant_subset, range)
##       mpg cylinders displacement horsepower weight acceleration year
## [1,] 11.0         3           68         46   1649          8.5   70
## [2,] 46.6         8          455        230   4997         24.8   82
sapply(quant_subset, mean)
##          mpg    cylinders displacement   horsepower       weight acceleration 
##    24.404430     5.373418   187.240506   100.721519  2935.971519    15.726899 
##         year 
##    77.145570
sapply(quant_subset, sd)
##          mpg    cylinders displacement   horsepower       weight acceleration 
##     7.867283     1.654179    99.678367    35.708853   811.300208     2.693721 
##         year 
##     3.106217

After removing observations 10 through 85, the ranges, means, and standard deviations shift noticeably for some variables, indicating those rows contained values spread across the distribution.

1.5 (e) Graphical Investigation

pairs(quant_vars, main = "Scatterplot Matrix - Auto Dataset", col = "steelblue")

par(mfrow = c(1, 3))

plot(Auto$weight, Auto$mpg, pch = 16, col = "steelblue",
     xlab = "Weight", ylab = "MPG", main = "Weight vs MPG")

plot(Auto$horsepower, Auto$mpg, pch = 16, col = "tomato",
     xlab = "Horsepower", ylab = "MPG", main = "Horsepower vs MPG")

plot(Auto$year, Auto$mpg, pch = 16, col = "darkgreen",
     xlab = "Year", ylab = "MPG", main = "Year vs MPG")

par(mfrow = c(1, 1))

The scatterplot matrix reveals several clear patterns among the predictors. Weight, displacement, and horsepower are strongly correlated with each other, and all three show a negative relationship with mpg. The variable year shows a positive trend with mpg.

1.6 (f) Variables Useful for Predicting MPG

cor(quant_vars)[, "mpg"]
##          mpg    cylinders displacement   horsepower       weight acceleration 
##    1.0000000   -0.7776175   -0.8051269   -0.7784268   -0.8322442    0.4233285 
##         year 
##    0.5805410

Based on the correlation values and scatterplots:

  • Weight, displacement, horsepower, cylinders have strong negative correlations with mpg — heavier and more powerful cars are less fuel efficient.
  • Year has a notable positive correlation — fuel efficiency has improved over time.
  • Acceleration has a mild positive correlation.

These variables would all be useful predictors of mpg.


2 Chapter 3 — Question 9 (Multiple Linear Regression on Auto)

2.1 (a) Scatterplot Matrix

pairs(Auto[, !(names(Auto) %in% c("name", "origin"))],
      main = "Scatterplot Matrix - Full Auto Dataset")

2.2 (b) Correlation Matrix

cor(Auto[, !(names(Auto) %in% c("name", "origin"))])  # exclude qualitative variables
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
##              acceleration       year
## mpg             0.4233285  0.5805410
## cylinders      -0.5046834 -0.3456474
## displacement   -0.5438005 -0.3698552
## horsepower     -0.6891955 -0.4163615
## weight         -0.4168392 -0.3091199
## acceleration    1.0000000  0.2903161
## year            0.2903161  1.0000000

2.3 (c) Multiple Linear Regression

lm_fit <- lm(mpg ~ . - name, data = Auto)
summary(lm_fit)
## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.0095 -2.0785 -0.0982  1.9856 13.3608 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1.795e+01  4.677e+00  -3.839 0.000145 ***
## cylinders      -4.897e-01  3.212e-01  -1.524 0.128215    
## displacement    2.398e-02  7.653e-03   3.133 0.001863 ** 
## horsepower     -1.818e-02  1.371e-02  -1.326 0.185488    
## weight         -6.710e-03  6.551e-04 -10.243  < 2e-16 ***
## acceleration    7.910e-02  9.822e-02   0.805 0.421101    
## year            7.770e-01  5.178e-02  15.005  < 2e-16 ***
## originEuropean  2.630e+00  5.664e-01   4.643 4.72e-06 ***
## originJapanese  2.853e+00  5.527e-01   5.162 3.93e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.307 on 383 degrees of freedom
## Multiple R-squared:  0.8242, Adjusted R-squared:  0.8205 
## F-statistic: 224.5 on 8 and 383 DF,  p-value: < 2.2e-16

i. Is there a relationship between the predictors and the response?
Yes. The F-statistic is large with a very small p-value, indicating a significant overall relationship.

ii. Which predictors are statistically significant?
displacement, weight, year, and origin all have p-values below 0.05 and are statistically significant.

iii. What does the year coefficient suggest?
The coefficient for year is approximately 0.75, meaning that holding all other variables constant, each additional model year is associated with an increase of ~0.75 mpg. This reflects improvements in fuel efficiency over time.

2.4 (d) Diagnostic Plots

par(mfrow = c(2, 2))
plot(lm_fit)

par(mfrow = c(1, 1))
  • Residuals vs Fitted: A slight non-linear pattern is visible, suggesting the linear model may not fully capture the relationship.
  • Scale-Location: Mild heteroscedasticity is present.
  • Leverage Plot: Observation 14 has unusually high leverage and should be investigated.
  • No extreme outliers are present but a few points deviate from the trend.

2.5 (e) Interaction Effects

lm_interact1 <- lm(mpg ~ displacement * weight, data = Auto)
summary(lm_interact1)
## 
## Call:
## lm(formula = mpg ~ displacement * weight, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.8664  -2.4801  -0.3355   1.8071  17.9429 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          5.372e+01  1.940e+00  27.697  < 2e-16 ***
## displacement        -7.831e-02  1.131e-02  -6.922 1.85e-11 ***
## weight              -8.931e-03  8.474e-04 -10.539  < 2e-16 ***
## displacement:weight  1.744e-05  2.789e-06   6.253 1.06e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.097 on 388 degrees of freedom
## Multiple R-squared:  0.7265, Adjusted R-squared:  0.7244 
## F-statistic: 343.6 on 3 and 388 DF,  p-value: < 2.2e-16
lm_interact2 <- lm(mpg ~ horsepower * weight, data = Auto)
summary(lm_interact2)
## 
## Call:
## lm(formula = mpg ~ horsepower * weight, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.7725  -2.2074  -0.2708   1.9973  14.7314 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        6.356e+01  2.343e+00  27.127  < 2e-16 ***
## horsepower        -2.508e-01  2.728e-02  -9.195  < 2e-16 ***
## weight            -1.077e-02  7.738e-04 -13.921  < 2e-16 ***
## horsepower:weight  5.355e-05  6.649e-06   8.054 9.93e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.93 on 388 degrees of freedom
## Multiple R-squared:  0.7484, Adjusted R-squared:  0.7465 
## F-statistic: 384.8 on 3 and 388 DF,  p-value: < 2.2e-16
lm_interact3 <- lm(mpg ~ acceleration + year + displacement:weight, data = Auto)
summary(lm_interact3)
## 
## Call:
## lm(formula = mpg ~ acceleration + year + displacement:weight, 
##     data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.7447 -2.7976 -0.3147  2.2772 15.6057 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -2.245e+01  4.827e+00  -4.651 4.53e-06 ***
## acceleration        -1.002e-01  8.917e-02  -1.123    0.262    
## year                 7.137e-01  6.115e-02  11.671  < 2e-16 ***
## displacement:weight -1.024e-05  4.839e-07 -21.157  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.108 on 388 degrees of freedom
## Multiple R-squared:  0.7251, Adjusted R-squared:  0.723 
## F-statistic: 341.1 on 3 and 388 DF,  p-value: < 2.2e-16

The interaction term displacement:weight is statistically significant, suggesting that the effect of displacement on mpg depends on the weight of the car. Specifically, the negative interaction coefficient indicates that the negative impact of displacement on fuel efficiency is even more severe in heavier vehicles — a large engine in a heavy car compounds the fuel efficiency penalty beyond what either variable alone would suggest. The horsepower:weight interaction is similarly significant and follows the same logic.

2.6 (f) Variable Transformations

lm_log  <- lm(mpg ~ log(horsepower), data = Auto)
lm_sqrt <- lm(mpg ~ sqrt(horsepower), data = Auto)
lm_sq   <- lm(mpg ~ I(horsepower^2), data = Auto)

par(mfrow = c(1, 3))

plot(Auto$horsepower, Auto$mpg, pch = 16, col = "gray",
     main = "log(horsepower)", xlab = "Horsepower", ylab = "MPG")
lines(sort(Auto$horsepower),
      fitted(lm_log)[order(Auto$horsepower)], col = "red", lwd = 2)

plot(Auto$horsepower, Auto$mpg, pch = 16, col = "gray",
     main = "sqrt(horsepower)", xlab = "Horsepower", ylab = "MPG")
lines(sort(Auto$horsepower),
      fitted(lm_sqrt)[order(Auto$horsepower)], col = "blue", lwd = 2)

plot(Auto$horsepower, Auto$mpg, pch = 16, col = "gray",
     main = "horsepower^2", xlab = "Horsepower", ylab = "MPG")
lines(sort(Auto$horsepower),
      fitted(lm_sq)[order(Auto$horsepower)], col = "darkgreen", lwd = 2)

par(mfrow = c(1, 1))
cat("R² - log:  ", summary(lm_log)$r.squared, "\n")
## R² - log:   0.6683348
cat("R² - sqrt: ", summary(lm_sqrt)$r.squared, "\n")
## R² - sqrt:  0.6437036
cat("R² - sq:   ", summary(lm_sq)$r.squared, "\n")
## R² - sq:    0.507367

The log(horsepower) transformation achieves the highest R² and produces the best visual fit. This confirms that the relationship between horsepower and mpg is non-linear, and a log transformation appropriately captures this curvature.


3 Chapter 3 — Question 15 (Boston Dataset: Predicting Crime Rate)

boston_vars <- names(Boston)[-1]  # all except 'crim'

3.1 (a) Simple Linear Regression for Each Predictor

# Collect p-values for each simple regression
slr_pvals <- sapply(boston_vars, function(var) {
  formula <- as.formula(paste("crim ~", var))
  fit <- lm(formula, data = Boston)
  coef(summary(fit))[2, 4]
})
slr_pvals
##           zn        indus         chas          nox           rm          age 
## 5.506472e-06 1.450349e-21 2.094345e-01 3.751739e-23 6.346703e-07 2.854869e-16 
##          dis          rad          tax      ptratio        lstat         medv 
## 8.519949e-19 2.693844e-56 2.357127e-47 2.942922e-11 2.654277e-27 1.173987e-19
par(mfrow = c(1, 4))

for (var in c("lstat", "medv", "rad", "tax")) {
  plot(Boston[[var]], Boston$crim, pch = 16, col = "steelblue",
       xlab = var, ylab = "crim", main = paste("crim vs", var))
  abline(lm(as.formula(paste("crim ~", var)), data = Boston),
         col = "red", lwd = 2)
}

par(mfrow = c(1, 1))

Almost all predictors show a statistically significant simple association with crim. The strongest associations are with rad (index of accessibility to highways), tax, lstat, and medv.

3.2 (b) Multiple Linear Regression

mlr_fit <- lm(crim ~ ., data = Boston)
summary(mlr_fit)
## 
## Call:
## lm(formula = crim ~ ., data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.534 -2.248 -0.348  1.087 73.923 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.7783938  7.0818258   1.946 0.052271 .  
## zn           0.0457100  0.0187903   2.433 0.015344 *  
## indus       -0.0583501  0.0836351  -0.698 0.485709    
## chas        -0.8253776  1.1833963  -0.697 0.485841    
## nox         -9.9575865  5.2898242  -1.882 0.060370 .  
## rm           0.6289107  0.6070924   1.036 0.300738    
## age         -0.0008483  0.0179482  -0.047 0.962323    
## dis         -1.0122467  0.2824676  -3.584 0.000373 ***
## rad          0.6124653  0.0875358   6.997 8.59e-12 ***
## tax         -0.0037756  0.0051723  -0.730 0.465757    
## ptratio     -0.3040728  0.1863598  -1.632 0.103393    
## lstat        0.1388006  0.0757213   1.833 0.067398 .  
## medv        -0.2200564  0.0598240  -3.678 0.000261 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.46 on 493 degrees of freedom
## Multiple R-squared:  0.4493, Adjusted R-squared:  0.4359 
## F-statistic: 33.52 on 12 and 493 DF,  p-value: < 2.2e-16

In the multiple regression model, we can reject H₀ (βj = 0) for: zn, dis, rad, black, and medv. Many predictors that were significant in simple regression lose significance here, likely due to multicollinearity.

3.3 (c) Univariate vs Multiple Regression Coefficients

slr_coefs <- sapply(boston_vars, function(var) {
  coef(lm(as.formula(paste("crim ~", var)), data = Boston))[2]
})

mlr_coefs <- coef(mlr_fit)[-1]

plot(slr_coefs, mlr_coefs,
     xlab = "Simple Regression Coefficients",
     ylab = "Multiple Regression Coefficients",
     main = "Univariate vs Multiple Regression Coefficients",
     pch = 16, col = "steelblue")
abline(0, 1, col = "red", lty = 2)
text(slr_coefs, mlr_coefs, labels = boston_vars, cex = 0.7, pos = 3)

The most striking difference is for nox, which has a large positive coefficient in simple regression but a negative coefficient in multiple regression. This is a sign of confounding — when other correlated predictors are included, the estimated effect of nox changes dramatically. This makes sense contextually: nitrogen oxide levels are closely tied to industrial zones and heavy traffic, which are themselves associated with higher crime rates, so in simple regression nox absorbs the effect of these omitted variables.

3.4 (d) Non-Linear Association (Polynomial Regression)

# Show p-values for quadratic and cubic terms for select variables
for (var in c("medv", "lstat", "dis", "nox")) {
  formula <- as.formula(paste("crim ~ poly(", var, ", 3)"))
  fit <- lm(formula, data = Boston)
  cat("\n---", var, "---\n")
  print(coef(summary(fit))[, 4])
}
## 
## --- medv ---
##    (Intercept) poly(medv, 3)1 poly(medv, 3)2 poly(medv, 3)3 
##   7.024110e-31   4.930818e-27   2.928577e-35   1.046510e-12 
## 
## --- lstat ---
##     (Intercept) poly(lstat, 3)1 poly(lstat, 3)2 poly(lstat, 3)3 
##    4.939398e-24    1.678072e-27    3.780418e-02    1.298906e-01 
## 
## --- dis ---
##   (Intercept) poly(dis, 3)1 poly(dis, 3)2 poly(dis, 3)3 
##  1.060226e-25  1.253249e-21  7.869767e-14  1.088832e-08 
## 
## --- nox ---
##   (Intercept) poly(nox, 3)1 poly(nox, 3)2 poly(nox, 3)3 
##  2.742908e-26  2.457491e-26  7.736755e-05  6.961110e-16

For several predictors (medv, lstat, dis, nox), the quadratic and/or cubic terms are statistically significant, providing evidence of non-linear relationships with crim.


4 Chapter 4 — Question 13 (Weekly Dataset: Logistic Regression)

4.1 (a) Numerical and Graphical Summaries

summary(Weekly)
##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume            Today         
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747   Min.   :-18.1950  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202   1st Qu.: -1.1540  
##  Median :  0.2380   Median :  0.2340   Median :1.00268   Median :  0.2410  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462   Mean   :  0.1499  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373   3rd Qu.:  1.4050  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821   Max.   : 12.0260  
##  Direction 
##  Down:484  
##  Up  :605  
##            
##            
##            
## 
cor(Weekly[, -9])  # exclude Direction
##               Year         Lag1        Lag2        Lag3         Lag4
## Year    1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1   -0.03228927  1.000000000 -0.07485305  0.05863568 -0.071273876
## Lag2   -0.03339001 -0.074853051  1.00000000 -0.07572091  0.058381535
## Lag3   -0.03000649  0.058635682 -0.07572091  1.00000000 -0.075395865
## Lag4   -0.03112792 -0.071273876  0.05838153 -0.07539587  1.000000000
## Lag5   -0.03051910 -0.008183096 -0.07249948  0.06065717 -0.075675027
## Volume  0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today  -0.03245989 -0.075031842  0.05916672 -0.07124364 -0.007825873
##                Lag5      Volume        Today
## Year   -0.030519101  0.84194162 -0.032459894
## Lag1   -0.008183096 -0.06495131 -0.075031842
## Lag2   -0.072499482 -0.08551314  0.059166717
## Lag3    0.060657175 -0.06928771 -0.071243639
## Lag4   -0.075675027 -0.06107462 -0.007825873
## Lag5    1.000000000 -0.05851741  0.011012698
## Volume -0.058517414  1.00000000 -0.033077783
## Today   0.011012698 -0.03307778  1.000000000
par(mfrow = c(1, 2))

plot(Weekly$Volume, type = "l", col = "steelblue",
     xlab = "Index", ylab = "Volume",
     main = "Trading Volume Over Time")

plot(Weekly$Year, Weekly$Volume, pch = 16, col = "tomato",
     xlab = "Year", ylab = "Volume",
     main = "Volume by Year")

par(mfrow = c(1, 1))

Trading volume has increased substantially over time. The lag variables show very low correlations with each other and with Direction. The only strong correlation is between Year and Volume.

4.2 (b) Logistic Regression — Full Data

glm_fit <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
               data = Weekly, family = binomial)
summary(glm_fit)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Weekly)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

Only Lag2 appears statistically significant (p < 0.05). All other predictors, including Volume, are not significant.

4.3 (c) Confusion Matrix — Full Data

glm_probs <- predict(glm_fit, type = "response")
glm_pred  <- ifelse(glm_probs > 0.5, "Up", "Down")

table(Predicted = glm_pred, Actual = Weekly$Direction)
##          Actual
## Predicted Down  Up
##      Down   54  48
##      Up    430 557
cat("\nOverall accuracy:", mean(glm_pred == Weekly$Direction), "\n")
## 
## Overall accuracy: 0.5610652

The overall accuracy is approximately 56%, only slightly better than random guessing. The confusion matrix reveals a clear imbalance in the model’s performance:

  • Sensitivity (ability to correctly predict “Up” weeks) is high — the model identifies most actual “Up” weeks correctly.
  • Specificity (ability to correctly predict “Down” weeks) is very low — the model misses the majority of actual “Down” weeks, predicting “Up” instead.

This means the model is biased toward the majority class (“Up”), making it unreliable for detecting market downturns despite its modest overall accuracy.

4.4 (d) Train on 1990–2008, Test on 2009–2010 (Lag2 only)

train <- Weekly$Year <= 2008
test  <- Weekly[!train, ]

glm_fit2 <- glm(Direction ~ Lag2,
                data = Weekly,
                family = binomial,
                subset = train)
summary(glm_fit2)
## 
## Call:
## glm(formula = Direction ~ Lag2, family = binomial, data = Weekly, 
##     subset = train)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.20326    0.06428   3.162  0.00157 **
## Lag2         0.05810    0.02870   2.024  0.04298 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1350.5  on 983  degrees of freedom
## AIC: 1354.5
## 
## Number of Fisher Scoring iterations: 4
glm_probs2 <- predict(glm_fit2, newdata = test, type = "response")
glm_pred2  <- ifelse(glm_probs2 > 0.5, "Up", "Down")

table(Predicted = glm_pred2, Actual = test$Direction)
##          Actual
## Predicted Down Up
##      Down    9  5
##      Up     34 56
cat("\nTest accuracy:", mean(glm_pred2 == test$Direction), "\n")
## 
## Test accuracy: 0.625

Using only Lag2 as a predictor and evaluating on held-out data from 2009–2010, the test accuracy improves to approximately 62.5%. This is better than the full-data model, suggesting that a simpler model with a proper train/test split generalizes better. The model still tends to predict “Up” more frequently than “Down”.


End of Midterm Homework