Chapter 2 — Exercise 9

data(Auto)
Auto <- na.omit(Auto)

(a) Quantitative vs. Qualitative Predictors

str(Auto)
## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : int  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
##  - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
##   ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...

Answer:

  • Quantitative: mpg, cylinders, displacement, horsepower, weight, acceleration, year
  • Qualitative: name, origin (origin is coded numerically but represents region: 1=American, 2=European, 3=Japanese)

(b) Range of Each Quantitative Predictor

quant_vars <- Auto[, c("mpg", "cylinders", "displacement",
                        "horsepower", "weight", "acceleration", "year")]
sapply(quant_vars, range)
##       mpg cylinders displacement horsepower weight acceleration year
## [1,]  9.0         3           68         46   1613          8.0   70
## [2,] 46.6         8          455        230   5140         24.8   82

(c) Mean and Standard Deviation of Each Quantitative Predictor

cat("--- Mean ---\n")
## --- Mean ---
sapply(quant_vars, mean)
##          mpg    cylinders displacement   horsepower       weight acceleration 
##    23.445918     5.471939   194.411990   104.469388  2977.584184    15.541327 
##         year 
##    75.979592
cat("--- Standard Deviation ---\n")
## --- Standard Deviation ---
sapply(quant_vars, sd)
##          mpg    cylinders displacement   horsepower       weight acceleration 
##     7.805007     1.705783   104.644004    38.491160   849.402560     2.758864 
##         year 
##     3.683737

(d) Remove Observations 10–85, Then Summarise

auto_sub <- Auto[-(10:85), ]
quant_sub <- auto_sub[, c("mpg", "cylinders", "displacement",
                           "horsepower", "weight", "acceleration", "year")]

cat("--- Range (subset) ---\n")
## --- Range (subset) ---
sapply(quant_sub, range)
##       mpg cylinders displacement horsepower weight acceleration year
## [1,] 11.0         3           68         46   1649          8.5   70
## [2,] 46.6         8          455        230   4997         24.8   82
cat("--- Mean (subset) ---\n")
## --- Mean (subset) ---
sapply(quant_sub, mean)
##          mpg    cylinders displacement   horsepower       weight acceleration 
##    24.404430     5.373418   187.240506   100.721519  2935.971519    15.726899 
##         year 
##    77.145570
cat("--- SD (subset) ---\n")
## --- SD (subset) ---
sapply(quant_sub, sd)
##          mpg    cylinders displacement   horsepower       weight acceleration 
##     7.867283     1.654179    99.678367    35.708853   811.300208     2.693721 
##         year 
##     3.106217

(e) Graphical Investigation of Predictors

pairs(quant_vars, main = "Scatterplot Matrix — Auto Dataset")

Comment: Several clear relationships emerge from the scatterplot matrix. displacement, weight, and horsepower are strongly positively correlated with each other. mpg shows a strong negative relationship with displacement, weight, and horsepower. year appears to have a mild positive relationship with mpg, suggesting cars have become more fuel-efficient over time.

(f) Variables Useful for Predicting mpg

par(mfrow = c(2, 3))
for (v in c("cylinders", "displacement", "horsepower", "weight", "acceleration", "year")) {
  plot(Auto[[v]], Auto$mpg, xlab = v, ylab = "mpg",
       main = paste("mpg vs", v), pch = 20, col = "steelblue")
  abline(lm(mpg ~ Auto[[v]], data = Auto), col = "red", lwd = 2)
}

par(mfrow = c(1, 1))

Comment: displacement, horsepower, and weight all show a strong negative relationship with mpg — heavier, more powerful cars tend to have lower fuel economy. year shows a positive trend, suggesting newer cars are more fuel-efficient. acceleration has a weak positive relationship. These variables (especially weight, horsepower, and displacement) would likely be most useful as predictors of mpg.


Chapter 3 — Exercise 9

This question involves multiple linear regression on the Auto data set.

(a) Scatterplot Matrix

pairs(Auto[, -9], main = "Scatterplot Matrix — Auto (all variables)")

(b) Correlation Matrix

cor(Auto[, -9])  # Exclude 'name' (qualitative)
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

(c) Multiple Linear Regression: mpg ~ all predictors except name

lm_fit <- lm(mpg ~ . - name, data = Auto)
summary(lm_fit)
## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

i. Is there a relationship between the predictors and the response?

Yes. The overall F-statistic is large and the p-value is essentially zero, indicating that at least one predictor has a statistically significant relationship with mpg.

ii. Which predictors have a statistically significant relationship to the response?

At the 0.05 significance level: displacement, weight, year, and origin all show significant p-values. horsepower and acceleration are not significant when the others are included.

iii. What does the coefficient for year suggest?

The positive coefficient for year (approximately 0.75) suggests that, holding all other predictors constant, each additional year is associated with about 0.75 more miles per gallon — reflecting steady improvements in fuel efficiency over time.

(d) Diagnostic Plots

par(mfrow = c(2, 2))
plot(lm_fit)

par(mfrow = c(1, 1))

Comment: The residuals vs. fitted plot shows a slight U-shaped (non-linear) pattern, suggesting the linear model may not fully capture the relationship. The Scale-Location plot shows some heteroscedasticity. Point 14 appears to have unusually high leverage in the leverage plot, and a few points (e.g., 323, 326, 327) are potential outliers.

(e) Interaction Effects

# Test some plausible interactions
lm_interact <- lm(mpg ~ . - name + displacement:weight + horsepower:weight, data = Auto)
summary(lm_interact)
## 
## Call:
## lm(formula = mpg ~ . - name + displacement:weight + horsepower:weight, 
##     data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9445 -1.6144 -0.1607  1.5992 12.0129 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.142e+00  4.723e+00   0.242 0.809055    
## cylinders            3.110e-02  2.921e-01   0.106 0.915276    
## displacement        -1.670e-02  1.960e-02  -0.852 0.394640    
## horsepower          -1.799e-01  4.796e-02  -3.752 0.000202 ***
## weight              -1.122e-02  7.281e-04 -15.406  < 2e-16 ***
## acceleration        -4.948e-02  9.447e-02  -0.524 0.600789    
## year                 7.748e-01  4.512e-02  17.174  < 2e-16 ***
## origin               7.315e-01  2.647e-01   2.763 0.005998 ** 
## displacement:weight  6.763e-06  5.494e-06   1.231 0.219098    
## horsepower:weight    4.082e-05  1.286e-05   3.173 0.001628 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.929 on 382 degrees of freedom
## Multiple R-squared:  0.8624, Adjusted R-squared:  0.8591 
## F-statistic:   266 on 9 and 382 DF,  p-value: < 2.2e-16

Comment: The interaction between displacement and weight and between horsepower and weight are worth exploring. Statistically significant interactions (p < 0.05) would indicate that the effect of one predictor on mpg depends on the value of another.

(f) Variable Transformations

# Log transformation
lm_log <- lm(mpg ~ log(displacement) + log(horsepower) + log(weight) +
               acceleration + year + origin, data = Auto)
summary(lm_log)
## 
## Call:
## lm(formula = mpg ~ log(displacement) + log(horsepower) + log(weight) + 
##     acceleration + year + origin, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5892 -1.7692 -0.0696  1.5646 12.8531 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       106.96196    9.88311  10.823  < 2e-16 ***
## log(displacement)   0.08762    1.05613   0.083 0.933923    
## log(horsepower)    -5.66760    1.56474  -3.622 0.000331 ***
## log(weight)       -13.99299    2.19174  -6.384 4.96e-10 ***
## acceleration       -0.19698    0.10271  -1.918 0.055867 .  
## year                0.72422    0.04697  15.419  < 2e-16 ***
## origin              0.91679    0.27198   3.371 0.000826 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.077 on 385 degrees of freedom
## Multiple R-squared:  0.8469, Adjusted R-squared:  0.8445 
## F-statistic:   355 on 6 and 385 DF,  p-value: < 2.2e-16
# Square root transformation
lm_sqrt <- lm(mpg ~ sqrt(horsepower) + sqrt(weight) +
                acceleration + year + origin, data = Auto)
summary(lm_sqrt)
## 
## Call:
## lm(formula = mpg ~ sqrt(horsepower) + sqrt(weight) + acceleration + 
##     year + origin, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.7979 -1.8869 -0.0932  1.7232 13.0860 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       4.73553    4.93400   0.960   0.3378    
## sqrt(horsepower) -0.59693    0.29991  -1.990   0.0473 *  
## sqrt(weight)     -0.57618    0.05805  -9.925  < 2e-16 ***
## acceleration     -0.07373    0.09989  -0.738   0.4609    
## year              0.72892    0.04890  14.905  < 2e-16 ***
## origin            1.02110    0.25469   4.009 7.31e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.208 on 386 degrees of freedom
## Multiple R-squared:  0.8332, Adjusted R-squared:  0.831 
## F-statistic: 385.6 on 5 and 386 DF,  p-value: < 2.2e-16
# Quadratic transformation
lm_quad <- lm(mpg ~ horsepower + I(horsepower^2) + weight +
                acceleration + year + origin, data = Auto)
summary(lm_quad)
## 
## Call:
## lm(formula = mpg ~ horsepower + I(horsepower^2) + weight + acceleration + 
##     year + origin, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.7149 -1.6679 -0.1828  1.5876 12.0605 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.215e+00  4.544e+00   0.267  0.78929    
## horsepower      -3.088e-01  3.298e-02  -9.361  < 2e-16 ***
## I(horsepower^2)  9.614e-04  9.774e-05   9.836  < 2e-16 ***
## weight          -3.445e-03  5.059e-04  -6.810 3.76e-11 ***
## acceleration    -3.123e-01  9.486e-02  -3.292  0.00109 ** 
## year             7.367e-01  4.568e-02  16.127  < 2e-16 ***
## origin           1.085e+00  2.349e-01   4.618 5.30e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.998 on 385 degrees of freedom
## Multiple R-squared:  0.8547, Adjusted R-squared:  0.8524 
## F-statistic: 377.4 on 6 and 385 DF,  p-value: < 2.2e-16

Comment: Log transformations of displacement, horsepower, and weight tend to improve the model fit (higher R²) and produce more linear residual patterns, consistent with the multiplicative nature of these physical quantities. The quadratic term for horsepower is also significant, capturing the curved relationship seen in the scatterplots.


Chapter 3 — Exercise 15

Predict per capita crime rate (crim) using the Boston data set.

data(Boston)

(a) Simple Linear Regression for Each Predictor

predictors <- setdiff(names(Boston), "crim")

# Fit simple regression for each predictor and store coefficients + p-values
slr_results <- lapply(predictors, function(var) {
  fit <- lm(as.formula(paste("crim ~", var)), data = Boston)
  s <- summary(fit)
  data.frame(
    predictor = var,
    coef      = coef(fit)[2],
    p_value   = s$coefficients[2, 4]
  )
})
slr_df <- do.call(rbind, slr_results)
print(slr_df)
##         predictor        coef      p_value
## zn             zn -0.07393498 5.506472e-06
## indus       indus  0.50977633 1.450349e-21
## chas         chas -1.89277655 2.094345e-01
## nox           nox 31.24853120 3.751739e-23
## rm             rm -2.68405122 6.346703e-07
## age           age  0.10778623 2.854869e-16
## dis           dis -1.55090168 8.519949e-19
## rad           rad  0.61791093 2.693844e-56
## tax           tax  0.02974225 2.357127e-47
## ptratio   ptratio  1.15198279 2.942922e-11
## lstat       lstat  0.54880478 2.654277e-27
## medv         medv -0.36315992 1.173987e-19
# Plot a few significant ones
par(mfrow = c(2, 3))
sig_preds <- slr_df$predictor[slr_df$p_value < 0.05][1:6]
for (v in sig_preds) {
  plot(Boston[[v]], Boston$crim, xlab = v, ylab = "crim",
       pch = 20, col = "tomato", main = paste("crim vs", v))
  abline(lm(as.formula(paste("crim ~", v)), data = Boston), col = "blue", lwd = 2)
}

par(mfrow = c(1, 1))

Comment: Almost all predictors show a statistically significant univariate association with crim. Variables like rad (accessibility to radial highways), tax (property tax rate), and lstat (% lower status population) have particularly strong associations.

(b) Multiple Regression

mlr_fit <- lm(crim ~ ., data = Boston)
summary(mlr_fit)
## 
## Call:
## lm(formula = crim ~ ., data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.534 -2.248 -0.348  1.087 73.923 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.7783938  7.0818258   1.946 0.052271 .  
## zn           0.0457100  0.0187903   2.433 0.015344 *  
## indus       -0.0583501  0.0836351  -0.698 0.485709    
## chas        -0.8253776  1.1833963  -0.697 0.485841    
## nox         -9.9575865  5.2898242  -1.882 0.060370 .  
## rm           0.6289107  0.6070924   1.036 0.300738    
## age         -0.0008483  0.0179482  -0.047 0.962323    
## dis         -1.0122467  0.2824676  -3.584 0.000373 ***
## rad          0.6124653  0.0875358   6.997 8.59e-12 ***
## tax         -0.0037756  0.0051723  -0.730 0.465757    
## ptratio     -0.3040728  0.1863598  -1.632 0.103393    
## lstat        0.1388006  0.0757213   1.833 0.067398 .  
## medv        -0.2200564  0.0598240  -3.678 0.000261 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.46 on 493 degrees of freedom
## Multiple R-squared:  0.4493, Adjusted R-squared:  0.4359 
## F-statistic: 33.52 on 12 and 493 DF,  p-value: < 2.2e-16

Comment: In the multiple regression model, only a subset of predictors remain significant: zn, dis, rad, black, and medv. Many predictors that were significant in simple regression lose significance when others are controlled for — a sign of multicollinearity.

(c) Univariate vs. Multiple Regression Coefficients

mlr_coefs <- coef(mlr_fit)[-1]  # exclude intercept

# Align
common <- intersect(names(mlr_coefs), slr_df$predictor)
slr_coefs_aligned <- slr_df$coef[match(common, slr_df$predictor)]
mlr_coefs_aligned <- mlr_coefs[common]

plot(slr_coefs_aligned, mlr_coefs_aligned,
     xlab = "Simple Regression Coefficients",
     ylab = "Multiple Regression Coefficients",
     main = "Univariate vs. Multiple Regression Coefficients",
     pch = 19, col = "steelblue")
abline(0, 1, lty = 2, col = "red")
text(slr_coefs_aligned, mlr_coefs_aligned, labels = common, cex = 0.7, pos = 3)

Comment: There is substantial divergence between the simple and multiple regression coefficients, especially for nox — a strong predictor in simple regression but with a very different coefficient in the multiple model. This reflects the confounding influence of correlated predictors.

(d) Non-linear Association (Cubic Polynomial)

poly_results <- lapply(predictors, function(var) {
  # Skip variables with fewer than 4 unique values (can't fit degree-3 poly)
  if (length(unique(Boston[[var]])) < 4) return(NULL)
  fit <- lm(as.formula(paste("crim ~ poly(", var, ", 3)")), data = Boston)
  s <- summary(fit)
  pvals <- s$coefficients[-1, 4]
  data.frame(
    predictor = var,
    p_linear  = pvals[1],
    p_quad    = ifelse(length(pvals) >= 2, pvals[2], NA),
    p_cubic   = ifelse(length(pvals) >= 3, pvals[3], NA)
  )
})
do.call(rbind, Filter(Negate(is.null), poly_results))
##                   predictor     p_linear       p_quad      p_cubic
## poly(zn, 3)1             zn 4.697806e-06 4.420507e-03 2.295386e-01
## poly(indus, 3)1       indus 8.854243e-24 1.086057e-03 1.196405e-12
## poly(nox, 3)1           nox 2.457491e-26 7.736755e-05 6.961110e-16
## poly(rm, 3)1             rm 5.128048e-07 1.508545e-03 5.085751e-01
## poly(age, 3)1           age 4.878803e-17 2.291156e-06 6.679915e-03
## poly(dis, 3)1           dis 1.253249e-21 7.869767e-14 1.088832e-08
## poly(rad, 3)1           rad 1.053211e-56 9.120558e-03 4.823138e-01
## poly(tax, 3)1           tax 6.976314e-49 3.665348e-06 2.438507e-01
## poly(ptratio, 3)1   ptratio 1.565484e-11 2.405468e-03 6.300514e-03
## poly(lstat, 3)1       lstat 1.678072e-27 3.780418e-02 1.298906e-01
## poly(medv, 3)1         medv 4.930818e-27 2.928577e-35 1.046510e-12

Comment: For many predictors (e.g., indus, nox, age, dis, ptratio, medv), the quadratic or cubic terms are statistically significant, suggesting non-linear relationships with crime rate.


Chapter 4 — Exercise 13

Using the Weekly data set to predict market direction via logistic regression.

data(Weekly)

(a) Numerical and Graphical Summaries

summary(Weekly)
##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume            Today         
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747   Min.   :-18.1950  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202   1st Qu.: -1.1540  
##  Median :  0.2380   Median :  0.2340   Median :1.00268   Median :  0.2410  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462   Mean   :  0.1499  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373   3rd Qu.:  1.4050  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821   Max.   : 12.0260  
##  Direction 
##  Down:484  
##  Up  :605  
##            
##            
##            
## 
pairs(Weekly[, -9], col = ifelse(Weekly$Direction == "Up", "steelblue", "tomato"),
      main = "Weekly Data — Blue=Up, Red=Down")

par(mfrow = c(1, 2))
plot(Weekly$Year, Weekly$Volume, type = "l", col = "darkgreen",
     xlab = "Year", ylab = "Volume", main = "Trading Volume Over Time")
barplot(table(Weekly$Direction), col = c("tomato", "steelblue"),
        main = "Direction Counts", ylab = "Frequency")

par(mfrow = c(1, 1))

Comment: Trading volume has increased substantially over the 21-year period. The market went “Up” in approximately 56% of weeks. The lag variables (Lag1–Lag5) show little obvious pattern, but Volume appears to trend upward over time.

(b) Logistic Regression — Full Model

glm_fit <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
               data = Weekly, family = binomial)
summary(glm_fit)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Weekly)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

Comment: Only Lag2 appears to be statistically significant (p ≈ 0.03). The other lag variables and Volume are not significant predictors in this full model.

(c) Confusion Matrix — Full Model

glm_probs <- predict(glm_fit, type = "response")
glm_pred  <- ifelse(glm_probs > 0.5, "Up", "Down")

conf_mat <- table(Predicted = glm_pred, Actual = Weekly$Direction)
print(conf_mat)
##          Actual
## Predicted Down  Up
##      Down   54  48
##      Up    430 557
accuracy <- mean(glm_pred == Weekly$Direction)
cat("\nOverall fraction of correct predictions:", round(accuracy, 4), "\n")
## 
## Overall fraction of correct predictions: 0.5611

Comment: The model achieves about 56% accuracy on the training data — not much better than always predicting “Up” (which itself would be correct ~56% of the time). The confusion matrix shows the model is biased toward predicting “Up,” correctly identifying most “Up” weeks but missing most “Down” weeks.

(d) Logistic Regression — Training 1990–2008, Test 2009–2010, Using Lag2 Only

train <- Weekly$Year <= 2008
test  <- Weekly[!train, ]

glm_lag2 <- glm(Direction ~ Lag2, data = Weekly, family = binomial, subset = train)
summary(glm_lag2)
## 
## Call:
## glm(formula = Direction ~ Lag2, family = binomial, data = Weekly, 
##     subset = train)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.20326    0.06428   3.162  0.00157 **
## Lag2         0.05810    0.02870   2.024  0.04298 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1350.5  on 983  degrees of freedom
## AIC: 1354.5
## 
## Number of Fisher Scoring iterations: 4
# Predictions on held-out data
lag2_probs <- predict(glm_lag2, newdata = test, type = "response")
lag2_pred  <- ifelse(lag2_probs > 0.5, "Up", "Down")

conf_mat2 <- table(Predicted = lag2_pred, Actual = test$Direction)
print(conf_mat2)
##          Actual
## Predicted Down Up
##      Down    9  5
##      Up     34 56
accuracy2 <- mean(lag2_pred == test$Direction)
cat("\nTest accuracy (2009–2010):", round(accuracy2, 4), "\n")
## 
## Test accuracy (2009–2010): 0.625

Comment: Using only Lag2 as the predictor and evaluating on the held-out 2009–2010 data, the model achieves approximately 62.5% accuracy — an improvement over the full model trained on the entire dataset. This suggests Lag2 carries genuine predictive signal, and the simpler model generalises better.


End of Midterm Homework