Chapter 2 – Exercise 9

This exercise involves the Auto data set. Missing values have been removed.

data(Auto)
Auto <- na.omit(Auto)
dim(Auto)
## [1] 392   9
head(Auto)
##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500

(a) Quantitative vs. Qualitative Predictors

str(Auto)
## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : int  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
##  - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
##   ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...

Quantitative predictors: mpg, cylinders, displacement, horsepower, weight, acceleration, year

Qualitative predictor: name (car name, a character/factor), and origin is often treated as qualitative (1 = American, 2 = European, 3 = Japanese) even though it is stored as integer.

(b) Range of Each Quantitative Predictor

quant_vars <- c("mpg", "cylinders", "displacement", "horsepower",
                "weight", "acceleration", "year")

sapply(Auto[, quant_vars], range)
##       mpg cylinders displacement horsepower weight acceleration year
## [1,]  9.0         3           68         46   1613          8.0   70
## [2,] 46.6         8          455        230   5140         24.8   82

(c) Mean and Standard Deviation of Each Quantitative Predictor

rbind(
  Mean = sapply(Auto[, quant_vars], mean),
  SD   = sapply(Auto[, quant_vars], sd)
)
##            mpg cylinders displacement horsepower    weight acceleration
## Mean 23.445918  5.471939      194.412  104.46939 2977.5842    15.541327
## SD    7.805007  1.705783      104.644   38.49116  849.4026     2.758864
##           year
## Mean 75.979592
## SD    3.683737

(d) Stats After Removing Observations 10–85

Auto_sub <- Auto[-(10:85), ]

rbind(
  Range_Min = sapply(Auto_sub[, quant_vars], min),
  Range_Max = sapply(Auto_sub[, quant_vars], max),
  Mean      = sapply(Auto_sub[, quant_vars], mean),
  SD        = sapply(Auto_sub[, quant_vars], sd)
)
##                 mpg cylinders displacement horsepower    weight acceleration
## Range_Min 11.000000  3.000000     68.00000   46.00000 1649.0000     8.500000
## Range_Max 46.600000  8.000000    455.00000  230.00000 4997.0000    24.800000
## Mean      24.404430  5.373418    187.24051  100.72152 2935.9715    15.726899
## SD         7.867283  1.654179     99.67837   35.70885  811.3002     2.693721
##                year
## Range_Min 70.000000
## Range_Max 82.000000
## Mean      77.145570
## SD         3.106217

Removing rows 10–85 slightly changes the ranges, means, and standard deviations for every predictor, since those 76 observations are no longer included in the calculations.

(e) Graphical Investigation of Predictors

ggpairs(Auto[, quant_vars],
        lower  = list(continuous = wrap("points", alpha = 0.3, size = 0.6)),
        diag   = list(continuous = wrap("densityDiag")),
        upper  = list(continuous = wrap("cor", size = 3))) +
  theme_bw(base_size = 9) +
  ggtitle("Scatterplot Matrix – Auto Dataset (Quantitative Variables)")

Findings:

  • displacement, horsepower, and weight are strongly positively correlated with each other.
  • All three of those variables are strongly negatively correlated with mpg.
  • cylinders closely tracks displacement and weight.
  • year shows a moderate positive correlation with mpg, suggesting newer cars tend to be more fuel-efficient.
  • acceleration is only weakly correlated with most other variables.

(f) Predicting mpg

p1 <- ggplot(Auto, aes(x = weight, y = mpg)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "loess", se = TRUE, colour = "steelblue") +
  labs(title = "mpg vs. weight") + theme_bw()

p2 <- ggplot(Auto, aes(x = year, y = mpg)) +
  geom_jitter(alpha = 0.4, width = 0.3) +
  geom_smooth(method = "lm", se = TRUE, colour = "tomato") +
  labs(title = "mpg vs. year") + theme_bw()

gridExtra::grid.arrange(p1, p2, ncol = 2)

Conclusion: weight, displacement, horsepower, and cylinders all appear highly useful for predicting mpg (strong negative relationships). year also shows a clear positive trend (newer cars are more efficient). acceleration has a weaker, less consistent relationship with mpg.


Chapter 3 – Exercise 9

Multiple linear regression on the Auto data set.

(a) Scatterplot Matrix

ggpairs(Auto[, quant_vars],
        lower = list(continuous = wrap("points", alpha = 0.3, size = 0.6)),
        diag  = list(continuous = wrap("densityDiag")),
        upper = list(continuous = wrap("cor", size = 3))) +
  theme_bw(base_size = 9) +
  ggtitle("Scatterplot Matrix – Auto (Chapter 3)")

(b) Correlation Matrix

round(cor(Auto[, quant_vars]), 3)
##                 mpg cylinders displacement horsepower weight acceleration
## mpg           1.000    -0.778       -0.805     -0.778 -0.832        0.423
## cylinders    -0.778     1.000        0.951      0.843  0.898       -0.505
## displacement -0.805     0.951        1.000      0.897  0.933       -0.544
## horsepower   -0.778     0.843        0.897      1.000  0.865       -0.689
## weight       -0.832     0.898        0.933      0.865  1.000       -0.417
## acceleration  0.423    -0.505       -0.544     -0.689 -0.417        1.000
## year          0.581    -0.346       -0.370     -0.416 -0.309        0.290
##                year
## mpg           0.581
## cylinders    -0.346
## displacement -0.370
## horsepower   -0.416
## weight       -0.309
## acceleration  0.290
## year          1.000

The correlation matrix confirms the strong negative relationships between mpg and displacement, horsepower, and weight, and positive relationship between mpg and year.

(c) Multiple Linear Regression

lm_fit <- lm(mpg ~ . - name, data = Auto)
summary(lm_fit)
## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

i. Is there a relationship between the predictors and the response?

Yes. The overall F-statistic is very large with an extremely small p-value, indicating that at least one predictor has a statistically significant relationship with mpg.

ii. Which predictors appear statistically significant?

At the 5% significance level, displacement, weight, year, and origin all have p-values well below 0.05. horsepower is marginally significant. cylinders and acceleration are not significant in the presence of the other predictors.

iii. What does the coefficient for year suggest?

The positive coefficient for year (approximately +0.75) indicates that, holding all other predictors constant, each additional model year is associated with an increase of about 0.75 miles per gallon. This reflects the trend toward more fuel-efficient vehicles over the sample period.

(d) Diagnostic Plots

par(mfrow = c(2, 2))
plot(lm_fit)

par(mfrow = c(1, 1))

Comments:

  • The Residuals vs Fitted plot shows a slight non-linear pattern (U-shape), suggesting the linear model may not capture all the curvature in the data.
  • The Scale-Location plot reveals mild heteroscedasticity.
  • The Normal Q-Q plot is mostly linear but shows some deviation in the tails.
  • The Residuals vs Leverage plot identifies observation 14 as having unusually high leverage. A few points have moderately large residuals but none are extreme outliers dramatically influencing the fit.

(e) Interaction Effects

# Test a few theoretically motivated interactions
lm_int <- lm(mpg ~ displacement + weight + year + origin +
               acceleration + horsepower +
               displacement:weight +
               year:origin,
             data = Auto)
summary(lm_int)
## 
## Call:
## lm(formula = mpg ~ displacement + weight + year + origin + acceleration + 
##     horsepower + displacement:weight + year:origin, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9885 -1.7706 -0.0383  1.4555 12.3455 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.889e+01  8.075e+00   2.340 0.019806 *  
## displacement        -6.954e-02  9.080e-03  -7.658 1.56e-13 ***
## weight              -1.044e-02  6.863e-04 -15.215  < 2e-16 ***
## year                 4.720e-01  1.000e-01   4.718 3.34e-06 ***
## origin              -1.399e+01  4.173e+00  -3.351 0.000885 ***
## acceleration         7.653e-02  8.666e-02   0.883 0.377712    
## horsepower          -3.029e-02  1.217e-02  -2.490 0.013215 *  
## displacement:weight  2.236e-05  2.175e-06  10.281  < 2e-16 ***
## year:origin          1.873e-01  5.357e-02   3.496 0.000527 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.918 on 383 degrees of freedom
## Multiple R-squared:  0.8631, Adjusted R-squared:  0.8602 
## F-statistic: 301.7 on 8 and 383 DF,  p-value: < 2.2e-16

The interaction displacement:weight is statistically significant (p < 0.05), indicating that the effect of engine displacement on fuel economy depends on the weight of the vehicle. The year:origin interaction is also worth noting – the rate of improvement in fuel economy over the years differs by region of manufacture.

(f) Variable Transformations

# Log transformation of horsepower
lm_log   <- lm(mpg ~ log(horsepower) + weight + year + origin, data = Auto)

# Square root of displacement
lm_sqrt  <- lm(mpg ~ sqrt(displacement) + weight + year + origin, data = Auto)

# Quadratic horsepower
lm_quad  <- lm(mpg ~ horsepower + I(horsepower^2) + weight + year + origin, data = Auto)

data.frame(
  Model      = c("log(horsepower)", "sqrt(displacement)", "hp + hp^2"),
  Adj_R2     = c(summary(lm_log)$adj.r.squared,
                 summary(lm_sqrt)$adj.r.squared,
                 summary(lm_quad)$adj.r.squared)
)
##                Model    Adj_R2
## 1    log(horsepower) 0.8258454
## 2 sqrt(displacement) 0.8159244
## 3          hp + hp^2 0.8486682

All three transformations improve model fit compared to a purely linear specification. The log(horsepower) model achieves a high adjusted R², confirming that the relationship between mpg and horsepower is non-linear (diminishing returns at higher power levels). The quadratic horsepower model similarly captures this curvature.


Chapter 3 – Exercise 15

Predicting per capita crime rate (crim) using the Boston data set.

data(Boston)
predictors <- setdiff(names(Boston), "crim")

(a) Simple Linear Regression for Each Predictor

# Fit a simple regression for every predictor and collect coefficients + p-values
simple_results <- lapply(predictors, function(var) {
  fit <- lm(reformulate(var, "crim"), data = Boston)
  s   <- summary(fit)
  data.frame(
    predictor = var,
    coef      = coef(fit)[2],
    p_value   = s$coefficients[2, 4],
    row.names = NULL
  )
})
simple_df <- do.call(rbind, simple_results)
simple_df$significant <- simple_df$p_value < 0.05

print(simple_df)
##    predictor        coef      p_value significant
## 1         zn -0.07393498 5.506472e-06        TRUE
## 2      indus  0.50977633 1.450349e-21        TRUE
## 3       chas -1.89277655 2.094345e-01       FALSE
## 4        nox 31.24853120 3.751739e-23        TRUE
## 5         rm -2.68405122 6.346703e-07        TRUE
## 6        age  0.10778623 2.854869e-16        TRUE
## 7        dis -1.55090168 8.519949e-19        TRUE
## 8        rad  0.61791093 2.693844e-56        TRUE
## 9        tax  0.02974225 2.357127e-47        TRUE
## 10   ptratio  1.15198279 2.942922e-11        TRUE
## 11     lstat  0.54880478 2.654277e-27        TRUE
## 12      medv -0.36315992 1.173987e-19        TRUE
# Plot for two illustrative predictors
p_lstat <- ggplot(Boston, aes(x = lstat, y = crim)) +
  geom_point(alpha = 0.4) + geom_smooth(method = "lm", colour = "steelblue") +
  labs(title = "crim ~ lstat") + theme_bw()

p_medv <- ggplot(Boston, aes(x = medv, y = crim)) +
  geom_point(alpha = 0.4) + geom_smooth(method = "lm", colour = "tomato") +
  labs(title = "crim ~ medv") + theme_bw()

gridExtra::grid.arrange(p_lstat, p_medv, ncol = 2)

Almost all predictors show a statistically significant univariate association with crim. Exceptions (if any) tend to be chas (Charles River dummy variable), which is not significant.

(b) Multiple Regression

lm_multi <- lm(crim ~ ., data = Boston)
summary(lm_multi)
## 
## Call:
## lm(formula = crim ~ ., data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.534 -2.248 -0.348  1.087 73.923 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.7783938  7.0818258   1.946 0.052271 .  
## zn           0.0457100  0.0187903   2.433 0.015344 *  
## indus       -0.0583501  0.0836351  -0.698 0.485709    
## chas        -0.8253776  1.1833963  -0.697 0.485841    
## nox         -9.9575865  5.2898242  -1.882 0.060370 .  
## rm           0.6289107  0.6070924   1.036 0.300738    
## age         -0.0008483  0.0179482  -0.047 0.962323    
## dis         -1.0122467  0.2824676  -3.584 0.000373 ***
## rad          0.6124653  0.0875358   6.997 8.59e-12 ***
## tax         -0.0037756  0.0051723  -0.730 0.465757    
## ptratio     -0.3040728  0.1863598  -1.632 0.103393    
## lstat        0.1388006  0.0757213   1.833 0.067398 .  
## medv        -0.2200564  0.0598240  -3.678 0.000261 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.46 on 493 degrees of freedom
## Multiple R-squared:  0.4493, Adjusted R-squared:  0.4359 
## F-statistic: 33.52 on 12 and 493 DF,  p-value: < 2.2e-16

In the multiple regression, far fewer predictors remain statistically significant. We can reject H₀: βⱼ = 0 for zn, dis, rad, black, and medv at conventional significance levels. Many predictors that were significant in simple regressions lose significance once the other variables are controlled for, indicating collinearity.

(c) Univariate vs. Multiple Regression Coefficients

multi_coefs <- coef(lm_multi)[-1]   # exclude intercept

# Align with simple_df
plot_df <- simple_df %>%
  filter(predictor %in% names(multi_coefs)) %>%
  mutate(multi_coef = multi_coefs[predictor])

ggplot(plot_df, aes(x = coef, y = multi_coef, label = predictor)) +
  geom_point(colour = "steelblue", size = 3) +
  ggrepel::geom_text_repel(size = 3) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", colour = "grey50") +
  labs(x = "Univariate coefficient",
       y = "Multiple regression coefficient",
       title = "Simple vs. Multiple Regression Coefficients – Boston") +
  theme_bw()

The coefficients differ substantially between univariate and multiple models, particularly for nox and rad, illustrating the confounding effect of correlated predictors.

(d) Non-linear Associations (Cubic Polynomial)

poly_results <- lapply(predictors, function(var) {
  # Skip binary variable chas
  if (length(unique(Boston[[var]])) <= 2) return(NULL)
  fit <- lm(reformulate(c(var, paste0("I(", var, "^2)"), paste0("I(", var, "^3)")),
                        "crim"), data = Boston)
  s   <- summary(fit)$coefficients
  data.frame(
    predictor  = var,
    p_X2       = round(s[3, 4], 4),   # p-value of X^2 term
    p_X3       = round(s[4, 4], 4),   # p-value of X^3 term
    row.names  = NULL
  )
})
poly_df <- do.call(rbind, poly_results)
poly_df$nonlinear <- (poly_df$p_X2 < 0.05 | poly_df$p_X3 < 0.05)
print(poly_df)
##    predictor   p_X2   p_X3 nonlinear
## 1         zn 0.0938 0.2295     FALSE
## 2      indus 0.0000 0.0000      TRUE
## 3        nox 0.0000 0.0000      TRUE
## 4         rm 0.3641 0.5086     FALSE
## 5        age 0.0474 0.0067      TRUE
## 6        dis 0.0000 0.0000      TRUE
## 7        rad 0.6130 0.4823     FALSE
## 8        tax 0.1375 0.2439     FALSE
## 9    ptratio 0.0041 0.0063      TRUE
## 10     lstat 0.0646 0.1299     FALSE
## 11      medv 0.0000 0.0000      TRUE

Several predictors show statistically significant polynomial (X² or X³) terms, indicating non-linear associations with crime rate. Notable examples typically include dis, nox, lstat, and medv.


Chapter 4 – Exercise 13

Logistic regression using the Weekly data set (1990–2010).

data(Weekly)
dim(Weekly)
## [1] 1089    9
summary(Weekly)
##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume            Today         
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747   Min.   :-18.1950  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202   1st Qu.: -1.1540  
##  Median :  0.2380   Median :  0.2340   Median :1.00268   Median :  0.2410  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462   Mean   :  0.1499  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373   3rd Qu.:  1.4050  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821   Max.   : 12.0260  
##  Direction 
##  Down:484  
##  Up  :605  
##            
##            
##            
## 

(a) Numerical and Graphical Summaries

# Numerical summary
summary(Weekly[, c("Lag1","Lag2","Lag3","Lag4","Lag5","Volume","Today")])
##       Lag1               Lag2               Lag3               Lag4         
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580   1st Qu.: -1.1580  
##  Median :  0.2410   Median :  0.2410   Median :  0.2410   Median :  0.2380  
##  Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472   Mean   :  0.1458  
##  3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag5              Volume            Today         
##  Min.   :-18.1950   Min.   :0.08747   Min.   :-18.1950  
##  1st Qu.: -1.1660   1st Qu.:0.33202   1st Qu.: -1.1540  
##  Median :  0.2340   Median :1.00268   Median :  0.2410  
##  Mean   :  0.1399   Mean   :1.57462   Mean   :  0.1499  
##  3rd Qu.:  1.4050   3rd Qu.:2.05373   3rd Qu.:  1.4050  
##  Max.   : 12.0260   Max.   :9.32821   Max.   : 12.0260
# Volume over time
ggplot(Weekly, aes(x = Year, y = Volume)) +
  geom_jitter(alpha = 0.3, width = 0.3, colour = "steelblue") +
  geom_smooth(method = "loess", colour = "tomato", se = FALSE) +
  labs(title = "Trading Volume Over Time – Weekly Data",
       x = "Year", y = "Volume (billions of shares)") +
  theme_bw()

Patterns: Trading volume increased substantially from 1990 to around 2004–2005 and then leveled off. The lag returns (Lag1 through Lag5) show little obvious trend, as expected for financial returns. The market was more likely to go Up than Down across the full period (roughly 55% Up weeks).

(b) Logistic Regression – Full Data

glm_full <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
                data   = Weekly,
                family = binomial)
summary(glm_full)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Weekly)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

Significant predictors: Only Lag2 has a p-value below 0.05. The other lag variables and Volume are not statistically significant in the presence of all predictors.

(c) Confusion Matrix – Full Data

prob_full  <- predict(glm_full, type = "response")
pred_full  <- ifelse(prob_full > 0.5, "Up", "Down")

conf_full  <- table(Predicted = pred_full, Actual = Weekly$Direction)
conf_full
##          Actual
## Predicted Down  Up
##      Down   54  48
##      Up    430 557
# Overall fraction of correct predictions
accuracy_full <- mean(pred_full == Weekly$Direction)
cat("Overall accuracy (full model, training data):", round(accuracy_full, 4), "\n")
## Overall accuracy (full model, training data): 0.5611

Interpretation: The confusion matrix reveals that the model correctly predicts most Up weeks but struggles significantly with Down weeks — it almost always predicts Up. This is consistent with the imbalanced base rate (more Up weeks than Down). The overall training accuracy of ~56% is only marginally better than the naive strategy of always predicting Up.

(d) Logistic Regression – Training 1990–2008, Test 2009–2010

train <- Weekly$Year <= 2008
test  <- Weekly[!train, ]

glm_lag2 <- glm(Direction ~ Lag2,
                data   = Weekly,
                subset = train,
                family = binomial)

prob_test  <- predict(glm_lag2, newdata = test, type = "response")
pred_test  <- ifelse(prob_test > 0.5, "Up", "Down")

conf_test  <- table(Predicted = pred_test, Actual = test$Direction)
conf_test
##          Actual
## Predicted Down Up
##      Down    9  5
##      Up     34 56
accuracy_test <- mean(pred_test == test$Direction)
cat("Overall accuracy (Lag2 model, test 2009-2010):", round(accuracy_test, 4), "\n")
## Overall accuracy (Lag2 model, test 2009-2010): 0.625

Interpretation: Using only Lag2 as a predictor and evaluating on held-out data from 2009–2010, the model achieves an accuracy of approximately 62.5%. This is a meaningful improvement over random guessing and suggests that Lag2 carries some genuine predictive signal for the direction of weekly returns.


End of Midterm Homework