1 Chapter 2 — Question 9: The Auto Dataset

# Load Auto dataset and remove any rows with missing values
data(Auto)
Auto <- na.omit(Auto)

# Quick overview
glimpse(Auto)
## Rows: 392
## Columns: 9
## $ mpg          <dbl> 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14, 2…
## $ cylinders    <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6, 6, 4, …
## $ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390, 383, 34…
## $ horsepower   <int> 130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170, 16…
## $ weight       <int> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, 385…
## $ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8.5, …
## $ year         <int> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 7…
## $ origin       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 3, …
## $ name         <fct> chevrolet chevelle malibu, buick skylark 320, plymouth sa…

1.1 (a) Quantitative vs. Qualitative Predictors

# Inspect variable types
str(Auto)
## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : int  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
##  - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
##   ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...

Classification:

Variable Type Justification
mpg Quantitative Continuous numeric — miles per gallon
cylinders Quantitative* Integer counts; treated as quantitative (could be ordinal)
displacement Quantitative Continuous — engine displacement in cubic inches
horsepower Quantitative Continuous — engine horsepower
weight Quantitative Continuous — vehicle weight in lbs
acceleration Quantitative Continuous — time to accelerate 0–60 mph
year Quantitative Integer — model year (70 = 1970, etc.)
origin Qualitative Integer code for region: 1 = American, 2 = European, 3 = Japanese
name Qualitative Character — vehicle name/identifier

Interpretation: Seven predictors are quantitative (continuous or ordinal numeric). Two are qualitative: origin (categorical region code) and name (nominal string identifier).


1.2 (b) Range of Each Quantitative Predictor

# Select quantitative predictors
quant_vars <- c("mpg","cylinders","displacement","horsepower","weight","acceleration","year")

# Compute range for each
range_df <- Auto %>%
  select(all_of(quant_vars)) %>%
  summarise(across(everything(), list(Min = min, Max = max))) %>%
  pivot_longer(everything(),
               names_to  = c("Variable", ".value"),
               names_sep = "_") %>%
  mutate(Range = Max - Min)

kable(range_df, digits = 2, caption = "Range of Quantitative Predictors") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE)
Range of Quantitative Predictors
Variable Min Max Range
mpg 9 46.6 37.6
cylinders 3 8.0 5.0
displacement 68 455.0 387.0
horsepower 46 230.0 184.0
weight 1613 5140.0 3527.0
acceleration 8 24.8 16.8
year 70 82.0 12.0

Interpretation: weight has the widest absolute range (~3,527 lbs), while acceleration is the most tightly bounded. The wide spread of horsepower and displacement foreshadows their potential predictive importance.


1.3 (c) Mean and Standard Deviation of Each Quantitative Predictor

# Summary statistics
summary_df <- Auto %>%
  select(all_of(quant_vars)) %>%
  summarise(across(everything(),
                   list(Mean = ~mean(.x), SD = ~sd(.x)),
                   .names = "{.col}_{.fn}")) %>%
  pivot_longer(everything(),
               names_to  = c("Variable", ".value"),
               names_sep = "_")

kable(summary_df, digits = 3,
      caption = "Mean and Standard Deviation of Quantitative Predictors") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE)
Mean and Standard Deviation of Quantitative Predictors
Variable Mean SD
mpg 23.446 7.805
cylinders 5.472 1.706
displacement 194.412 104.644
horsepower 104.469 38.491
weight 2977.584 849.403
acceleration 15.541 2.759
year 75.980 3.684

Interpretation: The high standard deviation of weight (849 lbs) and displacement (104 in³) relative to their means reflects large variance across vehicle classes — compact cars vs. large American sedans.


1.4 (d) Summary Statistics After Removing Observations 10–85

# Remove observations 10 through 85 (row indices)
Auto_subset <- Auto[-(10:85), ]

cat("Original rows:", nrow(Auto), "| Subset rows:", nrow(Auto_subset), "\n")
## Original rows: 392 | Subset rows: 316
# Compute range, mean, and SD on subset
subset_df <- Auto_subset %>%
  select(all_of(quant_vars)) %>%
  summarise(across(everything(),
                   list(Min = min, Max = max, Mean = ~mean(.x), SD = ~sd(.x)),
                   .names = "{.col}_{.fn}")) %>%
  pivot_longer(everything(),
               names_to  = c("Variable", ".value"),
               names_sep = "_") %>%
  mutate(Range = Max - Min)

kable(subset_df, digits = 3,
      caption = "Statistics After Removing Observations 10–85") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE)
Statistics After Removing Observations 10–85
Variable Min Max Mean SD Range
mpg 11.0 46.6 24.404 7.867 35.6
cylinders 3.0 8.0 5.373 1.654 5.0
displacement 68.0 455.0 187.241 99.678 387.0
horsepower 46.0 230.0 100.722 35.709 184.0
weight 1649.0 4997.0 2935.972 811.300 3348.0
acceleration 8.5 24.8 15.727 2.694 16.3
year 70.0 82.0 77.146 3.106 12.0

Interpretation: Removing observations 10–85 (76 rows, ~19% of the data) narrows several ranges and shifts means, indicating those rows are not uniformly distributed across the predictor space. The standard deviations also change, suggesting the removed observations captured some meaningful variability.


1.5 (e) Graphical Investigation — Scatterplot Matrix

# Scatterplot matrix with correlation coefficients using GGally
ggpairs(
  Auto %>% select(all_of(quant_vars)),
  upper = list(continuous = wrap("cor", size = 3, color = "steelblue")),
  lower = list(continuous = wrap("points", alpha = 0.3, size = 0.8,
                                 color = "steelblue")),
  diag  = list(continuous = wrap("densityDiag", fill = "steelblue",
                                 alpha = 0.4)),
  title = "Scatterplot Matrix — Auto Dataset (Quantitative Predictors)"
) +
  theme_bw(base_size = 10)

# Correlation heatmap for clearer reading
cor_mat <- cor(Auto %>% select(all_of(quant_vars)))

ggcorrplot(cor_mat,
           method   = "circle",
           type     = "lower",
           lab      = TRUE,
           lab_size = 3,
           colors   = c("#d73027","white","#1a9850"),
           title    = "Correlation Heatmap — Auto Quantitative Predictors",
           ggtheme  = theme_bw()) +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

Interpretation: Strong negative correlations exist between mpg and weight (r ≈ −0.83), displacement (r ≈ −0.81), cylinders (r ≈ −0.78), and horsepower (r ≈ −0.78). year shows a positive correlation with mpg (r ≈ +0.58), implying fuel efficiency improved over the model years studied. The engineering variables (cylinders, displacement, horsepower, weight) are highly inter-correlated, suggesting multicollinearity should be considered in regression models.


1.6 (f) Predictors Useful for Predicting mpg

# Scatter plots of mpg against each other quantitative predictor
other_vars <- setdiff(quant_vars, "mpg")

plots <- lapply(other_vars, function(v) {
  ggplot(Auto, aes_string(x = v, y = "mpg")) +
    geom_point(alpha = 0.4, color = "steelblue", size = 1.5) +
    geom_smooth(method = "lm", se = TRUE, color = "firebrick", linewidth = 0.8) +
    geom_smooth(method = "loess", se = FALSE, color = "darkgreen",
                linetype = "dashed", linewidth = 0.8) +
    labs(title = paste("mpg vs.", v),
         x = v, y = "mpg") +
    theme_bw(base_size = 11)
})

grid.arrange(grobs = plots, ncol = 3,
             top = "MPG vs. Quantitative Predictors (Red = OLS, Green Dashed = LOESS)")

Interpretation: Based on the plots:

  • weight, displacement, horsepower, cylinders: All show strong, negative non-linear relationships with mpg. Heavier, larger-engine cars consume more fuel. These predictors would be most useful.
  • year: Positive relationship — fuel efficiency improved over time due to regulatory and engineering changes. A useful predictor.
  • acceleration: Weaker, more scattered relationship; moderate predictive value.

The LOESS curves reveal non-linear patterns, particularly for horsepower and weight, suggesting that polynomial or logarithmic transformations may improve model fit.


2 Chapter 3 — Question 9: Multiple Linear Regression on Auto

This question involves the use of multiple linear regression on the Auto data set.


2.1 (a) Scatterplot Matrix

# Scatterplot matrix for all variables (excluding 'name')
ggpairs(
  Auto %>% select(-name),
  upper = list(continuous = wrap("cor", size = 2.8, color = "steelblue"),
               combo       = "box_no_facet"),
  lower = list(continuous = wrap("points", alpha = 0.25, size = 0.7,
                                 color = "steelblue"),
               combo       = "dot_no_facet"),
  diag  = list(continuous = wrap("densityDiag", fill = "steelblue", alpha = 0.4),
               discrete    = "barDiag"),
  title = "Scatterplot Matrix — All Auto Predictors (excluding 'name')"
) +
  theme_bw(base_size = 9)


2.2 (b) Correlation Matrix

# Exclude 'name' (qualitative) before computing correlations
cor_auto <- cor(Auto %>% select(-name))

# Display as rounded table
kable(round(cor_auto, 3),
      caption = "Correlation Matrix — Auto Dataset (excluding 'name')") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),
                font_size = 11) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#2c7bb6")
Correlation Matrix — Auto Dataset (excluding ‘name’)
mpg cylinders displacement horsepower weight acceleration year origin
mpg 1.000 -0.778 -0.805 -0.778 -0.832 0.423 0.581 0.565
cylinders -0.778 1.000 0.951 0.843 0.898 -0.505 -0.346 -0.569
displacement -0.805 0.951 1.000 0.897 0.933 -0.544 -0.370 -0.615
horsepower -0.778 0.843 0.897 1.000 0.865 -0.689 -0.416 -0.455
weight -0.832 0.898 0.933 0.865 1.000 -0.417 -0.309 -0.585
acceleration 0.423 -0.505 -0.544 -0.689 -0.417 1.000 0.290 0.213
year 0.581 -0.346 -0.370 -0.416 -0.309 0.290 1.000 0.182
origin 0.565 -0.569 -0.615 -0.455 -0.585 0.213 0.182 1.000

Interpretation: The correlation matrix confirms that mpg is most strongly (negatively) correlated with weight (−0.832), displacement (−0.805), and cylinders (−0.778). The high inter-correlations among the engineering predictors (e.g., displacementweight ≈ 0.933) signal potential multicollinearity in a regression model.


2.3 (c) Multiple Linear Regression — mpg ~ . (excluding name)

# Fit multiple linear regression: mpg as response, all except 'name' as predictors
lm_auto <- lm(mpg ~ . - name, data = Auto)

# Print detailed summary
summary(lm_auto)
## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16
# Tidy coefficient table for presentation
tidy(lm_auto) %>%
  mutate(significant = ifelse(p.value < 0.05, "Yes", "No")) %>%
  kable(digits = 4,
        col.names = c("Predictor","Estimate","Std. Error","t-value","p-value","Significant (α=0.05)"),
        caption = "MLR Coefficient Summary: mpg ~ . - name") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) %>%
  row_spec(which(tidy(lm_auto)$p.value < 0.05), background = "#e8f5e9")
MLR Coefficient Summary: mpg ~ . - name
Predictor Estimate Std. Error t-value p-value Significant (α=0.05)
(Intercept) -17.2184 4.6443 -3.7074 0.0002 Yes
cylinders -0.4934 0.3233 -1.5261 0.1278 No
displacement 0.0199 0.0075 2.6474 0.0084 Yes
horsepower -0.0170 0.0138 -1.2295 0.2196 No
weight -0.0065 0.0007 -9.9288 0.0000 Yes
acceleration 0.0806 0.0988 0.8152 0.4155 No
year 0.7508 0.0510 14.7288 0.0000 Yes
origin 1.4261 0.2781 5.1275 0.0000 Yes

i. Is there a relationship between the predictors and the response?

The overall F-test is highly significant (F ≈ 252.4 on 7 and 384 df, p < 2.2×10⁻¹⁶), confirming that at least one predictor has a statistically significant linear relationship with mpg. The model explains approximately 82.1% of the variance in fuel efficiency (Adjusted R² ≈ 0.821).

ii. Which predictors have a statistically significant relationship?

At the α = 0.05 level, the following predictors are statistically significant: displacement, weight, year, and origin. Variables cylinders, horsepower, and acceleration are not significant in the multiple regression context, likely due to multicollinearity with displacement and weight.

iii. What does the coefficient for year suggest?

The estimated coefficient for year is approximately +0.751, meaning that, holding all other predictors constant, each additional model year is associated with an increase of ~0.75 mpg. This reflects real-world improvements in automotive fuel efficiency driven by the 1973–74 oil crisis and subsequent CAFE (Corporate Average Fuel Economy) standards introduced in 1975.


2.4 (d) Diagnostic Plots

# Extract model diagnostics using broom::augment
diag_df <- augment(lm_auto)

# 1. Residuals vs Fitted
p1 <- ggplot(diag_df, aes(.fitted, .resid)) +
  geom_point(alpha = 0.4, color = "steelblue", size = 1.5) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "firebrick") +
  geom_smooth(method = "loess", se = FALSE, color = "firebrick", linewidth = 0.8) +
  labs(title = "Residuals vs. Fitted",
       x = "Fitted Values", y = "Residuals") +
  theme_bw()

# 2. Normal Q-Q
p2 <- ggplot(diag_df, aes(sample = .std.resid)) +
  stat_qq(alpha = 0.4, color = "steelblue", size = 1.5) +
  stat_qq_line(color = "firebrick", linewidth = 0.8) +
  labs(title = "Normal Q-Q Plot",
       x = "Theoretical Quantiles", y = "Standardised Residuals") +
  theme_bw()

# 3. Scale-Location
p3 <- ggplot(diag_df, aes(.fitted, sqrt(abs(.std.resid)))) +
  geom_point(alpha = 0.4, color = "steelblue", size = 1.5) +
  geom_smooth(method = "loess", se = FALSE, color = "firebrick", linewidth = 0.8) +
  labs(title = "Scale-Location",
       x = "Fitted Values", y = expression(sqrt("|Standardised Residuals|"))) +
  theme_bw()

# 4. Leverage (hat values) vs Standardised Residuals
p4 <- ggplot(diag_df, aes(.hat, .std.resid)) +
  geom_point(alpha = 0.4, color = "steelblue", size = 1.5) +
  geom_hline(yintercept = c(-2, 0, 2), linetype = "dashed", color = "firebrick") +
  geom_smooth(method = "loess", se = FALSE, color = "orange", linewidth = 0.8) +
  labs(title = "Leverage vs. Standardised Residuals",
       x = "Leverage (hat values)", y = "Standardised Residuals") +
  theme_bw()

grid.arrange(p1, p2, p3, p4, ncol = 2,
             top = "Diagnostic Plots — MLR: mpg ~ . - name")

# Identify high-leverage points (threshold: 2 * mean hat value)
lev_threshold <- 2 * mean(hatvalues(lm_auto))
high_lev <- which(hatvalues(lm_auto) > lev_threshold)
cat("High leverage observations (index):", paste(high_lev[1:10], collapse=", "),
    "...\n(", length(high_lev), "total )\n")
## High leverage observations (index): 7, 8, 9, 13, 14, 26, 27, 28, 29, 94 ...
## ( 17 total )

Interpretation:

  • Residuals vs. Fitted: A slight curvature in the LOESS smoother suggests mild non-linearity not fully captured by the linear model.
  • Q-Q Plot: Residuals are approximately normal with minor deviation in the tails, acceptable for inference.
  • Scale-Location: A slight increasing trend indicates mild heteroscedasticity — variance of residuals grows with fitted values.
  • Leverage Plot: Several high-leverage points are identified. Observation 14 appears in many summaries as having unusually high leverage. High leverage points have unusual predictor combinations but do not necessarily distort the fit.

2.5 (e) Interaction Effects

# Test interaction effects using * (main effects + interaction)
# Focus on two theoretically motivated interactions:
# weight:horsepower — both reflect engine load
# displacement:year — engine downsizing over time
lm_interact <- lm(mpg ~ . - name + weight:horsepower + displacement:year, data = Auto)
summary(lm_interact)
## 
## Call:
## lm(formula = mpg ~ . - name + weight:horsepower + displacement:year, 
##     data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.7503 -1.5594 -0.0426  1.2853 12.1239 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -2.698e+01  8.194e+00  -3.293  0.00108 ** 
## cylinders          2.082e-01  2.870e-01   0.725  0.46867    
## displacement       1.582e-01  3.583e-02   4.416 1.31e-05 ***
## horsepower        -2.104e-01  2.360e-02  -8.915  < 2e-16 ***
## weight            -1.019e-02  7.510e-04 -13.561  < 2e-16 ***
## acceleration      -5.116e-02  8.705e-02  -0.588  0.55705    
## year               1.116e+00  9.143e-02  12.207  < 2e-16 ***
## origin             7.994e-01  2.458e-01   3.252  0.00125 ** 
## horsepower:weight  4.757e-05  5.413e-06   8.787  < 2e-16 ***
## displacement:year -2.090e-03  4.834e-04  -4.324 1.96e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.866 on 382 degrees of freedom
## Multiple R-squared:  0.8683, Adjusted R-squared:  0.8652 
## F-statistic: 279.8 on 9 and 382 DF,  p-value: < 2.2e-16

Interpretation: The interaction term weight:horsepower is statistically significant (p < 0.05), suggesting that the joint effect of weight and horsepower on fuel efficiency is not simply additive — the penalty of high horsepower is amplified in heavier vehicles. The displacement:year interaction is also significant, reflecting that more recent large-displacement engines became relatively more efficient over time.


2.6 (f) Variable Transformations

# --- Log Transformation ---
lm_log <- lm(mpg ~ log(displacement) + log(horsepower) + log(weight) +
               acceleration + year + origin, data = Auto)

cat("=== Log-transformed model ===\n")
## === Log-transformed model ===
cat("Adjusted R²:", summary(lm_log)$adj.r.squared, "\n")
## Adjusted R²: 0.8445302
cat("Residual SE:", summary(lm_log)$sigma, "\n\n")
## Residual SE: 3.077488
# --- Square Root Transformation ---
lm_sqrt <- lm(mpg ~ sqrt(displacement) + sqrt(horsepower) + sqrt(weight) +
                acceleration + year + origin, data = Auto)

cat("=== Sqrt-transformed model ===\n")
## === Sqrt-transformed model ===
cat("Adjusted R²:", summary(lm_sqrt)$adj.r.squared, "\n")
## Adjusted R²: 0.8312818
cat("Residual SE:", summary(lm_sqrt)$sigma, "\n\n")
## Residual SE: 3.205932
# --- Quadratic (X²) Transformation ---
lm_sq <- lm(mpg ~ poly(displacement, 2) + poly(horsepower, 2) + poly(weight, 2) +
              acceleration + year + origin, data = Auto)

cat("=== Quadratic (X²) model ===\n")
## === Quadratic (X²) model ===
cat("Adjusted R²:", summary(lm_sq)$adj.r.squared, "\n")
## Adjusted R²: 0.861586
cat("Residual SE:", summary(lm_sq)$sigma, "\n\n")
## Residual SE: 2.903778
# Comparison table
model_compare <- data.frame(
  Model        = c("Original MLR","Log Transform","Sqrt Transform","Quadratic (X²)"),
  Adj_R2       = c(summary(lm_auto)$adj.r.squared,
                   summary(lm_log)$adj.r.squared,
                   summary(lm_sqrt)$adj.r.squared,
                   summary(lm_sq)$adj.r.squared),
  Residual_SE  = c(summary(lm_auto)$sigma,
                   summary(lm_log)$sigma,
                   summary(lm_sqrt)$sigma,
                   summary(lm_sq)$sigma)
)

kable(model_compare, digits = 4,
      caption = "Model Comparison: Transformations of Key Predictors") %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)
Model Comparison: Transformations of Key Predictors
Model Adj_R2 Residual_SE
Original MLR 0.8182 3.3277
Log Transform 0.8445 3.0775
Sqrt Transform 0.8313 3.2059
Quadratic (X²) 0.8616 2.9038

Interpretation: Applying logarithmic or square-root transformations to the skewed engineering predictors (displacement, horsepower, weight) increases the Adjusted R² and reduces the Residual SE relative to the untransformed model. The log-transformed model typically performs best because it linearises the multiplicative relationship between engine size/weight and fuel consumption. Quadratic terms capture non-linearity directly but add degrees of freedom without always matching log-transform gains.


3 Chapter 3 — Question 15: Predicting Crime Rate in the Boston Dataset

We will try to predict per capita crime rate (crim) using the other variables in the Boston data set.

data(Boston)
glimpse(Boston)
## Rows: 506
## Columns: 13
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829,…
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 1…
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.…
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524,…
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631,…
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, 9…
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9505…
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 31…
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15…
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17.10…
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15…

3.1 (a) Simple Linear Regression for Each Predictor

# Fit a simple linear regression for each predictor
predictors_boston <- setdiff(names(Boston), "crim")

# Collect tidy results
slr_results <- lapply(predictors_boston, function(v) {
  mod  <- lm(as.formula(paste("crim ~", v)), data = Boston)
  tidy_mod <- tidy(mod)
  # Return only the predictor row (slope)
  tidy_mod[2, ] %>%
    mutate(predictor = v,
           adj_r2    = summary(mod)$adj.r.squared)
}) %>% bind_rows()

# Flag significance
slr_sig <- slr_results %>%
  mutate(significant = ifelse(p.value < 0.05, "Yes", "No")) %>%
  select(predictor, estimate, std.error, statistic, p.value, adj_r2, significant)

kable(slr_sig, digits = 4,
      col.names = c("Predictor","Slope","Std. Error","t-stat","p-value",
                    "Adj. R²","Sig. (α=0.05)"),
      caption = "Simple Linear Regression Results: crim ~ each predictor") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"), font_size = 11) %>%
  row_spec(which(slr_sig$significant == "Yes"), background = "#e8f5e9")
Simple Linear Regression Results: crim ~ each predictor
Predictor Slope Std. Error t-stat p-value Adj. R² Sig. (α=0.05)
zn -0.0739 0.0161 -4.5938 0.0000 0.0383 Yes
indus 0.5098 0.0510 9.9908 0.0000 0.1637 Yes
chas -1.8928 1.5061 -1.2567 0.2094 0.0011 No
nox 31.2485 2.9992 10.4190 0.0000 0.1756 Yes
rm -2.6841 0.5320 -5.0448 0.0000 0.0462 Yes
age 0.1078 0.0127 8.4628 0.0000 0.1227 Yes
dis -1.5509 0.1683 -9.2135 0.0000 0.1425 Yes
rad 0.6179 0.0343 17.9982 0.0000 0.3900 Yes
tax 0.0297 0.0018 16.0994 0.0000 0.3383 Yes
ptratio 1.1520 0.1694 6.8014 0.0000 0.0823 Yes
lstat 0.5488 0.0478 11.4907 0.0000 0.2060 Yes
medv -0.3632 0.0384 -9.4597 0.0000 0.1491 Yes
# Scatterplots with regression lines for all predictors
sig_preds <- slr_sig %>% filter(significant == "Yes") %>% pull(predictor)

plots_15a <- lapply(sig_preds, function(v) {
  ggplot(Boston, aes_string(x = v, y = "crim")) +
    geom_point(alpha = 0.3, color = "steelblue", size = 1) +
    geom_smooth(method = "lm", se = TRUE, color = "firebrick", linewidth = 0.8) +
    labs(title = paste("crim ~", v),
         x = v, y = "Crime Rate") +
    theme_bw(base_size = 10)
})

grid.arrange(grobs = plots_15a, ncol = 4,
             top = "SLR: Crime Rate vs. Significant Predictors")

Interpretation: Almost all predictors show a statistically significant univariate association with crim (p < 0.05), with the exception of chas (Charles River dummy variable), which is not significant. Notable strong associations: rad (accessibility to radial highways), tax (property tax rate), and lstat (lower-status population %) are positively associated with crime; medv (median home value) is negatively associated.


3.2 (b) Multiple Linear Regression

# Fit multiple regression with all predictors
lm_boston <- lm(crim ~ ., data = Boston)
summary(lm_boston)
## 
## Call:
## lm(formula = crim ~ ., data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.534 -2.248 -0.348  1.087 73.923 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.7783938  7.0818258   1.946 0.052271 .  
## zn           0.0457100  0.0187903   2.433 0.015344 *  
## indus       -0.0583501  0.0836351  -0.698 0.485709    
## chas        -0.8253776  1.1833963  -0.697 0.485841    
## nox         -9.9575865  5.2898242  -1.882 0.060370 .  
## rm           0.6289107  0.6070924   1.036 0.300738    
## age         -0.0008483  0.0179482  -0.047 0.962323    
## dis         -1.0122467  0.2824676  -3.584 0.000373 ***
## rad          0.6124653  0.0875358   6.997 8.59e-12 ***
## tax         -0.0037756  0.0051723  -0.730 0.465757    
## ptratio     -0.3040728  0.1863598  -1.632 0.103393    
## lstat        0.1388006  0.0757213   1.833 0.067398 .  
## medv        -0.2200564  0.0598240  -3.678 0.000261 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.46 on 493 degrees of freedom
## Multiple R-squared:  0.4493, Adjusted R-squared:  0.4359 
## F-statistic: 33.52 on 12 and 493 DF,  p-value: < 2.2e-16
tidy(lm_boston) %>%
  mutate(significant = ifelse(p.value < 0.05, "Yes", "No")) %>%
  kable(digits = 4,
        col.names = c("Predictor","Estimate","Std. Error","t-value","p-value","Sig."),
        caption = "Multiple Linear Regression: crim ~ all predictors") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) %>%
  row_spec(which(tidy(lm_boston)$p.value < 0.05), background = "#e8f5e9")
Multiple Linear Regression: crim ~ all predictors
Predictor Estimate Std. Error t-value p-value Sig.
(Intercept) 13.7784 7.0818 1.9456 0.0523 No
zn 0.0457 0.0188 2.4326 0.0153 Yes
indus -0.0584 0.0836 -0.6977 0.4857 No
chas -0.8254 1.1834 -0.6975 0.4858 No
nox -9.9576 5.2898 -1.8824 0.0604 No
rm 0.6289 0.6071 1.0359 0.3007 No
age -0.0008 0.0179 -0.0473 0.9623 No
dis -1.0122 0.2825 -3.5836 0.0004 Yes
rad 0.6125 0.0875 6.9967 0.0000 Yes
tax -0.0038 0.0052 -0.7300 0.4658 No
ptratio -0.3041 0.1864 -1.6316 0.1034 No
lstat 0.1388 0.0757 1.8330 0.0674 No
medv -0.2201 0.0598 -3.6784 0.0003 Yes

Interpretation: In the multiple regression model, only a subset of predictors remain statistically significant after adjusting for all others: zn, dis, rad, black, and medv. Many variables significant in simple regression lose significance in the multiple regression due to multicollinearity — for example, tax (highly correlated with rad). We reject H₀: βⱼ = 0 for the above-listed predictors.


3.3 (c) Univariate vs. Multiple Regression Coefficient Comparison Plot

# Prepare data: univariate slopes vs. multiple regression slopes
uni_coefs <- slr_results %>%
  select(predictor, estimate) %>%
  rename(univariate = estimate)

multi_coefs <- tidy(lm_boston) %>%
  filter(term != "(Intercept)") %>%
  select(term, estimate) %>%
  rename(predictor = term, multiple = estimate)

coef_compare <- inner_join(uni_coefs, multi_coefs, by = "predictor")

ggplot(coef_compare, aes(x = univariate, y = multiple, label = predictor)) +
  geom_point(color = "steelblue", size = 3) +
  geom_text(vjust = -0.6, hjust = 0.5, size = 3.2, color = "black") +
  geom_hline(yintercept = 0, linetype = "dashed", color = "grey50") +
  geom_vline(xintercept = 0, linetype = "dashed", color = "grey50") +
  geom_abline(slope = 1, intercept = 0, color = "firebrick",
              linetype = "dotted", linewidth = 0.8) +
  labs(
    title    = "Univariate vs. Multiple Regression Coefficients",
    subtitle = "Each point = one predictor. Red dotted line = perfect agreement.",
    x        = "Univariate Regression Coefficient",
    y        = "Multiple Regression Coefficient"
  ) +
  theme_bw(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))

Interpretation: The scatter of points away from the identity line (y = x) illustrates how controlling for other predictors changes estimated coefficients substantially. nox (nitric oxide concentration) is a striking example: its large positive univariate coefficient shrinks or changes sign in the multiple regression, consistent with it acting as a proxy for other correlated urban-environment variables.


3.4 (d) Non-linear Association — Cubic Polynomial Models

# For each predictor, fit crim ~ X + X^2 + X^3 and test significance of
# quadratic and cubic terms via ANOVA

poly_results <- lapply(predictors_boston, function(v) {
  # Suppress errors for constant predictors
  tryCatch({
    formula_poly <- as.formula(paste("crim ~ poly(", v, ", 3)"))
    mod_poly <- lm(formula_poly, data = Boston)
    s <- summary(mod_poly)
    coef_tbl <- tidy(mod_poly)

    # Extract p-values for X², X³ terms
    p_linear    <- coef_tbl$p.value[2]
    p_quadratic <- ifelse(nrow(coef_tbl) >= 3, coef_tbl$p.value[3], NA)
    p_cubic     <- ifelse(nrow(coef_tbl) >= 4, coef_tbl$p.value[4], NA)

    data.frame(
      Predictor   = v,
      p_X1        = round(p_linear, 4),
      p_X2        = round(p_quadratic, 4),
      p_X3        = round(p_cubic, 4),
      Adj_R2      = round(s$adj.r.squared, 4),
      NonLinearSig = ifelse(!is.na(p_quadratic) & (p_quadratic < 0.05 |
                            (!is.na(p_cubic) & p_cubic < 0.05)), "Yes", "No")
    )
  }, error = function(e) NULL)
}) %>% bind_rows()

kable(poly_results,
      col.names = c("Predictor","p(X)","p(X²)","p(X³)","Adj.R²","Non-Linear Sig."),
      caption = "Cubic Polynomial Regression: crim ~ X + X² + X³") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"), font_size = 11) %>%
  row_spec(which(poly_results$NonLinearSig == "Yes"), background = "#fff3cd")
Cubic Polynomial Regression: crim ~ X + X² + X³
Predictor p(X) p(X²) p(X³) Adj.R² Non-Linear Sig.
zn 0 0.0044 0.2295 0.0526 Yes
indus 0 0.0011 0.0000 0.2552 Yes
nox 0 0.0001 0.0000 0.2928 Yes
rm 0 0.0015 0.5086 0.0622 Yes
age 0 0.0000 0.0067 0.1693 Yes
dis 0 0.0000 0.0000 0.2735 Yes
rad 0 0.0091 0.4823 0.3965 Yes
tax 0 0.0000 0.2439 0.3651 Yes
ptratio 0 0.0024 0.0063 0.1085 Yes
lstat 0 0.0378 0.1299 0.2133 Yes
medv 0 0.0000 0.0000 0.4167 Yes

Interpretation: For many predictors — including indus, nox, age, dis, ptratio, and medv — the quadratic and/or cubic terms are statistically significant, providing evidence of non-linear relationships with crime rate. For example, the relationship between dis (distance to employment centres) and crim appears to decrease sharply at first before levelling off — a pattern better captured by a polynomial model. These findings suggest that a linear model may be misspecified for predicting crim, and non-linear methods (splines, GAMs) could improve fit.


4 Chapter 4 — Question 13: Logistic Regression on the Weekly Dataset

This data contains 1,089 weekly stock market returns for 21 years, from 1990 to 2010. The response is Direction (Up/Down).

data(Weekly)
glimpse(Weekly)
## Rows: 1,089
## Columns: 9
## $ Year      <dbl> 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, …
## $ Lag1      <dbl> 0.816, -0.270, -2.576, 3.514, 0.712, 1.178, -1.372, 0.807, 0…
## $ Lag2      <dbl> 1.572, 0.816, -0.270, -2.576, 3.514, 0.712, 1.178, -1.372, 0…
## $ Lag3      <dbl> -3.936, 1.572, 0.816, -0.270, -2.576, 3.514, 0.712, 1.178, -…
## $ Lag4      <dbl> -0.229, -3.936, 1.572, 0.816, -0.270, -2.576, 3.514, 0.712, …
## $ Lag5      <dbl> -3.484, -0.229, -3.936, 1.572, 0.816, -0.270, -2.576, 3.514,…
## $ Volume    <dbl> 0.1549760, 0.1485740, 0.1598375, 0.1616300, 0.1537280, 0.154…
## $ Today     <dbl> -0.270, -2.576, 3.514, 0.712, 1.178, -1.372, 0.807, 0.041, 1…
## $ Direction <fct> Down, Down, Up, Up, Up, Down, Up, Up, Up, Down, Down, Up, Up…
cat("\nClass balance:\n")
## 
## Class balance:
table(Weekly$Direction)
## 
## Down   Up 
##  484  605

4.1 (a) Numerical and Graphical Summaries

# Numerical summary
summary(Weekly)
##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume            Today         
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747   Min.   :-18.1950  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202   1st Qu.: -1.1540  
##  Median :  0.2380   Median :  0.2340   Median :1.00268   Median :  0.2410  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462   Mean   :  0.1499  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373   3rd Qu.:  1.4050  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821   Max.   : 12.0260  
##  Direction 
##  Down:484  
##  Up  :605  
##            
##            
##            
## 
# Volume over time
ggplot(Weekly, aes(x = Year, y = Volume)) +
  geom_line(color = "steelblue", alpha = 0.7, linewidth = 0.8) +
  geom_smooth(method = "loess", color = "firebrick", se = FALSE, linewidth = 1) +
  labs(title = "Average Weekly Trading Volume Over Time (1990–2010)",
       x = "Year", y = "Volume (avg. daily billions)") +
  theme_bw(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))

# Distribution of Today's returns by Direction
ggplot(Weekly, aes(x = Today, fill = Direction)) +
  geom_histogram(bins = 50, alpha = 0.7, position = "identity") +
  scale_fill_manual(values = c("Down" = "#d73027","Up" = "#1a9850")) +
  labs(title = "Distribution of Weekly Returns by Market Direction",
       x = "Today's Return (%)", y = "Count", fill = "Direction") +
  theme_bw(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))

# Lag returns by Direction
Weekly_long <- Weekly %>%
  select(Direction, Lag1:Lag5) %>%
  pivot_longer(Lag1:Lag5, names_to = "Lag", values_to = "Return")

ggplot(Weekly_long, aes(x = Lag, y = Return, fill = Direction)) +
  geom_boxplot(alpha = 0.7, outlier.size = 0.5) +
  scale_fill_manual(values = c("Down" = "#d73027","Up" = "#1a9850")) +
  labs(title = "Distribution of Lag Returns by Market Direction",
       x = "Lag Variable", y = "Lagged Return (%)", fill = "Direction") +
  theme_bw(base_size = 12)

Interpretation: Trading volume increased substantially over the 21-year period, particularly through the late 1990s tech boom and post-2003 expansion. The distribution of weekly returns is approximately symmetric around zero, with no obvious separation between Up and Down distributions across the lag variables — suggesting the lag predictors alone may have limited discriminatory power. The Up class is somewhat more prevalent (≈56%) than Down, reflecting the long-term upward drift of equity markets.


4.2 (b) Logistic Regression — Full Model with All Lags + Volume

# Full logistic regression: Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume
glm_full <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
                data   = Weekly,
                family = binomial)

summary(glm_full)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Weekly)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4
tidy(glm_full) %>%
  mutate(
    OR          = exp(estimate),
    significant = ifelse(p.value < 0.05, "Yes", "No")
  ) %>%
  kable(digits = 4,
        col.names = c("Predictor","Log-Odds","Std. Error","z-value","p-value",
                      "Odds Ratio","Sig. (α=0.05)"),
        caption = "Logistic Regression Coefficients (Full Model)") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) %>%
  row_spec(which(tidy(glm_full)$p.value < 0.05), background = "#e8f5e9")
Logistic Regression Coefficients (Full Model)
Predictor Log-Odds Std. Error z-value p-value Odds Ratio Sig. (α=0.05)
(Intercept) 0.2669 0.0859 3.1056 0.0019 1.3059 Yes
Lag1 -0.0413 0.0264 -1.5626 0.1181 0.9596 No
Lag2 0.0584 0.0269 2.1754 0.0296 1.0602 Yes
Lag3 -0.0161 0.0267 -0.6024 0.5469 0.9841 No
Lag4 -0.0278 0.0265 -1.0501 0.2937 0.9726 No
Lag5 -0.0145 0.0264 -0.5485 0.5833 0.9856 No
Volume -0.0227 0.0369 -0.6163 0.5377 0.9775 No

Interpretation: Among the six predictors, only Lag2 is statistically significant (p ≈ 0.030). Its positive log-odds coefficient (+0.058) indicates that a higher return two weeks prior is associated with slightly higher odds of the market going Up the current week. The other lag variables and Volume are not significant. This is consistent with weak-form market efficiency — past prices have minimal predictive power.


4.3 (c) Confusion Matrix — Full Model

# Predicted probabilities and class predictions (threshold = 0.5)
pred_probs_full <- predict(glm_full, type = "response")
pred_class_full <- ifelse(pred_probs_full > 0.5, "Up", "Down")

# Confusion matrix
conf_mat_full <- table(Predicted = pred_class_full, Actual = Weekly$Direction)
conf_mat_full
##          Actual
## Predicted Down  Up
##      Down   54  48
##      Up    430 557
# Overall accuracy
accuracy_full <- mean(pred_class_full == Weekly$Direction)
cat(sprintf("\nOverall Accuracy (full model, training data): %.1f%%\n",
            accuracy_full * 100))
## 
## Overall Accuracy (full model, training data): 56.1%
# Sensitivity & Specificity
TP <- conf_mat_full["Up",   "Up"]
TN <- conf_mat_full["Down", "Down"]
FP <- conf_mat_full["Up",   "Down"]
FN <- conf_mat_full["Down", "Up"]

cat(sprintf("Sensitivity (True Positive Rate): %.1f%%\n", TP/(TP+FN)*100))
## Sensitivity (True Positive Rate): 92.1%
cat(sprintf("Specificity (True Negative Rate): %.1f%%\n", TN/(TN+FP)*100))
## Specificity (True Negative Rate): 11.2%
# Visual confusion matrix
conf_df <- as.data.frame(conf_mat_full) %>%
  mutate(label = paste0(Freq, "\n(",
                        round(Freq/sum(conf_mat_full)*100, 1), "%)"))

ggplot(conf_df, aes(x = Actual, y = Predicted, fill = Freq)) +
  geom_tile(color = "white", linewidth = 1) +
  geom_text(aes(label = label), size = 5, fontface = "bold") +
  scale_fill_gradient(low = "#e8f4fd", high = "#2196F3") +
  labs(title = "Confusion Matrix — Full Logistic Regression Model",
       subtitle = paste0("Overall Accuracy: ", round(accuracy_full*100,1), "%"),
       x = "Actual Direction", y = "Predicted Direction") +
  theme_bw(base_size = 13) +
  theme(plot.title    = element_text(face = "bold"),
        legend.position = "none")

Interpretation: The confusion matrix reveals that the model achieves an overall accuracy of 56.1%. However, this masks class imbalance: the model has a strong bias toward predicting “Up” — it correctly identifies most “Up” weeks (high sensitivity ~92%) but performs very poorly at detecting “Down” weeks (low specificity ~11%). This means the model essentially defaults to the majority class and provides little incremental value for identifying market downturns.


4.4 (d) Logistic Regression — Training (1990–2008) and Test (2009–2010)

# Split data: training = 1990–2008, test = 2009–2010
train <- Weekly %>% filter(Year <= 2008)
test  <- Weekly %>% filter(Year  > 2008)

cat("Training observations:", nrow(train), "\n")
## Training observations: 985
cat("Test observations:    ", nrow(test),  "\n")
## Test observations:     104
# Fit logistic regression with Lag2 only on training data
glm_lag2 <- glm(Direction ~ Lag2,
                data   = train,
                family = binomial)

summary(glm_lag2)
## 
## Call:
## glm(formula = Direction ~ Lag2, family = binomial, data = train)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.20326    0.06428   3.162  0.00157 **
## Lag2         0.05810    0.02870   2.024  0.04298 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1350.5  on 983  degrees of freedom
## AIC: 1354.5
## 
## Number of Fisher Scoring iterations: 4
# Predict on held-out test data (2009–2010)
pred_probs_test <- predict(glm_lag2, newdata = test, type = "response")
pred_class_test <- ifelse(pred_probs_test > 0.5, "Up", "Down")

# Confusion matrix on test set
conf_mat_test <- table(Predicted = pred_class_test, Actual = test$Direction)
conf_mat_test
##          Actual
## Predicted Down Up
##      Down    9  5
##      Up     34 56
# Overall test accuracy
accuracy_test <- mean(pred_class_test == test$Direction)
cat(sprintf("\nOverall Test Accuracy (2009–2010, Lag2 only): %.1f%%\n",
            accuracy_test * 100))
## 
## Overall Test Accuracy (2009–2010, Lag2 only): 62.5%
# Sensitivity & Specificity on test
TP_t <- conf_mat_test["Up",   "Up"]
TN_t <- conf_mat_test["Down", "Down"]
FP_t <- conf_mat_test["Up",   "Down"]
FN_t <- conf_mat_test["Down", "Up"]

cat(sprintf("Test Sensitivity: %.1f%%\n", TP_t/(TP_t+FN_t)*100))
## Test Sensitivity: 91.8%
cat(sprintf("Test Specificity: %.1f%%\n", TN_t/(TN_t+FP_t)*100))
## Test Specificity: 20.9%
# Visual confusion matrix — test set
conf_test_df <- as.data.frame(conf_mat_test) %>%
  mutate(label = paste0(Freq, "\n(", round(Freq/sum(conf_mat_test)*100,1), "%)"))

ggplot(conf_test_df, aes(x = Actual, y = Predicted, fill = Freq)) +
  geom_tile(color = "white", linewidth = 1) +
  geom_text(aes(label = label), size = 5, fontface = "bold") +
  scale_fill_gradient(low = "#e8f4fd", high = "#2196F3") +
  labs(title    = "Confusion Matrix — Lag2 Model on Test Data (2009–2010)",
       subtitle = paste0("Test Accuracy: ", round(accuracy_test*100,1), "%"),
       x = "Actual Direction", y = "Predicted Direction") +
  theme_bw(base_size = 13) +
  theme(plot.title    = element_text(face = "bold"),
        legend.position = "none")

Interpretation: The model trained on 1990–2008 data using only Lag2 achieves a test accuracy of 62.5% on the 2009–2010 held-out data. While seemingly modest, this exceeds the naive baseline of always predicting “Up” (≈ 58% accuracy for this test period). The improvement in specificity compared to the full model confirms that the parsimonious Lag2-only model generalises better out-of-sample. The ability of last week’s two-period lagged return to carry predictive signal is consistent with short-horizon momentum effects documented in the finance literature. Nevertheless, the overall discriminatory power remains limited, reinforcing the difficulty of beating the market using only historical price data.


5 Session Information

sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: aarch64-apple-darwin20
## Running under: macOS 26.2
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Asia/Taipei
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] scales_1.4.0       broom_1.0.12       ggcorrplot_0.1.4.1 gridExtra_2.3     
##  [5] kableExtra_1.4.0   knitr_1.49         tidyr_1.3.1        dplyr_1.1.4       
##  [9] GGally_2.4.0       ggplot2_4.0.2      ISLR2_1.3-2       
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.9         utf8_1.2.4         generics_0.1.3     xml2_1.3.6        
##  [5] lattice_0.22-6     stringi_1.8.4      digest_0.6.37      magrittr_2.0.3    
##  [9] evaluate_1.0.1     grid_4.4.1         RColorBrewer_1.1-3 fastmap_1.2.0     
## [13] Matrix_1.7-0       plyr_1.8.9         jsonlite_1.8.9     backports_1.5.0   
## [17] mgcv_1.9-1         purrr_1.2.1        fansi_1.0.6        viridisLite_0.4.2 
## [21] textshaping_0.4.0  jquerylib_0.1.4    cli_3.6.5          rlang_1.1.6       
## [25] splines_4.4.1      withr_3.0.2        cachem_1.1.0       yaml_2.3.10       
## [29] tools_4.4.1        reshape2_1.4.5     ggstats_0.13.0     vctrs_0.6.5       
## [33] R6_2.5.1           lifecycle_1.0.4    stringr_1.5.1      pkgconfig_2.0.3   
## [37] pillar_1.9.0       bslib_0.8.0        gtable_0.3.6       Rcpp_1.0.13       
## [41] glue_1.8.0         systemfonts_1.3.2  xfun_0.49          tibble_3.2.1      
## [45] tidyselect_1.2.1   rstudioapi_0.17.1  farver_2.1.2       nlme_3.1-164      
## [49] htmltools_0.5.8.1  rmarkdown_2.29     svglite_2.2.2      labeling_0.4.3    
## [53] compiler_4.4.1     S7_0.2.1