1 Chapter 2 — Question 9: The `Auto` Dataset

# Load Auto dataset and remove any rows with missing values
data(Auto)
Auto <- na.omit(Auto)

# Quick overview
glimpse(Auto)

## Rows: 392
## Columns: 9
## $ mpg          <dbl> 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14, 2…
## $ cylinders    <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6, 6, 4, …
## $ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390, 383, 34…
## $ horsepower   <int> 130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170, 16…
## $ weight       <int> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, 385…
## $ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8.5, …
## $ year         <int> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 7…
## $ origin       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 3, …
## $ name         <fct> chevrolet chevelle malibu, buick skylark 320, plymouth sa…

1.1 (a) Quantitative vs. Qualitative Predictors

# Inspect variable types
str(Auto)

## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : int  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
##  - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
##   ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...

Classification:

Variable	Type	Justification
`mpg`	Quantitative	Continuous numeric — miles per gallon
`cylinders`	Quantitative*	Integer counts; treated as quantitative (could be ordinal)
`displacement`	Quantitative	Continuous — engine displacement in cubic inches
`horsepower`	Quantitative	Continuous — engine horsepower
`weight`	Quantitative	Continuous — vehicle weight in lbs
`acceleration`	Quantitative	Continuous — time to accelerate 0–60 mph
`year`	Quantitative	Integer — model year (70 = 1970, etc.)
`origin`	Qualitative	Integer code for region: 1 = American, 2 = European, 3 = Japanese
`name`	Qualitative	Character — vehicle name/identifier

Interpretation: Seven predictors are quantitative (continuous or ordinal numeric). Two are qualitative: origin (categorical region code) and name (nominal string identifier).

1.2 (b) Range of Each Quantitative Predictor

# Select quantitative predictors
quant_vars <- c("mpg","cylinders","displacement","horsepower","weight","acceleration","year")

# Compute range for each
range_df <- Auto %>%
  select(all_of(quant_vars)) %>%
  summarise(across(everything(), list(Min = min, Max = max))) %>%
  pivot_longer(everything(),
               names_to  = c("Variable", ".value"),
               names_sep = "_") %>%
  mutate(Range = Max - Min)

kable(range_df, digits = 2, caption = "Range of Quantitative Predictors") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE)

Range of Quantitative Predictors
Variable	Min	Max	Range
mpg	9	46.6	37.6
cylinders	3	8.0	5.0
displacement	68	455.0	387.0
horsepower	46	230.0	184.0
weight	1613	5140.0	3527.0
acceleration	8	24.8	16.8
year	70	82.0	12.0

Interpretation: weight has the widest absolute range (~3,527 lbs), while acceleration is the most tightly bounded. The wide spread of horsepower and displacement foreshadows their potential predictive importance.

1.3 (c) Mean and Standard Deviation of Each Quantitative Predictor

# Summary statistics
summary_df <- Auto %>%
  select(all_of(quant_vars)) %>%
  summarise(across(everything(),
                   list(Mean = ~mean(.x), SD = ~sd(.x)),
                   .names = "{.col}_{.fn}")) %>%
  pivot_longer(everything(),
               names_to  = c("Variable", ".value"),
               names_sep = "_")

kable(summary_df, digits = 3,
      caption = "Mean and Standard Deviation of Quantitative Predictors") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE)

Mean and Standard Deviation of Quantitative Predictors
Variable	Mean	SD
mpg	23.446	7.805
cylinders	5.472	1.706
displacement	194.412	104.644
horsepower	104.469	38.491
weight	2977.584	849.403
acceleration	15.541	2.759
year	75.980	3.684

Interpretation: The high standard deviation of weight (849 lbs) and displacement (104 in³) relative to their means reflects large variance across vehicle classes — compact cars vs. large American sedans.

1.4 (d) Summary Statistics After Removing Observations 10–85

# Remove observations 10 through 85 (row indices)
Auto_subset <- Auto[-(10:85), ]

cat("Original rows:", nrow(Auto), "| Subset rows:", nrow(Auto_subset), "\n")

## Original rows: 392 | Subset rows: 316

# Compute range, mean, and SD on subset
subset_df <- Auto_subset %>%
  select(all_of(quant_vars)) %>%
  summarise(across(everything(),
                   list(Min = min, Max = max, Mean = ~mean(.x), SD = ~sd(.x)),
                   .names = "{.col}_{.fn}")) %>%
  pivot_longer(everything(),
               names_to  = c("Variable", ".value"),
               names_sep = "_") %>%
  mutate(Range = Max - Min)

kable(subset_df, digits = 3,
      caption = "Statistics After Removing Observations 10–85") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE)

Statistics After Removing Observations 10–85
Variable	Min	Max	Mean	SD	Range
mpg	11.0	46.6	24.404	7.867	35.6
cylinders	3.0	8.0	5.373	1.654	5.0
displacement	68.0	455.0	187.241	99.678	387.0
horsepower	46.0	230.0	100.722	35.709	184.0
weight	1649.0	4997.0	2935.972	811.300	3348.0
acceleration	8.5	24.8	15.727	2.694	16.3
year	70.0	82.0	77.146	3.106	12.0

Interpretation: Removing observations 10–85 (76 rows, ~19% of the data) narrows several ranges and shifts means, indicating those rows are not uniformly distributed across the predictor space. The standard deviations also change, suggesting the removed observations captured some meaningful variability.

1.5 (e) Graphical Investigation — Scatterplot Matrix

# Scatterplot matrix with correlation coefficients using GGally
ggpairs(
  Auto %>% select(all_of(quant_vars)),
  upper = list(continuous = wrap("cor", size = 3, color = "steelblue")),
  lower = list(continuous = wrap("points", alpha = 0.3, size = 0.8,
                                 color = "steelblue")),
  diag  = list(continuous = wrap("densityDiag", fill = "steelblue",
                                 alpha = 0.4)),
  title = "Scatterplot Matrix — Auto Dataset (Quantitative Predictors)"
) +
  theme_bw(base_size = 10)

# Correlation heatmap for clearer reading
cor_mat <- cor(Auto %>% select(all_of(quant_vars)))

ggcorrplot(cor_mat,
           method   = "circle",
           type     = "lower",
           lab      = TRUE,
           lab_size = 3,
           colors   = c("#d73027","white","#1a9850"),
           title    = "Correlation Heatmap — Auto Quantitative Predictors",
           ggtheme  = theme_bw()) +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

Interpretation: Strong negative correlations exist between mpg and weight (r ≈ −0.83), displacement (r ≈ −0.81), cylinders (r ≈ −0.78), and horsepower (r ≈ −0.78). year shows a positive correlation with mpg (r ≈ +0.58), implying fuel efficiency improved over the model years studied. The engineering variables (cylinders, displacement, horsepower, weight) are highly inter-correlated, suggesting multicollinearity should be considered in regression models.

1.6 (f) Predictors Useful for Predicting `mpg`

# Scatter plots of mpg against each other quantitative predictor
other_vars <- setdiff(quant_vars, "mpg")

plots <- lapply(other_vars, function(v) {
  ggplot(Auto, aes_string(x = v, y = "mpg")) +
    geom_point(alpha = 0.4, color = "steelblue", size = 1.5) +
    geom_smooth(method = "lm", se = TRUE, color = "firebrick", linewidth = 0.8) +
    geom_smooth(method = "loess", se = FALSE, color = "darkgreen",
                linetype = "dashed", linewidth = 0.8) +
    labs(title = paste("mpg vs.", v),
         x = v, y = "mpg") +
    theme_bw(base_size = 11)
})

grid.arrange(grobs = plots, ncol = 3,
             top = "MPG vs. Quantitative Predictors (Red = OLS, Green Dashed = LOESS)")

Interpretation: Based on the plots:

weight, displacement, horsepower, cylinders: All show strong, negative non-linear relationships with mpg. Heavier, larger-engine cars consume more fuel. These predictors would be most useful.

year: Positive relationship — fuel efficiency improved over time due to regulatory and engineering changes. A useful predictor.

acceleration: Weaker, more scattered relationship; moderate predictive value.

The LOESS curves reveal non-linear patterns, particularly for horsepower and weight, suggesting that polynomial or logarithmic transformations may improve model fit.

2 Chapter 3 — Question 9: Multiple Linear Regression on `Auto`

This question involves the use of multiple linear regression on the Auto data set.

2.1 (a) Scatterplot Matrix

# Scatterplot matrix for all variables (excluding 'name')
ggpairs(
  Auto %>% select(-name),
  upper = list(continuous = wrap("cor", size = 2.8, color = "steelblue"),
               combo       = "box_no_facet"),
  lower = list(continuous = wrap("points", alpha = 0.25, size = 0.7,
                                 color = "steelblue"),
               combo       = "dot_no_facet"),
  diag  = list(continuous = wrap("densityDiag", fill = "steelblue", alpha = 0.4),
               discrete    = "barDiag"),
  title = "Scatterplot Matrix — All Auto Predictors (excluding 'name')"
) +
  theme_bw(base_size = 9)

2.2 (b) Correlation Matrix

# Exclude 'name' (qualitative) before computing correlations
cor_auto <- cor(Auto %>% select(-name))

# Display as rounded table
kable(round(cor_auto, 3),
      caption = "Correlation Matrix — Auto Dataset (excluding 'name')") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),
                font_size = 11) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#2c7bb6")

Correlation Matrix — Auto Dataset (excluding ‘name’)
	mpg	cylinders	displacement	horsepower	weight	acceleration	year	origin
mpg	1.000	-0.778	-0.805	-0.778	-0.832	0.423	0.581	0.565
cylinders	-0.778	1.000	0.951	0.843	0.898	-0.505	-0.346	-0.569
displacement	-0.805	0.951	1.000	0.897	0.933	-0.544	-0.370	-0.615
horsepower	-0.778	0.843	0.897	1.000	0.865	-0.689	-0.416	-0.455
weight	-0.832	0.898	0.933	0.865	1.000	-0.417	-0.309	-0.585
acceleration	0.423	-0.505	-0.544	-0.689	-0.417	1.000	0.290	0.213
year	0.581	-0.346	-0.370	-0.416	-0.309	0.290	1.000	0.182
origin	0.565	-0.569	-0.615	-0.455	-0.585	0.213	0.182	1.000

Interpretation: The correlation matrix confirms that mpg is most strongly (negatively) correlated with weight (−0.832), displacement (−0.805), and cylinders (−0.778). The high inter-correlations among the engineering predictors (e.g., displacement–weight ≈ 0.933) signal potential multicollinearity in a regression model.

2.3 (c) Multiple Linear Regression — `mpg ~ .` (excluding `name`)

# Fit multiple linear regression: mpg as response, all except 'name' as predictors
lm_auto <- lm(mpg ~ . - name, data = Auto)

# Print detailed summary
summary(lm_auto)

## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

# Tidy coefficient table for presentation
tidy(lm_auto) %>%
  mutate(significant = ifelse(p.value < 0.05, "Yes", "No")) %>%
  kable(digits = 4,
        col.names = c("Predictor","Estimate","Std. Error","t-value","p-value","Significant (α=0.05)"),
        caption = "MLR Coefficient Summary: mpg ~ . - name") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) %>%
  row_spec(which(tidy(lm_auto)$p.value < 0.05), background = "#e8f5e9")

MLR Coefficient Summary: mpg ~ . - name
Predictor	Estimate	Std. Error	t-value	p-value	Significant (α=0.05)
(Intercept)	-17.2184	4.6443	-3.7074	0.0002	Yes
cylinders	-0.4934	0.3233	-1.5261	0.1278	No
displacement	0.0199	0.0075	2.6474	0.0084	Yes
horsepower	-0.0170	0.0138	-1.2295	0.2196	No
weight	-0.0065	0.0007	-9.9288	0.0000	Yes
acceleration	0.0806	0.0988	0.8152	0.4155	No
year	0.7508	0.0510	14.7288	0.0000	Yes
origin	1.4261	0.2781	5.1275	0.0000	Yes

i. Is there a relationship between the predictors and the response?

The overall F-test is highly significant (F ≈ 252.4 on 7 and 384 df, p < 2.2×10⁻¹⁶), confirming that at least one predictor has a statistically significant linear relationship with mpg. The model explains approximately 82.1% of the variance in fuel efficiency (Adjusted R² ≈ 0.821).

ii. Which predictors have a statistically significant relationship?

At the α = 0.05 level, the following predictors are statistically significant: displacement, weight, year, and origin. Variables cylinders, horsepower, and acceleration are not significant in the multiple regression context, likely due to multicollinearity with displacement and weight.

iii. What does the coefficient for year suggest?

The estimated coefficient for year is approximately +0.751, meaning that, holding all other predictors constant, each additional model year is associated with an increase of ~0.75 mpg. This reflects real-world improvements in automotive fuel efficiency driven by the 1973–74 oil crisis and subsequent CAFE (Corporate Average Fuel Economy) standards introduced in 1975.

2.4 (d) Diagnostic Plots

# Extract model diagnostics using broom::augment
diag_df <- augment(lm_auto)

# 1. Residuals vs Fitted
p1 <- ggplot(diag_df, aes(.fitted, .resid)) +
  geom_point(alpha = 0.4, color = "steelblue", size = 1.5) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "firebrick") +
  geom_smooth(method = "loess", se = FALSE, color = "firebrick", linewidth = 0.8) +
  labs(title = "Residuals vs. Fitted",
       x = "Fitted Values", y = "Residuals") +
  theme_bw()

# 2. Normal Q-Q
p2 <- ggplot(diag_df, aes(sample = .std.resid)) +
  stat_qq(alpha = 0.4, color = "steelblue", size = 1.5) +
  stat_qq_line(color = "firebrick", linewidth = 0.8) +
  labs(title = "Normal Q-Q Plot",
       x = "Theoretical Quantiles", y = "Standardised Residuals") +
  theme_bw()

# 3. Scale-Location
p3 <- ggplot(diag_df, aes(.fitted, sqrt(abs(.std.resid)))) +
  geom_point(alpha = 0.4, color = "steelblue", size = 1.5) +
  geom_smooth(method = "loess", se = FALSE, color = "firebrick", linewidth = 0.8) +
  labs(title = "Scale-Location",
       x = "Fitted Values", y = expression(sqrt("|Standardised Residuals|"))) +
  theme_bw()

# 4. Leverage (hat values) vs Standardised Residuals
p4 <- ggplot(diag_df, aes(.hat, .std.resid)) +
  geom_point(alpha = 0.4, color = "steelblue", size = 1.5) +
  geom_hline(yintercept = c(-2, 0, 2), linetype = "dashed", color = "firebrick") +
  geom_smooth(method = "loess", se = FALSE, color = "orange", linewidth = 0.8) +
  labs(title = "Leverage vs. Standardised Residuals",
       x = "Leverage (hat values)", y = "Standardised Residuals") +
  theme_bw()

grid.arrange(p1, p2, p3, p4, ncol = 2,
             top = "Diagnostic Plots — MLR: mpg ~ . - name")

# Identify high-leverage points (threshold: 2 * mean hat value)
lev_threshold <- 2 * mean(hatvalues(lm_auto))
high_lev <- which(hatvalues(lm_auto) > lev_threshold)
cat("High leverage observations (index):", paste(high_lev[1:10], collapse=", "),
    "...\n(", length(high_lev), "total )\n")

## High leverage observations (index): 7, 8, 9, 13, 14, 26, 27, 28, 29, 94 ...
## ( 17 total )

Interpretation:

Residuals vs. Fitted: A slight curvature in the LOESS smoother suggests mild non-linearity not fully captured by the linear model.

Q-Q Plot: Residuals are approximately normal with minor deviation in the tails, acceptable for inference.

Scale-Location: A slight increasing trend indicates mild heteroscedasticity — variance of residuals grows with fitted values.

Leverage Plot: Several high-leverage points are identified. Observation 14 appears in many summaries as having unusually high leverage. High leverage points have unusual predictor combinations but do not necessarily distort the fit.

2.5 (e) Interaction Effects

# Test interaction effects using * (main effects + interaction)
# Focus on two theoretically motivated interactions:
# weight:horsepower — both reflect engine load
# displacement:year — engine downsizing over time
lm_interact <- lm(mpg ~ . - name + weight:horsepower + displacement:year, data = Auto)
summary(lm_interact)

## 
## Call:
## lm(formula = mpg ~ . - name + weight:horsepower + displacement:year, 
##     data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.7503 -1.5594 -0.0426  1.2853 12.1239 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -2.698e+01  8.194e+00  -3.293  0.00108 ** 
## cylinders          2.082e-01  2.870e-01   0.725  0.46867    
## displacement       1.582e-01  3.583e-02   4.416 1.31e-05 ***
## horsepower        -2.104e-01  2.360e-02  -8.915  < 2e-16 ***
## weight            -1.019e-02  7.510e-04 -13.561  < 2e-16 ***
## acceleration      -5.116e-02  8.705e-02  -0.588  0.55705    
## year               1.116e+00  9.143e-02  12.207  < 2e-16 ***
## origin             7.994e-01  2.458e-01   3.252  0.00125 ** 
## horsepower:weight  4.757e-05  5.413e-06   8.787  < 2e-16 ***
## displacement:year -2.090e-03  4.834e-04  -4.324 1.96e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.866 on 382 degrees of freedom
## Multiple R-squared:  0.8683, Adjusted R-squared:  0.8652 
## F-statistic: 279.8 on 9 and 382 DF,  p-value: < 2.2e-16

Interpretation: The interaction term weight:horsepower is statistically significant (p < 0.05), suggesting that the joint effect of weight and horsepower on fuel efficiency is not simply additive — the penalty of high horsepower is amplified in heavier vehicles. The displacement:year interaction is also significant, reflecting that more recent large-displacement engines became relatively more efficient over time.

2.6 (f) Variable Transformations

# --- Log Transformation ---
lm_log <- lm(mpg ~ log(displacement) + log(horsepower) + log(weight) +
               acceleration + year + origin, data = Auto)

cat("=== Log-transformed model ===\n")

## === Log-transformed model ===

cat("Adjusted R²:", summary(lm_log)$adj.r.squared, "\n")

## Adjusted R²: 0.8445302

cat("Residual SE:", summary(lm_log)$sigma, "\n\n")

## Residual SE: 3.077488

# --- Square Root Transformation ---
lm_sqrt <- lm(mpg ~ sqrt(displacement) + sqrt(horsepower) + sqrt(weight) +
                acceleration + year + origin, data = Auto)

cat("=== Sqrt-transformed model ===\n")

## === Sqrt-transformed model ===

cat("Adjusted R²:", summary(lm_sqrt)$adj.r.squared, "\n")

## Adjusted R²: 0.8312818

cat("Residual SE:", summary(lm_sqrt)$sigma, "\n\n")

## Residual SE: 3.205932

# --- Quadratic (X²) Transformation ---
lm_sq <- lm(mpg ~ poly(displacement, 2) + poly(horsepower, 2) + poly(weight, 2) +
              acceleration + year + origin, data = Auto)

cat("=== Quadratic (X²) model ===\n")

## === Quadratic (X²) model ===

cat("Adjusted R²:", summary(lm_sq)$adj.r.squared, "\n")

## Adjusted R²: 0.861586

cat("Residual SE:", summary(lm_sq)$sigma, "\n\n")

## Residual SE: 2.903778

# Comparison table
model_compare <- data.frame(
  Model        = c("Original MLR","Log Transform","Sqrt Transform","Quadratic (X²)"),
  Adj_R2       = c(summary(lm_auto)$adj.r.squared,
                   summary(lm_log)$adj.r.squared,
                   summary(lm_sqrt)$adj.r.squared,
                   summary(lm_sq)$adj.r.squared),
  Residual_SE  = c(summary(lm_auto)$sigma,
                   summary(lm_log)$sigma,
                   summary(lm_sqrt)$sigma,
                   summary(lm_sq)$sigma)
)

kable(model_compare, digits = 4,
      caption = "Model Comparison: Transformations of Key Predictors") %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)

Model Comparison: Transformations of Key Predictors
Model	Adj_R2	Residual_SE
Original MLR	0.8182	3.3277
Log Transform	0.8445	3.0775
Sqrt Transform	0.8313	3.2059
Quadratic (X²)	0.8616	2.9038

Interpretation: Applying logarithmic or square-root transformations to the skewed engineering predictors (displacement, horsepower, weight) increases the Adjusted R² and reduces the Residual SE relative to the untransformed model. The log-transformed model typically performs best because it linearises the multiplicative relationship between engine size/weight and fuel consumption. Quadratic terms capture non-linearity directly but add degrees of freedom without always matching log-transform gains.

3 Chapter 3 — Question 15: Predicting Crime Rate in the `Boston` Dataset

We will try to predict per capita crime rate (crim) using the other variables in the Boston data set.

data(Boston)
glimpse(Boston)

## Rows: 506
## Columns: 13
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829,…
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 1…
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.…
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524,…
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631,…
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, 9…
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9505…
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 31…
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15…
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17.10…
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15…

3.1 (a) Simple Linear Regression for Each Predictor

# Fit a simple linear regression for each predictor
predictors_boston <- setdiff(names(Boston), "crim")

# Collect tidy results
slr_results <- lapply(predictors_boston, function(v) {
  mod  <- lm(as.formula(paste("crim ~", v)), data = Boston)
  tidy_mod <- tidy(mod)
  # Return only the predictor row (slope)
  tidy_mod[2, ] %>%
    mutate(predictor = v,
           adj_r2    = summary(mod)$adj.r.squared)
}) %>% bind_rows()

# Flag significance
slr_sig <- slr_results %>%
  mutate(significant = ifelse(p.value < 0.05, "Yes", "No")) %>%
  select(predictor, estimate, std.error, statistic, p.value, adj_r2, significant)

kable(slr_sig, digits = 4,
      col.names = c("Predictor","Slope","Std. Error","t-stat","p-value",
                    "Adj. R²","Sig. (α=0.05)"),
      caption = "Simple Linear Regression Results: crim ~ each predictor") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"), font_size = 11) %>%
  row_spec(which(slr_sig$significant == "Yes"), background = "#e8f5e9")

Simple Linear Regression Results: crim ~ each predictor
Predictor	Slope	Std. Error	t-stat	p-value	Adj. R²	Sig. (α=0.05)
zn	-0.0739	0.0161	-4.5938	0.0000	0.0383	Yes
indus	0.5098	0.0510	9.9908	0.0000	0.1637	Yes
chas	-1.8928	1.5061	-1.2567	0.2094	0.0011	No
nox	31.2485	2.9992	10.4190	0.0000	0.1756	Yes
rm	-2.6841	0.5320	-5.0448	0.0000	0.0462	Yes
age	0.1078	0.0127	8.4628	0.0000	0.1227	Yes
dis	-1.5509	0.1683	-9.2135	0.0000	0.1425	Yes
rad	0.6179	0.0343	17.9982	0.0000	0.3900	Yes
tax	0.0297	0.0018	16.0994	0.0000	0.3383	Yes
ptratio	1.1520	0.1694	6.8014	0.0000	0.0823	Yes
lstat	0.5488	0.0478	11.4907	0.0000	0.2060	Yes
medv	-0.3632	0.0384	-9.4597	0.0000	0.1491	Yes

# Scatterplots with regression lines for all predictors
sig_preds <- slr_sig %>% filter(significant == "Yes") %>% pull(predictor)

plots_15a <- lapply(sig_preds, function(v) {
  ggplot(Boston, aes_string(x = v, y = "crim")) +
    geom_point(alpha = 0.3, color = "steelblue", size = 1) +
    geom_smooth(method = "lm", se = TRUE, color = "firebrick", linewidth = 0.8) +
    labs(title = paste("crim ~", v),
         x = v, y = "Crime Rate") +
    theme_bw(base_size = 10)
})

grid.arrange(grobs = plots_15a, ncol = 4,
             top = "SLR: Crime Rate vs. Significant Predictors")

Interpretation: Almost all predictors show a statistically significant univariate association with crim (p < 0.05), with the exception of chas (Charles River dummy variable), which is not significant. Notable strong associations: rad (accessibility to radial highways), tax (property tax rate), and lstat (lower-status population %) are positively associated with crime; medv (median home value) is negatively associated.

3.2 (b) Multiple Linear Regression

# Fit multiple regression with all predictors
lm_boston <- lm(crim ~ ., data = Boston)
summary(lm_boston)

## 
## Call:
## lm(formula = crim ~ ., data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.534 -2.248 -0.348  1.087 73.923 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.7783938  7.0818258   1.946 0.052271 .  
## zn           0.0457100  0.0187903   2.433 0.015344 *  
## indus       -0.0583501  0.0836351  -0.698 0.485709    
## chas        -0.8253776  1.1833963  -0.697 0.485841    
## nox         -9.9575865  5.2898242  -1.882 0.060370 .  
## rm           0.6289107  0.6070924   1.036 0.300738    
## age         -0.0008483  0.0179482  -0.047 0.962323    
## dis         -1.0122467  0.2824676  -3.584 0.000373 ***
## rad          0.6124653  0.0875358   6.997 8.59e-12 ***
## tax         -0.0037756  0.0051723  -0.730 0.465757    
## ptratio     -0.3040728  0.1863598  -1.632 0.103393    
## lstat        0.1388006  0.0757213   1.833 0.067398 .  
## medv        -0.2200564  0.0598240  -3.678 0.000261 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.46 on 493 degrees of freedom
## Multiple R-squared:  0.4493, Adjusted R-squared:  0.4359 
## F-statistic: 33.52 on 12 and 493 DF,  p-value: < 2.2e-16

tidy(lm_boston) %>%
  mutate(significant = ifelse(p.value < 0.05, "Yes", "No")) %>%
  kable(digits = 4,
        col.names = c("Predictor","Estimate","Std. Error","t-value","p-value","Sig."),
        caption = "Multiple Linear Regression: crim ~ all predictors") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) %>%
  row_spec(which(tidy(lm_boston)$p.value < 0.05), background = "#e8f5e9")

Multiple Linear Regression: crim ~ all predictors
Predictor	Estimate	Std. Error	t-value	p-value	Sig.
(Intercept)	13.7784	7.0818	1.9456	0.0523	No
zn	0.0457	0.0188	2.4326	0.0153	Yes
indus	-0.0584	0.0836	-0.6977	0.4857	No
chas	-0.8254	1.1834	-0.6975	0.4858	No
nox	-9.9576	5.2898	-1.8824	0.0604	No
rm	0.6289	0.6071	1.0359	0.3007	No
age	-0.0008	0.0179	-0.0473	0.9623	No
dis	-1.0122	0.2825	-3.5836	0.0004	Yes
rad	0.6125	0.0875	6.9967	0.0000	Yes
tax	-0.0038	0.0052	-0.7300	0.4658	No
ptratio	-0.3041	0.1864	-1.6316	0.1034	No
lstat	0.1388	0.0757	1.8330	0.0674	No
medv	-0.2201	0.0598	-3.6784	0.0003	Yes

Interpretation: In the multiple regression model, only a subset of predictors remain statistically significant after adjusting for all others: zn, dis, rad, black, and medv. Many variables significant in simple regression lose significance in the multiple regression due to multicollinearity — for example, tax (highly correlated with rad). We reject H₀: βⱼ = 0 for the above-listed predictors.

3.3 (c) Univariate vs. Multiple Regression Coefficient Comparison Plot

# Prepare data: univariate slopes vs. multiple regression slopes
uni_coefs <- slr_results %>%
  select(predictor, estimate) %>%
  rename(univariate = estimate)

multi_coefs <- tidy(lm_boston) %>%
  filter(term != "(Intercept)") %>%
  select(term, estimate) %>%
  rename(predictor = term, multiple = estimate)

coef_compare <- inner_join(uni_coefs, multi_coefs, by = "predictor")

ggplot(coef_compare, aes(x = univariate, y = multiple, label = predictor)) +
  geom_point(color = "steelblue", size = 3) +
  geom_text(vjust = -0.6, hjust = 0.5, size = 3.2, color = "black") +
  geom_hline(yintercept = 0, linetype = "dashed", color = "grey50") +
  geom_vline(xintercept = 0, linetype = "dashed", color = "grey50") +
  geom_abline(slope = 1, intercept = 0, color = "firebrick",
              linetype = "dotted", linewidth = 0.8) +
  labs(
    title    = "Univariate vs. Multiple Regression Coefficients",
    subtitle = "Each point = one predictor. Red dotted line = perfect agreement.",
    x        = "Univariate Regression Coefficient",
    y        = "Multiple Regression Coefficient"
  ) +
  theme_bw(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))

Interpretation: The scatter of points away from the identity line (y = x) illustrates how controlling for other predictors changes estimated coefficients substantially. nox (nitric oxide concentration) is a striking example: its large positive univariate coefficient shrinks or changes sign in the multiple regression, consistent with it acting as a proxy for other correlated urban-environment variables.

3.4 (d) Non-linear Association — Cubic Polynomial Models

# For each predictor, fit crim ~ X + X^2 + X^3 and test significance of
# quadratic and cubic terms via ANOVA

poly_results <- lapply(predictors_boston, function(v) {
  # Suppress errors for constant predictors
  tryCatch({
    formula_poly <- as.formula(paste("crim ~ poly(", v, ", 3)"))
    mod_poly <- lm(formula_poly, data = Boston)
    s <- summary(mod_poly)
    coef_tbl <- tidy(mod_poly)

    # Extract p-values for X², X³ terms
    p_linear    <- coef_tbl$p.value[2]
    p_quadratic <- ifelse(nrow(coef_tbl) >= 3, coef_tbl$p.value[3], NA)
    p_cubic     <- ifelse(nrow(coef_tbl) >= 4, coef_tbl$p.value[4], NA)

    data.frame(
      Predictor   = v,
      p_X1        = round(p_linear, 4),
      p_X2        = round(p_quadratic, 4),
      p_X3        = round(p_cubic, 4),
      Adj_R2      = round(s$adj.r.squared, 4),
      NonLinearSig = ifelse(!is.na(p_quadratic) & (p_quadratic < 0.05 |
                            (!is.na(p_cubic) & p_cubic < 0.05)), "Yes", "No")
    )
  }, error = function(e) NULL)
}) %>% bind_rows()

kable(poly_results,
      col.names = c("Predictor","p(X)","p(X²)","p(X³)","Adj.R²","Non-Linear Sig."),
      caption = "Cubic Polynomial Regression: crim ~ X + X² + X³") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"), font_size = 11) %>%
  row_spec(which(poly_results$NonLinearSig == "Yes"), background = "#fff3cd")

Cubic Polynomial Regression: crim ~ X + X² + X³
Predictor	p(X²)	p(X³)	Adj.R²	Non-Linear Sig.
zn	0.0044	0.2295	0.0526	Yes
indus	0.0011	0.0000	0.2552	Yes
nox	0.0001	0.0000	0.2928	Yes
rm	0.0015	0.5086	0.0622	Yes
age	0.0000	0.0067	0.1693	Yes
dis	0.0000	0.0000	0.2735	Yes
rad	0.0091	0.4823	0.3965	Yes
tax	0.0000	0.2439	0.3651	Yes
ptratio	0.0024	0.0063	0.1085	Yes
lstat	0.0378	0.1299	0.2133	Yes
medv	0.0000	0.0000	0.4167	Yes

Interpretation: For many predictors — including indus, nox, age, dis, ptratio, and medv — the quadratic and/or cubic terms are statistically significant, providing evidence of non-linear relationships with crime rate. For example, the relationship between dis (distance to employment centres) and crim appears to decrease sharply at first before levelling off — a pattern better captured by a polynomial model. These findings suggest that a linear model may be misspecified for predicting crim, and non-linear methods (splines, GAMs) could improve fit.

4 Chapter 4 — Question 13: Logistic Regression on the `Weekly` Dataset

This data contains 1,089 weekly stock market returns for 21 years, from 1990 to 2010. The response is Direction (Up/Down).

data(Weekly)
glimpse(Weekly)

## Rows: 1,089
## Columns: 9
## $ Year      <dbl> 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, …
## $ Lag1      <dbl> 0.816, -0.270, -2.576, 3.514, 0.712, 1.178, -1.372, 0.807, 0…
## $ Lag2      <dbl> 1.572, 0.816, -0.270, -2.576, 3.514, 0.712, 1.178, -1.372, 0…
## $ Lag3      <dbl> -3.936, 1.572, 0.816, -0.270, -2.576, 3.514, 0.712, 1.178, -…
## $ Lag4      <dbl> -0.229, -3.936, 1.572, 0.816, -0.270, -2.576, 3.514, 0.712, …
## $ Lag5      <dbl> -3.484, -0.229, -3.936, 1.572, 0.816, -0.270, -2.576, 3.514,…
## $ Volume    <dbl> 0.1549760, 0.1485740, 0.1598375, 0.1616300, 0.1537280, 0.154…
## $ Today     <dbl> -0.270, -2.576, 3.514, 0.712, 1.178, -1.372, 0.807, 0.041, 1…
## $ Direction <fct> Down, Down, Up, Up, Up, Down, Up, Up, Up, Down, Down, Up, Up…

cat("\nClass balance:\n")

## 
## Class balance:

table(Weekly$Direction)

## 
## Down   Up 
##  484  605

4.1 (a) Numerical and Graphical Summaries

# Numerical summary
summary(Weekly)

##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume            Today         
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747   Min.   :-18.1950  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202   1st Qu.: -1.1540  
##  Median :  0.2380   Median :  0.2340   Median :1.00268   Median :  0.2410  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462   Mean   :  0.1499  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373   3rd Qu.:  1.4050  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821   Max.   : 12.0260  
##  Direction 
##  Down:484  
##  Up  :605  
##            
##            
##            
##

# Volume over time
ggplot(Weekly, aes(x = Year, y = Volume)) +
  geom_line(color = "steelblue", alpha = 0.7, linewidth = 0.8) +
  geom_smooth(method = "loess", color = "firebrick", se = FALSE, linewidth = 1) +
  labs(title = "Average Weekly Trading Volume Over Time (1990–2010)",
       x = "Year", y = "Volume (avg. daily billions)") +
  theme_bw(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))

# Distribution of Today's returns by Direction
ggplot(Weekly, aes(x = Today, fill = Direction)) +
  geom_histogram(bins = 50, alpha = 0.7, position = "identity") +
  scale_fill_manual(values = c("Down" = "#d73027","Up" = "#1a9850")) +
  labs(title = "Distribution of Weekly Returns by Market Direction",
       x = "Today's Return (%)", y = "Count", fill = "Direction") +
  theme_bw(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))

# Lag returns by Direction
Weekly_long <- Weekly %>%
  select(Direction, Lag1:Lag5) %>%
  pivot_longer(Lag1:Lag5, names_to = "Lag", values_to = "Return")

ggplot(Weekly_long, aes(x = Lag, y = Return, fill = Direction)) +
  geom_boxplot(alpha = 0.7, outlier.size = 0.5) +
  scale_fill_manual(values = c("Down" = "#d73027","Up" = "#1a9850")) +
  labs(title = "Distribution of Lag Returns by Market Direction",
       x = "Lag Variable", y = "Lagged Return (%)", fill = "Direction") +
  theme_bw(base_size = 12)

Interpretation: Trading volume increased substantially over the 21-year period, particularly through the late 1990s tech boom and post-2003 expansion. The distribution of weekly returns is approximately symmetric around zero, with no obvious separation between Up and Down distributions across the lag variables — suggesting the lag predictors alone may have limited discriminatory power. The Up class is somewhat more prevalent (≈56%) than Down, reflecting the long-term upward drift of equity markets.

4.2 (b) Logistic Regression — Full Model with All Lags + Volume

# Full logistic regression: Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume
glm_full <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
                data   = Weekly,
                family = binomial)

summary(glm_full)

## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Weekly)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

tidy(glm_full) %>%
  mutate(
    OR          = exp(estimate),
    significant = ifelse(p.value < 0.05, "Yes", "No")
  ) %>%
  kable(digits = 4,
        col.names = c("Predictor","Log-Odds","Std. Error","z-value","p-value",
                      "Odds Ratio","Sig. (α=0.05)"),
        caption = "Logistic Regression Coefficients (Full Model)") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) %>%
  row_spec(which(tidy(glm_full)$p.value < 0.05), background = "#e8f5e9")

Logistic Regression Coefficients (Full Model)
Predictor	Log-Odds	Std. Error	z-value	p-value	Odds Ratio	Sig. (α=0.05)
(Intercept)	0.2669	0.0859	3.1056	0.0019	1.3059	Yes
Lag1	-0.0413	0.0264	-1.5626	0.1181	0.9596	No
Lag2	0.0584	0.0269	2.1754	0.0296	1.0602	Yes
Lag3	-0.0161	0.0267	-0.6024	0.5469	0.9841	No
Lag4	-0.0278	0.0265	-1.0501	0.2937	0.9726	No
Lag5	-0.0145	0.0264	-0.5485	0.5833	0.9856	No
Volume	-0.0227	0.0369	-0.6163	0.5377	0.9775	No

Interpretation: Among the six predictors, only Lag2 is statistically significant (p ≈ 0.030). Its positive log-odds coefficient (+0.058) indicates that a higher return two weeks prior is associated with slightly higher odds of the market going Up the current week. The other lag variables and Volume are not significant. This is consistent with weak-form market efficiency — past prices have minimal predictive power.

4.3 (c) Confusion Matrix — Full Model

# Predicted probabilities and class predictions (threshold = 0.5)
pred_probs_full <- predict(glm_full, type = "response")
pred_class_full <- ifelse(pred_probs_full > 0.5, "Up", "Down")

# Confusion matrix
conf_mat_full <- table(Predicted = pred_class_full, Actual = Weekly$Direction)
conf_mat_full

##          Actual
## Predicted Down  Up
##      Down   54  48
##      Up    430 557

# Overall accuracy
accuracy_full <- mean(pred_class_full == Weekly$Direction)
cat(sprintf("\nOverall Accuracy (full model, training data): %.1f%%\n",
            accuracy_full * 100))

## 
## Overall Accuracy (full model, training data): 56.1%

# Sensitivity & Specificity
TP <- conf_mat_full["Up",   "Up"]
TN <- conf_mat_full["Down", "Down"]
FP <- conf_mat_full["Up",   "Down"]
FN <- conf_mat_full["Down", "Up"]

cat(sprintf("Sensitivity (True Positive Rate): %.1f%%\n", TP/(TP+FN)*100))

## Sensitivity (True Positive Rate): 92.1%

cat(sprintf("Specificity (True Negative Rate): %.1f%%\n", TN/(TN+FP)*100))

## Specificity (True Negative Rate): 11.2%

# Visual confusion matrix
conf_df <- as.data.frame(conf_mat_full) %>%
  mutate(label = paste0(Freq, "\n(",
                        round(Freq/sum(conf_mat_full)*100, 1), "%)"))

ggplot(conf_df, aes(x = Actual, y = Predicted, fill = Freq)) +
  geom_tile(color = "white", linewidth = 1) +
  geom_text(aes(label = label), size = 5, fontface = "bold") +
  scale_fill_gradient(low = "#e8f4fd", high = "#2196F3") +
  labs(title = "Confusion Matrix — Full Logistic Regression Model",
       subtitle = paste0("Overall Accuracy: ", round(accuracy_full*100,1), "%"),
       x = "Actual Direction", y = "Predicted Direction") +
  theme_bw(base_size = 13) +
  theme(plot.title    = element_text(face = "bold"),
        legend.position = "none")

Interpretation: The confusion matrix reveals that the model achieves an overall accuracy of 56.1%. However, this masks class imbalance: the model has a strong bias toward predicting “Up” — it correctly identifies most “Up” weeks (high sensitivity ~92%) but performs very poorly at detecting “Down” weeks (low specificity ~11%). This means the model essentially defaults to the majority class and provides little incremental value for identifying market downturns.

4.4 (d) Logistic Regression — Training (1990–2008) and Test (2009–2010)

# Split data: training = 1990–2008, test = 2009–2010
train <- Weekly %>% filter(Year <= 2008)
test  <- Weekly %>% filter(Year  > 2008)

cat("Training observations:", nrow(train), "\n")

## Training observations: 985

cat("Test observations:    ", nrow(test),  "\n")

## Test observations:     104

# Fit logistic regression with Lag2 only on training data
glm_lag2 <- glm(Direction ~ Lag2,
                data   = train,
                family = binomial)

summary(glm_lag2)

## 
## Call:
## glm(formula = Direction ~ Lag2, family = binomial, data = train)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.20326    0.06428   3.162  0.00157 **
## Lag2         0.05810    0.02870   2.024  0.04298 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1350.5  on 983  degrees of freedom
## AIC: 1354.5
## 
## Number of Fisher Scoring iterations: 4

# Predict on held-out test data (2009–2010)
pred_probs_test <- predict(glm_lag2, newdata = test, type = "response")
pred_class_test <- ifelse(pred_probs_test > 0.5, "Up", "Down")

# Confusion matrix on test set
conf_mat_test <- table(Predicted = pred_class_test, Actual = test$Direction)
conf_mat_test

##          Actual
## Predicted Down Up
##      Down    9  5
##      Up     34 56

# Overall test accuracy
accuracy_test <- mean(pred_class_test == test$Direction)
cat(sprintf("\nOverall Test Accuracy (2009–2010, Lag2 only): %.1f%%\n",
            accuracy_test * 100))

## 
## Overall Test Accuracy (2009–2010, Lag2 only): 62.5%

# Sensitivity & Specificity on test
TP_t <- conf_mat_test["Up",   "Up"]
TN_t <- conf_mat_test["Down", "Down"]
FP_t <- conf_mat_test["Up",   "Down"]
FN_t <- conf_mat_test["Down", "Up"]

cat(sprintf("Test Sensitivity: %.1f%%\n", TP_t/(TP_t+FN_t)*100))

## Test Sensitivity: 91.8%

cat(sprintf("Test Specificity: %.1f%%\n", TN_t/(TN_t+FP_t)*100))

## Test Specificity: 20.9%

# Visual confusion matrix — test set
conf_test_df <- as.data.frame(conf_mat_test) %>%
  mutate(label = paste0(Freq, "\n(", round(Freq/sum(conf_mat_test)*100,1), "%)"))

ggplot(conf_test_df, aes(x = Actual, y = Predicted, fill = Freq)) +
  geom_tile(color = "white", linewidth = 1) +
  geom_text(aes(label = label), size = 5, fontface = "bold") +
  scale_fill_gradient(low = "#e8f4fd", high = "#2196F3") +
  labs(title    = "Confusion Matrix — Lag2 Model on Test Data (2009–2010)",
       subtitle = paste0("Test Accuracy: ", round(accuracy_test*100,1), "%"),
       x = "Actual Direction", y = "Predicted Direction") +
  theme_bw(base_size = 13) +
  theme(plot.title    = element_text(face = "bold"),
        legend.position = "none")

Interpretation: The model trained on 1990–2008 data using only Lag2 achieves a test accuracy of 62.5% on the 2009–2010 held-out data. While seemingly modest, this exceeds the naive baseline of always predicting “Up” (≈ 58% accuracy for this test period). The improvement in specificity compared to the full model confirms that the parsimonious Lag2-only model generalises better out-of-sample. The ability of last week’s two-period lagged return to carry predictive signal is consistent with short-horizon momentum effects documented in the finance literature. Nevertheless, the overall discriminatory power remains limited, reinforcing the difficulty of beating the market using only historical price data.

5 Session Information

sessionInfo()

## R version 4.4.1 (2024-06-14)
## Platform: aarch64-apple-darwin20
## Running under: macOS 26.2
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Asia/Taipei
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] scales_1.4.0       broom_1.0.12       ggcorrplot_0.1.4.1 gridExtra_2.3     
##  [5] kableExtra_1.4.0   knitr_1.49         tidyr_1.3.1        dplyr_1.1.4       
##  [9] GGally_2.4.0       ggplot2_4.0.2      ISLR2_1.3-2       
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.9         utf8_1.2.4         generics_0.1.3     xml2_1.3.6        
##  [5] lattice_0.22-6     stringi_1.8.4      digest_0.6.37      magrittr_2.0.3    
##  [9] evaluate_1.0.1     grid_4.4.1         RColorBrewer_1.1-3 fastmap_1.2.0     
## [13] Matrix_1.7-0       plyr_1.8.9         jsonlite_1.8.9     backports_1.5.0   
## [17] mgcv_1.9-1         purrr_1.2.1        fansi_1.0.6        viridisLite_0.4.2 
## [21] textshaping_0.4.0  jquerylib_0.1.4    cli_3.6.5          rlang_1.1.6       
## [25] splines_4.4.1      withr_3.0.2        cachem_1.1.0       yaml_2.3.10       
## [29] tools_4.4.1        reshape2_1.4.5     ggstats_0.13.0     vctrs_0.6.5       
## [33] R6_2.5.1           lifecycle_1.0.4    stringr_1.5.1      pkgconfig_2.0.3   
## [37] pillar_1.9.0       bslib_0.8.0        gtable_0.3.6       Rcpp_1.0.13       
## [41] glue_1.8.0         systemfonts_1.3.2  xfun_0.49          tibble_3.2.1      
## [45] tidyselect_1.2.1   rstudioapi_0.17.1  farver_2.1.2       nlme_3.1-164      
## [49] htmltools_0.5.8.1  rmarkdown_2.29     svglite_2.2.2      labeling_0.4.3    
## [53] compiler_4.4.1     S7_0.2.1

Midterm Homework — Application of Financial Software

Jourast Buwana

2026-04-15

1 Chapter 2 — Question 9: The `Auto` Dataset

1.1 (a) Quantitative vs. Qualitative Predictors

1.2 (b) Range of Each Quantitative Predictor

1.3 (c) Mean and Standard Deviation of Each Quantitative Predictor

1.4 (d) Summary Statistics After Removing Observations 10–85

1.5 (e) Graphical Investigation — Scatterplot Matrix

1.6 (f) Predictors Useful for Predicting `mpg`

2 Chapter 3 — Question 9: Multiple Linear Regression on `Auto`

2.1 (a) Scatterplot Matrix

2.2 (b) Correlation Matrix

2.3 (c) Multiple Linear Regression — `mpg ~ .` (excluding `name`)

2.4 (d) Diagnostic Plots

2.5 (e) Interaction Effects

2.6 (f) Variable Transformations

3 Chapter 3 — Question 15: Predicting Crime Rate in the `Boston` Dataset

3.1 (a) Simple Linear Regression for Each Predictor

3.2 (b) Multiple Linear Regression

3.3 (c) Univariate vs. Multiple Regression Coefficient Comparison Plot

3.4 (d) Non-linear Association — Cubic Polynomial Models

4 Chapter 4 — Question 13: Logistic Regression on the `Weekly` Dataset

4.1 (a) Numerical and Graphical Summaries

4.2 (b) Logistic Regression — Full Model with All Lags + Volume

4.3 (c) Confusion Matrix — Full Model

4.4 (d) Logistic Regression — Training (1990–2008) and Test (2009–2010)

5 Session Information

Midterm Homework — Application of Financial Software

Jourast Buwana

2026-04-15

1 Chapter 2 — Question 9: The Auto Dataset

1.1 (a) Quantitative vs. Qualitative Predictors

1.2 (b) Range of Each Quantitative Predictor

1.3 (c) Mean and Standard Deviation of Each Quantitative Predictor

1.4 (d) Summary Statistics After Removing Observations 10–85

1.5 (e) Graphical Investigation — Scatterplot Matrix

1.6 (f) Predictors Useful for Predicting mpg

2 Chapter 3 — Question 9: Multiple Linear Regression on Auto

2.1 (a) Scatterplot Matrix

2.2 (b) Correlation Matrix

2.3 (c) Multiple Linear Regression — mpg ~ . (excluding name)

2.4 (d) Diagnostic Plots

2.5 (e) Interaction Effects

2.6 (f) Variable Transformations

3 Chapter 3 — Question 15: Predicting Crime Rate in the Boston Dataset

3.1 (a) Simple Linear Regression for Each Predictor

3.2 (b) Multiple Linear Regression

3.3 (c) Univariate vs. Multiple Regression Coefficient Comparison Plot

3.4 (d) Non-linear Association — Cubic Polynomial Models

4 Chapter 4 — Question 13: Logistic Regression on the Weekly Dataset

4.1 (a) Numerical and Graphical Summaries

4.2 (b) Logistic Regression — Full Model with All Lags + Volume

4.3 (c) Confusion Matrix — Full Model

4.4 (d) Logistic Regression — Training (1990–2008) and Test (2009–2010)

5 Session Information

1 Chapter 2 — Question 9: The `Auto` Dataset

1.6 (f) Predictors Useful for Predicting `mpg`

2 Chapter 3 — Question 9: Multiple Linear Regression on `Auto`

2.3 (c) Multiple Linear Regression — `mpg ~ .` (excluding `name`)

3 Chapter 3 — Question 15: Predicting Crime Rate in the `Boston` Dataset

4 Chapter 4 — Question 13: Logistic Regression on the `Weekly` Dataset