Chapter 2 — Exercise 9: The Auto Dataset

Setup: Load and Clean Data

data(Auto)
Auto <- na.omit(Auto)

# Overview
str(Auto)
## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : int  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
##  - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
##   ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...
dim(Auto)
## [1] 392   9

(a) Quantitative vs. Qualitative Predictors

The Auto dataset contains 9 variables. We classify them as follows:

  • Quantitative: mpg, cylinders, displacement, horsepower, weight, acceleration, year
  • Qualitative: name (car name, character string), origin (1 = American, 2 = European, 3 = Japanese — encodes geography, not magnitude)

Although cylinders and origin are stored as integers, origin is inherently qualitative. cylinders takes only a handful of discrete values (3, 4, 5, 6, 8) but is treated as quantitative here for summary purposes.

(b) Range of Each Quantitative Predictor

quant_vars <- c("mpg", "cylinders", "displacement",
                "horsepower", "weight", "acceleration", "year")

range_df <- as.data.frame(t(sapply(Auto[, quant_vars], range)))
colnames(range_df) <- c("Min", "Max")

knitr::kable(range_df, caption = "Range of Each Quantitative Predictor",
             digits = 2)
Range of Each Quantitative Predictor
Min Max
mpg 9 46.6
cylinders 3 8.0
displacement 68 455.0
horsepower 46 230.0
weight 1613 5140.0
acceleration 8 24.8
year 70 82.0

(c) Mean and Standard Deviation of Each Quantitative Predictor

stats_df <- data.frame(
  Mean = sapply(Auto[, quant_vars], mean),
  SD   = sapply(Auto[, quant_vars], sd)
)

knitr::kable(round(stats_df, 3),
             caption = "Mean and Standard Deviation of Quantitative Predictors")
Mean and Standard Deviation of Quantitative Predictors
Mean SD
mpg 23.446 7.805
cylinders 5.472 1.706
displacement 194.412 104.644
horsepower 104.469 38.491
weight 2977.584 849.403
acceleration 15.541 2.759
year 75.980 3.684

(d) Statistics After Removing Observations 10–85

Auto_sub <- Auto[-(10:85), ]
cat("Rows remaining:", nrow(Auto_sub), "\n")
## Rows remaining: 316
range_sub <- as.data.frame(t(sapply(Auto_sub[, quant_vars], range)))
colnames(range_sub) <- c("Min", "Max")

stats_sub <- data.frame(
  Mean = sapply(Auto_sub[, quant_vars], mean),
  SD   = sapply(Auto_sub[, quant_vars], sd)
)

knitr::kable(range_sub,
             caption = "Range After Removing Obs. 10–85", digits = 2)
Range After Removing Obs. 10–85
Min Max
mpg 11.0 46.6
cylinders 3.0 8.0
displacement 68.0 455.0
horsepower 46.0 230.0
weight 1649.0 4997.0
acceleration 8.5 24.8
year 70.0 82.0
knitr::kable(round(stats_sub, 3),
             caption = "Mean & SD After Removing Obs. 10–85")
Mean & SD After Removing Obs. 10–85
Mean SD
mpg 24.404 7.867
cylinders 5.373 1.654
displacement 187.241 99.678
horsepower 100.722 35.709
weight 2935.972 811.300
acceleration 15.727 2.694
year 77.146 3.106

Removing observations 10–85 (76 rows) reduces the sample to 316 cars. The ranges narrow — particularly for mpg and horsepower — because that block included many early-1970s high-displacement, low-efficiency vehicles. The mean mpg rises slightly, and standard deviations shrink across most predictors, reflecting reduced variability in the remaining subset.

(e) Graphical Investigation of Predictors

ggpairs(
  Auto[, c(quant_vars, "origin")],
  aes(colour = factor(origin), alpha = 0.4),
  upper = list(continuous = wrap("cor", size = 3)),
  lower = list(continuous = wrap("points", size = 0.7)),
  diag  = list(continuous = wrap("densityDiag"))
) +
  scale_colour_manual(
    values = c("1" = "#E63946", "2" = "#2A9D8F", "3" = "#457B9D"),
    labels = c("American", "European", "Japanese"),
    name   = "Origin"
  ) +
  labs(title = "Scatterplot Matrix — Auto Dataset")
Scatterplot matrix of quantitative predictors coloured by origin

Scatterplot matrix of quantitative predictors coloured by origin

Key findings:

  • displacement, horsepower, and weight are strongly positively correlated (r > 0.85), forming a cluster of engine-size variables.
  • mpg has strong negative correlations with displacement (r ≈ −0.81), weight (r ≈ −0.83), and horsepower (r ≈ −0.78): heavier, more powerful engines are less fuel-efficient.
  • American cars (red) cluster in the high-displacement, low-mpg region; Japanese cars (blue) concentrate in the low-displacement, high-mpg region.
  • year shows a positive relationship with mpg, suggesting fuel efficiency improved over time due to regulatory pressure (CAFE standards).

(f) Which Predictors Are Useful for Predicting mpg?

Auto_long <- Auto %>%
  pivot_longer(cols = c(displacement, horsepower, weight, acceleration, year),
               names_to  = "predictor",
               values_to = "value")

ggplot(Auto_long, aes(x = value, y = mpg)) +
  geom_point(aes(colour = predictor), alpha = 0.3, size = 0.9) +
  geom_smooth(method = "loess", se = TRUE, colour = "black", linewidth = 0.8) +
  facet_wrap(~predictor, scales = "free_x", nrow = 2) +
  scale_colour_brewer(palette = "Set1", guide = "none") +
  labs(title = "mpg vs. Key Predictors",
       x = "Predictor Value", y = "Miles per Gallon")
mpg vs. key predictors with LOESS smoother

mpg vs. key predictors with LOESS smoother

Conclusion: displacement, horsepower, and weight show strong non-linear negative relationships with mpg — each would be a powerful predictor. year exhibits a clear positive trend: newer cars are consistently more fuel-efficient. acceleration has a weaker positive association and would contribute less predictive power on its own.


Chapter 3 — Exercise 9: Multiple Linear Regression on Auto

(a) Scatterplot Matrix — All Variables

ggpairs(
  Auto %>% select(-name),
  aes(alpha = 0.3),
  upper = list(continuous = wrap("cor", size = 3)),
  lower = list(continuous = wrap("points", size = 0.6)),
  diag  = list(continuous = "densityDiag")
) +
  labs(title = "Full Scatterplot Matrix — Auto (excl. name)")
Scatterplot matrix of all Auto variables (excluding name)

Scatterplot matrix of all Auto variables (excluding name)

The matrix confirms the multicollinearity noted in Chapter 2: cylinders, displacement, horsepower, and weight form a tightly correlated group. mpg correlates negatively with all of them, while year and origin both correlate positively with mpg.

(b) Correlation Matrix

cor_matrix <- cor(Auto %>% select(-name))
knitr::kable(round(cor_matrix, 3),
             caption = "Correlation Matrix (all numeric variables)")
Correlation Matrix (all numeric variables)
mpg cylinders displacement horsepower weight acceleration year origin
mpg 1.000 -0.778 -0.805 -0.778 -0.832 0.423 0.581 0.565
cylinders -0.778 1.000 0.951 0.843 0.898 -0.505 -0.346 -0.569
displacement -0.805 0.951 1.000 0.897 0.933 -0.544 -0.370 -0.615
horsepower -0.778 0.843 0.897 1.000 0.865 -0.689 -0.416 -0.455
weight -0.832 0.898 0.933 0.865 1.000 -0.417 -0.309 -0.585
acceleration 0.423 -0.505 -0.544 -0.689 -0.417 1.000 0.290 0.213
year 0.581 -0.346 -0.370 -0.416 -0.309 0.290 1.000 0.182
origin 0.565 -0.569 -0.615 -0.455 -0.585 0.213 0.182 1.000

The highest absolute correlations with mpg are weight (−0.832), displacement (−0.805), and cylinders (−0.778). year is positively correlated with mpg (r ≈ 0.581).

(c) Multiple Linear Regression

lm_auto <- lm(mpg ~ . - name, data = Auto)
summary(lm_auto)
## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

i. Is there a relationship between the predictors and the response? Yes. The F-statistic is highly significant (p < 2.2e-16), and the model explains approximately 82% of the variance in mpg (Adjusted R² ≈ 0.82). This confirms a strong collective linear relationship.

ii. Which predictors are statistically significant? At α = 0.05, the significant predictors are:

  • displacement (p < 0.01)
  • weight (p < 0.001) — the strongest individual contributor
  • year (p < 0.001) — highly significant positive predictor
  • origin (p < 0.001)
  • horsepower (p < 0.05)

cylinders and acceleration are not significant once the others are controlled for, likely due to multicollinearity.

iii. What does the coefficient for year suggest? The coefficient for year is approximately +0.75, meaning that — holding all other variables constant — each one-year increase in model year is associated with an increase of ~0.75 mpg. This reflects mandatory fuel economy standards (CAFE) introduced after the 1973 oil crisis and continuing engineering improvements throughout the 1970s–80s.

(d) Diagnostic Plots

par(mfrow = c(2, 2))
plot(lm_auto)
Regression diagnostic plots

Regression diagnostic plots

par(mfrow = c(1, 1))

Interpretation:

  • Residuals vs. Fitted: A mild curved pattern indicates some non-linearity — the model slightly under-predicts at very low and very high fitted values. A log transformation of mpg could help.
  • Normal Q-Q: Residuals follow the reference line closely, with minor deviation in the upper tail. No severe normality violation.
  • Scale-Location: The spread of residuals is not entirely constant across fitted values, indicating mild heteroscedasticity.
  • Residuals vs. Leverage: Observation 14 has notably high leverage. A few points (e.g., obs. 327) have large residuals. However, Cook’s Distance is well below 0.5 for all points, so no single observation is dramatically influential.

(e) Interaction Effects

# Test two interactions: horsepower:weight and cylinders:displacement
lm_interact <- lm(mpg ~ . - name + horsepower:weight + cylinders:displacement,
                  data = Auto)
summary(lm_interact)
## 
## Call:
## lm(formula = mpg ~ . - name + horsepower:weight + cylinders:displacement, 
##     data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3228 -1.5862 -0.0291  1.5536 11.9296 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             4.053e+00  4.544e+00   0.892  0.37297    
## cylinders              -7.357e-01  4.834e-01  -1.522  0.12883    
## displacement           -2.010e-02  1.584e-02  -1.269  0.20531    
## horsepower             -2.080e-01  2.683e-02  -7.754 8.17e-14 ***
## weight                 -1.014e-02  9.351e-04 -10.849  < 2e-16 ***
## acceleration           -7.057e-02  8.895e-02  -0.793  0.42805    
## year                    7.692e-01  4.480e-02  17.168  < 2e-16 ***
## origin                  7.159e-01  2.589e-01   2.765  0.00597 ** 
## horsepower:weight       4.699e-05  6.930e-06   6.781 4.55e-11 ***
## cylinders:displacement  3.932e-03  2.165e-03   1.816  0.07008 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.923 on 382 degrees of freedom
## Multiple R-squared:  0.863,  Adjusted R-squared:  0.8598 
## F-statistic: 267.4 on 9 and 382 DF,  p-value: < 2.2e-16

The interaction horsepower:weight is statistically significant (p < 0.05), suggesting that the negative effect of weight on mpg is amplified at higher horsepower levels. This makes physical sense: a heavy car with a powerful engine is disproportionately fuel-inefficient. The cylinders:displacement interaction is not significant once horsepower:weight is included.

(f) Variable Transformations

# Log, square-root, and squared transformations of horsepower
lm_log   <- lm(mpg ~ log(horsepower)   + weight + year + origin, data = Auto)
lm_sqrt  <- lm(mpg ~ sqrt(horsepower)  + weight + year + origin, data = Auto)
lm_sq    <- lm(mpg ~ I(horsepower^2)   + weight + year + origin, data = Auto)

# Compare adjusted R²
data.frame(
  Model    = c("log(horsepower)", "sqrt(horsepower)", "horsepower^2"),
  Adj_R2   = c(summary(lm_log)$adj.r.squared,
               summary(lm_sqrt)$adj.r.squared,
               summary(lm_sq)$adj.r.squared)
) |> knitr::kable(digits = 4, caption = "Adjusted R² for Transformed Models")
Adjusted R² for Transformed Models
Model Adj_R2
log(horsepower) 0.8258
sqrt(horsepower) 0.8196
horsepower^2 0.8167
ggplot(Auto, aes(x = log(horsepower), y = mpg)) +
  geom_point(alpha = 0.4, colour = "#457B9D") +
  geom_smooth(method = "lm", se = TRUE, colour = "#E63946") +
  labs(title = "mpg vs. log(horsepower)",
       x = "log(Horsepower)", y = "Miles per Gallon")
mpg vs. log(horsepower) — the log transformation linearises the relationship

mpg vs. log(horsepower) — the log transformation linearises the relationship

The log(horsepower) transformation achieves the highest Adjusted R² and the most linear relationship with mpg, effectively correcting the non-linearity observed in the diagnostics. The squared term performs worst, introducing curvature that does not fit the data well.


Chapter 3 — Exercise 15: The Boston Dataset

Setup

data(Boston)
str(Boston)
## 'data.frame':    506 obs. of  13 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
# Response variable: crim (per-capita crime rate)
# Predictors: all other variables

(a) Simple Linear Regression for Each Predictor

predictors <- setdiff(names(Boston), "crim")

# Fit a simple regression for each predictor; store coefficient and p-value
simple_results <- lapply(predictors, function(var) {
  formula <- as.formula(paste("crim ~", var))
  fit     <- lm(formula, data = Boston)
  s       <- summary(fit)
  data.frame(
    predictor  = var,
    coef       = coef(fit)[2],
    p_value    = s$coefficients[2, 4],
    adj_r2     = s$adj.r.squared
  )
})

simple_df <- do.call(rbind, simple_results)
rownames(simple_df) <- NULL

knitr::kable(simple_df[order(simple_df$p_value), ],
             digits  = 4,
             caption = "Simple Regression Results: crim ~ each predictor (sorted by p-value)")
Simple Regression Results: crim ~ each predictor (sorted by p-value)
predictor coef p_value adj_r2
8 rad 0.6179 0.0000 0.3900
9 tax 0.0297 0.0000 0.3383
11 lstat 0.5488 0.0000 0.2060
4 nox 31.2485 0.0000 0.1756
2 indus 0.5098 0.0000 0.1637
12 medv -0.3632 0.0000 0.1491
7 dis -1.5509 0.0000 0.1425
6 age 0.1078 0.0000 0.1227
10 ptratio 1.1520 0.0000 0.0823
5 rm -2.6841 0.0000 0.0462
1 zn -0.0739 0.0000 0.0383
3 chas -1.8928 0.2094 0.0011
Boston_long <- Boston %>%
  pivot_longer(cols = all_of(predictors),
               names_to  = "predictor",
               values_to = "value")

ggplot(Boston_long, aes(x = value, y = crim)) +
  geom_point(alpha = 0.2, size = 0.7, colour = "#457B9D") +
  geom_smooth(method = "lm", se = TRUE, colour = "#E63946", linewidth = 0.9) +
  facet_wrap(~predictor, scales = "free_x", ncol = 4) +
  labs(title = "Crime Rate vs. Each Predictor (Simple Linear Regression)",
       x = "Predictor Value", y = "Crime Rate (crim)")
crim vs. each predictor with regression line

crim vs. each predictor with regression line

Finding: Almost all predictors have a statistically significant simple relationship with crim. Exceptions are chas (Charles River dummy variable, p ≈ 0.21), which is not significant. The strongest associations (lowest p-value, highest Adj. R²) are with rad (accessibility to radial highways), tax (property tax rate), and lstat (lower-status population %).

(b) Multiple Linear Regression

lm_boston <- lm(crim ~ ., data = Boston)
summary(lm_boston)
## 
## Call:
## lm(formula = crim ~ ., data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.534 -2.248 -0.348  1.087 73.923 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.7783938  7.0818258   1.946 0.052271 .  
## zn           0.0457100  0.0187903   2.433 0.015344 *  
## indus       -0.0583501  0.0836351  -0.698 0.485709    
## chas        -0.8253776  1.1833963  -0.697 0.485841    
## nox         -9.9575865  5.2898242  -1.882 0.060370 .  
## rm           0.6289107  0.6070924   1.036 0.300738    
## age         -0.0008483  0.0179482  -0.047 0.962323    
## dis         -1.0122467  0.2824676  -3.584 0.000373 ***
## rad          0.6124653  0.0875358   6.997 8.59e-12 ***
## tax         -0.0037756  0.0051723  -0.730 0.465757    
## ptratio     -0.3040728  0.1863598  -1.632 0.103393    
## lstat        0.1388006  0.0757213   1.833 0.067398 .  
## medv        -0.2200564  0.0598240  -3.678 0.000261 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.46 on 493 degrees of freedom
## Multiple R-squared:  0.4493, Adjusted R-squared:  0.4359 
## F-statistic: 33.52 on 12 and 493 DF,  p-value: < 2.2e-16

In the multiple regression context, only a subset of predictors retain significance at α = 0.05:

  • zn (residential zoning) — negative, significant
  • dis (distance to employment centres) — negative, significant
  • rad (highway accessibility) — positive, significant
  • black — negative, significant
  • medv (median home value) — negative, significant

Many predictors that were significant in simple regression (e.g., nox, age, tax) lose significance once the others are controlled for — a classic symptom of multicollinearity among urban socioeconomic variables.

(c) Comparing Univariate vs. Multiple Regression Coefficients

# Extract multiple regression coefficients (excluding intercept)
multi_coefs <- data.frame(
  predictor  = names(coef(lm_boston))[-1],
  multi_coef = coef(lm_boston)[-1]
)

# Merge with simple regression coefficients
coef_compare <- merge(
  simple_df[, c("predictor", "coef")],
  multi_coefs,
  by = "predictor"
)
colnames(coef_compare) <- c("predictor", "simple_coef", "multi_coef")

ggplot(coef_compare, aes(x = simple_coef, y = multi_coef, label = predictor)) +
  geom_point(colour = "#E63946", size = 3) +
  geom_text(nudge_y = 0.3, size = 3.5, colour = "grey30") +
  geom_hline(yintercept = 0, linetype = "dashed", colour = "grey60") +
  geom_vline(xintercept = 0, linetype = "dashed", colour = "grey60") +
  labs(
    title = "Univariate vs. Multiple Regression Coefficients",
    subtitle = "Boston dataset — response: crim",
    x = "Simple Regression Coefficient",
    y = "Multiple Regression Coefficient"
  )
Univariate vs. multiple regression coefficients for each predictor

Univariate vs. multiple regression coefficients for each predictor

Interpretation: Several predictors switch sign between simple and multiple regression (notably nox — nitrogen oxide concentration), indicating confounding and multicollinearity. In simple regression, nox has a large positive coefficient (high pollution ↔︎ high crime), but once distance to employment centres (dis) and other urban factors are controlled for, the marginal effect changes. This illustrates why multiple regression is needed to isolate individual effects.

(d) Evidence of Non-Linear Association (Polynomial Regression)

# Fit cubic polynomial for each predictor; test if X^2 or X^3 terms are significant
poly_results <- lapply(predictors, function(var) {
  formula <- as.formula(paste("crim ~ poly(", var, ", 1)"))
  fit     <- lm(formula, data = Boston)
  s       <- summary(fit)
  pvals   <- s$coefficients[-1, 4]  # p-values for linear, quadratic, cubic
  data.frame(
    predictor  = var,
    p_linear   = pvals[1],
    p_quadratic = ifelse(length(pvals) >= 2, pvals[2], NA),
    p_cubic    = ifelse(length(pvals) >= 3, pvals[3], NA)
  )
})

poly_df <- do.call(rbind, poly_results)
rownames(poly_df) <- NULL

Finding: There is strong evidence of non-linear relationships for several predictors. medv, dis, lstat, nox, and age all have significant quadratic or cubic terms (p < 0.05), suggesting U-shaped or S-shaped relationships with crime rate. For example, the effect of distance to employment centres (dis) diminishes and reverses at very large distances, likely reflecting suburban areas with distinct crime profiles.


Chapter 4 — Exercise 13: Logistic Regression on Weekly

Setup

data(Weekly)
str(Weekly)
## 'data.frame':    1089 obs. of  9 variables:
##  $ Year     : num  1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
##  $ Lag1     : num  0.816 -0.27 -2.576 3.514 0.712 ...
##  $ Lag2     : num  1.572 0.816 -0.27 -2.576 3.514 ...
##  $ Lag3     : num  -3.936 1.572 0.816 -0.27 -2.576 ...
##  $ Lag4     : num  -0.229 -3.936 1.572 0.816 -0.27 ...
##  $ Lag5     : num  -3.484 -0.229 -3.936 1.572 0.816 ...
##  $ Volume   : num  0.155 0.149 0.16 0.162 0.154 ...
##  $ Today    : num  -0.27 -2.576 3.514 0.712 1.178 ...
##  $ Direction: Factor w/ 2 levels "Down","Up": 1 1 2 2 2 1 2 2 2 1 ...
dim(Weekly)
## [1] 1089    9
# Response: Direction (Up/Down); Predictors: Lag1-Lag5, Volume, Today

(a) Numerical and Graphical Summaries

summary(Weekly)
##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume            Today         
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747   Min.   :-18.1950  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202   1st Qu.: -1.1540  
##  Median :  0.2380   Median :  0.2340   Median :1.00268   Median :  0.2410  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462   Mean   :  0.1499  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373   3rd Qu.:  1.4050  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821   Max.   : 12.0260  
##  Direction 
##  Down:484  
##  Up  :605  
##            
##            
##            
## 
p1 <- ggplot(Weekly, aes(x = seq_along(Volume), y = Volume)) +
  geom_line(colour = "#457B9D", linewidth = 0.5) +
  labs(title = "Weekly Trading Volume Over Time",
       x = "Observation Index (1990–2010)", y = "Volume")

p2 <- ggplot(Weekly, aes(x = Direction, y = Today, fill = Direction)) +
  geom_boxplot(alpha = 0.7) +
  scale_fill_manual(values = c("Down" = "#E63946", "Up" = "#2A9D8F")) +
  labs(title = "Weekly Return (Today) by Market Direction",
       x = "Direction", y = "Return (%)")

gridExtra::grid.arrange(p1, p2, ncol = 2)
Trading volume over time and return distribution by direction

Trading volume over time and return distribution by direction

Weekly_num <- Weekly %>% select(-Direction)
cor_weekly <- cor(Weekly_num)
knitr::kable(round(cor_weekly, 3),
             caption = "Correlation Matrix — Weekly Dataset")
Correlation Matrix — Weekly Dataset
Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today
Year 1.000 -0.032 -0.033 -0.030 -0.031 -0.031 0.842 -0.032
Lag1 -0.032 1.000 -0.075 0.059 -0.071 -0.008 -0.065 -0.075
Lag2 -0.033 -0.075 1.000 -0.076 0.058 -0.072 -0.086 0.059
Lag3 -0.030 0.059 -0.076 1.000 -0.075 0.061 -0.069 -0.071
Lag4 -0.031 -0.071 0.058 -0.075 1.000 -0.076 -0.061 -0.008
Lag5 -0.031 -0.008 -0.072 0.061 -0.076 1.000 -0.059 0.011
Volume 0.842 -0.065 -0.086 -0.069 -0.061 -0.059 1.000 -0.033
Today -0.032 -0.075 0.059 -0.071 -0.008 0.011 -0.033 1.000

Patterns observed:

  • Volume increases substantially over the 21-year period, reflecting the growth of equity markets.
  • The lag variables (Lag1–Lag5) have near-zero correlations with each other and with Today, indicating that past returns have little linear predictive power for current returns — consistent with the weak-form efficient market hypothesis.
  • The strongest correlation is between Volume and Year (r ≈ 0.84), purely reflecting the upward trend in trading activity.

(b) Logistic Regression — Full Model

logit_full <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
                  data   = Weekly,
                  family = binomial)
summary(logit_full)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Weekly)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

Are any predictors statistically significant? Only Lag2 is statistically significant (p ≈ 0.030) at the α = 0.05 level. Its positive coefficient indicates that a positive return two weeks prior is associated with a higher probability of an “Up” week. All other lag variables and Volume are not significant, confirming that the market is largely unpredictable from recent returns alone.

(c) Confusion Matrix — Full Training Set

# Predicted probabilities and class labels (threshold = 0.5)
probs_full <- predict(logit_full, type = "response")
preds_full <- ifelse(probs_full > 0.5, "Up", "Down")

# Confusion matrix
cm_full <- table(Predicted = preds_full, Actual = Weekly$Direction)
cm_full
##          Actual
## Predicted Down  Up
##      Down   54  48
##      Up    430 557
# Accuracy
acc_full <- mean(preds_full == Weekly$Direction)
cat("\nOverall Fraction of Correct Predictions (Accuracy):", round(acc_full, 4), "\n")
## 
## Overall Fraction of Correct Predictions (Accuracy): 0.5611
# False Positive and False Negative rates
TN <- cm_full["Down", "Down"]
FP <- cm_full["Up",   "Down"]
FN <- cm_full["Down", "Up"]
TP <- cm_full["Up",   "Up"]

FPR <- FP / (FP + TN)
FNR <- FN / (FN + TP)

cat("False Positive Rate (predicting Up when actually Down):", round(FPR, 4), "\n")
## False Positive Rate (predicting Up when actually Down): 0.8884
cat("False Negative Rate (predicting Down when actually Up):", round(FNR, 4), "\n")
## False Negative Rate (predicting Down when actually Up): 0.0793

Interpretation:

  • Accuracy: 0.5611 — only marginally better than the ~56% “Up” base rate in the data. The model is not genuinely predictive.
  • False Positive Rate (0.8884): The model incorrectly predicts “Up” for 88.8% of actual “Down” weeks — it over-predicts upward moves.
  • False Negative Rate (0.0793): Only 7.9% of actual “Up” weeks are missed. The model is biased toward predicting “Up” (consistent with the majority class), which inflates sensitivity but hurts specificity.

The confusion matrix shows the model nearly always predicts “Up”, essentially learning the marginal class distribution rather than the true decision boundary.

(d) Logistic Regression — Train 1990–2008, Test 2009–2010, Predictor: Lag2

# Split data
train <- Weekly %>% filter(Year <= 2008)
test  <- Weekly %>% filter(Year  > 2008)

cat("Training observations:", nrow(train), "\n")
## Training observations: 985
cat("Test observations:    ", nrow(test),  "\n")
## Test observations:     104
# Fit model on training data with Lag2 only
logit_lag2 <- glm(Direction ~ Lag2,
                  data   = train,
                  family = binomial)
summary(logit_lag2)
## 
## Call:
## glm(formula = Direction ~ Lag2, family = binomial, data = train)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.20326    0.06428   3.162  0.00157 **
## Lag2         0.05810    0.02870   2.024  0.04298 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1350.5  on 983  degrees of freedom
## AIC: 1354.5
## 
## Number of Fisher Scoring iterations: 4
# Predict on held-out test set
probs_test <- predict(logit_lag2, newdata = test, type = "response")
preds_test <- ifelse(probs_test > 0.5, "Up", "Down")

# Confusion matrix
cm_test <- table(Predicted = preds_test, Actual = test$Direction)
cm_test
##          Actual
## Predicted Down Up
##      Down    9  5
##      Up     34 56
# Accuracy on test set
acc_test <- mean(preds_test == test$Direction)
cat("\nTest Set Accuracy (2009–2010):", round(acc_test, 4), "\n")
## 
## Test Set Accuracy (2009–2010): 0.625
# Test set FPR and FNR
TN_t <- cm_test["Down", "Down"]
FP_t <- cm_test["Up",   "Down"]
FN_t <- cm_test["Down", "Up"]
TP_t <- cm_test["Up",   "Up"]

FPR_t <- FP_t / (FP_t + TN_t)
FNR_t <- FN_t / (FN_t + TP_t)

cat("Test False Positive Rate:", round(FPR_t, 4), "\n")
## Test False Positive Rate: 0.7907
cat("Test False Negative Rate:", round(FNR_t, 4), "\n")
## Test False Negative Rate: 0.082
test_plot <- test %>%
  mutate(prob_up = probs_test,
         predicted = preds_test)

ggplot(test_plot, aes(x = seq_along(prob_up), y = prob_up,
                      colour = Direction, shape = (Direction == predicted))) +
  geom_point(size = 2.5, alpha = 0.8) +
  geom_hline(yintercept = 0.5, linetype = "dashed", colour = "grey40") +
  scale_colour_manual(values = c("Down" = "#E63946", "Up" = "#2A9D8F")) +
  scale_shape_manual(values = c("TRUE" = 16, "FALSE" = 4),
                     labels = c("Correct", "Incorrect"),
                     name   = "Prediction") +
  labs(title   = "Predicted P(Up) on Test Set (2009–2010)",
       subtitle = "Model: Direction ~ Lag2, trained on 1990–2008",
       x = "Test Observation Index", y = "P(Direction = Up)",
       colour = "Actual Direction")
Predicted probabilities on the 2009-2010 test set by actual direction

Predicted probabilities on the 2009-2010 test set by actual direction

Summary of Test Performance:

  • Test Accuracy: 0.625 — the Lag2-only model achieves approximately 62.5% accuracy on the held-out 2009–2010 data, outperforming the full model on new data. This suggests that the Lag2 signal, while modest, has some genuine out-of-sample predictive content.
  • The model correctly identifies upward weeks more reliably than downward weeks, maintaining the upward bias, but the FNR drops compared to the full model.
  • Overall, the results are consistent with weak market predictability: Lag2 provides a small but detectable signal, not enough for reliable trading but statistically non-trivial.

End of Midterm Homework — Application of Financial Software