Chapter 2 — Question 9: The Auto Dataset

Overview

This exercise explores the Auto dataset from the ISLR2 package. The dataset contains information about 392 vehicles and is used to practice exploratory data analysis including descriptive statistics and graphical investigation.

data(Auto)
Auto <- na.omit(Auto)
cat("Dimensions after removing missing values:", dim(Auto), "\n")
## Dimensions after removing missing values: 392 9

(a) Quantitative vs. Qualitative Predictors

str(Auto)
## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : int  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
##  - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
##   ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...

Quantitative predictors (continuous/discrete numeric): mpg, cylinders, displacement, horsepower, weight, acceleration, year

Qualitative predictors (categorical): origin (1 = American, 2 = European, 3 = Japanese), name (character identifier)

Although cylinders and origin are stored as integers, they represent categorical groupings and should be treated accordingly in modeling contexts.


(b) Range of Each Quantitative Predictor

quant_vars <- Auto[, c("mpg","cylinders","displacement","horsepower",
                        "weight","acceleration","year")]
range_df <- as.data.frame(t(sapply(quant_vars, range)))
colnames(range_df) <- c("Min", "Max")
range_df$Range <- range_df$Max - range_df$Min
knitr::kable(range_df, caption = "Range of Quantitative Predictors")
Range of Quantitative Predictors
Min Max Range
mpg 9 46.6 37.6
cylinders 3 8.0 5.0
displacement 68 455.0 387.0
horsepower 46 230.0 184.0
weight 1613 5140.0 3527.0
acceleration 8 24.8 16.8
year 70 82.0 12.0

weight and displacement show the widest absolute ranges, suggesting high variability across vehicle types.


(c) Mean and Standard Deviation

summary_df <- data.frame(
  Mean = sapply(quant_vars, mean),
  SD   = sapply(quant_vars, sd)
)
knitr::kable(round(summary_df, 3), caption = "Mean and Standard Deviation of Quantitative Predictors")
Mean and Standard Deviation of Quantitative Predictors
Mean SD
mpg 23.446 7.805
cylinders 5.472 1.706
displacement 194.412 104.644
horsepower 104.469 38.491
weight 2977.584 849.403
acceleration 15.541 2.759
year 75.980 3.684

The high standard deviation of displacement (104.6) and weight (849.4) indicates substantial spread around the mean, consistent with their wide ranges. In contrast, acceleration has a relatively low SD, suggesting more homogeneity across vehicles.


(d) Subset Analysis: Removing Observations 10–85

Auto_sub  <- Auto[-(10:85), ]
quant_sub <- Auto_sub[, c("mpg","cylinders","displacement","horsepower",
                           "weight","acceleration","year")]

range_sub <- as.data.frame(t(sapply(quant_sub, range)))
colnames(range_sub) <- c("Min", "Max")

stats_sub <- data.frame(
  Mean = sapply(quant_sub, mean),
  SD   = sapply(quant_sub, sd)
)

knitr::kable(round(cbind(range_sub, stats_sub), 3),
             caption = "Summary Statistics After Removing Observations 10–85")
Summary Statistics After Removing Observations 10–85
Min Max Mean SD
mpg 11.0 46.6 24.404 7.867
cylinders 3.0 8.0 5.373 1.654
displacement 68.0 455.0 187.241 99.678
horsepower 46.0 230.0 100.722 35.709
weight 1649.0 4997.0 2935.972 811.300
acceleration 8.5 24.8 15.727 2.694
year 70.0 82.0 77.146 3.106

Removing 76 observations (roughly 19% of the data) from the middle of the dataset noticeably shifts means and ranges for several predictors, particularly horsepower and cylinders. This reflects the fact that those observations were not a random sample — removing a contiguous block can induce selection bias in descriptive statistics.


(e) Graphical Investigation of Predictors

pairs(quant_vars,
      main        = "Scatterplot Matrix — Auto Dataset",
      col         = "steelblue",
      pch         = 16,
      cex         = 0.5,
      upper.panel = panel.smooth)

Key observations:

  • displacement, horsepower, and weight are strongly positively correlated with one another — vehicles with larger engines tend to be heavier.
  • All three are strongly negatively correlated with mpg, suggesting a clear fuel efficiency penalty for heavier, more powerful vehicles.
  • year shows a mild positive association with mpg, indicating gradual improvements in fuel economy over time.
  • acceleration shows a weaker and more diffuse relationship with the other predictors.

(f) Predicting Gas Mileage (mpg)

par(mfrow = c(2, 3), mar = c(4, 4, 3, 1))
pred_vars <- c("cylinders","displacement","horsepower","weight","acceleration","year")

for (var in pred_vars) {
  plot(Auto[[var]], Auto$mpg,
       xlab  = var, ylab = "mpg",
       main  = paste("mpg vs.", var),
       col   = adjustcolor("steelblue", alpha.f = 0.5),
       pch   = 16, cex = 0.8)
  abline(lm(Auto$mpg ~ Auto[[var]]), col = "red", lwd = 2)
}

Conclusion: weight, displacement, horsepower, and cylinders all exhibit strong negative linear relationships with mpg and are likely the most useful predictors in a regression model. year shows a positive trend reflecting technological improvements. acceleration has a weaker and less consistent relationship.


Chapter 3 — Question 9: Multiple Linear Regression on Auto

Overview

This section fits multiple linear regression models using mpg as the response variable and investigates predictor significance, diagnostics, interaction effects, and variable transformations.


(a) Scatterplot Matrix

pairs(Auto[, -9],
      main        = "Scatterplot Matrix — All Auto Variables",
      col         = "darkslateblue",
      pch         = 16,
      cex         = 0.4,
      upper.panel = panel.smooth)


(b) Correlation Matrix

cor_mat <- cor(Auto[, -9])
knitr::kable(round(cor_mat, 3), caption = "Correlation Matrix (excluding 'name')")
Correlation Matrix (excluding ‘name’)
mpg cylinders displacement horsepower weight acceleration year origin
mpg 1.000 -0.778 -0.805 -0.778 -0.832 0.423 0.581 0.565
cylinders -0.778 1.000 0.951 0.843 0.898 -0.505 -0.346 -0.569
displacement -0.805 0.951 1.000 0.897 0.933 -0.544 -0.370 -0.615
horsepower -0.778 0.843 0.897 1.000 0.865 -0.689 -0.416 -0.455
weight -0.832 0.898 0.933 0.865 1.000 -0.417 -0.309 -0.585
acceleration 0.423 -0.505 -0.544 -0.689 -0.417 1.000 0.290 0.213
year 0.581 -0.346 -0.370 -0.416 -0.309 0.290 1.000 0.182
origin 0.565 -0.569 -0.615 -0.455 -0.585 0.213 0.182 1.000

The highest correlations with mpg are seen for weight (−0.832), displacement (−0.805), and cylinders (−0.778). These strong negative correlations confirm that heavier, larger-engine vehicles are significantly less fuel efficient. year has a moderate positive correlation (0.581), confirming the fuel economy improvement trend over time.


(c) Multiple Linear Regression

lm_fit <- lm(mpg ~ . - name, data = Auto)
summary(lm_fit)
## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

i. Is there a relationship between predictors and response? Yes. The model F-statistic is highly significant (p < 2.2e-16), providing strong evidence of a collective linear relationship between the predictors and mpg.

ii. Which predictors are statistically significant? At the 5% significance level: displacement, weight, year, and origin all have p-values below 0.05. cylinders, horsepower, and acceleration are not individually significant when controlling for others — likely due to multicollinearity with displacement and weight.

iii. Interpretation of the year coefficient: The coefficient for year is approximately +0.751, meaning that holding all other predictors constant, fuel efficiency increases by about 0.75 mpg per model year. This reflects systematic improvements in automotive engineering during the 1970s and 1980s.


(d) Diagnostic Plots

par(mfrow = c(2, 2))
plot(lm_fit)

Interpretation:

  • Residuals vs. Fitted: A curved pattern suggests mild non-linearity not captured by the linear model. A polynomial or log transformation may improve fit.
  • Normal Q-Q: Residuals are approximately normal, with slight deviation in the tails — acceptable for a dataset of this size.
  • Scale-Location: Some heteroscedasticity is present; variance appears to increase slightly at higher fitted values.
  • Residuals vs. Leverage: Observation 14 stands out with unusually high leverage. It may be influencing the regression coefficients and warrants closer inspection.

(e) Interaction Effects

lm_interact <- lm(mpg ~ displacement * weight + acceleration * year +
                        horsepower * weight, data = Auto)
summary(lm_interact)
## 
## Call:
## lm(formula = mpg ~ displacement * weight + acceleration * year + 
##     horsepower * weight, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.6205 -1.5878  0.0325  1.3787 11.8148 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.144e+02  1.817e+01   6.296 8.38e-10 ***
## displacement        -3.801e-02  1.607e-02  -2.365 0.018549 *  
## weight              -1.073e-02  6.881e-04 -15.589  < 2e-16 ***
## acceleration        -7.104e+00  1.118e+00  -6.353 6.00e-10 ***
## year                -6.792e-01  2.340e-01  -2.902 0.003917 ** 
## horsepower          -1.620e-01  4.459e-02  -3.633 0.000318 ***
## displacement:weight  1.029e-05  4.968e-06   2.071 0.039049 *  
## acceleration:year    9.203e-02  1.450e-02   6.345 6.28e-10 ***
## weight:horsepower    3.414e-05  1.207e-05   2.829 0.004910 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.812 on 383 degrees of freedom
## Multiple R-squared:  0.8729, Adjusted R-squared:  0.8702 
## F-statistic: 328.7 on 8 and 383 DF,  p-value: < 2.2e-16

The interaction between acceleration:year and displacement:weight are statistically significant, suggesting that the effect of engine size on fuel economy is moderated by vehicle weight, and that the effect of acceleration has changed over time. Including interactions improves model fit but reduces interpretability.


(f) Variable Transformations

# Fit models with different transformations of horsepower (within full model)
trans_models <- list(
  "Linear"      = lm(mpg ~ . - name,                                   data = Auto),
  "Log"         = lm(mpg ~ . - name - horsepower + log(horsepower),    data = Auto),
  "Square Root" = lm(mpg ~ . - name - horsepower + sqrt(horsepower),   data = Auto),
  "Quadratic"   = lm(mpg ~ . - name - horsepower + I(horsepower^2),    data = Auto)
)

# Extract fit statistics
trans_stats <- do.call(rbind, lapply(names(trans_models), function(name) {
  s <- summary(trans_models[[name]])
  data.frame(
    Transformation = name,
    Adj_R_Squared  = round(s$adj.r.squared, 4),
    RSE            = round(s$sigma, 4)
  )
}))

knitr::kable(trans_stats, row.names = FALSE,
             caption = "Model Fit Comparison Across Horsepower Transformations (Full Model)")
Model Fit Comparison Across Horsepower Transformations (Full Model)
Transformation Adj_R_Squared RSE
Linear 0.8182 3.3277
Log 0.8340 3.1801
Square Root 0.8237 3.2772
Quadratic 0.8193 3.3179
par(mfrow = c(2, 2), mar = c(4, 4, 3, 1))

smooth_line <- function(model, xvar, transform_fn, col = "red") {
  x_seq <- seq(min(xvar), max(xvar), length.out = 300)
  newdata <- data.frame(
    cylinders    = mean(Auto$cylinders),
    displacement = mean(Auto$displacement),
    horsepower   = x_seq,
    weight       = mean(Auto$weight),
    acceleration = mean(Auto$acceleration),
    year         = mean(Auto$year),
    origin       = mean(Auto$origin),
    name         = Auto$name[1]          # dummy value to satisfy the model
  )
  y_hat <- predict(model, newdata = newdata)
  lines(transform_fn(x_seq), y_hat, col = col, lwd = 2)
}

# --- Plot 1: Linear ---
plot(Auto$horsepower, Auto$mpg,
     main = paste0("Linear (horsepower)  |  Adj R² = ",
                   round(summary(trans_models$Linear)$adj.r.squared, 3)),
     xlab = "horsepower", ylab = "mpg",
     col  = adjustcolor("steelblue", 0.5), pch = 16)

# --- Plot 2: Log ---
plot(log(Auto$horsepower), Auto$mpg,
     main = paste0("log(horsepower)  |  Adj R² = ",
                   round(summary(trans_models$Log)$adj.r.squared, 3)),
     xlab = "log(horsepower)", ylab = "mpg",
     col  = adjustcolor("steelblue", 0.5), pch = 16)

# --- Plot 3: Square Root ---
plot(sqrt(Auto$horsepower), Auto$mpg,
     main = paste0("sqrt(horsepower)  |  Adj R² = ",
                   round(summary(trans_models$`Square Root`)$adj.r.squared, 3)),
     xlab = "sqrt(horsepower)", ylab = "mpg",
     col  = adjustcolor("darkgreen", 0.5), pch = 16)

# --- Plot 4: Quadratic ---
plot(Auto$horsepower^2, Auto$mpg,
     main = paste0("horsepower²  |  Adj R² = ",
                   round(summary(trans_models$Quadratic)$adj.r.squared, 3)),
     xlab = "horsepower²", ylab = "mpg",
     col  = adjustcolor("tomato", 0.5), pch = 16)


Chapter 3 — Question 15: Crime Rate Prediction with Boston

Overview

This section uses the Boston dataset to predict per capita crime rate (crim) using simple and multiple linear regression, and explores non-linear associations via polynomial regression.

data(Boston)
predictors <- setdiff(names(Boston), "crim")

(a) Simple Linear Regression for Each Predictor

results <- lapply(predictors, function(var) {
  fit <- lm(Boston$crim ~ Boston[[var]])
  s   <- summary(fit)
  data.frame(
    Predictor   = var,
    Coefficient = round(coef(fit)[2], 4),
    R_Squared   = round(s$r.squared, 4),
    P_value     = signif(s$coefficients[2, 4], 4)
  )
})
results_df <- do.call(rbind, results)
results_df$Significant <- ifelse(results_df$P_value < 0.05, "Yes", "No")
knitr::kable(results_df, row.names = FALSE,
             caption = "Simple Linear Regression Results: crim vs. Each Predictor")
Simple Linear Regression Results: crim vs. Each Predictor
Predictor Coefficient R_Squared P_value Significant
zn -0.0739 0.0402 0.0000055 Yes
indus 0.5098 0.1653 0.0000000 Yes
chas -1.8928 0.0031 0.2094000 No
nox 31.2485 0.1772 0.0000000 Yes
rm -2.6841 0.0481 0.0000006 Yes
age 0.1078 0.1244 0.0000000 Yes
dis -1.5509 0.1441 0.0000000 Yes
rad 0.6179 0.3913 0.0000000 Yes
tax 0.0297 0.3396 0.0000000 Yes
ptratio 1.1520 0.0841 0.0000000 Yes
lstat 0.5488 0.2076 0.0000000 Yes
medv -0.3632 0.1508 0.0000000 Yes

Almost all predictors show a statistically significant association with crim. Notably, rad (accessibility to radial highways) and tax (property tax rate) have the highest R² values, suggesting that areas with better highway access and higher tax burdens tend to have higher crime rates — possibly reflecting urban density effects.

sig_preds <- results_df$Predictor[results_df$Significant == "Yes"]
par(mfrow = c(3, 5), mar = c(3, 3, 2, 1))
for (var in sig_preds) {
  plot(Boston[[var]], Boston$crim,
       xlab = var, ylab = "crim",
       pch  = 16, col = adjustcolor("steelblue", 0.4), cex = 0.6,
       main = paste("crim ~", var))
  abline(lm(Boston$crim ~ Boston[[var]]), col = "red", lwd = 2)
}


(b) Multiple Linear Regression

lm_boston <- lm(crim ~ ., data = Boston)
summary(lm_boston)
## 
## Call:
## lm(formula = crim ~ ., data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.534 -2.248 -0.348  1.087 73.923 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.7783938  7.0818258   1.946 0.052271 .  
## zn           0.0457100  0.0187903   2.433 0.015344 *  
## indus       -0.0583501  0.0836351  -0.698 0.485709    
## chas        -0.8253776  1.1833963  -0.697 0.485841    
## nox         -9.9575865  5.2898242  -1.882 0.060370 .  
## rm           0.6289107  0.6070924   1.036 0.300738    
## age         -0.0008483  0.0179482  -0.047 0.962323    
## dis         -1.0122467  0.2824676  -3.584 0.000373 ***
## rad          0.6124653  0.0875358   6.997 8.59e-12 ***
## tax         -0.0037756  0.0051723  -0.730 0.465757    
## ptratio     -0.3040728  0.1863598  -1.632 0.103393    
## lstat        0.1388006  0.0757213   1.833 0.067398 .  
## medv        -0.2200564  0.0598240  -3.678 0.000261 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.46 on 493 degrees of freedom
## Multiple R-squared:  0.4493, Adjusted R-squared:  0.4359 
## F-statistic: 33.52 on 12 and 493 DF,  p-value: < 2.2e-16

In the multiple regression context, most individually-significant predictors lose their significance — a classic sign of multicollinearity. Only zn, dis, rad, black, and medv retain significance (p < 0.05). This suggests that many simple associations in (a) were confounded by correlated predictors.


(c) Univariate vs. Multiple Regression Coefficients

uni_coefs   <- setNames(results_df$Coefficient, results_df$Predictor)
multi_coefs <- coef(lm_boston)[-1]
common      <- intersect(names(uni_coefs), names(multi_coefs))

plot(uni_coefs[common], multi_coefs[common],
     xlab = "Simple Regression Coefficients",
     ylab = "Multiple Regression Coefficients",
     main = "Univariate vs. Multiple Regression Coefficients",
     pch  = 19, col = "steelblue", cex = 1.2)
abline(0, 1, col = "red", lty = 2, lwd = 1.5)
text(uni_coefs[common], multi_coefs[common],
     labels = common, cex = 0.65, pos = 3)

The coefficients differ substantially between the two model types, particularly for nox and dis. The divergence from the diagonal confirms that simple regression coefficients are biased when predictors are correlated — the multiple regression model controls for confounders and provides more reliable estimates.


(d) Non-linear Associations (Cubic Polynomial)

poly_results <- lapply(predictors, function(var) {
  if (length(unique(Boston[[var]])) < 4) return(NULL)
  fit <- lm(Boston$crim ~ poly(Boston[[var]], 3))
  s   <- summary(fit)
  pvals <- s$coefficients[-1, 4]
  data.frame(
    Predictor   = var,
    P_quadratic = signif(pvals[2], 4),
    P_cubic     = signif(pvals[3], 4)
  )
})
poly_df <- do.call(rbind, poly_results)
poly_df$NonLinear <- ifelse(poly_df$P_quadratic < 0.05 | poly_df$P_cubic < 0.05, "Yes", "No")
knitr::kable(poly_df, row.names = FALSE,
             caption = "Polynomial Regression: Evidence of Non-linearity")
Polynomial Regression: Evidence of Non-linearity
Predictor P_quadratic P_cubic NonLinear
zn 0.0044210 0.229500 Yes
indus 0.0010860 0.000000 Yes
nox 0.0000774 0.000000 Yes
rm 0.0015090 0.508600 Yes
age 0.0000023 0.006680 Yes
dis 0.0000000 0.000000 Yes
rad 0.0091210 0.482300 Yes
tax 0.0000037 0.243900 Yes
ptratio 0.0024050 0.006301 Yes
lstat 0.0378000 0.129900 Yes
medv 0.0000000 0.000000 Yes

Several predictors — including indus, nox, age, dis, and medv — show statistically significant quadratic or cubic terms, indicating non-linear relationships with crime rate. This suggests that linear models may be insufficient and that more flexible methods (e.g., splines or GAMs) could yield better predictive performance.


Chapter 4 — Question 13: The Weekly Dataset

Overview

This section uses the Weekly stock market dataset to practice logistic regression for binary classification. The response variable is Direction (Up/Down), and we evaluate model performance using confusion matrices.

data(Weekly)
cat("Dataset dimensions:", dim(Weekly), "\n")
## Dataset dimensions: 1089 9
cat("Class distribution:\n")
## Class distribution:
print(table(Weekly$Direction))
## 
## Down   Up 
##  484  605

(a) Numerical and Graphical Summaries

summary(Weekly)
##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume            Today         
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747   Min.   :-18.1950  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202   1st Qu.: -1.1540  
##  Median :  0.2380   Median :  0.2340   Median :1.00268   Median :  0.2410  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462   Mean   :  0.1499  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373   3rd Qu.:  1.4050  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821   Max.   : 12.0260  
##  Direction 
##  Down:484  
##  Up  :605  
##            
##            
##            
## 
pairs(Weekly[, -9],
      col  = ifelse(Weekly$Direction == "Up", "steelblue", "tomato"),
      pch  = 16, cex = 0.4,
      main = "Weekly Data — Blue = Up, Red = Down")

plot(Weekly$Year, Weekly$Volume,
     type = "l", col = "steelblue", lwd = 1.5,
     xlab = "Year", ylab = "Volume",
     main = "Trading Volume Over Time (1990–2010)")

Key patterns: Trading volume increased substantially from 1990 to the early 2000s before declining slightly. The lag variables show no obvious visual separation between Up and Down weeks, suggesting that predicting market direction from recent returns is inherently difficult — consistent with the efficient market hypothesis.


(b) Logistic Regression — Full Dataset

glm_full <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
                data   = Weekly,
                family = binomial)
summary(glm_full)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Weekly)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

Only Lag2 is statistically significant (p ≈ 0.030). All other predictors, including Volume and the remaining lag terms, do not contribute meaningfully when controlling for each other. The positive coefficient on Lag2 suggests that a positive return two weeks prior slightly increases the probability of an Up week.


(c) Confusion Matrix — Full Data Training

prob_full <- predict(glm_full, type = "response")
pred_full <- ifelse(prob_full > 0.5, "Up", "Down")

conf_full <- table(Predicted = pred_full, Actual = Weekly$Direction)
knitr::kable(conf_full, caption = "Confusion Matrix — Full Data Logistic Regression")
Confusion Matrix — Full Data Logistic Regression
Down Up
Down 54 48
Up 430 557
accuracy_full <- mean(pred_full == Weekly$Direction)
cat("Overall Accuracy:", round(accuracy_full, 4), "\n")
## Overall Accuracy: 0.5611

The model achieves an overall accuracy of ~56.1%, only marginally better than a naive classifier that always predicts “Up” (which would achieve ~55.6%). The confusion matrix reveals severe asymmetry: the model correctly identifies most Up weeks (true positive rate ~92%) but performs poorly on Down weeks (true positive rate ~11%). This reflects the class imbalance in the dataset and the model’s tendency to predict the majority class.


(d) Logistic Regression — Train 1990–2008, Test 2009–2010

train    <- Weekly$Year <= 2008
test_set <- Weekly[!train, ]

glm_lag2 <- glm(Direction ~ Lag2,
                data   = Weekly,
                family = binomial,
                subset = train)

prob_test <- predict(glm_lag2, newdata = test_set, type = "response")
pred_test <- ifelse(prob_test > 0.5, "Up", "Down")

conf_test <- table(Predicted = pred_test, Actual = test_set$Direction)
knitr::kable(conf_test, caption = "Confusion Matrix — Hold-Out Test Set (2009–2010)")
Confusion Matrix — Hold-Out Test Set (2009–2010)
Down Up
Down 9 5
Up 34 56
accuracy_test <- mean(pred_test == test_set$Direction)
cat("Test Set Accuracy:", round(accuracy_test, 4), "\n")
## Test Set Accuracy: 0.625

The held-out test accuracy is approximately 62.5%, an improvement over the full-data model. Using only Lag2 as a predictor reduces overfitting and generalizes better to unseen data. The model still tends to over-predict “Up” weeks, but achieves a better balance on the 2009–2010 test period. This exercise illustrates the importance of proper train/test splitting for honest model evaluation.