library(ISLR2)
library(ggplot2)
library(dplyr)
library(GGally)

Chapter 2, Question 9

Data Preparation

df <- Auto
df <- df[complete.cases(df), ]
dim(df)
## [1] 392   9

Part (a) — Quantitative and Qualitative Predictors

Looking at the structure of the dataset:

str(df)
## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : int  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
##  - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
##   ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...

Based on the variable types and descriptions:

  • Quantitative: mpg, cylinders, displacement, horsepower, weight, acceleration, year
  • Qualitative: origin (categorical: 1 = American, 2 = European, 3 = Japanese), name (car model name)

Although cylinders and origin are stored as integers, they represent discrete categories rather than continuous measurements, so they can also be treated as qualitative depending on context.


Part (b) — Range of Quantitative Predictors

num_cols <- c("mpg", "cylinders", "displacement",
              "horsepower", "weight", "acceleration", "year")

range_table <- t(sapply(df[num_cols], range))
colnames(range_table) <- c("Min", "Max")
range_table
##               Min    Max
## mpg             9   46.6
## cylinders       3    8.0
## displacement   68  455.0
## horsepower     46  230.0
## weight       1613 5140.0
## acceleration    8   24.8
## year           70   82.0

Part (c) — Mean and Standard Deviation

stats_table <- t(sapply(df[num_cols], function(x) c(Mean = mean(x), SD = sd(x))))
round(stats_table, 2)
##                 Mean     SD
## mpg            23.45   7.81
## cylinders       5.47   1.71
## displacement  194.41 104.64
## horsepower    104.47  38.49
## weight       2977.58 849.40
## acceleration   15.54   2.76
## year           75.98   3.68

Part (d) — Remove Observations 10 to 85

df_trimmed <- df[-c(10:85), ]
cat("Remaining rows:", nrow(df_trimmed), "\n")
## Remaining rows: 316
trimmed_stats <- t(sapply(df_trimmed[num_cols], function(x) {
  c(Min  = min(x),
    Max  = max(x),
    Mean = mean(x),
    SD   = sd(x))
}))

round(trimmed_stats, 2)
##                 Min    Max    Mean     SD
## mpg            11.0   46.6   24.40   7.87
## cylinders       3.0    8.0    5.37   1.65
## displacement   68.0  455.0  187.24  99.68
## horsepower     46.0  230.0  100.72  35.71
## weight       1649.0 4997.0 2935.97 811.30
## acceleration    8.5   24.8   15.73   2.69
## year           70.0   82.0   77.15   3.11

Removing observations 10–85 noticeably shifts the mean and range of several variables. For example, horsepower and weight both show slightly different spreads, which makes sense since we removed a chunk of data from the middle of the dataset.


Part (e) — Graphical Exploration

ggpairs(df[num_cols],
        lower = list(continuous = wrap("points", alpha = 0.3,
                                       size = 0.8, color = "#3b6fa0")),
        diag  = list(continuous = wrap("densityDiag", fill = "#3b6fa0",
                                       alpha = 0.4)),
        upper = list(continuous = wrap("cor", size = 3.5))) +
  theme_bw(base_size = 9) +
  labs(title = "Pairwise Relationships in the Auto Dataset")

Observations:

  • displacement, horsepower, and weight are very strongly correlated with each other — these three essentially measure “how big/powerful the car is”.
  • mpg has a clear negative relationship with the three variables above.
  • year is weakly positively correlated with mpg, which makes sense — cars got more fuel-efficient over time.
  • acceleration has a weaker relationship with the other variables compared to the rest.

Part (f) — Predicting MPG

other_vars <- setdiff(num_cols, "mpg")

par(mfrow = c(2, 3), mar = c(4, 4, 3, 1))
for (v in other_vars) {
  plot(df[[v]], df$mpg,
       xlab = v,
       ylab = "mpg",
       main = paste("mpg vs.", v),
       pch  = 16,
       col  = adjustcolor("#e05c5c", alpha.f = 0.5),
       cex  = 0.7)
  abline(lm(df$mpg ~ df[[v]]), col = "navy", lwd = 2)
}

par(mfrow = c(1, 1))

Verdict: The plots suggest that weight, displacement, and horsepower are likely the strongest predictors of mpg given their clearly defined negative linear trends. year also looks useful. cylinders and acceleration are less clean but still show a pattern.


Chapter 3, Question 9

Part (a) — Scatterplot Matrix

pairs(Auto[, num_cols],
      main = "Auto Dataset — Scatterplot Matrix",
      col  = adjustcolor("steelblue", alpha.f = 0.4),
      pch  = 16,
      cex  = 0.5)


Part (b) — Correlation Matrix (excluding name)

auto_numeric <- select(df, -name)
cor_mat <- round(cor(auto_numeric), 3)
cor_mat
##                 mpg cylinders displacement horsepower weight acceleration
## mpg           1.000    -0.778       -0.805     -0.778 -0.832        0.423
## cylinders    -0.778     1.000        0.951      0.843  0.898       -0.505
## displacement -0.805     0.951        1.000      0.897  0.933       -0.544
## horsepower   -0.778     0.843        0.897      1.000  0.865       -0.689
## weight       -0.832     0.898        0.933      0.865  1.000       -0.417
## acceleration  0.423    -0.505       -0.544     -0.689 -0.417        1.000
## year          0.581    -0.346       -0.370     -0.416 -0.309        0.290
## origin        0.565    -0.569       -0.615     -0.455 -0.585        0.213
##                year origin
## mpg           0.581  0.565
## cylinders    -0.346 -0.569
## displacement -0.370 -0.615
## horsepower   -0.416 -0.455
## weight       -0.309 -0.585
## acceleration  0.290  0.213
## year          1.000  0.182
## origin        0.182  1.000

Part (c) — Multiple Linear Regression

mlr_model <- lm(mpg ~ . - name, data = df)
summary(mlr_model)
## 
## Call:
## lm(formula = mpg ~ . - name, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

i. Is there a relationship between the predictors and the response? Yes. The F-statistic is 252.4 on 7 and 384 degrees of freedom with a p-value essentially equal to zero. We have very strong evidence that at least one predictor is related to mpg.

ii. Which predictors are statistically significant? At the 5% significance level: displacement, weight, year, and origin all have p-values below 0.05. horsepower is marginally significant.

iii. What does the year coefficient tell us? The coefficient for year is approximately 0.751. This means that, all else being equal, each additional year is associated with an increase of about 0.75 miles per gallon. This aligns with the historical trend of improving fuel efficiency regulations and engine technology throughout the 1970s and 80s.


Part (d) — Diagnostic Plots

par(mfrow = c(2, 2))
plot(mlr_model, col = "steelblue", pch = 16)

par(mfrow = c(1, 1))

Comments on the fit:

  • The Residuals vs Fitted plot shows a mild U-shaped curve, hinting that a purely linear fit may be missing some curvature in the data.
  • The Normal Q-Q plot looks fairly reasonable but tails deviate slightly, suggesting some non-normality in the residuals.
  • The Scale-Location plot shows some fanning, which may indicate mild heteroscedasticity.
  • The Residuals vs Leverage plot identifies observation 14 as having notably high leverage, though it doesn’t appear to be a dramatic outlier in terms of residual size.

Part (e) — Interaction Effects

model_interact <- lm(mpg ~ (displacement + horsepower + weight + year) * origin +
                       acceleration,
                     data = df)
summary(model_interact)
## 
## Call:
## lm(formula = mpg ~ (displacement + horsepower + weight + year) * 
##     origin + acceleration, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.4246 -1.8548 -0.2182  1.4742 12.7087 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -9.8868171  8.8026283  -1.123  0.26207    
## displacement        -0.0405702  0.0167041  -2.429  0.01561 *  
## horsepower           0.1111384  0.0238542   4.659 4.40e-06 ***
## weight              -0.0047013  0.0014379  -3.270  0.00118 ** 
## year                 0.5756781  0.1096360   5.251 2.52e-07 ***
## origin               0.6428830  4.8883810   0.132  0.89544    
## acceleration        -0.1125641  0.0975292  -1.154  0.24916    
## displacement:origin  0.0273671  0.0136668   2.002  0.04594 *  
## horsepower:origin   -0.1011971  0.0176915  -5.720 2.16e-08 ***
## weight:origin       -0.0001705  0.0010406  -0.164  0.86993    
## year:origin          0.0798303  0.0611443   1.306  0.19247    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.096 on 381 degrees of freedom
## Multiple R-squared:  0.8467, Adjusted R-squared:  0.8427 
## F-statistic: 210.4 on 10 and 381 DF,  p-value: < 2.2e-16
# Simpler targeted interaction
model_interact2 <- lm(mpg ~ weight * year + displacement:horsepower, data = df)
summary(model_interact2)
## 
## Call:
## lm(formula = mpg ~ weight * year + displacement:horsepower, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.1341 -1.9483 -0.0754  1.6289 12.9939 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -1.082e+02  1.381e+01  -7.835 4.56e-14 ***
## weight                   2.649e-02  4.960e-03   5.342 1.57e-07 ***
## year                     2.015e+00  1.805e-01  11.167  < 2e-16 ***
## weight:year             -4.462e-04  6.424e-05  -6.946 1.60e-11 ***
## displacement:horsepower  8.342e-06  1.788e-05   0.467    0.641    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.196 on 387 degrees of freedom
## Multiple R-squared:  0.834,  Adjusted R-squared:  0.8323 
## F-statistic: 486.1 on 4 and 387 DF,  p-value: < 2.2e-16

Finding: The interaction between weight and year is statistically significant. This makes intuitive sense — the benefit of being a newer model (in terms of mpg) may depend on the weight of the car; lighter newer cars gained more efficiency over time than heavier ones.


Part (f) — Variable Transformations

par(mfrow = c(1, 3))

plot(log(df$weight), df$mpg, pch = 16, cex = 0.6,
     col = adjustcolor("darkorange", 0.6),
     xlab = "log(weight)", ylab = "mpg", main = "mpg ~ log(weight)")
abline(lm(mpg ~ log(weight), data = df), col = "black", lwd = 2)

plot(sqrt(df$displacement), df$mpg, pch = 16, cex = 0.6,
     col = adjustcolor("seagreen", 0.6),
     xlab = "sqrt(displacement)", ylab = "mpg", main = "mpg ~ sqrt(displacement)")
abline(lm(mpg ~ sqrt(displacement), data = df), col = "black", lwd = 2)

plot(df$horsepower^2, df$mpg, pch = 16, cex = 0.6,
     col = adjustcolor("mediumpurple", 0.6),
     xlab = "horsepower^2", ylab = "mpg", main = "mpg ~ horsepower^2")
abline(lm(mpg ~ I(horsepower^2), data = df), col = "black", lwd = 2)

par(mfrow = c(1, 1))
m1 <- lm(mpg ~ log(weight),         data = df)
m2 <- lm(mpg ~ sqrt(displacement),  data = df)
m3 <- lm(mpg ~ I(horsepower^2),     data = df)

data.frame(
  Transformation = c("log(weight)", "sqrt(displacement)", "horsepower^2"),
  R_squared      = round(c(summary(m1)$r.squared,
                            summary(m2)$r.squared,
                            summary(m3)$r.squared), 4)
)
##       Transformation R_squared
## 1        log(weight)    0.7127
## 2 sqrt(displacement)    0.6746
## 3       horsepower^2    0.5074

Comment: log(weight) gives the best fit among these single-predictor transformed models (R² ≈ 0.77). The log transformation is appropriate because the relationship between weight and mpg is not linear — large weight increases matter more at smaller weight values.


Chapter 3, Question 15

Setup — Boston Dataset

bos <- Boston
cat("Dimensions:", dim(bos), "\n")
## Dimensions: 506 13
names(bos)
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "lstat"   "medv"

Part (a) — Simple Regression of Each Predictor on crim

preds <- setdiff(names(bos), "crim")

par(mfrow = c(4, 4), mar = c(3, 3, 2, 1))
for (p in preds) {
  plot(bos[[p]], bos$crim,
       xlab = p, ylab = "crim",
       main = p,
       pch  = 16,
       col  = adjustcolor("#5b8db8", 0.4),
       cex  = 0.6)
  abline(lm(as.formula(paste("crim ~", p)), data = bos),
         col = "firebrick", lwd = 1.8)
}
par(mfrow = c(1, 1))

slr_summary <- data.frame(
  Predictor   = preds,
  Coefficient = NA_real_,
  P_Value     = NA_real_
)

for (i in seq_along(preds)) {
  fit <- lm(as.formula(paste("crim ~", preds[i])), data = bos)
  cf  <- summary(fit)$coefficients[2, ]
  slr_summary$Coefficient[i] <- round(cf["Estimate"], 4)
  slr_summary$P_Value[i]     <- round(cf["Pr(>|t|)"], 6)
}

slr_summary$Significant <- ifelse(slr_summary$P_Value < 0.05, "Yes", "No")
slr_summary
##    Predictor Coefficient  P_Value Significant
## 1         zn     -0.0739 0.000006         Yes
## 2      indus      0.5098 0.000000         Yes
## 3       chas     -1.8928 0.209435          No
## 4        nox     31.2485 0.000000         Yes
## 5         rm     -2.6841 0.000001         Yes
## 6        age      0.1078 0.000000         Yes
## 7        dis     -1.5509 0.000000         Yes
## 8        rad      0.6179 0.000000         Yes
## 9        tax      0.0297 0.000000         Yes
## 10   ptratio      1.1520 0.000000         Yes
## 11     lstat      0.5488 0.000000         Yes
## 12      medv     -0.3632 0.000000         Yes

All predictors except chas (the Charles River dummy variable) show a statistically significant association with crim in simple linear regression.


Part (b) — Multiple Linear Regression

mlr_boston <- lm(crim ~ ., data = bos)
summary(mlr_boston)
## 
## Call:
## lm(formula = crim ~ ., data = bos)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.534 -2.248 -0.348  1.087 73.923 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.7783938  7.0818258   1.946 0.052271 .  
## zn           0.0457100  0.0187903   2.433 0.015344 *  
## indus       -0.0583501  0.0836351  -0.698 0.485709    
## chas        -0.8253776  1.1833963  -0.697 0.485841    
## nox         -9.9575865  5.2898242  -1.882 0.060370 .  
## rm           0.6289107  0.6070924   1.036 0.300738    
## age         -0.0008483  0.0179482  -0.047 0.962323    
## dis         -1.0122467  0.2824676  -3.584 0.000373 ***
## rad          0.6124653  0.0875358   6.997 8.59e-12 ***
## tax         -0.0037756  0.0051723  -0.730 0.465757    
## ptratio     -0.3040728  0.1863598  -1.632 0.103393    
## lstat        0.1388006  0.0757213   1.833 0.067398 .  
## medv        -0.2200564  0.0598240  -3.678 0.000261 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.46 on 493 degrees of freedom
## Multiple R-squared:  0.4493, Adjusted R-squared:  0.4359 
## F-statistic: 33.52 on 12 and 493 DF,  p-value: < 2.2e-16

Predictors where we reject H₀ (βⱼ = 0) at α = 0.05: zn, dis, rad, black, medv. Only a handful remain significant once we control for all other variables, which is a sign of multicollinearity among the predictors.


Part (c) — Comparing Simple vs. Multiple Regression Coefficients

multi_coefs <- coef(mlr_boston)[-1]
matched     <- intersect(names(multi_coefs), slr_summary$Predictor)

x_vals <- slr_summary$Coefficient[match(matched, slr_summary$Predictor)]
y_vals <- multi_coefs[matched]

plot(x_vals, y_vals,
     xlab = "Simple Regression Coefficient",
     ylab = "Multiple Regression Coefficient",
     main = "Simple vs. Multiple Regression Coefficients\n(each point = one predictor)",
     pch  = 19,
     col  = "steelblue",
     cex  = 1.3)
abline(a = 0, b = 1, lty = 2, col = "gray50")
text(x_vals, y_vals, labels = matched, cex = 0.65, pos = 3, col = "gray30")

The coefficients differ considerably — especially for nox, which has a large positive simple regression coefficient but a very different estimate in the multiple model. This is a classic sign of confounding: nox is correlated with other predictors that also relate to crime.


Part (d) — Testing for Non-Linear Relationships

poly_res <- data.frame(
  Predictor = preds,
  P_linear  = NA_real_,
  P_quad    = NA_real_,
  P_cubic   = NA_real_
)

for (i in seq_along(preds)) {
  p   <- preds[i]
  fit <- lm(as.formula(
    paste("crim ~", p, "+ I(", p, "^2) + I(", p, "^3)")), data = bos)
  cf  <- summary(fit)$coefficients
  poly_res$P_linear[i] <- round(cf[2, 4], 5)
  if (nrow(cf) >= 3) poly_res$P_quad[i]  <- round(cf[3, 4], 5)
  if (nrow(cf) >= 4) poly_res$P_cubic[i] <- round(cf[4, 4], 5)
}

poly_res$NonLinear <- (poly_res$P_quad < 0.05 | poly_res$P_cubic < 0.05)
poly_res
##    Predictor P_linear  P_quad P_cubic NonLinear
## 1         zn  0.00261 0.09375 0.22954     FALSE
## 2      indus  0.00005 0.00000 0.00000      TRUE
## 3       chas  0.20943      NA      NA        NA
## 4        nox  0.00000 0.00000 0.00000      TRUE
## 5         rm  0.21176 0.36411 0.50858     FALSE
## 6        age  0.14266 0.04738 0.00668      TRUE
## 7        dis  0.00000 0.00000 0.00000      TRUE
## 8        rad  0.62342 0.61301 0.48231     FALSE
## 9        tax  0.10971 0.13747 0.24385     FALSE
## 10   ptratio  0.00303 0.00412 0.00630      TRUE
## 11     lstat  0.33453 0.06459 0.12989     FALSE
## 12      medv  0.00000 0.00000 0.00000      TRUE

Comment: There is evidence of non-linear association for several predictors — notably indus, nox, age, dis, ptratio, and medv. For these, the quadratic or cubic term is statistically significant, suggesting the relationship with crime rate is curved rather than straight.


Chapter 4, Question 13

Data Preparation

wk <- Weekly
str(wk)
## 'data.frame':    1089 obs. of  9 variables:
##  $ Year     : num  1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
##  $ Lag1     : num  0.816 -0.27 -2.576 3.514 0.712 ...
##  $ Lag2     : num  1.572 0.816 -0.27 -2.576 3.514 ...
##  $ Lag3     : num  -3.936 1.572 0.816 -0.27 -2.576 ...
##  $ Lag4     : num  -0.229 -3.936 1.572 0.816 -0.27 ...
##  $ Lag5     : num  -3.484 -0.229 -3.936 1.572 0.816 ...
##  $ Volume   : num  0.155 0.149 0.16 0.162 0.154 ...
##  $ Today    : num  -0.27 -2.576 3.514 0.712 1.178 ...
##  $ Direction: Factor w/ 2 levels "Down","Up": 1 1 2 2 2 1 2 2 2 1 ...
summary(wk)
##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume            Today         
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747   Min.   :-18.1950  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202   1st Qu.: -1.1540  
##  Median :  0.2380   Median :  0.2340   Median :1.00268   Median :  0.2410  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462   Mean   :  0.1499  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373   3rd Qu.:  1.4050  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821   Max.   : 12.0260  
##  Direction 
##  Down:484  
##  Up  :605  
##            
##            
##            
## 

Part (a) — Numerical and Graphical Summaries

summary(wk[, c("Lag1","Lag2","Lag3","Lag4","Lag5","Volume","Today")])
##       Lag1               Lag2               Lag3               Lag4         
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580   1st Qu.: -1.1580  
##  Median :  0.2410   Median :  0.2410   Median :  0.2410   Median :  0.2380  
##  Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472   Mean   :  0.1458  
##  3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag5              Volume            Today         
##  Min.   :-18.1950   Min.   :0.08747   Min.   :-18.1950  
##  1st Qu.: -1.1660   1st Qu.:0.33202   1st Qu.: -1.1540  
##  Median :  0.2340   Median :1.00268   Median :  0.2410  
##  Mean   :  0.1399   Mean   :1.57462   Mean   :  0.1499  
##  3rd Qu.:  1.4050   3rd Qu.:2.05373   3rd Qu.:  1.4050  
##  Max.   : 12.0260   Max.   :9.32821   Max.   : 12.0260
ggpairs(wk[, c("Lag1","Lag2","Lag3","Lag4","Lag5","Volume")],
        lower = list(continuous = wrap("points",
                                       alpha = 0.2, size = 0.5,
                                       color = "steelblue")),
        diag  = list(continuous = wrap("densityDiag",
                                       fill = "steelblue", alpha = 0.4)),
        upper = list(continuous = wrap("cor", size = 3.5))) +
  theme_minimal(base_size = 9) +
  labs(title = "Weekly Dataset — Pairwise Summary")

ggplot(wk, aes(x = seq_along(Volume), y = Volume)) +
  geom_line(color = "steelblue", linewidth = 0.6) +
  labs(title = "Trading Volume Over Time",
       x = "Observation Index", y = "Volume") +
  theme_minimal()

table(wk$Direction)
## 
## Down   Up 
##  484  605
prop.table(table(wk$Direction))
## 
##      Down        Up 
## 0.4444444 0.5555556

Patterns: Trading volume increases substantially over the 21-year period — a clear upward trend. The lag variables show very little correlation with each other or with today’s return. The market went “Up” about 55.6% of the time, a slight bullish bias in this period.


Part (b) — Logistic Regression with All Lags + Volume

logit_full <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
                  data   = wk,
                  family = "binomial")
summary(logit_full)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = "binomial", data = wk)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

Which predictors are significant? Only Lag2 has a p-value below 0.05 (p ≈ 0.030). None of the other lag variables or Volume reach statistical significance in this full model.


Part (c) — Confusion Matrix on Full Data

fitted_probs <- predict(logit_full, type = "response")
fitted_class <- ifelse(fitted_probs > 0.5, "Up", "Down")

cm <- table(Predicted = fitted_class, Actual = wk$Direction)
cm
##          Actual
## Predicted Down  Up
##      Down   54  48
##      Up    430 557
overall_acc <- mean(fitted_class == wk$Direction)
cat("Overall accuracy:", round(overall_acc * 100, 2), "%\n")
## Overall accuracy: 56.11 %

Interpretation: The confusion matrix reveals the model almost always predicts “Up”. It catches most true “Up” weeks (557 correct) but misclassifies the vast majority of “Down” weeks (430 out of 484 predicted as “Up”). The 56.1% overall accuracy is barely above the naive baseline of always guessing “Up” (~55.6%), so the model is not particularly useful on the training data.


Part (d) — Train on 1990–2008, Test on 2009–2010 (Lag2 only)

train_idx  <- wk$Year <= 2008
train_data <- wk[train_idx, ]
test_data  <- wk[!train_idx, ]

cat("Training size:", nrow(train_data), "| Test size:", nrow(test_data), "\n")
## Training size: 985 | Test size: 104
logit_lag2 <- glm(Direction ~ Lag2,
                  data   = train_data,
                  family = "binomial")

summary(logit_lag2)
## 
## Call:
## glm(formula = Direction ~ Lag2, family = "binomial", data = train_data)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.20326    0.06428   3.162  0.00157 **
## Lag2         0.05810    0.02870   2.024  0.04298 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1350.5  on 983  degrees of freedom
## AIC: 1354.5
## 
## Number of Fisher Scoring iterations: 4
test_probs <- predict(logit_lag2, newdata = test_data, type = "response")
test_class <- ifelse(test_probs > 0.5, "Up", "Down")

cm2 <- table(Predicted = test_class, Actual = test_data$Direction)
cm2
##          Actual
## Predicted Down Up
##      Down    9  5
##      Up     34 56
held_acc <- mean(test_class == test_data$Direction)
cat("Test accuracy (2009-2010):", round(held_acc * 100, 2), "%\n")
## Test accuracy (2009-2010): 62.5 %

Comment: Using only Lag2 as a predictor and evaluating on genuinely unseen data from 2009–2010, the model achieves an accuracy of about 62.5%, which is a meaningful improvement over the full model. This suggests that Lag2 carries the most predictive signal and including the other noisy lag variables actually hurt performance. Less can be more in predictive modeling.