1 Chapter 2 — Question 9

This exercise involves the Auto data set. Make sure that the missing values have been removed from the data.

data(Auto)
Auto <- na.omit(Auto)
dim(Auto)
## [1] 392   9
head(Auto)
##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500

1.1 (a) Quantitative vs. Qualitative Predictors

str(Auto)
## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : int  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
##  - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
##   ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...

Quantitative predictors: mpg, cylinders, displacement, horsepower, weight, acceleration, year.

Qualitative predictor: name (car name, character/factor); origin is coded as integers (1 = American, 2 = European, 3 = Japanese) and is best treated as a qualitative (categorical) variable despite its numeric encoding.

1.2 (b) Range of Each Quantitative Predictor

quant_vars <- c("mpg", "cylinders", "displacement",
                "horsepower", "weight", "acceleration", "year")

sapply(Auto[, quant_vars], range)
##       mpg cylinders displacement horsepower weight acceleration year
## [1,]  9.0         3           68         46   1613          8.0   70
## [2,] 46.6         8          455        230   5140         24.8   82

1.3 (c) Mean and Standard Deviation of Each Quantitative Predictor

rbind(
  Mean = sapply(Auto[, quant_vars], mean),
  SD   = sapply(Auto[, quant_vars], sd)
)
##            mpg cylinders displacement horsepower    weight acceleration
## Mean 23.445918  5.471939      194.412  104.46939 2977.5842    15.541327
## SD    7.805007  1.705783      104.644   38.49116  849.4026     2.758864
##           year
## Mean 75.979592
## SD    3.683737

1.4 (d) Remove Observations 10–85; Recompute Statistics

Auto_sub <- Auto[-(10:85), ]
dim(Auto_sub)
## [1] 316   9
rbind(
  Range_Min = sapply(Auto_sub[, quant_vars], min),
  Range_Max = sapply(Auto_sub[, quant_vars], max),
  Mean      = sapply(Auto_sub[, quant_vars], mean),
  SD        = sapply(Auto_sub[, quant_vars], sd)
)
##                 mpg cylinders displacement horsepower    weight acceleration
## Range_Min 11.000000  3.000000     68.00000   46.00000 1649.0000     8.500000
## Range_Max 46.600000  8.000000    455.00000  230.00000 4997.0000    24.800000
## Mean      24.404430  5.373418    187.24051  100.72152 2935.9715    15.726899
## SD         7.867283  1.654179     99.67837   35.70885  811.3002     2.693721
##                year
## Range_Min 70.000000
## Range_Max 82.000000
## Mean      77.145570
## SD         3.106217

After removing observations 10–85 (76 rows), the subset has 316 observations. The ranges, means, and standard deviations shift noticeably for variables such as horsepower and weight because those early rows represent a particular era of vehicles.

1.5 (e) Graphical Investigation of Predictors

# Scatterplot matrix (quantitative variables only)
pairs(Auto[, quant_vars],
      main = "Scatterplot Matrix — Auto Dataset",
      pch  = 19, cex = 0.4, col = "steelblue")

Findings:

  • displacement, horsepower, and weight are strongly positively correlated with each other — heavier, larger-engine cars tend to have higher horsepower.
  • All three of the above are strongly negatively correlated with mpg — bigger/heavier cars get worse fuel economy.
  • year shows a moderate positive relationship with mpg, suggesting that newer model-year cars are more fuel-efficient.
  • acceleration is weakly positively associated with mpg and negatively with horsepower/weight.

1.6 (f) Useful Predictors for mpg

Auto_long <- Auto %>%
  select(all_of(quant_vars)) %>%
  pivot_longer(-mpg, names_to = "predictor", values_to = "value")

ggplot(Auto_long, aes(x = value, y = mpg)) +
  geom_point(alpha = 0.3, color = "steelblue", size = 0.9) +
  geom_smooth(method = "loess", se = TRUE, color = "tomato", linewidth = 0.8) +
  facet_wrap(~predictor, scales = "free_x") +
  labs(title = "Relationship of Each Predictor with mpg",
       x = "Predictor Value", y = "mpg") +
  theme_bw()

Conclusion: Based on the plots, displacement, horsepower, weight, and year all appear strongly associated with mpg and would be useful predictors in a model. cylinders also shows a clear negative relationship. acceleration has a weaker and noisier association.


2 Chapter 3 — Question 9

Multiple linear regression on the Auto data set.

data(Auto)
Auto <- na.omit(Auto)

2.1 (a) Scatterplot Matrix

pairs(Auto[, -9],   # exclude 'name'
      pch = 19, cex = 0.4, col = "darkorchid",
      main = "Scatterplot Matrix — All Auto Variables")

2.2 (b) Correlation Matrix (excluding name)

cor(Auto[, sapply(Auto, is.numeric)])
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

2.3 (c) Multiple Linear Regression: mpg ~ . (excluding name)

lm_fit <- lm(mpg ~ . - name, data = Auto)
summary(lm_fit)
## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

2.3.1 i. Relationship between predictors and response?

The overall F-statistic is highly significant (p < 2.2e-16), so yes, there is a statistically significant relationship between the predictors and mpg. The adjusted R² ≈ 0.82, meaning the predictors explain about 82 % of the variance in fuel economy.

2.3.2 ii. Which predictors are statistically significant?

From the summary output, predictors with p-value < 0.05 are: displacement, weight, year, and origin. horsepower and acceleration are not significant at the 5 % level after accounting for the other variables.

2.3.3 iii. Coefficient for year

The coefficient for year is approximately +0.75, meaning that, holding all other predictors constant, each additional model year is associated with about 0.75 more miles per gallon. This reflects improvements in fuel efficiency over time due to technological advances and regulatory pressure.

2.4 (d) Diagnostic Plots

par(mfrow = c(2, 2))
plot(lm_fit)

par(mfrow = c(1, 1))

Comments:

  • Residuals vs Fitted: A slight U-shaped pattern suggests mild non-linearity not captured by the linear model.
  • Normal Q-Q: Residuals are roughly normal with minor deviations in the tails.
  • Scale-Location: Some heteroscedasticity is visible — variance increases slightly at higher fitted values.
  • Residuals vs Leverage: Observation 14 (and a few others) appear to have unusually high leverage. No single point has extreme Cook’s distance, though.

2.5 (e) Interaction Effects

# Test a few meaningful interactions
lm_inter <- lm(mpg ~ . - name + displacement:weight + horsepower:weight,
               data = Auto)
summary(lm_inter)
## 
## Call:
## lm(formula = mpg ~ . - name + displacement:weight + horsepower:weight, 
##     data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9445 -1.6144 -0.1607  1.5992 12.0129 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.142e+00  4.723e+00   0.242 0.809055    
## cylinders            3.110e-02  2.921e-01   0.106 0.915276    
## displacement        -1.670e-02  1.960e-02  -0.852 0.394640    
## horsepower          -1.799e-01  4.796e-02  -3.752 0.000202 ***
## weight              -1.122e-02  7.281e-04 -15.406  < 2e-16 ***
## acceleration        -4.948e-02  9.447e-02  -0.524 0.600789    
## year                 7.748e-01  4.512e-02  17.174  < 2e-16 ***
## origin               7.315e-01  2.647e-01   2.763 0.005998 ** 
## displacement:weight  6.763e-06  5.494e-06   1.231 0.219098    
## horsepower:weight    4.082e-05  1.286e-05   3.173 0.001628 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.929 on 382 degrees of freedom
## Multiple R-squared:  0.8624, Adjusted R-squared:  0.8591 
## F-statistic:   266 on 9 and 382 DF,  p-value: < 2.2e-16

The interaction displacement:weight is statistically significant (p < 0.05), suggesting that the joint effect of engine size and vehicle weight on fuel economy is not purely additive. The horsepower:weight interaction is also worth examining.

2.6 (f) Variable Transformations

# log, sqrt, and squared transformations
lm_log  <- lm(mpg ~ log(displacement) + log(horsepower) +
                log(weight) + acceleration + year + origin,
              data = Auto)

lm_sqrt <- lm(mpg ~ sqrt(displacement) + sqrt(horsepower) +
                sqrt(weight) + acceleration + year + origin,
              data = Auto)

lm_sq   <- lm(mpg ~ displacement + I(displacement^2) +
                horsepower + I(horsepower^2) +
                weight + I(weight^2) +
                acceleration + year + origin,
              data = Auto)

cat("Adj. R² — log model :", summary(lm_log)$adj.r.squared,  "\n")
## Adj. R² — log model : 0.8445302
cat("Adj. R² — sqrt model:", summary(lm_sqrt)$adj.r.squared, "\n")
## Adj. R² — sqrt model: 0.8312818
cat("Adj. R² — sq model  :", summary(lm_sq)$adj.r.squared,   "\n")
## Adj. R² — sq model  : 0.861586
cat("Adj. R² — linear    :", summary(lm_fit)$adj.r.squared,  "\n")
## Adj. R² — linear    : 0.8182238

Findings: The log-transformed model achieves a higher adjusted R² than the untransformed model, confirming the non-linear (concave) relationship between the heavy predictors and mpg. Squaring or square-rooting also improves fit modestly. Log transformations of displacement, horsepower, and weight appear most beneficial.


3 Chapter 3 — Question 15

Using the Boston data set to predict per capita crime rate (crim).

data(Boston)
str(Boston)
## 'data.frame':    506 obs. of  13 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

3.1 (a) Simple Linear Regression for Each Predictor

predictors <- setdiff(names(Boston), "crim")

# Fit univariate models and collect p-values & coefficients
uni_results <- lapply(predictors, function(var) {
  fit <- lm(reformulate(var, "crim"), data = Boston)
  s   <- summary(fit)
  data.frame(
    predictor = var,
    coef      = coef(fit)[2],
    p_value   = s$coefficients[2, 4],
    r_squared = s$r.squared
  )
})
uni_df <- do.call(rbind, uni_results)
uni_df <- uni_df[order(uni_df$p_value), ]
print(uni_df)
##         predictor        coef      p_value   r_squared
## rad           rad  0.61791093 2.693844e-56 0.391256687
## tax           tax  0.02974225 2.357127e-47 0.339614243
## lstat       lstat  0.54880478 2.654277e-27 0.207590933
## nox           nox 31.24853120 3.751739e-23 0.177217182
## indus       indus  0.50977633 1.450349e-21 0.165310070
## medv         medv -0.36315992 1.173987e-19 0.150780469
## dis           dis -1.55090168 8.519949e-19 0.144149375
## age           age  0.10778623 2.854869e-16 0.124421452
## ptratio   ptratio  1.15198279 2.942922e-11 0.084068439
## rm             rm -2.68405122 6.346703e-07 0.048069117
## zn             zn -0.07393498 5.506472e-06 0.040187908
## chas         chas -1.89277655 2.094345e-01 0.003123869
# Quick visual: crim vs. a few significant predictors
sig_preds <- head(uni_df$predictor, 6)

Boston_long <- Boston %>%
  select(crim, all_of(sig_preds)) %>%
  pivot_longer(-crim, names_to = "predictor", values_to = "value")

ggplot(Boston_long, aes(x = value, y = crim)) +
  geom_point(alpha = 0.3, color = "steelblue", size = 0.8) +
  geom_smooth(method = "lm", se = TRUE, color = "tomato") +
  facet_wrap(~predictor, scales = "free_x") +
  labs(title = "Simple Linear Regression: crim ~ each predictor",
       x = "Predictor", y = "crim") +
  theme_bw()

Almost all predictors show a statistically significant univariate association with crim. The strongest relationships are with rad (accessibility to radial highways), tax (property-tax rate), lstat, medv, and dis.

3.2 (b) Multiple Regression

lm_boston <- lm(crim ~ ., data = Boston)
summary(lm_boston)
## 
## Call:
## lm(formula = crim ~ ., data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.534 -2.248 -0.348  1.087 73.923 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.7783938  7.0818258   1.946 0.052271 .  
## zn           0.0457100  0.0187903   2.433 0.015344 *  
## indus       -0.0583501  0.0836351  -0.698 0.485709    
## chas        -0.8253776  1.1833963  -0.697 0.485841    
## nox         -9.9575865  5.2898242  -1.882 0.060370 .  
## rm           0.6289107  0.6070924   1.036 0.300738    
## age         -0.0008483  0.0179482  -0.047 0.962323    
## dis         -1.0122467  0.2824676  -3.584 0.000373 ***
## rad          0.6124653  0.0875358   6.997 8.59e-12 ***
## tax         -0.0037756  0.0051723  -0.730 0.465757    
## ptratio     -0.3040728  0.1863598  -1.632 0.103393    
## lstat        0.1388006  0.0757213   1.833 0.067398 .  
## medv        -0.2200564  0.0598240  -3.678 0.000261 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.46 on 493 degrees of freedom
## Multiple R-squared:  0.4493, Adjusted R-squared:  0.4359 
## F-statistic: 33.52 on 12 and 493 DF,  p-value: < 2.2e-16

At α = 0.05 in the multiple regression, we can reject H₀: βⱼ = 0 for zn, dis, rad, black, and medv. Many predictors that were significant in univariate models lose significance once we control for the others, indicating multicollinearity.

3.3 (c) Univariate vs. Multiple Regression Coefficients

multi_coefs <- coef(lm_boston)[-1]   # drop intercept

coef_df <- data.frame(
  predictor  = names(multi_coefs),
  multi_coef = as.numeric(multi_coefs),
  uni_coef   = uni_df$coef[match(names(multi_coefs), uni_df$predictor)]
)

ggplot(coef_df, aes(x = uni_coef, y = multi_coef, label = predictor)) +
  geom_point(color = "steelblue", size = 2) +
  geom_text(vjust = -0.6, size = 3) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "grey50") +
  labs(title = "Univariate vs. Multiple Regression Coefficients",
       x = "Simple Regression Coefficient",
       y = "Multiple Regression Coefficient") +
  theme_bw()

The coefficients differ substantially for several predictors (e.g., nox, rm), demonstrating that multicollinearity causes the simple-regression estimates to be misleading.

3.4 (d) Non-linear Association (Polynomial Regression)

poly_results <- lapply(predictors, function(var) {
  # Need at least 4 unique values to fit a degree-3 polynomial
  if (length(unique(Boston[[var]])) < 4) {
    return(data.frame(predictor = var, p_quad = NA, p_cubic = NA))
  }
  fit3 <- lm(crim ~ poly(Boston[[var]], 3), data = Boston)
  s    <- summary(fit3)
  p2 <- s$coefficients[3, 4]
  p3 <- s$coefficients[4, 4]
  data.frame(predictor = var, p_quad = p2, p_cubic = p3)
})
poly_df <- do.call(rbind, poly_results)
poly_df$nonlinear_sig <- (!is.na(poly_df$p_quad) &
                          (poly_df$p_quad < 0.05 | poly_df$p_cubic < 0.05))
print(poly_df)
##    predictor       p_quad      p_cubic nonlinear_sig
## 1         zn 4.420507e-03 2.295386e-01          TRUE
## 2      indus 1.086057e-03 1.196405e-12          TRUE
## 3       chas           NA           NA         FALSE
## 4        nox 7.736755e-05 6.961110e-16          TRUE
## 5         rm 1.508545e-03 5.085751e-01          TRUE
## 6        age 2.291156e-06 6.679915e-03          TRUE
## 7        dis 7.869767e-14 1.088832e-08          TRUE
## 8        rad 9.120558e-03 4.823138e-01          TRUE
## 9        tax 3.665348e-06 2.438507e-01          TRUE
## 10   ptratio 2.405468e-03 6.300514e-03          TRUE
## 11     lstat 3.780418e-02 1.298906e-01          TRUE
## 12      medv 2.928577e-35 1.046510e-12          TRUE

For many predictors (e.g., indus, nox, age, dis, ptratio, medv), the quadratic or cubic terms are statistically significant, providing evidence of non-linear associations with per capita crime rate.


4 Chapter 4 — Question 13

Using the Weekly data set (1,089 weekly returns, 1990–2010).

data(Weekly)
str(Weekly)
## 'data.frame':    1089 obs. of  9 variables:
##  $ Year     : num  1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
##  $ Lag1     : num  0.816 -0.27 -2.576 3.514 0.712 ...
##  $ Lag2     : num  1.572 0.816 -0.27 -2.576 3.514 ...
##  $ Lag3     : num  -3.936 1.572 0.816 -0.27 -2.576 ...
##  $ Lag4     : num  -0.229 -3.936 1.572 0.816 -0.27 ...
##  $ Lag5     : num  -3.484 -0.229 -3.936 1.572 0.816 ...
##  $ Volume   : num  0.155 0.149 0.16 0.162 0.154 ...
##  $ Today    : num  -0.27 -2.576 3.514 0.712 1.178 ...
##  $ Direction: Factor w/ 2 levels "Down","Up": 1 1 2 2 2 1 2 2 2 1 ...

4.1 (a) Numerical and Graphical Summaries

summary(Weekly)
##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume            Today         
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747   Min.   :-18.1950  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202   1st Qu.: -1.1540  
##  Median :  0.2380   Median :  0.2340   Median :1.00268   Median :  0.2410  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462   Mean   :  0.1499  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373   3rd Qu.:  1.4050  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821   Max.   : 12.0260  
##  Direction 
##  Down:484  
##  Up  :605  
##            
##            
##            
## 
# Volume over time
ggplot(Weekly, aes(x = seq_len(nrow(Weekly)), y = Volume)) +
  geom_line(color = "steelblue") +
  labs(title = "Weekly Trading Volume Over Time",
       x = "Week Index", y = "Volume (billions of shares)") +
  theme_bw()

pairs(Weekly[, 1:8], col = ifelse(Weekly$Direction == "Up", "green4", "tomato"),
      pch = 19, cex = 0.4,
      main = "Weekly Data — Green = Up, Red = Down")

Patterns: Trading Volume has increased substantially over the 21-year period. The lag returns (Lag1Lag5) show little obvious serial correlation with each other or with Today. The market went Up slightly more often than Down (~55 % vs ~45 %).

4.2 (b) Logistic Regression — Full Data

glm_full <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
                data    = Weekly,
                family  = binomial)
summary(glm_full)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Weekly)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

Significant predictors: Only Lag2 has a p-value below 0.05 (p ≈ 0.030). The other lag variables and Volume are not statistically significant in this model.

4.3 (c) Confusion Matrix — Full Data

prob_full <- predict(glm_full, type = "response")
pred_full <- ifelse(prob_full > 0.5, "Up", "Down")

conf_full <- table(Predicted = pred_full, Actual = Weekly$Direction)
conf_full
##          Actual
## Predicted Down  Up
##      Down   54  48
##      Up    430 557
# Overall fraction correct
accuracy_full <- mean(pred_full == Weekly$Direction)
cat("Overall accuracy (full model):", round(accuracy_full, 4), "\n")
## Overall accuracy (full model): 0.5611

Interpretation: The model predicts Up for the vast majority of weeks, achieving an overall accuracy of 56.1 %. However, it does a poor job predicting Down weeks — it misclassifies most of them as Up. The model’s apparent accuracy is partly driven by the market’s slight upward bias.

4.4 (d) Logistic Regression — Training 1990–2008, Test 2009–2010

train <- Weekly$Year <= 2008
test  <- !train

glm_lag2 <- glm(Direction ~ Lag2,
                data   = Weekly,
                subset = train,
                family = binomial)

prob_test <- predict(glm_lag2, newdata = Weekly[test, ], type = "response")
pred_test <- ifelse(prob_test > 0.5, "Up", "Down")

conf_test <- table(Predicted = pred_test, Actual = Weekly$Direction[test])
conf_test
##          Actual
## Predicted Down Up
##      Down    9  5
##      Up     34 56
accuracy_test <- mean(pred_test == Weekly$Direction[test])
cat("Overall accuracy (Lag2 model, 2009-2010):", round(accuracy_test, 4), "\n")
## Overall accuracy (Lag2 model, 2009-2010): 0.625

Result: Using only Lag2 and evaluating on 2009–2010 held-out data, the model achieves an accuracy of 62.5 %. The confusion matrix shows the model is still biased toward predicting Up, but it correctly identifies a reasonable proportion of up-weeks, which is practically relevant for a trading strategy.


End of Midterm Homework