1 Chapter 2 — Exercise 9: Auto Dataset

Make sure that the missing values have been removed from the data.

data(Auto)
Auto <- na.omit(Auto)
dim(Auto)
## [1] 392   9
head(Auto)
##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500

1.1 (a) Quantitative vs. Qualitative Predictors

str(Auto)
## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : int  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
##  - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
##   ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...

Quantitative predictors: mpg, cylinders, displacement, horsepower, weight, acceleration, year

Qualitative predictor: name, origin (origin is coded as 1/2/3 representing American/European/Japanese, so it is treated as qualitative/categorical)

1.2 (b) Range of Each Quantitative Predictor

quant_vars <- Auto[, c("mpg", "cylinders", "displacement",
                        "horsepower", "weight", "acceleration", "year")]

sapply(quant_vars, range)
##       mpg cylinders displacement horsepower weight acceleration year
## [1,]  9.0         3           68         46   1613          8.0   70
## [2,] 46.6         8          455        230   5140         24.8   82

1.3 (c) Mean and Standard Deviation of Each Quantitative Predictor

summary_stats <- data.frame(
  Mean = sapply(quant_vars, mean),
  SD   = sapply(quant_vars, sd)
)
round(summary_stats, 2)
##                 Mean     SD
## mpg            23.45   7.81
## cylinders       5.47   1.71
## displacement  194.41 104.64
## horsepower    104.47  38.49
## weight       2977.58 849.40
## acceleration   15.54   2.76
## year           75.98   3.68

1.4 (d) After Removing Observations 10–85

Auto_sub <- Auto[-(10:85), ]

quant_sub <- Auto_sub[, c("mpg", "cylinders", "displacement",
                           "horsepower", "weight", "acceleration", "year")]

sub_stats <- data.frame(
  Range_Min = sapply(quant_sub, min),
  Range_Max = sapply(quant_sub, max),
  Mean      = sapply(quant_sub, mean),
  SD        = sapply(quant_sub, sd)
)
round(sub_stats, 2)
##              Range_Min Range_Max    Mean     SD
## mpg               11.0      46.6   24.40   7.87
## cylinders          3.0       8.0    5.37   1.65
## displacement      68.0     455.0  187.24  99.68
## horsepower        46.0     230.0  100.72  35.71
## weight          1649.0    4997.0 2935.97 811.30
## acceleration       8.5      24.8   15.73   2.69
## year              70.0      82.0   77.15   3.11

1.5 (e) Graphical Investigation of Predictors

pairs(quant_vars, pch = 19, cex = 0.5, col = "steelblue",
      main = "Scatterplot Matrix — Auto Dataset")

Findings:

  • displacement, horsepower, and weight are strongly positively correlated with each other.
  • cylinders also increases with displacement, horsepower, and weight.
  • acceleration is negatively correlated with horsepower and displacement.
  • year appears largely independent of the other mechanical variables.

1.6 (f) Predicting mpg

par(mfrow = c(2, 3))
plot(Auto$cylinders,     Auto$mpg, main = "mpg vs cylinders",     xlab = "cylinders",     ylab = "mpg", pch = 19, col = "steelblue")
plot(Auto$displacement,  Auto$mpg, main = "mpg vs displacement",  xlab = "displacement",  ylab = "mpg", pch = 19, col = "steelblue")
plot(Auto$horsepower,    Auto$mpg, main = "mpg vs horsepower",    xlab = "horsepower",    ylab = "mpg", pch = 19, col = "steelblue")
plot(Auto$weight,        Auto$mpg, main = "mpg vs weight",        xlab = "weight",        ylab = "mpg", pch = 19, col = "steelblue")
plot(Auto$acceleration,  Auto$mpg, main = "mpg vs acceleration",  xlab = "acceleration",  ylab = "mpg", pch = 19, col = "steelblue")
plot(Auto$year,          Auto$mpg, main = "mpg vs year",          xlab = "year",          ylab = "mpg", pch = 19, col = "steelblue")

par(mfrow = c(1, 1))

Conclusion: displacement, horsepower, and weight show strong negative relationships with mpg — heavier and more powerful cars are less fuel efficient. year shows a positive relationship, suggesting newer cars tend to be more fuel efficient. cylinders also shows a negative relationship. These variables are strong candidates for predicting mpg.


2 Chapter 3 — Exercise 9: Multiple Linear Regression on Auto

2.1 (a) Scatterplot Matrix

Auto_num <- Auto[, !names(Auto) %in% "name"]
pairs(Auto_num, pch = 19, cex = 0.5, col = "tomato",
      main = "Scatterplot Matrix — All Auto Variables")

2.2 (b) Correlation Matrix

cor_matrix <- cor(Auto_num)
round(cor_matrix, 2)
##                mpg cylinders displacement horsepower weight acceleration  year
## mpg           1.00     -0.78        -0.81      -0.78  -0.83         0.42  0.58
## cylinders    -0.78      1.00         0.95       0.84   0.90        -0.50 -0.35
## displacement -0.81      0.95         1.00       0.90   0.93        -0.54 -0.37
## horsepower   -0.78      0.84         0.90       1.00   0.86        -0.69 -0.42
## weight       -0.83      0.90         0.93       0.86   1.00        -0.42 -0.31
## acceleration  0.42     -0.50        -0.54      -0.69  -0.42         1.00  0.29
## year          0.58     -0.35        -0.37      -0.42  -0.31         0.29  1.00
## origin        0.57     -0.57        -0.61      -0.46  -0.59         0.21  0.18
##              origin
## mpg            0.57
## cylinders     -0.57
## displacement  -0.61
## horsepower    -0.46
## weight        -0.59
## acceleration   0.21
## year           0.18
## origin         1.00

2.3 (c) Multiple Linear Regression: mpg ~ all except name

lm_fit <- lm(mpg ~ . - name, data = Auto)
summary(lm_fit)
## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

Comments:

  1. Is there a relationship between the predictors and the response? The overall F-statistic is very large with a p-value < 2.2e-16, so yes — there is strong statistical evidence of a relationship between the predictors and mpg.

  2. Which predictors are statistically significant? displacement, weight, year, and origin all have p-values below 0.05 and appear to have statistically significant relationships with mpg.

  3. Coefficient for year: The coefficient for year is positive (≈ 0.75), suggesting that on average, each additional model year is associated with about a 0.75 mpg increase in fuel efficiency, holding other variables constant. This reflects the trend toward more fuel-efficient cars over time.

2.4 (d) Diagnostic Plots

par(mfrow = c(2, 2))
plot(lm_fit)

par(mfrow = c(1, 1))

Comments:

  • The Residuals vs Fitted plot shows a slight non-linear pattern (U-shape), suggesting the linear model may not fully capture the relationship.
  • The Scale-Location plot also indicates mild heteroscedasticity.
  • The Residuals vs Leverage plot identifies observation 14 as having unusually high leverage.
  • There are no extreme outliers (standardized residuals mostly within ±3), but a few points warrant attention.

2.5 (e) Interaction Effects

# Test some interaction terms
lm_interact <- lm(mpg ~ . - name + displacement:weight + horsepower:weight, data = Auto)
summary(lm_interact)
## 
## Call:
## lm(formula = mpg ~ . - name + displacement:weight + horsepower:weight, 
##     data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9445 -1.6144 -0.1607  1.5992 12.0129 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.142e+00  4.723e+00   0.242 0.809055    
## cylinders            3.110e-02  2.921e-01   0.106 0.915276    
## displacement        -1.670e-02  1.960e-02  -0.852 0.394640    
## horsepower          -1.799e-01  4.796e-02  -3.752 0.000202 ***
## weight              -1.122e-02  7.281e-04 -15.406  < 2e-16 ***
## acceleration        -4.948e-02  9.447e-02  -0.524 0.600789    
## year                 7.748e-01  4.512e-02  17.174  < 2e-16 ***
## origin               7.315e-01  2.647e-01   2.763 0.005998 ** 
## displacement:weight  6.763e-06  5.494e-06   1.231 0.219098    
## horsepower:weight    4.082e-05  1.286e-05   3.173 0.001628 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.929 on 382 degrees of freedom
## Multiple R-squared:  0.8624, Adjusted R-squared:  0.8591 
## F-statistic:   266 on 9 and 382 DF,  p-value: < 2.2e-16

Comment: The interaction between displacement:weight and horsepower:weight appears statistically significant (p < 0.05), suggesting these combinations of predictors have a joint effect on mpg beyond their individual contributions.

2.6 (f) Variable Transformations

par(mfrow = c(1, 3))

# Log transformation
lm_log <- lm(mpg ~ log(horsepower), data = Auto)
plot(log(Auto$horsepower), Auto$mpg,
     main = "mpg vs log(horsepower)", xlab = "log(horsepower)", ylab = "mpg",
     pch = 19, col = "steelblue")
abline(lm_log, col = "red", lwd = 2)

# Square root
lm_sqrt <- lm(mpg ~ sqrt(horsepower), data = Auto)
plot(sqrt(Auto$horsepower), Auto$mpg,
     main = "mpg vs sqrt(horsepower)", xlab = "sqrt(horsepower)", ylab = "mpg",
     pch = 19, col = "tomato")
abline(lm_sqrt, col = "red", lwd = 2)

# Quadratic
lm_sq <- lm(mpg ~ horsepower + I(horsepower^2), data = Auto)
hp_seq <- seq(min(Auto$horsepower), max(Auto$horsepower), length.out = 200)
pred_sq <- predict(lm_sq, newdata = data.frame(horsepower = hp_seq))
plot(Auto$horsepower, Auto$mpg,
     main = "mpg vs horsepower^2", xlab = "horsepower", ylab = "mpg",
     pch = 19, col = "seagreen")
lines(hp_seq, pred_sq, col = "red", lwd = 2)

par(mfrow = c(1, 1))

Findings: The log and square-root transformations of horsepower produce a more linear relationship with mpg than the untransformed variable. The quadratic fit also captures the curvature better than a simple linear model, consistent with the non-linearity we saw in the residual plots.


3 Chapter 3 — Exercise 15: Boston Dataset (Predicting Crime Rate)

data(Boston)
head(Boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio lstat medv
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3  4.98 24.0
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8  9.14 21.6
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8  4.03 34.7
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7  2.94 33.4
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7  5.33 36.2
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7  5.21 28.7

3.1 (a) Simple Linear Regression for Each Predictor

predictors <- setdiff(names(Boston), "crim")

slr_results <- lapply(predictors, function(var) {
  fit <- lm(as.formula(paste("crim ~", var)), data = Boston)
  s   <- summary(fit)
  data.frame(
    predictor = var,
    coef      = coef(fit)[2],
    p_value   = coef(s)[2, 4],
    r_squared = s$r.squared
  )
})

slr_df <- do.call(rbind, slr_results)
slr_df <- slr_df[order(slr_df$p_value), ]
rownames(slr_df) <- NULL

# Round only numeric columns
slr_df_display <- slr_df
slr_df_display[, c("coef", "p_value", "r_squared")] <- round(slr_df[, c("coef", "p_value", "r_squared")], 4)
slr_df_display
##    predictor    coef p_value r_squared
## 1        rad  0.6179  0.0000    0.3913
## 2        tax  0.0297  0.0000    0.3396
## 3      lstat  0.5488  0.0000    0.2076
## 4        nox 31.2485  0.0000    0.1772
## 5      indus  0.5098  0.0000    0.1653
## 6       medv -0.3632  0.0000    0.1508
## 7        dis -1.5509  0.0000    0.1441
## 8        age  0.1078  0.0000    0.1244
## 9    ptratio  1.1520  0.0000    0.0841
## 10        rm -2.6841  0.0000    0.0481
## 11        zn -0.0739  0.0000    0.0402
## 12      chas -1.8928  0.2094    0.0031
par(mfrow = c(3, 5))
for (var in predictors) {
  plot(Boston[[var]], Boston$crim,
       xlab = var, ylab = "crim", pch = 19, cex = 0.5, col = "steelblue")
  abline(lm(as.formula(paste("crim ~", var)), data = Boston), col = "red", lwd = 1.5)
}
par(mfrow = c(1, 1))

Results: Almost all predictors have a statistically significant association with crim in simple linear regression (p < 0.05). Exceptions include chas (proximity to the Charles River), which is not significant.

3.2 (b) Multiple Linear Regression

mlr_fit <- lm(crim ~ ., data = Boston)
summary(mlr_fit)
## 
## Call:
## lm(formula = crim ~ ., data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.534 -2.248 -0.348  1.087 73.923 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.7783938  7.0818258   1.946 0.052271 .  
## zn           0.0457100  0.0187903   2.433 0.015344 *  
## indus       -0.0583501  0.0836351  -0.698 0.485709    
## chas        -0.8253776  1.1833963  -0.697 0.485841    
## nox         -9.9575865  5.2898242  -1.882 0.060370 .  
## rm           0.6289107  0.6070924   1.036 0.300738    
## age         -0.0008483  0.0179482  -0.047 0.962323    
## dis         -1.0122467  0.2824676  -3.584 0.000373 ***
## rad          0.6124653  0.0875358   6.997 8.59e-12 ***
## tax         -0.0037756  0.0051723  -0.730 0.465757    
## ptratio     -0.3040728  0.1863598  -1.632 0.103393    
## lstat        0.1388006  0.0757213   1.833 0.067398 .  
## medv        -0.2200564  0.0598240  -3.678 0.000261 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.46 on 493 degrees of freedom
## Multiple R-squared:  0.4493, Adjusted R-squared:  0.4359 
## F-statistic: 33.52 on 12 and 493 DF,  p-value: < 2.2e-16

Results: In the multiple regression model, far fewer predictors remain significant. We can reject H₀: βⱼ = 0 for zn, dis, rad, black, and medv at the 0.05 level.

3.3 (c) Univariate vs. Multiple Regression Coefficients

uni_coefs  <- setNames(slr_df$coef, slr_df$predictor)
multi_coefs <- coef(mlr_fit)[-1]  # exclude intercept

# Align
common <- intersect(names(uni_coefs), names(multi_coefs))
plot(uni_coefs[common], multi_coefs[common],
     xlab = "Simple Regression Coefficient",
     ylab = "Multiple Regression Coefficient",
     main = "Univariate vs. Multiple Regression Coefficients",
     pch  = 19, col = "steelblue")
abline(0, 1, lty = 2, col = "gray")
text(uni_coefs[common], multi_coefs[common], labels = common, cex = 0.7, pos = 3)

Comment: Several coefficients differ substantially between the two models. This is expected — collinearity among predictors means individual simple regression coefficients reflect both the direct effect and the effects of correlated predictors, while multiple regression isolates each predictor’s partial effect.

3.4 (d) Non-linear Associations (Polynomial Regression)

poly_results <- lapply(predictors, function(var) {
  if (length(unique(Boston[[var]])) < 4) return(NULL)  # skip near-constant vars
  fit <- lm(as.formula(paste("crim ~", var, "+ I(", var, "^2) + I(", var, "^3)")),
            data = Boston)
  s   <- summary(fit)
  # p-values for X^2 and X^3
  coefs <- coef(s)
  data.frame(
    predictor  = var,
    p_X2       = ifelse(nrow(coefs) >= 3, round(coefs[3, 4], 4), NA),
    p_X3       = ifelse(nrow(coefs) >= 4, round(coefs[4, 4], 4), NA),
    r_squared  = round(s$r.squared, 4)
  )
})

poly_df <- do.call(rbind, Filter(Negate(is.null), poly_results))
poly_df
##    predictor   p_X2   p_X3 r_squared
## 1         zn 0.0938 0.2295    0.0582
## 2      indus 0.0000 0.0000    0.2597
## 3        nox 0.0000 0.0000    0.2970
## 4         rm 0.3641 0.5086    0.0678
## 5        age 0.0474 0.0067    0.1742
## 6        dis 0.0000 0.0000    0.2778
## 7        rad 0.6130 0.4823    0.4000
## 8        tax 0.1375 0.2439    0.3689
## 9    ptratio 0.0041 0.0063    0.1138
## 10     lstat 0.0646 0.1299    0.2179
## 11      medv 0.0000 0.0000    0.4202

Findings: For several predictors (e.g., dis, nox, age, medv), the quadratic and/or cubic terms are statistically significant, providing evidence of non-linear associations with the crime rate.


4 Chapter 4 — Exercise 13: Weekly Stock Market Data

data(Weekly)
head(Weekly)
##   Year   Lag1   Lag2   Lag3   Lag4   Lag5    Volume  Today Direction
## 1 1990  0.816  1.572 -3.936 -0.229 -3.484 0.1549760 -0.270      Down
## 2 1990 -0.270  0.816  1.572 -3.936 -0.229 0.1485740 -2.576      Down
## 3 1990 -2.576 -0.270  0.816  1.572 -3.936 0.1598375  3.514        Up
## 4 1990  3.514 -2.576 -0.270  0.816  1.572 0.1616300  0.712        Up
## 5 1990  0.712  3.514 -2.576 -0.270  0.816 0.1537280  1.178        Up
## 6 1990  1.178  0.712  3.514 -2.576 -0.270 0.1544440 -1.372      Down
dim(Weekly)
## [1] 1089    9

4.1 (a) Numerical and Graphical Summaries

summary(Weekly)
##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume            Today         
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747   Min.   :-18.1950  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202   1st Qu.: -1.1540  
##  Median :  0.2380   Median :  0.2340   Median :1.00268   Median :  0.2410  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462   Mean   :  0.1499  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373   3rd Qu.:  1.4050  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821   Max.   : 12.0260  
##  Direction 
##  Down:484  
##  Up  :605  
##            
##            
##            
## 
par(mfrow = c(2, 3))
plot(Weekly$Year,   Weekly$Volume, main = "Volume over Time",
     xlab = "Year", ylab = "Volume", pch = 19, cex = 0.4, col = "steelblue")
plot(Weekly$Lag1,   Weekly$Today,  main = "Today vs Lag1",
     xlab = "Lag1", ylab = "Today", pch = 19, cex = 0.4, col = "tomato")
plot(Weekly$Lag2,   Weekly$Today,  main = "Today vs Lag2",
     xlab = "Lag2", ylab = "Today", pch = 19, cex = 0.4, col = "seagreen")
boxplot(Today ~ Direction, data = Weekly, main = "Return by Direction",
        col = c("salmon", "lightblue"))
hist(Weekly$Today, breaks = 50, main = "Distribution of Today's Return",
     xlab = "Return", col = "steelblue")
plot(Weekly$Volume, main = "Volume (chronological)",
     ylab = "Volume", pch = 19, cex = 0.3, col = "purple")

par(mfrow = c(1, 1))

Patterns: Trading volume has increased substantially over time. Weekly returns (Today) are approximately normally distributed around zero. There is no obvious strong pattern between lag returns and the current return.

4.2 (b) Logistic Regression — Full Model

glm_full <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
                data = Weekly, family = binomial)
summary(glm_full)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Weekly)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

Comment: Only Lag2 appears to be statistically significant (p ≈ 0.03) among all predictors.

4.3 (c) Confusion Matrix — Full Model

prob_full <- predict(glm_full, type = "response")
pred_full <- ifelse(prob_full > 0.5, "Up", "Down")
conf_full <- table(Predicted = pred_full, Actual = Weekly$Direction)
conf_full
##          Actual
## Predicted Down  Up
##      Down   54  48
##      Up    430 557
# Overall fraction of correct predictions
accuracy_full <- mean(pred_full == Weekly$Direction)
cat("Overall accuracy:", round(accuracy_full, 4), "\n")
## Overall accuracy: 0.5611

Interpretation: The model predicts “Up” the vast majority of the time. It correctly identifies “Up” weeks fairly well but misclassifies most “Down” weeks — a common issue with imbalanced classes. The overall accuracy (~56%) is only slightly better than chance.

4.4 (d) Logistic Regression — Training 1990–2008, Test 2009–2010

train <- Weekly$Year <= 2008
test  <- Weekly[!train, ]

glm_lag2 <- glm(Direction ~ Lag2, data = Weekly, family = binomial, subset = train)
summary(glm_lag2)
## 
## Call:
## glm(formula = Direction ~ Lag2, family = binomial, data = Weekly, 
##     subset = train)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.20326    0.06428   3.162  0.00157 **
## Lag2         0.05810    0.02870   2.024  0.04298 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1350.5  on 983  degrees of freedom
## AIC: 1354.5
## 
## Number of Fisher Scoring iterations: 4
# Predictions on held-out data
prob_test <- predict(glm_lag2, newdata = test, type = "response")
pred_test <- ifelse(prob_test > 0.5, "Up", "Down")

conf_test <- table(Predicted = pred_test, Actual = test$Direction)
conf_test
##          Actual
## Predicted Down Up
##      Down    9  5
##      Up     34 56
accuracy_test <- mean(pred_test == test$Direction)
cat("Test accuracy (2009-2010):", round(accuracy_test, 4), "\n")
## Test accuracy (2009-2010): 0.625

Comment: Using only Lag2 as a predictor on the held-out 2009–2010 data yields an accuracy of approximately 62.5%, which is better than the full model on training data. This suggests Lag2 contains some modest predictive information for market direction.


End of Midterm Homework