Chapter 2 — Exercise 9

This exercise involves the Auto data set. Make sure that the missing values have been removed from the data.

data(Auto)
Auto <- na.omit(Auto)

(a) Quantitative vs. Qualitative Predictors

str(Auto)
## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : int  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
##  - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
##   ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...

Answer:

  • Quantitative: mpg, cylinders, displacement, horsepower, weight, acceleration, year
  • Qualitative: name, origin (origin is coded as 1/2/3 representing American/European/Japanese — treated as categorical)

(b) Range of Each Quantitative Predictor

quant_vars <- c("mpg", "cylinders", "displacement", "horsepower",
                "weight", "acceleration", "year")

sapply(Auto[, quant_vars], range)
##       mpg cylinders displacement horsepower weight acceleration year
## [1,]  9.0         3           68         46   1613          8.0   70
## [2,] 46.6         8          455        230   5140         24.8   82

(c) Mean and Standard Deviation of Each Quantitative Predictor

rbind(
  Mean = sapply(Auto[, quant_vars], mean),
  SD   = sapply(Auto[, quant_vars], sd)
)
##            mpg cylinders displacement horsepower    weight acceleration
## Mean 23.445918  5.471939      194.412  104.46939 2977.5842    15.541327
## SD    7.805007  1.705783      104.644   38.49116  849.4026     2.758864
##           year
## Mean 75.979592
## SD    3.683737

(d) Remove Observations 10–85, Then Recompute

Auto_sub <- Auto[-(10:85), ]

rbind(
  Range_Min = sapply(Auto_sub[, quant_vars], min),
  Range_Max = sapply(Auto_sub[, quant_vars], max),
  Mean      = sapply(Auto_sub[, quant_vars], mean),
  SD        = sapply(Auto_sub[, quant_vars], sd)
)
##                 mpg cylinders displacement horsepower    weight acceleration
## Range_Min 11.000000  3.000000     68.00000   46.00000 1649.0000     8.500000
## Range_Max 46.600000  8.000000    455.00000  230.00000 4997.0000    24.800000
## Mean      24.404430  5.373418    187.24051  100.72152 2935.9715    15.726899
## SD         7.867283  1.654179     99.67837   35.70885  811.3002     2.693721
##                year
## Range_Min 70.000000
## Range_Max 82.000000
## Mean      77.145570
## SD         3.106217

(e) Graphical Investigation of Predictors

pairs(Auto[, quant_vars], main = "Scatterplot Matrix of Auto Dataset")

Findings:

  • mpg has a clear negative relationship with displacement, horsepower, and weight — heavier, more powerful cars tend to be less fuel-efficient.
  • displacement, horsepower, and weight are highly correlated with each other (multicollinearity).
  • year shows a modest positive relationship with mpg, suggesting newer cars tend to be more fuel-efficient.
  • cylinders correlates strongly with displacement and weight.

(f) Which Variables Are Useful for Predicting mpg?

par(mfrow = c(2, 3))
for (v in setdiff(quant_vars, "mpg")) {
  plot(Auto[[v]], Auto$mpg,
       xlab = v, ylab = "mpg",
       main = paste("mpg vs", v),
       pch  = 19, col = "steelblue", cex = 0.6)
  abline(lm(Auto$mpg ~ Auto[[v]]), col = "red", lwd = 2)
}

par(mfrow = c(1, 1))

Answer:

Based on the plots, displacement, horsepower, weight, and year all appear useful for predicting mpg:

  • displacement, horsepower, and weight show strong negative linear (or slightly curved) relationships with mpg.
  • year shows a positive trend — newer cars are more fuel efficient.
  • cylinders is also associated but is largely captured by displacement.
  • acceleration has a weaker relationship.

Chapter 3 — Exercise 9

Multiple linear regression on the Auto data set.

(a) Scatterplot Matrix

pairs(Auto[, quant_vars], main = "Scatterplot Matrix — Auto")

(b) Correlation Matrix

round(cor(Auto[, quant_vars]), 3)
##                 mpg cylinders displacement horsepower weight acceleration
## mpg           1.000    -0.778       -0.805     -0.778 -0.832        0.423
## cylinders    -0.778     1.000        0.951      0.843  0.898       -0.505
## displacement -0.805     0.951        1.000      0.897  0.933       -0.544
## horsepower   -0.778     0.843        0.897      1.000  0.865       -0.689
## weight       -0.832     0.898        0.933      0.865  1.000       -0.417
## acceleration  0.423    -0.505       -0.544     -0.689 -0.417        1.000
## year          0.581    -0.346       -0.370     -0.416 -0.309        0.290
##                year
## mpg           0.581
## cylinders    -0.346
## displacement -0.370
## horsepower   -0.416
## weight       -0.309
## acceleration  0.290
## year          1.000

(c) Multiple Linear Regression

lm_fit <- lm(mpg ~ . - name, data = Auto)
summary(lm_fit)
## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

Comments:

i. Is there a relationship between the predictors and the response?
Yes. The overall F-statistic is very large with a p-value essentially 0, indicating at least one predictor is significantly related to mpg.

ii. Which predictors have a statistically significant relationship?
Based on the p-values (< 0.05): displacement, weight, year, and origin are statistically significant.

iii. What does the coefficient for year suggest?
The positive coefficient for year (approximately +0.75) suggests that, holding all other variables constant, each additional model year is associated with about 0.75 more miles per gallon. Cars have become more fuel-efficient over time.

(d) Diagnostic Plots

par(mfrow = c(2, 2))
plot(lm_fit)

par(mfrow = c(1, 1))

Comments:

  • Residuals vs Fitted: There is a slight non-linear pattern (U-shape), suggesting the linear model may not fully capture the relationship — a non-linear transformation could help.
  • Normal Q-Q: Residuals are roughly normal, with slight deviation in the tails.
  • Scale-Location: Some heteroscedasticity is visible.
  • Residuals vs Leverage: Observation 14 has unusually high leverage. A few points (e.g., 327, 394) are potential outliers but do not have extreme leverage.

(e) Interaction Effects

# Test some meaningful interactions
lm_interact <- lm(mpg ~ displacement + weight + year + origin +
                    displacement:weight + weight:year, data = Auto)
summary(lm_interact)
## 
## Call:
## lm(formula = mpg ~ displacement + weight + year + origin + displacement:weight + 
##     weight:year, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.9585 -1.6093 -0.0373  1.3563 12.7428 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -6.140e+01  1.374e+01  -4.468 1.04e-05 ***
## displacement        -6.043e-02  9.407e-03  -6.424 3.92e-10 ***
## weight               9.045e-03  4.895e-03   1.848   0.0654 .  
## year                 1.498e+00  1.739e-01   8.616  < 2e-16 ***
## origin               3.388e-01  2.525e-01   1.342   0.1805    
## displacement:weight  1.708e-05  2.383e-06   7.169 3.89e-12 ***
## weight:year         -2.486e-04  6.159e-05  -4.037 6.54e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.958 on 385 degrees of freedom
## Multiple R-squared:  0.8586, Adjusted R-squared:  0.8564 
## F-statistic: 389.6 on 6 and 385 DF,  p-value: < 2.2e-16

Comment:
The interaction displacement:weight and weight:year can be assessed from the output. If their p-values are < 0.05, the interaction is statistically significant. Typically, weight:year is significant — suggesting the effect of weight on mpg changes over model years.

(f) Variable Transformations

par(mfrow = c(1, 3))
plot(log(Auto$horsepower), Auto$mpg,
     xlab = "log(horsepower)", ylab = "mpg",
     main = "mpg vs log(horsepower)", pch = 19, col = "steelblue", cex = 0.6)
plot(sqrt(Auto$horsepower), Auto$mpg,
     xlab = "sqrt(horsepower)", ylab = "mpg",
     main = "mpg vs sqrt(horsepower)", pch = 19, col = "darkgreen", cex = 0.6)
plot(Auto$horsepower^2, Auto$mpg,
     xlab = "horsepower^2", ylab = "mpg",
     main = "mpg vs horsepower^2", pch = 19, col = "tomato", cex = 0.6)

par(mfrow = c(1, 1))
lm_log  <- lm(mpg ~ log(horsepower) + weight + year + origin, data = Auto)
lm_sqrt <- lm(mpg ~ sqrt(horsepower) + weight + year + origin, data = Auto)
lm_sq   <- lm(mpg ~ I(horsepower^2) + weight + year + origin, data = Auto)

cat("log(horsepower) R²:", summary(lm_log)$r.squared, "\n")
## log(horsepower) R²: 0.827627
cat("sqrt(horsepower) R²:", summary(lm_sqrt)$r.squared, "\n")
## sqrt(horsepower) R²: 0.8214188
cat("horsepower^2 R²:", summary(lm_sq)$r.squared, "\n")
## horsepower^2 R²: 0.8185632

Comment:
The log(horsepower) and sqrt(horsepower) transformations tend to produce a more linear relationship with mpg and slightly improve R². The scatterplot with log(horsepower) shows a cleaner linear pattern compared to the raw or squared values.


Chapter 3 — Exercise 15

Predict per capita crime rate (crim) using the Boston data set.

data(Boston)
predictors <- setdiff(names(Boston), "crim")

(a) Simple Linear Regression for Each Predictor

# Fit simple regression for each predictor and collect p-values & coefficients
slr_results <- data.frame(
  predictor = predictors,
  coef      = NA,
  p_value   = NA
)

for (i in seq_along(predictors)) {
  v   <- predictors[i]
  fit <- lm(crim ~ Boston[[v]], data = Boston)
  s   <- summary(fit)
  slr_results$coef[i]    <- coef(fit)[2]
  slr_results$p_value[i] <- s$coefficients[2, 4]
}

slr_results$significant <- slr_results$p_value < 0.05
print(slr_results)
##    predictor        coef      p_value significant
## 1         zn -0.07393498 5.506472e-06        TRUE
## 2      indus  0.50977633 1.450349e-21        TRUE
## 3       chas -1.89277655 2.094345e-01       FALSE
## 4        nox 31.24853120 3.751739e-23        TRUE
## 5         rm -2.68405122 6.346703e-07        TRUE
## 6        age  0.10778623 2.854869e-16        TRUE
## 7        dis -1.55090168 8.519949e-19        TRUE
## 8        rad  0.61791093 2.693844e-56        TRUE
## 9        tax  0.02974225 2.357127e-47        TRUE
## 10   ptratio  1.15198279 2.942922e-11        TRUE
## 11     lstat  0.54880478 2.654277e-27        TRUE
## 12      medv -0.36315992 1.173987e-19        TRUE
par(mfrow = c(4, 4))
for (v in predictors) {
  plot(Boston[[v]], Boston$crim,
       xlab = v, ylab = "crim",
       main = paste("crim ~", v),
       pch = 19, col = "steelblue", cex = 0.4)
  abline(lm(crim ~ Boston[[v]], data = Boston), col = "red", lwd = 1.5)
}
par(mfrow = c(1, 1))

Answer:
Almost all predictors have a statistically significant association with crim in simple linear regression. Exceptions (non-significant) tend to be chas (Charles River dummy variable).

(b) Multiple Linear Regression

mlr_fit <- lm(crim ~ ., data = Boston)
summary(mlr_fit)
## 
## Call:
## lm(formula = crim ~ ., data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.534 -2.248 -0.348  1.087 73.923 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.7783938  7.0818258   1.946 0.052271 .  
## zn           0.0457100  0.0187903   2.433 0.015344 *  
## indus       -0.0583501  0.0836351  -0.698 0.485709    
## chas        -0.8253776  1.1833963  -0.697 0.485841    
## nox         -9.9575865  5.2898242  -1.882 0.060370 .  
## rm           0.6289107  0.6070924   1.036 0.300738    
## age         -0.0008483  0.0179482  -0.047 0.962323    
## dis         -1.0122467  0.2824676  -3.584 0.000373 ***
## rad          0.6124653  0.0875358   6.997 8.59e-12 ***
## tax         -0.0037756  0.0051723  -0.730 0.465757    
## ptratio     -0.3040728  0.1863598  -1.632 0.103393    
## lstat        0.1388006  0.0757213   1.833 0.067398 .  
## medv        -0.2200564  0.0598240  -3.678 0.000261 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.46 on 493 degrees of freedom
## Multiple R-squared:  0.4493, Adjusted R-squared:  0.4359 
## F-statistic: 33.52 on 12 and 493 DF,  p-value: < 2.2e-16

Answer:
In the multiple regression model, we can reject H₀: βⱼ = 0 (at α = 0.05) for: zn, dis, rad, black, and medv. Many predictors that were significant in simple regression lose significance here due to multicollinearity.

(c) Univariate vs. Multiple Regression Coefficients

mlr_coefs <- coef(mlr_fit)[-1]  # drop intercept

# Align
common <- intersect(slr_results$predictor, names(mlr_coefs))
slr_c  <- slr_results$coef[slr_results$predictor %in% common]
mlr_c  <- mlr_coefs[common]

plot(slr_c, mlr_c,
     xlab = "Simple Regression Coefficient",
     ylab = "Multiple Regression Coefficient",
     main = "Univariate vs. Multiple Regression Coefficients",
     pch = 19, col = "steelblue")
abline(0, 1, lty = 2, col = "red")
text(slr_c, mlr_c, labels = common, cex = 0.7, pos = 3)

Comment:
The coefficients differ substantially. nox (nitrogen oxide) has a large positive coefficient in simple regression but shifts in the multiple regression — this is because many predictors are correlated with nox.

(d) Non-linear Association (Polynomial Regression)

poly_results <- data.frame(
  predictor   = predictors,
  p_X2        = NA,
  p_X3        = NA
)

for (i in seq_along(predictors)) {
  v <- predictors[i]
  # Skip binary variable chas
  if (length(unique(Boston[[v]])) <= 2) next
  fit3 <- lm(crim ~ poly(Boston[[v]], 3), data = Boston)
  s    <- summary(fit3)$coefficients
  poly_results$p_X2[i] <- s[3, 4]
  poly_results$p_X3[i] <- s[4, 4]
}

poly_results$nonlinear <- with(poly_results,
  (!is.na(p_X2) & p_X2 < 0.05) | (!is.na(p_X3) & p_X3 < 0.05))

print(poly_results)
##    predictor         p_X2         p_X3 nonlinear
## 1         zn 4.420507e-03 2.295386e-01      TRUE
## 2      indus 1.086057e-03 1.196405e-12      TRUE
## 3       chas           NA           NA     FALSE
## 4        nox 7.736755e-05 6.961110e-16      TRUE
## 5         rm 1.508545e-03 5.085751e-01      TRUE
## 6        age 2.291156e-06 6.679915e-03      TRUE
## 7        dis 7.869767e-14 1.088832e-08      TRUE
## 8        rad 9.120558e-03 4.823138e-01      TRUE
## 9        tax 3.665348e-06 2.438507e-01      TRUE
## 10   ptratio 2.405468e-03 6.300514e-03      TRUE
## 11     lstat 3.780418e-02 1.298906e-01      TRUE
## 12      medv 2.928577e-35 1.046510e-12      TRUE

Answer:
Several predictors show evidence of non-linear association with crim (significant X² or X³ terms), including indus, nox, age, dis, ptratio, and medv. This suggests polynomial or other non-linear models may be more appropriate for those predictors.


Chapter 4 — Exercise 13

Logistic regression using the Weekly data set.

data(Weekly)

(a) Numerical and Graphical Summaries

summary(Weekly)
##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume            Today         
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747   Min.   :-18.1950  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202   1st Qu.: -1.1540  
##  Median :  0.2380   Median :  0.2340   Median :1.00268   Median :  0.2410  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462   Mean   :  0.1499  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373   3rd Qu.:  1.4050  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821   Max.   : 12.0260  
##  Direction 
##  Down:484  
##  Up  :605  
##            
##            
##            
## 
pairs(Weekly[, -9], col = ifelse(Weekly$Direction == "Up", "steelblue", "tomato"),
      main = "Weekly Data — Blue = Up, Red = Down")

table(Weekly$Direction)
## 
## Down   Up 
##  484  605
prop.table(table(Weekly$Direction))
## 
##      Down        Up 
## 0.4444444 0.5555556

Patterns:

  • The market went Up about 55.6% of the time vs. Down 44.4% — a slight upward bias.
  • Volume (average shares traded) increases notably over time, reflecting growth in market activity.
  • The lag variables (Lag1–Lag5) show very little correlation with each other or with Direction, making prediction challenging.

(b) Logistic Regression — Full Model

glm_full <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
                data    = Weekly,
                family  = binomial)
summary(glm_full)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Weekly)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

Answer:
Only Lag2 appears statistically significant (p ≈ 0.03). The other lag variables and Volume are not significant predictors of Direction.

(c) Confusion Matrix — Full Model (Training Data)

glm_probs <- predict(glm_full, type = "response")
glm_pred  <- ifelse(glm_probs > 0.5, "Up", "Down")

conf_matrix <- table(Predicted = glm_pred, Actual = Weekly$Direction)
print(conf_matrix)
##          Actual
## Predicted Down  Up
##      Down   54  48
##      Up    430 557
cat("\nOverall fraction correct:",
    mean(glm_pred == Weekly$Direction), "\n")
## 
## Overall fraction correct: 0.5610652

Interpretation:

  • The model predicts Up the vast majority of the time, reflecting the base rate.
  • It correctly identifies many Up weeks but does poorly on Down weeks (low sensitivity for “Down”).
  • Overall accuracy ≈ 56%, barely above the naive baseline of always predicting “Up”.

(d) Logistic Regression — Training 1990–2008, Test 2009–2010

train <- Weekly$Year <= 2008
test  <- !train

glm_lag2 <- glm(Direction ~ Lag2,
                data   = Weekly,
                subset = train,
                family = binomial)

# Predict on held-out data (2009–2010)
test_probs <- predict(glm_lag2, newdata = Weekly[test, ], type = "response")
test_pred  <- ifelse(test_probs > 0.5, "Up", "Down")

conf_test <- table(Predicted = test_pred, Actual = Weekly$Direction[test])
print(conf_test)
##          Actual
## Predicted Down Up
##      Down    9  5
##      Up     34 56
cat("\nFraction correct on test set:",
    mean(test_pred == Weekly$Direction[test]), "\n")
## 
## Fraction correct on test set: 0.625

Interpretation:

  • Using only Lag2 as a predictor and testing on 2009–2010 data, the model achieves approximately 62.5% correct predictions — better than the full model on training data.
  • This suggests Lag2 is the most informative predictor, and the simpler model generalizes better.

End of Midterm Homework