if (!require(GGally)) install.packages("GGally")

library(ISLR2)
library(tidyverse)
library(GGally)

Chapter 2 – Exercise 9 (Auto Data)

Data Preparation

data(Auto)
Auto <- na.omit(Auto)

The dataset contains 392 observations after removing missing values.

(a) Quantitative vs Qualitative Variables

str(Auto)

## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : int  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
##  - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
##   ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...

Quantitative variables: mpg, cylinders, displacement, horsepower, weight, acceleration, year

Qualitative variable: name

Although origin is numeric, it represents categories (1 = USA, 2 = Europe, 3 = Japan) and is therefore qualitative in meaning.

(b) Range of Quantitative Predictors

sapply(Auto[, 1:7], range)

##       mpg cylinders displacement horsepower weight acceleration year
## [1,]  9.0         3           68         46   1613          8.0   70
## [2,] 46.6         8          455        230   5140         24.8   82

The ranges indicate substantial variability in engine size, weight, and horsepower, suggesting meaningful dispersion for regression modeling.

(c) Mean and Standard Deviation

sapply(Auto[, 1:7], mean)

##          mpg    cylinders displacement   horsepower       weight acceleration 
##    23.445918     5.471939   194.411990   104.469388  2977.584184    15.541327 
##         year 
##    75.979592

sapply(Auto[, 1:7], sd)

##          mpg    cylinders displacement   horsepower       weight acceleration 
##     7.805007     1.705783   104.644004    38.491160   849.402560     2.758864 
##         year 
##     3.683737

The relatively high standard deviation of weight and horsepower suggests these variables may strongly influence fuel efficiency.

(d) Removing Observations 10–85

Auto_sub <- Auto[-(10:85), ]
sapply(Auto_sub[, 1:7], range)

##       mpg cylinders displacement horsepower weight acceleration year
## [1,] 11.0         3           68         46   1649          8.5   70
## [2,] 46.6         8          455        230   4997         24.8   82

sapply(Auto_sub[, 1:7], mean)

##          mpg    cylinders displacement   horsepower       weight acceleration 
##    24.404430     5.373418   187.240506   100.721519  2935.971519    15.726899 
##         year 
##    77.145570

sapply(Auto_sub[, 1:7], sd)

##          mpg    cylinders displacement   horsepower       weight acceleration 
##     7.867283     1.654179    99.678367    35.708853   811.300208     2.693721 
##         year 
##     3.106217

After removing these observations, summary statistics shift moderately, indicating that early observations contain heavier and lower-mpg vehicles.

(e) Graphical Investigation

pairs(Auto[, 1:7])

Findings:

Strong negative relationship between mpg and weight
Strong negative relationship between mpg and horsepower
Strong positive correlation among displacement, weight, and cylinders
Evidence of multicollinearity among engine-related predictors

(f) Predicting mpg

Weight, horsepower, displacement, and cylinders appear highly useful for predicting mpg.

Weight appears to be the strongest predictor visually.

Chapter 3 – Exercise 9 (Auto Data)

(a) Scatterplot Matrix

ggpairs(Auto[, 1:7])

The matrix confirms strong linear relationships between mpg and weight, horsepower, and displacement.

(b) Correlation Matrix

cor(Auto[, 1:7])

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
##              acceleration       year
## mpg             0.4233285  0.5805410
## cylinders      -0.5046834 -0.3456474
## displacement   -0.5438005 -0.3698552
## horsepower     -0.6891955 -0.4163615
## weight         -0.4168392 -0.3091199
## acceleration    1.0000000  0.2903161
## year            0.2903161  1.0000000

mpg shows:

Strong negative correlation with weight (~-0.83)
Strong negative correlation with displacement
Strong positive correlation with year

(c) Multiple Linear Regression

fit <- lm(mpg ~ . - name, data = Auto)
summary(fit)

## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

Interpretation:

The F-statistic is highly significant → strong evidence of relationship between predictors and mpg.
Statistically significant predictors typically include:

weight
year
origin
horsepower (depending on model)

The positive coefficient for year indicates that newer cars have better fuel efficiency.

(d) Diagnostic Plots

par(mfrow = c(2, 2))
plot(fit)

par(mfrow = c(1, 1))

Findings:

Slight heteroskedasticity visible
Some high-leverage observations
No extreme outliers severely violating assumptions

(e) Interaction Effects

fit_inter <- lm(mpg ~ (cylinders + displacement + horsepower + weight +
                         acceleration + year + origin)^2, data = Auto)
summary(fit_inter)

## 
## Call:
## lm(formula = mpg ~ (cylinders + displacement + horsepower + weight + 
##     acceleration + year + origin)^2, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6303 -1.4481  0.0596  1.2739 11.1386 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                3.548e+01  5.314e+01   0.668  0.50475   
## cylinders                  6.989e+00  8.248e+00   0.847  0.39738   
## displacement              -4.785e-01  1.894e-01  -2.527  0.01192 * 
## horsepower                 5.034e-01  3.470e-01   1.451  0.14769   
## weight                     4.133e-03  1.759e-02   0.235  0.81442   
## acceleration              -5.859e+00  2.174e+00  -2.696  0.00735 **
## year                       6.974e-01  6.097e-01   1.144  0.25340   
## origin                    -2.090e+01  7.097e+00  -2.944  0.00345 **
## cylinders:displacement    -3.383e-03  6.455e-03  -0.524  0.60051   
## cylinders:horsepower       1.161e-02  2.420e-02   0.480  0.63157   
## cylinders:weight           3.575e-04  8.955e-04   0.399  0.69000   
## cylinders:acceleration     2.779e-01  1.664e-01   1.670  0.09584 . 
## cylinders:year            -1.741e-01  9.714e-02  -1.793  0.07389 . 
## cylinders:origin           4.022e-01  4.926e-01   0.816  0.41482   
## displacement:horsepower   -8.491e-05  2.885e-04  -0.294  0.76867   
## displacement:weight        2.472e-05  1.470e-05   1.682  0.09342 . 
## displacement:acceleration -3.479e-03  3.342e-03  -1.041  0.29853   
## displacement:year          5.934e-03  2.391e-03   2.482  0.01352 * 
## displacement:origin        2.398e-02  1.947e-02   1.232  0.21875   
## horsepower:weight         -1.968e-05  2.924e-05  -0.673  0.50124   
## horsepower:acceleration   -7.213e-03  3.719e-03  -1.939  0.05325 . 
## horsepower:year           -5.838e-03  3.938e-03  -1.482  0.13916   
## horsepower:origin          2.233e-03  2.930e-02   0.076  0.93931   
## weight:acceleration        2.346e-04  2.289e-04   1.025  0.30596   
## weight:year               -2.245e-04  2.127e-04  -1.056  0.29182   
## weight:origin             -5.789e-04  1.591e-03  -0.364  0.71623   
## acceleration:year          5.562e-02  2.558e-02   2.174  0.03033 * 
## acceleration:origin        4.583e-01  1.567e-01   2.926  0.00365 **
## year:origin                1.393e-01  7.399e-02   1.882  0.06062 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.695 on 363 degrees of freedom
## Multiple R-squared:  0.8893, Adjusted R-squared:  0.8808 
## F-statistic: 104.2 on 28 and 363 DF,  p-value: < 2.2e-16

Some interaction terms appear statistically significant, particularly those involving weight and horsepower.

(f) Transformations

fit_quad <- lm(mpg ~ horsepower + I(horsepower^2), data = Auto)
summary(fit_quad)

## 
## Call:
## lm(formula = mpg ~ horsepower + I(horsepower^2), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.7135  -2.5943  -0.0859   2.2868  15.8961 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     56.9000997  1.8004268   31.60   <2e-16 ***
## horsepower      -0.4661896  0.0311246  -14.98   <2e-16 ***
## I(horsepower^2)  0.0012305  0.0001221   10.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared:  0.6876, Adjusted R-squared:  0.686 
## F-statistic:   428 on 2 and 389 DF,  p-value: < 2.2e-16

The quadratic term is significant, suggesting a nonlinear relationship between horsepower and mpg.

Chapter 3 – Exercise 15 (Boston Data)

data(Boston)

(a) Simple Linear Regressions

predictors <- setdiff(names(Boston), "crim")

uni_models <- lapply(predictors, function(var) {
  lm(as.formula(paste("crim ~", var)), data = Boston)
})
names(uni_models) <- predictors

uni_summaries <- lapply(uni_models, summary)
uni_summaries[["lstat"]]  # example: view one model

## 
## Call:
## lm(formula = as.formula(paste("crim ~", var)), data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.925  -2.822  -0.664   1.079  82.862 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.33054    0.69376  -4.801 2.09e-06 ***
## lstat        0.54880    0.04776  11.491  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.664 on 504 degrees of freedom
## Multiple R-squared:  0.2076, Adjusted R-squared:  0.206 
## F-statistic:   132 on 1 and 504 DF,  p-value: < 2.2e-16

Several predictors (e.g., lstat, rm, dis) show statistically significant associations with crime rate.

(b) Multiple Regression

fit_boston <- lm(crim ~ ., data = Boston)
summary(fit_boston)

## 
## Call:
## lm(formula = crim ~ ., data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.534 -2.248 -0.348  1.087 73.923 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.7783938  7.0818258   1.946 0.052271 .  
## zn           0.0457100  0.0187903   2.433 0.015344 *  
## indus       -0.0583501  0.0836351  -0.698 0.485709    
## chas        -0.8253776  1.1833963  -0.697 0.485841    
## nox         -9.9575865  5.2898242  -1.882 0.060370 .  
## rm           0.6289107  0.6070924   1.036 0.300738    
## age         -0.0008483  0.0179482  -0.047 0.962323    
## dis         -1.0122467  0.2824676  -3.584 0.000373 ***
## rad          0.6124653  0.0875358   6.997 8.59e-12 ***
## tax         -0.0037756  0.0051723  -0.730 0.465757    
## ptratio     -0.3040728  0.1863598  -1.632 0.103393    
## lstat        0.1388006  0.0757213   1.833 0.067398 .  
## medv        -0.2200564  0.0598240  -3.678 0.000261 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.46 on 493 degrees of freedom
## Multiple R-squared:  0.4493, Adjusted R-squared:  0.4359 
## F-statistic: 33.52 on 12 and 493 DF,  p-value: < 2.2e-16

In the multiple regression setting, some predictors lose significance due to multicollinearity.

We reject H₀: βⱼ = 0 for predictors with p-value < 0.05.

(c) Comparing Coefficients

uni_coef <- sapply(predictors, function(var) {
  coef(uni_models[[var]])[2]
})

multi_coef <- coef(fit_boston)[predictors]

plot(uni_coef, multi_coef,
     xlab = "Univariate Coefficients",
     ylab = "Multiple Regression Coefficients",
     main = "Univariate vs. Multiple Regression Coefficients")
abline(h = 0, lty = 2)
abline(v = 0, lty = 2)

Large deviations from the diagonal indicate strong correlation among predictors.

(d) Nonlinear Associations

poly_models <- lapply(predictors, function(var) {
  tryCatch(
    summary(lm(as.formula(paste("crim ~ poly(", var, ", 3)")), data = Boston)),
    error = function(e) {
      message("Skipping ", var, ": ", e$message)
      NULL
    }
  )
})

## Skipping chas: 'degree' must be less than number of unique points

names(poly_models) <- predictors

# Remove skipped variables (e.g. binary 'chas' can't support degree-3 poly)
poly_models <- Filter(Negate(is.null), poly_models)
poly_models[["lstat"]]  # example: view one model

## 
## Call:
## lm(formula = as.formula(paste("crim ~ poly(", var, ", 3)")), 
##     data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.234  -2.151  -0.486   0.066  83.353 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.6135     0.3392  10.654   <2e-16 ***
## poly(lstat, 3)1  88.0697     7.6294  11.543   <2e-16 ***
## poly(lstat, 3)2  15.8882     7.6294   2.082   0.0378 *  
## poly(lstat, 3)3 -11.5740     7.6294  -1.517   0.1299    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.629 on 502 degrees of freedom
## Multiple R-squared:  0.2179, Adjusted R-squared:  0.2133 
## F-statistic: 46.63 on 3 and 502 DF,  p-value: < 2.2e-16

Some predictors show significant higher-order polynomial terms, indicating nonlinear effects.

Chapter 4 – Exercise 13 (Weekly Data)

data(Weekly)

(a) Summary

summary(Weekly)

##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume            Today         
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747   Min.   :-18.1950  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202   1st Qu.: -1.1540  
##  Median :  0.2380   Median :  0.2340   Median :1.00268   Median :  0.2410  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462   Mean   :  0.1499  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373   3rd Qu.:  1.4050  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821   Max.   : 12.0260  
##  Direction 
##  Down:484  
##  Up  :605  
##            
##            
##            
##

pairs(Weekly[, -9])

Lag variables show weak correlation with Direction.

(b) Logistic Regression

fit_weekly <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
                  data = Weekly, family = binomial)
summary(fit_weekly)

## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Weekly)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

Lag2 often appears statistically significant.

(c) Confusion Matrix

probs <- predict(fit_weekly, type = "response")
pred <- ifelse(probs > 0.5, "Up", "Down")
table(Predicted = pred, Actual = Weekly$Direction)

##          Actual
## Predicted Down  Up
##      Down   54  48
##      Up    430 557

mean(pred == Weekly$Direction)

## [1] 0.5610652

Accuracy is modest (~55–57%), only slightly better than random guessing.

(d) Train/Test Split

train <- Weekly$Year <= 2008

fit_train <- glm(Direction ~ Lag2,
                 data = Weekly[train, ], family = binomial)

probs_test <- predict(fit_train, newdata = Weekly[!train, ], type = "response")
pred_test <- ifelse(probs_test > 0.5, "Up", "Down")

table(Predicted = pred_test, Actual = Weekly$Direction[!train])

##          Actual
## Predicted Down Up
##      Down    9  5
##      Up     34 56

mean(pred_test == Weekly$Direction[!train])

## [1] 0.625

Out-of-sample accuracy remains modest (~60%), indicating limited predictive power.

Conclusion

This analysis demonstrates:

Strong linear relationships between vehicle characteristics and fuel efficiency.
Evidence of multicollinearity in multiple regression settings.
Nonlinear effects in horsepower and crime rate predictors.
Weak predictive power of stock market lag variables for market direction.

Overall, classical linear and logistic regression provide meaningful insights, though predictive performance remains limited in financial time-series contexts.

Midterm Homework - Application of Financial Software

Purevjav

2026-04-14

Chapter 2 – Exercise 9 (Auto Data)

Data Preparation

(a) Quantitative vs Qualitative Variables

(b) Range of Quantitative Predictors

(c) Mean and Standard Deviation

(d) Removing Observations 10–85

(e) Graphical Investigation

Findings:

(f) Predicting mpg

Chapter 3 – Exercise 9 (Auto Data)

(a) Scatterplot Matrix

(b) Correlation Matrix

(c) Multiple Linear Regression

Interpretation:

(d) Diagnostic Plots

(e) Interaction Effects

(f) Transformations

Chapter 3 – Exercise 15 (Boston Data)

(a) Simple Linear Regressions

(b) Multiple Regression

(c) Comparing Coefficients

(d) Nonlinear Associations

Chapter 4 – Exercise 13 (Weekly Data)

(a) Summary

(b) Logistic Regression

(c) Confusion Matrix

(d) Train/Test Split

Conclusion