library(ISLR2)
library(ggplot2)
library(GGally)
library(corrplot)

1 Chapter 2: Question 9 — Auto Data Set

This exercise involves the Auto data set. We first ensure that missing values have been removed.

data(Auto)
Auto <- na.omit(Auto)
str(Auto)

## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : int  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
##  - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
##   ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...

dim(Auto)

## [1] 392   9

head(Auto)

##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500

1.1 (a) Quantitative vs. Qualitative Predictors

# Identify predictor types
sapply(Auto, class)

##          mpg    cylinders displacement   horsepower       weight acceleration 
##    "numeric"    "integer"    "numeric"    "integer"    "integer"    "numeric" 
##         year       origin         name 
##    "integer"    "integer"     "factor"

Quantitative predictors: mpg, cylinders, displacement, horsepower, weight, acceleration, year

Qualitative predictors: origin, name

Note: origin is coded as numeric (1, 2, 3) but represents categorical groups (1 = American, 2 = European, 3 = Japanese), so it is qualitative. cylinders could also be considered qualitative since it takes only a few discrete values, but we treat it as quantitative here as is conventional.

1.2 (b) Range of Each Quantitative Predictor

quant_vars <- c("mpg", "cylinders", "displacement", "horsepower",
                "weight", "acceleration", "year")
sapply(Auto[, quant_vars], range)

##       mpg cylinders displacement horsepower weight acceleration year
## [1,]  9.0         3           68         46   1613          8.0   70
## [2,] 46.6         8          455        230   5140         24.8   82

range_df <- data.frame(
  Variable = quant_vars,
  Min = sapply(Auto[, quant_vars], min),
  Max = sapply(Auto[, quant_vars], max)
)
knitr::kable(range_df, row.names = FALSE, caption = "Range of Quantitative Predictors")

Range of Quantitative Predictors
Variable	Min	Max
mpg	9	46.6
cylinders	3	8.0
displacement	68	455.0
horsepower	46	230.0
weight	1613	5140.0
acceleration	8	24.8
year	70	82.0

1.3 (c) Mean and Standard Deviation

stats_df <- data.frame(
  Variable = quant_vars,
  Mean = round(sapply(Auto[, quant_vars], mean), 2),
  SD = round(sapply(Auto[, quant_vars], sd), 2)
)
knitr::kable(stats_df, row.names = FALSE, caption = "Mean and Standard Deviation")

Mean and Standard Deviation
Variable	Mean	SD
mpg	23.45	7.81
cylinders	5.47	1.71
displacement	194.41	104.64
horsepower	104.47	38.49
weight	2977.58	849.40
acceleration	15.54	2.76
year	75.98	3.68

1.4 (d) Remove 10th–85th Observations

Auto_sub <- Auto[-c(10:85), ]
cat("Original observations:", nrow(Auto), "\n")

## Original observations: 392

cat("After removal:", nrow(Auto_sub), "\n")

## After removal: 316

stats_sub <- data.frame(
  Variable = quant_vars,
  Range_Min = sapply(Auto_sub[, quant_vars], min),
  Range_Max = sapply(Auto_sub[, quant_vars], max),
  Mean = round(sapply(Auto_sub[, quant_vars], mean), 2),
  SD = round(sapply(Auto_sub[, quant_vars], sd), 2)
)
knitr::kable(stats_sub, row.names = FALSE,
             caption = "Statistics After Removing Observations 10-85")

Statistics After Removing Observations 10-85
Variable	Range_Min	Range_Max	Mean	SD
mpg	11.0	46.6	24.40	7.87
cylinders	3.0	8.0	5.37	1.65
displacement	68.0	455.0	187.24	99.68
horsepower	46.0	230.0	100.72	35.71
weight	1649.0	4997.0	2935.97	811.30
acceleration	8.5	24.8	15.73	2.69
year	70.0	82.0	77.15	3.11

Compared to part (c), removing observations 10–85 changes the statistics slightly. For some variables (e.g., displacement, horsepower, weight), the means increase because many of the removed observations were smaller, fuel-efficient cars.

1.5 (e) Graphical Investigation

pairs(Auto[, quant_vars], pch = 20, col = "steelblue",
      main = "Scatterplot Matrix of Auto Data")

par(mfrow = c(2, 3))

plot(Auto$displacement, Auto$mpg, pch = 20, col = "steelblue",
     xlab = "Displacement", ylab = "MPG", main = "MPG vs Displacement")

plot(Auto$horsepower, Auto$mpg, pch = 20, col = "darkorange",
     xlab = "Horsepower", ylab = "MPG", main = "MPG vs Horsepower")

plot(Auto$weight, Auto$mpg, pch = 20, col = "darkgreen",
     xlab = "Weight", ylab = "MPG", main = "MPG vs Weight")

plot(Auto$acceleration, Auto$mpg, pch = 20, col = "purple",
     xlab = "Acceleration", ylab = "MPG", main = "MPG vs Acceleration")

plot(Auto$year, Auto$mpg, pch = 20, col = "red",
     xlab = "Year", ylab = "MPG", main = "MPG vs Year")

boxplot(mpg ~ origin, data = Auto, col = c("coral", "lightgreen", "lightblue"),
        xlab = "Origin (1=US, 2=EU, 3=JP)", ylab = "MPG",
        main = "MPG by Origin")

par(mfrow = c(1, 1))

Findings:

There is a clear negative relationship between mpg and displacement, horsepower, and weight — heavier, more powerful cars tend to have lower fuel efficiency.
There is a positive relationship between mpg and year — newer cars tend to be more fuel efficient.
acceleration shows a weak positive relationship with mpg.
Japanese and European cars (origin 3 and 2) tend to have higher mpg than American cars (origin 1).
displacement, horsepower, and weight are highly correlated with each other.

1.6 (f) Predicting MPG

# Correlation of quantitative predictors with mpg
cor_with_mpg <- cor(Auto[, quant_vars])[, "mpg"]
sort(abs(cor_with_mpg), decreasing = TRUE)

##          mpg       weight displacement   horsepower    cylinders         year 
##    1.0000000    0.8322442    0.8051269    0.7784268    0.7776175    0.5805410 
## acceleration 
##    0.4233285

Based on the scatterplots and correlations:

weight (r = -0.832) has the strongest negative correlation with mpg and is the most useful predictor.
displacement (r = -0.805) and horsepower (r = -0.778) are also strongly negatively correlated.
year (r = 0.581) is positively correlated, indicating newer cars are more efficient.
cylinders is also useful but is highly correlated with displacement.
acceleration has the weakest relationship and may be less useful on its own.

The qualitative variable origin also appears useful, as shown by the boxplot in part (e).

2 Chapter 3: Question 9 — Multiple Linear Regression on Auto

2.1 (a) Scatterplot Matrix

pairs(Auto[, -which(names(Auto) == "name")], pch = 20, col = "steelblue",
      main = "Scatterplot Matrix — Auto Data (excluding name)")

The scatterplot matrix reveals several strong linear and non-linear relationships among the variables.

2.2 (b) Correlation Matrix

auto_numeric <- Auto[, -which(names(Auto) == "name")]
cor_matrix <- cor(auto_numeric)
round(cor_matrix, 3)

##                 mpg cylinders displacement horsepower weight acceleration
## mpg           1.000    -0.778       -0.805     -0.778 -0.832        0.423
## cylinders    -0.778     1.000        0.951      0.843  0.898       -0.505
## displacement -0.805     0.951        1.000      0.897  0.933       -0.544
## horsepower   -0.778     0.843        0.897      1.000  0.865       -0.689
## weight       -0.832     0.898        0.933      0.865  1.000       -0.417
## acceleration  0.423    -0.505       -0.544     -0.689 -0.417        1.000
## year          0.581    -0.346       -0.370     -0.416 -0.309        0.290
## origin        0.565    -0.569       -0.615     -0.455 -0.585        0.213
##                year origin
## mpg           0.581  0.565
## cylinders    -0.346 -0.569
## displacement -0.370 -0.615
## horsepower   -0.416 -0.455
## weight       -0.309 -0.585
## acceleration  0.290  0.213
## year          1.000  0.182
## origin        0.182  1.000

corrplot(cor_matrix, method = "color", type = "upper",
         addCoef.col = "black", number.cex = 0.7,
         tl.col = "black", tl.srt = 45,
         title = "Correlation Matrix of Auto Data",
         mar = c(0, 0, 2, 0))

Key observations:

displacement, horsepower, weight, and cylinders are highly positively correlated with each other (r > 0.8).
These variables are all strongly negatively correlated with mpg.
year has a moderate positive correlation with mpg.

2.3 (c) Multiple Linear Regression

lm.fit <- lm(mpg ~ . - name, data = Auto)
summary(lm.fit)

## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

i. Is there a relationship between the predictors and the response?

Yes. The F-statistic is very large with a p-value near zero (p < 2.2e-16), so we can reject the null hypothesis that all regression coefficients are zero. There is a statistically significant relationship between the predictors and mpg.

ii. Which predictors have a statistically significant relationship to the response?

coef_summary <- summary(lm.fit)$coefficients
sig_predictors <- rownames(coef_summary)[coef_summary[, "Pr(>|t|)"] < 0.05]
sig_predictors

## [1] "(Intercept)"  "displacement" "weight"       "year"         "origin"

Looking at the p-values, displacement, weight, year, and origin appear to have statistically significant relationships with mpg (p < 0.05).

iii. What does the coefficient for the year variable suggest?

coef(lm.fit)["year"]

##      year 
## 0.7507727

The coefficient for year is positive (approximately 0.7508), suggesting that, holding all other predictors fixed, each additional year is associated with an increase of about 0.75 mpg. This indicates that cars have become more fuel-efficient over time.

2.4 (d) Diagnostic Plots

par(mfrow = c(2, 2))
plot(lm.fit)

par(mfrow = c(1, 1))

Comments:

Residuals vs Fitted: There is a slight U-shaped pattern, suggesting some non-linearity in the data that the linear model does not capture.
Normal Q-Q: The residuals largely follow the diagonal line but deviate at the tails, suggesting slight non-normality (heavy right tail).
Scale-Location: There is some evidence of non-constant variance (heteroscedasticity) — the spread increases slightly with fitted values.
Residuals vs Leverage: A few observations have relatively high leverage. Observation 14 stands out as a potential high-leverage point. We should check if any points exceed Cook’s distance thresholds.

# Identify high leverage points
hatvalues_lm <- hatvalues(lm.fit)
p <- length(coef(lm.fit))
n <- nrow(Auto)
high_leverage <- which(hatvalues_lm > 2 * p / n)
cat("High leverage observations:", head(high_leverage, 10), "\n")

## High leverage observations: 7 8 9 13 14 26 27 28 29 94

cat("Number of high leverage points:", length(high_leverage), "\n")

## Number of high leverage points: 17

2.5 (e) Interaction Effects

# Test several interaction terms
lm.fit.int1 <- lm(mpg ~ . - name + displacement:weight, data = Auto)
summary(lm.fit.int1)

## 
## Call:
## lm(formula = mpg ~ . - name + displacement:weight, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.9027 -1.8092 -0.0946  1.5549 12.1687 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -5.389e+00  4.301e+00  -1.253   0.2109    
## cylinders            1.175e-01  2.943e-01   0.399   0.6899    
## displacement        -6.837e-02  1.104e-02  -6.193 1.52e-09 ***
## horsepower          -3.280e-02  1.238e-02  -2.649   0.0084 ** 
## weight              -1.064e-02  7.136e-04 -14.915  < 2e-16 ***
## acceleration         6.724e-02  8.805e-02   0.764   0.4455    
## year                 7.852e-01  4.553e-02  17.246  < 2e-16 ***
## origin               5.610e-01  2.622e-01   2.139   0.0331 *  
## displacement:weight  2.269e-05  2.257e-06  10.054  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.964 on 383 degrees of freedom
## Multiple R-squared:  0.8588, Adjusted R-squared:  0.8558 
## F-statistic: 291.1 on 8 and 383 DF,  p-value: < 2.2e-16

lm.fit.int2 <- lm(mpg ~ . - name + horsepower:weight, data = Auto)
summary(lm.fit.int2)

## 
## Call:
## lm(formula = mpg ~ . - name + horsepower:weight, data = Auto)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.589 -1.617 -0.184  1.541 12.001 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2.876e+00  4.511e+00   0.638 0.524147    
## cylinders         -2.955e-02  2.881e-01  -0.103 0.918363    
## displacement       5.950e-03  6.750e-03   0.881 0.378610    
## horsepower        -2.313e-01  2.363e-02  -9.791  < 2e-16 ***
## weight            -1.121e-02  7.285e-04 -15.393  < 2e-16 ***
## acceleration      -9.019e-02  8.855e-02  -1.019 0.309081    
## year               7.695e-01  4.494e-02  17.124  < 2e-16 ***
## origin             8.344e-01  2.513e-01   3.320 0.000986 ***
## horsepower:weight  5.529e-05  5.227e-06  10.577  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.931 on 383 degrees of freedom
## Multiple R-squared:  0.8618, Adjusted R-squared:  0.859 
## F-statistic: 298.6 on 8 and 383 DF,  p-value: < 2.2e-16

lm.fit.int3 <- lm(mpg ~ . - name + year:origin, data = Auto)
summary(lm.fit.int3)

## 
## Call:
## lm(formula = mpg ~ . - name + year:origin, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6072 -2.0439 -0.0596  1.7121 12.3368 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.492e+00  9.044e+00   0.939 0.348353    
## cylinders    -5.042e-01  3.192e-01  -1.579 0.115082    
## displacement  1.567e-02  7.530e-03   2.081 0.038060 *  
## horsepower   -1.399e-02  1.364e-02  -1.025 0.305786    
## weight       -6.352e-03  6.449e-04  -9.851  < 2e-16 ***
## acceleration  9.185e-02  9.766e-02   0.941 0.347546    
## year          4.189e-01  1.125e-01   3.723 0.000226 ***
## origin       -1.405e+01  4.699e+00  -2.989 0.002978 ** 
## year:origin   1.989e-01  6.030e-02   3.298 0.001064 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.286 on 383 degrees of freedom
## Multiple R-squared:  0.8264, Adjusted R-squared:  0.8228 
## F-statistic: 227.9 on 8 and 383 DF,  p-value: < 2.2e-16

# Model with multiple interactions
lm.fit.int4 <- lm(mpg ~ (. - name)^2, data = Auto)
sig_int <- summary(lm.fit.int4)$coefficients
sig_interactions <- rownames(sig_int)[sig_int[, "Pr(>|t|)"] < 0.05]
sig_interactions

## [1] "displacement"        "acceleration"        "origin"             
## [4] "displacement:year"   "acceleration:year"   "acceleration:origin"

Findings: Several interaction terms appear to be statistically significant. The interaction between displacement:weight is significant, as is acceleration:year and acceleration:origin among others. This suggests that the effect of one predictor on mpg depends on the level of another predictor.

2.6 (f) Variable Transformations

# Log transformation
lm.fit.log <- lm(mpg ~ log(displacement) + log(horsepower) + log(weight) +
                   acceleration + year + factor(origin), data = Auto)
summary(lm.fit.log)

## 
## Call:
## lm(formula = mpg ~ log(displacement) + log(horsepower) + log(weight) + 
##     acceleration + year + factor(origin), data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.1406 -1.9581 -0.0011  1.6383 13.0738 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       109.11202    9.87221  11.052  < 2e-16 ***
## log(displacement)   0.68018    1.12701   0.604 0.546516    
## log(horsepower)    -5.43488    1.57006  -3.462 0.000597 ***
## log(weight)       -14.86606    2.26541  -6.562 1.72e-10 ***
## acceleration       -0.18354    0.10294  -1.783 0.075386 .  
## year                0.74088    0.04821  15.368  < 2e-16 ***
## factor(origin)2     1.64083    0.55679   2.947 0.003405 ** 
## factor(origin)3     1.87792    0.54392   3.453 0.000617 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.073 on 384 degrees of freedom
## Multiple R-squared:  0.8478, Adjusted R-squared:  0.845 
## F-statistic: 305.6 on 7 and 384 DF,  p-value: < 2.2e-16

# Square root transformation
lm.fit.sqrt <- lm(mpg ~ sqrt(displacement) + sqrt(horsepower) + sqrt(weight) +
                    acceleration + year + factor(origin), data = Auto)
summary(lm.fit.sqrt)

## 
## Call:
## lm(formula = mpg ~ sqrt(displacement) + sqrt(horsepower) + sqrt(weight) + 
##     acceleration + year + factor(origin), data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8811 -1.9810 -0.0654  1.8271 13.2758 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.09260    4.96699   0.824 0.410474    
## sqrt(displacement)  0.31252    0.16782   1.862 0.063328 .  
## sqrt(horsepower)   -0.63396    0.30172  -2.101 0.036276 *  
## sqrt(weight)       -0.67451    0.07897  -8.541 3.15e-16 ***
## acceleration       -0.03662    0.10201  -0.359 0.719838    
## year                0.75929    0.05027  15.105  < 2e-16 ***
## factor(origin)2     2.17072    0.56392   3.849 0.000139 ***
## factor(origin)3     2.34062    0.54877   4.265 2.52e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.192 on 384 degrees of freedom
## Multiple R-squared:  0.8357, Adjusted R-squared:  0.8327 
## F-statistic:   279 on 7 and 384 DF,  p-value: < 2.2e-16

# Quadratic terms
lm.fit.sq <- lm(mpg ~ displacement + I(displacement^2) +
                   horsepower + I(horsepower^2) +
                   weight + I(weight^2) +
                   acceleration + year + factor(origin), data = Auto)
summary(lm.fit.sq)

## 
## Call:
## lm(formula = mpg ~ displacement + I(displacement^2) + horsepower + 
##     I(horsepower^2) + weight + I(weight^2) + acceleration + year + 
##     factor(origin), data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9905 -1.6264 -0.0559  1.4085 12.2330 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        4.115e+00  4.756e+00   0.865 0.387448    
## displacement      -1.261e-02  1.648e-02  -0.765 0.444903    
## I(displacement^2)  3.581e-05  3.214e-05   1.114 0.266004    
## horsepower        -1.789e-01  4.080e-02  -4.386 1.50e-05 ***
## I(horsepower^2)    4.943e-04  1.381e-04   3.581 0.000387 ***
## weight            -1.261e-02  2.511e-03  -5.022 7.89e-07 ***
## I(weight^2)        1.283e-06  3.344e-07   3.837 0.000146 ***
## acceleration      -1.526e-01  1.003e-01  -1.521 0.129089    
## year               7.860e-01  4.645e-02  16.923  < 2e-16 ***
## factor(origin)2    1.286e+00  5.344e-01   2.406 0.016601 *  
## factor(origin)3    1.353e+00  5.197e-01   2.603 0.009600 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.901 on 381 degrees of freedom
## Multiple R-squared:  0.8654, Adjusted R-squared:  0.8619 
## F-statistic:   245 on 10 and 381 DF,  p-value: < 2.2e-16

# Diagnostic plots for log model
par(mfrow = c(2, 2))
plot(lm.fit.log, main = "Log-transformed Model")

par(mfrow = c(1, 1))

Findings:

The log transformation improves the model fit (higher R²) and reduces the non-linearity pattern in the residuals.
The square root transformation also improves the fit, though slightly less than the log transformation.
The quadratic terms for horsepower and weight are statistically significant, confirming a non-linear relationship.
Overall, transformations help address the non-linearity issues identified in the diagnostic plots from part (d).

3 Chapter 3: Question 15 — Boston Data Set (Crime Rate)

This problem uses the Boston data set to predict per capita crime rate (crim).

data(Boston, package = "ISLR2")
str(Boston)

## 'data.frame':    506 obs. of  13 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

head(Boston)

##      crim zn indus chas   nox    rm  age    dis rad tax ptratio lstat medv
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3  4.98 24.0
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8  9.14 21.6
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8  4.03 34.7
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7  2.94 33.4
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7  5.33 36.2
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7  5.21 28.7

3.1 (a) Simple Linear Regression for Each Predictor

predictors <- names(Boston)[-1]  # All except 'crim'

par(mfrow = c(4, 4))

slr_results <- data.frame(
  Predictor = character(),
  Coefficient = numeric(),
  P_Value = numeric(),
  R_Squared = numeric(),
  Significant = character(),
  stringsAsFactors = FALSE
)

for (pred in predictors) {
  fit <- lm(as.formula(paste("crim ~", pred)), data = Boston)
  s <- summary(fit)

  slr_results <- rbind(slr_results, data.frame(
    Predictor = pred,
    Coefficient = round(coef(fit)[2], 4),
    P_Value = signif(s$coefficients[2, 4], 4),
    R_Squared = round(s$r.squared, 4),
    Significant = ifelse(s$coefficients[2, 4] < 0.05, "Yes", "No")
  ))

  plot(Boston[[pred]], Boston$crim, pch = 20, col = "steelblue",
       xlab = pred, ylab = "crim",
       main = paste("crim vs", pred))
  abline(fit, col = "red", lwd = 2)
}

par(mfrow = c(1, 1))

knitr::kable(slr_results, row.names = FALSE,
             caption = "Simple Linear Regression Results for Each Predictor")

Simple Linear Regression Results for Each Predictor
Predictor	Coefficient	P_Value	R_Squared	Significant
zn	-0.0739	0.0000055	0.0402	Yes
indus	0.5098	0.0000000	0.1653	Yes
chas	-1.8928	0.2094000	0.0031	No
nox	31.2485	0.0000000	0.1772	Yes
rm	-2.6841	0.0000006	0.0481	Yes
age	0.1078	0.0000000	0.1244	Yes
dis	-1.5509	0.0000000	0.1441	Yes
rad	0.6179	0.0000000	0.3913	Yes
tax	0.0297	0.0000000	0.3396	Yes
ptratio	1.1520	0.0000000	0.0841	Yes
lstat	0.5488	0.0000000	0.2076	Yes
medv	-0.3632	0.0000000	0.1508	Yes

Findings: For most predictors, there is a statistically significant association with crim (p < 0.05). The predictor chas (Charles River dummy variable) is the only one that is not statistically significant. Among the significant predictors, rad (index of accessibility to radial highways) and tax (property tax rate) have the highest R² values.

3.2 (b) Multiple Regression with All Predictors

lm.fit.all <- lm(crim ~ ., data = Boston)
summary(lm.fit.all)

## 
## Call:
## lm(formula = crim ~ ., data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.534 -2.248 -0.348  1.087 73.923 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.7783938  7.0818258   1.946 0.052271 .  
## zn           0.0457100  0.0187903   2.433 0.015344 *  
## indus       -0.0583501  0.0836351  -0.698 0.485709    
## chas        -0.8253776  1.1833963  -0.697 0.485841    
## nox         -9.9575865  5.2898242  -1.882 0.060370 .  
## rm           0.6289107  0.6070924   1.036 0.300738    
## age         -0.0008483  0.0179482  -0.047 0.962323    
## dis         -1.0122467  0.2824676  -3.584 0.000373 ***
## rad          0.6124653  0.0875358   6.997 8.59e-12 ***
## tax         -0.0037756  0.0051723  -0.730 0.465757    
## ptratio     -0.3040728  0.1863598  -1.632 0.103393    
## lstat        0.1388006  0.0757213   1.833 0.067398 .  
## medv        -0.2200564  0.0598240  -3.678 0.000261 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.46 on 493 degrees of freedom
## Multiple R-squared:  0.4493, Adjusted R-squared:  0.4359 
## F-statistic: 33.52 on 12 and 493 DF,  p-value: < 2.2e-16

coef_all <- summary(lm.fit.all)$coefficients
sig_multi <- rownames(coef_all)[coef_all[, "Pr(>|t|)"] < 0.05]
cat("Significant predictors in multiple regression:\n")

## Significant predictors in multiple regression:

sig_multi[-1]  # Exclude intercept

## [1] "dis"  "rad"  "medv"

In the multiple regression, we can reject H₀: βⱼ = 0 for only a few predictors. The significant predictors are printed above. Many predictors that were significant in simple regression are no longer significant here due to multicollinearity.

3.3 (c) Comparing Simple vs. Multiple Regression Coefficients

# Get simple regression coefficients
simple_coefs <- sapply(predictors, function(pred) {
  fit <- lm(as.formula(paste("crim ~", pred)), data = Boston)
  coef(fit)[2]
})

# Get multiple regression coefficients
multi_coefs <- coef(lm.fit.all)[-1]  # Exclude intercept

# Create comparison plot
plot(simple_coefs, multi_coefs,
     pch = 19, col = "steelblue", cex = 1.5,
     xlab = "Simple Regression Coefficients",
     ylab = "Multiple Regression Coefficients",
     main = "Simple vs. Multiple Regression Coefficients")
text(simple_coefs, multi_coefs, labels = predictors,
     pos = 3, cex = 0.8)
abline(0, 1, lty = 2, col = "red")
abline(h = 0, lty = 3, col = "gray")
abline(v = 0, lty = 3, col = "gray")

Findings: The coefficients from simple and multiple regression differ substantially for several predictors. For example, nox has a very large positive coefficient in simple regression but a much smaller (possibly negative) coefficient in multiple regression. This discrepancy occurs because the simple regression coefficient captures both the direct effect and indirect effects through correlated predictors. The multiple regression adjusts for other variables and isolates the partial effect.

3.4 (d) Non-Linear Association

par(mfrow = c(4, 4))

nonlinear_results <- data.frame(
  Predictor = character(),
  Linear_P = numeric(),
  Quadratic_P = numeric(),
  Cubic_P = numeric(),
  Evidence_Nonlinear = character(),
  stringsAsFactors = FALSE
)

for (pred in predictors) {
  # Skip chas (binary variable)
  if (pred == "chas") {
    next
  }

  fit_poly <- lm(as.formula(paste("crim ~ poly(", pred, ", 3)")), data = Boston)
  s <- summary(fit_poly)
  pvals <- s$coefficients[, "Pr(>|t|)"]

  nonlinear_results <- rbind(nonlinear_results, data.frame(
    Predictor = pred,
    Linear_P = signif(pvals[2], 4),
    Quadratic_P = signif(pvals[3], 4),
    Cubic_P = signif(pvals[4], 4),
    Evidence_Nonlinear = ifelse(pvals[3] < 0.05 | pvals[4] < 0.05, "Yes", "No")
  ))

  # Plot with polynomial fit
  x_range <- seq(min(Boston[[pred]]), max(Boston[[pred]]), length.out = 100)
  pred_vals <- predict(fit_poly, newdata = setNames(data.frame(x_range), pred))

  plot(Boston[[pred]], Boston$crim, pch = 20, col = "steelblue",
       xlab = pred, ylab = "crim",
       main = paste("crim vs", pred, "(cubic fit)"))
  lines(x_range, pred_vals, col = "red", lwd = 2)
}

par(mfrow = c(1, 1))

knitr::kable(nonlinear_results, row.names = FALSE,
             caption = "Evidence of Non-linear Association (Polynomial Degree 3)")

Evidence of Non-linear Association (Polynomial Degree 3)
Predictor	Linear_P	Quadratic_P	Cubic_P	Evidence_Nonlinear
zn	4.7e-06	0.0044210	0.229500	Yes
indus	0.0e+00	0.0010860	0.000000	Yes
nox	0.0e+00	0.0000774	0.000000	Yes
rm	5.0e-07	0.0015090	0.508600	Yes
age	0.0e+00	0.0000023	0.006680	Yes
dis	0.0e+00	0.0000000	0.000000	Yes
rad	0.0e+00	0.0091210	0.482300	Yes
tax	0.0e+00	0.0000037	0.243900	Yes
ptratio	0.0e+00	0.0024050	0.006301	Yes
lstat	0.0e+00	0.0378000	0.129900	Yes
medv	0.0e+00	0.0000000	0.000000	Yes

Findings: There is evidence of non-linear association for several predictors. Specifically, predictors such as indus, nox, age, dis, ptratio, and medv show significant quadratic or cubic terms (p < 0.05), indicating that a simple linear model may not adequately capture the relationship between these predictors and crim.

4 Chapter 4: Question 13 — Weekly Data Set (Logistic Regression)

data(Weekly)
str(Weekly)

## 'data.frame':    1089 obs. of  9 variables:
##  $ Year     : num  1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
##  $ Lag1     : num  0.816 -0.27 -2.576 3.514 0.712 ...
##  $ Lag2     : num  1.572 0.816 -0.27 -2.576 3.514 ...
##  $ Lag3     : num  -3.936 1.572 0.816 -0.27 -2.576 ...
##  $ Lag4     : num  -0.229 -3.936 1.572 0.816 -0.27 ...
##  $ Lag5     : num  -3.484 -0.229 -3.936 1.572 0.816 ...
##  $ Volume   : num  0.155 0.149 0.16 0.162 0.154 ...
##  $ Today    : num  -0.27 -2.576 3.514 0.712 1.178 ...
##  $ Direction: Factor w/ 2 levels "Down","Up": 1 1 2 2 2 1 2 2 2 1 ...

head(Weekly)

##   Year   Lag1   Lag2   Lag3   Lag4   Lag5    Volume  Today Direction
## 1 1990  0.816  1.572 -3.936 -0.229 -3.484 0.1549760 -0.270      Down
## 2 1990 -0.270  0.816  1.572 -3.936 -0.229 0.1485740 -2.576      Down
## 3 1990 -2.576 -0.270  0.816  1.572 -3.936 0.1598375  3.514        Up
## 4 1990  3.514 -2.576 -0.270  0.816  1.572 0.1616300  0.712        Up
## 5 1990  0.712  3.514 -2.576 -0.270  0.816 0.1537280  1.178        Up
## 6 1990  1.178  0.712  3.514 -2.576 -0.270 0.1544440 -1.372      Down

dim(Weekly)

## [1] 1089    9

4.1 (a) Numerical and Graphical Summaries

summary(Weekly)

##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume            Today         
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747   Min.   :-18.1950  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202   1st Qu.: -1.1540  
##  Median :  0.2380   Median :  0.2340   Median :1.00268   Median :  0.2410  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462   Mean   :  0.1499  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373   3rd Qu.:  1.4050  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821   Max.   : 12.0260  
##  Direction 
##  Down:484  
##  Up  :605  
##            
##            
##            
##

# Correlation matrix (numeric variables only)
cor_weekly <- cor(Weekly[, -which(names(Weekly) == "Direction")])
round(cor_weekly, 3)

##          Year   Lag1   Lag2   Lag3   Lag4   Lag5 Volume  Today
## Year    1.000 -0.032 -0.033 -0.030 -0.031 -0.031  0.842 -0.032
## Lag1   -0.032  1.000 -0.075  0.059 -0.071 -0.008 -0.065 -0.075
## Lag2   -0.033 -0.075  1.000 -0.076  0.058 -0.072 -0.086  0.059
## Lag3   -0.030  0.059 -0.076  1.000 -0.075  0.061 -0.069 -0.071
## Lag4   -0.031 -0.071  0.058 -0.075  1.000 -0.076 -0.061 -0.008
## Lag5   -0.031 -0.008 -0.072  0.061 -0.076  1.000 -0.059  0.011
## Volume  0.842 -0.065 -0.086 -0.069 -0.061 -0.059  1.000 -0.033
## Today  -0.032 -0.075  0.059 -0.071 -0.008  0.011 -0.033  1.000

corrplot(cor_weekly, method = "color", type = "upper",
         addCoef.col = "black", number.cex = 0.7,
         tl.col = "black", tl.srt = 45,
         title = "Correlation Matrix — Weekly Data",
         mar = c(0, 0, 2, 0))

par(mfrow = c(2, 3))

for (var in c("Lag1", "Lag2", "Lag3", "Lag4", "Lag5")) {
  boxplot(Weekly[[var]] ~ Weekly$Direction, col = c("coral", "lightgreen"),
          xlab = "Direction", ylab = var,
          main = paste(var, "by Direction"))
}

par(mfrow = c(1, 1))

plot(Weekly$Volume, type = "l", col = "steelblue",
     xlab = "Observation", ylab = "Volume",
     main = "Trading Volume Over Time")

Patterns observed:

The only substantial correlation is between Year and Volume (r ≈ 0.84), indicating that trading volume has increased over time.
The lag variables show very little correlation with each other, which is expected for weekly returns.
There does not appear to be a clear pattern in the lag variables between Up and Down directions based on the boxplots.

4.2 (b) Logistic Regression — Full Model

glm.fit <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
               data = Weekly, family = binomial)
summary(glm.fit)

## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Weekly)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

coef_glm <- summary(glm.fit)$coefficients
sig_glm <- rownames(coef_glm)[coef_glm[, "Pr(>|z|)"] < 0.05]
cat("Statistically significant predictors:\n")

## Statistically significant predictors:

sig_glm

## [1] "(Intercept)" "Lag2"

Findings: Only Lag2 appears to be statistically significant (p < 0.05). Its positive coefficient suggests that a positive return two weeks ago is associated with a higher probability of the market going up this week. None of the other lag variables or Volume are statistically significant.

4.3 (c) Confusion Matrix

glm.probs <- predict(glm.fit, type = "response")
glm.pred <- ifelse(glm.probs > 0.5, "Up", "Down")

# Confusion matrix
confusion <- table(Predicted = glm.pred, Actual = Weekly$Direction)
confusion

##          Actual
## Predicted Down  Up
##      Down   54  48
##      Up    430 557

# Overall accuracy
accuracy <- mean(glm.pred == Weekly$Direction)
cat("Overall accuracy:", round(accuracy * 100, 2), "%\n")

## Overall accuracy: 56.11 %

# Detailed metrics
TP <- confusion["Up", "Up"]
TN <- confusion["Down", "Down"]
FP <- confusion["Up", "Down"]
FN <- confusion["Down", "Up"]

cat("\nTrue Positive (correctly predicted Up):", TP, "\n")

## 
## True Positive (correctly predicted Up): 557

cat("True Negative (correctly predicted Down):", TN, "\n")

## True Negative (correctly predicted Down): 54

cat("False Positive (predicted Up, actual Down):", FP, "\n")

## False Positive (predicted Up, actual Down): 430

cat("False Negative (predicted Down, actual Up):", FN, "\n")

## False Negative (predicted Down, actual Up): 48

sensitivity <- TP / (TP + FN)
specificity <- TN / (TN + FP)
cat("\nSensitivity (True Positive Rate):", round(sensitivity * 100, 2), "%\n")

## 
## Sensitivity (True Positive Rate): 92.07 %

cat("Specificity (True Negative Rate):", round(specificity * 100, 2), "%\n")

## Specificity (True Negative Rate): 11.16 %

Interpretation: The overall accuracy is about 56.1%, which is slightly better than random guessing (50%). However, the confusion matrix reveals that the model has very high sensitivity (correctly predicting “Up”) but very low specificity (poorly predicting “Down”). The model tends to predict “Up” for most observations, which makes sense since the market went up more often than down during this period. The model’s ability to predict downturns is very poor — it makes many false positive errors (predicting “Up” when the actual direction is “Down”).

4.4 (d) Logistic Regression with Training/Test Split

# Training data: 1990 to 2008
train <- Weekly$Year <= 2008
Weekly.train <- Weekly[train, ]
Weekly.test <- Weekly[!train, ]

cat("Training set size:", nrow(Weekly.train), "\n")

## Training set size: 985

cat("Test set size:", nrow(Weekly.test), "\n")

## Test set size: 104

cat("Test set years:", unique(Weekly.test$Year), "\n")

## Test set years: 2009 2010

# Fit logistic regression using only Lag2
glm.fit2 <- glm(Direction ~ Lag2, data = Weekly.train, family = binomial)
summary(glm.fit2)

## 
## Call:
## glm(formula = Direction ~ Lag2, family = binomial, data = Weekly.train)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.20326    0.06428   3.162  0.00157 **
## Lag2         0.05810    0.02870   2.024  0.04298 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1350.5  on 983  degrees of freedom
## AIC: 1354.5
## 
## Number of Fisher Scoring iterations: 4

# Predict on test data
glm.probs2 <- predict(glm.fit2, newdata = Weekly.test, type = "response")
glm.pred2 <- ifelse(glm.probs2 > 0.5, "Up", "Down")

# Confusion matrix for test data
confusion2 <- table(Predicted = glm.pred2, Actual = Weekly.test$Direction)
confusion2

##          Actual
## Predicted Down Up
##      Down    9  5
##      Up     34 56

# Test set accuracy
accuracy2 <- mean(glm.pred2 == Weekly.test$Direction)
cat("Test set accuracy:", round(accuracy2 * 100, 2), "%\n")

## Test set accuracy: 62.5 %

# Detailed metrics
TP2 <- confusion2["Up", "Up"]
TN2 <- confusion2["Down", "Down"]
FP2 <- confusion2["Up", "Down"]
FN2 <- confusion2["Down", "Up"]

sensitivity2 <- TP2 / (TP2 + FN2)
specificity2 <- TN2 / (TN2 + FP2)

cat("Sensitivity:", round(sensitivity2 * 100, 2), "%\n")

## Sensitivity: 91.8 %

cat("Specificity:", round(specificity2 * 100, 2), "%\n")

## Specificity: 20.93 %

Findings: Using only Lag2 as a predictor and evaluating on the held-out test set (2009–2010), the model achieves an accuracy of approximately 62.5%. This is an improvement over the full model’s apparent accuracy because:

We are using proper train/test evaluation (no overfitting to the test data).
We are only using the one significant predictor (Lag2), which reduces noise from irrelevant variables.
The test set accuracy of ~62.5% is better than the naive baseline of always predicting “Up” (~58.7% for the test period).

5 Session Information

sessionInfo()

## R version 4.4.1 (2024-06-14 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: Asia/Shanghai
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] corrplot_0.95 GGally_2.4.0  ggplot2_4.0.2 ISLR2_1.3-2  
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6       jsonlite_2.0.0     dplyr_1.2.0        compiler_4.4.1    
##  [5] tidyselect_1.2.1   ggstats_0.13.0     tidyr_1.3.2        jquerylib_0.1.4   
##  [9] scales_1.4.0       yaml_2.3.10        fastmap_1.2.0      R6_2.6.1          
## [13] generics_0.1.3     knitr_1.51         tibble_3.3.1       bslib_0.10.0      
## [17] pillar_1.10.2      RColorBrewer_1.1-3 rlang_1.1.7        cachem_1.1.0      
## [21] xfun_0.52          sass_0.4.10        S7_0.2.0           otel_0.2.0        
## [25] cli_3.6.5          withr_3.0.2        magrittr_2.0.3     digest_0.6.37     
## [29] grid_4.4.1         rstudioapi_0.17.1  lifecycle_1.0.5    vctrs_0.7.1       
## [33] evaluate_1.0.5     glue_1.8.0         farver_2.1.2       rmarkdown_2.31    
## [37] purrr_1.2.1        tools_4.4.1        pkgconfig_2.0.3    htmltools_0.5.8.1

Midterm Homework — Application of Financial Software

Enkhjin.N

2026-04-16