library(ISLR2)
data(Auto)
str(Auto)
## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : int  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
##  - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
##   ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...

Chapter 2

#No. 9

(a) Quantitative and Qualitative Variables

Quantitative: mpg, cylinders, displacement, horsepower, weight, acceleration, year

Qualitative: origin

Identifier: name

(b) Range of each quantitative predictor

The range shows the minimum and maximum value of each variable.

For example:

mpg ranges from low fuel efficiency cars to high fuel efficiency cars. horsepower ranges from weak engines to strong engines. weight ranges from light cars to heavy cars.

This means the dataset contains many different types of cars with different sizes and performance levels.

range(Auto$mpg)
## [1]  9.0 46.6
range(Auto$cylinders)
## [1] 3 8
range(Auto$displacement)
## [1]  68 455
range(Auto$horsepower)
## [1]  46 230
range(Auto$weight)
## [1] 1613 5140
range(Auto$acceleration)
## [1]  8.0 24.8
range(Auto$year)
## [1] 70 82

(c) Mean and Standard Deviation

The mean represents the average value of each variable.

The standard deviation shows how spread out the data is from the average.

For example:

Average mpg shows the typical fuel efficiency of cars in the dataset. High standard deviation in weight means car weights vary a lot. High standard deviation in horsepower means engine power differs between cars.

This shows the dataset has a wide variety of vehicles.

mean(Auto$mpg)
## [1] 23.44592
sd(Auto$mpg)
## [1] 7.805007
mean(Auto$cylinders)
## [1] 5.471939
sd(Auto$cylinders)
## [1] 1.705783
mean(Auto$displacement)
## [1] 194.412
sd(Auto$displacement)
## [1] 104.644
mean(Auto$horsepower)
## [1] 104.4694
sd(Auto$horsepower)
## [1] 38.49116
mean(Auto$weight)
## [1] 2977.584
sd(Auto$weight)
## [1] 849.4026
mean(Auto$acceleration)
## [1] 15.54133
sd(Auto$acceleration)
## [1] 2.758864
mean(Auto$year)
## [1] 75.97959
sd(Auto$year)
## [1] 3.683737

(d) Remove observations 10 to 85

After removing observations 10 to 85, the summary statistics change slightly.

The average mpg becomes higher, meaning the remaining cars are more fuel efficient.

The average weight and horsepower become lower, meaning heavier and less efficient cars may have been removed.

This indicates that the deleted observations had some influence on the dataset.

Auto2 <- Auto[-c(10:85), ]

# Range
range(Auto2$mpg)
## [1] 11.0 46.6
range(Auto2$cylinders)
## [1] 3 8
range(Auto2$displacement)
## [1]  68 455
range(Auto2$horsepower)
## [1]  46 230
range(Auto2$weight)
## [1] 1649 4997
range(Auto2$acceleration)
## [1]  8.5 24.8
range(Auto2$year)
## [1] 70 82
# Mean
mean(Auto2$mpg)
## [1] 24.40443
mean(Auto2$cylinders)
## [1] 5.373418
mean(Auto2$displacement)
## [1] 187.2405
mean(Auto2$horsepower)
## [1] 100.7215
mean(Auto2$weight)
## [1] 2935.972
mean(Auto2$acceleration)
## [1] 15.7269
mean(Auto2$year)
## [1] 77.14557
# Standard Deviation
sd(Auto2$mpg)
## [1] 7.867283
sd(Auto2$cylinders)
## [1] 1.654179
sd(Auto2$displacement)
## [1] 99.67837
sd(Auto2$horsepower)
## [1] 35.70885
sd(Auto2$weight)
## [1] 811.3002
sd(Auto2$acceleration)
## [1] 2.693721
sd(Auto2$year)
## [1] 3.106217

(e) Scatterplot Matrix

From the scatterplot matrix:

mpg and weight have a strong negative relationship. mpg and horsepower also have a negative relationship. mpg and year have a positive relationship.

This means:

Heavier cars usually have lower mpg. Cars with stronger engines use more fuel. Newer cars tend to be more fuel efficient.

pairs(Auto)

(f) Relationship with mpg

The most important predictors related to mpg are:

Negative Relationship: weight horsepower displacement cylinders

When these variables increase, mpg decreases.

Positive Relationship: year acceleration (weak)

When year increases, mpg tends to increase.

cor(Auto[,1:8])
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000
# Optional specific scatterplots
plot(Auto$weight, Auto$mpg)

plot(Auto$horsepower, Auto$mpg)

plot(Auto$year, Auto$mpg)

Based on the analysis, fuel efficiency (mpg) is strongly affected by the size, weight, and engine power of the car.

Smaller, lighter, and newer cars generally have better fuel efficiency than older, heavier, and high-powered cars.

Chapter 3

No. 9

(a) Scatterplot Matrix

library(ISLR2)
pairs(Auto)

################################################# # (b) Corellation Matrix #################################################

# Column 9 is 'name'
cor(Auto[, -9])
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

(c) Multiple Linear Regression

fit <- lm(mpg ~ . - name, data = Auto)
summary(fit)
## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16
  1. Relationship: Check the F-statistic p-value in the summary. If it is very small, there is a statistically significant relationship between the response and at least one predictor.

  2. Statistically Significant Predictors: Look at the \(Pr(>|t|)\) column. Predictors with a value less than 0.05 (like weight, year, and origin) are typically significant.

  3. Year Variable: The positive coefficient for year suggests that for every unit increase in year (newer cars), mpg increases by that coefficient value, assuming all other variables are held constant.

(d) Diagnostic Plots

par(mfrow = c(2, 2))
plot(fit)

-Residual Plots: Look for a non-linear pattern in the “Residuals vs Fitted” plot. Points far from the central line indicate potential outliers.

-Leverage Plot: Check the “Residuals vs Leverage” plot. Points falling outside the dashed red lines (Cook’s distance) indicate observations with unusually high leverage that may be disproportionately influencing the model.

(e) Interaction Effects

# Example: Interaction between horsepower and weight
fit_interact <- lm(mpg ~ horsepower * weight + year + origin, data = Auto)
summary(fit_interact)
## 
## Call:
## lm(formula = mpg ~ horsepower * weight + year + origin, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6051 -1.7722 -0.1304  1.5205 12.0369 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        8.145e-01  3.969e+00   0.205  0.83753    
## horsepower        -2.160e-01  2.055e-02 -10.514  < 2e-16 ***
## weight            -1.106e-02  6.343e-04 -17.435  < 2e-16 ***
## year               7.677e-01  4.464e-02  17.195  < 2e-16 ***
## origin             7.224e-01  2.328e-01   3.103  0.00206 ** 
## horsepower:weight  5.501e-05  5.051e-06  10.891  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.931 on 386 degrees of freedom
## Multiple R-squared:  0.8608, Adjusted R-squared:  0.859 
## F-statistic: 477.5 on 5 and 386 DF,  p-value: < 2.2e-16

Significance: If the p-value for the interaction term (e.g., horsepower:weight) is small, the interaction is statistically significant.

(f) Variable Transformations

# Example: Log transformation of displacement and squaring horsepower
fit_trans <- lm(mpg ~ log(displacement) + I(horsepower^2) + weight + year, data = Auto)
summary(fit_trans)
## 
## Call:
## lm(formula = mpg ~ log(displacement) + I(horsepower^2) + weight + 
##     year, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.8919  -2.0263  -0.0686   1.7793  13.5477 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.541e+00  5.223e+00   0.295  0.76808    
## log(displacement) -4.641e+00  9.126e-01  -5.085 5.73e-07 ***
## I(horsepower^2)    9.362e-05  3.132e-05   2.989  0.00298 ** 
## weight            -4.768e-03  5.914e-04  -8.062 9.46e-15 ***
## year               7.731e-01  5.019e-02  15.403  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.308 on 387 degrees of freedom
## Multiple R-squared:  0.8223, Adjusted R-squared:  0.8204 
## F-statistic: 447.6 on 4 and 387 DF,  p-value: < 2.2e-16

Findings: Compare the \(R^2\) value of this model to the previous ones; a higher \(R^2\) suggests the transformation better captures the data’s underlying structure

No. 15

(a) Simple Linear Regression

library(ISLR2)
# Get a list of all predictors except 'crim'
predictors <- colnames(Boston)[colnames(Boston) != "crim"]

# Create a vector to store the univariate coefficients
univariate_coeffs <- c()

for (p in predictors) {
  formula <- as.formula(paste("crim ~", p))
  fit <- lm(formula, data = Boston)
  print(summary(fit)) # Describe results 
  univariate_coeffs[p] <- coef(fit)[2]
  # Create plots to back up assertions [cite: 43]
  plot(Boston[[p]], Boston$crim, xlab = p, ylab = "crim", main = paste("crim vs", p))
  abline(fit, col = "red")
}
## 
## Call:
## lm(formula = formula, data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.429 -4.222 -2.620  1.250 84.523 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.45369    0.41722  10.675  < 2e-16 ***
## zn          -0.07393    0.01609  -4.594 5.51e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.435 on 504 degrees of freedom
## Multiple R-squared:  0.04019,    Adjusted R-squared:  0.03828 
## F-statistic:  21.1 on 1 and 504 DF,  p-value: 5.506e-06

## 
## Call:
## lm(formula = formula, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.972  -2.698  -0.736   0.712  81.813 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.06374    0.66723  -3.093  0.00209 ** 
## indus        0.50978    0.05102   9.991  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.866 on 504 degrees of freedom
## Multiple R-squared:  0.1653, Adjusted R-squared:  0.1637 
## F-statistic: 99.82 on 1 and 504 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = formula, data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.738 -3.661 -3.435  0.018 85.232 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.7444     0.3961   9.453   <2e-16 ***
## chas         -1.8928     1.5061  -1.257    0.209    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.597 on 504 degrees of freedom
## Multiple R-squared:  0.003124,   Adjusted R-squared:  0.001146 
## F-statistic: 1.579 on 1 and 504 DF,  p-value: 0.2094

## 
## Call:
## lm(formula = formula, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.371  -2.738  -0.974   0.559  81.728 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -13.720      1.699  -8.073 5.08e-15 ***
## nox           31.249      2.999  10.419  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.81 on 504 degrees of freedom
## Multiple R-squared:  0.1772, Adjusted R-squared:  0.1756 
## F-statistic: 108.6 on 1 and 504 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = formula, data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.604 -3.952 -2.654  0.989 87.197 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   20.482      3.365   6.088 2.27e-09 ***
## rm            -2.684      0.532  -5.045 6.35e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.401 on 504 degrees of freedom
## Multiple R-squared:  0.04807,    Adjusted R-squared:  0.04618 
## F-statistic: 25.45 on 1 and 504 DF,  p-value: 6.347e-07

## 
## Call:
## lm(formula = formula, data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.789 -4.257 -1.230  1.527 82.849 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.77791    0.94398  -4.002 7.22e-05 ***
## age          0.10779    0.01274   8.463 2.85e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.057 on 504 degrees of freedom
## Multiple R-squared:  0.1244, Adjusted R-squared:  0.1227 
## F-statistic: 71.62 on 1 and 504 DF,  p-value: 2.855e-16

## 
## Call:
## lm(formula = formula, data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.708 -4.134 -1.527  1.516 81.674 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.4993     0.7304  13.006   <2e-16 ***
## dis          -1.5509     0.1683  -9.213   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.965 on 504 degrees of freedom
## Multiple R-squared:  0.1441, Adjusted R-squared:  0.1425 
## F-statistic: 84.89 on 1 and 504 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = formula, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.164  -1.381  -0.141   0.660  76.433 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.28716    0.44348  -5.157 3.61e-07 ***
## rad          0.61791    0.03433  17.998  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.718 on 504 degrees of freedom
## Multiple R-squared:  0.3913, Adjusted R-squared:   0.39 
## F-statistic: 323.9 on 1 and 504 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = formula, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.513  -2.738  -0.194   1.065  77.696 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -8.528369   0.815809  -10.45   <2e-16 ***
## tax          0.029742   0.001847   16.10   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.997 on 504 degrees of freedom
## Multiple R-squared:  0.3396, Adjusted R-squared:  0.3383 
## F-statistic: 259.2 on 1 and 504 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = formula, data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.654 -3.985 -1.912  1.825 83.353 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.6469     3.1473  -5.607 3.40e-08 ***
## ptratio       1.1520     0.1694   6.801 2.94e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.24 on 504 degrees of freedom
## Multiple R-squared:  0.08407,    Adjusted R-squared:  0.08225 
## F-statistic: 46.26 on 1 and 504 DF,  p-value: 2.943e-11

## 
## Call:
## lm(formula = formula, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.925  -2.822  -0.664   1.079  82.862 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.33054    0.69376  -4.801 2.09e-06 ***
## lstat        0.54880    0.04776  11.491  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.664 on 504 degrees of freedom
## Multiple R-squared:  0.2076, Adjusted R-squared:  0.206 
## F-statistic:   132 on 1 and 504 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = formula, data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.071 -4.022 -2.343  1.298 80.957 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.79654    0.93419   12.63   <2e-16 ***
## medv        -0.36316    0.03839   -9.46   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.934 on 504 degrees of freedom
## Multiple R-squared:  0.1508, Adjusted R-squared:  0.1491 
## F-statistic: 89.49 on 1 and 504 DF,  p-value: < 2.2e-16

Significance: Check the p-value for each model. In the Boston data, most predictors (like zn, nox, rm, dis, rad, tax) typically show a statistically significant association with crim when modeled individually.

(b) Multiple Regression

fit_all <- lm(crim ~ ., data = Boston)
summary(fit_all)
## 
## Call:
## lm(formula = crim ~ ., data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.534 -2.248 -0.348  1.087 73.923 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.7783938  7.0818258   1.946 0.052271 .  
## zn           0.0457100  0.0187903   2.433 0.015344 *  
## indus       -0.0583501  0.0836351  -0.698 0.485709    
## chas        -0.8253776  1.1833963  -0.697 0.485841    
## nox         -9.9575865  5.2898242  -1.882 0.060370 .  
## rm           0.6289107  0.6070924   1.036 0.300738    
## age         -0.0008483  0.0179482  -0.047 0.962323    
## dis         -1.0122467  0.2824676  -3.584 0.000373 ***
## rad          0.6124653  0.0875358   6.997 8.59e-12 ***
## tax         -0.0037756  0.0051723  -0.730 0.465757    
## ptratio     -0.3040728  0.1863598  -1.632 0.103393    
## lstat        0.1388006  0.0757213   1.833 0.067398 .  
## medv        -0.2200564  0.0598240  -3.678 0.000261 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.46 on 493 degrees of freedom
## Multiple R-squared:  0.4493, Adjusted R-squared:  0.4359 
## F-statistic: 33.52 on 12 and 493 DF,  p-value: < 2.2e-16

Rejecting the Null Hypothesis: You can reject \(H_0: \beta_j = 0\) for predictors with a p-value less than 0.05. Usually, in the full model, only a few variables like zn, dis, rad, black, and medv remain statistically significant.

(c) Comparison of Coefficients

# Extract coefficients from the multiple regression (excluding the intercept)
multiple_coeffs <- coef(fit_all)[-1]

# Ensure the names match for plotting
plot(univariate_coeffs, multiple_coeffs, 
     xlab = "Univariate Regression Coefficients", 
     ylab = "Multiple Regression Coefficients", 
     main = "Comparison of Coefficients",
     pch = 19, col = "blue")
abline(0, 1, lty = 2) # Reference line where simple = multiple

Interpretation: Each point represents a predictor. Points far from the diagonal line indicate variables where the relationship with crime changes significantly when other predictors are added to the model (due to multicollinearity).

(d) Non-Linear Association

\[Y = \beta_0 + \beta_1X + \beta_2X^2 + \beta_3X^3 + \epsilon\]

library(ISLR2)
predictors <- colnames(Boston)[colnames(Boston) != "crim"]

# (d) Test for non-linear association
for (p in predictors) {
  # Count unique values to avoid the "degree" error
  unique_values <- length(unique(Boston[[p]]))
  
  if (unique_values > 3) {
    # Fit the cubic model if there are enough unique points
    fit_cubic <- lm(crim ~ poly(get(p), 3), data = Boston)
    
    cat("\n--- Non-linear analysis for:", p, "---\n")
    print(summary(fit_cubic))
  } else {
    cat("\nSkipping", p, "because it has only", unique_values, "unique values.\n")
  }
}
## 
## --- Non-linear analysis for: zn ---
## 
## Call:
## lm(formula = crim ~ poly(get(p), 3), data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.821 -4.614 -1.294  0.473 84.130 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        3.6135     0.3722   9.709  < 2e-16 ***
## poly(get(p), 3)1 -38.7498     8.3722  -4.628  4.7e-06 ***
## poly(get(p), 3)2  23.9398     8.3722   2.859  0.00442 ** 
## poly(get(p), 3)3 -10.0719     8.3722  -1.203  0.22954    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.372 on 502 degrees of freedom
## Multiple R-squared:  0.05824,    Adjusted R-squared:  0.05261 
## F-statistic: 10.35 on 3 and 502 DF,  p-value: 1.281e-06
## 
## 
## --- Non-linear analysis for: indus ---
## 
## Call:
## lm(formula = crim ~ poly(get(p), 3), data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.278 -2.514  0.054  0.764 79.713 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         3.614      0.330  10.950  < 2e-16 ***
## poly(get(p), 3)1   78.591      7.423  10.587  < 2e-16 ***
## poly(get(p), 3)2  -24.395      7.423  -3.286  0.00109 ** 
## poly(get(p), 3)3  -54.130      7.423  -7.292  1.2e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.423 on 502 degrees of freedom
## Multiple R-squared:  0.2597, Adjusted R-squared:  0.2552 
## F-statistic: 58.69 on 3 and 502 DF,  p-value: < 2.2e-16
## 
## 
## Skipping chas because it has only 2 unique values.
## 
## --- Non-linear analysis for: nox ---
## 
## Call:
## lm(formula = crim ~ poly(get(p), 3), data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.110 -2.068 -0.255  0.739 78.302 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        3.6135     0.3216  11.237  < 2e-16 ***
## poly(get(p), 3)1  81.3720     7.2336  11.249  < 2e-16 ***
## poly(get(p), 3)2 -28.8286     7.2336  -3.985 7.74e-05 ***
## poly(get(p), 3)3 -60.3619     7.2336  -8.345 6.96e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.234 on 502 degrees of freedom
## Multiple R-squared:  0.297,  Adjusted R-squared:  0.2928 
## F-statistic: 70.69 on 3 and 502 DF,  p-value: < 2.2e-16
## 
## 
## --- Non-linear analysis for: rm ---
## 
## Call:
## lm(formula = crim ~ poly(get(p), 3), data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.485  -3.468  -2.221  -0.015  87.219 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        3.6135     0.3703   9.758  < 2e-16 ***
## poly(get(p), 3)1 -42.3794     8.3297  -5.088 5.13e-07 ***
## poly(get(p), 3)2  26.5768     8.3297   3.191  0.00151 ** 
## poly(get(p), 3)3  -5.5103     8.3297  -0.662  0.50858    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.33 on 502 degrees of freedom
## Multiple R-squared:  0.06779,    Adjusted R-squared:  0.06222 
## F-statistic: 12.17 on 3 and 502 DF,  p-value: 1.067e-07
## 
## 
## --- Non-linear analysis for: age ---
## 
## Call:
## lm(formula = crim ~ poly(get(p), 3), data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.762 -2.673 -0.516  0.019 82.842 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        3.6135     0.3485  10.368  < 2e-16 ***
## poly(get(p), 3)1  68.1820     7.8397   8.697  < 2e-16 ***
## poly(get(p), 3)2  37.4845     7.8397   4.781 2.29e-06 ***
## poly(get(p), 3)3  21.3532     7.8397   2.724  0.00668 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.84 on 502 degrees of freedom
## Multiple R-squared:  0.1742, Adjusted R-squared:  0.1693 
## F-statistic: 35.31 on 3 and 502 DF,  p-value: < 2.2e-16
## 
## 
## --- Non-linear analysis for: dis ---
## 
## Call:
## lm(formula = crim ~ poly(get(p), 3), data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.757  -2.588   0.031   1.267  76.378 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        3.6135     0.3259  11.087  < 2e-16 ***
## poly(get(p), 3)1 -73.3886     7.3315 -10.010  < 2e-16 ***
## poly(get(p), 3)2  56.3730     7.3315   7.689 7.87e-14 ***
## poly(get(p), 3)3 -42.6219     7.3315  -5.814 1.09e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.331 on 502 degrees of freedom
## Multiple R-squared:  0.2778, Adjusted R-squared:  0.2735 
## F-statistic: 64.37 on 3 and 502 DF,  p-value: < 2.2e-16
## 
## 
## --- Non-linear analysis for: rad ---
## 
## Call:
## lm(formula = crim ~ poly(get(p), 3), data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.381  -0.412  -0.269   0.179  76.217 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        3.6135     0.2971  12.164  < 2e-16 ***
## poly(get(p), 3)1 120.9074     6.6824  18.093  < 2e-16 ***
## poly(get(p), 3)2  17.4923     6.6824   2.618  0.00912 ** 
## poly(get(p), 3)3   4.6985     6.6824   0.703  0.48231    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.682 on 502 degrees of freedom
## Multiple R-squared:    0.4,  Adjusted R-squared:  0.3965 
## F-statistic: 111.6 on 3 and 502 DF,  p-value: < 2.2e-16
## 
## 
## --- Non-linear analysis for: tax ---
## 
## Call:
## lm(formula = crim ~ poly(get(p), 3), data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.273  -1.389   0.046   0.536  76.950 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        3.6135     0.3047  11.860  < 2e-16 ***
## poly(get(p), 3)1 112.6458     6.8537  16.436  < 2e-16 ***
## poly(get(p), 3)2  32.0873     6.8537   4.682 3.67e-06 ***
## poly(get(p), 3)3  -7.9968     6.8537  -1.167    0.244    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.854 on 502 degrees of freedom
## Multiple R-squared:  0.3689, Adjusted R-squared:  0.3651 
## F-statistic:  97.8 on 3 and 502 DF,  p-value: < 2.2e-16
## 
## 
## --- Non-linear analysis for: ptratio ---
## 
## Call:
## lm(formula = crim ~ poly(get(p), 3), data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.833 -4.146 -1.655  1.408 82.697 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         3.614      0.361  10.008  < 2e-16 ***
## poly(get(p), 3)1   56.045      8.122   6.901 1.57e-11 ***
## poly(get(p), 3)2   24.775      8.122   3.050  0.00241 ** 
## poly(get(p), 3)3  -22.280      8.122  -2.743  0.00630 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.122 on 502 degrees of freedom
## Multiple R-squared:  0.1138, Adjusted R-squared:  0.1085 
## F-statistic: 21.48 on 3 and 502 DF,  p-value: 4.171e-13
## 
## 
## --- Non-linear analysis for: lstat ---
## 
## Call:
## lm(formula = crim ~ poly(get(p), 3), data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.234  -2.151  -0.486   0.066  83.353 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        3.6135     0.3392  10.654   <2e-16 ***
## poly(get(p), 3)1  88.0697     7.6294  11.543   <2e-16 ***
## poly(get(p), 3)2  15.8882     7.6294   2.082   0.0378 *  
## poly(get(p), 3)3 -11.5740     7.6294  -1.517   0.1299    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.629 on 502 degrees of freedom
## Multiple R-squared:  0.2179, Adjusted R-squared:  0.2133 
## F-statistic: 46.63 on 3 and 502 DF,  p-value: < 2.2e-16
## 
## 
## --- Non-linear analysis for: medv ---
## 
## Call:
## lm(formula = crim ~ poly(get(p), 3), data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.427  -1.976  -0.437   0.439  73.655 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         3.614      0.292  12.374  < 2e-16 ***
## poly(get(p), 3)1  -75.058      6.569 -11.426  < 2e-16 ***
## poly(get(p), 3)2   88.086      6.569  13.409  < 2e-16 ***
## poly(get(p), 3)3  -48.033      6.569  -7.312 1.05e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.569 on 502 degrees of freedom
## Multiple R-squared:  0.4202, Adjusted R-squared:  0.4167 
## F-statistic: 121.3 on 3 and 502 DF,  p-value: < 2.2e-16

Evidence: Look for significant p-values for the quadratic (\(X^2\)) or cubic (\(X^3\)) terms. If these terms are significant, it suggests a non-linear association between that predictor and the crime rate. Variables like indus, nox, age, dis, and medv often show strong evidence of non-linearity.

Chapter 4

No. 13

(a) Numerical and Graphical Summaries

library(ISLR2)
# Numerical summary
summary(Weekly)
##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume            Today         
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747   Min.   :-18.1950  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202   1st Qu.: -1.1540  
##  Median :  0.2380   Median :  0.2340   Median :1.00268   Median :  0.2410  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462   Mean   :  0.1499  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373   3rd Qu.:  1.4050  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821   Max.   : 12.0260  
##  Direction 
##  Down:484  
##  Up  :605  
##            
##            
##            
## 
# Correlation matrix
cor(Weekly[, -9]) # Exclude the qualitative 'Direction' variable
##               Year         Lag1        Lag2        Lag3         Lag4
## Year    1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1   -0.03228927  1.000000000 -0.07485305  0.05863568 -0.071273876
## Lag2   -0.03339001 -0.074853051  1.00000000 -0.07572091  0.058381535
## Lag3   -0.03000649  0.058635682 -0.07572091  1.00000000 -0.075395865
## Lag4   -0.03112792 -0.071273876  0.05838153 -0.07539587  1.000000000
## Lag5   -0.03051910 -0.008183096 -0.07249948  0.06065717 -0.075675027
## Volume  0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today  -0.03245989 -0.075031842  0.05916672 -0.07124364 -0.007825873
##                Lag5      Volume        Today
## Year   -0.030519101  0.84194162 -0.032459894
## Lag1   -0.008183096 -0.06495131 -0.075031842
## Lag2   -0.072499482 -0.08551314  0.059166717
## Lag3    0.060657175 -0.06928771 -0.071243639
## Lag4   -0.075675027 -0.06107462 -0.007825873
## Lag5    1.000000000 -0.05851741  0.011012698
## Volume -0.058517414  1.00000000 -0.033077783
## Today   0.011012698 -0.03307778  1.000000000
# Graphical summary
pairs(Weekly)

plot(Weekly$Volume)

Patterns: The correlation matrix and scatterplot matrix usually show very little correlation between the lag variables and today’s returns. The only notable pattern is typically between Year and Volume, showing that the number of shares traded increased over time.

(b) Logistic Regression (Full Data)

glm_fit <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, 
               data = Weekly, family = binomial)
summary(glm_fit)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Weekly)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

Significance: Look at the \(Pr(>|z|)\) column in the summary. Typically, for this dataset, only Lag2 appears to be statistically significant with a p-value less than 0.05.

(c) Confusion Matrix and Accuracy

# Convert probabilities to class labels
glm_probs <- predict(glm_fit, type = "response")
glm_pred <- ifelse(glm_probs > 0.5, "Up", "Down")

# Confusion Matrix
table(glm_pred, Weekly$Direction)
##         
## glm_pred Down  Up
##     Down   54  48
##     Up    430 557
# Overall fraction of correct predictions (Accuracy)
mean(glm_pred == Weekly$Direction)
## [1] 0.5610652

Mistakes: The confusion matrix tells you the types of errors: False Positives (predicting “Up” when the market went “Down”) and False Negatives (predicting “Down” when the market went “Up”). Logistic regression on this data often has a high “Training Error Rate” and tends to over-predict “Up” days.

(d) Training and Testing (Held-out Data)

# Create training and test sets
train <- (Weekly$Year <= 2008)
Weekly_test <- Weekly[!train, ]
Direction_test <- Weekly$Direction[!train]

# Fit model on training data
glm_fit2 <- glm(Direction ~ Lag2, data = Weekly, family = binomial, subset = train)

# Predict on test data
glm_probs2 <- predict(glm_fit2, Weekly_test, type = "response")
glm_pred2 <- ifelse(glm_probs2 > 0.5, "Up", "Down")

# Results
table(glm_pred2, Direction_test)
##          Direction_test
## glm_pred2 Down Up
##      Down    9  5
##      Up     34 56
mean(glm_pred2 == Direction_test) # Accuracy on test data
## [1] 0.625

Interpretation: This part evaluates how the model performs on “unseen” data. You will find the test accuracy, which represents the model’s true predictive power for the years 2009 and 2010.