library(ISLR2)
data(Auto)
str(Auto)
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : int 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
## - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
## ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...
#No. 9
Quantitative: mpg, cylinders, displacement, horsepower, weight, acceleration, year
Qualitative: origin
Identifier: name
The range shows the minimum and maximum value of each variable.
For example:
mpg ranges from low fuel efficiency cars to high fuel efficiency cars. horsepower ranges from weak engines to strong engines. weight ranges from light cars to heavy cars.
This means the dataset contains many different types of cars with different sizes and performance levels.
range(Auto$mpg)
## [1] 9.0 46.6
range(Auto$cylinders)
## [1] 3 8
range(Auto$displacement)
## [1] 68 455
range(Auto$horsepower)
## [1] 46 230
range(Auto$weight)
## [1] 1613 5140
range(Auto$acceleration)
## [1] 8.0 24.8
range(Auto$year)
## [1] 70 82
The mean represents the average value of each variable.
The standard deviation shows how spread out the data is from the average.
For example:
Average mpg shows the typical fuel efficiency of cars in the dataset. High standard deviation in weight means car weights vary a lot. High standard deviation in horsepower means engine power differs between cars.
This shows the dataset has a wide variety of vehicles.
mean(Auto$mpg)
## [1] 23.44592
sd(Auto$mpg)
## [1] 7.805007
mean(Auto$cylinders)
## [1] 5.471939
sd(Auto$cylinders)
## [1] 1.705783
mean(Auto$displacement)
## [1] 194.412
sd(Auto$displacement)
## [1] 104.644
mean(Auto$horsepower)
## [1] 104.4694
sd(Auto$horsepower)
## [1] 38.49116
mean(Auto$weight)
## [1] 2977.584
sd(Auto$weight)
## [1] 849.4026
mean(Auto$acceleration)
## [1] 15.54133
sd(Auto$acceleration)
## [1] 2.758864
mean(Auto$year)
## [1] 75.97959
sd(Auto$year)
## [1] 3.683737
After removing observations 10 to 85, the summary statistics change slightly.
The average mpg becomes higher, meaning the remaining cars are more fuel efficient.
The average weight and horsepower become lower, meaning heavier and less efficient cars may have been removed.
This indicates that the deleted observations had some influence on the dataset.
Auto2 <- Auto[-c(10:85), ]
# Range
range(Auto2$mpg)
## [1] 11.0 46.6
range(Auto2$cylinders)
## [1] 3 8
range(Auto2$displacement)
## [1] 68 455
range(Auto2$horsepower)
## [1] 46 230
range(Auto2$weight)
## [1] 1649 4997
range(Auto2$acceleration)
## [1] 8.5 24.8
range(Auto2$year)
## [1] 70 82
# Mean
mean(Auto2$mpg)
## [1] 24.40443
mean(Auto2$cylinders)
## [1] 5.373418
mean(Auto2$displacement)
## [1] 187.2405
mean(Auto2$horsepower)
## [1] 100.7215
mean(Auto2$weight)
## [1] 2935.972
mean(Auto2$acceleration)
## [1] 15.7269
mean(Auto2$year)
## [1] 77.14557
# Standard Deviation
sd(Auto2$mpg)
## [1] 7.867283
sd(Auto2$cylinders)
## [1] 1.654179
sd(Auto2$displacement)
## [1] 99.67837
sd(Auto2$horsepower)
## [1] 35.70885
sd(Auto2$weight)
## [1] 811.3002
sd(Auto2$acceleration)
## [1] 2.693721
sd(Auto2$year)
## [1] 3.106217
From the scatterplot matrix:
mpg and weight have a strong negative relationship. mpg and horsepower also have a negative relationship. mpg and year have a positive relationship.
This means:
Heavier cars usually have lower mpg. Cars with stronger engines use more fuel. Newer cars tend to be more fuel efficient.
pairs(Auto)
The most important predictors related to mpg are:
Negative Relationship: weight horsepower displacement cylinders
When these variables increase, mpg decreases.
Positive Relationship: year acceleration (weak)
When year increases, mpg tends to increase.
cor(Auto[,1:8])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
# Optional specific scatterplots
plot(Auto$weight, Auto$mpg)
plot(Auto$horsepower, Auto$mpg)
plot(Auto$year, Auto$mpg)
Based on the analysis, fuel efficiency (mpg) is strongly affected by the
size, weight, and engine power of the car.
Smaller, lighter, and newer cars generally have better fuel efficiency than older, heavier, and high-powered cars.
library(ISLR2)
pairs(Auto)
################################################# # (b) Corellation
Matrix #################################################
# Column 9 is 'name'
cor(Auto[, -9])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
fit <- lm(mpg ~ . - name, data = Auto)
summary(fit)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
Relationship: Check the F-statistic p-value in the summary. If it is very small, there is a statistically significant relationship between the response and at least one predictor.
Statistically Significant Predictors: Look at the \(Pr(>|t|)\) column. Predictors with a value less than 0.05 (like weight, year, and origin) are typically significant.
Year Variable: The positive coefficient for year suggests that for every unit increase in year (newer cars), mpg increases by that coefficient value, assuming all other variables are held constant.
par(mfrow = c(2, 2))
plot(fit)
-Residual Plots: Look for a non-linear pattern in the “Residuals vs
Fitted” plot. Points far from the central line indicate potential
outliers.
-Leverage Plot: Check the “Residuals vs Leverage” plot. Points falling outside the dashed red lines (Cook’s distance) indicate observations with unusually high leverage that may be disproportionately influencing the model.
# Example: Interaction between horsepower and weight
fit_interact <- lm(mpg ~ horsepower * weight + year + origin, data = Auto)
summary(fit_interact)
##
## Call:
## lm(formula = mpg ~ horsepower * weight + year + origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.6051 -1.7722 -0.1304 1.5205 12.0369
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.145e-01 3.969e+00 0.205 0.83753
## horsepower -2.160e-01 2.055e-02 -10.514 < 2e-16 ***
## weight -1.106e-02 6.343e-04 -17.435 < 2e-16 ***
## year 7.677e-01 4.464e-02 17.195 < 2e-16 ***
## origin 7.224e-01 2.328e-01 3.103 0.00206 **
## horsepower:weight 5.501e-05 5.051e-06 10.891 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.931 on 386 degrees of freedom
## Multiple R-squared: 0.8608, Adjusted R-squared: 0.859
## F-statistic: 477.5 on 5 and 386 DF, p-value: < 2.2e-16
Significance: If the p-value for the interaction term (e.g., horsepower:weight) is small, the interaction is statistically significant.
# Example: Log transformation of displacement and squaring horsepower
fit_trans <- lm(mpg ~ log(displacement) + I(horsepower^2) + weight + year, data = Auto)
summary(fit_trans)
##
## Call:
## lm(formula = mpg ~ log(displacement) + I(horsepower^2) + weight +
## year, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.8919 -2.0263 -0.0686 1.7793 13.5477
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.541e+00 5.223e+00 0.295 0.76808
## log(displacement) -4.641e+00 9.126e-01 -5.085 5.73e-07 ***
## I(horsepower^2) 9.362e-05 3.132e-05 2.989 0.00298 **
## weight -4.768e-03 5.914e-04 -8.062 9.46e-15 ***
## year 7.731e-01 5.019e-02 15.403 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.308 on 387 degrees of freedom
## Multiple R-squared: 0.8223, Adjusted R-squared: 0.8204
## F-statistic: 447.6 on 4 and 387 DF, p-value: < 2.2e-16
Findings: Compare the \(R^2\) value of this model to the previous ones; a higher \(R^2\) suggests the transformation better captures the data’s underlying structure
library(ISLR2)
# Get a list of all predictors except 'crim'
predictors <- colnames(Boston)[colnames(Boston) != "crim"]
# Create a vector to store the univariate coefficients
univariate_coeffs <- c()
for (p in predictors) {
formula <- as.formula(paste("crim ~", p))
fit <- lm(formula, data = Boston)
print(summary(fit)) # Describe results
univariate_coeffs[p] <- coef(fit)[2]
# Create plots to back up assertions [cite: 43]
plot(Boston[[p]], Boston$crim, xlab = p, ylab = "crim", main = paste("crim vs", p))
abline(fit, col = "red")
}
##
## Call:
## lm(formula = formula, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.429 -4.222 -2.620 1.250 84.523
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.45369 0.41722 10.675 < 2e-16 ***
## zn -0.07393 0.01609 -4.594 5.51e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.435 on 504 degrees of freedom
## Multiple R-squared: 0.04019, Adjusted R-squared: 0.03828
## F-statistic: 21.1 on 1 and 504 DF, p-value: 5.506e-06
##
## Call:
## lm(formula = formula, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.972 -2.698 -0.736 0.712 81.813
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.06374 0.66723 -3.093 0.00209 **
## indus 0.50978 0.05102 9.991 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.866 on 504 degrees of freedom
## Multiple R-squared: 0.1653, Adjusted R-squared: 0.1637
## F-statistic: 99.82 on 1 and 504 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = formula, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.738 -3.661 -3.435 0.018 85.232
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.7444 0.3961 9.453 <2e-16 ***
## chas -1.8928 1.5061 -1.257 0.209
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.597 on 504 degrees of freedom
## Multiple R-squared: 0.003124, Adjusted R-squared: 0.001146
## F-statistic: 1.579 on 1 and 504 DF, p-value: 0.2094
##
## Call:
## lm(formula = formula, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.371 -2.738 -0.974 0.559 81.728
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13.720 1.699 -8.073 5.08e-15 ***
## nox 31.249 2.999 10.419 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.81 on 504 degrees of freedom
## Multiple R-squared: 0.1772, Adjusted R-squared: 0.1756
## F-statistic: 108.6 on 1 and 504 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = formula, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.604 -3.952 -2.654 0.989 87.197
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.482 3.365 6.088 2.27e-09 ***
## rm -2.684 0.532 -5.045 6.35e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.401 on 504 degrees of freedom
## Multiple R-squared: 0.04807, Adjusted R-squared: 0.04618
## F-statistic: 25.45 on 1 and 504 DF, p-value: 6.347e-07
##
## Call:
## lm(formula = formula, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.789 -4.257 -1.230 1.527 82.849
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.77791 0.94398 -4.002 7.22e-05 ***
## age 0.10779 0.01274 8.463 2.85e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.057 on 504 degrees of freedom
## Multiple R-squared: 0.1244, Adjusted R-squared: 0.1227
## F-statistic: 71.62 on 1 and 504 DF, p-value: 2.855e-16
##
## Call:
## lm(formula = formula, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.708 -4.134 -1.527 1.516 81.674
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.4993 0.7304 13.006 <2e-16 ***
## dis -1.5509 0.1683 -9.213 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.965 on 504 degrees of freedom
## Multiple R-squared: 0.1441, Adjusted R-squared: 0.1425
## F-statistic: 84.89 on 1 and 504 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = formula, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.164 -1.381 -0.141 0.660 76.433
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.28716 0.44348 -5.157 3.61e-07 ***
## rad 0.61791 0.03433 17.998 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.718 on 504 degrees of freedom
## Multiple R-squared: 0.3913, Adjusted R-squared: 0.39
## F-statistic: 323.9 on 1 and 504 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = formula, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.513 -2.738 -0.194 1.065 77.696
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.528369 0.815809 -10.45 <2e-16 ***
## tax 0.029742 0.001847 16.10 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.997 on 504 degrees of freedom
## Multiple R-squared: 0.3396, Adjusted R-squared: 0.3383
## F-statistic: 259.2 on 1 and 504 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = formula, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.654 -3.985 -1.912 1.825 83.353
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.6469 3.1473 -5.607 3.40e-08 ***
## ptratio 1.1520 0.1694 6.801 2.94e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.24 on 504 degrees of freedom
## Multiple R-squared: 0.08407, Adjusted R-squared: 0.08225
## F-statistic: 46.26 on 1 and 504 DF, p-value: 2.943e-11
##
## Call:
## lm(formula = formula, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.925 -2.822 -0.664 1.079 82.862
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.33054 0.69376 -4.801 2.09e-06 ***
## lstat 0.54880 0.04776 11.491 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.664 on 504 degrees of freedom
## Multiple R-squared: 0.2076, Adjusted R-squared: 0.206
## F-statistic: 132 on 1 and 504 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = formula, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.071 -4.022 -2.343 1.298 80.957
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.79654 0.93419 12.63 <2e-16 ***
## medv -0.36316 0.03839 -9.46 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.934 on 504 degrees of freedom
## Multiple R-squared: 0.1508, Adjusted R-squared: 0.1491
## F-statistic: 89.49 on 1 and 504 DF, p-value: < 2.2e-16
Significance: Check the p-value for each model. In the Boston data, most
predictors (like zn, nox, rm, dis, rad, tax) typically show a
statistically significant association with crim when modeled
individually.
fit_all <- lm(crim ~ ., data = Boston)
summary(fit_all)
##
## Call:
## lm(formula = crim ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.534 -2.248 -0.348 1.087 73.923
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.7783938 7.0818258 1.946 0.052271 .
## zn 0.0457100 0.0187903 2.433 0.015344 *
## indus -0.0583501 0.0836351 -0.698 0.485709
## chas -0.8253776 1.1833963 -0.697 0.485841
## nox -9.9575865 5.2898242 -1.882 0.060370 .
## rm 0.6289107 0.6070924 1.036 0.300738
## age -0.0008483 0.0179482 -0.047 0.962323
## dis -1.0122467 0.2824676 -3.584 0.000373 ***
## rad 0.6124653 0.0875358 6.997 8.59e-12 ***
## tax -0.0037756 0.0051723 -0.730 0.465757
## ptratio -0.3040728 0.1863598 -1.632 0.103393
## lstat 0.1388006 0.0757213 1.833 0.067398 .
## medv -0.2200564 0.0598240 -3.678 0.000261 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.46 on 493 degrees of freedom
## Multiple R-squared: 0.4493, Adjusted R-squared: 0.4359
## F-statistic: 33.52 on 12 and 493 DF, p-value: < 2.2e-16
Rejecting the Null Hypothesis: You can reject \(H_0: \beta_j = 0\) for predictors with a p-value less than 0.05. Usually, in the full model, only a few variables like zn, dis, rad, black, and medv remain statistically significant.
# Extract coefficients from the multiple regression (excluding the intercept)
multiple_coeffs <- coef(fit_all)[-1]
# Ensure the names match for plotting
plot(univariate_coeffs, multiple_coeffs,
xlab = "Univariate Regression Coefficients",
ylab = "Multiple Regression Coefficients",
main = "Comparison of Coefficients",
pch = 19, col = "blue")
abline(0, 1, lty = 2) # Reference line where simple = multiple
Interpretation: Each point represents a predictor. Points far from the
diagonal line indicate variables where the relationship with crime
changes significantly when other predictors are added to the model (due
to multicollinearity).
\[Y = \beta_0 + \beta_1X + \beta_2X^2 + \beta_3X^3 + \epsilon\]
library(ISLR2)
predictors <- colnames(Boston)[colnames(Boston) != "crim"]
# (d) Test for non-linear association
for (p in predictors) {
# Count unique values to avoid the "degree" error
unique_values <- length(unique(Boston[[p]]))
if (unique_values > 3) {
# Fit the cubic model if there are enough unique points
fit_cubic <- lm(crim ~ poly(get(p), 3), data = Boston)
cat("\n--- Non-linear analysis for:", p, "---\n")
print(summary(fit_cubic))
} else {
cat("\nSkipping", p, "because it has only", unique_values, "unique values.\n")
}
}
##
## --- Non-linear analysis for: zn ---
##
## Call:
## lm(formula = crim ~ poly(get(p), 3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.821 -4.614 -1.294 0.473 84.130
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6135 0.3722 9.709 < 2e-16 ***
## poly(get(p), 3)1 -38.7498 8.3722 -4.628 4.7e-06 ***
## poly(get(p), 3)2 23.9398 8.3722 2.859 0.00442 **
## poly(get(p), 3)3 -10.0719 8.3722 -1.203 0.22954
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.372 on 502 degrees of freedom
## Multiple R-squared: 0.05824, Adjusted R-squared: 0.05261
## F-statistic: 10.35 on 3 and 502 DF, p-value: 1.281e-06
##
##
## --- Non-linear analysis for: indus ---
##
## Call:
## lm(formula = crim ~ poly(get(p), 3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.278 -2.514 0.054 0.764 79.713
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.614 0.330 10.950 < 2e-16 ***
## poly(get(p), 3)1 78.591 7.423 10.587 < 2e-16 ***
## poly(get(p), 3)2 -24.395 7.423 -3.286 0.00109 **
## poly(get(p), 3)3 -54.130 7.423 -7.292 1.2e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.423 on 502 degrees of freedom
## Multiple R-squared: 0.2597, Adjusted R-squared: 0.2552
## F-statistic: 58.69 on 3 and 502 DF, p-value: < 2.2e-16
##
##
## Skipping chas because it has only 2 unique values.
##
## --- Non-linear analysis for: nox ---
##
## Call:
## lm(formula = crim ~ poly(get(p), 3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.110 -2.068 -0.255 0.739 78.302
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6135 0.3216 11.237 < 2e-16 ***
## poly(get(p), 3)1 81.3720 7.2336 11.249 < 2e-16 ***
## poly(get(p), 3)2 -28.8286 7.2336 -3.985 7.74e-05 ***
## poly(get(p), 3)3 -60.3619 7.2336 -8.345 6.96e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.234 on 502 degrees of freedom
## Multiple R-squared: 0.297, Adjusted R-squared: 0.2928
## F-statistic: 70.69 on 3 and 502 DF, p-value: < 2.2e-16
##
##
## --- Non-linear analysis for: rm ---
##
## Call:
## lm(formula = crim ~ poly(get(p), 3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.485 -3.468 -2.221 -0.015 87.219
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6135 0.3703 9.758 < 2e-16 ***
## poly(get(p), 3)1 -42.3794 8.3297 -5.088 5.13e-07 ***
## poly(get(p), 3)2 26.5768 8.3297 3.191 0.00151 **
## poly(get(p), 3)3 -5.5103 8.3297 -0.662 0.50858
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.33 on 502 degrees of freedom
## Multiple R-squared: 0.06779, Adjusted R-squared: 0.06222
## F-statistic: 12.17 on 3 and 502 DF, p-value: 1.067e-07
##
##
## --- Non-linear analysis for: age ---
##
## Call:
## lm(formula = crim ~ poly(get(p), 3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.762 -2.673 -0.516 0.019 82.842
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6135 0.3485 10.368 < 2e-16 ***
## poly(get(p), 3)1 68.1820 7.8397 8.697 < 2e-16 ***
## poly(get(p), 3)2 37.4845 7.8397 4.781 2.29e-06 ***
## poly(get(p), 3)3 21.3532 7.8397 2.724 0.00668 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.84 on 502 degrees of freedom
## Multiple R-squared: 0.1742, Adjusted R-squared: 0.1693
## F-statistic: 35.31 on 3 and 502 DF, p-value: < 2.2e-16
##
##
## --- Non-linear analysis for: dis ---
##
## Call:
## lm(formula = crim ~ poly(get(p), 3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.757 -2.588 0.031 1.267 76.378
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6135 0.3259 11.087 < 2e-16 ***
## poly(get(p), 3)1 -73.3886 7.3315 -10.010 < 2e-16 ***
## poly(get(p), 3)2 56.3730 7.3315 7.689 7.87e-14 ***
## poly(get(p), 3)3 -42.6219 7.3315 -5.814 1.09e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.331 on 502 degrees of freedom
## Multiple R-squared: 0.2778, Adjusted R-squared: 0.2735
## F-statistic: 64.37 on 3 and 502 DF, p-value: < 2.2e-16
##
##
## --- Non-linear analysis for: rad ---
##
## Call:
## lm(formula = crim ~ poly(get(p), 3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.381 -0.412 -0.269 0.179 76.217
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6135 0.2971 12.164 < 2e-16 ***
## poly(get(p), 3)1 120.9074 6.6824 18.093 < 2e-16 ***
## poly(get(p), 3)2 17.4923 6.6824 2.618 0.00912 **
## poly(get(p), 3)3 4.6985 6.6824 0.703 0.48231
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.682 on 502 degrees of freedom
## Multiple R-squared: 0.4, Adjusted R-squared: 0.3965
## F-statistic: 111.6 on 3 and 502 DF, p-value: < 2.2e-16
##
##
## --- Non-linear analysis for: tax ---
##
## Call:
## lm(formula = crim ~ poly(get(p), 3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.273 -1.389 0.046 0.536 76.950
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6135 0.3047 11.860 < 2e-16 ***
## poly(get(p), 3)1 112.6458 6.8537 16.436 < 2e-16 ***
## poly(get(p), 3)2 32.0873 6.8537 4.682 3.67e-06 ***
## poly(get(p), 3)3 -7.9968 6.8537 -1.167 0.244
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.854 on 502 degrees of freedom
## Multiple R-squared: 0.3689, Adjusted R-squared: 0.3651
## F-statistic: 97.8 on 3 and 502 DF, p-value: < 2.2e-16
##
##
## --- Non-linear analysis for: ptratio ---
##
## Call:
## lm(formula = crim ~ poly(get(p), 3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.833 -4.146 -1.655 1.408 82.697
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.614 0.361 10.008 < 2e-16 ***
## poly(get(p), 3)1 56.045 8.122 6.901 1.57e-11 ***
## poly(get(p), 3)2 24.775 8.122 3.050 0.00241 **
## poly(get(p), 3)3 -22.280 8.122 -2.743 0.00630 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.122 on 502 degrees of freedom
## Multiple R-squared: 0.1138, Adjusted R-squared: 0.1085
## F-statistic: 21.48 on 3 and 502 DF, p-value: 4.171e-13
##
##
## --- Non-linear analysis for: lstat ---
##
## Call:
## lm(formula = crim ~ poly(get(p), 3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.234 -2.151 -0.486 0.066 83.353
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6135 0.3392 10.654 <2e-16 ***
## poly(get(p), 3)1 88.0697 7.6294 11.543 <2e-16 ***
## poly(get(p), 3)2 15.8882 7.6294 2.082 0.0378 *
## poly(get(p), 3)3 -11.5740 7.6294 -1.517 0.1299
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.629 on 502 degrees of freedom
## Multiple R-squared: 0.2179, Adjusted R-squared: 0.2133
## F-statistic: 46.63 on 3 and 502 DF, p-value: < 2.2e-16
##
##
## --- Non-linear analysis for: medv ---
##
## Call:
## lm(formula = crim ~ poly(get(p), 3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.427 -1.976 -0.437 0.439 73.655
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.614 0.292 12.374 < 2e-16 ***
## poly(get(p), 3)1 -75.058 6.569 -11.426 < 2e-16 ***
## poly(get(p), 3)2 88.086 6.569 13.409 < 2e-16 ***
## poly(get(p), 3)3 -48.033 6.569 -7.312 1.05e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.569 on 502 degrees of freedom
## Multiple R-squared: 0.4202, Adjusted R-squared: 0.4167
## F-statistic: 121.3 on 3 and 502 DF, p-value: < 2.2e-16
Evidence: Look for significant p-values for the quadratic (\(X^2\)) or cubic (\(X^3\)) terms. If these terms are significant, it suggests a non-linear association between that predictor and the crime rate. Variables like indus, nox, age, dis, and medv often show strong evidence of non-linearity.
library(ISLR2)
# Numerical summary
summary(Weekly)
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume Today
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747 Min. :-18.1950
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
## Median : 0.2380 Median : 0.2340 Median :1.00268 Median : 0.2410
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462 Mean : 0.1499
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821 Max. : 12.0260
## Direction
## Down:484
## Up :605
##
##
##
##
# Correlation matrix
cor(Weekly[, -9]) # Exclude the qualitative 'Direction' variable
## Year Lag1 Lag2 Lag3 Lag4
## Year 1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1 -0.03228927 1.000000000 -0.07485305 0.05863568 -0.071273876
## Lag2 -0.03339001 -0.074853051 1.00000000 -0.07572091 0.058381535
## Lag3 -0.03000649 0.058635682 -0.07572091 1.00000000 -0.075395865
## Lag4 -0.03112792 -0.071273876 0.05838153 -0.07539587 1.000000000
## Lag5 -0.03051910 -0.008183096 -0.07249948 0.06065717 -0.075675027
## Volume 0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today -0.03245989 -0.075031842 0.05916672 -0.07124364 -0.007825873
## Lag5 Volume Today
## Year -0.030519101 0.84194162 -0.032459894
## Lag1 -0.008183096 -0.06495131 -0.075031842
## Lag2 -0.072499482 -0.08551314 0.059166717
## Lag3 0.060657175 -0.06928771 -0.071243639
## Lag4 -0.075675027 -0.06107462 -0.007825873
## Lag5 1.000000000 -0.05851741 0.011012698
## Volume -0.058517414 1.00000000 -0.033077783
## Today 0.011012698 -0.03307778 1.000000000
# Graphical summary
pairs(Weekly)
plot(Weekly$Volume)
Patterns: The correlation matrix and scatterplot matrix usually show
very little correlation between the lag variables and today’s returns.
The only notable pattern is typically between Year and Volume, showing
that the number of shares traded increased over time.
glm_fit <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data = Weekly, family = binomial)
summary(glm_fit)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Weekly)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
Significance: Look at the \(Pr(>|z|)\) column in the summary. Typically, for this dataset, only Lag2 appears to be statistically significant with a p-value less than 0.05.
# Convert probabilities to class labels
glm_probs <- predict(glm_fit, type = "response")
glm_pred <- ifelse(glm_probs > 0.5, "Up", "Down")
# Confusion Matrix
table(glm_pred, Weekly$Direction)
##
## glm_pred Down Up
## Down 54 48
## Up 430 557
# Overall fraction of correct predictions (Accuracy)
mean(glm_pred == Weekly$Direction)
## [1] 0.5610652
Mistakes: The confusion matrix tells you the types of errors: False Positives (predicting “Up” when the market went “Down”) and False Negatives (predicting “Down” when the market went “Up”). Logistic regression on this data often has a high “Training Error Rate” and tends to over-predict “Up” days.
# Create training and test sets
train <- (Weekly$Year <= 2008)
Weekly_test <- Weekly[!train, ]
Direction_test <- Weekly$Direction[!train]
# Fit model on training data
glm_fit2 <- glm(Direction ~ Lag2, data = Weekly, family = binomial, subset = train)
# Predict on test data
glm_probs2 <- predict(glm_fit2, Weekly_test, type = "response")
glm_pred2 <- ifelse(glm_probs2 > 0.5, "Up", "Down")
# Results
table(glm_pred2, Direction_test)
## Direction_test
## glm_pred2 Down Up
## Down 9 5
## Up 34 56
mean(glm_pred2 == Direction_test) # Accuracy on test data
## [1] 0.625
Interpretation: This part evaluates how the model performs on “unseen” data. You will find the test accuracy, which represents the model’s true predictive power for the years 2009 and 2010.