library(MASS)
library(ISLR)
Chapter 3: Linear Regression
Chapter 3 introduces the core ideas of linear regression, one of the most widely used methods for predicting a quantitative response variable. The chapter begins with simple linear regression, where a single predictor variable is used to explain the response through the equation \(Y = \beta_0 + \beta_1X + \epsilon\). It then extends to *multiple linear regression, which incorporates several predictors and is more typical in real-world applications. A key part of the chapter is understanding how to assess model accuracy using the \(R^2\) statistic, which measures the proportion of variance explained, and the residual standard error (RSE), which reflects the average deviation of observations from the regression line. The chapter also emphasizes statistical inference: using hypothesis tests, such as the F-test, and confidence intervals to determine whether predictors are significant. Finally, advanced topics are introduced, including interaction effects between variables and handling qualitative predictors, highlighting how regression can adapt to more complex situations.
Chapter 4: Classification
Chapter 4 shifts from predicting continuous outcomes to predicting categorical responses through classification methods. It begins with logistic regression, the most commonly used approach for binary outcomes, which models the probability of belonging to a given class. The chapter then introduces *linear discriminant analysis (LDA), which is effective when classes are well separated or sample sizes are small, and quadratic discriminant analysis (QDA), which provides greater flexibility by allowing quadratic decision boundaries. Another important method discussed is k-nearest neighbors (KNN), a simple and non-parametric algorithm that classifies new observations based on the majority vote of their closest neighbors in the training set. Finally, the chapter explains how to evaluate classifier performance using confusion matrices, which provide a detailed breakdown of correct and incorrect predictions, and stresses the importance of assessing models beyond overall accuracy.
attach(Boston)
3.6.1 Simple Linear Regression —
<- lm(medv ~ lstat, data = Boston)
lm.fit summary(lm.fit )
#>
#> Call:
#> lm(formula = medv ~ lstat, data = Boston)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -15.168 -3.990 -1.318 2.034 24.500
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 34.55384 0.56263 61.41 <2e-16 ***
#> lstat -0.95005 0.03873 -24.53 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 6.216 on 504 degrees of freedom
#> Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
#> F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
confint(lm.fit)
#> 2.5 % 97.5 %
#> (Intercept) 33.448457 35.6592247
#> lstat -1.026148 -0.8739505
plot(Boston$lstat, Boston$medv, pch = 20, col = "gray40")
abline(lm.fit, col = "blue", lwd = 2)
par(mfrow = c(2, 2))
plot(lm.fit) # Residuals vs Fitted, QQ, Scale-Location, Residuals vs Leverage
par(mfrow = c(1, 1))
predict(lm.fit, data.frame(lstat = c(5, 10, 15)), interval = "confidence")
#> fit lwr upr
#> 1 29.80359 29.00741 30.59978
#> 2 25.05335 24.47413 25.63256
#> 3 20.30310 19.73159 20.87461
predict(lm.fit, data.frame(lstat = c(5, 10, 15)), interval = "prediction")
#> fit lwr upr
#> 1 29.80359 17.565675 42.04151
#> 2 25.05335 12.827626 37.27907
#> 3 20.30310 8.077742 32.52846
3.6.2 Multiple Linear Regression —
Fit a multiple linear regression model to predict ‘medv’ using multiple predictors.
<- lm(medv ~ lstat + age, data = Boston)
lm.fit2 summary(lm.fit2)
#>
#> Call:
#> lm(formula = medv ~ lstat + age, data = Boston)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -15.981 -3.978 -1.283 1.968 23.158
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 33.22276 0.73085 45.458 < 2e-16 ***
#> lstat -1.03207 0.04819 -21.416 < 2e-16 ***
#> age 0.03454 0.01223 2.826 0.00491 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 6.173 on 503 degrees of freedom
#> Multiple R-squared: 0.5513, Adjusted R-squared: 0.5495
#> F-statistic: 309 on 2 and 503 DF, p-value: < 2.2e-16
Fit a model using all predictors
<- lm(medv ~ ., data = Boston)
lm.fit_all summary(lm.fit_all)
#>
#> Call:
#> lm(formula = medv ~ ., data = Boston)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -15.595 -2.730 -0.518 1.777 26.199
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
#> crim -1.080e-01 3.286e-02 -3.287 0.001087 **
#> zn 4.642e-02 1.373e-02 3.382 0.000778 ***
#> indus 2.056e-02 6.150e-02 0.334 0.738288
#> chas 2.687e+00 8.616e-01 3.118 0.001925 **
#> nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
#> rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***
#> age 6.922e-04 1.321e-02 0.052 0.958229
#> dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
#> rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***
#> tax -1.233e-02 3.760e-03 -3.280 0.001112 **
#> ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
#> black 9.312e-03 2.686e-03 3.467 0.000573 ***
#> lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 4.745 on 492 degrees of freedom
#> Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
#> F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
names(lm.fit_all)
#> [1] "coefficients" "residuals" "effects" "rank"
#> [5] "fitted.values" "assign" "qr" "df.residual"
#> [9] "xlevels" "call" "terms" "model"
Check the variance inflation factor (VIF) to assess multicollinearity
library(car)
vif(lm.fit_all)
#> crim zn indus chas nox rm age dis
#> 1.792192 2.298758 3.991596 1.073995 4.393720 1.933744 3.100826 3.955945
#> rad tax ptratio black lstat
#> 7.484496 9.008554 1.799084 1.348521 2.941491
Fit a model with interactions
<- lm(medv ~ lstat * age, data = Boston)
lm.fit_interact summary(lm.fit_interact)
#>
#> Call:
#> lm(formula = medv ~ lstat * age, data = Boston)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -15.806 -4.045 -1.333 2.085 27.552
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 36.0885359 1.4698355 24.553 < 2e-16 ***
#> lstat -1.3921168 0.1674555 -8.313 8.78e-16 ***
#> age -0.0007209 0.0198792 -0.036 0.9711
#> lstat:age 0.0041560 0.0018518 2.244 0.0252 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 6.149 on 502 degrees of freedom
#> Multiple R-squared: 0.5557, Adjusted R-squared: 0.5531
#> F-statistic: 209.3 on 3 and 502 DF, p-value: < 2.2e-16
3.6.3 Non-linear Transformations of the Predictors
Fit a polynomial regression model
<- lm(medv ~ lstat + I(lstat^2), data = Boston)
lm.fit_poly summary(lm.fit_poly)
#>
#> Call:
#> lm(formula = medv ~ lstat + I(lstat^2), data = Boston)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -15.2834 -3.8313 -0.5295 2.3095 25.4148
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 42.862007 0.872084 49.15 <2e-16 ***
#> lstat -2.332821 0.123803 -18.84 <2e-16 ***
#> I(lstat^2) 0.043547 0.003745 11.63 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 5.524 on 503 degrees of freedom
#> Multiple R-squared: 0.6407, Adjusted R-squared: 0.6393
#> F-statistic: 448.5 on 2 and 503 DF, p-value: < 2.2e-16
Use the anova() function to compare models
The F-statistic suggests that the second model (with the quadratic term) is superior to the first (the simple linear model).
anova(lm.fit, lm.fit_poly)
Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
---|---|---|---|---|---|
504 | 19472.38 | NA | NA | NA | NA |
503 | 15347.24 | 1 | 4125.138 | 135.1998 | 0 |
Fit a polynomial regression of degree 5
<- lm(medv ~ poly(lstat, 5), data = Boston)
lm.fit_poly5 summary(lm.fit_poly5)
#>
#> Call:
#> lm(formula = medv ~ poly(lstat, 5), data = Boston)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -13.5433 -3.1039 -0.7052 2.0844 27.1153
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 22.5328 0.2318 97.197 < 2e-16 ***
#> poly(lstat, 5)1 -152.4595 5.2148 -29.236 < 2e-16 ***
#> poly(lstat, 5)2 64.2272 5.2148 12.316 < 2e-16 ***
#> poly(lstat, 5)3 -27.0511 5.2148 -5.187 3.10e-07 ***
#> poly(lstat, 5)4 25.4517 5.2148 4.881 1.42e-06 ***
#> poly(lstat, 5)5 -19.2524 5.2148 -3.692 0.000247 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 5.215 on 500 degrees of freedom
#> Multiple R-squared: 0.6817, Adjusted R-squared: 0.6785
#> F-statistic: 214.2 on 5 and 500 DF, p-value: < 2.2e-16
— 3.6.4 Qualitative Predictors —
attach(Carseats)
<- lm(Sales ~ . + Income:Advertising + Price:Age, data = Carseats)
lm.fit_qual summary(lm.fit_qual)
#>
#> Call:
#> lm(formula = Sales ~ . + Income:Advertising + Price:Age, data = Carseats)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -2.9208 -0.7503 0.0177 0.6754 3.3413
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 6.5755654 1.0087470 6.519 2.22e-10 ***
#> CompPrice 0.0929371 0.0041183 22.567 < 2e-16 ***
#> Income 0.0108940 0.0026044 4.183 3.57e-05 ***
#> Advertising 0.0702462 0.0226091 3.107 0.002030 **
#> Population 0.0001592 0.0003679 0.433 0.665330
#> Price -0.1008064 0.0074399 -13.549 < 2e-16 ***
#> ShelveLocGood 4.8486762 0.1528378 31.724 < 2e-16 ***
#> ShelveLocMedium 1.9532620 0.1257682 15.531 < 2e-16 ***
#> Age -0.0579466 0.0159506 -3.633 0.000318 ***
#> Education -0.0208525 0.0196131 -1.063 0.288361
#> UrbanYes 0.1401597 0.1124019 1.247 0.213171
#> USYes -0.1575571 0.1489234 -1.058 0.290729
#> Income:Advertising 0.0007510 0.0002784 2.698 0.007290 **
#> Price:Age 0.0001068 0.0001333 0.801 0.423812
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 1.011 on 386 degrees of freedom
#> Multiple R-squared: 0.8761, Adjusted R-squared: 0.8719
#> F-statistic: 210 on 13 and 386 DF, p-value: < 2.2e-16
Use ‘contrasts’ to see how R handles qualitative variables
contrasts(ShelveLoc)
#> Good Medium
#> Bad 0 0
#> Good 1 0
#> Medium 0 1
CHAPTER 4 Lab sessions
library(ISLR2) # Note: The second edition uses the ISLR2 package
library(MASS)
library(class)
4.6.1 The Stock Market Data —
attach(Smarket)
names(Smarket)
#> [1] "Year" "Lag1" "Lag2" "Lag3" "Lag4" "Lag5"
#> [7] "Volume" "Today" "Direction"
summary(Smarket)
#> Year Lag1 Lag2 Lag3
#> Min. :2001 Min. :-4.922000 Min. :-4.922000 Min. :-4.922000
#> 1st Qu.:2002 1st Qu.:-0.639500 1st Qu.:-0.639500 1st Qu.:-0.640000
#> Median :2003 Median : 0.039000 Median : 0.039000 Median : 0.038500
#> Mean :2003 Mean : 0.003834 Mean : 0.003919 Mean : 0.001716
#> 3rd Qu.:2004 3rd Qu.: 0.596750 3rd Qu.: 0.596750 3rd Qu.: 0.596750
#> Max. :2005 Max. : 5.733000 Max. : 5.733000 Max. : 5.733000
#> Lag4 Lag5 Volume Today
#> Min. :-4.922000 Min. :-4.92200 Min. :0.3561 Min. :-4.922000
#> 1st Qu.:-0.640000 1st Qu.:-0.64000 1st Qu.:1.2574 1st Qu.:-0.639500
#> Median : 0.038500 Median : 0.03850 Median :1.4229 Median : 0.038500
#> Mean : 0.001636 Mean : 0.00561 Mean :1.4783 Mean : 0.003138
#> 3rd Qu.: 0.596750 3rd Qu.: 0.59700 3rd Qu.:1.6417 3rd Qu.: 0.596750
#> Max. : 5.733000 Max. : 5.73300 Max. :3.1525 Max. : 5.733000
#> Direction
#> Down:602
#> Up :648
#>
#>
#>
#>
4.6.2 Logistic Regression —
Fit a logistic regression model to predict ‘Direction’ using ‘Lag1’ through ‘Lag5’ and ‘Volume’
<- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
glm.fits data = Smarket, family = binomial)
summary(glm.fits)
#>
#> Call:
#> glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
#> Volume, family = binomial, data = Smarket)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -0.126000 0.240736 -0.523 0.601
#> Lag1 -0.073074 0.050167 -1.457 0.145
#> Lag2 -0.042301 0.050086 -0.845 0.398
#> Lag3 0.011085 0.049939 0.222 0.824
#> Lag4 0.009359 0.049974 0.187 0.851
#> Lag5 0.010313 0.049511 0.208 0.835
#> Volume 0.135441 0.158360 0.855 0.392
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 1731.2 on 1249 degrees of freedom
#> Residual deviance: 1727.6 on 1243 degrees of freedom
#> AIC: 1741.6
#>
#> Number of Fisher Scoring iterations: 3
coef(glm.fits)
#> (Intercept) Lag1 Lag2 Lag3 Lag4 Lag5
#> -0.126000257 -0.073073746 -0.042301344 0.011085108 0.009358938 0.010313068
#> Volume
#> 0.135440659
<- predict(glm.fits, type = "response") glm.probs
<- rep("Down", 1250)
glm.pred > 0.5] <- "Up"
glm.pred[glm.probs table(glm.pred, Direction)
#> Direction
#> glm.pred Down Up
#> Down 145 141
#> Up 457 507
Create training and test sets
<- Year < 2005
train <- glm(Direction ~ Lag1 + Lag2, data = Smarket, family = binomial,
glm.fits.train subset = train)
<- predict(glm.fits.train, newdata = Smarket[!train, ],
glm.probs.test type = "response")
<- rep("Down", 252)
glm.pred.test > 0.5] <- "Up"
glm.pred.test[glm.probs.test table(glm.pred.test, Direction[!train])
#>
#> glm.pred.test Down Up
#> Down 35 35
#> Up 76 106
4.6.3 Linear Discriminant Analysis —
Fit a LDA model using ‘Lag1’ and ‘Lag2’
<- lda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
lda.fit
lda.fit#> Call:
#> lda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
#>
#> Prior probabilities of groups:
#> Down Up
#> 0.491984 0.508016
#>
#> Group means:
#> Lag1 Lag2
#> Down 0.04279022 0.03389409
#> Up -0.03954635 -0.03132544
#>
#> Coefficients of linear discriminants:
#> LD1
#> Lag1 -0.6420190
#> Lag2 -0.5135293
plot(lda.fit)
Get predictions on the test set
<- predict(lda.fit, newdata = Smarket[!train, ])
lda.pred names(lda.pred)
#> [1] "class" "posterior" "x"
table(lda.pred$class, Direction[!train])
#>
#> Down Up
#> Down 35 35
#> Up 76 106
4.6.4 Quadratic Discriminant Analysis —
Fit a QDA model
<- qda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
qda.fit
qda.fit#> Call:
#> qda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
#>
#> Prior probabilities of groups:
#> Down Up
#> 0.491984 0.508016
#>
#> Group means:
#> Lag1 Lag2
#> Down 0.04279022 0.03389409
#> Up -0.03954635 -0.03132544
Get predictions on the test set
<- predict(qda.fit, newdata = Smarket[!train, ])
qda.pred table(qda.pred$class, Direction[!train])
#>
#> Down Up
#> Down 30 20
#> Up 81 121
4.6.5 K-Nearest Neighbors
Create matrices of predictors for the training and test data
<- as.matrix(Smarket[train, c("Lag1", "Lag2")])
train.X <- as.matrix(Smarket[!train, c("Lag1", "Lag2")])
test.X <- Smarket[train, "Direction"] train.Direction
Run KNN with K=3
<- knn(train.X, test.X, train.Direction, k = 3)
knn.pred table(knn.pred, Direction[!train])
#>
#> knn.pred Down Up
#> Down 48 55
#> Up 63 86
Run KNN with K=1
<- knn(train.X, test.X, train.Direction, k = 1)
knn.pred table(knn.pred, Direction[!train])
#>
#> knn.pred Down Up
#> Down 43 58
#> Up 68 83