Artificial Intelligence for Health Data Science:Exercise A3

Introduction to Statistical Learning: Summary and Lab for Chapters 3 and 4

Author
Affiliation

Vusumuzi Mabasa

University Of the Witwatersrand (School of Public Health)

Published

September 12, 2025

Chapter 3: Linear Regression

Chapter 3 introduces the core ideas of linear regression, one of the most widely used methods for predicting a quantitative response variable. The chapter begins with simple linear regression, where a single predictor variable is used to explain the response through the equation \(Y = \beta_0 + \beta_1X + \epsilon\). It then extends to *multiple linear regression, which incorporates several predictors and is more typical in real-world applications. A key part of the chapter is understanding how to assess model accuracy using the \(R^2\) statistic, which measures the proportion of variance explained, and the residual standard error (RSE), which reflects the average deviation of observations from the regression line. The chapter also emphasizes statistical inference: using hypothesis tests, such as the F-test, and confidence intervals to determine whether predictors are significant. Finally, advanced topics are introduced, including interaction effects between variables and handling qualitative predictors, highlighting how regression can adapt to more complex situations.

Chapter 4: Classification

Chapter 4 shifts from predicting continuous outcomes to predicting categorical responses through classification methods. It begins with logistic regression, the most commonly used approach for binary outcomes, which models the probability of belonging to a given class. The chapter then introduces *linear discriminant analysis (LDA), which is effective when classes are well separated or sample sizes are small, and quadratic discriminant analysis (QDA), which provides greater flexibility by allowing quadratic decision boundaries. Another important method discussed is k-nearest neighbors (KNN), a simple and non-parametric algorithm that classifies new observations based on the majority vote of their closest neighbors in the training set. Finally, the chapter explains how to evaluate classifier performance using confusion matrices, which provide a detailed breakdown of correct and incorrect predictions, and stresses the importance of assessing models beyond overall accuracy.

library(MASS)
library(ISLR)
attach(Boston)

3.6.1 Simple Linear Regression —

lm.fit <- lm(medv ~ lstat, data = Boston)
summary(lm.fit )
#> 
#> Call:
#> lm(formula = medv ~ lstat, data = Boston)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -15.168  -3.990  -1.318   2.034  24.500 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 34.55384    0.56263   61.41   <2e-16 ***
#> lstat       -0.95005    0.03873  -24.53   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 6.216 on 504 degrees of freedom
#> Multiple R-squared:  0.5441, Adjusted R-squared:  0.5432 
#> F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16
confint(lm.fit)
#>                 2.5 %     97.5 %
#> (Intercept) 33.448457 35.6592247
#> lstat       -1.026148 -0.8739505
plot(Boston$lstat, Boston$medv, pch = 20, col = "gray40")
abline(lm.fit, col = "blue", lwd = 2)

par(mfrow = c(2, 2))
plot(lm.fit)  # Residuals vs Fitted, QQ, Scale-Location, Residuals vs Leverage

par(mfrow = c(1, 1))
predict(lm.fit, data.frame(lstat = c(5, 10, 15)), interval = "confidence")
#>        fit      lwr      upr
#> 1 29.80359 29.00741 30.59978
#> 2 25.05335 24.47413 25.63256
#> 3 20.30310 19.73159 20.87461
predict(lm.fit, data.frame(lstat = c(5, 10, 15)), interval = "prediction")
#>        fit       lwr      upr
#> 1 29.80359 17.565675 42.04151
#> 2 25.05335 12.827626 37.27907
#> 3 20.30310  8.077742 32.52846

3.6.2 Multiple Linear Regression —

Fit a multiple linear regression model to predict ‘medv’ using multiple predictors.

lm.fit2 <- lm(medv ~ lstat + age, data = Boston)
summary(lm.fit2)
#> 
#> Call:
#> lm(formula = medv ~ lstat + age, data = Boston)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -15.981  -3.978  -1.283   1.968  23.158 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 33.22276    0.73085  45.458  < 2e-16 ***
#> lstat       -1.03207    0.04819 -21.416  < 2e-16 ***
#> age          0.03454    0.01223   2.826  0.00491 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 6.173 on 503 degrees of freedom
#> Multiple R-squared:  0.5513, Adjusted R-squared:  0.5495 
#> F-statistic:   309 on 2 and 503 DF,  p-value: < 2.2e-16

Fit a model using all predictors


lm.fit_all <- lm(medv ~ ., data = Boston)
summary(lm.fit_all)
#> 
#> Call:
#> lm(formula = medv ~ ., data = Boston)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -15.595  -2.730  -0.518   1.777  26.199 
#> 
#> Coefficients:
#>               Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
#> crim        -1.080e-01  3.286e-02  -3.287 0.001087 ** 
#> zn           4.642e-02  1.373e-02   3.382 0.000778 ***
#> indus        2.056e-02  6.150e-02   0.334 0.738288    
#> chas         2.687e+00  8.616e-01   3.118 0.001925 ** 
#> nox         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
#> rm           3.810e+00  4.179e-01   9.116  < 2e-16 ***
#> age          6.922e-04  1.321e-02   0.052 0.958229    
#> dis         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
#> rad          3.060e-01  6.635e-02   4.613 5.07e-06 ***
#> tax         -1.233e-02  3.760e-03  -3.280 0.001112 ** 
#> ptratio     -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
#> black        9.312e-03  2.686e-03   3.467 0.000573 ***
#> lstat       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 4.745 on 492 degrees of freedom
#> Multiple R-squared:  0.7406, Adjusted R-squared:  0.7338 
#> F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16
names(lm.fit_all)
#>  [1] "coefficients"  "residuals"     "effects"       "rank"         
#>  [5] "fitted.values" "assign"        "qr"            "df.residual"  
#>  [9] "xlevels"       "call"          "terms"         "model"

Check the variance inflation factor (VIF) to assess multicollinearity

library(car)
vif(lm.fit_all)
#>     crim       zn    indus     chas      nox       rm      age      dis 
#> 1.792192 2.298758 3.991596 1.073995 4.393720 1.933744 3.100826 3.955945 
#>      rad      tax  ptratio    black    lstat 
#> 7.484496 9.008554 1.799084 1.348521 2.941491

Fit a model with interactions

lm.fit_interact <- lm(medv ~ lstat * age, data = Boston)
summary(lm.fit_interact)
#> 
#> Call:
#> lm(formula = medv ~ lstat * age, data = Boston)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -15.806  -4.045  -1.333   2.085  27.552 
#> 
#> Coefficients:
#>               Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 36.0885359  1.4698355  24.553  < 2e-16 ***
#> lstat       -1.3921168  0.1674555  -8.313 8.78e-16 ***
#> age         -0.0007209  0.0198792  -0.036   0.9711    
#> lstat:age    0.0041560  0.0018518   2.244   0.0252 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 6.149 on 502 degrees of freedom
#> Multiple R-squared:  0.5557, Adjusted R-squared:  0.5531 
#> F-statistic: 209.3 on 3 and 502 DF,  p-value: < 2.2e-16

3.6.3 Non-linear Transformations of the Predictors

Fit a polynomial regression model

lm.fit_poly <- lm(medv ~ lstat + I(lstat^2), data = Boston)
summary(lm.fit_poly)
#> 
#> Call:
#> lm(formula = medv ~ lstat + I(lstat^2), data = Boston)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -15.2834  -3.8313  -0.5295   2.3095  25.4148 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 42.862007   0.872084   49.15   <2e-16 ***
#> lstat       -2.332821   0.123803  -18.84   <2e-16 ***
#> I(lstat^2)   0.043547   0.003745   11.63   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 5.524 on 503 degrees of freedom
#> Multiple R-squared:  0.6407, Adjusted R-squared:  0.6393 
#> F-statistic: 448.5 on 2 and 503 DF,  p-value: < 2.2e-16

Use the anova() function to compare models

The F-statistic suggests that the second model (with the quadratic term) is superior to the first (the simple linear model).

anova(lm.fit, lm.fit_poly)
Res.Df RSS Df Sum of Sq F Pr(>F)
504 19472.38 NA NA NA NA
503 15347.24 1 4125.138 135.1998 0

Fit a polynomial regression of degree 5

lm.fit_poly5 <- lm(medv ~ poly(lstat, 5), data = Boston)
summary(lm.fit_poly5)
#> 
#> Call:
#> lm(formula = medv ~ poly(lstat, 5), data = Boston)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -13.5433  -3.1039  -0.7052   2.0844  27.1153 
#> 
#> Coefficients:
#>                  Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)       22.5328     0.2318  97.197  < 2e-16 ***
#> poly(lstat, 5)1 -152.4595     5.2148 -29.236  < 2e-16 ***
#> poly(lstat, 5)2   64.2272     5.2148  12.316  < 2e-16 ***
#> poly(lstat, 5)3  -27.0511     5.2148  -5.187 3.10e-07 ***
#> poly(lstat, 5)4   25.4517     5.2148   4.881 1.42e-06 ***
#> poly(lstat, 5)5  -19.2524     5.2148  -3.692 0.000247 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 5.215 on 500 degrees of freedom
#> Multiple R-squared:  0.6817, Adjusted R-squared:  0.6785 
#> F-statistic: 214.2 on 5 and 500 DF,  p-value: < 2.2e-16

— 3.6.4 Qualitative Predictors —


attach(Carseats)
lm.fit_qual <- lm(Sales ~ . + Income:Advertising + Price:Age, data = Carseats)
summary(lm.fit_qual)
#> 
#> Call:
#> lm(formula = Sales ~ . + Income:Advertising + Price:Age, data = Carseats)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -2.9208 -0.7503  0.0177  0.6754  3.3413 
#> 
#> Coefficients:
#>                      Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)         6.5755654  1.0087470   6.519 2.22e-10 ***
#> CompPrice           0.0929371  0.0041183  22.567  < 2e-16 ***
#> Income              0.0108940  0.0026044   4.183 3.57e-05 ***
#> Advertising         0.0702462  0.0226091   3.107 0.002030 ** 
#> Population          0.0001592  0.0003679   0.433 0.665330    
#> Price              -0.1008064  0.0074399 -13.549  < 2e-16 ***
#> ShelveLocGood       4.8486762  0.1528378  31.724  < 2e-16 ***
#> ShelveLocMedium     1.9532620  0.1257682  15.531  < 2e-16 ***
#> Age                -0.0579466  0.0159506  -3.633 0.000318 ***
#> Education          -0.0208525  0.0196131  -1.063 0.288361    
#> UrbanYes            0.1401597  0.1124019   1.247 0.213171    
#> USYes              -0.1575571  0.1489234  -1.058 0.290729    
#> Income:Advertising  0.0007510  0.0002784   2.698 0.007290 ** 
#> Price:Age           0.0001068  0.0001333   0.801 0.423812    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 1.011 on 386 degrees of freedom
#> Multiple R-squared:  0.8761, Adjusted R-squared:  0.8719 
#> F-statistic:   210 on 13 and 386 DF,  p-value: < 2.2e-16

Use ‘contrasts’ to see how R handles qualitative variables

contrasts(ShelveLoc)
#>        Good Medium
#> Bad       0      0
#> Good      1      0
#> Medium    0      1

CHAPTER 4 Lab sessions

library(ISLR2) # Note: The second edition uses the ISLR2 package
library(MASS)
library(class)

4.6.1 The Stock Market Data —

attach(Smarket)
names(Smarket)
#> [1] "Year"      "Lag1"      "Lag2"      "Lag3"      "Lag4"      "Lag5"     
#> [7] "Volume"    "Today"     "Direction"
summary(Smarket)
#>       Year           Lag1                Lag2                Lag3          
#>  Min.   :2001   Min.   :-4.922000   Min.   :-4.922000   Min.   :-4.922000  
#>  1st Qu.:2002   1st Qu.:-0.639500   1st Qu.:-0.639500   1st Qu.:-0.640000  
#>  Median :2003   Median : 0.039000   Median : 0.039000   Median : 0.038500  
#>  Mean   :2003   Mean   : 0.003834   Mean   : 0.003919   Mean   : 0.001716  
#>  3rd Qu.:2004   3rd Qu.: 0.596750   3rd Qu.: 0.596750   3rd Qu.: 0.596750  
#>  Max.   :2005   Max.   : 5.733000   Max.   : 5.733000   Max.   : 5.733000  
#>       Lag4                Lag5              Volume           Today          
#>  Min.   :-4.922000   Min.   :-4.92200   Min.   :0.3561   Min.   :-4.922000  
#>  1st Qu.:-0.640000   1st Qu.:-0.64000   1st Qu.:1.2574   1st Qu.:-0.639500  
#>  Median : 0.038500   Median : 0.03850   Median :1.4229   Median : 0.038500  
#>  Mean   : 0.001636   Mean   : 0.00561   Mean   :1.4783   Mean   : 0.003138  
#>  3rd Qu.: 0.596750   3rd Qu.: 0.59700   3rd Qu.:1.6417   3rd Qu.: 0.596750  
#>  Max.   : 5.733000   Max.   : 5.73300   Max.   :3.1525   Max.   : 5.733000  
#>  Direction 
#>  Down:602  
#>  Up  :648  
#>            
#>            
#>            
#> 

4.6.2 Logistic Regression —

Fit a logistic regression model to predict ‘Direction’ using ‘Lag1’ through ‘Lag5’ and ‘Volume’

glm.fits <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
                data = Smarket, family = binomial)
summary(glm.fits)
#> 
#> Call:
#> glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
#>     Volume, family = binomial, data = Smarket)
#> 
#> Coefficients:
#>              Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -0.126000   0.240736  -0.523    0.601
#> Lag1        -0.073074   0.050167  -1.457    0.145
#> Lag2        -0.042301   0.050086  -0.845    0.398
#> Lag3         0.011085   0.049939   0.222    0.824
#> Lag4         0.009359   0.049974   0.187    0.851
#> Lag5         0.010313   0.049511   0.208    0.835
#> Volume       0.135441   0.158360   0.855    0.392
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 1731.2  on 1249  degrees of freedom
#> Residual deviance: 1727.6  on 1243  degrees of freedom
#> AIC: 1741.6
#> 
#> Number of Fisher Scoring iterations: 3
coef(glm.fits)
#>  (Intercept)         Lag1         Lag2         Lag3         Lag4         Lag5 
#> -0.126000257 -0.073073746 -0.042301344  0.011085108  0.009358938  0.010313068 
#>       Volume 
#>  0.135440659
glm.probs <- predict(glm.fits, type = "response")
glm.pred <- rep("Down", 1250)
glm.pred[glm.probs > 0.5] <- "Up"
table(glm.pred, Direction)
#>         Direction
#> glm.pred Down  Up
#>     Down  145 141
#>     Up    457 507

Create training and test sets

train <- Year < 2005
glm.fits.train <- glm(Direction ~ Lag1 + Lag2, data = Smarket, family = binomial,
                      subset = train)
glm.probs.test <- predict(glm.fits.train, newdata = Smarket[!train, ],
                          type = "response")
glm.pred.test <- rep("Down", 252)
glm.pred.test[glm.probs.test > 0.5] <- "Up"
table(glm.pred.test, Direction[!train])
#>              
#> glm.pred.test Down  Up
#>          Down   35  35
#>          Up     76 106

4.6.3 Linear Discriminant Analysis —

Fit a LDA model using ‘Lag1’ and ‘Lag2’

lda.fit <- lda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
lda.fit
#> Call:
#> lda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
#> 
#> Prior probabilities of groups:
#>     Down       Up 
#> 0.491984 0.508016 
#> 
#> Group means:
#>             Lag1        Lag2
#> Down  0.04279022  0.03389409
#> Up   -0.03954635 -0.03132544
#> 
#> Coefficients of linear discriminants:
#>             LD1
#> Lag1 -0.6420190
#> Lag2 -0.5135293
plot(lda.fit)

Get predictions on the test set

lda.pred <- predict(lda.fit, newdata = Smarket[!train, ])
names(lda.pred)
#> [1] "class"     "posterior" "x"
table(lda.pred$class, Direction[!train])
#>       
#>        Down  Up
#>   Down   35  35
#>   Up     76 106

4.6.4 Quadratic Discriminant Analysis —

Fit a QDA model

qda.fit <- qda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
qda.fit
#> Call:
#> qda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
#> 
#> Prior probabilities of groups:
#>     Down       Up 
#> 0.491984 0.508016 
#> 
#> Group means:
#>             Lag1        Lag2
#> Down  0.04279022  0.03389409
#> Up   -0.03954635 -0.03132544

Get predictions on the test set

qda.pred <- predict(qda.fit, newdata = Smarket[!train, ])
table(qda.pred$class, Direction[!train])
#>       
#>        Down  Up
#>   Down   30  20
#>   Up     81 121

4.6.5 K-Nearest Neighbors

Create matrices of predictors for the training and test data

train.X <- as.matrix(Smarket[train, c("Lag1", "Lag2")])
test.X <- as.matrix(Smarket[!train, c("Lag1", "Lag2")])
train.Direction <- Smarket[train, "Direction"]

Run KNN with K=3

knn.pred <- knn(train.X, test.X, train.Direction, k = 3)
table(knn.pred, Direction[!train])
#>         
#> knn.pred Down Up
#>     Down   48 55
#>     Up     63 86

Run KNN with K=1


knn.pred <- knn(train.X, test.X, train.Direction, k = 1)
table(knn.pred, Direction[!train])
#>         
#> knn.pred Down Up
#>     Down   43 58
#>     Up     68 83