Part A: Statistical Learning Regression and Classification Exercise A.3

Author

1601014

Chapter 3: Linear Regression

What it is: A simple but powerful tool used to predict a number (like price, sales, or temperature) based on the value of one or more other variables.

The Core Idea: It finds the best-fitting straight line (or a flat plane, with multiple variables) through your data points. This line is defined by the equation:

Y = β₀ + β₁X + ε

  • Y: The thing you want to predict (the “response”).

  • X: The thing you’re using to make the prediction (the “predictor”).

  • β₀ (Intercept): The predicted value of Y when X is zero.

  • β₁ (Slope): The average change in Y for a one-unit change in X.

  • ε (Error): The stuff the model can’t explain (random noise).

How it works: The “best” line is the one that minimizes the sum of squared differences between the actual data points and the points on the line. This is called the “least squares” method.

What you get out of it:

  1. Predictions: Plug in a new X value, get a predicted Y value.

  2. Relationships: The slope (β₁) tells you the strength and direction of the relationship. A positive slope means Y increases with X. A negative slope means Y decreases with X.

  3. Significance: Statistical tests (p-values) tell you if the relationship you see is likely real or just a fluke.

In a nutshell: Linear regression is the go-to method for answering the question, “How much does Y change when X changes?” It’s the fundamental starting point for predicting numerical outcomes.

3.1 Simple Linear Regression

  • The model:
    [ Y = _0 + _1 X + ]
  • Estimates coefficients using the least squares approach.
  • Key interpretations:
    • (_0): intercept — expected (Y) when (X = 0)
    • (_1): slope — change in (Y) for a one-unit change in (X)

3.2 Multiple Linear Regression

What it is: An extension of simple linear regression that allows you to predict a single numerical outcome (Y) based on the values of two or more predictor variables (X₁, X₂, …, Xₚ).

The Core Idea: Instead of fitting a best-fit line, multiple linear regression fits a best-fit multi-dimensional hyperplane to the data. The model is represented by the equation:

Y = β₀ + β₁X₁ + β₂X₂ + … + βₚXₚ + ε

  • Y: The response variable you want to predict.

  • X₁, X₂, …, Xₚ: The predictor variables.

  • β₀: The intercept (the value of Y when all predictors are zero).

  • β₁, β₂, …, βₚ: The slope coefficients. Each one represents the average change in Y for a one-unit increase in its corresponding predictor, holding all other predictors constant. This is the most important concept.

  • ε: The error term (random noise).

3.3 Other Considerations in the Regression Model

1. Qualitative Predictors (How to Use Categories)

You are not limited to numerical predictors. You can include categorical variables (e.g., Gender, Region, OwnsHouse).

  • How it’s done: You create dummy variables (0/1 indicators) for each category.

    • Example: For a predictor OwnsHouse with levels Yes/No, you create a new variable: X = 1 if Yes, 0 if No.
  • Interpretation: The coefficient for a dummy variable represents the average difference in the response between that category and the baseline (or reference) category (the one coded as 0).

2. Extending the Model: Interactions & Non-Linearity

The standard model makes two big assumptions: additivity (each predictor’s effect is independent) and linearity (the effect is a straight line). This section shows how to break these assumptions when needed.

A. Interactions (Synergy)

  • What it is: When the effect of one predictor depends on the level of another predictor.

    • Example: The effectiveness of TV advertising on Sales might depend on the amount spent on Radio advertising. They work together synergistically.
  • How it’s done: Add a new predictor that is the product of the two original variables: X₁ * X₂.

  • Interpretation: The coefficient for the interaction term tells you how much the effect of one predictor changes for a one-unit increase in the other predictor.

B. Non-Linear Relationships

  • What it is: When the relationship between a predictor and the response is curved (e.g., diminishing returns).

    • Example: The effect of Horsepower on MPG is likely curved—increasing horsepower improves MPG up to a point, then hurts it.
  • How it’s done: You can add polynomial terms (e.g., X², X³) or other transformations (e.g., log(X)) to the model.

  • Key Point: This is still linear regression because the model is still linear in the coefficients (β₀, β₁, β₂). You are just using non-linear predictors.

3. Potential Problems to Diagnose (The Watchlist)

A good data analyst doesn’t just fit a model; they check if it’s a good model. This section lists the top threats to a model’s validity.

  1. Non-linearity: The relationship isn’t a straight line.

    • Fix: Look at residual plots. Add polynomial or transformed terms.
  2. Correlation of Error Terms: The errors are not independent (common in time series data). This breaks the calculation of standard errors, making you overconfident in your results.

  3. Non-constant Variance (Heteroscedasticity): The spread of the residuals changes with the predicted value (e.g., a funnel shape in the residual plot).

    • Fix: Transform the response variable (e.g., use log(Y)).
  4. Outliers: Data points where the response Y is unusual given its predictors.

    • Fix: Identify them with studentized residuals. Investigate but don’t delete without cause.
  5. High-Leverage Points: Data points where the predictor X is unusual. These points can have an outsized influence on the model fit, pulling the regression line toward them.

    • Fix: Calculate leverage statistics to identify them.
  6. Collinearity: When two or more predictors are highly correlated with each other. This makes it:

    • Difficult to separate their individual effects on the response.

    • Causes the standard errors of the coefficients to inflate, making it hard to find statistically significant effects.

    • Fix: Calculate a VIF (Variance Inflation Factor). A VIF > 5 or 10 indicates a problem. Solutions include removing variables or combining them.

3.4 Chapter 3 Lab: Linear Regression in R

# Load the MASS package which contains the Boston dataset
library(MASS)

# Attach the Boston dataset to the R search path
attach(Boston)

# View the first few rows of the dataset
head(Boston)
     crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
  medv
1 24.0
2 21.6
3 34.7
4 33.4
5 36.2
6 28.7
# Fit a simple linear regression model: medv (median home value) as a function of lstat (lower status population)
lm.fit <- lm(medv ~ lstat, data = Boston)

# Display a detailed summary of the linear model
summary(lm.fit)

Call:
lm(formula = medv ~ lstat, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.168  -3.990  -1.318   2.034  24.500 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 34.55384    0.56263   61.41   <2e-16 ***
lstat       -0.95005    0.03873  -24.53   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.216 on 504 degrees of freedom
Multiple R-squared:  0.5441,    Adjusted R-squared:  0.5432 
F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16
# View the names of the components in the lm.fit object
names(lm.fit)
 [1] "coefficients"  "residuals"     "effects"       "rank"         
 [5] "fitted.values" "assign"        "qr"            "df.residual"  
 [9] "xlevels"       "call"          "terms"         "model"        
# Extract the estimated coefficients (intercept and slope)
coef(lm.fit)
(Intercept)       lstat 
 34.5538409  -0.9500494 
# Compute 95% confidence intervals for the model coefficients
confint(lm.fit)
                2.5 %     97.5 %
(Intercept) 33.448457 35.6592247
lstat       -1.026148 -0.8739505
# Predict medv for lstat values of 5, 10, and 15 with confidence intervals
predict(lm.fit, data.frame(lstat = c(5, 10, 15)), interval = "confidence")
       fit      lwr      upr
1 29.80359 29.00741 30.59978
2 25.05335 24.47413 25.63256
3 20.30310 19.73159 20.87461
# Predict medv for lstat values of 5, 10, and 15 with prediction intervals
predict(lm.fit, data.frame(lstat = c(5, 10, 15)), interval = "prediction")
       fit       lwr      upr
1 29.80359 17.565675 42.04151
2 25.05335 12.827626 37.27907
3 20.30310  8.077742 32.52846
# Plot the data: lstat vs medv
plot(lstat, medv)

# Add the regression line to the plot
abline(lm.fit)

# Add the regression line with increased line width
abline(lm.fit, lwd = 3)

# Add the regression line in red color
abline(lm.fit, lwd = 3, col = "red")

# Plot the data points in red
plot(lstat, medv, col = "red")

# Plot the data points using solid dots
plot(lstat, medv, pch = 20)

# Plot the data points using plus signs
plot(lstat, medv, pch = "+")

# Display all available plotting symbols
plot(1:20, 1:20, pch = 1:20)

# Set up a 2x2 plotting area for diagnostic plots
par(mfrow = c(2, 2))

# Plot diagnostic plots for the linear model (residuals, leverage, etc.)
plot(lm.fit)

# Plot predicted values vs residuals
plot(predict(lm.fit), residuals(lm.fit))

# Plot predicted values vs studentized residuals
plot(predict(lm.fit), rstudent(lm.fit))

# Plot leverage values
plot(hatvalues(lm.fit))

# Identify the observation with the highest leverage
which.max(hatvalues(lm.fit))
375 
375 

library(MASS)
library(car)
Loading required package: carData
attach(Boston)
The following objects are masked from Boston (pos = 5):

    age, black, chas, crim, dis, indus, lstat, medv, nox, ptratio, rad,
    rm, tax, zn
# Fit the full model excluding 'age'
lm.fit1 <- lm(medv ~ . - age, data = Boston)

# View summary
summary(lm.fit1)

Call:
lm(formula = medv ~ . - age, data = Boston)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.6054  -2.7313  -0.5188   1.7601  26.2243 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  36.436927   5.080119   7.172 2.72e-12 ***
crim         -0.108006   0.032832  -3.290 0.001075 ** 
zn            0.046334   0.013613   3.404 0.000719 ***
indus         0.020562   0.061433   0.335 0.737989    
chas          2.689026   0.859598   3.128 0.001863 ** 
nox         -17.713540   3.679308  -4.814 1.97e-06 ***
rm            3.814394   0.408480   9.338  < 2e-16 ***
dis          -1.478612   0.190611  -7.757 5.03e-14 ***
rad           0.305786   0.066089   4.627 4.75e-06 ***
tax          -0.012329   0.003755  -3.283 0.001099 ** 
ptratio      -0.952211   0.130294  -7.308 1.10e-12 ***
black         0.009321   0.002678   3.481 0.000544 ***
lstat        -0.523852   0.047625 -10.999  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.74 on 493 degrees of freedom
Multiple R-squared:  0.7406,    Adjusted R-squared:  0.7343 
F-statistic: 117.3 on 12 and 493 DF,  p-value: < 2.2e-16
# Check multicollinearity
vif(lm.fit1)
    crim       zn    indus     chas      nox       rm      dis      rad 
1.792172 2.265290 3.991592 1.071227 4.084846 1.851068 3.620246 7.441492 
     tax  ptratio    black    lstat 
9.000474 1.788084 1.343044 2.599229 

3.6.4 Interaction Terms

# Interaction Terms
summary(lm(medv ~ lstat * age, data = Boston))

Call:
lm(formula = medv ~ lstat * age, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.806  -4.045  -1.333   2.085  27.552 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 36.0885359  1.4698355  24.553  < 2e-16 ***
lstat       -1.3921168  0.1674555  -8.313 8.78e-16 ***
age         -0.0007209  0.0198792  -0.036   0.9711    
lstat:age    0.0041560  0.0018518   2.244   0.0252 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.149 on 502 degrees of freedom
Multiple R-squared:  0.5557,    Adjusted R-squared:  0.5531 
F-statistic: 209.3 on 3 and 502 DF,  p-value: < 2.2e-16

3.6.5 Non-linear Transformations of the Predictors

# Fit a second-degree polynomial regression model
# This models medv (median home value) as a quadratic function of lstat (lower status population)
lm.fit_poly <- lm(medv ~ lstat + I(lstat^2), data = Boston)

# Display a summary of the polynomial regression model
# This includes coefficients, standard errors, t-values, p-values, R-squared, and F-statistic
summary(lm.fit_poly)

Call:
lm(formula = medv ~ lstat + I(lstat^2), data = Boston)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.2834  -3.8313  -0.5295   2.3095  25.4148 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 42.862007   0.872084   49.15   <2e-16 ***
lstat       -2.332821   0.123803  -18.84   <2e-16 ***
I(lstat^2)   0.043547   0.003745   11.63   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.524 on 503 degrees of freedom
Multiple R-squared:  0.6407,    Adjusted R-squared:  0.6393 
F-statistic: 448.5 on 2 and 503 DF,  p-value: < 2.2e-16
#We use the anova() function to further quantify the extent to which the quadratic fit is superior to the linear fit.
anova(lm.fit, lm.fit_poly)
Analysis of Variance Table

Model 1: medv ~ lstat
Model 2: medv ~ lstat + I(lstat^2)
  Res.Df   RSS Df Sum of Sq     F    Pr(>F)    
1    504 19472                                 
2    503 15347  1    4125.1 135.2 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Summary

Chapter 4: Classification – Summary of Key Lessons 4.1 Why Not Linear Regression?

4.1 Overview of Classification

  • Classification methods predict discrete class labels such as “Yes” or “No”, “Up” or “Down”.
  • Evaluation focuses on classification accuracy rather than mean squared error.


4.2 Logistic Regression

  • Logistic regression models the probability that (Y = 1) using the logit function: [ ( ) = _0 + _1 X_1 + + _p X_p ]
  • Predictions are interpreted as probabilities.
  • Logistic regression can be extended to:
    • Multiple logistic regression
    • Multinomial logistic regression (multi-class problems)

4.4 Generative Models for Classification

4.4.1 Linear Discriminant Analysis (LDA)

  • Assumes predictors are normally distributed within each class.
  • Maximizes the separation between group means.
  • Effective when classes are well-separated and predictors are Gaussian.

4.4.2 Quadratic Discriminant Analysis (QDA)

  • Allows different covariance matrices for each class.
  • More flexible but prone to overfitting with small data.

4.4.3 Naive Bayes

  • Assumes independence among predictors given the class.
  • Scales well to high-dimensional problems.
  • Simplified yet surprisingly effective.

4.5 Comparing Classification Methods

  • Analytical comparison based on assumptions and model flexibility.
  • Empirical comparison using metrics like accuracy, sensitivity, specificity, ROC curves.

4.6 Generalized Linear Models (GLMs)

  • Extends linear models to handle non-Gaussian responses.
  • Examples:
    • Poisson regression for count data
    • Logistic regression as a special case of GLM with binomial outcome

4.7 Lab: Classification Methods

  • Implements logistic regression, LDA, QDA, Naive Bayes, K-Nearest Neighbors (KNN), and Poisson regression in R using the Smarket dataset.
  • Demonstrates:
    • Fitting models
    • Making predictions
    • Creating confusion matrices
    • Model comparison

Lab sessions

library(ISLR2) # Note: The second edition uses the ISLR2 package
Warning: package 'ISLR2' was built under R version 4.5.1

Attaching package: 'ISLR2'
The following object is masked from 'package:MASS':

    Boston
library(MASS)
library(class)
attach(Smarket)
names(Smarket)
[1] "Year"      "Lag1"      "Lag2"      "Lag3"      "Lag4"      "Lag5"     
[7] "Volume"    "Today"     "Direction"
summary(Smarket)
      Year           Lag1                Lag2                Lag3          
 Min.   :2001   Min.   :-4.922000   Min.   :-4.922000   Min.   :-4.922000  
 1st Qu.:2002   1st Qu.:-0.639500   1st Qu.:-0.639500   1st Qu.:-0.640000  
 Median :2003   Median : 0.039000   Median : 0.039000   Median : 0.038500  
 Mean   :2003   Mean   : 0.003834   Mean   : 0.003919   Mean   : 0.001716  
 3rd Qu.:2004   3rd Qu.: 0.596750   3rd Qu.: 0.596750   3rd Qu.: 0.596750  
 Max.   :2005   Max.   : 5.733000   Max.   : 5.733000   Max.   : 5.733000  
      Lag4                Lag5              Volume           Today          
 Min.   :-4.922000   Min.   :-4.92200   Min.   :0.3561   Min.   :-4.922000  
 1st Qu.:-0.640000   1st Qu.:-0.64000   1st Qu.:1.2574   1st Qu.:-0.639500  
 Median : 0.038500   Median : 0.03850   Median :1.4229   Median : 0.038500  
 Mean   : 0.001636   Mean   : 0.00561   Mean   :1.4783   Mean   : 0.003138  
 3rd Qu.: 0.596750   3rd Qu.: 0.59700   3rd Qu.:1.6417   3rd Qu.: 0.596750  
 Max.   : 5.733000   Max.   : 5.73300   Max.   :3.1525   Max.   : 5.733000  
 Direction 
 Down:602  
 Up  :648  
           
           
           
           

4.6.2 Logistic Regression

glm.fits <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
                data = Smarket, family = binomial)
summary(glm.fits)

Call:
glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
    Volume, family = binomial, data = Smarket)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.126000   0.240736  -0.523    0.601
Lag1        -0.073074   0.050167  -1.457    0.145
Lag2        -0.042301   0.050086  -0.845    0.398
Lag3         0.011085   0.049939   0.222    0.824
Lag4         0.009359   0.049974   0.187    0.851
Lag5         0.010313   0.049511   0.208    0.835
Volume       0.135441   0.158360   0.855    0.392

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1731.2  on 1249  degrees of freedom
Residual deviance: 1727.6  on 1243  degrees of freedom
AIC: 1741.6

Number of Fisher Scoring iterations: 3
coef(glm.fits)
 (Intercept)         Lag1         Lag2         Lag3         Lag4         Lag5 
-0.126000257 -0.073073746 -0.042301344  0.011085108  0.009358938  0.010313068 
      Volume 
 0.135440659 
glm.probs <- predict(glm.fits, type = "response")

glm.pred <- rep("Down", 1250)
glm.pred[glm.probs > 0.5] <- "Up"
table(glm.pred, Direction)
        Direction
glm.pred Down  Up
    Down  145 141
    Up    457 507

Create training and test sets

train <- Year < 2005
glm.fits.train <- glm(Direction ~ Lag1 + Lag2, data = Smarket, family = binomial,
                      subset = train)
glm.probs.test <- predict(glm.fits.train, newdata = Smarket[!train, ],
                          type = "response")
glm.pred.test <- rep("Down", 252)
glm.pred.test[glm.probs.test > 0.5] <- "Up"
table(glm.pred.test, Direction[!train])
             
glm.pred.test Down  Up
         Down   35  35
         Up     76 106

4.6.3 Linear Discriminant Analysis

#Fit a LDA model using ‘Lag1’ and ‘Lag2’
lda.fit <- lda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
lda.fit
Call:
lda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)

Prior probabilities of groups:
    Down       Up 
0.491984 0.508016 

Group means:
            Lag1        Lag2
Down  0.04279022  0.03389409
Up   -0.03954635 -0.03132544

Coefficients of linear discriminants:
            LD1
Lag1 -0.6420190
Lag2 -0.5135293
plot(lda.fit)

#Get predictions on the test set
lda.pred <- predict(lda.fit, newdata = Smarket[!train, ])
names(lda.pred)
[1] "class"     "posterior" "x"        
table(lda.pred$class, Direction[!train])
      
       Down  Up
  Down   35  35
  Up     76 106

4.6.4 Quadratic Discriminant Analysis

#Fit a QDA model

qda.fit <- qda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
qda.fit
Call:
qda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)

Prior probabilities of groups:
    Down       Up 
0.491984 0.508016 

Group means:
            Lag1        Lag2
Down  0.04279022  0.03389409
Up   -0.03954635 -0.03132544
qda.pred <- predict(qda.fit, newdata = Smarket[!train, ])
table(qda.pred$class, Direction[!train])
      
       Down  Up
  Down   30  20
  Up     81 121
#Create matrices of predictors for the training and test data

train.X <- as.matrix(Smarket[train, c("Lag1", "Lag2")])
test.X <- as.matrix(Smarket[!train, c("Lag1", "Lag2")])
train.Direction <- Smarket[train, "Direction"]
#Run KNN with K=3
knn.pred <- knn(train.X, test.X, train.Direction, k = 3)
table(knn.pred, Direction[!train])
        
knn.pred Down Up
    Down   48 56
    Up     63 85
#Run KNN with K=1
knn.pred <- knn(train.X, test.X, train.Direction, k = 1)
table(knn.pred, Direction[!train])
        
knn.pred Down Up
    Down   43 58
    Up     68 83