AI and Machine Learning for Public Health: Regression and Classification : Exercise 3

Author

Bongani Ncube

Published

September 8, 2025

Who’s that guy?!

Bongani Ncube

Msc Student , Wits
Training as a statistician
Machine learning and Mathematics fanatic

This Class

Slides, R codes, data and practicals

I used R, and RStudio
I also used R Markdown to write reproducible documents (slides/exercises)
All material is available https://www.r-bloggers.com/in-depth-introduction-to-machine-learning-in-15-hours-of-expert-videos/

Introduction

Definition

This chapter is about linear regression, a very simple approach for supervised learning.
In particular, linear regression is a useful tool for predicting a quantitative response

Example Problems solved by linear Regression

Is there a relationship between advertising budget and sales?

Our first goal should be to determine whether the data provide evidence of an association between advertising expenditure and sales. If the evidence is weak, then one might argue that no money should be spent on advertising!

How strong is the relationship between advertising budget and sales?

Assuming that there is a relationship between advertising and sales, we would like to know the strength of this relationship. In other words, given a certain advertising budget, can we predict sales with a high level of accuracy? This would be a strong relationship. Or is a prediction of sales based on advertising expenditure only slightly better than a random guess? This would be a weak relationship.

Which media contribute to sales?

Do all three media—TV, radio, and newspaper—contribute to sales, or do just one or two ofthe media contribute? To answerthis question, we must find a way to separate out the individual effects of each medium when we have spent money on all three media.

How accurately can we estimate the effect of each medium on sales?

For every dollar spent on advertising in a particular medium, by what amount will sales increase? How accurately can we predict this amount of increase?

How accurately can we predict future sales?

For any given level of television, radio, or newspaper advertising, what is our prediction for sales, and what is the accuracy of this prediction?

Is the relationship linear? > If there is approximately a straight-line relationship between advertising expenditure in the various media and sales, then linear regression is an appropriate tool. If not, then it may still be possible to transform the predictor or the response so that linear regression can be used.

Regression

Simple Linear Regression

A straightforward approach of predicting a quantitative Y from a single predictor X, assuming an approximately linear relationship: Mathematically, we can write this linear relationship as

Y \approx \beta_0 + \beta_1 X

Estimating the Coefficients

In the above equation ,\beta_0 and \beta_1 are two unknown constants that represent the intercept and slope terms in the linear model.
Our goal is to obtain estimates of the coefficients \hat{\beta}_0 and \hat{\beta}_1 such that the linear model fits the data well.
There are a number of ways of evaluating fit to data, but by far the most common approach is the least squares criterion. We define the residual sum of squares (RSS) as

\text{RSS} = e_1^2 + e_2^2 + \dots + e_n^2

where e_i = y_i - \hat{y} is the ith (out of n) residual. The least squares approach chooses \hat{\beta}_0 and \hat{\beta}_1 to minimize the RSS. Using some calculus, one can show that

\begin{align} \hat{\beta}_1 &= \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2}\\ \hat{\beta}_0 &= \bar{y} - \hat{\beta}_1 \bar{x} \end{align}

Illustration

Strategy

The same derivation gives simple linear regression fit to the Advertising data , where \beta_0 = 7.03 and \beta_1 = 0.048.

Assessing the Accuracy of the Coefficient Estimates

If we use the bias sample mean \hat{\mu} to estimate \mu, this estimate is unbiased, in the sense that unbiased on average, we expect \hat{\mu} to equal \mu.
It means that on the basis of one particular set of observations y_1,\dots, y_n, \hat{\mu} might overestimate \mu, and on the basis of another set of observations, \hat{\mu} might underestimate \mu. But if we could average a huge number of estimates of \mu obtained from a huge number of sets of observations, then this average would exactly equal \mu.
An unbiased estimator does not systematically over- or under-estimate the true parameter.
We have established that the average of \hat{\mu}’s over many data sets will be very close to \mu, but that a single estimate \hat{\mu} may be a substantial underestimate or overestimate of \mu.
How far off will that single estimate of \hat{\mu} be? In general, we answer this question by computing the standard error of \hat{\mu}, written as SE(\hat{\mu}).

\text{Var}(\hat{\mu}) = \text{SE}(\hat{\mu})^2 = \frac{\sigma^2}{n}

To compute the standard errors associated with \hat{\beta}_0 and \hat{\beta}_1, we use the following formulas:

\begin{align} \text{SE}(\hat{\beta}_0)^2 &= \sigma^2 \left[ \frac{1}{n} + \frac{\bar{x}^2}{\sum (x_i - \bar{x})^2}\right] \\ \text{SE}(\hat{\beta}_1)^2 &= \frac{\sigma^2}{\sum (x_i - \bar{x})^2} \end{align}

where \sigma^2 = \text{Var}(\epsilon).

In general, \sigma^2 is not known, but can be estimated from the data. This estimate of \sigma is known as the residual standard error, and is given by the formula \text{RSE} = \sqrt{\text{RSS}/(n-2)}.

For linear regression, we get approximate 95% confidence intervals for the coefficients as:

\begin{align} \hat{\beta}_1 \pm 2 \text{SE}(\hat{\beta}_1) \\ \hat{\beta}_0 \pm 2 \text{SE}(\hat{\beta}_0). \end{align}

Assessing the Accuracy of the Model

The quality of a linear regression fit is typically assessed using two related quantities: the residual standard error (RSE) and the R^2 statistic.

Residual Standard Error

The residual standard error (RSE) is sigma, variance explained R^2 is r.squared, and the F-statistic is statistic.

library(gt)
glance(lm_sales_tv) %>%
  transmute(`Residual standard error` = round(sigma, 2),
            `R2` = round(r.squared, 3), `F-statistic` = round(statistic, 1)) %>%
  mutate(across(everything(), as.character)) %>%
  pivot_longer(everything(), names_to = "Quantity", values_to = "Value") %>%
  gt()

Quantity	Value
Residual standard error	3.26
R2	0.612
F-statistic	312.1

R^2 Statistic

The R^2 statistic provides an alternative measure of fit. Ita lways takes on a value between 0 and 1, and is independent of the scale of Y.

R^2 = \frac{\text{TSS} - \text{RSS}}{\text{TSS}} = 1 - \frac{\text{RSS}}{\text{TSS}}

where \text{TSS} = \sum (y_i - \bar{y})^2 is the total sum of squares.

Multiple Linear Regression

Simple linear regression is a useful approach for predicting a response on the basis of a single predictor variable. However, in practice we often have more than one predictor.

The model with p predictors takes the form:

Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p + \epsilon.

We interpret the slope \beta_j as the average effect on Y by a one unit increase in X_j while holding all other predictors fixed.

Estimating the Regression Coefficients

The parameters are estimated using the same least squares approach as simple linear regression: choose \beta_0 \dots \beta_p to minimize RSS.

Some Important Questions

One: Is There a Relationship Between the Response and Predictors?

We begin by testing whether the predictors are related to the response. Formally:

H_0: \beta_1 = \beta_2 = \dots = \beta_p = 0 \quad \text{vs.} \quad H_a: \text{at least one } \beta_j \neq 0

This is assessed using the F-statistic:

F = \frac{(\text{TSS} - \text{RSS})/p}{\text{RSS}/(n - p - 1)}

If F \approx 1, there is little evidence against H_0. Large values of F indicate that the predictors collectively improve model fit and that at least one is associated with the response.

Two: Deciding on Important Variables

The next question is which predictors matter most. This is known as variable selection.
Common criteria include:

Adjusted R^2
Akaike Information Criterion (AIC)
Bayesian Information Criterion (BIC)
Mallow’s C_p

Stepwise procedures (forward, backward, mixed) exist but can be unstable and lead to overfitting. More robust methods include penalized regression (e.g., LASSO, ridge) or cross-validation approaches.

Three: Model Fit

Model fit is typically quantified by:

Residual Standard Error (RSE) – the average size of the residuals.
R^2 – proportion of variance in the response explained by the predictors.

In multiple regression:

R^2 = \text{Cor}(Y, \hat{Y})^2

However, R^2 always increases as more variables are added, even if they are uninformative. Thus, adjusted R^2 or information criteria are preferred when comparing models.

Four: Predictions

With a fitted regression model, predictions of the response can be made. The uncertainty in predictions comes from three main sources:

Sampling variability in the estimated coefficients (\hat{\beta}).
Model bias if the assumed linearity does not hold.
Irreducible error \epsilon.

Prediction intervals are always wider than confidence intervals since they incorporate both reducible and irreducible error.

Other Considerations in Regression

Qualitative Predictors

Categorical variables can also be included:

For two levels, a single indicator variable is sufficient.
For more than two levels, multiple dummy variables are used.

The significance of such variables is often tested with an F-test, comparing models with and without the categorical predictors.

Extensions of the Linear Model

The linear model assumes additivity and linearity. These assumptions can be relaxed.

Interaction effects allow the effect of one predictor to depend on the level of another.

For example:

Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 (X_1 X_2) + \epsilon

Including interactions may reveal synergies (combined effects of predictors). By the hierarchical principle, if an interaction is included, the corresponding main effects should also remain in the model.

Comparison of Linear Regression with K-Nearest Neighbors

Parametric methods are often easy to fit, and easy to interpret, but the disadvantage is the strong assumption about the form of f(X).

Non-parametric methods do not explicitly assume a form for f(X) and therefore provide an alternative and more flexible approach to regression. One of the simplest and best-known methods is K-nearest neighbors regression (closely related to the KNN classifier from Chapter 2).

From the K nearest neighbors (represented by the set \mathcal{N}_0) to a prediction point x_0, it estimates f(x_0) using the average:

\hat{f}(x_0) = \frac{1}{K} \sum_{x_i \in \mathcal{N}_0} y_i.

In general, the optimal value for K will depend on the bias-variance tradeoff .A small value for K provides the most flexible fit, which will have low bias but high variance.This variance is due to the fact that the prediction in a given region is entirely dependent on just one observation. In contrast, larger values of K provide a smoother and less variable fit; the prediction in a region is an average of several points, and so changing one observation has a smaller effect. However, the smoothing may cause bias by masking some of the structure in f(X). In Chapter 5, we introduce several approaches for estimating test error rates. These methods can be used to identify the optimal value of K in KNN regression

Classifcation

Introduction

In many situations, the response variable is instead qualitative. For example, eye color is qualitative. Often qualitative variables are referred to as categorical; these terms are interchangeably used. In this chapter, I learnt approaches for predicting qualitative responses, a process that is known as classification. The models we learnt include logistic regression (and Poisson regression), linear discriminant analysis, quadratic discriminant analysis, naive Bayes, and K-nearest neighbors.

Prolems where classification is used

A person arrives at the emergency room with a set of symptoms that could possibly be attributed to one of three medical conditions. Which of the three conditions does the individual have?
An online banking service must be able to determine whether or not a transaction being performed on the site is fraudulent, on the basis of the user’s IP address, past transaction history, and so forth.
On the basis of DNA sequence data for a number of patients with and without a given disease, a biologist would like to figure out which DNA mutations are deleterious (disease-causing) and which are not

Overiew of classification

Why Not Linear Regression?

Linear regression cannot predict un-ordered qualitative responses with more than two levels.
It is possible to use linear regression to predict a binary (two level) response. For example, if we code stroke and drug overdose as dummy variables:

Y = \begin{cases} 0 & \text{if stroke;} \\ 1 & \text{if drug overdose}\\ \end{cases}

Then we predict stroke if \hat{Y} <= 0.5 and overdose if \hat{Y} > 0.5. It turns out that these probability estimates are not unreasonble, but there can be issues:

However, if we use linear regression, some of our estimates might be outside the [0, 1] interval , making them hard to interpret as probabilities! Nevertheless, the predictions provide an ordering and can be interpreted as crude probability estimates.

Illustration

Logistic Regression

In logistic regression, we model the probability of Y belonging to a class, rather than the response Y itself. The probability of default given balance can be written:

\text{Pr}(\text{default = Yes}|\text{balance}) = p(\text{balance}).

One might predict a default for an individual with p(\text{balance}) > 0.5. Or they may alter the threshold to be conservative, e.g. p(\text{balance}) > 0.1

The Logistic Model

As previously discussed, we could model the probability as linear:

p(X) = \beta_0 + \beta_1 X

but this could give probabilities outside of the range 0-1. We must instead model p(X) using a function that gives outputs 0-1. Many functions meet this description, but logistic regression uses the logistic function:

p(X) = \frac{e^{\beta_0 + \beta_1X}}{1 + e^{\beta_0 + \beta_1 X}}.

Estimating the Regression Coefficients

We fit logistic regression models with maximum likelihood, which seeks estimates for \beta_0 and \beta_1 such that the predicted probabilities \hat{p}(x_i) corresponds as closely as possible to the values y_i. This idea is formalized using a likelihood function:

\ell (\beta_0, \beta_1) = \prod_{i: y_i = 1} p(x_i) \prod_{i': y_{i'} = 0} (1 - p(x_{i'})).

We find the estimates \hat{\beta}_0 and \hat{\beta}_1 by maximizing this likelihood function. Note that the least squares approach to linear regression is a special case of maximum likelihood.

Multiple Logistic Regression

The extension to multiple predictors p is straightfoward:

\log \left( \frac{p(X)}{1 - p(X)} \right) = \beta_0 + \beta_1 X_1 + \dots + \beta_p X_p.

Multinomial Logistic Regression

For predicting K > 2 classes, we can extend logistic regression in a method called multinomial logistic regression. To do this, we choose a single class K to serve as the baseline. Then the probability of another class k is:

\text{Pr}(Y = k|X = x) = \frac{e^{\beta_{k0} + \beta_{k1} x_1 + \beta_{kp} x_p}}{1 + \sum_{l=1}^{K-1} e^{\beta_{l0} + \beta_{l1} x_1 + \beta_{lp} x_p}}

for k = 1, \dots, K - 1. Then for the baseline class K:

\text{Pr}(Y = K|X = x) = \frac{1}{1 + \sum_{l=1}^{K-1} e^{\beta_{l0} + \beta_{l1} x_1 + \beta_{lp} x_p}}.

The log-odds of a class k is then linear in the predictors:

\log \left( \frac{\text{Pr} (Y = k| X = x)}{\text{Pr} (Y = K| X = x)}\right) = \beta_{k0} + \beta_{k1} x_1 + \dots + \beta_{kp} x_p.

The softmax coding is equivalent to the coding just described in the sense that the fitted values, log odds between any pair of classes, and other key model outputs will remain the same, regardless of coding. But the softmax coding is used extensively in some areas of the machine learning literature (and will appear again in Chapter 10), so it is worth being aware of it. In the softmax coding, rather than selecting a baseline class, we treat all K classes symmetrically, and assume that for k = 1,...,K,

\text{Pr}(Y = k|X = x) = \frac{e^{\beta_{k0} + \beta_{k1} x_1 + \beta_{kp} x_p}}{ \sum_{l=1}^{K} e^{\beta_{l0} + \beta_{l1} x_1 + \beta_{lp} x_p}}.

Rather than estimating coefficients for K − 1 classes, we actually estimate coefficients for all K classes. The log odds ratio between the kth and k′th classes equals

\frac{\log \text{Pr} (Y = k| X = x)}{\log \text{Pr} (Y = k'| X = x)} = (\beta_{k0} - \beta_{k'0}) + (\beta_{k1} - \beta_{k'1}) x_1 + \dots + (\beta_{kp} - \beta_{k'p}) x_p.

Note that in the case of K = 2, the numerator becomes p(X) and the denominator 1 - p(X), which is exactly the same the two-class logistic regression formula (Equation 4.6).

The choice of class K as baseline was arbitrary. The only thing that will change by choosing a different baseline will be the coefficient estimates, but the predictions (fitted values), and model metrics will be the same.

When performing multinomial logistic regression, we will sometimes use an alternative to dummy coding called softmax coding.

Linear Discriminant Analysis for p = 1

For the case of one predictor, we start by assuming that f_k (x) is normal or Gaussian, which has the following density in one dimension:

f_k (x) = \frac{1}{\sqrt{2 \pi} \sigma_k} \exp \left( - \frac{1}{2\sigma_k^2} (x - \mu_k)^2\right)

where \mu_k and \sigma_k^2 are the mean and variance of the kth class. assume all classes have the same variance \sigma^2. Plugging the above into Bayes’ theorem, we have:

p_k (x) = \frac{\pi_k \frac{1}{\sqrt{2 \pi} \sigma} \exp \left( - \frac{1}{2\sigma^2} (x - \mu_k)^2\right)} {\sum_{l=1}^K \pi_l \frac{1}{\sqrt{2 \pi} \sigma} \exp \left( - \frac{1}{2\sigma^2} (x - \mu_l)^2\right)}. The Bayes classifier assigns an observation X = x to the class for which the above is largest. Taking the log and rearranging, this is equivalent to choosing the class for which:

\delta_k (x) = x \frac{\mu_k}{\sigma^2} - \frac{\mu_k^2}{2 \sigma^2} + \log(\pi_k)

is largest.

Linear Discriminant Analysis for p > 1

Extending the LDA classifier for multiple predictors involves a multi-variate Gaussian distribution with class-specific mean vector and a common covariance matrix.

The LDA classifier assumes that the observations in the kth class are drawn from a multivariate Gaussian distribution N(\mu_k, \Sigma). The Bayes classifier assigns an observation X = x to the class for which

\delta_k (x) = x^T \Sigma^{-1} \mu_k - \frac{1}{2} \mu_k^T \Sigma^{-1} \mu_k + \log \pi_k

is largest.

As with the univariable case, the LDA method involves estimating the unknown parameters \mu_k, \pi_k and \Sigma. Then the quantities \hat{\delta}_k (x) are calculated and the observations X are classified into the largest \hat{\delta}_k (k).

Quadratic Discriminant Analysis

LDA assumes that the observations within each class are drawn from a multivariate Gaussian distribution with a class-specific mean vector and a covariance matrix that is common to all K classes.
Quadratic discriminant analysis (QDA) provides an alternative approach. Like LDA, the QDA classifier results from assuming that the observations from each class are drawn from a Gaussian distribution, and plugging estimates for the parameters into Bayes’ theorem in order to perform prediction. However, unlike LDA, QDA assumes that each class has its own covariance matrix. That is, it assumes that an observation from the kth class is of the form X \sim N(\mu_k, \Sigma_k), where \Sigma_k is a covariance matrix for the kth class.

An observation X = x is assigned to the class for which

\begin{align} \delta_k (x) = - \frac{1}{2} (x - \mu_k)^T \Sigma^{-1}_k (x - \mu_k) - \frac{1}{2} \log |\Sigma_k|+ \log \pi_k. \end{align}

is largest. QDA gets its name from how the quantity x appears as quadratic function in the first term of the above equation.

In general, use LDA if there are relatively few training observations, and use QDA for many. Also consider QDA if you have some intuition about the decision boundary being non-linear.

Naive Bayes

The naive Bayes classifier also estimates the conditional probability f_k (x) = \text{Pr}(X|Y = k). In LDA, we made the very strong assumption that f_k is the density function of a multivariate normal distribution with mean \mu_k and shared covariance \Sigma. In QDA, the covariance \Sigma_k is class-specific. The naive Bayes classifier instead makes a single assumption:

Mathematically:

f_k (x) = f_{k1}(x_1) \times f_{k2}(x_2) \times \dots \times f_{kp}(x_p).

where f_{kj} is the density function of the jth predictor among observations in the kth

Under the naive Bayes, assumption, the posterior probability becomes:

\text{Pr}(Y = k|X = x) = \frac{\pi_k \times f_{k1}(x_1) \times \dots \times f_{kp} (x_p)}{\sum_{l=1}^K \pi_l \times f_{l1}(x_1) \times \dots \times f_{lp} (x_p)}

for k = 1, \dots, K.

To estimate the one-dimensional f_{kj} from x_j, we have a few options:

Assume that the jth predictor is drawn from a univariate normal distribution.
- X_j | Y = k \sim N(\mu_{jk}, \sigma_{jk}^2).
- This is like QDA except the covariance matrix is diagonal because the predictors are independent.
Use a non-parametric estimate for f_{kj}.
- A simple way: Estimate f_{kj}(x_j) as the fraction of the training observations in the kth class belonging to a histogram bin.
- Alternatively, use a kernel density estimator, which is essentially a smoothed version of a histogram.
For qualitative X_j, simply count the proportion of training observations for the jth predictor corresponding to each class.

Lab: Regression

Load the boston data rather than the full ISLR2 package:

boston <- ISLR2::Boston

Simple Linear Regression

Regress median value of owner-occupied homes medv on percentage of houses with lower socioeconomic status lstat:

lm.fit <- lm(medv ~ lstat, data = boston)
summary(lm.fit)
#> 
#> Call:
#> lm(formula = medv ~ lstat, data = boston)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -15.168  -3.990  -1.318   2.034  24.500 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 34.55384    0.56263   61.41   <2e-16 ***
#> lstat       -0.95005    0.03873  -24.53   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 6.216 on 504 degrees of freedom
#> Multiple R-squared:  0.5441, Adjusted R-squared:  0.5432 
#> F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16

Compute confidence and prediction intervals at different values of lstat:

confint(lm.fit)
#>                 2.5 %     97.5 %
#> (Intercept) 33.448457 35.6592247
#> lstat       -1.026148 -0.8739505

Plot the relationship between medv and lstat:

plot(boston$lstat, boston$medv, pch = 20, col = "gray40")
abline(lm.fit, col = "blue", lwd = 2)


boston %>%
  ggplot(aes(x = lstat, y = medv)) +
  geom_point() +
  geom_abline(slope = coef(lm.fit)["lstat"],
              intercept = coef(lm.fit)["(Intercept)"],
              size = 1.0, color = td_colors$nice$day9_yellow)

Model Peformance

par(mfrow = c(2, 2))
plot(lm.fit)

To display model diagnostics, we can call plot() on the model object as we have before, but I like the performance package because it uses ggplot2:

performance::check_model(lm.fit)

par(mfrow = c(1, 1))

predict(lm.fit, data.frame(lstat = c(5, 10, 15)), interval = "confidence")
#>        fit      lwr      upr
#> 1 29.80359 29.00741 30.59978
#> 2 25.05335 24.47413 25.63256
#> 3 20.30310 19.73159 20.87461

predict(lm.fit, data.frame(lstat = c(5, 10, 15)), interval = "prediction")
#>        fit       lwr      upr
#> 1 29.80359 17.565675 42.04151
#> 2 25.05335 12.827626 37.27907
#> 3 20.30310  8.077742 32.52846

Multiple Linear Regression

Few Variables

lm.fit2 <- lm(medv ~ lstat + age, data = boston)
summary(lm.fit2)
#> 
#> Call:
#> lm(formula = medv ~ lstat + age, data = boston)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -15.981  -3.978  -1.283   1.968  23.158 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 33.22276    0.73085  45.458  < 2e-16 ***
#> lstat       -1.03207    0.04819 -21.416  < 2e-16 ***
#> age          0.03454    0.01223   2.826  0.00491 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 6.173 on 503 degrees of freedom
#> Multiple R-squared:  0.5513, Adjusted R-squared:  0.5495 
#> F-statistic:   309 on 2 and 503 DF,  p-value: < 2.2e-16

All Predictors

Fit to all predictors and check VIF with the performance package:

lm_medv_all <- lm(medv ~ ., data = boston)
summary(lm_medv_all)
#> 
#> Call:
#> lm(formula = medv ~ ., data = boston)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -15.1304  -2.7673  -0.5814   1.9414  26.2526 
#> 
#> Coefficients:
#>               Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  41.617270   4.936039   8.431 3.79e-16 ***
#> crim         -0.121389   0.033000  -3.678 0.000261 ***
#> zn            0.046963   0.013879   3.384 0.000772 ***
#> indus         0.013468   0.062145   0.217 0.828520    
#> chas          2.839993   0.870007   3.264 0.001173 ** 
#> nox         -18.758022   3.851355  -4.870 1.50e-06 ***
#> rm            3.658119   0.420246   8.705  < 2e-16 ***
#> age           0.003611   0.013329   0.271 0.786595    
#> dis          -1.490754   0.201623  -7.394 6.17e-13 ***
#> rad           0.289405   0.066908   4.325 1.84e-05 ***
#> tax          -0.012682   0.003801  -3.337 0.000912 ***
#> ptratio      -0.937533   0.132206  -7.091 4.63e-12 ***
#> lstat        -0.552019   0.050659 -10.897  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 4.798 on 493 degrees of freedom
#> Multiple R-squared:  0.7343, Adjusted R-squared:  0.7278 
#> F-statistic: 113.5 on 12 and 493 DF,  p-value: < 2.2e-16

library(car)
vif(lm_medv_all)
#>     crim       zn    indus     chas      nox       rm      age      dis 
#> 1.767486 2.298459 3.987181 1.071168 4.369093 1.912532 3.088232 3.954037 
#>      rad      tax  ptratio    lstat 
#> 7.445301 9.002158 1.797060 2.870777

performance::check_collinearity(lm_medv_all)
#> # Check for Multicollinearity
#> 
#> Low Correlation
#> 
#>     Term  VIF    VIF 95% CI Increased SE Tolerance Tolerance 95% CI
#>     crim 1.77 [1.58,  2.01]         1.33      0.57     [0.50, 0.63]
#>       zn 2.30 [2.03,  2.64]         1.52      0.44     [0.38, 0.49]
#>    indus 3.99 [3.45,  4.64]         2.00      0.25     [0.22, 0.29]
#>     chas 1.07 [1.02,  1.28]         1.03      0.93     [0.78, 0.98]
#>      nox 4.37 [3.77,  5.09]         2.09      0.23     [0.20, 0.26]
#>       rm 1.91 [1.70,  2.19]         1.38      0.52     [0.46, 0.59]
#>      age 3.09 [2.69,  3.57]         1.76      0.32     [0.28, 0.37]
#>      dis 3.95 [3.42,  4.60]         1.99      0.25     [0.22, 0.29]
#>  ptratio 1.80 [1.61,  2.05]         1.34      0.56     [0.49, 0.62]
#>    lstat 2.87 [2.51,  3.32]         1.69      0.35     [0.30, 0.40]
#> 
#> Moderate Correlation
#> 
#>  Term  VIF    VIF 95% CI Increased SE Tolerance Tolerance 95% CI
#>   rad 7.45 [6.37,  8.73]         2.73      0.13     [0.11, 0.16]
#>   tax 9.00 [7.69, 10.58]         3.00      0.11     [0.09, 0.13]

The rad (accessibility to radial highways) and tax (property tax rate) variables have moderate VIF.

Interaction Terms

Interaction between lstat and age:

library(gt)
lm.fit_age <- lm(medv ~ lstat * age, data = boston)
tidy_custom(lm.fit_age) %>%
  gt()

term	coefficient	std.error	t-statistic	p-value
(Intercept)	36.089	1.4698	24.55	<0.001
lstat	-1.392	0.1675	-8.31	<0.001
age	-0.001	0.0199	-0.04	0.971
lstat:age	0.004	0.0019	2.24	0.025

Non-linear Transformations of the Predictors

Perform a regression of medv onto lstat and lstat^2, and compare fits with anova:

lm.fit2 <- lm(medv ~ lstat + I(lstat^2), data = boston)
summary(lm.fit2)
#> 
#> Call:
#> lm(formula = medv ~ lstat + I(lstat^2), data = boston)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -15.2834  -3.8313  -0.5295   2.3095  25.4148 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 42.862007   0.872084   49.15   <2e-16 ***
#> lstat       -2.332821   0.123803  -18.84   <2e-16 ***
#> I(lstat^2)   0.043547   0.003745   11.63   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 5.524 on 503 degrees of freedom
#> Multiple R-squared:  0.6407, Adjusted R-squared:  0.6393 
#> F-statistic: 448.5 on 2 and 503 DF,  p-value: < 2.2e-16

Use Anova

anova(lm.fit, lm.fit2)
#> Analysis of Variance Table
#> 
#> Model 1: medv ~ lstat
#> Model 2: medv ~ lstat + I(lstat^2)
#>   Res.Df   RSS Df Sum of Sq     F    Pr(>F)    
#> 1    504 19472                                 
#> 2    503 15347  1    4125.1 135.2 < 2.2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The quadratic model is superior by the F-test. Check the residuals:

performance::check_model(lm.fit2, check = "linearity")

The quadratic term is an obvious improvement, but still some non-linearity at large values of medv.

The poly() function is a quick way to include higher order terms:

# Orthogonalized predictors by default
lm.fit5 <- lm(medv ~ poly(lstat, 5), data = boston)
tidy_custom(lm.fit5)
#> # A tibble: 6 × 5
#>   term            coefficient std.error `t-statistic` `p-value`
#>   <chr>                 <dbl>     <dbl>         <dbl> <chr>    
#> 1 (Intercept)            22.5     0.232         97.2  <0.001   
#> 2 poly(lstat, 5)1      -152.      5.21         -29.2  <0.001   
#> 3 poly(lstat, 5)2        64.2     5.21          12.3  <0.001   
#> 4 poly(lstat, 5)3       -27.1     5.21          -5.19 <0.001   
#> 5 poly(lstat, 5)4        25.5     5.21           4.88 <0.001   
#> 6 poly(lstat, 5)5       -19.3     5.21          -3.69 <0.001
# Raw polynomials
lm.fit5_raw <- lm(medv ~ poly(lstat, 5, raw = TRUE), data = boston)
tidy_custom(lm.fit5_raw)
#> # A tibble: 6 × 5
#>   term                        coefficient std.error `t-statistic` `p-value`
#>   <chr>                             <dbl>     <dbl>         <dbl> <chr>    
#> 1 (Intercept)                      67.7      3.60           18.8  <0.001   
#> 2 poly(lstat, 5, raw = TRUE)1     -12.0      1.53           -7.86 <0.001   
#> 3 poly(lstat, 5, raw = TRUE)2       1.27     0.223           5.7  <0.001   
#> 4 poly(lstat, 5, raw = TRUE)3      -0.068    0.0144         -4.75 <0.001   
#> 5 poly(lstat, 5, raw = TRUE)4       0.002    0.0004          4.14 <0.001   
#> 6 poly(lstat, 5, raw = TRUE)5       0        0              -3.69 <0.001

Qualitative Predictors

Load carseats:

carseats <- ISLR2::Carseats

The contrasts() function shows the dummy coding for the qualitative ShelveLoc variable:

contrasts(carseats$ShelveLoc)
#>        Good Medium
#> Bad       0      0
#> Good      1      0
#> Medium    0      1

Fit the model and print the coefficients related to ShelveLoc:

lm_sales <- lm(Sales ~ . + Income:Advertising + Price:Age,
               data = carseats)
tidy_custom(lm_sales) %>%
  filter(str_detect(term, "ShelveLoc|Intercept"))
#> # A tibble: 3 × 5
#>   term            coefficient std.error `t-statistic` `p-value`
#>   <chr>                 <dbl>     <dbl>         <dbl> <chr>    
#> 1 (Intercept)            6.58     1.01           6.52 <0.001   
#> 2 ShelveLocGood          4.85     0.153         31.7  <0.001   
#> 3 ShelveLocMedium        1.95     0.126         15.5  <0.001

Lab 2 : Classification

The Stock Market Data

This data set consists of percentage returns for the S&P 500 stock index over 1,250 days, from the beginning of 2001 until the end of 2005. For each date, we have recorded the percentage returns for each of the five previous trading days, Lag1 through Lag5. We have also recorded Volume (the number of shares traded on the previous day, in billions), Today (the percentage return on the date in question) and Direction (whether the market was Up or Down on this date). Our goal is to predict Direction (a qualitative response) using the other features.

I like the skimr package for summarizing a data set:

smarket <- ISLR2::Smarket
attach(smarket)
skimr::skim(smarket)

Data summary
Name	smarket
Number of rows	1250
Number of columns	9
_______________________
Column type frequency:
factor	1
numeric	8
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
Direction	0	1	FALSE	2	Up: 648, Dow: 602

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Year	1	2003.02	1.41	2001.00	2002.00	2003.00	2004.00	2005.00	▇▇▇▇▇
Lag1	1	0.00	1.14	-4.92	-0.64	0.04	0.60	5.73	▁▃▇▁▁
Lag2	1	0.00	1.14	-4.92	-0.64	0.04	0.60	5.73	▁▃▇▁▁
Lag3	1	0.00	1.14	-4.92	-0.64	0.04	0.60	5.73	▁▃▇▁▁
Lag4	1	0.00	1.14	-4.92	-0.64	0.04	0.60	5.73	▁▃▇▁▁
Lag5	1	0.01	1.15	-4.92	-0.64	0.04	0.60	5.73	▁▃▇▁▁
Volume	1	1.48	0.36	0.36	1.26	1.42	1.64	3.15	▁▇▅▁▁
Today	1	0.00	1.14	-4.92	-0.64	0.04	0.60	5.73	▁▃▇▁▁

smarket %>% select(-Direction) %>%
  corrr::correlate(method = "pearson", quiet = TRUE) %>%
  gt(rowname_col = "term") %>%
  gt::fmt_missing(columns = everything(), missing_text = "") %>%
  gt::data_color(
    columns = everything(),
    colors = scales::col_numeric(
      palette = td_pal("div5")(5),
      domain = c(-0.1, 0.6)
    )
  ) %>%
  gt::fmt_number(columns = everything(), decimals = 3)

	Year	Lag1	Lag2	Lag3	Lag4	Lag5	Volume	Today
Year		0.030	0.031	0.033	0.036	0.030	0.539	0.030
Lag1	0.030		−0.026	−0.011	−0.003	−0.006	0.041	−0.026
Lag2	0.031	−0.026		−0.026	−0.011	−0.004	−0.043	−0.010
Lag3	0.033	−0.011	−0.026		−0.024	−0.019	−0.042	−0.002
Lag4	0.036	−0.003	−0.011	−0.024		−0.027	−0.048	−0.007
Lag5	0.030	−0.006	−0.004	−0.019	−0.027		−0.022	−0.035
Volume	0.539	0.041	−0.043	−0.042	−0.048	−0.022		0.015
Today	0.030	−0.026	−0.010	−0.002	−0.007	−0.035	0.015

smarket %>%
  ggplot(aes(x = factor(Year), y = Volume)) +
  geom_jitter(width = 0.3, color = td_colors$nice$day9_yellow) +
  geom_boxplot(alpha = 0.3, outlier.shape = NA, width = 0.2)

Logistic Regression

glm.fits <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
                data = smarket, family = binomial)
summary(glm.fits)
#> 
#> Call:
#> glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
#>     Volume, family = binomial, data = smarket)
#> 
#> Coefficients:
#>              Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -0.126000   0.240736  -0.523    0.601
#> Lag1        -0.073074   0.050167  -1.457    0.145
#> Lag2        -0.042301   0.050086  -0.845    0.398
#> Lag3         0.011085   0.049939   0.222    0.824
#> Lag4         0.009359   0.049974   0.187    0.851
#> Lag5         0.010313   0.049511   0.208    0.835
#> Volume       0.135441   0.158360   0.855    0.392
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 1731.2  on 1249  degrees of freedom
#> Residual deviance: 1727.6  on 1243  degrees of freedom
#> AIC: 1741.6
#> 
#> Number of Fisher Scoring iterations: 3

glm.probs <- predict(glm.fits, type = "response")

glm.pred <- rep("Down", 1250)
glm.pred[glm.probs > 0.5] <- "Up"
table(glm.pred, Direction)
#>         Direction
#> glm.pred Down  Up
#>     Down  145 141
#>     Up    457 507

Use tidymodels to fit the model and produce the confusion matrix:

library(tidymodels)

glm_direction_fit <-
  # Note that these options are the defaults
  #  (and mode can only be "classification" for logistic)
  logistic_reg(mode = "classification", 
               engine = "glm") %>%
  fit(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
      data = smarket)

glm_direction_conf_mat <-
  augment(glm_direction_fit, smarket) %>%
  yardstick::conf_mat(.pred_class, Direction)
glm_direction_conf_mat
#>           Truth
#> Prediction Down  Up
#>       Down  145 457
#>       Up    141 507

This returns an object with some helpful built-in functions. Summary metrics:

summary(glm_direction_conf_mat)
#> # A tibble: 13 × 3
#>    .metric              .estimator .estimate
#>    <chr>                <chr>          <dbl>
#>  1 accuracy             binary        0.522 
#>  2 kap                  binary        0.0237
#>  3 sens                 binary        0.507 
#>  4 spec                 binary        0.526 
#>  5 ppv                  binary        0.241 
#>  6 npv                  binary        0.782 
#>  7 mcc                  binary        0.0277
#>  8 j_index              binary        0.0329
#>  9 bal_accuracy         binary        0.516 
#> 10 detection_prevalence binary        0.482 
#> 11 precision            binary        0.241 
#> 12 recall               binary        0.507 
#> 13 f_meas               binary        0.327

We get the same accuracy (52.2%) as that in the text. There is also a couple autoplot() options

autoplot(glm_direction_conf_mat, type = "mosaic") |
  autoplot(glm_direction_conf_mat, type = "heatmap")

Create Train and Test set

train <- Year < 2005
glm.fits.train <- glm(Direction ~ Lag1 + Lag2, data = smarket, family = binomial,
                      subset = train)
glm.probs.test <- predict(glm.fits.train, newdata = smarket[!train, ],
                          type = "response")
glm.pred.test <- rep("Down", 252)
glm.pred.test[glm.probs.test > 0.5] <- "Up"
table(glm.pred.test, Direction[!train])
#>              
#> glm.pred.test Down  Up
#>          Down   35  35
#>          Up     76 106

We can use tidymodels

smarket_split <-
  make_splits(
    x = list(
      # Get row numbers for <2005
      "analysis" = which(smarket$Year < 2005),
      # Get row numbers for 2005
      "assessment" = which(smarket$Year == 2005)
    ),
    data = smarket
  )
smarket_split
#> <Analysis/Assess/Total>
#> <998/252/1250>

We then get the training and testing data with their matching functions:

smarket_train <- training(smarket_split)
smarket_test <- testing(smarket_split)

glm_direction_fit_simple <-
  logistic_reg(mode = "classification", engine = "glm") %>%
  fit(Direction ~ Lag1 + Lag2,
      data = smarket_train)
glm_direction_fit_simple %>%
  augment(smarket_test) %>%
  accuracy(truth = Direction, .pred_class)
#> # A tibble: 1 × 3
#>   .metric  .estimator .estimate
#>   <chr>    <chr>          <dbl>
#> 1 accuracy binary         0.560

Linear Discriminant Analysis

Perform LDA to predict Direction from Lag1 and Lag2:

library(MASS)
lda.fit <- lda(Direction ~ Lag1 + Lag2, data = smarket, subset = train)
lda.fit
#> Call:
#> lda(Direction ~ Lag1 + Lag2, data = smarket, subset = train)
#> 
#> Prior probabilities of groups:
#>     Down       Up 
#> 0.491984 0.508016 
#> 
#> Group means:
#>             Lag1        Lag2
#> Down  0.04279022  0.03389409
#> Up   -0.03954635 -0.03132544
#> 
#> Coefficients of linear discriminants:
#>             LD1
#> Lag1 -0.6420190
#> Lag2 -0.5135293

lda.pred <- predict(lda.fit, newdata = smarket[!train, ])
names(lda.pred)
#> [1] "class"     "posterior" "x"
#> [1] "class"     "posterior" "x"
table(lda.pred$class, Direction[!train])
#>       
#>        Down  Up
#>   Down   35  35
#>   Up     76 106

In Tidymodels

library(discrim)
lda_direction_fit <-
  # Note that this model definition does not come with parsnip, but the
  #  extension package discrim
  discrim_linear(mode = "classification", engine = "MASS") %>%
  fit(Direction ~ Lag1 + Lag2, data = smarket_train)
lda_direction_fit
#> parsnip model object
#> 
#> Call:
#> lda(Direction ~ Lag1 + Lag2, data = data)
#> 
#> Prior probabilities of groups:
#>     Down       Up 
#> 0.491984 0.508016 
#> 
#> Group means:
#>             Lag1        Lag2
#> Down  0.04279022  0.03389409
#> Up   -0.03954635 -0.03132544
#> 
#> Coefficients of linear discriminants:
#>             LD1
#> Lag1 -0.6420190
#> Lag2 -0.5135293

Performance on the testing data set:

lda_direction_test_pred <-
  lda_direction_fit %>%
  augment(new_data = smarket_test)
lda_direction_test_pred %>%
  conf_mat(truth = Direction, estimate = .pred_class)
#>           Truth
#> Prediction Down  Up
#>       Down   35  35
#>       Up     76 106
lda_direction_test_pred %>%
  accuracy(truth = Direction, estimate = .pred_class)
#> # A tibble: 1 × 3
#>   .metric  .estimator .estimate
#>   <chr>    <chr>          <dbl>
#> 1 accuracy binary         0.560

Quadratic Discriminant Analysis

Performing QDA will look very similar:

qda.fit <- qda(Direction ~ Lag1 + Lag2, data = smarket, subset = train)
qda.fit
#> Call:
#> qda(Direction ~ Lag1 + Lag2, data = smarket, subset = train)
#> 
#> Prior probabilities of groups:
#>     Down       Up 
#> 0.491984 0.508016 
#> 
#> Group means:
#>             Lag1        Lag2
#> Down  0.04279022  0.03389409
#> Up   -0.03954635 -0.03132544

Predections

qda.pred <- predict(qda.fit, newdata = smarket[!train, ])
table(qda.pred$class, Direction[!train])
#>       
#>        Down  Up
#>   Down   30  20
#>   Up     81 121

In Tidymodels

qda_direction_fit <-
  discrim_quad(mode = "classification", engine = "MASS") %>%
  fit(Direction ~ Lag1 + Lag2, data = smarket_train)
qda_direction_fit
#> parsnip model object
#> 
#> Call:
#> qda(Direction ~ Lag1 + Lag2, data = data)
#> 
#> Prior probabilities of groups:
#>     Down       Up 
#> 0.491984 0.508016 
#> 
#> Group means:
#>             Lag1        Lag2
#> Down  0.04279022  0.03389409
#> Up   -0.03954635 -0.03132544

qda_direction_test_pred <-
  qda_direction_fit %>%
  augment(new_data = smarket_test)
qda_direction_test_pred %>%
  conf_mat(truth = Direction, estimate = .pred_class)
#>           Truth
#> Prediction Down  Up
#>       Down   30  20
#>       Up     81 121
qda_direction_test_pred %>%
  accuracy(truth = Direction, estimate = .pred_class)
#> # A tibble: 1 × 3
#>   .metric  .estimator .estimate
#>   <chr>    <chr>          <dbl>
#> 1 accuracy binary         0.599

Naive Bayes

By default, the discrim::naive_Bayes() function uses the klaR package as the engine (e1071 is not an option).

nb_direction_fit <-
  naive_Bayes(mode = "classification", engine = "klaR") %>%
  # The klaR engine has an argument usekernel that is always TRUE
  # We have to set it to FALSE to not use KDE, and instead use Gaussian
  #  distributions, as in the text
  set_args(usekernel = FALSE) %>%
  fit(Direction ~ Lag1 + Lag2, data = smarket_train)
# The model output is quite long, so I won't print it here
# nb_direction_fit

nb_direction_test_pred <-
  nb_direction_fit %>%
  augment(new_data = smarket_test)
nb_direction_test_pred %>%
  conf_mat(truth = Direction, estimate = .pred_class)
#>           Truth
#> Prediction Down  Up
#>       Down   28  20
#>       Up     83 121
nb_direction_test_pred %>%
  accuracy(truth = Direction, estimate = .pred_class)
#> # A tibble: 1 × 3
#>   .metric  .estimator .estimate
#>   <chr>    <chr>          <dbl>
#> 1 accuracy binary         0.591

The naive Bayes approach was quite close (a bit worse) than QDA. This makes sense. As we saw in the correlation matrix, the Lag1 and Lag2 are uncorrelated, so the assumption of naive Bayes is not violated.

K-Nearest Neighbors

library(class)
train.X <- as.matrix(smarket[train, c("Lag1", "Lag2")])
test.X <- as.matrix(smarket[!train, c("Lag1", "Lag2")])
train.Direction <- smarket[train, "Direction"]

k=3

knn.pred <- knn(train.X, test.X, train.Direction, k = 3)
table(knn.pred, Direction[!train])
#>         
#> knn.pred Down Up
#>     Down   48 55
#>     Up     63 86

k=1

knn.pred <- knn(train.X, test.X, train.Direction, k = 1)
table(knn.pred, Direction[!train])
#>         
#> knn.pred Down Up
#>     Down   43 58
#>     Up     68 83

In Tidymodels

I’ll skip right to the K = 3 neighbors fit:

knn_direction_fit <-
  nearest_neighbor(mode = "classification", engine = "kknn",
                   neighbors = 3) %>%
  fit(Direction ~ Lag1 + Lag2, data = smarket_train)

knn_direction_test_pred <-
  knn_direction_fit %>%
  augment(new_data = smarket_test)
knn_direction_test_pred %>%
  conf_mat(truth = Direction, estimate = .pred_class)
#>           Truth
#> Prediction Down Up
#>       Down   43 58
#>       Up     68 83
knn_direction_test_pred %>%
  accuracy(truth = Direction, estimate = .pred_class)
#> # A tibble: 1 × 3
#>   .metric  .estimator .estimate
#>   <chr>    <chr>          <dbl>
#> 1 accuracy binary           0.5