AI and Machine Learning for Public Health: Regression and Classification : Exercise 3
Who’s that guy?!
Bongani Ncube
- Msc Student , Wits
- Training as a statistician
- Machine learning and Mathematics fanatic
This Class
Slides, R codes, data and practicals
- I used
R
, andRStudio
- I also used
R Markdown
to write reproducible documents (slides/exercises) - All material is available https://www.r-bloggers.com/in-depth-introduction-to-machine-learning-in-15-hours-of-expert-videos/
Introduction
Definition
- This chapter is about linear regression, a very simple approach for supervised learning.
- In particular, linear regression is a useful tool for predicting a quantitative response
Example Problems solved by linear Regression
- Is there a relationship between advertising budget and sales?
Our first goal should be to determine whether the data provide evidence of an association between advertising expenditure and sales. If the evidence is weak, then one might argue that no money should be spent on advertising!
- How strong is the relationship between advertising budget and sales?
Assuming that there is a relationship between advertising and sales, we would like to know the strength of this relationship. In other words, given a certain advertising budget, can we predict sales with a high level of accuracy? This would be a strong relationship. Or is a prediction of sales based on advertising expenditure only slightly better than a random guess? This would be a weak relationship.
- Which media contribute to sales?
Do all three media—TV, radio, and newspaper—contribute to sales, or do just one or two ofthe media contribute? To answerthis question, we must find a way to separate out the individual effects of each medium when we have spent money on all three media.
- How accurately can we estimate the effect of each medium on sales?
For every dollar spent on advertising in a particular medium, by what amount will sales increase? How accurately can we predict this amount of increase?
- How accurately can we predict future sales?
For any given level of television, radio, or newspaper advertising, what is our prediction for sales, and what is the accuracy of this prediction?
- Is the relationship linear? > If there is approximately a straight-line relationship between advertising expenditure in the various media and sales, then linear regression is an appropriate tool. If not, then it may still be possible to transform the predictor or the response so that linear regression can be used.
Regression
Simple Linear Regression
A straightforward approach of predicting a quantitative Y from a single predictor X, assuming an approximately linear relationship: Mathematically, we can write this linear relationship as
Y \approx \beta_0 + \beta_1 X
Estimating the Coefficients
- In the above equation ,\beta_0 and \beta_1 are two unknown constants that represent the intercept and slope terms in the linear model.
- Our goal is to obtain estimates of the coefficients \hat{\beta}_0 and \hat{\beta}_1 such that the linear model fits the data well.
- There are a number of ways of evaluating fit to data, but by far the most common approach is the least squares criterion. We define the residual sum of squares (RSS) as
\text{RSS} = e_1^2 + e_2^2 + \dots + e_n^2
where e_i = y_i - \hat{y} is the ith (out of n) residual. The least squares approach chooses \hat{\beta}_0 and \hat{\beta}_1 to minimize the RSS. Using some calculus, one can show that
\begin{align} \hat{\beta}_1 &= \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2}\\ \hat{\beta}_0 &= \bar{y} - \hat{\beta}_1 \bar{x} \end{align}
Illustration
Strategy
- The same derivation gives simple linear regression fit to the Advertising data , where \beta_0 = 7.03 and \beta_1 = 0.048.
Assessing the Accuracy of the Coefficient Estimates
- If we use the bias sample mean \hat{\mu} to estimate \mu, this estimate is unbiased, in the sense that unbiased on average, we expect \hat{\mu} to equal \mu.
- It means that on the basis of one particular set of observations y_1,\dots, y_n, \hat{\mu} might overestimate \mu, and on the basis of another set of observations, \hat{\mu} might underestimate \mu. But if we could average a huge number of estimates of \mu obtained from a huge number of sets of observations, then this average would exactly equal \mu.
- An unbiased estimator does not systematically over- or under-estimate the true parameter.
- We have established that the average of \hat{\mu}’s over many data sets will be very close to \mu, but that a single estimate \hat{\mu} may be a substantial underestimate or overestimate of \mu.
- How far off will that single estimate of \hat{\mu} be? In general, we answer this question by computing the standard error of \hat{\mu}, written as SE(\hat{\mu}).
\text{Var}(\hat{\mu}) = \text{SE}(\hat{\mu})^2 = \frac{\sigma^2}{n}
To compute the standard errors associated with \hat{\beta}_0 and \hat{\beta}_1, we use the following formulas:
\begin{align} \text{SE}(\hat{\beta}_0)^2 &= \sigma^2 \left[ \frac{1}{n} + \frac{\bar{x}^2}{\sum (x_i - \bar{x})^2}\right] \\ \text{SE}(\hat{\beta}_1)^2 &= \frac{\sigma^2}{\sum (x_i - \bar{x})^2} \end{align}
where \sigma^2 = \text{Var}(\epsilon).
In general, \sigma^2 is not known, but can be estimated from the data. This estimate of \sigma is known as the residual standard error, and is given by the formula \text{RSE} = \sqrt{\text{RSS}/(n-2)}.
For linear regression, we get approximate 95% confidence intervals for the coefficients as:
\begin{align} \hat{\beta}_1 \pm 2 \text{SE}(\hat{\beta}_1) \\ \hat{\beta}_0 \pm 2 \text{SE}(\hat{\beta}_0). \end{align}
Assessing the Accuracy of the Model
The quality of a linear regression fit is typically assessed using two related quantities: the residual standard error (RSE) and the R^2 statistic.
Residual Standard Error
The residual standard error (RSE) is sigma
, variance explained R^2 is r.squared
, and the F-statistic is statistic
.
library(gt)
glance(lm_sales_tv) %>%
transmute(`Residual standard error` = round(sigma, 2),
`R2` = round(r.squared, 3), `F-statistic` = round(statistic, 1)) %>%
mutate(across(everything(), as.character)) %>%
pivot_longer(everything(), names_to = "Quantity", values_to = "Value") %>%
gt()
Quantity | Value |
---|---|
Residual standard error | 3.26 |
R2 | 0.612 |
F-statistic | 312.1 |
R^2 Statistic
The R^2 statistic provides an alternative measure of fit. Ita lways takes on a value between 0 and 1, and is independent of the scale of Y.
R^2 = \frac{\text{TSS} - \text{RSS}}{\text{TSS}} = 1 - \frac{\text{RSS}}{\text{TSS}}
where \text{TSS} = \sum (y_i - \bar{y})^2 is the total sum of squares.
Multiple Linear Regression
Simple linear regression is a useful approach for predicting a response on the basis of a single predictor variable. However, in practice we often have more than one predictor.
The model with p predictors takes the form:
Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p + \epsilon.
We interpret the slope \beta_j as the average effect on Y by a one unit increase in X_j while holding all other predictors fixed.
Estimating the Regression Coefficients
The parameters are estimated using the same least squares approach as simple linear regression: choose \beta_0 \dots \beta_p to minimize RSS.
Some Important Questions
One: Is There a Relationship Between the Response and Predictors?
We begin by testing whether the predictors are related to the response. Formally:
H_0: \beta_1 = \beta_2 = \dots = \beta_p = 0 \quad \text{vs.} \quad H_a: \text{at least one } \beta_j \neq 0
This is assessed using the F-statistic:
F = \frac{(\text{TSS} - \text{RSS})/p}{\text{RSS}/(n - p - 1)}
If F \approx 1, there is little evidence against H_0. Large values of F indicate that the predictors collectively improve model fit and that at least one is associated with the response.
Two: Deciding on Important Variables
The next question is which predictors matter most. This is known as variable selection.
Common criteria include:
- Adjusted R^2
- Akaike Information Criterion (AIC)
- Bayesian Information Criterion (BIC)
- Mallow’s C_p
Stepwise procedures (forward, backward, mixed) exist but can be unstable and lead to overfitting. More robust methods include penalized regression (e.g., LASSO, ridge) or cross-validation approaches.
Three: Model Fit
Model fit is typically quantified by:
- Residual Standard Error (RSE) – the average size of the residuals.
- R^2 – proportion of variance in the response explained by the predictors.
In multiple regression:
R^2 = \text{Cor}(Y, \hat{Y})^2
However, R^2 always increases as more variables are added, even if they are uninformative. Thus, adjusted R^2 or information criteria are preferred when comparing models.
Four: Predictions
With a fitted regression model, predictions of the response can be made. The uncertainty in predictions comes from three main sources:
- Sampling variability in the estimated coefficients (\hat{\beta}).
- Model bias if the assumed linearity does not hold.
- Irreducible error \epsilon.
Prediction intervals are always wider than confidence intervals since they incorporate both reducible and irreducible error.
Other Considerations in Regression
Qualitative Predictors
Categorical variables can also be included:
- For two levels, a single indicator variable is sufficient.
- For more than two levels, multiple dummy variables are used.
The significance of such variables is often tested with an F-test, comparing models with and without the categorical predictors.
Extensions of the Linear Model
The linear model assumes additivity and linearity. These assumptions can be relaxed.
- Interaction effects allow the effect of one predictor to depend on the level of another.
For example:
Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 (X_1 X_2) + \epsilon
Including interactions may reveal synergies (combined effects of predictors). By the hierarchical principle, if an interaction is included, the corresponding main effects should also remain in the model.
Comparison of Linear Regression with K-Nearest Neighbors
Parametric methods are often easy to fit, and easy to interpret, but the disadvantage is the strong assumption about the form of f(X).
Non-parametric methods do not explicitly assume a form for f(X) and therefore provide an alternative and more flexible approach to regression. One of the simplest and best-known methods is K-nearest neighbors regression (closely related to the KNN classifier from Chapter 2).
From the K nearest neighbors (represented by the set \mathcal{N}_0) to a prediction point x_0, it estimates f(x_0) using the average:
\hat{f}(x_0) = \frac{1}{K} \sum_{x_i \in \mathcal{N}_0} y_i.
In general, the optimal value for K will depend on the bias-variance tradeoff .A small value for K provides the most flexible fit, which will have low bias but high variance.This variance is due to the fact that the prediction in a given region is entirely dependent on just one observation. In contrast, larger values of K provide a smoother and less variable fit; the prediction in a region is an average of several points, and so changing one observation has a smaller effect. However, the smoothing may cause bias by masking some of the structure in f(X). In Chapter 5, we introduce several approaches for estimating test error rates. These methods can be used to identify the optimal value of K in KNN regression
Classifcation
Introduction
In many situations, the response variable is instead qualitative. For example, eye color is qualitative. Often qualitative variables are referred to as categorical; these terms are interchangeably used. In this chapter, I learnt approaches for predicting qualitative responses, a process that is known as classification. The models we learnt include logistic regression (and Poisson regression), linear discriminant analysis, quadratic discriminant analysis, naive Bayes, and K-nearest neighbors.
Prolems where classification is used
A person arrives at the emergency room with a set of symptoms that could possibly be attributed to one of three medical conditions. Which of the three conditions does the individual have?
An online banking service must be able to determine whether or not a transaction being performed on the site is fraudulent, on the basis of the user’s IP address, past transaction history, and so forth.
On the basis of DNA sequence data for a number of patients with and without a given disease, a biologist would like to figure out which DNA mutations are deleterious (disease-causing) and which are not
Overiew of classification
Why Not Linear Regression?
- Linear regression cannot predict un-ordered qualitative responses with more than two levels.
- It is possible to use linear regression to predict a binary (two level) response. For example, if we code stroke and drug overdose as dummy variables:
Y = \begin{cases} 0 & \text{if stroke;} \\ 1 & \text{if drug overdose}\\ \end{cases}
Then we predict stroke if \hat{Y} <= 0.5 and overdose if \hat{Y} > 0.5. It turns out that these probability estimates are not unreasonble, but there can be issues:
- However, if we use linear regression, some of our estimates might be outside the [0, 1] interval , making them hard to interpret as probabilities! Nevertheless, the predictions provide an ordering and can be interpreted as crude probability estimates.
Illustration
Logistic Regression
In logistic regression, we model the probability of Y belonging to a class, rather than the response Y itself. The probability of default
given balance
can be written:
\text{Pr}(\text{default = Yes}|\text{balance}) = p(\text{balance}).
One might predict a default for an individual with p(\text{balance}) > 0.5. Or they may alter the threshold to be conservative, e.g. p(\text{balance}) > 0.1
The Logistic Model
As previously discussed, we could model the probability as linear:
p(X) = \beta_0 + \beta_1 X
but this could give probabilities outside of the range 0-1. We must instead model p(X) using a function that gives outputs 0-1. Many functions meet this description, but logistic regression uses the logistic function:
p(X) = \frac{e^{\beta_0 + \beta_1X}}{1 + e^{\beta_0 + \beta_1 X}}.
Estimating the Regression Coefficients
We fit logistic regression models with maximum likelihood, which seeks estimates for \beta_0 and \beta_1 such that the predicted probabilities \hat{p}(x_i) corresponds as closely as possible to the values y_i. This idea is formalized using a likelihood function:
\ell (\beta_0, \beta_1) = \prod_{i: y_i = 1} p(x_i) \prod_{i': y_{i'} = 0} (1 - p(x_{i'})).
We find the estimates \hat{\beta}_0 and \hat{\beta}_1 by maximizing this likelihood function. Note that the least squares approach to linear regression is a special case of maximum likelihood.
Multiple Logistic Regression
The extension to multiple predictors p is straightfoward:
\log \left( \frac{p(X)}{1 - p(X)} \right) = \beta_0 + \beta_1 X_1 + \dots + \beta_p X_p.
Multinomial Logistic Regression
For predicting K > 2 classes, we can extend logistic regression in a method called multinomial logistic regression. To do this, we choose a single class K to serve as the baseline. Then the probability of another class k is:
\text{Pr}(Y = k|X = x) = \frac{e^{\beta_{k0} + \beta_{k1} x_1 + \beta_{kp} x_p}}{1 + \sum_{l=1}^{K-1} e^{\beta_{l0} + \beta_{l1} x_1 + \beta_{lp} x_p}}
for k = 1, \dots, K - 1. Then for the baseline class K:
\text{Pr}(Y = K|X = x) = \frac{1}{1 + \sum_{l=1}^{K-1} e^{\beta_{l0} + \beta_{l1} x_1 + \beta_{lp} x_p}}.
The log-odds of a class k is then linear in the predictors:
\log \left( \frac{\text{Pr} (Y = k| X = x)}{\text{Pr} (Y = K| X = x)}\right) = \beta_{k0} + \beta_{k1} x_1 + \dots + \beta_{kp} x_p.
The softmax coding is equivalent to the coding just described in the sense that the fitted values, log odds between any pair of classes, and other key model outputs will remain the same, regardless of coding. But the softmax coding is used extensively in some areas of the machine learning literature (and will appear again in Chapter 10), so it is worth being aware of it. In the softmax coding, rather than selecting a baseline class, we treat all K classes symmetrically, and assume that for k = 1,...,K,
\text{Pr}(Y = k|X = x) = \frac{e^{\beta_{k0} + \beta_{k1} x_1 + \beta_{kp} x_p}}{ \sum_{l=1}^{K} e^{\beta_{l0} + \beta_{l1} x_1 + \beta_{lp} x_p}}.
Rather than estimating coefficients for K − 1 classes, we actually estimate coefficients for all K classes. The log odds ratio between the kth and k′th classes equals
\frac{\log \text{Pr} (Y = k| X = x)}{\log \text{Pr} (Y = k'| X = x)} = (\beta_{k0} - \beta_{k'0}) + (\beta_{k1} - \beta_{k'1}) x_1 + \dots + (\beta_{kp} - \beta_{k'p}) x_p.
Note that in the case of K = 2, the numerator becomes p(X) and the denominator 1 - p(X), which is exactly the same the two-class logistic regression formula (Equation 4.6).
The choice of class K as baseline was arbitrary. The only thing that will change by choosing a different baseline will be the coefficient estimates, but the predictions (fitted values), and model metrics will be the same.
When performing multinomial logistic regression, we will sometimes use an alternative to dummy coding called softmax coding.
Linear Discriminant Analysis for p = 1
For the case of one predictor, we start by assuming that f_k (x) is normal or Gaussian, which has the following density in one dimension:
f_k (x) = \frac{1}{\sqrt{2 \pi} \sigma_k} \exp \left( - \frac{1}{2\sigma_k^2} (x - \mu_k)^2\right)
where \mu_k and \sigma_k^2 are the mean and variance of the kth class. assume all classes have the same variance \sigma^2. Plugging the above into Bayes’ theorem, we have:
p_k (x) = \frac{\pi_k \frac{1}{\sqrt{2 \pi} \sigma} \exp \left( - \frac{1}{2\sigma^2} (x - \mu_k)^2\right)} {\sum_{l=1}^K \pi_l \frac{1}{\sqrt{2 \pi} \sigma} \exp \left( - \frac{1}{2\sigma^2} (x - \mu_l)^2\right)}. The Bayes classifier assigns an observation X = x to the class for which the above is largest. Taking the log and rearranging, this is equivalent to choosing the class for which:
\delta_k (x) = x \frac{\mu_k}{\sigma^2} - \frac{\mu_k^2}{2 \sigma^2} + \log(\pi_k)
is largest.
Linear Discriminant Analysis for p > 1
Extending the LDA classifier for multiple predictors involves a multi-variate Gaussian distribution with class-specific mean vector and a common covariance matrix.
The LDA classifier assumes that the observations in the kth class are drawn from a multivariate Gaussian distribution N(\mu_k, \Sigma). The Bayes classifier assigns an observation X = x to the class for which
\delta_k (x) = x^T \Sigma^{-1} \mu_k - \frac{1}{2} \mu_k^T \Sigma^{-1} \mu_k + \log \pi_k
is largest.
As with the univariable case, the LDA method involves estimating the unknown parameters \mu_k, \pi_k and \Sigma. Then the quantities \hat{\delta}_k (x) are calculated and the observations X are classified into the largest \hat{\delta}_k (k).
Quadratic Discriminant Analysis
- LDA assumes that the observations within each class are drawn from a multivariate Gaussian distribution with a class-specific mean vector and a covariance matrix that is common to all K classes.
- Quadratic discriminant analysis (QDA) provides an alternative approach. Like LDA, the QDA classifier results from assuming that the observations from each class are drawn from a Gaussian distribution, and plugging estimates for the parameters into Bayes’ theorem in order to perform prediction. However, unlike LDA, QDA assumes that each class has its own covariance matrix. That is, it assumes that an observation from the kth class is of the form X \sim N(\mu_k, \Sigma_k), where \Sigma_k is a covariance matrix for the kth class.
An observation X = x is assigned to the class for which
\begin{align} \delta_k (x) = - \frac{1}{2} (x - \mu_k)^T \Sigma^{-1}_k (x - \mu_k) - \frac{1}{2} \log |\Sigma_k|+ \log \pi_k. \end{align}
is largest. QDA gets its name from how the quantity x appears as quadratic function in the first term of the above equation.
In general, use LDA if there are relatively few training observations, and use QDA for many. Also consider QDA if you have some intuition about the decision boundary being non-linear.
Naive Bayes
The naive Bayes classifier also estimates the conditional probability f_k (x) = \text{Pr}(X|Y = k). In LDA, we made the very strong assumption that f_k is the density function of a multivariate normal distribution with mean \mu_k and shared covariance \Sigma. In QDA, the covariance \Sigma_k is class-specific. The naive Bayes classifier instead makes a single assumption:
Mathematically:
f_k (x) = f_{k1}(x_1) \times f_{k2}(x_2) \times \dots \times f_{kp}(x_p).
where f_{kj} is the density function of the jth predictor among observations in the kth
Under the naive Bayes, assumption, the posterior probability becomes:
\text{Pr}(Y = k|X = x) = \frac{\pi_k \times f_{k1}(x_1) \times \dots \times f_{kp} (x_p)}{\sum_{l=1}^K \pi_l \times f_{l1}(x_1) \times \dots \times f_{lp} (x_p)}
for k = 1, \dots, K.
To estimate the one-dimensional f_{kj} from x_j, we have a few options:
- Assume that the jth predictor is drawn from a univariate normal distribution.
- X_j | Y = k \sim N(\mu_{jk}, \sigma_{jk}^2).
- This is like QDA except the covariance matrix is diagonal because the predictors are independent.
- Use a non-parametric estimate for f_{kj}.
- A simple way: Estimate f_{kj}(x_j) as the fraction of the training observations in the kth class belonging to a histogram bin.
- Alternatively, use a kernel density estimator, which is essentially a smoothed version of a histogram.
- For qualitative X_j, simply count the proportion of training observations for the jth predictor corresponding to each class.
Lab: Regression
Load the boston
data rather than the full ISLR2
package:
<- ISLR2::Boston boston
Simple Linear Regression
Regress median value of owner-occupied homes medv
on percentage of houses with lower socioeconomic status lstat
:
<- lm(medv ~ lstat, data = boston)
lm.fit summary(lm.fit)
#>
#> Call:
#> lm(formula = medv ~ lstat, data = boston)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -15.168 -3.990 -1.318 2.034 24.500
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 34.55384 0.56263 61.41 <2e-16 ***
#> lstat -0.95005 0.03873 -24.53 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 6.216 on 504 degrees of freedom
#> Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
#> F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
Compute confidence and prediction intervals at different values of lstat
:
confint(lm.fit)
#> 2.5 % 97.5 %
#> (Intercept) 33.448457 35.6592247
#> lstat -1.026148 -0.8739505
Plot the relationship between
medv
andlstat
:
plot(boston$lstat, boston$medv, pch = 20, col = "gray40")
abline(lm.fit, col = "blue", lwd = 2)
%>%
boston ggplot(aes(x = lstat, y = medv)) +
geom_point() +
geom_abline(slope = coef(lm.fit)["lstat"],
intercept = coef(lm.fit)["(Intercept)"],
size = 1.0, color = td_colors$nice$day9_yellow)
Model Peformance
par(mfrow = c(2, 2))
plot(lm.fit)
To display model diagnostics, we can call plot()
on the model object as we have before, but I like the performance
package because it uses ggplot2
:
::check_model(lm.fit) performance
par(mfrow = c(1, 1))
predict(lm.fit, data.frame(lstat = c(5, 10, 15)), interval = "confidence")
#> fit lwr upr
#> 1 29.80359 29.00741 30.59978
#> 2 25.05335 24.47413 25.63256
#> 3 20.30310 19.73159 20.87461
predict(lm.fit, data.frame(lstat = c(5, 10, 15)), interval = "prediction")
#> fit lwr upr
#> 1 29.80359 17.565675 42.04151
#> 2 25.05335 12.827626 37.27907
#> 3 20.30310 8.077742 32.52846
Multiple Linear Regression
Few Variables
<- lm(medv ~ lstat + age, data = boston)
lm.fit2 summary(lm.fit2)
#>
#> Call:
#> lm(formula = medv ~ lstat + age, data = boston)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -15.981 -3.978 -1.283 1.968 23.158
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 33.22276 0.73085 45.458 < 2e-16 ***
#> lstat -1.03207 0.04819 -21.416 < 2e-16 ***
#> age 0.03454 0.01223 2.826 0.00491 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 6.173 on 503 degrees of freedom
#> Multiple R-squared: 0.5513, Adjusted R-squared: 0.5495
#> F-statistic: 309 on 2 and 503 DF, p-value: < 2.2e-16
All Predictors
Fit to all predictors and check VIF with the performance
package:
<- lm(medv ~ ., data = boston)
lm_medv_all summary(lm_medv_all)
#>
#> Call:
#> lm(formula = medv ~ ., data = boston)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -15.1304 -2.7673 -0.5814 1.9414 26.2526
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 41.617270 4.936039 8.431 3.79e-16 ***
#> crim -0.121389 0.033000 -3.678 0.000261 ***
#> zn 0.046963 0.013879 3.384 0.000772 ***
#> indus 0.013468 0.062145 0.217 0.828520
#> chas 2.839993 0.870007 3.264 0.001173 **
#> nox -18.758022 3.851355 -4.870 1.50e-06 ***
#> rm 3.658119 0.420246 8.705 < 2e-16 ***
#> age 0.003611 0.013329 0.271 0.786595
#> dis -1.490754 0.201623 -7.394 6.17e-13 ***
#> rad 0.289405 0.066908 4.325 1.84e-05 ***
#> tax -0.012682 0.003801 -3.337 0.000912 ***
#> ptratio -0.937533 0.132206 -7.091 4.63e-12 ***
#> lstat -0.552019 0.050659 -10.897 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 4.798 on 493 degrees of freedom
#> Multiple R-squared: 0.7343, Adjusted R-squared: 0.7278
#> F-statistic: 113.5 on 12 and 493 DF, p-value: < 2.2e-16
library(car)
vif(lm_medv_all)
#> crim zn indus chas nox rm age dis
#> 1.767486 2.298459 3.987181 1.071168 4.369093 1.912532 3.088232 3.954037
#> rad tax ptratio lstat
#> 7.445301 9.002158 1.797060 2.870777
::check_collinearity(lm_medv_all)
performance#> # Check for Multicollinearity
#>
#> Low Correlation
#>
#> Term VIF VIF 95% CI Increased SE Tolerance Tolerance 95% CI
#> crim 1.77 [1.58, 2.01] 1.33 0.57 [0.50, 0.63]
#> zn 2.30 [2.03, 2.64] 1.52 0.44 [0.38, 0.49]
#> indus 3.99 [3.45, 4.64] 2.00 0.25 [0.22, 0.29]
#> chas 1.07 [1.02, 1.28] 1.03 0.93 [0.78, 0.98]
#> nox 4.37 [3.77, 5.09] 2.09 0.23 [0.20, 0.26]
#> rm 1.91 [1.70, 2.19] 1.38 0.52 [0.46, 0.59]
#> age 3.09 [2.69, 3.57] 1.76 0.32 [0.28, 0.37]
#> dis 3.95 [3.42, 4.60] 1.99 0.25 [0.22, 0.29]
#> ptratio 1.80 [1.61, 2.05] 1.34 0.56 [0.49, 0.62]
#> lstat 2.87 [2.51, 3.32] 1.69 0.35 [0.30, 0.40]
#>
#> Moderate Correlation
#>
#> Term VIF VIF 95% CI Increased SE Tolerance Tolerance 95% CI
#> rad 7.45 [6.37, 8.73] 2.73 0.13 [0.11, 0.16]
#> tax 9.00 [7.69, 10.58] 3.00 0.11 [0.09, 0.13]
The rad
(accessibility to radial highways) and tax
(property tax rate) variables have moderate VIF.
Interaction Terms
Interaction between lstat
and age
:
library(gt)
<- lm(medv ~ lstat * age, data = boston)
lm.fit_age tidy_custom(lm.fit_age) %>%
gt()
term | coefficient | std.error | t-statistic | p-value |
---|---|---|---|---|
(Intercept) | 36.089 | 1.4698 | 24.55 | <0.001 |
lstat | -1.392 | 0.1675 | -8.31 | <0.001 |
age | -0.001 | 0.0199 | -0.04 | 0.971 |
lstat:age | 0.004 | 0.0019 | 2.24 | 0.025 |
Non-linear Transformations of the Predictors
Perform a regression of medv
onto lstat
and lstat^2
, and compare fits with anova
:
<- lm(medv ~ lstat + I(lstat^2), data = boston)
lm.fit2 summary(lm.fit2)
#>
#> Call:
#> lm(formula = medv ~ lstat + I(lstat^2), data = boston)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -15.2834 -3.8313 -0.5295 2.3095 25.4148
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 42.862007 0.872084 49.15 <2e-16 ***
#> lstat -2.332821 0.123803 -18.84 <2e-16 ***
#> I(lstat^2) 0.043547 0.003745 11.63 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 5.524 on 503 degrees of freedom
#> Multiple R-squared: 0.6407, Adjusted R-squared: 0.6393
#> F-statistic: 448.5 on 2 and 503 DF, p-value: < 2.2e-16
Use Anova
anova(lm.fit, lm.fit2)
#> Analysis of Variance Table
#>
#> Model 1: medv ~ lstat
#> Model 2: medv ~ lstat + I(lstat^2)
#> Res.Df RSS Df Sum of Sq F Pr(>F)
#> 1 504 19472
#> 2 503 15347 1 4125.1 135.2 < 2.2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The quadratic model is superior by the F-test. Check the residuals:
::check_model(lm.fit2, check = "linearity") performance
The quadratic term is an obvious improvement, but still some non-linearity at large values of medv
.
The poly()
function is a quick way to include higher order terms:
# Orthogonalized predictors by default
<- lm(medv ~ poly(lstat, 5), data = boston)
lm.fit5 tidy_custom(lm.fit5)
#> # A tibble: 6 × 5
#> term coefficient std.error `t-statistic` `p-value`
#> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 (Intercept) 22.5 0.232 97.2 <0.001
#> 2 poly(lstat, 5)1 -152. 5.21 -29.2 <0.001
#> 3 poly(lstat, 5)2 64.2 5.21 12.3 <0.001
#> 4 poly(lstat, 5)3 -27.1 5.21 -5.19 <0.001
#> 5 poly(lstat, 5)4 25.5 5.21 4.88 <0.001
#> 6 poly(lstat, 5)5 -19.3 5.21 -3.69 <0.001
# Raw polynomials
<- lm(medv ~ poly(lstat, 5, raw = TRUE), data = boston)
lm.fit5_raw tidy_custom(lm.fit5_raw)
#> # A tibble: 6 × 5
#> term coefficient std.error `t-statistic` `p-value`
#> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 (Intercept) 67.7 3.60 18.8 <0.001
#> 2 poly(lstat, 5, raw = TRUE)1 -12.0 1.53 -7.86 <0.001
#> 3 poly(lstat, 5, raw = TRUE)2 1.27 0.223 5.7 <0.001
#> 4 poly(lstat, 5, raw = TRUE)3 -0.068 0.0144 -4.75 <0.001
#> 5 poly(lstat, 5, raw = TRUE)4 0.002 0.0004 4.14 <0.001
#> 6 poly(lstat, 5, raw = TRUE)5 0 0 -3.69 <0.001
Qualitative Predictors
Load carseats
:
<- ISLR2::Carseats carseats
The contrasts()
function shows the dummy coding for the qualitative ShelveLoc
variable:
contrasts(carseats$ShelveLoc)
#> Good Medium
#> Bad 0 0
#> Good 1 0
#> Medium 0 1
Fit the model and print the coefficients related to ShelveLoc
:
<- lm(Sales ~ . + Income:Advertising + Price:Age,
lm_sales data = carseats)
tidy_custom(lm_sales) %>%
filter(str_detect(term, "ShelveLoc|Intercept"))
#> # A tibble: 3 × 5
#> term coefficient std.error `t-statistic` `p-value`
#> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 (Intercept) 6.58 1.01 6.52 <0.001
#> 2 ShelveLocGood 4.85 0.153 31.7 <0.001
#> 3 ShelveLocMedium 1.95 0.126 15.5 <0.001
Lab 2 : Classification
The Stock Market Data
This data set consists of percentage returns for the S&P 500 stock index over 1,250 days, from the beginning of 2001 until the end of 2005. For each date, we have recorded the percentage returns for each of the five previous trading days, Lag1
through Lag5
. We have also recorded Volume
(the number of shares traded on the previous day, in billions), Today
(the percentage return on the date in question) and Direction
(whether the market was Up
or Down
on this date). Our goal is to predict Direction
(a qualitative response) using the other features.
I like the skimr
package for summarizing a data set:
<- ISLR2::Smarket
smarket attach(smarket)
::skim(smarket) skimr
Name | smarket |
Number of rows | 1250 |
Number of columns | 9 |
_______________________ | |
Column type frequency: | |
factor | 1 |
numeric | 8 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
Direction | 0 | 1 | FALSE | 2 | Up: 648, Dow: 602 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Year | 0 | 1 | 2003.02 | 1.41 | 2001.00 | 2002.00 | 2003.00 | 2004.00 | 2005.00 | ▇▇▇▇▇ |
Lag1 | 0 | 1 | 0.00 | 1.14 | -4.92 | -0.64 | 0.04 | 0.60 | 5.73 | ▁▃▇▁▁ |
Lag2 | 0 | 1 | 0.00 | 1.14 | -4.92 | -0.64 | 0.04 | 0.60 | 5.73 | ▁▃▇▁▁ |
Lag3 | 0 | 1 | 0.00 | 1.14 | -4.92 | -0.64 | 0.04 | 0.60 | 5.73 | ▁▃▇▁▁ |
Lag4 | 0 | 1 | 0.00 | 1.14 | -4.92 | -0.64 | 0.04 | 0.60 | 5.73 | ▁▃▇▁▁ |
Lag5 | 0 | 1 | 0.01 | 1.15 | -4.92 | -0.64 | 0.04 | 0.60 | 5.73 | ▁▃▇▁▁ |
Volume | 0 | 1 | 1.48 | 0.36 | 0.36 | 1.26 | 1.42 | 1.64 | 3.15 | ▁▇▅▁▁ |
Today | 0 | 1 | 0.00 | 1.14 | -4.92 | -0.64 | 0.04 | 0.60 | 5.73 | ▁▃▇▁▁ |
%>% select(-Direction) %>%
smarket ::correlate(method = "pearson", quiet = TRUE) %>%
corrrgt(rowname_col = "term") %>%
::fmt_missing(columns = everything(), missing_text = "") %>%
gt::data_color(
gtcolumns = everything(),
colors = scales::col_numeric(
palette = td_pal("div5")(5),
domain = c(-0.1, 0.6)
)%>%
) ::fmt_number(columns = everything(), decimals = 3) gt
Year | Lag1 | Lag2 | Lag3 | Lag4 | Lag5 | Volume | Today | |
---|---|---|---|---|---|---|---|---|
Year | 0.030 | 0.031 | 0.033 | 0.036 | 0.030 | 0.539 | 0.030 | |
Lag1 | 0.030 | −0.026 | −0.011 | −0.003 | −0.006 | 0.041 | −0.026 | |
Lag2 | 0.031 | −0.026 | −0.026 | −0.011 | −0.004 | −0.043 | −0.010 | |
Lag3 | 0.033 | −0.011 | −0.026 | −0.024 | −0.019 | −0.042 | −0.002 | |
Lag4 | 0.036 | −0.003 | −0.011 | −0.024 | −0.027 | −0.048 | −0.007 | |
Lag5 | 0.030 | −0.006 | −0.004 | −0.019 | −0.027 | −0.022 | −0.035 | |
Volume | 0.539 | 0.041 | −0.043 | −0.042 | −0.048 | −0.022 | 0.015 | |
Today | 0.030 | −0.026 | −0.010 | −0.002 | −0.007 | −0.035 | 0.015 |
%>%
smarket ggplot(aes(x = factor(Year), y = Volume)) +
geom_jitter(width = 0.3, color = td_colors$nice$day9_yellow) +
geom_boxplot(alpha = 0.3, outlier.shape = NA, width = 0.2)
Logistic Regression
<- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
glm.fits data = smarket, family = binomial)
summary(glm.fits)
#>
#> Call:
#> glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
#> Volume, family = binomial, data = smarket)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -0.126000 0.240736 -0.523 0.601
#> Lag1 -0.073074 0.050167 -1.457 0.145
#> Lag2 -0.042301 0.050086 -0.845 0.398
#> Lag3 0.011085 0.049939 0.222 0.824
#> Lag4 0.009359 0.049974 0.187 0.851
#> Lag5 0.010313 0.049511 0.208 0.835
#> Volume 0.135441 0.158360 0.855 0.392
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 1731.2 on 1249 degrees of freedom
#> Residual deviance: 1727.6 on 1243 degrees of freedom
#> AIC: 1741.6
#>
#> Number of Fisher Scoring iterations: 3
<- predict(glm.fits, type = "response")
glm.probs
<- rep("Down", 1250)
glm.pred > 0.5] <- "Up"
glm.pred[glm.probs table(glm.pred, Direction)
#> Direction
#> glm.pred Down Up
#> Down 145 141
#> Up 457 507
Use
tidymodels
to fit the model and produce the confusion matrix:
library(tidymodels)
<-
glm_direction_fit # Note that these options are the defaults
# (and mode can only be "classification" for logistic)
logistic_reg(mode = "classification",
engine = "glm") %>%
fit(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data = smarket)
<-
glm_direction_conf_mat augment(glm_direction_fit, smarket) %>%
::conf_mat(.pred_class, Direction)
yardstick
glm_direction_conf_mat#> Truth
#> Prediction Down Up
#> Down 145 457
#> Up 141 507
This returns an object with some helpful built-in functions. Summary metrics:
summary(glm_direction_conf_mat)
#> # A tibble: 13 × 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 accuracy binary 0.522
#> 2 kap binary 0.0237
#> 3 sens binary 0.507
#> 4 spec binary 0.526
#> 5 ppv binary 0.241
#> 6 npv binary 0.782
#> 7 mcc binary 0.0277
#> 8 j_index binary 0.0329
#> 9 bal_accuracy binary 0.516
#> 10 detection_prevalence binary 0.482
#> 11 precision binary 0.241
#> 12 recall binary 0.507
#> 13 f_meas binary 0.327
We get the same accuracy (52.2%) as that in the text. There is also a couple autoplot()
options
autoplot(glm_direction_conf_mat, type = "mosaic") |
autoplot(glm_direction_conf_mat, type = "heatmap")
Create Train and Test set
<- Year < 2005
train <- glm(Direction ~ Lag1 + Lag2, data = smarket, family = binomial,
glm.fits.train subset = train)
<- predict(glm.fits.train, newdata = smarket[!train, ],
glm.probs.test type = "response")
<- rep("Down", 252)
glm.pred.test > 0.5] <- "Up"
glm.pred.test[glm.probs.test table(glm.pred.test, Direction[!train])
#>
#> glm.pred.test Down Up
#> Down 35 35
#> Up 76 106
We can use tidymodels
<-
smarket_split make_splits(
x = list(
# Get row numbers for <2005
"analysis" = which(smarket$Year < 2005),
# Get row numbers for 2005
"assessment" = which(smarket$Year == 2005)
),data = smarket
)
smarket_split#> <Analysis/Assess/Total>
#> <998/252/1250>
We then get the training and testing data with their matching functions:
<- training(smarket_split)
smarket_train <- testing(smarket_split) smarket_test
<-
glm_direction_fit_simple logistic_reg(mode = "classification", engine = "glm") %>%
fit(Direction ~ Lag1 + Lag2,
data = smarket_train)
%>%
glm_direction_fit_simple augment(smarket_test) %>%
accuracy(truth = Direction, .pred_class)
#> # A tibble: 1 × 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 accuracy binary 0.560
Linear Discriminant Analysis
Perform LDA to predict Direction
from Lag1
and Lag2
:
library(MASS)
<- lda(Direction ~ Lag1 + Lag2, data = smarket, subset = train)
lda.fit
lda.fit#> Call:
#> lda(Direction ~ Lag1 + Lag2, data = smarket, subset = train)
#>
#> Prior probabilities of groups:
#> Down Up
#> 0.491984 0.508016
#>
#> Group means:
#> Lag1 Lag2
#> Down 0.04279022 0.03389409
#> Up -0.03954635 -0.03132544
#>
#> Coefficients of linear discriminants:
#> LD1
#> Lag1 -0.6420190
#> Lag2 -0.5135293
<- predict(lda.fit, newdata = smarket[!train, ])
lda.pred names(lda.pred)
#> [1] "class" "posterior" "x"
#> [1] "class" "posterior" "x"
table(lda.pred$class, Direction[!train])
#>
#> Down Up
#> Down 35 35
#> Up 76 106
In Tidymodels
library(discrim)
<-
lda_direction_fit # Note that this model definition does not come with parsnip, but the
# extension package discrim
discrim_linear(mode = "classification", engine = "MASS") %>%
fit(Direction ~ Lag1 + Lag2, data = smarket_train)
lda_direction_fit#> parsnip model object
#>
#> Call:
#> lda(Direction ~ Lag1 + Lag2, data = data)
#>
#> Prior probabilities of groups:
#> Down Up
#> 0.491984 0.508016
#>
#> Group means:
#> Lag1 Lag2
#> Down 0.04279022 0.03389409
#> Up -0.03954635 -0.03132544
#>
#> Coefficients of linear discriminants:
#> LD1
#> Lag1 -0.6420190
#> Lag2 -0.5135293
Performance on the testing data set:
<-
lda_direction_test_pred %>%
lda_direction_fit augment(new_data = smarket_test)
%>%
lda_direction_test_pred conf_mat(truth = Direction, estimate = .pred_class)
#> Truth
#> Prediction Down Up
#> Down 35 35
#> Up 76 106
%>%
lda_direction_test_pred accuracy(truth = Direction, estimate = .pred_class)
#> # A tibble: 1 × 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 accuracy binary 0.560
Quadratic Discriminant Analysis
Performing QDA will look very similar:
<- qda(Direction ~ Lag1 + Lag2, data = smarket, subset = train)
qda.fit
qda.fit#> Call:
#> qda(Direction ~ Lag1 + Lag2, data = smarket, subset = train)
#>
#> Prior probabilities of groups:
#> Down Up
#> 0.491984 0.508016
#>
#> Group means:
#> Lag1 Lag2
#> Down 0.04279022 0.03389409
#> Up -0.03954635 -0.03132544
Predections
<- predict(qda.fit, newdata = smarket[!train, ])
qda.pred table(qda.pred$class, Direction[!train])
#>
#> Down Up
#> Down 30 20
#> Up 81 121
In Tidymodels
<-
qda_direction_fit discrim_quad(mode = "classification", engine = "MASS") %>%
fit(Direction ~ Lag1 + Lag2, data = smarket_train)
qda_direction_fit#> parsnip model object
#>
#> Call:
#> qda(Direction ~ Lag1 + Lag2, data = data)
#>
#> Prior probabilities of groups:
#> Down Up
#> 0.491984 0.508016
#>
#> Group means:
#> Lag1 Lag2
#> Down 0.04279022 0.03389409
#> Up -0.03954635 -0.03132544
<-
qda_direction_test_pred %>%
qda_direction_fit augment(new_data = smarket_test)
%>%
qda_direction_test_pred conf_mat(truth = Direction, estimate = .pred_class)
#> Truth
#> Prediction Down Up
#> Down 30 20
#> Up 81 121
%>%
qda_direction_test_pred accuracy(truth = Direction, estimate = .pred_class)
#> # A tibble: 1 × 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 accuracy binary 0.599
Naive Bayes
By default, the discrim::naive_Bayes()
function uses the klaR
package as the engine (e1071
is not an option).
<-
nb_direction_fit naive_Bayes(mode = "classification", engine = "klaR") %>%
# The klaR engine has an argument usekernel that is always TRUE
# We have to set it to FALSE to not use KDE, and instead use Gaussian
# distributions, as in the text
set_args(usekernel = FALSE) %>%
fit(Direction ~ Lag1 + Lag2, data = smarket_train)
# The model output is quite long, so I won't print it here
# nb_direction_fit
<-
nb_direction_test_pred %>%
nb_direction_fit augment(new_data = smarket_test)
%>%
nb_direction_test_pred conf_mat(truth = Direction, estimate = .pred_class)
#> Truth
#> Prediction Down Up
#> Down 28 20
#> Up 83 121
%>%
nb_direction_test_pred accuracy(truth = Direction, estimate = .pred_class)
#> # A tibble: 1 × 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 accuracy binary 0.591
The naive Bayes approach was quite close (a bit worse) than QDA. This makes sense. As we saw in the correlation matrix, the Lag1
and Lag2
are uncorrelated, so the assumption of naive Bayes is not violated.
K-Nearest Neighbors
library(class)
<- as.matrix(smarket[train, c("Lag1", "Lag2")])
train.X <- as.matrix(smarket[!train, c("Lag1", "Lag2")])
test.X <- smarket[train, "Direction"] train.Direction
k=3
<- knn(train.X, test.X, train.Direction, k = 3)
knn.pred table(knn.pred, Direction[!train])
#>
#> knn.pred Down Up
#> Down 48 55
#> Up 63 86
k=1
<- knn(train.X, test.X, train.Direction, k = 1)
knn.pred table(knn.pred, Direction[!train])
#>
#> knn.pred Down Up
#> Down 43 58
#> Up 68 83
In Tidymodels
I’ll skip right to the K = 3 neighbors fit:
<-
knn_direction_fit nearest_neighbor(mode = "classification", engine = "kknn",
neighbors = 3) %>%
fit(Direction ~ Lag1 + Lag2, data = smarket_train)
<-
knn_direction_test_pred %>%
knn_direction_fit augment(new_data = smarket_test)
%>%
knn_direction_test_pred conf_mat(truth = Direction, estimate = .pred_class)
#> Truth
#> Prediction Down Up
#> Down 43 58
#> Up 68 83
%>%
knn_direction_test_pred accuracy(truth = Direction, estimate = .pred_class)
#> # A tibble: 1 × 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 accuracy binary 0.5