Introduction
Logistic regression is a probabilistic linear classification method that can be used to estimate the probability that an observation belongs to a particular class based on the feature values. Logistic regression can be adapted for use in multi-class classification problems, but we will begin by discussing the standard version of the algorithm, which is a binary classifier.
We will introduce this concept with a simple example.
Example 1: Exam Prep (6 Obs)
Assume that students in a certain field have to take a professional exam. We wish to determine the effect that time spent studying has on a students chances of passing the exam. We collect a dataset consisting of 6 students. For each student, we have recorded the number of hours that this student spent studying, as well as the results of the exam (0 for failing and 1 for passing).
The cell below creates a dataframe to store the student records.
hours <- c(50, 80, 110, 120, 140, 160)
passed <- c(0, 0, 1, 0, 1, 1)
df <- data.frame(hours, passed)
df
If we interpret passed
as a quantitative variable, then we can create a scatter plot of passed
against hours
. Such a plot is provided in the figure below.

Furthermore, if we are interpreting passed
as being quantitative, then there is nothing that would prevent us from creating a simple linear regression model by regressing passed
against hours
. We build such a model in the cell below.
slr_mod <- lm(passed ~ hours, df)
summary(slr_mod)
Call:
lm(formula = passed ~ hours, data = df)
Residuals:
1 2 3 4 5 6
1.000e-01 -2.000e-01 5.000e-01 -6.000e-01 2.000e-01 -6.939e-17
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.600000 0.542083 -1.107 0.3304
hours 0.010000 0.004677 2.138 0.0993 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4183 on 4 degrees of freedom
Multiple R-squared: 0.5333, Adjusted R-squared: 0.4167
F-statistic: 4.571 on 1 and 4 DF, p-value: 0.0993
The fitted line for this data is shown in the figure below.

There are a number of serious concerns with this using a linear model such as this to describe the relationship between study hours and the ability to pass the exam. We will discuss some of these deficiencies in class.
A more appropriate model for this task is provided by a technique called logistic regression. Before discussing the details of logistic regression, we need to first introduce the sigmoid function.
The Sigmoid Function
The sigmoid function, also called the logistic function, is defined as follows:
\[\sigma(z) = \frac{e^z}{1 + e^z} = \frac{1}{1 + e^{-z}}\]
A plot of the sigmoid function is provided below. Note that the domain of \(\sigma\) is the entire real line, and its outputs are confined between 0 and 1.

Since the range of the \(\sigma\) is the interval \(\left(0,1\right)\), we can interpret the output of the sigmoid function as representing a probability. We can imagine that the sigmoid function converts arbitrary real numbers into probabilities in a specific way.
Logistic Regression Model: Single Predictor
Let \(Y\) be a categorical response variable with two levels. Assume that the labels are encoded as 0 and 1. Let \(p(x)\) denote the probability that \(Y = 1\) given that \(X=x\). That is:
- \(p(x) = P\left[Y = 1 | X = x \right]\)
A logistic regression model assumes that \(p\) is defined in terms of \(X\) as follows:
Our goal is to approximate this hypothetical relationship by finding estimates \(\hat\beta_0\) and \(\hat\beta_1\) for the unknown model parameters. As with ordinary least squares regression, our goal is to find parameter estimates that result in a model that “best fits” the training data according to some metric.
We will now discuss how logistic regression models are scored.
Scoring Logistic Regression Models
Assume that two logistic regression models have been proposed for this dataset:
Model 1: \(Z = 3 - 0.03 X, \hspace{0.6 em} p(X) = \sigma(Z)\)
Model 2: \(Z = 10 - 0.1 X, \hspace{0.6 em} p(X) = \sigma(Z)\)
The figure below displays plots for both of these models.

Likelihood
We can calculate a likelihood score for each of these models. This score measures how likely it is that the observed data could have been generated by the model being scored. We prefer models with higher likelihood scores. Likelihood is calculated as follows:
Step 1: Calculate \(p_i = p(x_i)\) for each training observation.
Recall that this number is the model’s estimate of the probability that a student will pass the exam, given the number of hours that they have studied.
Step 2: Calculate \(\pi_i\) for each training observation.
For each training observation, define \(\pi_i\) as follows:
\(\pi_i =\left\{\begin{array}{ll} p_i & \text{if }~ y_i = 1 \\ 1 - p_i & \text{if } ~y_i = 0 \end{array}\right.\)
The value \(\pi_i\) can be interpreted as the probability that student \(i\) will have obtained the observed outcome, given the number of hours that they studied.
Step 3: Calculate Likelihood, \(L\).
The model’s likelihood score is given by: \(L = \pi_1 \cdot \pi_2 \cdot ... \cdot \pi_n = \prod_{i=1}^{n} \pi_i\).
This score is equal to the probability that we would have seen the outcomes that we have observed, assuming that the model is correct. We prefer models with higher likelihood scores.
The figure below contains bars representing the values of \(\pi_i\) for each observation according to each model.

In the cell below, we calculate the likelihood scores for both of the two models we are considering.
z1 <- 3 - 0.03*df$hours
p1 <- 1 / (1 + exp(-z1))
pi1 <- ifelse(df$passed == 1, p1, 1-p1)
lik1 <- prod(pi1)
z2 <- 10 - 0.1*df$hours
p2 <- 1 / (1 + exp(-z2))
pi2 <- ifelse(df$passed == 1, p2, 1-p2)
lik2 <- prod(pi2)
cat('Likelihood score for Model 1: ', lik1, '\n',
'Likelihood score for Model 2: ', lik2, sep='')
Likelihood score for Model 1: 0.0005831859
Likelihood score for Model 2: 8.404835e-09
Negative Log-Likelihood
Although we can, in theory, compare models using their likelihood score, \(L\), it is more common in practice to score models using negative log-likelihood, \(NLL\). By definition, \(NLL = -\ln(L)\), but it is actually calculated using the following steps:
Step 1: Calculate \(p_i = p(x_i)\) for each training observation.
Step 2: Calculate \(\pi_i\) for each training observation.
For each training observation, calculate:
\(\pi_i =\left\{\begin{array}{ll} p_i & \text{if }~ y_i = 1 \\ 1 - p_i & \text{if } ~y_i = 0 \end{array}\right.\)
Step 3: Calculate Negative Log-Likelihood, \(NLL\).
The model’s negative log-likelihood score is given by:
\(NLL = - \left[\ln(\pi_1) + \ln(\pi_2) + ... + \ln(\pi_n)\right] = - \sum_{i=1}^n \ln(\pi_i)\).
Since the natural logarithm is an increasing function, maximizing likelihood is equivalent to maximizing log-likelihood, and thus equivalent to minimizing negative log-likelihood. Therefore, we prefer models with low \(NLL\) scores.
We calculate the negative log-likelihood of each of our proposed models in the cell below.
nll1 = -sum(log(pi1)) # nll = -log(lik1)
nll2 = -sum(log(pi2)) # nll = -log(lik2)
cat('Negative Log-Likelihood score for Model 1: ', nll1, '\n',
'Negative Log-Likelihood score for Model 2: ', nll2, sep='')
Negative Log-Likelihood score for Model 1: 7.447005
Negative Log-Likelihood score for Model 2: 18.59446
Finding Optimal Logistic Regression Model
R provides a glm()
function for performing logistic regression. The syntax for using this function is illustrated in the cell below, where we create a logistic regression model for our exam prep data.
mod <- glm(passed ~ hours, data=df, family=binomial(link="logit"))
summary(mod)
Call:
glm(formula = passed ~ hours, family = binomial(link = "logit"),
data = df)
Deviance Residuals:
1 2 3 4 5 6
-0.1047 -0.3528 1.3018 -1.4098 0.4650 0.2095
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.29923 8.24186 -1.128 0.259
hours 0.08192 0.07006 1.169 0.242
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 8.3178 on 5 degrees of freedom
Residual deviance: 4.0778 on 4 degrees of freedom
AIC: 8.0778
Number of Fisher Scoring iterations: 6
We can see from the model summary that the optimal logistic regression model for this dataset is given by:
\[Z = -9.3 + 0.082 \cdot X\]
\[p(X) = \sigma(Z) = \frac{1}{1 + e^{9.3 - 0.083 \cdot X}}\]
The figure below shows a plot of the optimal logistic regression model for this dataset.

We can use the predict()
function along with the parameter type="response"
to see the estimated probabilities of passing for each student in our training set, according to our model
predict(mod, df, type="response")
1 2 3 4 5 6
0.005469362 0.060344876 0.428551145 0.629823359 0.897514421 0.978296836
We will now calculate the likelihood and log-likelihood scores for our optimal model.
p <- predict(mod, df, type="response")
pi <- ifelse(df$passed == 1, p, 1-p)
lik <- prod(pi)
nll <- -log(lik)
cat('Log-Likelihood for Optimal Model : ', lik, '\n',
'Negative Log-Likelihood for Optimal Model: ', nll, sep='')
Log-Likelihood for Optimal Model : 0.1301699
Negative Log-Likelihood for Optimal Model: 2.038915
Let’s now consider a similar example, but with more observations.
Example 2: Exam Prep
The data for this example is stored in the tab-separated file data/exam_prep.txt
. We load the data and view a summary of the data.
exam <- read.table("data/exam_prep.txt", sep="\t", header=TRUE)
exam$Passed <- factor(exam$Passed)
summary(exam)
Passed Hours
0:44 Min. : 50.0
1:56 1st Qu.:116.2
Median :189.0
Mean :178.2
3rd Qu.:241.5
Max. :299.0
We will explore the relationship between hours studied and exam outcome by creating a boxplot.
ggplot(exam, aes(x=Passed, y=Hours, fill=Passed)) + geom_boxplot()

We will use the glm()
function to create a logistic regression model.
exam_mod <- glm(Passed ~ Hours, exam, family=binomial(link="logit"))
summary(exam_mod)
Call:
glm(formula = Passed ~ Hours, family = binomial(link = "logit"),
data = exam)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1727 -0.7261 0.4391 0.7149 1.9716
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.185091 0.715405 -4.452 8.50e-06 ***
Hours 0.019662 0.003892 5.052 4.38e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 137.19 on 99 degrees of freedom
Residual deviance: 100.28 on 98 degrees of freedom
AIC: 104.28
Number of Fisher Scoring iterations: 4
We can see from the model summary that the optimal logistic regression model for this dataset is given by:
\[Z = -3.19 + 0.02 X\]
\[p(X) = \sigma(Z) = \frac{1}{1 + e^{3.19 - 0.02 X}}\]
In the cell below, we will use our model to estimate the probability that a student will pass the exam if they have studied for each of the following numbers of hours: 0, 50, 100, 150, 200
nd <- data.frame(Hours = c(0, 50, 100, 150, 200))
prob <- predict(exam_mod, newdata=nd, type="response")
prob
1 2 3 4 5
0.03973065 0.09956985 0.22812386 0.44130744 0.67857270
Training Accuracy
A common way to evaluate the performance of a classification model is to calculate its accuracy on some set. The accuracy on a particular set is equal to the proportion of observations in the set for which the model predicts the correct class.
In the cell below, we calculate our model’s accuracy on the training set.
prob <- predict(exam_mod, exam, type="response")
pred <- ifelse(prob < 0.5, 0, 1)
tr_acc <- mean(pred == exam$Passed)
cat('Training Accuracy: ', tr_acc, sep='')
Training Accuracy: 0.77
Interpreting Model Coefficients
Interpreting the coefficients in a logistic regression model is not as direct as in a linear regression model. However, it is possible to do. Because discussing how we would interpret these coefficients, we need to discuss the concept of odds ratios.
Odds Ratios
The odds of an event occurring is defined as the probability of an event occurring divided by the probability of the event not occurring. The probability of an event \(P_E\), and the odds of an event \(O_E\) are related by the following expressions:
\[O_E = \frac{P_E}{1 - P_E} \hspace{3 em} P_E = \frac{O_E}{1 + O_E}\]
You might be familiar with the concept of odds from sports or games of chance. Assume that an event is twice as likely to occur as not. Then we might say that the event has “2 to 1 odds”. The odds of such an event is often denoted by 2:1 or 2/1. This notion of odds is equivalent to the one that we defined above.
The table below shows several ways of expressing odds, as well as the probability associated with each odds value.
Odds
|
Odds
|
Odds
|
Prob
|
1:1
|
1/1
|
1.00
|
0.50
|
2:1
|
2/1
|
2.00
|
0.67
|
1:2
|
1/2
|
0.50
|
0.33
|
1:3
|
1/3
|
0.33
|
0.25
|
Note that odds are always positive, but can be arbitrarily large.
Coefficient Interpretation
The interpretation of \(\beta_1\) in a logistic regression model is as follows: If we increase the predictor \(X\) by a single unit, that will result in an increase of \(\beta_1\) units in log-odds, and approximately a \(\beta_1*100\%\) increase in the odds of \(Y=1\).
Returning to our exam prep example, we recall that our model was given by:
\[p(X) = \frac{1}{1 + e^{3.19 - 0.02 X}}\]
This can be rewritten as:
\[\ln\left[\frac{p(X)}{1 - p(X)} \right] = -3.19 + 0.02 X \]
It follows that each additional hour of studying will increase the odds (not the probability) of passing the exam by about 2% (or in other words, by a factor of 1.02).
Example 3: Pima Diabetes Dataset
For this example, we will be working with the Pima Diabetes Dataset. This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective is to predict based on diagnostic measurements whether a patient has diabetes. All patients are females at least 21 years old of Pima Indian heritage.
The columns in this dataset are described below.
- Pregnancies: Number of times pregnant
- Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
- BloodPressure: Diastolic blood pressure (mm Hg)
- SkinThickness: Triceps skin fold thickness (mm)
- Insulin: 2-Hour serum insulin (mu U/ml)
- BMI: Body mass index (weight in kg/(height in m)^2)
- DiabetesPedigreeFunction: Diabetes pedigree function
- Age: Age (years)
- Outcome: Class variable (0 or 1)
Loading the Data
We will read in the data, convert the Outcome column to a factor variable, and then print a summary of the dataset.
pima <- read.table("data/diabetes.csv", sep=",", header=TRUE)
pima$Outcome <- factor(pima$Outcome)
summary(pima)
Pregnancies Glucose BloodPressure SkinThickness Insulin
Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00 Min. : 0.0
1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00 1st Qu.: 0.0
Median : 3.000 Median :117.0 Median : 72.00 Median :23.00 Median : 30.5
Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54 Mean : 79.8
3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00 3rd Qu.:127.2
Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00 Max. :846.0
BMI DiabetesPedigreeFunction Age Outcome
Min. : 0.00 Min. :0.0780 Min. :21.00 0:500
1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00 1:268
Median :32.00 Median :0.3725 Median :29.00
Mean :31.99 Mean :0.4719 Mean :33.24
3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
Max. :67.10 Max. :2.4200 Max. :81.00
Visualizing the Data
The figure below displays density plots for each of our predictors, and for each value of our response variable.

Splitting the Data
We will consider multiple logistic regression models on this dataset. We will create training and validation sets. The models will be trained on the training set, and compared based on their accuracy on the validation set.
The caret
package provides a function called createDataPartition()
which can be used to create a train/validation split. By default, this function creates a stratified random split. As a result, the distribution of the response variable will be roughly the same in the training set and validation set.
In the cell below, we create a 70/30 split of our dataset.
library(caret)
Loading required package: lattice
Registered S3 method overwritten by 'data.table':
method from
print.data.table
set.seed(1)
train.index <- createDataPartition(pima$Outcome, p = .7, list=FALSE)
train <- pima[ train.index,]
valid <- pima[-train.index,]
summary(train$Outcome)
0 1
350 188
summary(valid$Outcome)
0 1
150 80
Create Models
We will now train our model on the training set.
pima_mod_1 <- glm(Outcome ~ ., train, family=binomial(link="logit"))
summary(pima_mod_1)
Call:
glm(formula = Outcome ~ ., family = binomial(link = "logit"),
data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.4622 -0.6897 -0.3771 0.7121 2.8527
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.5166331 0.9353565 -10.174 < 2e-16 ***
Pregnancies 0.1142149 0.0382633 2.985 0.00284 **
Glucose 0.0328008 0.0044197 7.422 1.16e-13 ***
BloodPressure -0.0113590 0.0066079 -1.719 0.08561 .
SkinThickness -0.0100998 0.0081048 -1.246 0.21271
Insulin -0.0004845 0.0010896 -0.445 0.65660
BMI 0.1080848 0.0188664 5.729 1.01e-08 ***
DiabetesPedigreeFunction 1.5592974 0.3777531 4.128 3.66e-05 ***
Age 0.0312125 0.0112356 2.778 0.00547 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 696.28 on 537 degrees of freedom
Residual deviance: 487.97 on 529 degrees of freedom
AIC: 505.97
Number of Fisher Scoring iterations: 5
It appears that three of our predictors, Insulin
, SkinThickness
, and BloodPressure
do not have statistically significant effects on the prescience of diabetes. We will remove them one at a time.
pima_mod_2 <- glm(Outcome ~ . - Insulin, train, family=binomial(link="logit"))
summary(pima_mod_2)
Call:
glm(formula = Outcome ~ . - Insulin, family = binomial(link = "logit"),
data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.4591 -0.6866 -0.3798 0.7075 2.8425
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.467154 0.927116 -10.211 < 2e-16 ***
Pregnancies 0.115198 0.038156 3.019 0.00253 **
Glucose 0.032150 0.004151 7.746 9.51e-15 ***
BloodPressure -0.011115 0.006567 -1.692 0.09056 .
SkinThickness -0.011555 0.007391 -1.563 0.11795
BMI 0.108058 0.018842 5.735 9.75e-09 ***
DiabetesPedigreeFunction 1.549068 0.377035 4.109 3.98e-05 ***
Age 0.031393 0.011228 2.796 0.00517 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 696.28 on 537 degrees of freedom
Residual deviance: 488.17 on 530 degrees of freedom
AIC: 504.17
Number of Fisher Scoring iterations: 5
pima_mod_3 <- glm(Outcome ~ . - Insulin - SkinThickness, train, family=binomial(link="logit"))
summary(pima_mod_3)
Call:
glm(formula = Outcome ~ . - Insulin - SkinThickness, family = binomial(link = "logit"),
data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.4798 -0.6973 -0.3835 0.6820 2.8184
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.298915 0.910983 -10.208 < 2e-16 ***
Pregnancies 0.117716 0.038248 3.078 0.00209 **
Glucose 0.032033 0.004148 7.722 1.14e-14 ***
BloodPressure -0.012371 0.006476 -1.910 0.05609 .
BMI 0.098521 0.017582 5.603 2.10e-08 ***
DiabetesPedigreeFunction 1.462223 0.370645 3.945 7.98e-05 ***
Age 0.032487 0.011280 2.880 0.00398 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 696.28 on 537 degrees of freedom
Residual deviance: 490.62 on 531 degrees of freedom
AIC: 504.62
Number of Fisher Scoring iterations: 5
pima_mod_4 <- glm(Outcome ~ . - Insulin - SkinThickness - BloodPressure, train, family=binomial(link="logit"))
summary(pima_mod_4)
Call:
glm(formula = Outcome ~ . - Insulin - SkinThickness - BloodPressure,
family = binomial(link = "logit"), data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5241 -0.6849 -0.3820 0.6919 2.7797
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.704090 0.893021 -10.867 < 2e-16 ***
Pregnancies 0.116444 0.037992 3.065 0.00218 **
Glucose 0.031563 0.004114 7.672 1.69e-14 ***
BMI 0.091049 0.016922 5.381 7.42e-08 ***
DiabetesPedigreeFunction 1.455119 0.367779 3.957 7.61e-05 ***
Age 0.028194 0.010960 2.572 0.01010 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 696.28 on 537 degrees of freedom
Residual deviance: 494.30 on 532 degrees of freedom
AIC: 506.3
Number of Fisher Scoring iterations: 5
Notice that all predictors are statistically significant in our fourth model.
Calculate Training and Validation Accuracies
We will now calculate the training and validation accuracies for each of our four models. The cell below shows the code for calculating these metrics using Model 1. The same values are also calculated for the other three models, but the code has been hidden in the output. It can be seen in the Rmd file for this lesson, however.
tr_prob_1 <- predict(pima_mod_1, train, type="response")
tr_pred_1 <- ifelse(tr_prob_1 < 0.5, 0, 1)
tr_acc_1 <- mean(tr_pred_1 == train$Outcome)
va_prob_1 <- predict(pima_mod_1, valid, type="response")
va_pred_1 <- ifelse(va_prob_1 < 0.5, 0, 1)
va_acc_1 <- mean(va_pred_1 == valid$Outcome)
We will now compare the accuracy scores for each of our models.
cat("Model 1 Training Accuracy: ", tr_acc_1, "\n",
"Model 1 Validation Accuracy: ", va_acc_1, "\n\n",
"Model 2 Training Accuracy: ", tr_acc_2, "\n",
"Model 2 Validation Accuracy: ", va_acc_2, "\n\n",
"Model 3 Training Accuracy: ", tr_acc_3, "\n",
"Model 3 Validation Accuracy: ", va_acc_3, "\n\n",
"Model 4 Training Accuracy: ", tr_acc_4, "\n",
"Model 4 Validation Accuracy: ", va_acc_4, sep=""
)
Model 1 Training Accuracy: 0.7843866
Model 1 Validation Accuracy: 0.7652174
Model 2 Training Accuracy: 0.7843866
Model 2 Validation Accuracy: 0.7695652
Model 3 Training Accuracy: 0.7750929
Model 3 Validation Accuracy: 0.7608696
Model 4 Training Accuracy: 0.7750929
Model 4 Validation Accuracy: 0.7695652
Notice that Models 2 and 4 are actually tied in their performance on the validation set, although the score is similar for all four models. In this case, we would likely prefer Model 4 since it does not contain any non-significant predictors.
---
title: "Lesson 4.1 - Logistic Regression"
author: "Robbie Beane"
output:
  html_notebook:
    theme: flatly
    toc: yes
    toc_depth: 4
---


### **Introduction**

Logistic regression is a probabilistic linear classification method that can be used to estimate the probability that an observation belongs to a particular class based on the feature values. Logistic regression can be adapted for use in multi-class classification problems, but we will begin by discussing the standard version of the algorithm, which is a binary classifier.

We will introduce this concept with a simple example.

### **Example 1: Exam Prep (6 Obs)**

Assume that students in a certain field have to take a professional exam. We wish to determine the effect that time spent studying has on a students chances of passing the exam. We collect a dataset consisting of 6 students. For each student, we have recorded the number of hours that this student spent studying, as well as the results of the exam (0 for failing and 1 for passing).

The cell below creates a dataframe to store the student records. 


```{r}
hours <- c(50, 80, 110, 120, 140, 160)
passed <- c(0, 0, 1, 0, 1, 1)
df <- data.frame(hours, passed)
df
```

If we interpret `passed` as a quantitative variable, then we can create a scatter plot of `passed` against `hours`. Such a plot is provided in the figure below. 

```{r, echo=FALSE}
plot(passed[df$passed == 1] ~ hours[df$passed == 1], cex=2, pch=21, bg='cornflowerblue', 
     xlim=c(30,180), ylim=c(-0.2, 1.2), xlab='Study Hours', ylab='Passed')
lines(c(0,200), c(0,0), lty=2)
lines(c(0,200), c(1,1), lty=2)
points(passed[df$passed == 0] ~ hours[df$passed == 0], cex=2, pch=21, bg='salmon')
points(passed[df$passed == 1] ~ hours[df$passed == 1], cex=2, pch=21, bg='cornflowerblue')
```

Furthermore, if we are interpreting `passed` as being quantitative, then there is nothing that would prevent us from creating a simple linear regression model by regressing `passed` against `hours`. We build such a model in the cell below. 

```{r}
slr_mod <- lm(passed ~ hours, df)
summary(slr_mod)
```

The fitted line for this data is shown in the figure below. 

```{r, echo=FALSE}
plot(passed[df$passed == 1] ~ hours[df$passed == 1], cex=2, pch=21, bg='cornflowerblue', 
     xlim=c(30,180), ylim=c(-0.2, 1.2), xlab='Study Hours', ylab='Passed')
abline(slr_mod$coefficients, lwd=2)
lines(c(0,200), c(0,0), lty=2)
lines(c(0,200), c(1,1), lty=2)
points(passed[df$passed == 0] ~ hours[df$passed == 0], cex=2, pch=21, bg='salmon')
points(passed[df$passed == 1] ~ hours[df$passed == 1], cex=2, pch=21, bg='cornflowerblue')
```

There are a number of serious concerns with this using a linear model such as this to describe the relationship between study hours and the ability to pass the exam. We will discuss some of these deficiencies in class. 

A more appropriate model for this task is provided by a technique called logistic regression. Before discussing the details of logistic regression, we need to first introduce the sigmoid function. 


### **The Sigmoid Function**

The **sigmoid function**, also called the **logistic function**, is defined as follows:

$$\sigma(z) = \frac{e^z}{1 + e^z} = \frac{1}{1 + e^{-z}}$$

A plot of the sigmoid function is provided below. Note that the domain of $\sigma$ is the entire real line, and its outputs are confined between 0 and 1. 


```{r, echo=FALSE}
z_sig <- seq(-6, 6, 0.1)
y_sig <- 1 / (1 + exp(-z_sig))
plot(y_sig ~ z_sig, pch='.', col="white", xlab="Z", ylab="p", main="Plot of Sigmoid Function")

lines(c(0,0), c(-1,1), lty=2, lwd=1, col='grey50')
lines(c(-8,8), c(0,0), lty=2, lwd=1, col='grey50')
lines(c(-8,8), c(1,1), lty=2, lwd=1, col='grey50')

lines(y_sig ~ z_sig, lwd=3, col="steelblue")

```

Since the range of the $\sigma$ is the interval $\left(0,1\right)$, we can interpret the output of the sigmoid function as representing a probability. We can imagine that the sigmoid function converts arbitrary real numbers into probabilities in a specific way. 

### **Logistic Regression Model: Single Predictor**

Let $Y$ be a categorical response variable with two levels. Assume that the labels are encoded as 0 and 1. Let $p(x)$ denote the probability that $Y = 1$ given that $X=x$. That is:

* $p(x) = P\left[Y = 1 | X = x \right]$

A logistic regression model assumes that $p$ is defined in terms of $X$ as follows:

* $Z = \beta_0 + \beta_1  X$

* $p(X) = \sigma(Z) = \sigma(\beta_0 + \beta_1 X) = \frac{1}{1 + e^{-\left(\beta_0 + \beta_1  X \right)}}$

Our goal is to approximate this hypothetical relationship by finding estimates $\hat\beta_0$ and $\hat\beta_1$ for the unknown model parameters. As with ordinary least squares regression, our goal is to find parameter estimates that result in a model that "best fits" the training data according to some metric. 

We will now discuss how logistic regression models are scored. 


### **Scoring Logistic Regression Models**

Assume that two logistic regression models have been proposed for this dataset:

* **Model 1:** $Z = 3 - 0.03  X, \hspace{0.6 em} p(X) = \sigma(Z)$

* **Model 2:** $Z = 10 - 0.1  X, \hspace{0.6 em} p(X) = \sigma(Z)$

The figure below displays plots for both of these models. 


```{r, echo=FALSE}
x_grid <- seq(0, 200)


par(mfrow=c(1,2))

plot(passed[df$passed == 1] ~ hours[df$passed == 1], cex=2, pch=21, bg='cornflowerblue', 
     xlim=c(30,180), ylim=c(-0.2, 1.2), xlab='Study Hours', ylab='Passed')

lines(x_grid, 1 / (1 + exp(3 - 0.03 * x_grid)), lwd=2)
lines(c(0,200), c(0,0), lty=2)
lines(c(0,200), c(1,1), lty=2)
points(passed[df$passed == 0] ~ hours[df$passed == 0], cex=2, pch=21, bg='salmon')
points(passed[df$passed == 1] ~ hours[df$passed == 1], cex=2, pch=21, bg='cornflowerblue')

plot(passed[df$passed == 1] ~ hours[df$passed == 1], cex=2, pch=21, bg='cornflowerblue', 
     xlim=c(30,180), ylim=c(-0.2, 1.2), xlab='Study Hours', ylab='Passed')

lines(x_grid, 1 / (1 + exp(10 - 0.10 * x_grid)), lwd=2)
lines(c(0,200), c(0,0), lty=2)
lines(c(0,200), c(1,1), lty=2)
points(passed[df$passed == 0] ~ hours[df$passed == 0], cex=2, pch=21, bg='salmon')
points(passed[df$passed == 1] ~ hours[df$passed == 1], cex=2, pch=21, bg='cornflowerblue')

par(mfrow=c(1,1))
```

#### **Likelihood**

We can calculate a **likelihood** score for each of these models. This score measures how likely it is that the observed data could have been generated by the model being scored. We prefer models with higher likelihood scores. Likelihood is calculated as follows:

* **Step 1:** Calculate $p_i = p(x_i)$ for each training observation. 

  Recall that this number is the model's estimate of the probability that a  student will pass the exam, given the number of hours that they have studied.  

* **Step 2:** Calculate $\pi_i$ for each training observation.

  For each training observation, define $\pi_i$ as follows: 
  
  $\pi_i =\left\{\begin{array}{ll} p_i & \text{if }~ y_i = 1 \\ 1 - p_i & \text{if } ~y_i = 0 \end{array}\right.$
  
  The value $\pi_i$ can be interpreted as the probability that student $i$ will have obtained the observed outcome, given the number of hours that they studied. 
  
* **Step 3:** Calculate Likelihood, $L$.

  The model's **likelihood** score is given by: $L = \pi_1 \cdot \pi_2 \cdot ... \cdot \pi_n = \prod_{i=1}^{n} \pi_i$. 
  
  This score is equal to the probability that we would have seen the outcomes that we have observed, assuming that the model is correct. We prefer models with higher likelihood scores.  

The figure below contains bars representing the values of $\pi_i$ for each observation according to each model. 

```{r, echo=FALSE}
x_grid <- seq(0, 200)

p1 <- 1 / (1 + exp((3 - 0.03*df$hours)))
pi1 <- ifelse(df$passed == 1, p1, 1-p1)
p2 <- 1 / (1 + exp((10 - 0.1*df$hours)))
pi2 <- ifelse(df$passed == 1, p2, 1-p2)


par(mfrow=c(1,2))

plot(passed[df$passed == 1] ~ hours[df$passed == 1], cex=2, pch=21, bg='cornflowerblue', 
     xlim=c(30,180), ylim=c(-0.2, 1.2), xlab='Study Hours', ylab='Passed')


segments(x0=hours[passed==0], y0=p1[passed==0], y1=1, col='salmon')
segments(x0=hours[passed==1], y0=0, y1=p1[passed==1], col='cornflowerblue')

lines(x_grid, 1 / (1 + exp(3 - 0.03 * x_grid)), lwd=2)
lines(c(0,200), c(0,0), lty=2)
lines(c(0,200), c(1,1), lty=2)
points(passed[df$passed == 0] ~ hours[df$passed == 0], cex=2, pch=21, bg='salmon')
points(passed[df$passed == 1] ~ hours[df$passed == 1], cex=2, pch=21, bg='cornflowerblue')

plot(passed[df$passed == 1] ~ hours[df$passed == 1], cex=2, pch=21, bg='cornflowerblue', 
     xlim=c(30,180), ylim=c(-0.2, 1.2), xlab='Study Hours', ylab='Passed')

segments(x0=hours[passed==0], y0=p2[passed==0], y1=1, col='salmon')
segments(x0=hours[passed==1], y0=0, y1=p2[passed==1], col='cornflowerblue')

lines(x_grid, 1 / (1 + exp(10 - 0.10 * x_grid)), lwd=2)
lines(c(0,200), c(0,0), lty=2)
lines(c(0,200), c(1,1), lty=2)
points(passed[df$passed == 0] ~ hours[df$passed == 0], cex=2, pch=21, bg='salmon')
points(passed[df$passed == 1] ~ hours[df$passed == 1], cex=2, pch=21, bg='cornflowerblue')

par(mfrow=c(1,1))
```


In the cell below, we calculate the likelihood scores for both of the two models we are considering.

```{r}
z1 <- 3 - 0.03*df$hours
p1 <- 1 / (1 + exp(-z1))
pi1 <- ifelse(df$passed == 1, p1, 1-p1)
lik1 <- prod(pi1)

z2 <- 10 - 0.1*df$hours
p2 <- 1 / (1 + exp(-z2))
pi2 <- ifelse(df$passed == 1, p2, 1-p2)
lik2 <- prod(pi2)

cat('Likelihood score for Model 1: ', lik1, '\n', 
    'Likelihood score for Model 2: ', lik2, sep='')
```

#### **Negative Log-Likelihood**

Although we can, in theory, compare models using their likelihood score, $L$, it is more common in practice to score models using **negative log-likelihood**, $NLL$. By definition, $NLL = -\ln(L)$, but it is actually calculated using the following steps:


* **Step 1:** Calculate $p_i = p(x_i)$ for each training observation. 

* **Step 2:** Calculate $\pi_i$ for each training observation.

  For each training observation, calculate:
  
  $\pi_i =\left\{\begin{array}{ll} p_i & \text{if }~ y_i = 1 \\ 1 - p_i & \text{if } ~y_i = 0 \end{array}\right.$
  
* **Step 3:** Calculate Negative Log-Likelihood, $NLL$.

  The model's **negative log-likelihood** score is given by: 
  
  $NLL = - \left[\ln(\pi_1) + \ln(\pi_2) + ... + \ln(\pi_n)\right] = - \sum_{i=1}^n \ln(\pi_i)$. 
  
Since the natural logarithm is an increasing function, maximizing likelihood is equivalent to maximizing log-likelihood, and thus equivalent to minimizing negative log-likelihood. Therefore, we prefer models with **low** $NLL$ scores. 

We calculate the negative log-likelihood of each of our proposed models in the cell below. 

```{r}
nll1 = -sum(log(pi1))   # nll = -log(lik1)
nll2 = -sum(log(pi2))   # nll = -log(lik2)

cat('Negative Log-Likelihood score for Model 1: ', nll1, '\n', 
    'Negative Log-Likelihood score for Model 2: ', nll2, sep='')
```


### **Finding Optimal Logistic Regression Model**

R provides a `glm()` function for performing logistic regression. The syntax for using this function is illustrated in the cell below, where we create a logistic regression model for our exam prep data. 

```{r}
mod <- glm(passed ~ hours, data=df, family=binomial(link="logit"))
summary(mod)
```

We can see from the model summary that the optimal logistic regression model for this dataset is given by: 

$$Z = -9.3 + 0.082 \cdot X$$

$$p(X) = \sigma(Z) = \frac{1}{1 + e^{9.3 - 0.083 \cdot X}}$$

The figure below shows a plot of the optimal logistic regression model for this dataset. 

```{r, echo=FALSE}
plot(passed[df$passed == 1] ~ hours[df$passed == 1], cex=2, pch=21, bg='cornflowerblue', 
     xlim=c(30,180), ylim=c(-0.2, 1.2), xlab='Study Hours', ylab='Passed')

p <- predict(mod, data.frame(hours=x_grid), type="response")
lines(x_grid, p, lwd=2)

boundary <- -mod$coefficients[1] / mod$coefficients[2]
lines(c(boundary, boundary), c(0,1), lty=2, col="red")
lines(c(0,200), c(0,0), lty=2)
lines(c(0,200), c(1,1), lty=2)
points(passed[df$passed == 0] ~ hours[df$passed == 0], cex=2, pch=21, bg='salmon')
points(passed[df$passed == 1] ~ hours[df$passed == 1], cex=2, pch=21, bg='cornflowerblue')
```

We can use the `predict()` function along with the parameter `type="response"` to see the estimated probabilities of passing for each student in our training set, according to our model 

```{r}
predict(mod, df, type="response")
```

We will now calculate the likelihood and log-likelihood scores for our optimal model. 

```{r}
p <- predict(mod, df, type="response")
pi <- ifelse(df$passed == 1, p, 1-p)
lik <- prod(pi)
nll <- -log(lik)

cat('Log-Likelihood for Optimal Model         : ', lik, '\n', 
    'Negative Log-Likelihood for Optimal Model: ', nll, sep='')

```

Let's now consider a similar example, but with more observations. 

### **Example 2: Exam Prep**

The data for this example is stored in the tab-separated file `data/exam_prep.txt`. We load the data and view a summary of the data. 

```{r}
exam <- read.table("data/exam_prep.txt", sep="\t", header=TRUE)
exam$Passed <- factor(exam$Passed)
summary(exam)
```

We will explore the relationship between hours studied and exam outcome by creating a boxplot. 

```{r, message=FALSE, echo=FALSE}
library(ggplot2)
```

```{r}
ggplot(exam, aes(x=Passed, y=Hours, fill=Passed)) + geom_boxplot()
```

We will use the `glm()` function to create a logistic regression model. 


```{r}
exam_mod <- glm(Passed ~ Hours, exam, family=binomial(link="logit"))
summary(exam_mod)
```

We can see from the model summary that the optimal logistic regression model for this dataset is given by: 

$$Z = -3.19 + 0.02 X$$

$$p(X) = \sigma(Z) = \frac{1}{1 + e^{3.19 - 0.02  X}}$$

In the cell below, we will use our model to estimate the probability that a student will pass the exam if they have studied for each of the following numbers of hours: 0, 50, 100, 150, 200

```{r}
nd <- data.frame(Hours = c(0, 50, 100, 150, 200))
prob <- predict(exam_mod, newdata=nd, type="response")

prob
```

### **Training Accuracy**

A common way to evaluate the performance of a classification model is to calculate its **accuracy** on some set. The accuracy on a particular set is equal to the proportion of observations in the set for which the model predicts the correct class. 

In the cell below, we calculate our model's accuracy on the training set. 

```{r}
prob <- predict(exam_mod, exam, type="response")
pred <- ifelse(prob < 0.5, 0, 1)
tr_acc <- mean(pred == exam$Passed)
cat('Training Accuracy: ', tr_acc, sep='')
```


### **Interpreting Model Coefficients**

Interpreting the coefficients in a logistic regression model is not as direct as in a linear regression model. However, it is possible to do. Because discussing how we would interpret these coefficients, we need to discuss the concept of odds ratios. 

#### **Odds Ratios**

The **odds** of an event occurring is defined as the probability of an event occurring divided by the probability of the event not occurring. The probability of an event $P_E$, and the odds of an event $O_E$ are related by the following expressions:

$$O_E = \frac{P_E}{1 - P_E} \hspace{3 em} P_E = \frac{O_E}{1 + O_E}$$

You might be familiar with the concept of odds from sports or games of chance. Assume that an event is twice as likely to occur as not. Then we might say that the event has "2 to 1 odds". The odds of such an event is often denoted by 2:1 or 2/1. This notion of odds is equivalent to the one that we defined above. 

The table below shows several ways of expressing odds, as well as the probability associated with each odds value. 

<center>
<table width=400>
<tr>
<td> **Odds** </td><td> **Odds** </td><td> **Odds** </td><td> **Prob** </td>
</tr>
<tr>
<td> 1:1 </td><td> 1/1 </td><td> 1.00 </td><td> 0.50 </td>
</tr>
<tr>
<td> 2:1 </td><td> 2/1 </td><td> 2.00 </td><td> 0.67 </td>
</tr>
<tr>
<td> 1:2 </td><td> 1/2 </td><td> 0.50 </td><td> 0.33 </td>
</tr>
<tr>
<td> 1:3 </td><td> 1/3 </td><td> 0.33 </td><td> 0.25 </td>
</tr>
</table>
</center>
</br>

Note that odds are always positive, but can be arbitrarily large. 

#### **Alternate Form For Logistic Regression Model**

Consider a logistic regression model of the form: 

$$p(X) = \frac{1}{1 + e^{-\left(\beta_0 + \beta_1  X \right)}}$$
After a little bit of algebraic manipulation, we can rewrite this expression in the following form:

$$\ln\left[\frac{p(X)}{1 - p(X)} \right] = \beta_0 + \beta_1  X $$

Notice that the left-hand side of this formula is the natural log of the odds that $Y=1$ given $X$. This quantity is often referred to as the **log-odds**. 

This shows that a logistic regression model really is a linear model for log-odds. 

#### **Coefficient Interpretation**

The interpretation of $\beta_1$ in a logistic regression model is as follows: If we increase the predictor $X$ by a single unit, that will result in an increase of $\beta_1$ units in log-odds, and approximately a $\beta_1*100\%$ increase in the odds of $Y=1$.

Returning to our exam prep example, we recall that our model was given by:

$$p(X) = \frac{1}{1 + e^{3.19 - 0.02  X}}$$

This can be rewritten as:


$$\ln\left[\frac{p(X)}{1 - p(X)} \right] = -3.19 + 0.02  X $$

It follows that each additional hour of studying will increase the odds (not the probability) of passing the exam by about 2% (or in other words, by a factor of 1.02).


### **Example 3: Pima Diabetes Dataset**

For this example, we will be working with the Pima Diabetes Dataset. This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective is to predict based on diagnostic measurements whether a patient has diabetes. All patients are females at least 21 years old of Pima Indian heritage. 

The columns in this dataset are described below. 

* **Pregnancies**: Number of times pregnant
* **Glucose**: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
* **BloodPressure**: Diastolic blood pressure (mm Hg)
* **SkinThickness**: Triceps skin fold thickness (mm)
* **Insulin**: 2-Hour serum insulin (mu U/ml)
* **BMI**: Body mass index (weight in kg/(height in m)^2)
* **DiabetesPedigreeFunction**: Diabetes pedigree function
* **Age**: Age (years)
* **Outcome**: Class variable (0 or 1)

#### **Loading the Data**

We will read in the data, convert the **Outcome** column to a factor variable, and then print a summary of the dataset. 

```{r}
pima <- read.table("data/diabetes.csv", sep=",", header=TRUE)
pima$Outcome <- factor(pima$Outcome)
summary(pima)
```
#### **Visualizing the Data**

The figure below displays density plots for each of our predictors, and for each value of our response variable. 

```{r, message=FALSE, echo=FALSE}
library(tidyr)
library(reshape2)
```


```{r, echo=FALSE}
melt(pima, id.vars = "Outcome") %>%
  ggplot(aes(value, fill=Outcome)) + facet_wrap(~ variable, scale="free") + 
  geom_density(alpha=0.4)
```

#### **Splitting the Data**

We will consider multiple logistic regression models on this dataset. We will create training and validation sets. The models will be trained on the training set, and compared based on their accuracy on the validation set. 

The `caret` package provides a function called `createDataPartition()` which can be used to create a train/validation split. By default, this function creates a stratified random split. As a result, the distribution of the response variable will be roughly the same in the training set and validation set. 

In the cell below, we create a 70/30 split of our dataset. 

```{r}
library(caret)
set.seed(1)
train.index <- createDataPartition(pima$Outcome, p = .7, list=FALSE)
train <- pima[ train.index,]
valid  <- pima[-train.index,]

summary(train$Outcome)
summary(valid$Outcome)
```

#### **Create Models**

We will now train our model on the training set. 

```{r}
pima_mod_1 <- glm(Outcome ~ ., train, family=binomial(link="logit"))
summary(pima_mod_1)
```

It appears that three of our predictors, `Insulin`, `SkinThickness`, and `BloodPressure` do not have statistically significant effects on the prescience of diabetes. We will remove them one at a time. 

```{r}
pima_mod_2 <- glm(Outcome ~ . - Insulin, train, family=binomial(link="logit"))
summary(pima_mod_2)
```

```{r}
pima_mod_3 <- glm(Outcome ~ . - Insulin - SkinThickness, train, family=binomial(link="logit"))
summary(pima_mod_3)
```

```{r}
pima_mod_4 <- glm(Outcome ~ . - Insulin - SkinThickness - BloodPressure, train, family=binomial(link="logit"))
summary(pima_mod_4)
```

Notice that all predictors are statistically significant in our fourth model. 

#### **Calculate Training and Validation Accuracies**

We will now calculate the training and validation accuracies for each of our four models. The cell below shows the code for calculating these metrics using Model 1. The same values are also calculated for the other three models, but the code has been hidden in the output. It can be seen in the Rmd file for this lesson, however. 


```{r}
tr_prob_1 <- predict(pima_mod_1, train, type="response")
tr_pred_1 <- ifelse(tr_prob_1 < 0.5, 0, 1)
tr_acc_1 <- mean(tr_pred_1 == train$Outcome)

va_prob_1 <- predict(pima_mod_1, valid, type="response")
va_pred_1 <- ifelse(va_prob_1 < 0.5, 0, 1)
va_acc_1 <- mean(va_pred_1 == valid$Outcome)
```

```{r, echo=FALSE}
tr_prob_2 <- predict(pima_mod_2, train, type="response")
tr_pred_2 <- ifelse(tr_prob_2 < 0.5, 0, 1)
tr_acc_2 <- mean(tr_pred_2 == train$Outcome)

va_prob_2 <- predict(pima_mod_2, valid, type="response")
va_pred_2 <- ifelse(va_prob_2 < 0.5, 0, 1)
va_acc_2 <- mean(va_pred_2 == valid$Outcome)

tr_prob_3 <- predict(pima_mod_3, train, type="response")
tr_pred_3 <- ifelse(tr_prob_3 < 0.5, 0, 1)
tr_acc_3 <- mean(tr_pred_3 == train$Outcome)

va_prob_3 <- predict(pima_mod_3, valid, type="response")
va_pred_3 <- ifelse(va_prob_3 < 0.5, 0, 1)
va_acc_3 <- mean(va_pred_3 == valid$Outcome)

tr_prob_4 <- predict(pima_mod_4, train, type="response")
tr_pred_4 <- ifelse(tr_prob_4 < 0.5, 0, 1)
tr_acc_4 <- mean(tr_pred_4 == train$Outcome)

va_prob_4 <- predict(pima_mod_4, valid, type="response")
va_pred_4 <- ifelse(va_prob_4 < 0.5, 0, 1)
va_acc_4 <- mean(va_pred_4 == valid$Outcome)
```

We will now compare the accuracy scores for each of our models. 

```{r}
cat("Model 1 Training Accuracy:   ", tr_acc_1, "\n",
    "Model 1 Validation Accuracy: ", va_acc_1, "\n\n",
    
    "Model 2 Training Accuracy:   ", tr_acc_2, "\n",
    "Model 2 Validation Accuracy: ", va_acc_2, "\n\n",
    
    "Model 3 Training Accuracy:   ", tr_acc_3, "\n",
    "Model 3 Validation Accuracy: ", va_acc_3, "\n\n",
    
    "Model 4 Training Accuracy:   ", tr_acc_4, "\n",
    "Model 4 Validation Accuracy: ", va_acc_4, sep=""
)

```

Notice that Models 2 and 4 are actually tied in their performance on the validation set, although the score is similar for all four models. In this case, we would likely prefer Model 4 since it does not contain any non-significant predictors. 

