Logistic Regression is a regression model used to classify the dependent variable. For logistic regression the dependent variable must be binary; it can take only two values, “0” and “1”. These values (either 0 or 1) can represent a pass/fail, dead/alive, present/not present, below/above, false/true, or no/yes. They can represent many different things that you need. Because the output of Logistic based regression is between 0-1, it is best at delivering the probability of an event occurring.
Key Point #1: The output of logistic regression is the probability of an event occurring
To see the full benefits of logistic regression, we will use it to predict whether Airmen will pass or fail their PT test. In our case, we will use a 1 to represent a passing score and a 0 to represent a failing score.
Logistic regression does not make many of the key assumptions of linear regression and general linear models that are based on ordinary least squares algorithms - particularly regarding linearity, normality, homoscedasticity, and measurement level. The logistic function is listed below. Because logistic regression is in “log space” or not linear, many diagnostics used for linear regression are not needed nor do they make sense.
\[p(X) = \frac{e^{\beta_0 + \beta_1X}}{1 + e^{\beta_0 + \beta_1X}}\]
Logistic regression requires each observation to be independent. That is that the data points should not be from any dependent samples design, e.g., before-after measurements, or matched pairings. Also the model should have little or no multicollinearity. That is that the independent variables should be independent from each other. However, there is the option to include interaction effects of categorical variables in the analysis and the model. Aside from multicollinerity and sample size, logistic regression is fairly robust.
Because there are limited metrics that explain if our logistic model is poor, busted, or bad fit, it is best to run the regression twice. The first model should be run on a large sample of the entire dataset. Then we test to see how well that model predicts results for a different sample. This process will be performed below.
Key Point #2: Running logistic regression on a sample, and then testing it on a different sample is the best validation procedure
knitr::opts_chunk$set(echo = T)
knitr::opts_chunk$set(warning = F)
knitr::opts_chunk$set(message = F)
library(knitr)
library(kableExtra)
library(tidyverse)
library(pROC)
library(GGally)
#Organizing Packge information for table
packages <- c("tidyverse", "pROC", "knitr", "kableExtra", "GGally")
display <- c("Package","Title", "Maintainer", "Version", "URL")
table <- matrix(NA, 1, NROW(display), dimnames = list(1, display))
for(i in 1:NROW(packages)){
list <- packageDescription(packages[i])
table <- rbind(table, matrix(unlist(list[c(display)]), 1, NROW(display), byrow = T))
}
table[,NROW(display)] <- stringr::str_extract(table[,NROW(display)], ".+,")
#Table of packages
kable(table[-1,], format = "html", align = "c") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| Package | Title | Maintainer | Version | URL | |
|---|---|---|---|---|---|
| tidyverse | Easily Install and Load ‘Tidyverse’ Packages | Hadley Wickham <hadley@rstudio.com>; | 1.1.1 | http://tidyverse.org, | |
| pROC | Display and Analyze ROC Curves | Xavier Robin <robin@lindinglab.org>; | 1.10.0 | NA | |
| knitr | A General-Purpose Package for Dynamic Report Generation in R | Yihui Xie <xie@yihui.name>; | 1.16 | NA | |
| kableExtra | Construct Complex Table with ‘kable’ and Pipe Syntax | Hao Zhu <haozhu233@gmail.com>; | 0.3.0 | http://haozhu233.github.io/kableExtra/, | |
| GGally | Extension to ‘ggplot2’ | Barret Schloerke <schloerke@gmail.com>; | 1.3.2 | https://ggobi.github.io/ggally, |
The data consists of nearly 7 million data entries. Each row representing an individual taking the PFA. No row is dependent on another row. Today, we will be working with a very small random sample of that data. It can be downloaded here. As with all real world data, there may be errors within this dataset, however let’s start by getting a little understanding of the data.
#Quick source of data
load("pt.rdata")
PT.Scores <- na.omit(PT.Scores)
#Split into Train/Test datasets - this can occur later in the process
train <- PT.Scores[sample(nrow(PT.Scores), nrow(PT.Scores) * .7), ]
test <- anti_join(PT.Scores, train, by = "PERSONNEL_RECORD_ID")
Below are three graphs/tables that show a general summary of our data. The first shows the spreadsheet view of the data. I always find this a good starting place, for example the PERSONNEL_RECORD_ID#. We can easily determine that this column is unique to each person and although numeric, should have no influence on the outcome. In our regression model, we will exclude this variable. The second table shows summary statistics for each independent variable. This allows a quick view of the distribution and variability of the variables. For now, we exclude the factor variables from this chart - only numerical values are shown.
#Show spreadsheet view of data
train
#Summary of (non factor) variables
train[sapply(train,is.numeric)] %>%
gather(1:9, key = "Variable", value = "Value") %>%
group_by(Variable) %>%
summarise_all(funs(min, mean, median, max, sd)) %>%
kable(digits = 2, align = "c", format = "html", caption = "Summary of Variables") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F, position = "center")
| Variable | min | mean | median | max | sd |
|---|---|---|---|---|---|
| ABD_CIRC | 20 | 32.95 | 33.0 | 46 | 3.39 |
| AERO_SCORE | 457 | 747.19 | 738.0 | 1285 | 107.28 |
| AGE | 18 | 30.60 | 29.0 | 58 | 8.44 |
| CRUNCH_COUNT | 0 | 50.49 | 51.0 | 88 | 9.19 |
| HEIGHT | 54 | 69.15 | 69.5 | 79 | 3.58 |
| PASSFAIL | 0 | 0.88 | 1.0 | 1 | 0.32 |
| PERSONNEL_RECORD_ID | 555239 | 1131014.70 | 1190246.0 | 1710085 | 309070.53 |
| PUSHUP_COUNT | 0 | 47.93 | 48.0 | 100 | 13.45 |
| WEIGHT | 89 | 177.75 | 178.0 | 285 | 30.27 |
The last graph is a correlation plot. It shows the correlation between variables. The darker the green, the higher the correlation. The darker the red, the lower the correlation. A common practice is to further assess correlations that are above 70% or below -70%. In the PT dataset, we have a high correlation between Weight and Abdominal Circumfrence. This logically makes sense. Rather than remove a column or merge the information, we will make a note to assess the validation results later. So, for now, we will leave it in.
ggcorr(PT.Scores[-c(1, 5, 6, 10)], nbreaks = 6, label = T, low = "red3", high = "green3",
label_round = 2, name = "Correlation Scale", label_alpha = T, hjust = 0.75) +
ggtitle(label = "Correlation Plot") +
theme(plot.title = element_text(hjust = 0.6))
To start exploring our data and later also be able to create models correctly, we need to separate our data into train and test data. This is done to simulate a real world dataset (test) and is not used in anyway during the analysis. We build the initial model from the Train subset and validate the model using the different Test subset. This separation helps ensure that our regression model is accurate and unbiased.
The general model is glm or Generalized Linear Models (GLM). There is a lot of power and flexibility behind this function, however we will only use the basic syntax, which is as follows: \[glm(formula, family, data)\] To start off, we need to identify our formula. As we mentioned earlier, we want to predict if someone will pass or fail their Physical Fitness Assessment (PFA) in the Air Force. We will use the column PASSFAIL as the dependent variable. Additionally, the structure of our data allows us to every column except ID as independent variables. For terms in R, our formula is \(Pass\) ~ \(.\). The family argument is “binomial” since we are using a binomial distribution within the logistic model. The data argument is the data frame containing our data, in this case train.
Now we have the basic structure of our function to which to save a model1.
The first output we will assess is the summary of the coefficients for our variables. Ideally, we are looking for larger z values (lower p-values). By looking at the table below, we can see that only Weight has a p-value above .05. Bear in mind that the estimates from logistic regression characterize the relationship between the predictor and response variable on a log-odds scale. For example, this model suggests that for every one unit increase in Age, the log-odds of the Airmen passing increases by .06. Because this isn’t of much practical value, we’ll usually want to use the exponential function to calculate the odds ratios for each predictor. So for Age, we would expect the Airman likelihood of passing to increase by 1.06% for each year. This is easier to quickly interpret.
#Removes bias variables
train <- train[-c(5, 6, 7, 12)]
#Logisitic model
model1 <- step(glm(PASSFAIL ~ ., family = binomial(link = "logit"), data = train[-c(1)]), direction = "both", trace = F)
#Table of Coefficients
kable(cbind(
summary(model1)[[12]],
"Exp Coef" = exp(model1$coefficients) / (1 + exp(model1$coefficients))),
format = "html", align = "c", caption = "Summary of Coefficients", digits = 2) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F, position = "center")
| Estimate | Std. Error | z value | Pr(>|z|) | Exp Coef | |
|---|---|---|---|---|---|
| (Intercept) | 3.24 | 2.52 | 1.28 | 0.20 | 0.96 |
| HEIGHT | 0.13 | 0.04 | 3.56 | 0.00 | 0.53 |
| WEIGHT | -0.01 | 0.01 | -1.49 | 0.14 | 0.50 |
| ABD_CIRC | -0.31 | 0.05 | -5.82 | 0.00 | 0.42 |
| ACTIVEG | -0.88 | 0.21 | -4.13 | 0.00 | 0.29 |
| ACTIVER | -0.75 | 0.24 | -3.20 | 0.00 | 0.32 |
| SEXM | 0.93 | 0.29 | 3.18 | 0.00 | 0.72 |
| AGE | 0.06 | 0.01 | 5.00 | 0.00 | 0.51 |
First, we have to decide what to do with the output of our model. We already discussed that the output is a probability, but what does a 52% mean? What does a 48% mean? For our model, we will assume that if an Airman’s predicted probability of passing is over 50% that they pass, else they fail. Our model’s predictions are displayed in a confusion matrix.
This matrix shows the response of the model vs the actual response. For example, if the model predicted an Airman will pass, and the Airman did pass, a tally (count) is placed in the 2nd row - 2nd column. If the model predicted an Airman will fail and the Airman did fail, a tally is placed in the 1st column - 1st row. Ideally, we want to maximize the number of times the model assessed the correct outcome. The wrongly predicted outcomes are in the other 2 squares. These represent the model’s type I and type II error. These are bad.
#Predicted values stored rather than summary function
test.predicted.m1 <- predict(model1, newdata = train, type = "response")
#Confusion Matrix
kable(table(train$PASSFAIL, test.predicted.m1 > 0.5),
format = "html", align = "c", caption = "Confusion Matrix") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F, position = "center")
| FALSE | TRUE | |
|---|---|---|
| 0 | 32 | 157 |
| 1 | 9 | 1402 |
#Confusion Percentages
right1 <- table(train$PASSFAIL, test.predicted.m1 > 0.5)[1,1]
right2 <- table(train$PASSFAIL, test.predicted.m1 > 0.5)[2,2]
total <- length(train$PASSFAIL)
default <- sum(train$PASSFAIL) / total
Quick math shows us that the model predicted (32 + 1402) / 1600 = 89.6%. Meaning we misclassified 10.4% of the Airmen. This is significant because only 88.2% of Airman pass.
The next two results are not required for logistic regression as they are for linear regression. They show the influence of one test on the coefficients. If the value is larger, it influences the coefficients more. This is important to assess for linear regression when one point overly impacts the rest of the results. Additionally, a variable may be significant only because of the one large outlier.
The second graph is very similar in that it shows influence. The second shows the studentized residuals and leverage. Leverage is a measure of how far away the independent variable values of an observation are from those of the other observations. As long as there are no outliers outside the red dotted lines, we can more safely state that no single point overly influenced our regression.
#Plots the not needed Cooks D with Leverage plots
plot(model1, which = 4:5, id.n = 5)
#Logistic Curve
train %>%
mutate(
Misclassified = ifelse(train$PASSFAIL > .5 & predict(model1, type = "response") < .5 |
train$PASSFAIL < .5 & predict(model1, type = "response") > .5,
"Misclassified", "Correctly Classified")) %>%
ggplot(aes(predict(model1, type = "response"), train$PASSFAIL)) +
guides(alpha = F) +
geom_point(aes(alpha = .2, color = Misclassified)) +
geom_smooth(method = "glm", method.args = list(family = "binomial"), se = F) +
scale_x_continuous(name = "Predicted Probabilty") +
scale_y_continuous(name = "Actual Probabilty") +
theme(plot.title = element_text(hjust = 0.5),
legend.title = element_blank()) +
ggtitle(label = "Actual vs Predict with Logistic Curve")
R-Squared is the general statistic used in linear models to describe the quality of the model. R-squared gives a single number that states how much of the total variance is explained by the model. Logistic regression has no such number. Instead of creating a pseudo R-Squared, we will use a ROC curve for Area Under the Curve (AUC) and the simple predicted percentages.
Key Point #3: Area Under the Curve and Confusion Matrix percentages are THE key numbers to remember
train$PASSFAIL <- as.character(train$PASSFAIL)
train$PASSFAIL <- as.factor(train$PASSFAIL)
#Plot rROC
rocobj <- plot.roc(train$PASSFAIL, model1$fitted.values,
main="Confidence intervals", percent=TRUE, ci=TRUE, print.auc=TRUE)
ciobj <- ci.se(rocobj, specificities=seq(0, 100, 5), conf.level = .99, boot.n = 100)
plot(ciobj, type="shape", col="#1c61b6AA")
plot(ci(rocobj, of="thresholds", thresholds="best"))
It appears that the logistic model is fairly good as well as not violating of any the model’s assumptions. The next step is to run the model on the test dataset. This test will use coefficients and variables from the train dataset. The two outputs we will compare are the ROC Curve/AUC and the confusion matrix probability.
#Convert PASSFAIL to factor for pROC
test$PASSFAIL <- as.character(test$PASSFAIL)
test$PASSFAIL <- as.factor(test$PASSFAIL)
test.fitted.values <- predict(model1, newdata = test, probability = TRUE)
#Plot rROC
rocobj <- plot.roc(test$PASSFAIL, test.fitted.values,
main="Confidence intervals", percent=TRUE, ci=TRUE, print.auc=TRUE)
ciobj <- ci.se(rocobj, specificities=seq(0, 100, 5), conf.level = .99, boot.n = 100)
plot(ciobj, type="shape", col="#1c61b6AA")
plot(ci(rocobj, of="thresholds", thresholds="best"))
#Confusion Matrix
kable(table(test$PASSFAIL, test.fitted.values > 0.5),
format = "html", align = "c", caption = "Confusion Matrix") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F, position = "center")
| FALSE | TRUE | |
|---|---|---|
| 0 | 24 | 48 |
| 1 | 8 | 603 |
#Confusion Percentages
right1 <- table(test$PASSFAIL, test.fitted.values > 0.5)[1,1]
right2 <- table(test$PASSFAIL, test.fitted.values > 0.5)[2,2]
total <- length(test$PASSFAIL)
Again, quick math shows us that the model predicted (24 + 603) / 683 = 91.8%. Meaning we misclassified 8.2% of the Airmen. With these two models being roughly equal in predictive ability, we will rerun the model using the same variables on the entire dataset. This will be to update the coefficients and establish the final formula for future use.
#Regression
final <- step(glm(PT.Scores$PASSFAIL ~ ., data = PT.Scores[-1],
family = "binomial"), trace = F, direction = "both")
#Final Coefficients
Final.Coefficients <- data.frame(Log = final$coefficients,
"Exp of Log" = exp(final$coefficients) / (1 + exp(final$coefficients)))
kable(Final.Coefficients, format = "html", align = "c", caption = "Summary of Coefficients", digits = 2) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F, position = "center")
| Log | Exp.of.Log | |
|---|---|---|
| (Intercept) | 16.10 | 1.00 |
| HEIGHT | 0.07 | 0.52 |
| ABD_CIRC | -0.30 | 0.43 |
| AERO_SCORE | -0.03 | 0.49 |
| PUSHUP_COUNT | 0.04 | 0.51 |
| CRUNCH_COUNT | 0.18 | 0.55 |
| SEXM | -5.55 | 0.00 |
| AGE | 0.22 | 0.56 |
What are the two key outputs that describe how good a logistic model is?
Do males have a higher probability of passing their PFA?
Create a model with just Height + Weight as X variables. What are their coefficients?