What is Logistic Regression?

Logistic Regression is a regression model used to classify the dependent variable. For logistic regression the dependent variable must be binary; it can take only two values, “0” and “1”. These values (either 0 or 1) can represent a pass/fail, dead/alive, present/not present, below/above, false/true, or no/yes. They can represent many different things that you need. Because the output of Logistic based regression is between 0-1, it is best at delivering the probability of an event occurring.

Key Point #1: The output of logistic regression is the probability of an event occurring

To see the full benefits of logistic regression, we will use it to predict whether Airmen will pass or fail their PT test. In our case, we will use a 1 to represent a passing score and a 0 to represent a failing score.

Assumptions/uses

Logistic regression does not make many of the key assumptions of linear regression and general linear models that are based on ordinary least squares algorithms - particularly regarding linearity, normality, homoscedasticity, and measurement level. The logistic function is listed below. Because logistic regression is in “log space” or not linear, many diagnostics used for linear regression are not needed nor do they make sense.

\[p(X) = \frac{e^{\beta_0 + \beta_1X}}{1 + e^{\beta_0 + \beta_1X}}\]

Logistic regression requires each observation to be independent. That is that the data points should not be from any dependent samples design, e.g., before-after measurements, or matched pairings. Also the model should have little or no multicollinearity. That is that the independent variables should be independent from each other. However, there is the option to include interaction effects of categorical variables in the analysis and the model. Aside from multicollinerity and sample size, logistic regression is fairly robust.

Because there are limited metrics that explain if our logistic model is poor, busted, or bad fit, it is best to run the regression twice. The first model should be run on a large sample of the entire dataset. Then we test to see how well that model predicts results for a different sample. This process will be performed below.

Key Point #2: Running logistic regression on a sample, and then testing it on a different sample is the best validation procedure

knitr::opts_chunk$set(echo = T)
knitr::opts_chunk$set(warning = F)
knitr::opts_chunk$set(message = F)

R Packages used

library(knitr)
library(kableExtra)
library(tidyverse)
library(pROC)
library(GGally)

#Organizing Packge information for table
packages <- c("tidyverse", "pROC", "knitr", "kableExtra", "GGally")
display <- c("Package","Title", "Maintainer", "Version", "URL")
table <- matrix(NA, 1, NROW(display), dimnames = list(1, display))
for(i in 1:NROW(packages)){
list <- packageDescription(packages[i])
table <- rbind(table, matrix(unlist(list[c(display)]), 1, NROW(display), byrow = T))
}
table[,NROW(display)] <- stringr::str_extract(table[,NROW(display)], ".+,")

#Table of packages
kable(table[-1,], format = "html", align = "c") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Package	Title	Maintainer	Version	URL
tidyverse	Easily Install and Load ‘Tidyverse’ Packages	Hadley Wickham <hadley@rstudio.com>;	1.1.1	http://tidyverse.org,
pROC	Display and Analyze ROC Curves	Xavier Robin <robin@lindinglab.org>;	1.10.0	NA
knitr	A General-Purpose Package for Dynamic Report Generation in R	Yihui Xie <xie@yihui.name>;	1.16	NA
kableExtra	Construct Complex Table with ‘kable’ and Pipe Syntax	Hao Zhu <haozhu233@gmail.com>;	0.3.0	http://haozhu233.github.io/kableExtra/,
GGally	Extension to ‘ggplot2’	Barret Schloerke <schloerke@gmail.com>;	1.3.2	https://ggobi.github.io/ggally,

Regression Model

The data consists of nearly 7 million data entries. Each row representing an individual taking the PFA. No row is dependent on another row. Today, we will be working with a very small random sample of that data. It can be downloaded here. As with all real world data, there may be errors within this dataset, however let’s start by getting a little understanding of the data.

#Quick source of data
load("pt.rdata")
PT.Scores <- na.omit(PT.Scores)


#Split into Train/Test datasets - this can occur later in the process
train <- PT.Scores[sample(nrow(PT.Scores), nrow(PT.Scores) * .7), ]
test <- anti_join(PT.Scores, train, by = "PERSONNEL_RECORD_ID")

Understanding the Data

Below are three graphs/tables that show a general summary of our data. The first shows the spreadsheet view of the data. I always find this a good starting place, for example the PERSONNEL_RECORD_ID#. We can easily determine that this column is unique to each person and although numeric, should have no influence on the outcome. In our regression model, we will exclude this variable. The second table shows summary statistics for each independent variable. This allows a quick view of the distribution and variability of the variables. For now, we exclude the factor variables from this chart - only numerical values are shown.

#Show spreadsheet view of data
train

#Summary of (non factor) variables 
train[sapply(train,is.numeric)] %>%
    gather(1:9, key = "Variable", value = "Value") %>%
    group_by(Variable) %>%
    summarise_all(funs(min, mean, median, max, sd)) %>% 
    kable(digits = 2, align = "c", format = "html", caption = "Summary of Variables") %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F, position = "center")

Summary of Variables
Variable	min	mean	median	max	sd
ABD_CIRC	20	32.95	33.0	46	3.39
AERO_SCORE	457	747.19	738.0	1285	107.28
AGE	18	30.60	29.0	58	8.44
CRUNCH_COUNT	0	50.49	51.0	88	9.19
HEIGHT	54	69.15	69.5	79	3.58
PASSFAIL	0	0.88	1.0	1	0.32
PERSONNEL_RECORD_ID	555239	1131014.70	1190246.0	1710085	309070.53
PUSHUP_COUNT	0	47.93	48.0	100	13.45
WEIGHT	89	177.75	178.0	285	30.27

The last graph is a correlation plot. It shows the correlation between variables. The darker the green, the higher the correlation. The darker the red, the lower the correlation. A common practice is to further assess correlations that are above 70% or below -70%. In the PT dataset, we have a high correlation between Weight and Abdominal Circumfrence. This logically makes sense. Rather than remove a column or merge the information, we will make a note to assess the validation results later. So, for now, we will leave it in.

ggcorr(PT.Scores[-c(1, 5, 6, 10)], nbreaks = 6, label = T, low = "red3", high = "green3", 
       label_round = 2, name = "Correlation Scale", label_alpha = T, hjust = 0.75) +
  ggtitle(label = "Correlation Plot") +
  theme(plot.title = element_text(hjust = 0.6))

Modeling the data

To start exploring our data and later also be able to create models correctly, we need to separate our data into train and test data. This is done to simulate a real world dataset (test) and is not used in anyway during the analysis. We build the initial model from the Train subset and validate the model using the different Test subset. This separation helps ensure that our regression model is accurate and unbiased.

The general model is glm or Generalized Linear Models (GLM). There is a lot of power and flexibility behind this function, however we will only use the basic syntax, which is as follows: \[glm(formula, family, data)\] To start off, we need to identify our formula. As we mentioned earlier, we want to predict if someone will pass or fail their Physical Fitness Assessment (PFA) in the Air Force. We will use the column PASSFAIL as the dependent variable. Additionally, the structure of our data allows us to every column except ID as independent variables. For terms in R, our formula is \(Pass\) ~ \(.\). The family argument is “binomial” since we are using a binomial distribution within the logistic model. The data argument is the data frame containing our data, in this case train.

Now we have the basic structure of our function to which to save a model1.

The first output we will assess is the summary of the coefficients for our variables. Ideally, we are looking for larger z values (lower p-values). By looking at the table below, we can see that only Weight has a p-value above .05. Bear in mind that the estimates from logistic regression characterize the relationship between the predictor and response variable on a log-odds scale. For example, this model suggests that for every one unit increase in Age, the log-odds of the Airmen passing increases by .06. Because this isn’t of much practical value, we’ll usually want to use the exponential function to calculate the odds ratios for each predictor. So for Age, we would expect the Airman likelihood of passing to increase by 1.06% for each year. This is easier to quickly interpret.

#Removes bias variables
train <- train[-c(5, 6, 7, 12)]

#Logisitic model
model1 <- step(glm(PASSFAIL ~ ., family = binomial(link = "logit"), data = train[-c(1)]), direction = "both", trace = F)

#Table of Coefficients
kable(cbind(
  summary(model1)[[12]],
  "Exp Coef" = exp(model1$coefficients) / (1 + exp(model1$coefficients))),
  format = "html", align = "c", caption = "Summary of Coefficients", digits = 2) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F, position = "center")

Summary of Coefficients
	Estimate	Std. Error	z value	Pr(>\|z\|)	Exp Coef
(Intercept)	3.24	2.52	1.28	0.20	0.96
HEIGHT	0.13	0.04	3.56	0.00	0.53
WEIGHT	-0.01	0.01	-1.49	0.14	0.50
ABD_CIRC	-0.31	0.05	-5.82	0.00	0.42
ACTIVEG	-0.88	0.21	-4.13	0.00	0.29
ACTIVER	-0.75	0.24	-3.20	0.00	0.32
SEXM	0.93	0.29	3.18	0.00	0.72
AGE	0.06	0.01	5.00	0.00	0.51

Results of the model

First, we have to decide what to do with the output of our model. We already discussed that the output is a probability, but what does a 52% mean? What does a 48% mean? For our model, we will assume that if an Airman’s predicted probability of passing is over 50% that they pass, else they fail. Our model’s predictions are displayed in a confusion matrix.

This matrix shows the response of the model vs the actual response. For example, if the model predicted an Airman will pass, and the Airman did pass, a tally (count) is placed in the 2nd row - 2nd column. If the model predicted an Airman will fail and the Airman did fail, a tally is placed in the 1st column - 1st row. Ideally, we want to maximize the number of times the model assessed the correct outcome. The wrongly predicted outcomes are in the other 2 squares. These represent the model’s type I and type II error. These are bad.

#Predicted values stored rather than summary function
test.predicted.m1 <- predict(model1, newdata = train, type = "response")

#Confusion Matrix
kable(table(train$PASSFAIL, test.predicted.m1 > 0.5),
      format = "html", align = "c", caption = "Confusion Matrix") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F, position = "center")

Confusion Matrix
	FALSE	TRUE
0	32	157
1	9	1402

#Confusion Percentages
right1 <- table(train$PASSFAIL, test.predicted.m1 > 0.5)[1,1] 
right2 <- table(train$PASSFAIL, test.predicted.m1 > 0.5)[2,2]
total <- length(train$PASSFAIL)
default <- sum(train$PASSFAIL) / total

Quick math shows us that the model predicted (32 + 1402) / 1600 = 89.6%. Meaning we misclassified 10.4% of the Airmen. This is significant because only 88.2% of Airman pass.

Influence

The next two results are not required for logistic regression as they are for linear regression. They show the influence of one test on the coefficients. If the value is larger, it influences the coefficients more. This is important to assess for linear regression when one point overly impacts the rest of the results. Additionally, a variable may be significant only because of the one large outlier.

The second graph is very similar in that it shows influence. The second shows the studentized residuals and leverage. Leverage is a measure of how far away the independent variable values of an observation are from those of the other observations. As long as there are no outliers outside the red dotted lines, we can more safely state that no single point overly influenced our regression.

#Plots the not needed Cooks D with Leverage plots
plot(model1, which = 4:5, id.n = 5)

#Logistic Curve 
train %>%
    mutate(
        Misclassified = ifelse(train$PASSFAIL > .5 & predict(model1, type = "response") < .5 |
                                   train$PASSFAIL < .5 & predict(model1, type = "response") > .5,
                               "Misclassified", "Correctly Classified")) %>%
    ggplot(aes(predict(model1, type = "response"), train$PASSFAIL)) + 
    guides(alpha = F) +
    geom_point(aes(alpha = .2, color = Misclassified)) +
    geom_smooth(method = "glm", method.args = list(family = "binomial"), se = F) +
    scale_x_continuous(name = "Predicted Probabilty") +
    scale_y_continuous(name = "Actual Probabilty") +
    theme(plot.title = element_text(hjust = 0.5),
          legend.title = element_blank()) +
    ggtitle(label = "Actual vs Predict with Logistic Curve")

R-Squared vs AUC

R-Squared is the general statistic used in linear models to describe the quality of the model. R-squared gives a single number that states how much of the total variance is explained by the model. Logistic regression has no such number. Instead of creating a pseudo R-Squared, we will use a ROC curve for Area Under the Curve (AUC) and the simple predicted percentages.

Key Point #3: Area Under the Curve and Confusion Matrix percentages are THE key numbers to remember

train$PASSFAIL <- as.character(train$PASSFAIL)
train$PASSFAIL <- as.factor(train$PASSFAIL)

#Plot rROC 
rocobj <- plot.roc(train$PASSFAIL, model1$fitted.values,
main="Confidence intervals", percent=TRUE, ci=TRUE, print.auc=TRUE) 
ciobj <- ci.se(rocobj, specificities=seq(0, 100, 5), conf.level = .99, boot.n = 100) 
plot(ciobj, type="shape", col="#1c61b6AA") 
plot(ci(rocobj, of="thresholds", thresholds="best"))

Validation

It appears that the logistic model is fairly good as well as not violating of any the model’s assumptions. The next step is to run the model on the test dataset. This test will use coefficients and variables from the train dataset. The two outputs we will compare are the ROC Curve/AUC and the confusion matrix probability.

#Convert PASSFAIL to factor for pROC
test$PASSFAIL <- as.character(test$PASSFAIL)
test$PASSFAIL <- as.factor(test$PASSFAIL)
test.fitted.values <- predict(model1, newdata = test, probability = TRUE)

#Plot rROC 
rocobj <- plot.roc(test$PASSFAIL, test.fitted.values,
main="Confidence intervals", percent=TRUE, ci=TRUE, print.auc=TRUE) 
ciobj <- ci.se(rocobj, specificities=seq(0, 100, 5), conf.level = .99, boot.n = 100)  
plot(ciobj, type="shape", col="#1c61b6AA") 
plot(ci(rocobj, of="thresholds", thresholds="best"))

#Confusion Matrix
kable(table(test$PASSFAIL, test.fitted.values > 0.5),
      format = "html", align = "c", caption = "Confusion Matrix") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F, position = "center")

Confusion Matrix
	FALSE	TRUE
0	24	48
1	8	603

#Confusion Percentages
right1 <- table(test$PASSFAIL, test.fitted.values > 0.5)[1,1] 
right2 <- table(test$PASSFAIL, test.fitted.values > 0.5)[2,2]
total <- length(test$PASSFAIL)

Again, quick math shows us that the model predicted (24 + 603) / 683 = 91.8%. Meaning we misclassified 8.2% of the Airmen. With these two models being roughly equal in predictive ability, we will rerun the model using the same variables on the entire dataset. This will be to update the coefficients and establish the final formula for future use.

#Regression
final <- step(glm(PT.Scores$PASSFAIL ~ ., data = PT.Scores[-1],
                                    family = "binomial"), trace = F, direction = "both")

#Final Coefficients
Final.Coefficients <- data.frame(Log = final$coefficients, 
                                 "Exp of Log" = exp(final$coefficients) / (1 + exp(final$coefficients)))

kable(Final.Coefficients, format = "html", align = "c", caption = "Summary of Coefficients",  digits = 2) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F, position = "center")

Summary of Coefficients
	Log	Exp.of.Log
(Intercept)	16.10	1.00
HEIGHT	0.07	0.52
ABD_CIRC	-0.30	0.43
AERO_SCORE	-0.03	0.49
PUSHUP_COUNT	0.04	0.51
CRUNCH_COUNT	0.18	0.55
SEXM	-5.55	0.00
AGE	0.22	0.56

Exercises

What are the two key outputs that describe how good a logistic model is?
Do males have a higher probability of passing their PFA?
Create a model with just Height + Weight as X variables. What are their coefficients?

Logistic Regression Overview

James Ellis

August 16, 2017