Logistic Regression

In logistic regression, we fit a regression curve, \(y = f(x)\) where y is a categorical variable such as such as True/False or 0/1. The predictors can be continuous, categorical or a mix of both. the underlying technique is the same as linear regression

There are 3 different types of Logistic Regression models:

Binary Logistic Regression: Binary is the base case where the dependent variable is just a single binary response (True/False or 0/1)
Multinomial Logistic Regression: Multinomial Regression is an extension of binary logistic regression, such that there are 2 or more response variables. Multinomial regression is used to handle multi-class classification problems
Ordinal Logistic Regression: In ordinal Logistic Regression, the response is an ordinal value where the the order of the value matters such as a review of a book from 1 to 5 stars.

Logistic Regression does not determine the outcome of a response variable as True/False or 0/1, rather it calculates the probability (between 0 and 1) of a response falling in True/False or 0/1.

For this example, we will see how variables, such as GRE (Graduate Record Exam scores), GPA (grade point average) and prestige of the undergraduate institution, effect admission into graduate school. The response variable, admit/don’t admit, is a binary variable.

library(skimr)
library(dplyr)
library(tidyr)
library(ggplot2)

Load and Visualize Data

Skim the data to for any NA values. In this case there are none

admit <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
skim(admit)
Data summary
Name admit
Number of rows 400
Number of columns 4
_______________________
Column type frequency:
numeric 4
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
admit 0 1 0.32 0.47 0.00 0.00 0.0 1.00 1 ▇▁▁▁▃
gre 0 1 587.70 115.52 220.00 520.00 580.0 660.00 800 ▁▂▇▇▅
gpa 0 1 3.39 0.38 2.26 3.13 3.4 3.67 4 ▁▃▆▇▆
rank 0 1 2.48 0.94 1.00 2.00 2.0 3.00 4 ▃▇▁▆▃
admit %>% gather() %>% 
  ggplot(aes(value)) +
  facet_wrap(~key, scale = "free",  ncol = 3) +
  geom_histogram() +
  theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

To indicate RANK as a categorical variable we will convert it to a factor

admit$rank <- factor(admit$rank)

Build Model

To build a logistic regression model, we’ll be using the glm() function. Logistic regression belongs to a class of models called the Generalized Linear Models (GLM) which can be built using the glm() function.

admit_m <- glm(admit ~ ., family=binomial(link='logit'), data = admit)
summary(admit_m)
## 
## Call:
## glm(formula = admit ~ ., family = binomial(link = "logit"), data = admit)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6268  -0.8662  -0.6388   1.1490   2.0790  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.989979   1.139951  -3.500 0.000465 ***
## gre          0.002264   0.001094   2.070 0.038465 *  
## gpa          0.804038   0.331819   2.423 0.015388 *  
## rank2       -0.675443   0.316490  -2.134 0.032829 *  
## rank3       -1.340204   0.345306  -3.881 0.000104 ***
## rank4       -1.551464   0.417832  -3.713 0.000205 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 499.98  on 399  degrees of freedom
## Residual deviance: 458.52  on 394  degrees of freedom
## AIC: 470.52
## 
## Number of Fisher Scoring iterations: 4

Interpreting Model Results

To evaluate the performance of a logistic regression model, we consider the following from the model summary:

Coefficients: As with linear regression these represent the beta coefficients and their statistical significance, below 0.05
AIC (Akaike Information Criteria): To measure the overall model in linear regression \(R^2\) was used. In Logistic regression AIC is used as a measure of fit. Residual deviance indicates the response predicted by a model on adding independent variables. A lower AIC means a better model.
Deviance: Deviance is also a measure of goodness of fit of a model. Higher numbers indicates bad fit. Deviance is reported in two forms, Null Deviance and Residual Deviance

Null Deviance shows how well the response variable is predicted with nothing but an intercept.
Residual Deviance shows how well the response variable is predicted with the independent variables.

The difference between the null deviance and the residual deviance shows how our model is doing against the null model (a model with only the intercept). The wider this gap, the better.