Binary Logistic Regression in R

Binary Logistic Regression

A logistic regression is used to predict a class (or category) variable (y) based on one or more predictor variables (x). It is used to model binary output, that is, a variable that can have only two possible values (e.g., 0 or 1, yes or no, sick or not sick).

Here are the steps for running a binary logistic regression:

Set your working directory/folder.

setwd("C:/MyRData/Binary_Logistic_Regression")

Load the {tidyverse} package. Always include this package!!

# Load package
library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.2 --
## v ggplot2 3.4.2     v purrr   1.0.1
## v tibble  3.2.1     v dplyr   1.1.2
## v tidyr   1.3.0     v stringr 1.5.0
## v readr   2.1.3     v forcats 0.5.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Load the school admissions dataset located in your working directory/folder.

# Upload data
mydata<- read.csv("school_admission_binary.csv", fileEncoding = "UTF-8-BOM")

View dataset structure using glimpse(). This function makes it possible to see every column in a dataset.
The binary dependent variable is named ‘admit.’

mydata %>% 
  glimpse()

## Rows: 400
## Columns: 4
## $ admit <int> 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1~
## $ gre   <int> 380, 660, 800, 640, 520, 760, 560, 400, 540, 700, 800, 440, 760,~
## $ gpa   <dbl> 3.61, 3.67, 4.00, 3.19, 2.93, 3.00, 2.98, 3.08, 3.39, 3.92, 4.00~
## $ rank  <int> 3, 3, 1, 4, 4, 2, 1, 2, 3, 2, 4, 1, 1, 2, 1, 3, 4, 3, 2, 1, 3, 2~

Convert rank variable from integer to factor . We are indicating that the rank variable should be treated as a categorical variable.

# Convert rank variable to a factor
mydata$rank <- factor(mydata$rank)

Verify that the rank variable was converted to factor

# view data variable types
sapply(mydata, class)

##     admit       gre       gpa      rank 
## "integer" "integer" "numeric"  "factor"

Make sure there is no zero cells in your dependent variable (admit) and categorical variable (rank) by cross tabulating variables admit by rank.

# Crosstab admit by rank
xtabs(~admit + rank, data = mydata)

##      rank
## admit  1  2  3  4
##     0 28 97 93 55
##     1 33 54 28 12

Create your logistic regression model. The code below estimates a logistic regression model using the glm (generalized linear model) function from the {base} package.

# Create model
my_binomial_logit <- glm(admit ~ gre + gpa + rank, data = mydata, family = "binomial")

# Produce output from regression
summary(my_binomial_logit)

## 
## Call:
## glm(formula = admit ~ gre + gpa + rank, family = "binomial", 
##     data = mydata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6268  -0.8662  -0.6388   1.1490   2.0790  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.989979   1.139951  -3.500 0.000465 ***
## gre          0.002264   0.001094   2.070 0.038465 *  
## gpa          0.804038   0.331819   2.423 0.015388 *  
## rank2       -0.675443   0.316490  -2.134 0.032829 *  
## rank3       -1.340204   0.345306  -3.881 0.000104 ***
## rank4       -1.551464   0.417832  -3.713 0.000205 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 499.98  on 399  degrees of freedom
## Residual deviance: 458.52  on 394  degrees of freedom
## AIC: 470.52
## 
## Number of Fisher Scoring iterations: 4

Plot the logistic regression model Odds Ratios

library(sjPlot)

# simple forest plot
plot_model(my_binomial_logit, title = "Odds Ratios For School Admissions Logistic Regression Model", show.values = T)

Create formatted table of logistic regression model.

library(gtsummary)

## #Uighur

my_binomial_logit %>% 
  tbl_regression(exponentiate = TRUE) %>% 
  bold_labels() %>% 
  bold_p(t = .1) %>% 
   # add table captions
  as_gt() %>%
  gt::tab_header(title = "Table 1. School Admission Logistic Regression Model",
                 subtitle = "Data source: Admissions Office")

Characteristic	OR¹	95% CI¹	p-value
Table 1. School Admission Logistic Regression Model
Data source: Admissions Office
gre	1.00	1.00, 1.00	0.038
gpa	2.23	1.17, 4.32	0.015
rank
1	—	—
2	0.51	0.27, 0.94	0.033
3	0.26	0.13, 0.51	<0.001
4	0.21	0.09, 0.47	<0.001
¹ OR = Odds Ratio, CI = Confidence Interval

Finally, lets use the {report} package to automatically produce a written report of the model.

Run {report} package

# Load package
library(report)

## Warning: package 'report' was built under R version 4.1.3

Produce a written report of the logistic regression model

# Produce a written report of the model
report(my_binomial_logit)

## We fitted a logistic model (estimated using ML) to predict admit with gre, gpa
## and rank (formula: admit ~ gre + gpa + rank). The model's explanatory power is
## weak (Tjur's R2 = 0.10). The model's intercept, corresponding to gre = 0, gpa =
## 0 and rank = 1, is at -3.99 (95% CI [-6.27, -1.79], p < .001). Within this
## model:
## 
##   - The effect of gre is statistically significant and positive (beta = 2.26e-03,
## 95% CI [1.38e-04, 4.44e-03], p = 0.038; Std. beta = 0.26, 95% CI [0.02, 0.51])
##   - The effect of gpa is statistically significant and positive (beta = 0.80, 95%
## CI [0.16, 1.46], p = 0.015; Std. beta = 0.31, 95% CI [0.06, 0.56])
##   - The effect of rank [2] is statistically significant and negative (beta =
## -0.68, 95% CI [-1.30, -0.06], p = 0.033; Std. beta = -0.68, 95% CI [-1.30,
## -0.06])
##   - The effect of rank [3] is statistically significant and negative (beta =
## -1.34, 95% CI [-2.03, -0.67], p < .001; Std. beta = -1.34, 95% CI [-2.03,
## -0.67])
##   - The effect of rank [4] is statistically significant and negative (beta =
## -1.55, 95% CI [-2.40, -0.75], p < .001; Std. beta = -1.55, 95% CI [-2.40,
## -0.75])
## 
## Standardized parameters were obtained by fitting the model on a standardized
## version of the dataset. 95% Confidence Intervals (CIs) and p-values were
## computed using a Wald z-distribution approximation.

If you are new to R Programming Language, don’t give up. Your R skills will get better with time.

Note. The ‘binary.csv’ file can be downloaded from: https://stats.idre.ucla.edu/stat/data/binary.csv

Binary Logistic Regression in R

Ramon Rodriguez-Santana, MBA,MPH

2023-10-24

Binary Logistic Regression