A logistic regression is used to predict a class (or category) variable (y) based on one or more predictor variables (x). It is used to model binary output, that is, a variable that can have only two possible values (e.g., 0 or 1, yes or no, sick or not sick).
Here are the steps for running a binary logistic regression:
setwd("C:/MyRData/Binary_Logistic_Regression")
# Load package
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.2 --
## v ggplot2 3.4.2 v purrr 1.0.1
## v tibble 3.2.1 v dplyr 1.1.2
## v tidyr 1.3.0 v stringr 1.5.0
## v readr 2.1.3 v forcats 0.5.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
# Upload data
mydata<- read.csv("school_admission_binary.csv", fileEncoding = "UTF-8-BOM")
mydata %>%
glimpse()
## Rows: 400
## Columns: 4
## $ admit <int> 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1~
## $ gre <int> 380, 660, 800, 640, 520, 760, 560, 400, 540, 700, 800, 440, 760,~
## $ gpa <dbl> 3.61, 3.67, 4.00, 3.19, 2.93, 3.00, 2.98, 3.08, 3.39, 3.92, 4.00~
## $ rank <int> 3, 3, 1, 4, 4, 2, 1, 2, 3, 2, 4, 1, 1, 2, 1, 3, 4, 3, 2, 1, 3, 2~
# Convert rank variable to a factor
mydata$rank <- factor(mydata$rank)
# view data variable types
sapply(mydata, class)
## admit gre gpa rank
## "integer" "integer" "numeric" "factor"
# Crosstab admit by rank
xtabs(~admit + rank, data = mydata)
## rank
## admit 1 2 3 4
## 0 28 97 93 55
## 1 33 54 28 12
# Create model
my_binomial_logit <- glm(admit ~ gre + gpa + rank, data = mydata, family = "binomial")
# Produce output from regression
summary(my_binomial_logit)
##
## Call:
## glm(formula = admit ~ gre + gpa + rank, family = "binomial",
## data = mydata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6268 -0.8662 -0.6388 1.1490 2.0790
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.989979 1.139951 -3.500 0.000465 ***
## gre 0.002264 0.001094 2.070 0.038465 *
## gpa 0.804038 0.331819 2.423 0.015388 *
## rank2 -0.675443 0.316490 -2.134 0.032829 *
## rank3 -1.340204 0.345306 -3.881 0.000104 ***
## rank4 -1.551464 0.417832 -3.713 0.000205 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 499.98 on 399 degrees of freedom
## Residual deviance: 458.52 on 394 degrees of freedom
## AIC: 470.52
##
## Number of Fisher Scoring iterations: 4
library(sjPlot)
# simple forest plot
plot_model(my_binomial_logit, title = "Odds Ratios For School Admissions Logistic Regression Model", show.values = T)
library(gtsummary)
## #Uighur
my_binomial_logit %>%
tbl_regression(exponentiate = TRUE) %>%
bold_labels() %>%
bold_p(t = .1) %>%
# add table captions
as_gt() %>%
gt::tab_header(title = "Table 1. School Admission Logistic Regression Model",
subtitle = "Data source: Admissions Office")
| Table 1. School Admission Logistic Regression Model | |||
| Data source: Admissions Office | |||
| Characteristic | OR1 | 95% CI1 | p-value |
|---|---|---|---|
| gre | 1.00 | 1.00, 1.00 | 0.038 |
| gpa | 2.23 | 1.17, 4.32 | 0.015 |
| rank | |||
| 1 | — | — | |
| 2 | 0.51 | 0.27, 0.94 | 0.033 |
| 3 | 0.26 | 0.13, 0.51 | <0.001 |
| 4 | 0.21 | 0.09, 0.47 | <0.001 |
| 1 OR = Odds Ratio, CI = Confidence Interval | |||
Run {report} package
# Load package
library(report)
## Warning: package 'report' was built under R version 4.1.3
Produce a written report of the logistic regression model
# Produce a written report of the model
report(my_binomial_logit)
## We fitted a logistic model (estimated using ML) to predict admit with gre, gpa
## and rank (formula: admit ~ gre + gpa + rank). The model's explanatory power is
## weak (Tjur's R2 = 0.10). The model's intercept, corresponding to gre = 0, gpa =
## 0 and rank = 1, is at -3.99 (95% CI [-6.27, -1.79], p < .001). Within this
## model:
##
## - The effect of gre is statistically significant and positive (beta = 2.26e-03,
## 95% CI [1.38e-04, 4.44e-03], p = 0.038; Std. beta = 0.26, 95% CI [0.02, 0.51])
## - The effect of gpa is statistically significant and positive (beta = 0.80, 95%
## CI [0.16, 1.46], p = 0.015; Std. beta = 0.31, 95% CI [0.06, 0.56])
## - The effect of rank [2] is statistically significant and negative (beta =
## -0.68, 95% CI [-1.30, -0.06], p = 0.033; Std. beta = -0.68, 95% CI [-1.30,
## -0.06])
## - The effect of rank [3] is statistically significant and negative (beta =
## -1.34, 95% CI [-2.03, -0.67], p < .001; Std. beta = -1.34, 95% CI [-2.03,
## -0.67])
## - The effect of rank [4] is statistically significant and negative (beta =
## -1.55, 95% CI [-2.40, -0.75], p < .001; Std. beta = -1.55, 95% CI [-2.40,
## -0.75])
##
## Standardized parameters were obtained by fitting the model on a standardized
## version of the dataset. 95% Confidence Intervals (CIs) and p-values were
## computed using a Wald z-distribution approximation.
If you are new to R Programming Language, don’t give up. Your R skills will get better with time.
Note. The ‘binary.csv’ file can be downloaded from: https://stats.idre.ucla.edu/stat/data/binary.csv