`logiSense` - Interpreting Logistic Regression Results

Challenges and Solutions for Non-Statisticians

Anna (Jingxuan) He, Belina Jang, Vanessa Liao, Tina (Zhaoyu) Tan, & Victoria Truong

December 2, 2024

github.com/BelinaJang/logiSense

logiSense Logo

Introduction
Logistic Regression Review
Motivation
Purpose
logiSense Overview & Example
Existing Packages
Strengths
Challenges & Remaining Work
Future Directions
Contributions

Introduction

Logistic regression is essential for evaluating the strength of the associations between binary outcomes and independent variables (categorical, continuous).
Logistic regression models can be extended to include interaction terms.

Logistic Regression Formula

\[ \text{logit}(p) = \log\left(\frac{p}{1-p}\right) = \beta_0 + \sum_{i=1}^{I} \beta_i X_i \] Where:

\(\text{logit}(p)\) is the log-odds of the probability \(p\)
\(p\) is the probability of the outcome occurring
\(\beta_0\) is the intercept
\(\beta_i\) is the coefficient of the predictors
\(X_i\) is the predictor variable \((i = 1, 2, \dots, I)\)

Logistic Regression Formula (with One Two-Way Interaction)

\[ \text{logit}(p) = \log\left(\frac{p}{1-p}\right) = \beta_0 + \sum_{i=1}^{I} \beta_i X_i \color{magenta}{+ \beta_{ij}(X_i \cdot X_j)} \] Where:

\(\text{logit}(p)\) is the log-odds of the probability \(p\)
\(p\) is the probability of the outcome occurring
\(\beta_0\) is the intercept
\(\beta_i\) is the coefficient of the predictors
\(X_i\) is the predictor variable \((i = 1, 2, \dots, I)\)

For two-way interaction:

\(X_i, X_j\) are the predictor variables \((i \neq j)\)
\(\beta_{ij}\) is the coefficient for the interaction term between \(X_i\) and \(X_j\)
\(X_i \cdot X_j\) is the interaction term of predictors \(X_i\) and \(X_j\)

Motivation

Challenging to interpret because the coefficients are in the log-odds scale.
Conducted a literature review and found a paper by Rajeev Kumar Malhotra that highlighted critical errors in interpreting the results of logistic regression.

Several occurrences of the misinterpretation of odds ratios (OR):

→ Treated ORs as direct probabilities or relative risk (RR)
→ Can lead to exaggerated conclusions about the strength of associations

Odds Ratio Misinterpreted as Relative Risk

Example 1: Rampure et al. (2019)

Misinterpretation:

“People with a high level of craving have 1.8 times the chance of relapse compared to low craving.”

The statement misinterpreted the OR (1.78, 95% CI: 1.25-2.54) as the RR.

Correct Interpretation:

“The odds of alcoholic relapse having a high level of craving is 1.8 times more than odds of alcoholic relapse having a low level of craving.”

Odds Ratio Misinterpreted as Relative Risk:

Example 2: Madasu et al. (2019)

Misinterpretation:

“Female sex was found to be two times more associated with anxiety disorders than male sex.”

Correct Interpretation:

“The odds of having anxiety disorder are twice as high for females compared to males.”

Key Takeaway: Misinterpretation of Log Odds is More Pervasive than Expected

Interpreting odds ratios as probabilities or risks can lead to confusion and may overestimate the strength of associations.
This misinterpretation leads readers to perceive the effect as stronger than it truly is.

Purpose

To address this, we developed an R package designed to help non-statisticians accurately interpret the logistic regression models they create, which will help to close the gap in understanding.

Our package is called…

`logiSense`

logiSense is an R package for making sense of logistic regression results and helping users interpret results.

logiSense Logo

We have two functions:

Function 1: logis

Provides logistic regression interpretation for models without interaction terms.

Function 2: logint

Provides logistic regression interpretation for models with one interaction term.

Arguments

Table 1. Description of parameters for logis.

Parameter	Definition	Type
formula	Description of the model to be fitted	formula
data	Name of the data	data frame
variable_interest	Variable name interested for interpretation	character
variable_type	Type of variable of interest	character

Table 2. Description of parameters for logint.

Parameter	Definition	Type
formula	Description of the model to be fitted	formula
data	Name of the data	data frame
categorical_var	Categorical variable name interested for interpretation	character
continuous_var	Continuous variable name interested for interpretation	character

Using logiSense with Stroke Prediction Dataset

Used a dataset from Kaggle (McKinsey & Company’s healthcare hackathon).

Table 3. Description of variables from the stroke prediction dataset.

Variable Name	Description and Categories	Type
ID	Unique identifier for each patient	Numeric/Integer
Gender	Gender of the patient: “Male”, “Female”, or “Other”	Categorical
Age	Age of the patient	Numeric
Hypertension	Whether the patient has hypertension: 0 = No, 1 = Yes	Binary
Heart Disease	Whether the patient has any heart disease: 0 = No, 1 = Yes	Binary
Ever Married	Marital status of the patient: “No” or “Yes”	Categorical
Work Type	Employment type of the patient: “Children”, “Govt_job”, “Never_worked”, “Private”, or “Self-employed”	Categorical
Residence Type	Type of residence: “Rural” or “Urban”	Categorical
Avg Glucose Level	Average glucose level in blood	Numeric
BMI	Body mass index	Numeric
Smoking Status	Smoking status of the patient: “Formerly smoked”, “Never smoked”, “Smokes”, or “Unknown”	Categorical
Stroke	Whether the patient had a stroke: 0 = No, 1 = Yes	Binary

How to Use logiSense: Stroke Prediction Dataset

For logis (no interaction term):

Continuous variable:

library(logiSense)
library(here)

test_data <- read.csv(here("data/test_data.csv"))

result_con <- logis(formula = stroke ~ gender + age + hypertension + heart_disease + avg_glucose_level + smoking_status,
                    data = test_data,
                    variable_interest = "age",
                    variable_type = "continuous")

Outcome variable:  stroke 
For each one-unit increase in 'age,' the odds of 'stroke' are multiplied by 1.072 (95% CI: 1.061 - 1.083). This result is statistically significant at 5% significance level (p-value: 2.346e-40).

Categorical variable:

result_cat <- logis(formula = stroke ~ gender + age + hypertension + heart_disease + avg_glucose_level + smoking_status,
                    data = test_data,
                    variable_interest = "smoking_status",
                    variable_type = "categorical")

Outcome variable:  stroke 
Reference level:  formerly smoked 
Compared to the reference level 'formerly smoked' of 'smoking_status,' the odds of 'stroke' for the level 'never smoked' are multiplied by 0.8164 (95% CI: 0.5801 - 1.153). This result is not statistically significant at 5% significance level (p-value: 0.2466). 
Compared to the reference level 'formerly smoked' of 'smoking_status,' the odds of 'stroke' for the level 'smokes' are multiplied by 1.119 (95% CI: 0.7314 - 1.697). This result is not statistically significant at 5% significance level (p-value: 0.5985). 
Compared to the reference level 'formerly smoked' of 'smoking_status,' the odds of 'stroke' for the level 'Unknown' are multiplied by 0.9589 (95% CI: 0.6375 - 1.433). This result is not statistically significant at 5% significance level (p-value: 0.8388).

How to Use logiSense: Stroke Prediction Dataset

For logint (two-way interaction):

Continuous \(\times\) Categorical:

logint(formula = stroke ~ work_type * age,
       data = test_data,
       continuous_var = "age",
       categorical_var = "work_type")

The odds ratio of 'stroke' for increasing age by one unit in children = 1.04.
The odds ratio of 'stroke' for increasing age by one unit in Govt_job = 1.087.
The odds ratio of 'stroke' for increasing age by one unit in Never_worked = 1.
The odds ratio of 'stroke' for increasing age by one unit in Private = 1.087.
The odds ratio of 'stroke' for increasing age by one unit in Self-employed = 1.063.


For an observation with age=value, the odds ratio of 'stroke' for Govt_job vs children (reference level) is e^(-1.73 + 0.04404*(value)).
For an observation with age=value, the odds ratio of 'stroke' for Never_worked vs children (reference level) is e^(-9.444 + -0.03938*(value)).
For an observation with age=value, the odds ratio of 'stroke' for Private vs children (reference level) is e^(-1.561 + 0.04413*(value)).
For an observation with age=value, the odds ratio of 'stroke' for Self-employed vs children (reference level) is e^(-0.3749 + 0.02177*(value)).

Existing Packages

Overview of Similar Packages

modelsummary:

Presents coefficients in human-readable formats.
Directly interprets logistic regression outputs, including odds ratios.

sjPlot:

Displays odds ratios, significance stars, and confidence intervals.
Outputs publication-ready tables and visualizations.

gtsummary:

Converts logistic regression coefficients to odds ratios.
Customizable labels, formats, and output styles.

performance:

Assesses model performance and diagnostics.

Strengths: What Sets Our Package Apart

Gap in Existing Packages:

While existing packages address model performance and visualization, no existing package is designed for comprehensive and direct interpretation of regression results.

logiSense Logo

logiSense focuses on direct interpretation of regression results.
Simplifies statistical output for non-technical users.
Bridges the gap between technical outputs and research insights.

Comparison Table

Table 4. Comparison of existing packages with logiSense.

Feature	`modelsummary`	`sjPlot`	`gtsummary`	`performance`	`logiSense`
Human-readable coefficients	✅	✅	✅	❌	✅
Odds ratio interpretation	✅	✅	✅	❌	✅
User-friendly outputs	❌	✅	✅	❌	✅
Model performance diagnostics	❌	❌	❌	✅	❌
Direct interpretation tools	❌	❌	❌	❌	✅

Challenges

We originally underestimated the complexity associated with generating different interpretations for different variable types.

Remaining Work

Implementation of interaction interpretations:
- Currently limited to continuous \(\times\) categorical interactions
- Need to add interpretations for:
  - Continuous \(\times\) continuous interactions
  - Categorical \(\times\) categorical interactions
Complete the documentation

Future Directions

Table 5. Current features and future goals.

Feature	Currently Supports	Future Goals
Automatic Variable Type Detection	Manual input of variable types	Automatic detection for improved usability
Model Complexity	Logistic regression with one two-way interaction	Higher-order interactions between variables
Type of Regression Model	Only logistic regression	Poisson regression and multinomial logistic regression

References

Analytics Vidhya. (n.d.) McKinsey analytics online hackathon. Retrieved November 29, 2024, from https://www.analyticsvidhya.com/datahack/contest/mckinsey-analytics-online-hackathon/

Madasu, S., Malhotra, S., Kant, S., Sager, R., Mishra, A. K., Misra, P., & Ahamed, F. (2019, December, 17). Anxiety disorders among adolescents in a rural area of northern India using Screen for Child Anxiety-Related Emotional Disorders tool: A Community-based study Indian Journal of Community Medicine, 44(4), 317-321. https://doi.org/10.4103/ijcm.ijcm_359_18

Malhotra, R. K. (2020, October 28). Errors in the use of multivariable logistic regression analysis: An empirical analysis. Indian Journal of Community Medicine, 45(4), 560–562. https://doi.org/10.4103/ijcm.IJCM_16_20

Phil. (2021, October 27). Logistic regression output advice. Stack Overflow. https://stackoverflow.com/questions/69736261/logistic-regression-output-advice

Putra, L. (n.d.). Working during COVID-19 epidemic concept: Portrait of stressed man doctor cartoon character wearing working uniform feeling stressed with laptop and paper in front of him Vecteezy. https://www.vecteezy.com/vector-art/15708435-working-during-covid-19-epidemic-concept-portrait-of-stressed-man-doctor-cartoon-character-wearing-working-uniform-feeling-stressed-with-laptop-and-paper-in-front-of-him

Rampure, R., Inbaraj, L. R., Elizabeth, C. G., & Norman, G. (2019, August 5). Factors contributing to alcohol relapse in a rural population: Lessons from a camp-based de-addiction model from rural Karnataka.Indian Journal of Community Medicine, 44(4), 307-312. https://doi.org/10.4103/ijcm.IJCM_321_18

Soriano, F. (n.d.). Stroke prediction dataset Kaggle. https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset/data

Contributions

Code:

Anna:

Wrote the function for the interpretation of continuous variables
Wrote the function for models with interaction term (continuous \(\times\) categorical)
Helped with categorical variable interpretation

Belina:

Wrote function for the interpretation of categorical variables
Helped with interaction term interpretation
Set up the R-package (roxygen, description, etc.)

Vanessa:

Wrote draft of interaction function
Helped setting up the R-package (roxygen, etc.)

Presentation & Research:

Victoria:

Slides for title, introduction, logistic regression formula, motivation, two examples for odds ratios being misinterpreted as relative risk, key takeaway for misinterpretation, purpose, logiSense package, using logiSense with stroke prediction dataset, how to use logiSense with stroke prediction dataset, references (1-13, 15-17, 23, 25)
Logo for R package
Overall formatting and slide transitions for the Quarto presentation

Tina:

Slides for arguments, existing packages, strengths, tables, challenges, remaining work, future directions (14, 18-22)

logiSense Logo

Thank you!

QA Image

logiSense Logo

logiSense - Interpreting Logistic Regression Results

Challenges and Solutions for Non-Statisticians

Anna (Jingxuan) He, Belina Jang, Vanessa Liao, Tina (Zhaoyu) Tan, & Victoria Truong

December 2, 2024

github.com/BelinaJang/logiSense

Table of Contents

Introduction

Logistic Regression Formula

Logistic Regression Formula (with One Two-Way Interaction)

Motivation

Odds Ratio Misinterpreted as Relative Risk

Odds Ratio Misinterpreted as Relative Risk:

Purpose

logiSense

Arguments

Existing Packages

Strengths: What Sets Our Package Apart

Comparison Table

Future Directions

References

Contributions

`logiSense` - Interpreting Logistic Regression Results

`logiSense`