logiSense - Interpreting Logistic Regression Results

Challenges and Solutions for Non-Statisticians


Anna (Jingxuan) He, Belina Jang, Vanessa Liao, Tina (Zhaoyu) Tan, & Victoria Truong

December 2, 2024


github.com/BelinaJang/logiSense

logiSense Logo

Table of Contents

  1. Introduction
  2. Logistic Regression Review
  3. Motivation
  4. Purpose
  5. logiSense Overview & Example
  6. Existing Packages
  7. Strengths
  8. Challenges & Remaining Work
  9. Future Directions
  10. Contributions

Introduction

  • Logistic regression is essential for evaluating the strength of the associations between binary outcomes and independent variables (categorical, continuous).

  • Logistic regression models can be extended to include interaction terms.

Logistic Regression Formula

\[ \text{logit}(p) = \log\left(\frac{p}{1-p}\right) = \beta_0 + \sum_{i=1}^{I} \beta_i X_i \] Where:

  • \(\text{logit}(p)\) is the log-odds of the probability \(p\)
  • \(p\) is the probability of the outcome occurring
  • \(\beta_0\) is the intercept
  • \(\beta_i\) is the coefficient of the predictors
  • \(X_i\) is the predictor variable \((i = 1, 2, \dots, I)\)

    

Logistic Regression Formula (with One Two-Way Interaction)

\[ \text{logit}(p) = \log\left(\frac{p}{1-p}\right) = \beta_0 + \sum_{i=1}^{I} \beta_i X_i \color{magenta}{+ \beta_{ij}(X_i \cdot X_j)} \] Where:

  • \(\text{logit}(p)\) is the log-odds of the probability \(p\)
  • \(p\) is the probability of the outcome occurring
  • \(\beta_0\) is the intercept
  • \(\beta_i\) is the coefficient of the predictors
  • \(X_i\) is the predictor variable \((i = 1, 2, \dots, I)\)

    

For two-way interaction:

  • \(X_i, X_j\) are the predictor variables \((i \neq j)\)
  • \(\beta_{ij}\) is the coefficient for the interaction term between \(X_i\) and \(X_j\)
  • \(X_i \cdot X_j\) is the interaction term of predictors \(X_i\) and \(X_j\)

Motivation

  • Challenging to interpret because the coefficients are in the log-odds scale.

  • Conducted a literature review and found a paper by Rajeev Kumar Malhotra that highlighted critical errors in interpreting the results of logistic regression.

Several occurrences of the misinterpretation of odds ratios (OR):

  • → Treated ORs as direct probabilities or relative risk (RR)
  • → Can lead to exaggerated conclusions about the strength of associations

Odds Ratio Misinterpreted as Relative Risk

Example 1: Rampure et al. (2019)



Misinterpretation:

“People with a high level of craving have 1.8 times the chance of relapse compared to low craving.”

    

The statement misinterpreted the OR (1.78, 95% CI: 1.25-2.54) as the RR.



Correct Interpretation:

“The odds of alcoholic relapse having a high level of craving is 1.8 times more than odds of alcoholic relapse having a low level of craving.”

Odds Ratio Misinterpreted as Relative Risk:

Example 2: Madasu et al. (2019)



Misinterpretation:

“Female sex was found to be two times more associated with anxiety disorders than male sex.”



Correct Interpretation:

“The odds of having anxiety disorder are twice as high for females compared to males.”

Key Takeaway: Misinterpretation of Log Odds is More Pervasive than Expected

  • Interpreting odds ratios as probabilities or risks can lead to confusion and may overestimate the strength of associations.

  • This misinterpretation leads readers to perceive the effect as stronger than it truly is.

Purpose

To address this, we developed an R package designed to help non-statisticians accurately interpret the logistic regression models they create, which will help to close the gap in understanding.



Our package is called…

logiSense

    

logiSense is an R package for making sense of logistic regression results and helping users interpret results.

logiSense Logo

We have two functions:

    

Function 1: logis

  • Provides logistic regression interpretation for models without interaction terms.

    

Function 2: logint

  • Provides logistic regression interpretation for models with one interaction term.

Arguments

Table 1. Description of parameters for logis.

Parameter Definition Type
formula Description of the model to be fitted formula
data Name of the data data frame
variable_interest Variable name interested for interpretation character
variable_type Type of variable of interest character

    

Table 2. Description of parameters for logint.

Parameter Definition Type
formula Description of the model to be fitted formula
data Name of the data data frame
categorical_var Categorical variable name interested for interpretation character
continuous_var Continuous variable name interested for interpretation character

Using logiSense with Stroke Prediction Dataset

  • Used a dataset from Kaggle (McKinsey & Company’s healthcare hackathon).

Table 3. Description of variables from the stroke prediction dataset.

Variable Name Description and Categories Type
ID Unique identifier for each patient Numeric/Integer
Gender Gender of the patient: “Male”, “Female”, or “Other” Categorical
Age Age of the patient Numeric
Hypertension Whether the patient has hypertension: 0 = No, 1 = Yes Binary
Heart Disease Whether the patient has any heart disease: 0 = No, 1 = Yes Binary
Ever Married Marital status of the patient: “No” or “Yes” Categorical
Work Type Employment type of the patient: “Children”, “Govt_job”, “Never_worked”, “Private”, or “Self-employed” Categorical
Residence Type Type of residence: “Rural” or “Urban” Categorical
Avg Glucose Level Average glucose level in blood Numeric
BMI Body mass index Numeric
Smoking Status Smoking status of the patient: “Formerly smoked”, “Never smoked”, “Smokes”, or “Unknown” Categorical
Stroke Whether the patient had a stroke: 0 = No, 1 = Yes Binary

How to Use logiSense: Stroke Prediction Dataset

For logis (no interaction term):

Continuous variable:

library(logiSense)
library(here)

test_data <- read.csv(here("data/test_data.csv"))

result_con <- logis(formula = stroke ~ gender + age + hypertension + heart_disease + avg_glucose_level + smoking_status,
                    data = test_data,
                    variable_interest = "age",
                    variable_type = "continuous")
Outcome variable:  stroke 
For each one-unit increase in 'age,' the odds of 'stroke' are multiplied by 1.072 (95% CI: 1.061 - 1.083). This result is statistically significant at 5% significance level (p-value: 2.346e-40).

Categorical variable:

result_cat <- logis(formula = stroke ~ gender + age + hypertension + heart_disease + avg_glucose_level + smoking_status,
                    data = test_data,
                    variable_interest = "smoking_status",
                    variable_type = "categorical")
Outcome variable:  stroke 
Reference level:  formerly smoked 
Compared to the reference level 'formerly smoked' of 'smoking_status,' the odds of 'stroke' for the level 'never smoked' are multiplied by 0.8164 (95% CI: 0.5801 - 1.153). This result is not statistically significant at 5% significance level (p-value: 0.2466). 
Compared to the reference level 'formerly smoked' of 'smoking_status,' the odds of 'stroke' for the level 'smokes' are multiplied by 1.119 (95% CI: 0.7314 - 1.697). This result is not statistically significant at 5% significance level (p-value: 0.5985). 
Compared to the reference level 'formerly smoked' of 'smoking_status,' the odds of 'stroke' for the level 'Unknown' are multiplied by 0.9589 (95% CI: 0.6375 - 1.433). This result is not statistically significant at 5% significance level (p-value: 0.8388). 

How to Use logiSense: Stroke Prediction Dataset

For logint (two-way interaction):

Continuous \(\times\) Categorical:

logint(formula = stroke ~ work_type * age,
       data = test_data,
       continuous_var = "age",
       categorical_var = "work_type")
The odds ratio of 'stroke' for increasing age by one unit in children = 1.04.
The odds ratio of 'stroke' for increasing age by one unit in Govt_job = 1.087.
The odds ratio of 'stroke' for increasing age by one unit in Never_worked = 1.
The odds ratio of 'stroke' for increasing age by one unit in Private = 1.087.
The odds ratio of 'stroke' for increasing age by one unit in Self-employed = 1.063.


For an observation with age=value, the odds ratio of 'stroke' for Govt_job vs children (reference level) is e^(-1.73 + 0.04404*(value)).
For an observation with age=value, the odds ratio of 'stroke' for Never_worked vs children (reference level) is e^(-9.444 + -0.03938*(value)).
For an observation with age=value, the odds ratio of 'stroke' for Private vs children (reference level) is e^(-1.561 + 0.04413*(value)).
For an observation with age=value, the odds ratio of 'stroke' for Self-employed vs children (reference level) is e^(-0.3749 + 0.02177*(value)).

Existing Packages

Overview of Similar Packages

modelsummary:

  • Presents coefficients in human-readable formats.
  • Directly interprets logistic regression outputs, including odds ratios.

sjPlot:

  • Displays odds ratios, significance stars, and confidence intervals.
  • Outputs publication-ready tables and visualizations.

gtsummary:

  • Converts logistic regression coefficients to odds ratios.
  • Customizable labels, formats, and output styles.

performance:

  • Assesses model performance and diagnostics.

Strengths: What Sets Our Package Apart

Gap in Existing Packages:

  • While existing packages address model performance and visualization, no existing package is designed for comprehensive and direct interpretation of regression results.

logiSense Logo

  • logiSense focuses on direct interpretation of regression results.
  • Simplifies statistical output for non-technical users.
  • Bridges the gap between technical outputs and research insights.

Comparison Table

Table 4. Comparison of existing packages with logiSense.

Feature modelsummary sjPlot gtsummary performance logiSense
Human-readable coefficients
Odds ratio interpretation
User-friendly outputs
Model performance diagnostics
Direct interpretation tools

Challenges

  • We originally underestimated the complexity associated with generating different interpretations for different variable types.

Remaining Work

  • Implementation of interaction interpretations:
    • Currently limited to continuous \(\times\) categorical interactions
    • Need to add interpretations for:
      • Continuous \(\times\) continuous interactions
      • Categorical \(\times\) categorical interactions
  • Complete the documentation

Future Directions

    

Table 5. Current features and future goals.

Feature Currently Supports Future Goals
Automatic Variable Type Detection Manual input of variable types Automatic detection for improved usability
Model Complexity Logistic regression with one two-way interaction Higher-order interactions between variables
Type of Regression Model Only logistic regression Poisson regression and multinomial logistic regression

References

Analytics Vidhya. (n.d.) McKinsey analytics online hackathon. Retrieved November 29, 2024, from https://www.analyticsvidhya.com/datahack/contest/mckinsey-analytics-online-hackathon/

Madasu, S., Malhotra, S., Kant, S., Sager, R., Mishra, A. K., Misra, P., & Ahamed, F. (2019, December, 17). Anxiety disorders among adolescents in a rural area of northern India using Screen for Child Anxiety-Related Emotional Disorders tool: A Community-based study Indian Journal of Community Medicine, 44(4), 317-321. https://doi.org/10.4103/ijcm.ijcm_359_18

Malhotra, R. K. (2020, October 28). Errors in the use of multivariable logistic regression analysis: An empirical analysis. Indian Journal of Community Medicine, 45(4), 560–562. https://doi.org/10.4103/ijcm.IJCM_16_20

Phil. (2021, October 27). Logistic regression output advice. Stack Overflow. https://stackoverflow.com/questions/69736261/logistic-regression-output-advice

Putra, L. (n.d.). Working during COVID-19 epidemic concept: Portrait of stressed man doctor cartoon character wearing working uniform feeling stressed with laptop and paper in front of him Vecteezy. https://www.vecteezy.com/vector-art/15708435-working-during-covid-19-epidemic-concept-portrait-of-stressed-man-doctor-cartoon-character-wearing-working-uniform-feeling-stressed-with-laptop-and-paper-in-front-of-him

Rampure, R., Inbaraj, L. R., Elizabeth, C. G., & Norman, G. (2019, August 5). Factors contributing to alcohol relapse in a rural population: Lessons from a camp-based de-addiction model from rural Karnataka.Indian Journal of Community Medicine, 44(4), 307-312. https://doi.org/10.4103/ijcm.IJCM_321_18

Soriano, F. (n.d.). Stroke prediction dataset Kaggle. https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset/data

Contributions

Code:

Anna:

  • Wrote the function for the interpretation of continuous variables
  • Wrote the function for models with interaction term (continuous \(\times\) categorical)
  • Helped with categorical variable interpretation

Belina:

  • Wrote function for the interpretation of categorical variables
  • Helped with interaction term interpretation
  • Set up the R-package (roxygen, description, etc.)

Vanessa:

  • Wrote draft of interaction function
  • Helped setting up the R-package (roxygen, etc.)

Presentation & Research:

Victoria:

  • Slides for title, introduction, logistic regression formula, motivation, two examples for odds ratios being misinterpreted as relative risk, key takeaway for misinterpretation, purpose, logiSense package, using logiSense with stroke prediction dataset, how to use logiSense with stroke prediction dataset, references (1-13, 15-17, 23, 25)
  • Logo for R package
  • Overall formatting and slide transitions for the Quarto presentation

Tina:

  • Slides for arguments, existing packages, strengths, tables, challenges, remaining work, future directions (14, 18-22)

logiSense Logo

Thank you!

QA Image

logiSense Logo