This notebook lecture will cover multivariable logistic regression in R, using the Titanic survival dataset as an example.
Univariable models are insufficient for understanding complex phenomena because they do not account for the interconnectedness of multiple factors. Multivariable logistic regression is a more realistic approach for understanding binary outcomes, such as survival.
This lecture will cover how to: - Visualize predicted survival probabilities - Validate a model by checking its assumptions - Extract and interpret odds ratios - Identify the most important factors influencing survival - Running a Logistic Regression in R
We’ll use the Survival of Passengers on the Titanic dataset which has information on the survival status, sex, age, and passenger class of 1309 passengers in the Titanic disaster of 1912.
# Load the Titanic dataset
data(TitanicSurvival)
Warning: data set ‘TitanicSurvival’ not found
It’s always a good idea to explore new datasets to get a feel for the number of observations, number of variable, and missing values
library(DataExplorer)
Registered S3 method overwritten by 'data.table':
method from
print.data.table
Registered S3 methods overwritten by 'htmltools':
method from
print.html tools:rstudio
print.shiny.tag tools:rstudio
print.shiny.tag.list tools:rstudio
Registered S3 method overwritten by 'rmarkdown':
method from
print.paged_df
Registered S3 method overwritten by 'htmlwidgets':
method from
print.htmlwidget tools:rstudio
## View basic description for TitanicSurvival data
t(introduce(TitanicSurvival))
[,1]
rows 1309
columns 4
discrete_columns 3
continuous_columns 1
all_missing_columns 0
total_missing_values 0
complete_rows 1309
total_observations 5236
memory_usage 29536
## Plot basic description for TitanicSurvival data
plot_intro(TitanicSurvival)
## View missing value distribution for TitanicSurvival data
plot_missing(TitanicSurvival)
## frequency distribution of all discrete variables
plot_bar(TitanicSurvival)
## View histogram of all continuous variables
plot_histogram(TitanicSurvival)
## View overall correlation heatmap
plot_correlation(TitanicSurvival)
## View bivariate continuous distribution based on `cut`
plot_boxplot(TitanicSurvival, by = "survived")
Those missing values for age
are a problem. The
glm
function will listwise delete these variables, but
let’s explore imputation.
# Impute missing values (for instance, using median or mean imputation):
library(dplyr)
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
# Replace NA values in the age column with the median age
TitanicSurvival <- TitanicSurvival %>%
mutate(age = ifelse(is.na(age), median(age, na.rm = TRUE), age))
m <- glm(formula = survived ~ sex + age + passengerClass,
data = d,
family = binomial)
The glm()
function is used to run a logistic regression
in R. It stands for “generalized linear model.” The
family = binomial
argument is used to specify that the
outcome is binary. The formula for a logistic regression has two parts:
- The left side specifies the binary outcome variable. - The right side
specifies the predictor variables.
Finally, the data argument specifies the dataset to be used.
# Load necessary packages
library(carData)
Attaching package: ‘carData’
The following object is masked _by_ ‘.GlobalEnv’:
TitanicSurvival
library(car)
Attaching package: ‘car’
The following object is masked from ‘package:dplyr’:
recode
library(performance)
library(ggeffects)
library(sjPlot)
library(gtsummary)
library(equatiomatic)
library(vip)
Attaching package: ‘vip’
The following object is masked from ‘package:utils’:
vi
library(pROC)
Type 'citation("pROC")' for a citation.
Attaching package: ‘pROC’
The following objects are masked from ‘package:stats’:
cov, smooth, var
# Run the logistic regression
model <- glm(survived ~ sex + age + passengerClass, data = TitanicSurvival, family = binomial)
Numeric variables, such as age, can have nonlinear relationships with the outcome. It’s important to check for nonlinearity using methods such as generalized additive models.
The gam()
function can be used to fit a generalized
additive model.
# Fit a generalized additive model
library(mgcv)
Loading required package: nlme
Attaching package: ‘nlme’
The following object is masked from ‘package:dplyr’:
collapse
This is mgcv 1.9-1. For overview type 'help("mgcv-package")'.
gam_model <- gam(survived ~ s(age) + sex + passengerClass, data = TitanicSurvival, family = binomial)
# Plot the generalized additive model
plot(gam_model)
If a nonlinear relationship is found, it needs to be addressed appropriately.
Checking model assumptions is crucial to ensure the validity of the
results. The check_model()
function from the performance
package automatically checks all relevant assumptions for the specified
model type.
It checks assumptions such as: - The distribution of residuals - Multicollinearity - Influential outliers
The independence of observations assumption needs to be checked separately, especially if the data includes repeated measures.
The check_model()
function can also handle mixed effects
logistic regression models.
# Check model assumptions
check_model(model)
The ggpredict()
function from the ggeffects package
can be used to obtain predicted probabilities for numeric variables. It
provides multiple values to illustrate trends, rather than just the
average. Averages are calculated across all categories of other
predictors. Averaging across all categories is important because many
predictive functions default to the reference levels for categorical
predictors. This can lead to misleading conclusions if not considered
carefully. The plot()
function can be used to visualize the
estimated survival probabilities for each predictor separately. The
plot_grid()
function from the sjPlot package can be used to
combine multiple plots into a single figure.
library(ggeffects)
library(cowplot)
Attaching package: ‘cowplot’
The following objects are masked from ‘package:sjPlot’:
plot_grid, save_plot
The following object is masked from ‘package:ggeffects’:
get_title
# Obtain predicted probabilities for each term separately
predictions_age <- ggpredict(model, terms = "age")
Data were 'prettified'. Consider using `terms="age [all]"` to get smooth plots.
predictions_sex <- ggpredict(model, terms = "sex")
predictions_passengerClass <- ggpredict(model, terms = "passengerClass")
# Plot each term individually
plot_age <- plot(predictions_age)
plot_sex <- plot(predictions_sex)
plot_passengerClass <- plot(predictions_passengerClass)
# Combine plots into a single figure
plot_grid(plot_age, plot_sex, plot_passengerClass)
# extract predictions
library(ggeffects)
ggeffect(model)
$sex
# Predicted probabilities of survived
sex | Predicted | 95% CI
-------------------------------
female | 0.72 | 0.68, 0.77
male | 0.18 | 0.15, 0.21
$age
# Predicted probabilities of survived
age | Predicted | 95% CI
----------------------------
0 | 0.58 | 0.48, 0.66
10 | 0.50 | 0.43, 0.56
20 | 0.42 | 0.37, 0.46
30 | 0.34 | 0.31, 0.37
40 | 0.27 | 0.24, 0.31
50 | 0.21 | 0.17, 0.27
60 | 0.16 | 0.12, 0.23
80 | 0.09 | 0.05, 0.16
Not all rows are shown in the output. Use `print(..., n = Inf)` to show all rows.
$passengerClass
# Predicted probabilities of survived
passengerClass | Predicted | 95% CI
---------------------------------------
1st | 0.68 | 0.62, 0.74
2nd | 0.40 | 0.33, 0.47
3rd | 0.20 | 0.17, 0.24
attr(,"class")
[1] "ggalleffects" "list"
attr(,"model.name")
[1] "model"
Odds ratios and their confidence intervals and p-values are
essential for confirming the statistical significance of differences in
survival rates. The tbl_regression()
function from the
gtsummary package can be used to create a table of odds ratios. The
exponentiate = TRUE
argument converts log odds ratios to
odds ratios. The add_pairwise_contrasts = TRUE
argument
compares all categories with each other. The table can be saved in
various formats, such as Microsoft Word or PNG.
# Create a table of odds ratios
fancy_table <- tbl_regression(
model,
exponentiate = TRUE,
add_pairwise_contrasts = TRUE,
# up your table game
contrasts_adjust = "bonferroni", # none
pairwise_reverse = FALSE,
pvalue_fun = ~style_pvalue(.x, digits = 3)) %>%
add_significance_stars(hide_p = F, hide_se = T,
hide_ci = F) %>%
bold_p()
fancy_table
Characteristic | OR1,2 | 95% CI2 | p-value |
---|---|---|---|
sex | |||
female / male | 12.2*** | 9.09, 16.3 | <0.001 |
age | 0.97*** | 0.96, 0.98 | <0.001 |
passengerClass | |||
1st / 2nd | 3.26*** | 1.97, 5.40 | <0.001 |
1st / 3rd | 8.65*** | 5.42, 13.8 | <0.001 |
2nd / 3rd | 2.65*** | 1.72, 4.09 | <0.001 |
1 *p<0.05; **p<0.01; ***p<0.001 | |||
2 OR = Odds Ratio, CI = Confidence Interval |
The extract_eq()
function from the equatiomatic
package converts a model to LaTeX format, making it easier to
include equations in documents.
# Extract the model equation
extract_eq(model)
$$
\log\left[ \frac { P( \operatorname{survived} = \operatorname{yes} ) }{ 1 - P( \operatorname{survived} = \operatorname{yes} ) } \right] = \alpha + \beta_{1}(\operatorname{sex}_{\operatorname{male}}) + \beta_{2}(\operatorname{age}) + \beta_{3}(\operatorname{passengerClass}_{\operatorname{2nd}}) + \beta_{4}(\operatorname{passengerClass}_{\operatorname{3rd}})
$$
summary(model)
Call:
glm(formula = survived ~ sex + age + passengerClass, family = binomial,
data = TitanicSurvival)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.332795 0.299152 11.141 < 2e-16 ***
sexmale -2.498210 0.148571 -16.815 < 2e-16 ***
age -0.032159 0.006107 -5.266 1.39e-07 ***
passengerClass2nd -1.183165 0.209954 -5.635 1.75e-08 ***
passengerClass3rd -2.157681 0.195160 -11.056 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1741.0 on 1308 degrees of freedom
Residual deviance: 1228.1 on 1304 degrees of freedom
AIC: 1238.1
Number of Fisher Scoring iterations: 4
The contrast_adjust
argument in the
tbl_regression()
function allows for different
p-value adjustments for multiple comparisons, such as
Bonferroni or Tukey. The Bonferroni correction is conservative and can
increase p-values, making it harder to reach significance.
Other corrections or no correction may be more appropriate depending on
the data and research question. The
pairwise_reverse = FALSE
argument in the
tbl_regression()
function can be used to interpret odds
ratios greater than one instead of less than one.
With large sample sizes, all p-values may be significant, making it difficult to determine which predictors are most important.
Two methods for ranking variable importance are: - The
Anova()
function from the car package, which provides
likelihood ratio chi-square values. Higher chi-square values indicate
greater importance. - Random forest classification, using the
vip()
package to create a variable importance plot.
# Assess variable importance using ANOVA
Anova(model)
Analysis of Deviance Table (Type II tests)
Response: survived
LR Chisq Df Pr(>Chisq)
sex 340.74 1 < 2.2e-16 ***
age 29.10 1 6.876e-08 ***
passengerClass 139.55 2 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Run the random forest model
library(randomForest)
randomForest 4.7-1.2
Type rfNews() to see new features/changes/bug fixes.
Attaching package: ‘randomForest’
The following object is masked from ‘package:dplyr’:
combine
rf_model <- randomForest(survived ~ sex + age + passengerClass, data = TitanicSurvival)
# Create a variable importance plot
vip(rf_model)
# Obtain model performance metrics
performance(model)
# Indices of model performance
AIC | AICc | BIC | Tjur's R2 | RMSE | Sigma | Log_loss | Score_log | Score_spherical | PCP
-----------------------------------------------------------------------------------------------------------
1238.123 | 1238.169 | 1264.008 | 0.362 | 0.387 | 1.000 | 0.469 | -Inf | 9.369e-04 | 0.699
# Interpret R-squared value
library(effectsize)
library(performance)
# Calculate R-squared
r2_value <- r2(model)$R2
# Interpret the R-squared value
interpret_r2(r2_value)
Tjur's R2
"substantial"
(Rules: cohen1988)
The ROC curve visualizes a model’s ability to distinguish between positive and negative cases. Area under the curve (AUC): A higher AUC indicates better model performance. The roc() function from the pROC package can be used to create an ROC curve. The confusion matrix provides additional insights into model performance, including metrics such as specificity, sensitivity, and accuracy.
# Create an ROC curve
roc_curve <- roc(survived ~ fitted.values(model), data=TitanicSurvival,
plot = TRUE, legacy.axes = TRUE,
print.auc = TRUE, ci = TRUE)
Setting levels: control = no, case = yes
Setting direction: controls < cases
# Plot the ROC curve
plot(roc_curve)
# Print the AUC
auc(roc_curve)
Area under the curve: 0.8339
Multivariable models provide a more realistic understanding of relationships than univariable models because they account for spurious correlations. Analyzing variables in isolation can lead to misleading conclusions.
This notebook lecture has covered the key concepts and techniques of multivariable logistic regression in R. You have learned how to run and interpret a model, check assumptions, visualize results, assess variable importance, evaluate model fit, and understand the importance of considering multiple variables simultaneously.install