This workflow guides you through a complete multinomial logistic regression analysis from data preparation to prediction and interpretation. Based on the structure presented in the lecture and practical materials.
Objective: Set up your R environment and load necessary libraries
# Start with a fresh project
library(here)
library(tidyverse)
library(haven)
library(gtsummary)
library(VGAM)
library(nnet)
Objective: Understand the structure and characteristics of your data
# Read data (example: Stata format)
dat <- read_dta('your_data.dta')
# Examine data structure
glimpse(dat)
head(dat)
summary(dat)
# Get summary statistics by group
dat %>%
tbl_summary(by = outcome_variable,
statistic = list(all_continuous() ~ "{mean} ({sd})"))
Questions to answer: - What is the outcome variable? Does it have 3+ categories? - Are all variables in the correct format (numeric, factor)? - Are there missing values? - What is the sample size?
Objective: Prepare variables for analysis
# Convert outcome to factor variable
dat <- dat %>%
mutate(outcome_factor = factor(outcome,
labels = c("Category_0", "Category_1", "Category_2")))
# Check the categories and counts
dat %>% count(outcome_factor)
# Set reference category (typically the most common or most meaningful)
# For VGAM, it uses the LAST level as reference
# For nnet, it uses the FIRST level as reference
dat <- dat %>%
mutate(outcome_ordered = fct_relevel(outcome_factor,
c("Category_2", "Category_1", "Category_0")))
# Recode other categorical variables as needed
dat <- dat %>%
mutate(predictor_factor = factor(predictor,
labels = c("No", "Yes")))
# Verify transformations
levels(dat$outcome_ordered)
table(dat$outcome_ordered, dat$predictor_factor)
Key decisions: - Which category should be the reference? (Usually the “baseline” or most common) - Do any variables need recoding or collapsing? - Are continuous variables appropriately scaled?
Objective: Examine association between outcome and each covariate
# Basic cross-tabulation
dat %>%
group_by(outcome_factor) %>%
count()
# Cross-tabulation with percentages
table(dat$outcome_factor, dat$predictor)
Objective: Screen variables for inclusion in multivariable model
# Fit simple multinomial logit with one covariate
simple_model1 <- vglm(outcome_ordered ~ predictor1,
family = multinomial, data = dat)
summary(simple_model1)
# Extract p-values and coefficients
coef(simple_model1)
confint(simple_model1)
# Calculate relative risk ratios
exp(coef(simple_model1))
Objective: Identify variables for multivariable model
Include in next step if: - p-value < 0.25 (Hosmer-Lemeshow criterion for screening) - Clinically or theoretically important - Known confounder
Exclude if: - p-value > 0.25 and not clinically important - Collinear with another variable - Sparse category (very few observations)
Objective: Build foundational model with key variables
# Fit model with main predictor(s) of interest
model1 <- vglm(outcome_ordered ~ main_predictor,
family = multinomial, data = dat)
summary(model1)
Objective: Build progressively more complex model
# Add more covariates
model2 <- vglm(outcome_ordered ~ main_predictor + covariate1 + covariate2,
family = multinomial, data = dat)
summary(model2)
# For categorical variables, convert to factor within formula
model3 <- vglm(outcome_ordered ~ main_predictor + factor(category_var) + continuous_var,
family = multinomial, data = dat)
summary(model3)
Objective: Identify variables that substantially change coefficients
# Compare coefficients between models
coef_model1 <- coef(model1)
coef_model2 <- coef(model2)
# Calculate percent change: (new - old) / old * 100
# If >20% change when removing a variable, it's likely a confounder
# Keep confounders in model even if not statistically significant
Objective: Test if effect of one variable depends on another
# Add interaction term (only if clinically meaningful)
model_interact <- vglm(outcome_ordered ~ predictor1 + predictor2 +
predictor1:predictor2,
family = multinomial, data = dat)
summary(model_interact)
Objective: Obtain comprehensive estimate tables
# Get log-odds (β) and confidence intervals
b <- coef(model)
ci_b <- confint(model)
b_ci <- cbind(b, ci_b)
# Show results
b_ci
Objective: Exponentiate coefficients for interpretation
# RRR = exp(β)
rrr <- exp(coef(model))
rrr_ci <- exp(confint(model))
# Combine into comprehensive table
results <- cbind(b, b_ci, rrr, rrr_ci)
colnames(results) <- c('b', 'lower_b', 'upper_b', 'rrr', 'lower_rrr', 'upper_rrr')
results
Objective: Test statistical significance of coefficients
Method 1: p-values (from summary output)
summary(model)
# Look for Pr(>|z|) column
# p-value < 0.05 indicates statistical significance
Method 2: Confidence Intervals
# For log-odds: If CI does not include 0, p < 0.05
# For RRR: If CI does not include 1, p < 0.05
# For nnet::multinom (which doesn't output p-values):
coefs <- summary(model)$coefficients
ses <- summary(model)$standard.errors
z_test <- coefs / ses
p_values <- (1 - pnorm(abs(z_test), 0, 1)) * 2
Objective: Communicate findings
For a binary predictor (coded 0/1):
"The odds of outcome=1 (vs reference) compared to outcome=0
for individuals with predictor=1 vs predictor=0 is RRR times higher"
Example: RRR = 2.5
"The odds of intermediate placement vs outpatient placement
for adolescents WITH history of violence is 2.5 times higher
than for adolescents WITHOUT history of violence"
For a continuous predictor:
"A one-unit increase in [predictor] is associated with
a change in odds of outcome=1 by a factor of RRR"
Example: RRR = 1.15, predictor = income
"A $1,000 increase in annual income increases the odds
of outcome=1 by 1.15 times (15% increase)"
Objective: Compare nested models (simpler vs more complex)
# Extract log-likelihoods
ll_simple <- logLik(model_simple)
ll_full <- logLik(model_full)
# Calculate test statistic
LR <- -2 * (ll_simple - ll_full)
# Degrees of freedom = number of parameters added
df <- length(coef(model_full)) - length(coef(model_simple))
# Calculate p-value
p_value <- 1 - pchisq(LR, df)
# Decision: if p < 0.05, prefer full model
Objective: Test individual coefficients (from summary output)
# From summary(model):
# z-value = β / SE(β)
# p-value already calculated
# Interpretation: p < 0.05 = statistically significant
Choose model based on: - Statistical significance (LR test, Wald test) - Clinical/practical significance - Model parsimony (simpler models preferred when similar fit) - Goodness of fit (next section)
Objective: Evaluate how well model fits the data
# Get model deviance
deviance(model)
logLik(model)
# Residual deviance should be approximately equal to degrees of freedom
# if model fits well
# Compare observed vs predicted frequencies
observed <- table(dat$outcome_ordered)
predicted <- colSums(fitted(model))
fit_table <- data.frame(
Category = names(observed),
Observed = as.numeric(observed),
Predicted = round(predicted, 1)
)
fit_table
Objective: Verify model assumptions
Linearity of logits (for continuous variables):
# Check if continuous variables are linear in logit
# Options:
# 1. Plot log odds vs variable
# 2. Use fractional polynomials to test
# 3. Categorize continuous variable and compare
# If linearity violated, consider:
# - Recategorize the variable
# - Use polynomial terms: I(predictor^2)
# - Use splines (advanced)
Check for influential observations:
# Calculate delta beta (change in coefficients if obs removed)
# R automatically reports this as "Hat matrix diagonal"
# in more detailed diagnostics
# Observations with very high influence should be investigated:
# - Check for data entry errors
# - Verify they're from same population
# - Consider if true outliers or errors
Objective: Formal tests of model fit
# For each comparison (logit 1, logit 2, etc.):
# Can perform separate binary logistic regression goodness-of-fit
# (This is mentioned as an option in the materials)
# Basic approach: Pearson chi-square for grouped data
# If available: Hosmer-Lemeshow test
Objective: Calculate predicted log-odds for new observations
# For all observations in data
log_odds <- predict(model, type = 'link')
head(log_odds)
# For specific covariate values
new_data <- data.frame(
predictor1 = c(0, 1),
predictor2 = c(10, 15)
)
log_odds_new <- predict(model, newdata = new_data, type = 'link')
Manual calculation:
# For categorical predictor with logit = α + β × X
alpha <- coef(model)[1] # intercept
beta <- coef(model)[2] # slope
# When X = 0:
log_odds_x0 <- alpha + beta * 0
# When X = 1:
log_odds_x1 <- alpha + beta * 1
Objective: Calculate probability for each outcome category
# For all observations
probabilities <- predict(model, type = 'response')
head(probabilities)
# For new data
probabilities_new <- predict(model, newdata = new_data, type = 'response')
Manual calculation (for 3-category outcome):
# If log_odds_1 = g1(x), log_odds_2 = g2(x):
exp_g1 <- exp(g1)
exp_g2 <- exp(g2)
P_category0 <- 1 / (1 + exp_g1 + exp_g2)
P_category1 <- exp_g1 / (1 + exp_g1 + exp_g2)
P_category2 <- exp_g2 / (1 + exp_g1 + exp_g2)
# Check: P_cat0 + P_cat1 + P_cat2 should equal 1
Objective: Make actionable predictions
# Example output:
# Observation 1: P(outcome=0)=0.60, P(outcome=1)=0.25, P(outcome=2)=0.15
# Most likely outcome: Category 0 (probability 60%)
# For clinical/practical decisions:
# - If probabilities close (e.g., 0.33, 0.33, 0.33) → uncertain
# - If one dominates (e.g., 0.80, 0.15, 0.05) → confident prediction
| Feature | VGAM::vglm | nnet::multinom |
|---|---|---|
| Reference category | Last level | First level |
| Output format | Detailed statistical output | Simpler output |
| P-values | Automatic | Manual calculation |
| CI for coefficients | confint() function | Manual calculation |
| Learning curve | Moderate | Moderate |
| Phase | Purpose | Main Output |
|---|---|---|
| 1 | Prepare and explore data | Clean dataset, understanding of structure |
| 2 | Screen variables | List of candidates for multivariable model |
| 3 | Build model | Estimated coefficients and fitted model |
| 4 | Inferences | Interpretable coefficients, RRR, confidence intervals |
| 5 | Compare models | Selection of best-fitting model |
| 6 | Assess fit | Confirmation model is adequate |
| 7 | Predict | Probabilities for new observations |
Q: How many variables can I include? A: Rule of thumb: at least 10-20 observations per predictor parameter. For outcome with k categories and p predictors, you have (k-1) × p parameters.
Q: What if my outcome has ordinal categories? A: Consider ordinal logistic regression instead (proportional odds model). Use this workflow as foundation but add tests for proportional odds assumption.
Q: How do I handle missing data? A: Options: (1) Listwise deletion (simplest, loses data), (2) Multiple imputation (recommended, complex), (3) Investigate missingness mechanism.
Q: What if I have interaction terms? A: Include only if clinically meaningful. Test with LR test. Interpretation becomes more complex. Document the specific population where interaction applies.