Multinomial Logistic Regression: Complete Data Analysis Workflow

Overview

This workflow guides you through a complete multinomial logistic regression analysis from data preparation to prediction and interpretation. Based on the structure presented in the lecture and practical materials.


PHASE 1: DATA PREPARATION & EXPLORATION

1.1 Prepare Environment

Objective: Set up your R environment and load necessary libraries

# Start with a fresh project
library(here)
library(tidyverse)
library(haven)
library(gtsummary)
library(VGAM)
library(nnet)

1.2 Read & Explore Data

Objective: Understand the structure and characteristics of your data

# Read data (example: Stata format)
dat <- read_dta('your_data.dta')

# Examine data structure
glimpse(dat)
head(dat)
summary(dat)

# Get summary statistics by group
dat %>%
  tbl_summary(by = outcome_variable,
              statistic = list(all_continuous() ~ "{mean} ({sd})"))

Questions to answer: - What is the outcome variable? Does it have 3+ categories? - Are all variables in the correct format (numeric, factor)? - Are there missing values? - What is the sample size?

1.3 Data Transformation

Objective: Prepare variables for analysis

# Convert outcome to factor variable
dat <- dat %>%
  mutate(outcome_factor = factor(outcome,
                                  labels = c("Category_0", "Category_1", "Category_2")))

# Check the categories and counts
dat %>% count(outcome_factor)

# Set reference category (typically the most common or most meaningful)
# For VGAM, it uses the LAST level as reference
# For nnet, it uses the FIRST level as reference
dat <- dat %>%
  mutate(outcome_ordered = fct_relevel(outcome_factor,
                                        c("Category_2", "Category_1", "Category_0")))

# Recode other categorical variables as needed
dat <- dat %>%
  mutate(predictor_factor = factor(predictor,
                                   labels = c("No", "Yes")))

# Verify transformations
levels(dat$outcome_ordered)
table(dat$outcome_ordered, dat$predictor_factor)

Key decisions: - Which category should be the reference? (Usually the “baseline” or most common) - Do any variables need recoding or collapsing? - Are continuous variables appropriately scaled?


PHASE 2: UNIVARIATE ANALYSIS

2.1 Cross-tabulation Analysis

Objective: Examine association between outcome and each covariate

# Basic cross-tabulation
dat %>%
  group_by(outcome_factor) %>%
  count()

# Cross-tabulation with percentages
table(dat$outcome_factor, dat$predictor)

2.2 Simple Models (One Covariate at a Time)

Objective: Screen variables for inclusion in multivariable model

# Fit simple multinomial logit with one covariate
simple_model1 <- vglm(outcome_ordered ~ predictor1,
                      family = multinomial, data = dat)
summary(simple_model1)

# Extract p-values and coefficients
coef(simple_model1)
confint(simple_model1)

# Calculate relative risk ratios
exp(coef(simple_model1))

2.3 Variable Selection Criteria

Objective: Identify variables for multivariable model

Include in next step if: - p-value < 0.25 (Hosmer-Lemeshow criterion for screening) - Clinically or theoretically important - Known confounder

Exclude if: - p-value > 0.25 and not clinically important - Collinear with another variable - Sparse category (very few observations)


PHASE 3: MODEL ESTIMATION

3.1 Start with Simple Multivariable Model

Objective: Build foundational model with key variables

# Fit model with main predictor(s) of interest
model1 <- vglm(outcome_ordered ~ main_predictor,
               family = multinomial, data = dat)
summary(model1)

3.2 Add Additional Covariates

Objective: Build progressively more complex model

# Add more covariates
model2 <- vglm(outcome_ordered ~ main_predictor + covariate1 + covariate2,
               family = multinomial, data = dat)
summary(model2)

# For categorical variables, convert to factor within formula
model3 <- vglm(outcome_ordered ~ main_predictor + factor(category_var) + continuous_var,
               family = multinomial, data = dat)
summary(model3)

3.3 Check for Confounders

Objective: Identify variables that substantially change coefficients

# Compare coefficients between models
coef_model1 <- coef(model1)
coef_model2 <- coef(model2)

# Calculate percent change: (new - old) / old * 100
# If >20% change when removing a variable, it's likely a confounder
# Keep confounders in model even if not statistically significant

3.4 Consider Interactions

Objective: Test if effect of one variable depends on another

# Add interaction term (only if clinically meaningful)
model_interact <- vglm(outcome_ordered ~ predictor1 + predictor2 + 
                                        predictor1:predictor2,
                       family = multinomial, data = dat)
summary(model_interact)

PHASE 4: MODEL INFERENCES

4.1 Extract and Organize Coefficients

Objective: Obtain comprehensive estimate tables

# Get log-odds (β) and confidence intervals
b <- coef(model)
ci_b <- confint(model)
b_ci <- cbind(b, ci_b)

# Show results
b_ci

4.2 Calculate Relative Risk Ratios (RRR)

Objective: Exponentiate coefficients for interpretation

# RRR = exp(β)
rrr <- exp(coef(model))
rrr_ci <- exp(confint(model))

# Combine into comprehensive table
results <- cbind(b, b_ci, rrr, rrr_ci)
colnames(results) <- c('b', 'lower_b', 'upper_b', 'rrr', 'lower_rrr', 'upper_rrr')

results

4.3 Hypothesis Testing

Objective: Test statistical significance of coefficients

Method 1: p-values (from summary output)

summary(model)
# Look for Pr(>|z|) column
# p-value < 0.05 indicates statistical significance

Method 2: Confidence Intervals

# For log-odds: If CI does not include 0, p < 0.05
# For RRR: If CI does not include 1, p < 0.05

# For nnet::multinom (which doesn't output p-values):
coefs <- summary(model)$coefficients
ses <- summary(model)$standard.errors
z_test <- coefs / ses
p_values <- (1 - pnorm(abs(z_test), 0, 1)) * 2

4.4 Interpretation

Objective: Communicate findings

For a binary predictor (coded 0/1):

"The odds of outcome=1 (vs reference) compared to outcome=0 
 for individuals with predictor=1 vs predictor=0 is RRR times higher"

Example: RRR = 2.5
"The odds of intermediate placement vs outpatient placement 
 for adolescents WITH history of violence is 2.5 times higher 
 than for adolescents WITHOUT history of violence"

For a continuous predictor:

"A one-unit increase in [predictor] is associated with 
 a change in odds of outcome=1 by a factor of RRR"

Example: RRR = 1.15, predictor = income
"A $1,000 increase in annual income increases the odds 
 of outcome=1 by 1.15 times (15% increase)"

PHASE 5: MODEL COMPARISON & SELECTION

5.1 Likelihood Ratio Test

Objective: Compare nested models (simpler vs more complex)

# Extract log-likelihoods
ll_simple <- logLik(model_simple)
ll_full <- logLik(model_full)

# Calculate test statistic
LR <- -2 * (ll_simple - ll_full)

# Degrees of freedom = number of parameters added
df <- length(coef(model_full)) - length(coef(model_simple))

# Calculate p-value
p_value <- 1 - pchisq(LR, df)

# Decision: if p < 0.05, prefer full model

5.2 Wald Test

Objective: Test individual coefficients (from summary output)

# From summary(model):
# z-value = β / SE(β)
# p-value already calculated

# Interpretation: p < 0.05 = statistically significant

5.3 Model Comparison Criteria

Choose model based on: - Statistical significance (LR test, Wald test) - Clinical/practical significance - Model parsimony (simpler models preferred when similar fit) - Goodness of fit (next section)


PHASE 6: MODEL ASSESSMENT

6.1 Assess Overall Model Fit

Objective: Evaluate how well model fits the data

# Get model deviance
deviance(model)
logLik(model)

# Residual deviance should be approximately equal to degrees of freedom
# if model fits well

# Compare observed vs predicted frequencies
observed <- table(dat$outcome_ordered)
predicted <- colSums(fitted(model))

fit_table <- data.frame(
  Category = names(observed),
  Observed = as.numeric(observed),
  Predicted = round(predicted, 1)
)
fit_table

6.2 Check Assumptions

Objective: Verify model assumptions

Linearity of logits (for continuous variables):

# Check if continuous variables are linear in logit
# Options:
# 1. Plot log odds vs variable
# 2. Use fractional polynomials to test
# 3. Categorize continuous variable and compare

# If linearity violated, consider:
# - Recategorize the variable
# - Use polynomial terms: I(predictor^2)
# - Use splines (advanced)

Check for influential observations:

# Calculate delta beta (change in coefficients if obs removed)
# R automatically reports this as "Hat matrix diagonal"
# in more detailed diagnostics

# Observations with very high influence should be investigated:
# - Check for data entry errors
# - Verify they're from same population
# - Consider if true outliers or errors

6.3 Goodness of Fit Tests

Objective: Formal tests of model fit

# For each comparison (logit 1, logit 2, etc.):
# Can perform separate binary logistic regression goodness-of-fit
# (This is mentioned as an option in the materials)

# Basic approach: Pearson chi-square for grouped data
# If available: Hosmer-Lemeshow test

6.4 Diagnostics Checklist


PHASE 7: PREDICTION

7.1 Predict Log-Odds

Objective: Calculate predicted log-odds for new observations

# For all observations in data
log_odds <- predict(model, type = 'link')
head(log_odds)

# For specific covariate values
new_data <- data.frame(
  predictor1 = c(0, 1),
  predictor2 = c(10, 15)
)
log_odds_new <- predict(model, newdata = new_data, type = 'link')

Manual calculation:

# For categorical predictor with logit = α + β × X
alpha <- coef(model)[1]  # intercept
beta <- coef(model)[2]   # slope

# When X = 0:
log_odds_x0 <- alpha + beta * 0

# When X = 1:
log_odds_x1 <- alpha + beta * 1

7.2 Predict Probabilities

Objective: Calculate probability for each outcome category

# For all observations
probabilities <- predict(model, type = 'response')
head(probabilities)

# For new data
probabilities_new <- predict(model, newdata = new_data, type = 'response')

Manual calculation (for 3-category outcome):

# If log_odds_1 = g1(x), log_odds_2 = g2(x):

exp_g1 <- exp(g1)
exp_g2 <- exp(g2)

P_category0 <- 1 / (1 + exp_g1 + exp_g2)
P_category1 <- exp_g1 / (1 + exp_g1 + exp_g2)
P_category2 <- exp_g2 / (1 + exp_g1 + exp_g2)

# Check: P_cat0 + P_cat1 + P_cat2 should equal 1

7.3 Interpretation of Predictions

Objective: Make actionable predictions

# Example output:
# Observation 1: P(outcome=0)=0.60, P(outcome=1)=0.25, P(outcome=2)=0.15
# Most likely outcome: Category 0 (probability 60%)

# For clinical/practical decisions:
# - If probabilities close (e.g., 0.33, 0.33, 0.33) → uncertain
# - If one dominates (e.g., 0.80, 0.15, 0.05) → confident prediction

IMPORTANT NOTES

Choice Between VGAM and nnet Package

Feature VGAM::vglm nnet::multinom
Reference category Last level First level
Output format Detailed statistical output Simpler output
P-values Automatic Manual calculation
CI for coefficients confint() function Manual calculation
Learning curve Moderate Moderate

When Model Doesn’t Fit Well

  1. Insufficient sample size → Collect more data
  2. Important variables missing → Add predictors
  3. Non-linear relationships → Test polynomial terms
  4. Incorrect reference category → Change reference
  5. Outcome categories too rare → Consider combining categories
  6. Violation of IIA assumption → Check with tests or use ordinal regression if ordering exists

Final Model Selection Criteria

  1. ✅ Statistically significant predictors (p < 0.05)
  2. ✅ Clinically/theoretically meaningful variables
  3. ✅ Good overall model fit (assessed in Phase 6)
  4. ✅ Stable, interpretable coefficients
  5. ✅ No multicollinearity among predictors
  6. ✅ Reasonable predictions on new data

WORKFLOW SUMMARY

Phase Purpose Main Output
1 Prepare and explore data Clean dataset, understanding of structure
2 Screen variables List of candidates for multivariable model
3 Build model Estimated coefficients and fitted model
4 Inferences Interpretable coefficients, RRR, confidence intervals
5 Compare models Selection of best-fitting model
6 Assess fit Confirmation model is adequate
7 Predict Probabilities for new observations

COMMON QUESTIONS

Q: How many variables can I include? A: Rule of thumb: at least 10-20 observations per predictor parameter. For outcome with k categories and p predictors, you have (k-1) × p parameters.

Q: What if my outcome has ordinal categories? A: Consider ordinal logistic regression instead (proportional odds model). Use this workflow as foundation but add tests for proportional odds assumption.

Q: How do I handle missing data? A: Options: (1) Listwise deletion (simplest, loses data), (2) Multiple imputation (recommended, complex), (3) Investigate missingness mechanism.

Q: What if I have interaction terms? A: Include only if clinically meaningful. Test with LR test. Interpretation becomes more complex. Document the specific population where interaction applies.


REFERENCES

  • Hosmer Jr., D. W., & Lemeshow, S. (2013). Applied logistic regression (3rd ed.).
  • Kleinbaum, D. G. (2010). Logistic regression: A self-learning text.
  • Materials from lecture: “Logistic Regression for Data with Polychotomous Outcome”
  • Practical materials: “Multinomial Logistic Regression Practicals”