This document presents a true multivariate logistic regression analysis, which models multiple correlated binary outcomes simultaneously. This approach is statistically superior to running separate logistic regressions for each outcome because it explicitly accounts for the covariance between the dependent variables. In this demonstration, we will analyze the factors influencing a voter’s approval of a candidate’s economic policy and their foreign policy. This is a common scenario in political science where various aspects of political opinion are often interrelated.
We begin by loading the necessary R libraries. The VGAM package is specifically designed for fitting vector generalized linear models, which is essential for this type of multivariate analysis. We also include mvtnorm to facilitate the creation of correlated outcomes for our synthetic dataset.
We create a synthetic dataset that includes two binary dependent variables: approve_economic_policy and approve_foreign_policy. These variables are designed to be correlated. The independent variables are age, education_level (ordinal), and party_identification (binary). This structure allows us to simulate realistic political data for analytical purposes.
set.seed(42) # For reproducibility
n <- 1000 # Number of observations
# Generate independent variables
voter_data <- data.frame(
age = rnorm(n, mean = 45, sd = 15), # Age, continuous
education_level = sample(1:5, n, replace = TRUE), # Education, ordinal scale
party_identification = rbinom(n, 1, prob = 0.5) # Party ID, binary (0/1)
)
# Generate correlated binary outcomes
# Create a 2x2 correlation matrix for the two outcomes
sigma <- matrix(c(1, 0.4, 0.4, 1), nrow = 2) # 0.4 indicates a moderate positive correlation
# Generate correlated latent variables using a multivariate normal distribution
latent_vars <- mvtnorm::rmvnorm(n, mean = c(0, 0), sigma = sigma)
# Add effects of predictors to the latent variables
# These coefficients (0.1, -0.3, 0.5, etc.) are chosen for demonstration
latent_vars[,1] <- latent_vars[,1] + 0.1 * voter_data$age - 0.3 * voter_data$education_level + 0.5 * voter_data$party_identification # Effects on economic policy approval
latent_vars[,2] <- latent_vars[,2] + 0.05 * voter_data$age - 0.2 * voter_data$education_level + 0.4 * voter_data$party_identification # Effects on foreign policy approval
# Convert latent variables to binary outcomes based on a threshold (e.g., 0)
voter_data$approve_economic_policy <- ifelse(latent_vars[,1] > 0, 1, 0)
voter_data$approve_foreign_policy <- ifelse(latent_vars[,2] > 0, 1, 0)
# Display the first few rows of the generated data
gt(head(voter_data))
# Check the correlation of the two binary outcomes
af_correlation_outcomes <- cor(voter_data$approve_economic_policy, voter_data$approve_foreign_policy)
cat("\nCorrelation between approve_economic_policy and approve_foreign_policy:", round(af_correlation_outcomes, 3), "\n")
Correlation between approve_economic_policy and approve_foreign_policy: 0.286
The observed correlation between the two approval variables (r=0.286) confirms that they are not independent, thus validating the use of a multivariate approach that models their joint distribution.
Before proceeding with the model, we ensure that the data satisfies the necessary assumptions for multivariate logistic regression.
Multiple Binary Outcome Variables: Both approve_economic_policy and approve_foreign_policy are dichotomous (0 or 1), meeting a fundamental requirement for this model.
Independence of Observations: We assume that each voter’s responses are independent of other voters. This is a standard assumption for cross-sectional survey data.
No Perfect Multicollinearity: The independent variables should not be perfectly correlated. We assess this using the Variance Inflation Factor (VIF). High VIF values (typically above 5 or 10) can indicate problematic multicollinearity.
# Check for multicollinearity among independent variables using VIF
# We create a temporary linear model to calculate VIF
temp_model_vif <- lm(approve_economic_policy ~ age + education_level + party_identification, data = voter_data)
cat("VIF values for independent variables:\n")
VIF values for independent variables:
age education_level party_identification 1.005970 1.006877 1.005195
# Also, examine the correlation matrix of the independent variables
cat("\nCorrelation matrix of independent variables:\n")
Correlation matrix of independent variables:
age education_level party_identification age 1.00000000 -0.06294495 -0.04790365 education_level -0.06294495 1.00000000 0.05651168 party_identification -0.04790365 0.05651168 1.00000000
The VIF values are well below the common thresholds (e.g., 5 or 10), and the pairwise correlations between independent variables are not excessively high. This indicates that multicollinearity is not a significant concern for our model.
For a true multivariate logistic regression with two binary outcomes, we utilize the vglm() function from the VGAM package with the binom2.or family. This family explicitly models two binary responses using logistic links and estimates their association through an odds ratio parameter. The formula cbind(approve_economic_policy, approve_foreign_policy) ~ age + education_level + party_identification specifies the two dependent variables on the left-hand side and the predictors on the right.
# Fit the multivariate logistic model using vglm()
# The binom2.or family models two binary outcomes with logistic links and their odds ratio
multivariate_logistic <- vglm(cbind(approve_economic_policy, approve_foreign_policy) ~ age + education_level + party_identification,
family = binom2.or,
data = voter_data)
# Print the model summary
af_cat(summary(multivariate_logistic))
Call: vglm(formula = cbind(approve_economic_policy, approve_foreign_policy) ~ age + education_level + party_identification, family = binom2.or, data = voter_data) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept):1 0.70269 0.99169 0.709 0.47858 (Intercept):2 0.41878 0.50537 0.829 0.40730 (Intercept):3 1.18587 0.60638 1.956 0.05050 . age:1 0.17537 0.02804 6.255 3.96e-10 *** age:2 0.09752 0.01137 8.574 < 2e-16 *** education_level:1 -0.61604 0.22115 -2.786 0.00534 ** education_level:2 -0.54717 0.10692 -5.117 3.10e-07 *** party_identification:1 0.58524 0.56637 1.033 0.30145 party_identification:2 0.52765 0.26299 2.006 0.04482 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Names of linear predictors: logitlink(mu1), logitlink(mu2), loglink(oratio) Residual deviance: 523.7143 on 2991 degrees of freedom Log-likelihood: -261.8572 on 2991 degrees of freedom Number of Fisher scoring iterations: 9 Warning: Hauck-Donner effect detected in the following estimate(s): 'age:1', 'age:2'
Family: binom2.or Informal classes: binom2.or, binom2 Bivariate binomial regression with an odds ratio Links: logitlink(mu1), logitlink(mu2); loglink(oratio)
To present the model results in a professional, publication-ready format suitable for a Q1 political science journal, we extract the coefficients and transform them to odds ratios. We also explicitly report the estimated odds ratio parameter that captures the association between the two outcomes and the overall model’s Log-Likelihood.
# Extract the odds ratio parameter for association between outcomes
or_parameter <- exp(coef(multivariate_logistic)["(Intercept):3"])
modelsummary(
multivariate_logistic,
output = "markdown",
title = "Multivariate Logistic Regression Results (Alternative Format)",
exponentiate = TRUE,
fmt = 3,
stars = TRUE,
coef_map = c(
"(Intercept):1" = "Intercept (Economic)",
"age:1" = "Age (Economic)",
"education_level:1" = "Education (Economic)",
"party_identification:1" = "Party ID (Economic)",
"(Intercept):2" = "Intercept (Foreign)",
"age:2" = "Age (Foreign)",
"education_level:2" = "Education (Foreign)",
"party_identification:2" = "Party ID (Foreign)"
),
coef_omit = "(Intercept):3", # Exclude the OR parameter from main table
notes = list(
"Odds ratios are presented. Standard errors in parentheses.",
paste("Association odds ratio between outcomes:", round(or_parameter, 3)),
"* p < 0.05, ** p < 0.01, *** p < 0.001."
)
)
(1) | |
---|---|
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | |
Odds ratios are presented. Standard errors in parentheses. | |
Association odds ratio between outcomes: 3.274 | |
* p < 0.05, ** p < 0.01, *** p < 0.001. | |
Intercept (Economic) | 2.019 |
(2.002) | |
Age (Economic) | 1.192*** |
(0.033) | |
Education (Economic) | 0.540** |
(0.119) | |
Party ID (Economic) | 1.795 |
(1.017) | |
Intercept (Foreign) | 1.520 |
(0.768) | |
Age (Foreign) | 1.102*** |
(0.013) | |
Education (Foreign) | 0.579*** |
(0.062) | |
Party ID (Foreign) | 1.695* |
(0.446) | |
Num.Obs. | 1000 |
AIC | 541.7 |
BIC | 585.9 |
RMSE | 4508310.71 |
The table above presents the odds ratios for each predictor on both policy approval outcomes. Unlike individual logit models, these estimates are derived simultaneously, reflecting the joint influence of the predictors while accounting for the association between economic and foreign policy approval.
Odds Ratios: For each unit increase in a predictor, the odds of approval are multiplied by the corresponding odds ratio, holding all other variables constant. For instance, an odds ratio greater than 1 suggests an increase in the odds of approval, while a value less than 1 suggests a decrease.
Association Odds Ratio: The estimated odds ratio parameter (3.274) quantifies the association between the two outcomes. This value represents how much more (or less) likely someone is to approve of foreign policy if they approve of economic policy, beyond what is explained by the measured predictors.
To enhance interpretability, especially for continuous predictors like age, visualizing the predicted probabilities is crucial. This helps to understand the substantive impact of a variable across its range. Below, we plot the predicted probabilities of both economic and foreign policy approval as a function of age, holding education_level and party_identification at their mean values.
# Function to plot predicted probabilities for multivariate VGLM
af_plot_probabilities <- function(model, data, x_var, x_label, title) {
# Calculate reference values for other variables
mean_age <- mean(data$age, na.rm = TRUE)
mean_education_level <- round(mean(data$education_level, na.rm = TRUE))
mode_party_identification <- as.numeric(names(sort(table(data$party_identification), decreasing = TRUE))[1])
# Create a sequence for the variable to be plotted
if (x_var == "age") {
x_var_values <- seq(from = min(data$age, na.rm = TRUE),
to = max(data$age, na.rm = TRUE),
length.out = 100)
new_data <- data.frame(
age = x_var_values,
education_level = mean_education_level,
party_identification = mode_party_identification
)
} else if (x_var == "education_level") {
x_var_values <- 1:5 # Since education is 1-5 scale
new_data <- data.frame(
age = mean_age,
education_level = x_var_values,
party_identification = mode_party_identification
)
} else if (x_var == "party_identification") {
x_var_values <- c(0, 1) # Binary variable
new_data <- data.frame(
age = mean_age,
education_level = mean_education_level,
party_identification = x_var_values
)
}
# Debug: Print some information
cat("Plotting variable:", x_var, "\n")
cat("Range of x values:", range(x_var_values), "\n")
cat("Sample of new_data:\n")
print(head(new_data))
# Predict probabilities for both outcomes
tryCatch({
predicted_probs <- predict(model, newdata = new_data, type = "response")
cat("Prediction successful. Dimensions:", dim(predicted_probs), "\n")
cat("Sample predictions:\n")
print(head(predicted_probs))
}, error = function(e) {
cat("Prediction error:", e$message, "\n")
return(NULL)
})
# Convert the predictions to a long format for plotting
plot_data <- data.frame(
x_var_val = x_var_values,
approve_economic_policy = predicted_probs[, 1],
approve_foreign_policy = predicted_probs[, 2]
) %>%
tidyr::pivot_longer(
cols = starts_with("approve_"),
names_to = "policy_type",
values_to = "predicted_prob"
)
# Create the plot using ggplot2 and the viridis color palette
p <- ggplot(plot_data, aes(x = x_var_val, y = predicted_prob, color = policy_type)) +
geom_line(size = 1.2) +
scale_color_viridis_d(
begin = 0.2, end = 0.8,
labels = c("Economic Policy Approval", "Foreign Policy Approval")
) +
labs(
title = title,
x = x_label,
y = "Predicted Probability",
color = "Policy Type"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
legend.position = "bottom"
) +
ylim(0, 1) # Ensure y-axis is on probability scale
return(p)
}
plot_age <- af_plot_probabilities(
model = multivariate_logistic,
data = voter_data,
x_var = "age",
x_label = "Age",
title = "Predicted Probabilities of Policy Approval by Age (Logistic Model)"
)
Plotting variable: age
Range of x values: -5.576086 97.42956
Sample of new_data:
age education_level party_identification
1 -5.5760862 3 1
2 -4.5356251 3 1
3 -3.4951640 3 1
4 -2.4547029 3 1
5 -1.4142418 3 1
6 -0.3737807 3 1
Prediction successful. Dimensions: 100 4
Sample predictions:
00 01 10 11
1 0.6731399 0.1500458 0.1022234 0.07459095
2 0.6420568 0.1529886 0.1151421 0.08981253
3 0.6090386 0.1546735 0.1290233 0.10726460
4 0.5742227 0.1549974 0.1437557 0.12702420
5 0.5378189 0.1539066 0.1591684 0.14910612
6 0.5001155 0.1514045 0.1750251 0.17345480
# Additional plots for other variables
plot_education <- af_plot_probabilities(
model = multivariate_logistic,
data = voter_data,
x_var = "education_level",
x_label = "Education Level",
title = "Predicted Probabilities of Policy Approval by Education Level"
)
Plotting variable: education_level
Range of x values: 1 5
Sample of new_data:
age education_level party_identification
1 44.61263 1 1
2 44.61263 2 1
3 44.61263 3 1
4 44.61263 4 1
5 44.61263 5 1
Prediction successful. Dimensions: 5 4
Sample predictions:
00 01 10 11
1 5.624746e-06 0.0001986515 0.008573628 0.9912221
2 1.763686e-05 0.0003605298 0.014718259 0.9849036
3 5.453534e-05 0.0006454426 0.025143840 0.9741562
4 1.648830e-04 0.0011304054 0.042601890 0.9561028
5 4.815829e-04 0.0019140960 0.071201612 0.9264027
plot_party <- af_plot_probabilities(
model = multivariate_logistic,
data = voter_data,
x_var = "party_identification",
x_label = "Party Identification",
title = "Predicted Probabilities of Policy Approval by Party ID"
)
Plotting variable: party_identification
Range of x values: 0 1
Sample of new_data:
age education_level party_identification
1 44.61263 3 0
2 44.61263 3 1
Prediction successful. Dimensions: 2 4
Sample predictions:
00 01 10 11
1 1.571947e-04 0.0010988592 0.04181742 0.9569265
2 5.453534e-05 0.0006454426 0.02514384 0.9741562