1 Introduction

This document presents a true multivariate logistic regression analysis, which models multiple correlated binary outcomes simultaneously. This approach is statistically superior to running separate logistic regressions for each outcome because it explicitly accounts for the covariance between the dependent variables. In this demonstration, we will analyze the factors influencing a voter’s approval of a candidate’s economic policy and their foreign policy. This is a common scenario in political science where various aspects of political opinion are often interrelated.

2 Setup and Data Creation

We begin by loading the necessary R libraries. The VGAM package is specifically designed for fitting vector generalized linear models, which is essential for this type of multivariate analysis. We also include mvtnorm to facilitate the creation of correlated outcomes for our synthetic dataset.

We create a synthetic dataset that includes two binary dependent variables: approve_economic_policy and approve_foreign_policy. These variables are designed to be correlated. The independent variables are age, education_level (ordinal), and party_identification (binary). This structure allows us to simulate realistic political data for analytical purposes.

set.seed(42) # For reproducibility
n <- 1000   # Number of observations

# Generate independent variables
voter_data <- data.frame(
  age = rnorm(n, mean = 45, sd = 15), # Age, continuous
  education_level = sample(1:5, n, replace = TRUE), # Education, ordinal scale
  party_identification = rbinom(n, 1, prob = 0.5) # Party ID, binary (0/1)
)

# Generate correlated binary outcomes
# Create a 2x2 correlation matrix for the two outcomes
sigma <- matrix(c(1, 0.4, 0.4, 1), nrow = 2) # 0.4 indicates a moderate positive correlation

# Generate correlated latent variables using a multivariate normal distribution
latent_vars <- mvtnorm::rmvnorm(n, mean = c(0, 0), sigma = sigma)

# Add effects of predictors to the latent variables
# These coefficients (0.1, -0.3, 0.5, etc.) are chosen for demonstration
latent_vars[,1] <- latent_vars[,1] + 0.1 * voter_data$age - 0.3 * voter_data$education_level + 0.5 * voter_data$party_identification # Effects on economic policy approval
latent_vars[,2] <- latent_vars[,2] + 0.05 * voter_data$age - 0.2 * voter_data$education_level + 0.4 * voter_data$party_identification # Effects on foreign policy approval

# Convert latent variables to binary outcomes based on a threshold (e.g., 0)
voter_data$approve_economic_policy <- ifelse(latent_vars[,1] > 0, 1, 0)
voter_data$approve_foreign_policy <- ifelse(latent_vars[,2] > 0, 1, 0)

# Display the first few rows of the generated data
gt(head(voter_data))

# Check the correlation of the two binary outcomes
af_correlation_outcomes <- cor(voter_data$approve_economic_policy, voter_data$approve_foreign_policy)
cat("\nCorrelation between approve_economic_policy and approve_foreign_policy:", round(af_correlation_outcomes, 3), "\n")

Correlation between approve_economic_policy and approve_foreign_policy: 0.286

The observed correlation between the two approval variables (r=0.286) confirms that they are not independent, thus validating the use of a multivariate approach that models their joint distribution.

3 Pre-requisite Tests

Before proceeding with the model, we ensure that the data satisfies the necessary assumptions for multivariate logistic regression.

Multiple Binary Outcome Variables: Both approve_economic_policy and approve_foreign_policy are dichotomous (0 or 1), meeting a fundamental requirement for this model.

Independence of Observations: We assume that each voter’s responses are independent of other voters. This is a standard assumption for cross-sectional survey data.

No Perfect Multicollinearity: The independent variables should not be perfectly correlated. We assess this using the Variance Inflation Factor (VIF). High VIF values (typically above 5 or 10) can indicate problematic multicollinearity.

# Check for multicollinearity among independent variables using VIF
# We create a temporary linear model to calculate VIF
temp_model_vif <- lm(approve_economic_policy ~ age + education_level + party_identification, data = voter_data)
cat("VIF values for independent variables:\n")

VIF values for independent variables:

af_cat(car::vif(temp_model_vif))
                 age      education_level party_identification 
            1.005970             1.006877             1.005195 

# Also, examine the correlation matrix of the independent variables
cat("\nCorrelation matrix of independent variables:\n")

Correlation matrix of independent variables:

af_cat(cor(voter_data[, c("age", "education_level", "party_identification")]))
                             age education_level party_identification
age                   1.00000000     -0.06294495          -0.04790365
education_level      -0.06294495      1.00000000           0.05651168
party_identification -0.04790365      0.05651168           1.00000000

The VIF values are well below the common thresholds (e.g., 5 or 10), and the pairwise correlations between independent variables are not excessively high. This indicates that multicollinearity is not a significant concern for our model.

4 Model Estimation and Results

For a true multivariate logistic regression with two binary outcomes, we utilize the vglm() function from the VGAM package with the binom2.or family. This family explicitly models two binary responses using logistic links and estimates their association through an odds ratio parameter. The formula cbind(approve_economic_policy, approve_foreign_policy) ~ age + education_level + party_identification specifies the two dependent variables on the left-hand side and the predictors on the right.

# Fit the multivariate logistic model using vglm()
# The binom2.or family models two binary outcomes with logistic links and their odds ratio
multivariate_logistic <- vglm(cbind(approve_economic_policy, approve_foreign_policy) ~ age + education_level + party_identification,
                              family = binom2.or,
                              data = voter_data)

# Print the model summary
af_cat(summary(multivariate_logistic))
Call:
vglm(formula = cbind(approve_economic_policy, approve_foreign_policy) ~ 
    age + education_level + party_identification, family = binom2.or, 
    data = voter_data)

Coefficients: 
                       Estimate Std. Error z value Pr(>|z|)    
(Intercept):1           0.70269    0.99169   0.709  0.47858    
(Intercept):2           0.41878    0.50537   0.829  0.40730    
(Intercept):3           1.18587    0.60638   1.956  0.05050 .  
age:1                   0.17537    0.02804   6.255 3.96e-10 ***
age:2                   0.09752    0.01137   8.574  < 2e-16 ***
education_level:1      -0.61604    0.22115  -2.786  0.00534 ** 
education_level:2      -0.54717    0.10692  -5.117 3.10e-07 ***
party_identification:1  0.58524    0.56637   1.033  0.30145    
party_identification:2  0.52765    0.26299   2.006  0.04482 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Names of linear predictors: logitlink(mu1), logitlink(mu2), loglink(oratio)

Residual deviance: 523.7143 on 2991 degrees of freedom

Log-likelihood: -261.8572 on 2991 degrees of freedom

Number of Fisher scoring iterations: 9 

Warning: Hauck-Donner effect detected in the following estimate(s):
'age:1', 'age:2'

# Check the model family to confirm it's logistic
af_cat(multivariate_logistic@family)
Family:  binom2.or 
Informal classes: binom2.or, binom2 

Bivariate binomial regression with an odds ratio
Links:    logitlink(mu1), logitlink(mu2); loglink(oratio)

5 Regression Table (Odds Ratios)

To present the model results in a professional, publication-ready format suitable for a Q1 political science journal, we extract the coefficients and transform them to odds ratios. We also explicitly report the estimated odds ratio parameter that captures the association between the two outcomes and the overall model’s Log-Likelihood.

# Extract the odds ratio parameter for association between outcomes
or_parameter <- exp(coef(multivariate_logistic)["(Intercept):3"])

modelsummary(
  multivariate_logistic,
  output = "markdown",
  title = "Multivariate Logistic Regression Results (Alternative Format)",
  exponentiate = TRUE,
  fmt = 3,
  stars = TRUE,
  coef_map = c(
    "(Intercept):1" = "Intercept (Economic)",
    "age:1" = "Age (Economic)",
    "education_level:1" = "Education (Economic)",
    "party_identification:1" = "Party ID (Economic)",
    "(Intercept):2" = "Intercept (Foreign)",
    "age:2" = "Age (Foreign)",
    "education_level:2" = "Education (Foreign)",
    "party_identification:2" = "Party ID (Foreign)"
  ),
  coef_omit = "(Intercept):3",  # Exclude the OR parameter from main table
  notes = list(
    "Odds ratios are presented. Standard errors in parentheses.",
    paste("Association odds ratio between outcomes:", round(or_parameter, 3)),
    "* p < 0.05, ** p < 0.01, *** p < 0.001."
  )
)
tinytable_hpjcraejin0jcq9em034
Multivariate Logistic Regression Results (Alternative Format)
(1)
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Odds ratios are presented. Standard errors in parentheses.
Association odds ratio between outcomes: 3.274
* p < 0.05, ** p < 0.01, *** p < 0.001.
Intercept (Economic) 2.019
(2.002)
Age (Economic) 1.192***
(0.033)
Education (Economic) 0.540**
(0.119)
Party ID (Economic) 1.795
(1.017)
Intercept (Foreign) 1.520
(0.768)
Age (Foreign) 1.102***
(0.013)
Education (Foreign) 0.579***
(0.062)
Party ID (Foreign) 1.695*
(0.446)
Num.Obs. 1000
AIC 541.7
BIC 585.9
RMSE 4508310.71

5.1 Interpretation of Results

The table above presents the odds ratios for each predictor on both policy approval outcomes. Unlike individual logit models, these estimates are derived simultaneously, reflecting the joint influence of the predictors while accounting for the association between economic and foreign policy approval.

Odds Ratios: For each unit increase in a predictor, the odds of approval are multiplied by the corresponding odds ratio, holding all other variables constant. For instance, an odds ratio greater than 1 suggests an increase in the odds of approval, while a value less than 1 suggests a decrease.

Association Odds Ratio: The estimated odds ratio parameter (3.274) quantifies the association between the two outcomes. This value represents how much more (or less) likely someone is to approve of foreign policy if they approve of economic policy, beyond what is explained by the measured predictors.

5.1.1 Visualization of Effects

To enhance interpretability, especially for continuous predictors like age, visualizing the predicted probabilities is crucial. This helps to understand the substantive impact of a variable across its range. Below, we plot the predicted probabilities of both economic and foreign policy approval as a function of age, holding education_level and party_identification at their mean values.


# Function to plot predicted probabilities for multivariate VGLM
af_plot_probabilities <- function(model, data, x_var, x_label, title) {
  # Calculate reference values for other variables
  mean_age <- mean(data$age, na.rm = TRUE)
  mean_education_level <- round(mean(data$education_level, na.rm = TRUE))
  mode_party_identification <- as.numeric(names(sort(table(data$party_identification), decreasing = TRUE))[1])
  
  # Create a sequence for the variable to be plotted
  if (x_var == "age") {
    x_var_values <- seq(from = min(data$age, na.rm = TRUE), 
                       to = max(data$age, na.rm = TRUE), 
                       length.out = 100)
    new_data <- data.frame(
      age = x_var_values,
      education_level = mean_education_level,
      party_identification = mode_party_identification
    )
  } else if (x_var == "education_level") {
    x_var_values <- 1:5  # Since education is 1-5 scale
    new_data <- data.frame(
      age = mean_age,
      education_level = x_var_values,
      party_identification = mode_party_identification
    )
  } else if (x_var == "party_identification") {
    x_var_values <- c(0, 1)  # Binary variable
    new_data <- data.frame(
      age = mean_age,
      education_level = mean_education_level,
      party_identification = x_var_values
    )
  }
  
  # Debug: Print some information
  cat("Plotting variable:", x_var, "\n")
  cat("Range of x values:", range(x_var_values), "\n")
  cat("Sample of new_data:\n")
  print(head(new_data))

  # Predict probabilities for both outcomes
  tryCatch({
    predicted_probs <- predict(model, newdata = new_data, type = "response")
    cat("Prediction successful. Dimensions:", dim(predicted_probs), "\n")
    cat("Sample predictions:\n")
    print(head(predicted_probs))
  }, error = function(e) {
    cat("Prediction error:", e$message, "\n")
    return(NULL)
  })

  # Convert the predictions to a long format for plotting
  plot_data <- data.frame(
    x_var_val = x_var_values,
    approve_economic_policy = predicted_probs[, 1],
    approve_foreign_policy = predicted_probs[, 2]
  ) %>%
    tidyr::pivot_longer(
      cols = starts_with("approve_"),
      names_to = "policy_type",
      values_to = "predicted_prob"
    )

  # Create the plot using ggplot2 and the viridis color palette
  p <- ggplot(plot_data, aes(x = x_var_val, y = predicted_prob, color = policy_type)) +
    geom_line(size = 1.2) +
    scale_color_viridis_d(
      begin = 0.2, end = 0.8,
      labels = c("Economic Policy Approval", "Foreign Policy Approval")
    ) +
    labs(
      title = title,
      x = x_label,
      y = "Predicted Probability",
      color = "Policy Type"
    ) +
    theme_minimal() +
    theme(
      plot.title = element_text(hjust = 0.5, face = "bold"),
      legend.position = "bottom"
    ) +
    ylim(0, 1)  # Ensure y-axis is on probability scale

  return(p)
}

plot_age <- af_plot_probabilities(
  model = multivariate_logistic,
  data = voter_data,
  x_var = "age",
  x_label = "Age",
  title = "Predicted Probabilities of Policy Approval by Age (Logistic Model)"
)
Plotting variable: age 
Range of x values: -5.576086 97.42956 
Sample of new_data:
         age education_level party_identification
1 -5.5760862               3                    1
2 -4.5356251               3                    1
3 -3.4951640               3                    1
4 -2.4547029               3                    1
5 -1.4142418               3                    1
6 -0.3737807               3                    1
Prediction successful. Dimensions: 100 4 
Sample predictions:
         00        01        10         11
1 0.6731399 0.1500458 0.1022234 0.07459095
2 0.6420568 0.1529886 0.1151421 0.08981253
3 0.6090386 0.1546735 0.1290233 0.10726460
4 0.5742227 0.1549974 0.1437557 0.12702420
5 0.5378189 0.1539066 0.1591684 0.14910612
6 0.5001155 0.1514045 0.1750251 0.17345480

print(plot_age)


# Additional plots for other variables
plot_education <- af_plot_probabilities(
  model = multivariate_logistic,
  data = voter_data,
  x_var = "education_level",
  x_label = "Education Level",
  title = "Predicted Probabilities of Policy Approval by Education Level"
)
Plotting variable: education_level 
Range of x values: 1 5 
Sample of new_data:
       age education_level party_identification
1 44.61263               1                    1
2 44.61263               2                    1
3 44.61263               3                    1
4 44.61263               4                    1
5 44.61263               5                    1
Prediction successful. Dimensions: 5 4 
Sample predictions:
            00           01          10        11
1 5.624746e-06 0.0001986515 0.008573628 0.9912221
2 1.763686e-05 0.0003605298 0.014718259 0.9849036
3 5.453534e-05 0.0006454426 0.025143840 0.9741562
4 1.648830e-04 0.0011304054 0.042601890 0.9561028
5 4.815829e-04 0.0019140960 0.071201612 0.9264027

print(plot_education)


plot_party <- af_plot_probabilities(
  model = multivariate_logistic,
  data = voter_data,
  x_var = "party_identification",
  x_label = "Party Identification",
  title = "Predicted Probabilities of Policy Approval by Party ID"
)
Plotting variable: party_identification 
Range of x values: 0 1 
Sample of new_data:
       age education_level party_identification
1 44.61263               3                    0
2 44.61263               3                    1
Prediction successful. Dimensions: 2 4 
Sample predictions:
            00           01         10        11
1 1.571947e-04 0.0010988592 0.04181742 0.9569265
2 5.453534e-05 0.0006454426 0.02514384 0.9741562

print(plot_party)