“Theoretical” … kinda just conceptual

Part 1: Multiple Choice

Question 1:

In the context of binary response models, the latent variable approach assumes that the observed binary outcome is based on an unobserved continuous variable. Which of the following is true regarding this latent variable?

It directly represents the probability of the binary outcome.
It is modeled with the assumption that the error term follows a uniform distribution.
The binary outcome is 1 if the latent variable exceeds a threshold (usually 0).
It assumes the error terms are heteroskedastic, leading to unbiased but inefficient estimates.

Answer: (C)

Question 2:

What is the primary difference between the logit and probit models in terms of their functional form?

Logit models assume a logistic distribution for the error terms, while probit models assume a normal distribution.=> Correct, it leads to long tail distribution
Logit models produce unbiased estimates, while probit models do not. => Not correct. the funtional form do not directly affect biasness
Probit models assume symmetric error terms, while logit models do not.
Logit models are more computationally efficient because the cumulative distribution function has a closed form, unlike probit models.

Answer: (A) Error distribution choose the functional form.

Question 3:

Which of the following is a key advantage of using a Linear Probability Model (LPM) over a logit or probit model for binary outcome estimation?

The LPM can easily account for non-linearity in the relationship between covariates and the dependent variable.=> Difficult to capture non-linear relationship
The LPM is simpler to estimate and interpret since it uses ordinary least squares (OLS) rather than iterative maximum likelihood estimation.
The LPM guarantees that predicted probabilities will always lie within the [0,1] range.=> Not correct, beyond 1
The LPM automatically corrects for heteroskedasticity in the error terms. => Not correct

Answer: (B) LPM is easy to estimate/interpret via OLS

Question 4:

In the complementary log-log (cloglog) model, what key characteristic distinguishes its cumulative distribution function (CDF) from that of the logit and probit models?

It is symmetrical around 0.5, like the logit and probit CDFs.
It compresses differences near the tails, producing an S-shaped curve similar to the logit.
It is asymmetrical and skews toward extreme probabilities, making it suitable for rare events.
It models the log odds directly without needing an inverse link function.

Answer: (C) The complementary log-log (cloglog) model uses an asymmetric link function and it is skewed in the tail. This structure comes from survival analysis, where we model the hazard of an event.

Question 5:

The cloglog link is preferred for discrete-time models derived from a continuous-time PH process because it:

Maximizes ROC-AUC => cloglong has nothing to do with ROC-AUC
Preserves multiplicative effects on the hazard after discretization
Forces a constant baseline
Eliminates censoring

Answer: (B) Cloglog naturally arises when you observe a continuous-time process at discrete intervals.The underlying hazard is constant within intervals.

Question 6:

When using simulated predictions to interpret logit model results, which of the following best explains the purpose of drawing from the multivariate normal distribution of estimated coefficients?

To estimate the uncertainty around the model’s fitted values.
To ensure that the simulated probabilities are always between 0 and 1.
To evaluate the goodness-of-fit of the logit model using the Wald test.
To improve computational speed when calculating marginal effects.

Answer: (A) Multiple parameters together (multivariate) capture: 1. Individual parameter uncertainty(how coefficient vary - CI) 2. Correlations between parameters 3. Allows us to make statements about combinations of coefficient

Part 2: True or False

Consider the following non-linear model \(P(Y) = \Phi(\beta_1x_1+\beta_2x_2+\beta_{12}x_1x_2+ \mathbf{X}\beta)\). Ai and Norton indicate that the interaction effect could be nonzero, even if \(\beta_{12}=0\).
- True: Marginal effect of an interaction incldues both direct effect of interaction term(\(\beta_{12}\)) and indirect effect(\(\beta_{1} and \beta_{2}\))
Considering the same non-linear model, Ai and Norton indicate that the significance of the interaction effect can be tested with a simple t-test on the coefficient of the interaction term \(\beta_{12}\).
- False: From Ai and Norton, interaction effect is not same as interaction coefficients
Considering the same non-linear model, Ai and Norton show that the sign of \(\beta_{12}\) indicates the sign of the interaction effect.
- False: From Ai and Norton, interaction effect is not same as interaction coefficients(same rationale as above)
In logit models, interaction effects are inherently imposed through the link function, even in the absence of an interaction term.
- True: After partial derivate, the link funtion remains, requiring use of chain rule (in p.69)
The Akaike information criterion selects a model that minimizes the negative likelihood penalized by the number of parameters.

True

The Bayesian information criterion selects a model that minimizes the negative likelihood penalized by the number of parameters and the sample size.

True: BIC is penalized by the number of parameters and the sample size, not AIC (p.87)

Right-censored observations should be dropped because the failure time is unknown.
- False
A Weibull with shape (p>1) implies an increasing hazard over time: True

-True : On the other hand, shape p<1 means decreasing hazard(i.e., negative duration dependence) (in 4.3 The Weibull Model)

Applied

Show all code but only relevant output.

Part 1

## load packages
library(mvtnorm)
library(ggplot2)
library(reshape2)
library(dplyr)
library(magrittr)

For this question you’ll be using the data in debate_election_data.csv. This dataset contains information about political candidates who participated in televised debates before an election. The dataset includes details on their debate performance, party affiliation, campaign spending, and whether they won or lost the election. - election_won: This binary variable indicates whether the candidate won (1) or lost (0) the election. It’s the dependent variable in the models. - debate_perf: This continuous variable represents the candidate’s debate performance score, as evaluated by undecided voters. The scores range from 0 to a 100, with higher scores indicating better performances. Pay attention to this variable and its relationship with the election outcome, especially given what we saw with Biden. - party_major: This binary variable indicates whether the candidate belongs to a major political party (1) or a minor party (0). Major party affiliation is expected to positively influence election outcomes. - spend_millions: This continuous variable represents the total spending on the candidate’s campaign, in millions of dollars. Campaign spending can significantly influence a candidate’s chances of winning the election, with higher spending typically leading to better outcomes.

In answering the question, you will be using four models: logit, probit, complementary log-log (cloglog), and a linear probability model (LPM) (in the latter make sure to check for heteroskedasticity and correct for it if relevant).

Write out the stochastic and systematic components for each approach: logit, probit, cloglog, and LPM.

Let’s start with Binomial Model with Logit Link: (p.14 in lecture slides).

\[ Y_i \sim \text{Bernoulli}(\pi_i) \]

\[ \pi_i = \frac{1}{1 + \exp(-X_i \beta)} \] Where \(\pi_i = \mathbb{P}(Y_i = 1)\) and \(X_i \beta\) is the linear predictor.

Let’s move on the probit model:

\[ \pi_i = \Phi(X_i \beta) \]

Where \(\Phi\) is the cumulative distribution function of the standard normal distribution.

Implement the models in R and present the model results in a clean table. Interpret the results.

#1. Load data
library(modelsummary)
setwd("C:/Users/Chaehwan/OneDrive - Michigan State University/25 Fall/PLS900/Assignment/week7")
rfseed <- 1931
set.seed(rfseed)
debate = read.csv('debate_election_data.csv.')

##1.1 Summary of Data
vars <- c("election_won", "debate_perf", "party_major", "spend_millions")
summary(debate[vars])

##   election_won    debate_perf      party_major     spend_millions    
##  Min.   :0.000   Min.   :  0.00   Min.   :0.0000   Min.   : 0.05028  
##  1st Qu.:0.000   1st Qu.: 30.01   1st Qu.:0.0000   1st Qu.: 4.43004  
##  Median :0.000   Median : 39.88   Median :1.0000   Median : 5.66651  
##  Mean   :0.379   Mean   : 41.46   Mean   :0.5323   Mean   : 5.63427  
##  3rd Qu.:1.000   3rd Qu.: 51.85   3rd Qu.:1.0000   3rd Qu.: 6.99534  
##  Max.   :1.000   Max.   :100.00   Max.   :1.0000   Max.   :10.53042

##2. Set up our model formula
dv = 'election_won'
ivs = c( 'debate_perf','party_major','spend_millions' )
sysComp = paste(ivs, collapse='+')
form = paste0(dv, '~', sysComp)
modData = na.omit(debate[,c(dv, ivs)])

###3. Fit a binary logit model

m1 = glm(
  election_won ~
    debate_perf + party_major + spend_millions,
  data=debate,
  family=binomial(link='logit')
)

m2 = glm(
  election_won ~
    debate_perf + party_major + spend_millions,
  data=debate,
  family=binomial(link='probit')
)

m3 = glm(
  election_won ~
    debate_perf + party_major + spend_millions,
  data=debate,
  family=binomial(link='cloglog')
)

m4 = glm(
  election_won ~
    debate_perf + party_major + spend_millions,
  data=debate
)

modelsummary(
  list('Logit' = m1, 'Probit' = m2, 'Cloglog' = m3, 'LPM' = m4),
  stars=c('*'=0.1, '**' = 0.95),
  coef_map = c(
    'debate_perf' = 'debate performance',
    'party_major' = 'major party',
    'spend_millions' = 'total spending'
    ),
  gof_omit = 'F|RMSE|Log.Lik.'
  )

	Logit	Probit	Cloglog	LPM
** p < 0.95, * p < 0.1
debate performance	0.064*	0.037*	0.038*	0.013*
	(0.010)	(0.006)	(0.006)	(0.002)
major party	0.364**	0.236**	0.454*	0.070**
	(0.297)	(0.176)	(0.220)	(0.056)
total spending	-0.066**	-0.043**	-0.046**	-0.014**
	(0.077)	(0.046)	(0.057)	(0.014)
Num.Obs.	248	248	248	248
R2				0.203
AIC	282.7	283.3	289.8	298.8
BIC	296.7	297.4	303.8	316.4

Conduct a substantive effects analysis using a typical case approach to assess the effect of debate performance on winning the election (if correcting for heteroskedasticity in the LPM be sure to account for that in plot_predictions or your manual approach). Discuss how this effect differs across the logit, probit, cloglog, and LPM models. Be sure to use a package like patchwork to present the results in a clear and organized manner.

library(modelsummary)
library(marginaleffects)
library(patchwork)     
library(dplyr)
library(ggplot2)

debate_vals <- with(debate,
                    quantile(debate_perf, probs = c(0.1, 0.9), na.rm = TRUE))

# Plot predicted probabilities

p_logit <- plot_predictions(
  m1,
  condition = list(
    debate_perf = seq(debate_vals[1], debate_vals[2], length.out = 100),
    party_major = mean(debate$party_major, na.rm = TRUE),
    spend_millions = mean(debate$spend_millions, na.rm = TRUE)
  )) +
  labs(
    x = "Debate Performance",
    y = "Pr(Election Won)",
    title = "Logit Model"  ) 

p_probit <- plot_predictions(
  m2,
  condition = list(
    debate_perf = seq(debate_vals[1], debate_vals[2], length.out = 100),
    party_major = mean(debate$party_major, na.rm = TRUE),
    spend_millions = mean(debate$spend_millions, na.rm = TRUE)
  ) ) +
  labs(
    x = "Debate Performance",
    y = "Pr(Election Won)",
    title = "Probit Model"  ) 

p_cloglog <- plot_predictions(
  m3,
  condition = list(
    debate_perf = seq(debate_vals[1], debate_vals[2], length.out = 100),
    party_major = mean(debate$party_major, na.rm = TRUE),
    spend_millions = mean(debate$spend_millions, na.rm = TRUE)
  )) +
  labs(
    x = "Debate Performance",
    y = "Pr(Election Won)",
    title = "Cloglog Model"  )
 
p_lpm <- plot_predictions(
  m4,
  vcov = "HC3",
  condition = list(
    debate_perf = seq(debate_vals[1], debate_vals[2], length.out = 100),
    party_major = mean(debate$party_major, na.rm = TRUE),
    spend_millions = mean(debate$spend_millions, na.rm = TRUE)
  )) +
  labs(
    x = "Debate Performance",
    y = "Pr(Election Won)",
    title = "LPM"  ) 

# Combine plots using patchwork
(p_logit | p_probit) / (p_cloglog | p_lpm) & 
  theme(strip.text = element_blank(), legend.position = "none")

Even thoguth,logit and probit model assume a different link function between the linear predictor and predicted probaility, two models not much differ. In cloglog model, the probability of winning stays low between 10-90% of debate performance, which indicate that it is Useful when the event (winning) occurs rarely. The LPM line is linear where the others are non-linear shaped.

Perform a model performance analysis using either cross-validation or information criteria (AIC/BIC) to evaluate model fit. What does this analysis tell you about the appropriateness of each model for the data?

AIC_BIC <- data.frame(
  Model = c("Logit", "Probit", "Cloglog", "LPM"),
  AIC = c(AIC(m1), AIC(m2), AIC(m3), AIC(m4)),
  BIC = c(BIC(m1), BIC(m2), BIC(m3), BIC(m4))
)

AIC_BIC

library(caret)

## Loading required package: lattice

set.seed(1931)
train_control <- trainControl(method = "cv", number = 10)

cv_logit <- train(
  factor(election_won) ~ debate_perf + party_major + spend_millions,
  data = debate,
  method = "glm",
  family = binomial(link = "logit"),
  trControl = train_control
)

cv_probit <- train(
  factor(election_won) ~ debate_perf + party_major + spend_millions,
  data = debate,
  method = "glm",
  family = binomial(link = "probit"),
  trControl = train_control
)

cv_cloglog <- train(
  factor(election_won) ~ debate_perf + party_major + spend_millions,
  data = debate,
  method = "glm",
  family = binomial(link = "cloglog"),
  trControl = train_control
)
cv_lpm <- train(
  election_won ~ debate_perf + party_major + spend_millions,
  data = debate,
  method = "lm",
  trControl = train_control
)

## Warning in train.default(x, y, weights = w, ...): You are trying to do
## regression and your outcome only has two possible values Are you trying to do
## classification? If so, use a 2 level factor as your outcome column.

# Compare results
cv_results <- data.frame(
  Model = c("Logit", "Probit", "Cloglog", "LPM"),
  RMSE = c(cv_logit$results$RMSE,
           cv_probit$results$RMSE,
           cv_cloglog$results$RMSE,
           cv_lpm$results$RMSE),
  Rsquared = c(cv_logit$results$Rsquared,
               cv_probit$results$Rsquared,
               cv_cloglog$results$Rsquared,
               cv_lpm$results$Rsquared)
)

cv_results

ANSWER: Lower values indicate a better model fit. Logit model has the lowest AIC (282.65) and BIC (296.71), which indicates the among all four models, the logit model provides the best overall fit to the data.

Based on your analyses, which model is most appropriate for analyzing the effect of debate performance on election outcomes? Discuss the strengths and weaknesses of each model in the context of this data and research question (pay particular attention to assumptions about the type of relationship between debate performance and election outcomes).

ANSWER: When holding other predictors constant at typical values, all models show that higher debate performance increases the probability of winning. Each model has different shape of the relationship(e.g., Logit&Probit: S-shaped, symmetric curves, Cloglog: Asymmetric curve, LPM: linear line)

In this dataset, election victories are not rare(mean=0.379), and the relationship between debate performance and winning appears smooth.Therefore, cloglog is not appropriate. Based on AIC and BIC, LPM showed poorest fit. The probit model yields nearly identical results with logit model, but, the logit model captures this relationship most accurately and efficiently based on AIC and BIC.

vars <- c("election_won", "debate_perf", "party_major", "spend_millions")
summary(debate[vars])

##   election_won    debate_perf      party_major     spend_millions    
##  Min.   :0.000   Min.   :  0.00   Min.   :0.0000   Min.   : 0.05028  
##  1st Qu.:0.000   1st Qu.: 30.01   1st Qu.:0.0000   1st Qu.: 4.43004  
##  Median :0.000   Median : 39.88   Median :1.0000   Median : 5.66651  
##  Mean   :0.379   Mean   : 41.46   Mean   :0.5323   Mean   : 5.63427  
##  3rd Qu.:1.000   3rd Qu.: 51.85   3rd Qu.:1.0000   3rd Qu.: 6.99534  
##  Max.   :1.000   Max.   :100.00   Max.   :1.0000   Max.   :10.53042

Part 2

In class, we’ve been working with data from Saleyhan et al. (2011). The data file is called saleyhan_etal_2011.rda and is included in the .zip you downloaded for the assignment. We’ve been working with only subset of the model that they employed to study whether or not rebels receive foreign support. Their full model specification includes:

supp1 ~
    as.factor(muchweaker) + # rebels much weaker
    as.factor(strong) + # rebels strong
    cl + # strong central command
    gs + # government support
    tk + # transnational constituency
    tc + # territorial control
    moreactors + # more than one actor
    riv +  # rivalry
    log(rgdpch) + # ln gdp per capita
    cinc + # military capability
    pol6 # democracy

Estimate this model using a binary model of your choice and make a table of the results. Provide a brief interpretation.

#1. Load data
rebel <- get(load("saleyhan_etal_2011.rda"))

m1 = glm(
    supp1 ~
        muchweaker + # rebels much weaker
        strong + # rebels strong
        cl + # strong central command
        gs + # government support
        tk + # transnational constituency
        tc + # territorial control
        moreactors + # more than one actor
        riv +  # rivalry
        log(rgdpch) + # ln gdp per capita
        cinc + # military capability
        pol6, # democracy
data = rebel,
family = binomial(link = "logit")
)

modelsummary(
  list('Logit' = m1),
  stars=c('*'=0.1, '**' = 0.95),
  coef_map = c(
    'muchweaker' = 'weaker rebels',
    'strong' = 'strong rebels',
    'cl' = 'strong central command',
    'gs' = 'government support',
    'tk' = 'transnational constituency',
    'tc' = 'territorial control',
    'moreactors' = 'more than one actor',
    'riv' = 'rivalry',
    'log(rgdpch)' = 'ln(gdp)',
    'cinc' = 'military capability',
    'pol6' = 'democracy'
    ),
  gof_omit = 'F|RMSE|Log.Lik.'
  )

	Logit
** p < 0.95, * p < 0.1
weaker rebels	-0.498**
	(0.305)
strong rebels	-1.096**
	(0.680)
strong central command	-0.415**
	(0.316)
government support	0.841*
	(0.279)
transnational constituency	1.025*
	(0.300)
territorial control	0.462**
	(0.286)
more than one actor	0.361**
	(0.275)
rivalry	0.962*
	(0.271)
ln(gdp)	-0.179**
	(0.150)
military capability	6.631**
	(4.862)
democracy	-0.404**
	(0.340)
Num.Obs.	308
AIC	386.5
BIC	431.2

In logit mode, a one-unit change in X changes log-odds by size of coefficient. Therefore, for interpretation, we use exponential of coefficient. Based on this logic, holding other factors constant, if a rebel group has government support, it increases the odds of rebels receiving foreign support by about 2.3 times(=exp(0.841)). When a rivalry exists (1 vs 0), the odds of rebels receiving foreign support are 2.6 times higher. If a rebel group has a transnational constituency, it increases the odds of rebels receiving foreign support by about 2.7 times. Rebel groups with strong central command have 34% lower odds of foreign support compared to those without it.

The key independent variables from Saleyhan et al. are: cinc, gs, tk, and riv. Each of these variables are 0 or 1. Manually or marginaleffects do the following and INTERPRET the results. Again make sure to organize the plots in an organized manner using something like patchwork.

cl_margin = plot_predictions(
  m1,
  by='cl',
  data=rebel ) +
  labs(
  x='1=With strong central command(CC), 0=without strong CC',
  y='Prob.(supp1=1)'
  )

## Warning: These arguments are not known to be supported for models of class
## `glm`: data. All arguments are still passed to the model-specific prediction
## function, but users are encouraged to check if the argument is indeed supported
## by their modeling package. Please file a request on Github if you believe that
## an unknown argument should be added to the `marginaleffects` white list of
## known arguments, in order to avoid raising this warning:
## https://github.com/vincentarelbundock/marginaleffects/issues

gs_margin = plot_predictions(
  m1,
  by='gs',
  data=rebel ) +
  labs(
  x='1=With gov support, 0=without gov support ',
  y='Prob.(supp1=1)'
  )

## Warning: These arguments are not known to be supported for models of class
## `glm`: data. All arguments are still passed to the model-specific prediction
## function, but users are encouraged to check if the argument is indeed supported
## by their modeling package. Please file a request on Github if you believe that
## an unknown argument should be added to the `marginaleffects` white list of
## known arguments, in order to avoid raising this warning:
## https://github.com/vincentarelbundock/marginaleffects/issues

tk_margin = plot_predictions(
  m1,
  by='tk',
  data=rebel ) +
  labs(
  x='1=With transnational constituency(TC1), 0=without TC1',
  y='Prob.(supp1=1)'
  )

## Warning: These arguments are not known to be supported for models of class
## `glm`: data. All arguments are still passed to the model-specific prediction
## function, but users are encouraged to check if the argument is indeed supported
## by their modeling package. Please file a request on Github if you believe that
## an unknown argument should be added to the `marginaleffects` white list of
## known arguments, in order to avoid raising this warning:
## https://github.com/vincentarelbundock/marginaleffects/issues

tc_margin = plot_predictions(
  m1,
  by='tc',
  data=rebel ) +
  labs(
  x='1=With territorial control(TC2), 0=without TC2',
  y='Prob.(supp1=1)'
  )

## Warning: These arguments are not known to be supported for models of class
## `glm`: data. All arguments are still passed to the model-specific prediction
## function, but users are encouraged to check if the argument is indeed supported
## by their modeling package. Please file a request on Github if you believe that
## an unknown argument should be added to the `marginaleffects` white list of
## known arguments, in order to avoid raising this warning:
## https://github.com/vincentarelbundock/marginaleffects/issues

(cl_margin | gs_margin) / (tk_margin | tc_margin) & 
  theme(strip.text = element_blank(), legend.position = "none")

(1) Centralized rebel organizations are less likely to attract foreign sponsors. (2) When the government receives foreign support, the probability that rebels also receive external support increases (3) Rebels with transnational constituencies have a higher probability of foreign support, compared to those without (4) Rebels that control territory show a higher probability of foreign support than those who do not

What is the effect of each of the following on the probability of a rebel group receiving foreign support using a typical case approach:

+ some change, to be defined by you, in the military capability score of the government (cinc)
+ the government in the rebel-gov dyad receiving foreign support (gs)
+ the rebel group having a transnational constituency (tk
+ the government being engaged in an international rivalry (riv)

Now, do all that again using an observed value approach. Again organize.

library(modelsummary)
library(marginaleffects)
library(patchwork)   
library(ggplot2)

typical_grid <- datagrid(model = m1)  # grid_type = "mean" by default

# 2) EFFECT OF cinc

q <- quantile(rebel$cinc, c(.25, .75), na.rm = TRUE)

pred_cinc <- predictions(
  m1,
  newdata = datagrid(model = m1, cinc = q),
  type = "response"
) %>%
  mutate(level = c("25th", "75th"))

p_cinc <- ggplot(pred_cinc, aes(x = level, y = estimate)) +
  geom_point(size = 2) +
  geom_line(aes(group = 1)) +
  geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = .1) +
  labs(
    x = "cinc",
    y = "Pr(Foreign Support)",
    title = "Effect of cinc"
  ) +
  theme_minimal()

# 3 gs: 0→1
ap_gs <- avg_predictions(m1, variables = "gs", type = "response")
# contrast (1 vs 0), averaged over the sample distribution of covariates
eff_gs <- comparisons(
  m1, variables = list(gs = c(0, 1)),
  type = "response"
)

p_gs <- plot_predictions(m1, condition = "gs", type = "response") +
  labs(
    x = "Government receives foreign support",
    y = "Pr(Foreign Support)",
    title = "Effect of government foreign support"
  ) +
  theme_minimal()

# 4) TRANSNATIONAL CONSTITUENCY (tk: 0→1)
ap_tk <- avg_predictions(m1, variables = "tk", type = "response")
eff_tk <- comparisons(
  m1, variables = list(tk = c(0, 1)),
  type = "response"
)

p_tk <- plot_predictions(m1, condition = "tk", type = "response") +
  labs(
    x = "transnational constituency",
    y = "Pr(Foreign Support)",
    title = "Effect of Transnational Constituency"
  ) +
  theme_minimal()

# 5) riv: 0→1
ap_riv <- avg_predictions(m1, variables = "riv", type = "response")
eff_riv <- comparisons(
  m1, variables = list(riv = c(0, 1)),
  type = "response"
)

p_riv <- plot_predictions(m1, condition = "riv", type = "response") +
  labs(
    x = "Government in Rivalry",
    y = "Pr(Foreign Support)",
    title = "Effect of Rivalry"
  ) +
  theme_minimal()

(p_cinc | p_gs) / (p_tk | p_riv) & 
  theme(strip.text = element_blank(), legend.position = "none")

Part 3

Run one model on conflict with all the IVs specified above and then run another to test whether there is evidence of a conditional effect between postBoko and ngbrCount on conflict. Organize the results in a table.

#1. Load data
nigeria <- get(load("nigeria.rda"))

model1 = glm(
    conflict ~
        lagDV + # binary whether the sender actor initiated a conflict with the receiver in the previous year
        lagRecip + # binary the receiver actor initiated a conflict with the sender in the previous year
        govActor + # binary forces of the Nigerian government
        postBoko + # binary Boko Haram has entered the conflict
        elecYear + # binary election year
        groupSpread + # binary operate beyond multiple Nigerian administrative
        ngbrConfCount, # count conflicts in countries bordering Nigeria
data = nigeria,
family = binomial(link = "logit")
)

model2 = glm(
    conflict ~
        lagDV + # binary whether the sender actor initiated a conflict with the receiver in the previous year
        lagRecip + # binary the receiver actor initiated a conflict with the sender in the previous year
        govActor + # binary forces of the Nigerian government
        postBoko + # binary Boko Haram has entered the conflict
        elecYear + # binary election year
        groupSpread + # binary operate beyond multiple Nigerian administrative
        ngbrConfCount + # count conflicts in countries bordering Nigeria
        postBoko*ngbrConfCount,
data = nigeria,
     family = binomial(link = "logit")  )


modelsummary(
     list('Base Model'=model1, 'Interaction Model'=model2),
     stars=c('*'=0.1, '**'=0.05),
     coef_map = c(
          'postBoko' = 'Boko Haram',
          'ngbrConfCount' = 'neighbor conflicts',
          'postBoko:ngbrConfCount' = 'Boko Haram x neighbor conflicts',
          'lagDV' = 'conflict with the receiver in t-1',
          'lagRecip' = 'conflict with the sender in t-1',          
          'govActor' = 'forces of the Nigerian government',
          'elecYear' = 'election year',
          'groupSpread' = 'beyond multiple administrative'
     ),
     gof_omit = 'F|RMSE|Log.Lik.'
     )

## Profiled confidence intervals may take longer time to compute.
##   Use `ci_method="wald"` for faster computation of CIs.
## Profiled confidence intervals may take longer time to compute.
##   Use `ci_method="wald"` for faster computation of CIs.

	Base Model	Interaction Model
* p < 0.1, ** p < 0.05
Boko Haram	0.853**	-0.357
	(0.155)	(0.254)
neighbor conflicts	0.007**	-0.016**
	(0.002)	(0.005)
Boko Haram x neighbor conflicts		0.029**
		(0.005)
conflict with the receiver in t-1	3.154**	3.162**
	(0.203)	(0.203)
conflict with the sender in t-1	1.893**	1.920**
	(0.241)	(0.239)
forces of the Nigerian government	1.354**	1.430**
	(0.606)	(0.611)
election year	-0.358*	-0.407**
	(0.190)	(0.192)
beyond multiple administrative	0.945**	0.810*
	(0.420)	(0.427)
Num.Obs.	16154	16154
AIC	1862.0	1831.9
BIC	1923.6	1901.1

Visualize the conditional relationship. Be explicit about what the visualization is showing and whether you are using the observed value approach or average case approach and also try your best to include the raw data or at least some way of highlighting that you have enough data to test the interaction. Then of course interpret, whether there is evidence of a conditional effect between postBoko and ngbrCount on conflict.

vars <- c("postBoko", "ngbrConfCount")
summary(nigeria[vars])

##     postBoko      ngbrConfCount   
##  Min.   :0.0000   Min.   :  5.00  
##  1st Qu.:0.0000   1st Qu.: 16.00  
##  Median :0.0000   Median : 27.00  
##  Mean   :0.3914   Mean   : 40.12  
##  3rd Qu.:1.0000   3rd Qu.: 81.00  
##  Max.   :1.0000   Max.   :101.00

plot_predictions(
     model2, 
     condition=list(
          ngbrConfCount=seq(20, 80, 1),
          postBoko=c(0, 1)
     ) ) + 
     labs(
          x='conflicts in countries bordering Nigeria',
          y='Prob.(conflict=1)', 
          color='postBoko', 
          fill='postBoko'
     )

CONDITIONAL EFFECT: The positive and significant coefficient for two-way interaction confirms a conditional effect — the neighboring countries’ conflict environment matters much more after Boko Haram’s emergence. Based on average case approach, the figure plots the predicted probability that a dyad experiences a conflict as a function of the number of conflicts in countries bordering Nigeria under two condition: red line - before BOko haram entry and blue line - after Boko Haram entry.The shaded area around line represent 95% confidence intervals.

EVALUATION OF MODE: Both AIC and BIC decrease when the interaction term is added, suggesting that the model with the conditional effect provides a better fit to the data

Part 4

Consider a hypothetical scenario where a research organization conducts a survey on the adoption of a new green energy initiative among the citizens of two towns - ‘EcoVille’ and ‘GrayTown’. The data for these results is stored in green_data.rda. The organization models the likelihood of a citizen supporting the green initiative (this is the dv and is labeled green) based on two covariates:

ecoville: (1 if EcoVille, 0 if GrayTown)
z_monthly_bill: Standarized monthly energy bill (continuous variable with higher values indicating higher energy consumption)

Upon analysis, the organization finds that every citizen of EcoVille supports the green energy initiative while no one from GrayTown does.

Explain the concept of separation as it applies to this scenario.

ANSWER: Since one variable perfectly predict outcome, this scenario indicate perfect separation.

If you were to run a logit model on this data, what issues would arise? Why?

Coefficients for ecoville become extremely large and standard errors blow up, leading to unreliable inference.

eco <- get(load("green_data.rda"))

mod1 = glm(
    green ~
        ecoville + # 1 if EcoVille
        z_monthly_bill , # Standarized monthly energy bill - continuous variable
data = eco,
family = binomial(link = "logit")
)

## Warning: glm.fit: algorithm did not converge

modelsummary(
     list('Base Model'=mod1),
     stars=c('*'=0.1, '**'=0.05),
     coef_map = c(
          'ecoville' = 'EcoVille',
          'z_monthly_bill' = 'energy bill' ),
     gof_omit = 'F|RMSE|Log.Lik.'  )

	Base Model
* p < 0.1, ** p < 0.05
EcoVille	53.132
	(51908.478)
energy bill	0.000
	(25959.327)
Num.Obs.	435
AIC	6.0
BIC	18.2

Run a logistic regression and a penalized logistic regression, and presents the results in a table. Discuss the results.

The standard logit fails because of complete separation. Based on the statistically significant coefficient of EcoVile in model 2, the penalized logit induce consistent estimates using small bias-correction penalty.

library(logistf) # install.packages("logistf")
mod2 <- logistf(green ~ ecoville + z_monthly_bill, data = green_data)

## Warning in logistf(green ~ ecoville + z_monthly_bill, data = green_data):
## logistf.fit: Maximum number of iterations for full model exceeded. Try to
## increase the number of iterations or alter step size by passing
## 'logistf.control(maxit=..., maxstep=...)' to parameter control

modelsummary(
     list('Base Model'=mod1, 'Penalized Model'=mod2 ),
     stars=c('*'=0.1, '**'=0.05),
     coef_map = c(
          'ecoville' = 'EcoVille',
          'z_monthly_bill' = 'energy bill' ),
     gof_omit = 'F|RMSE|Log.Lik.')

	Base Model	Penalized Model
* p < 0.1, ** p < 0.05
EcoVille	53.132	14.386**
	(51908.478)	(3.430)
energy bill	0.000	-1.355
	(25959.327)	(0.977)
Num.Obs.	435	435
R2		1.000
AIC	6.0
BIC	18.2

Conduct a substantive effects analysis of the effect of z_monthly_bill on the likelihood of supporting the green initiative in EcoVille. Discuss the results.

library(marginaleffects)

bill_vals <- with(
  subset(green_data, ecoville == 1), 
  quantile(z_monthly_bill, probs = c(0.1, 0.9), na.rm = TRUE)
)

p_logit <- plot_predictions(
  model = mod2,   
  condition = list(
    z_monthly_bill = seq(bill_vals[1], bill_vals[2], length.out = 100),
    ecoville = 1
  ),
  type = "response" 
) +
  labs(
    x = "Monthly Energy Bill",
    y = "Pr(Supporting Green Initiative)",
    title = "Penalized Logit Model"
  ) +
  theme_minimal()

p_logit

Among citizens of Ecovile, the predicted probability of supporting the green initiative is near 1.0. However, the line slightly slopes downward, suggesting that the penalized logit model estimates a weak negative association between monthly energy bills and support probability. In other words, citizens with higher energy consumption are slightly less likely to support the green energy initiative

Part 5

Consider a (simulated) study of cabinet survival across a set of very fun polities—Weimaria, Linzonia, Przeworska, Svolikia, Leviathan, Selectorate, Median Voterland, Endogeneity Bay, Hegemonia, Credibilia, Coalitionberg, and Backslidistan. The dataset for this exercise is stored in cabinet.rda and includes two objects:

cabinet (continuous-time data; one row per cabinet):
- country: polity name (see above),
- time: observed duration in years (censored at study end),
- event: 1 if cabinet failed, 0 if right-censored,
- coalition: factor with levels Single / Coalition,
- majority: factor with levels Minority / Majority,
- econ_growth: GDP growth (%),
- polarization: ideological polarization (0–1),
- crisis: factor No / Yes.

5.1 Nonparametric foundations

Plot Kaplan–Meier survival curves for coalition vs Single.
In 2–3 sentences, describe the censoring pattern and what the curves suggest about duration dependence.

library(survival)

## 
## Attaching package: 'survival'

## The following object is masked from 'package:caret':
## 
##     cluster

library(survminer)

## Loading required package: ggpubr

## 
## Attaching package: 'survminer'

## The following object is masked from 'package:survival':
## 
##     myeloma

# Load in duration dataset
load('cabinet.rda')

# KM by coalition status
km_fit <- survfit(Surv(time, event) ~ coalition, data = cabinet)

# Plot with CIs, censoring marks, and numbers at risk
p_km <- ggsurvplot(
  km_fit,
  data       = cabinet,
  conf.int   = TRUE,
  censor     = TRUE,
  ggtheme    = theme_minimal(),
  legend.title = "Cabinet type",
  legend.labs  = levels(cabinet$coalition),
  xlab = "time since cabinet formation (years)",
  ylab = "survival probability"
)
p_km

Both single and coalition cabinet shows right-censoring, surviving until the end of the study(maybe 12)
The decline suggests negative duration dependence.
Coaliation cabinets have lower survival probability than single-party cabinets, suggesting less stable status.

5.2 Cox proportional hazards (continuous time)

Fit a Cox model with Efron ties:

\[ h(t|X) = h_0(t)\exp(\beta_1 \text{Coalition} + \beta_2 \text{Majority} + \beta_3 \text{EconGrowth} + \beta_4 \text{Polarization} + \beta_5 \text{Crisis}) \]
Report hazard ratios with 95% CIs.
Briefly interpret the Coalition and Majority effects (direction + a meaningful magnitude).

load("cabinet.rda")
cabinet_data <- cabinet
surv_obj <- Surv(time = cabinet_data$time,
                 event = cabinet_data$event)
cox1 <- coxph(surv_obj ~ coalition + majority + econ_growth +
              polarization + crisis,
              data = cabinet_data, x = TRUE)  # Store design matrix for simPH

# Present results using modelsummary
models_list <- list(
  "Base Model" = cox1
)

modelsummary(models_list,
             exponentiate = TRUE,  # show hazard ratios
             title = "Table 2: Cox Proportional Hazards Models of Cabinet Duration",
             stars = TRUE,
             gof_map = c("nobs", "nevent", "logLik", "AIC", "BIC"),
             coef_rename = c(
               "coalitionCoalition" = "Coalition Government",
               "majorityMajority" = "Majority Government",
               "econ_growth" = "Economic Growth (%)",
               "polarization" = "Polarization Index",
               "crisisYes" = "External Crisis"
             ),
             notes = "Hazard ratios reported. Values > 1 indicate higher failure risk.")

Table 2: Cox Proportional Hazards Models of Cabinet Duration
	Base Model
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001
Hazard ratios reported. Values > 1 indicate higher failure risk.
Coalition Government	1.605***
	(0.188)
Majority Government	0.321***
	(0.039)
Economic Growth (%)	0.863***
	(0.025)
Polarization Index	4.861***
	(1.066)
External Crisis	2.570***
	(0.353)
Num.Obs.	450

# Extract hazard ratios from full model
summary(cox1)$conf.int

##                    exp(coef) exp(-coef) lower .95 upper .95
## coalitionCoalition 1.6046268  0.6231979 1.2753524 2.0189143
## majorityMajority   0.3206228  3.1189297 0.2519175 0.4080661
## econ_growth        0.8627470  1.1590884 0.8157571 0.9124437
## polarization       4.8605101  0.2057397 3.1627560 7.4696115
## crisisYes          2.5698209  0.3891322 1.9631411 3.3639861

Coalition: Coalition cabinets have about 60% higher risk of failure at any given time than single-party cabinets.
Majority: Majority cabinets have roughly 68% lower hazard of failure than minority cabinets.

5.3 Discrete-time hazard (cloglog) — two baseline options

Estimate both specifications with a cloglog link using pp:

(a) Time dummies (fully flexible baseline) event ~ factor(t) + coalition + majority + econ_growth + polarization + crisis
(b) Carter & Signorino cubic of time (smooth baseline) Create \(\tilde t = t / \max(t)\) and include \(\tilde t, \tilde t^2, \tilde t^3\): event ~ t_cs + t_cs2 + t_cs3 + coalition + majority + econ_growth + polarization + crisis
pp (person–period expansion for discrete-time modeling):
- id, country,
- t: time since the cabinet formed (integer periods; resets at the start of each spell),
- event: period failure indicator (1 only in the failure period),
- the same covariates as above.

Important: In discrete-time models, t indexes duration (time since entering the risk set), not calendar year. If you wanted calendar-year shocks, you’d add separate year indicators (not provided here).

load("cabinet.rda")
person_period_data <- pp
head(person_period_data, 15)

t_max  <- max(person_period_data$t, na.rm = TRUE)
person_period_data$t_cs  <- person_period_data$t / t_max
person_period_data$t_cs2 <- person_period_data$t_cs^2
person_period_data$t_cs3 <- person_period_data$t_cs^3

# base cloglog model 
a_cloglog <- glm(event ~ factor(t) + coalition + majority + econ_growth + polarization + crisis ,
                  data = person_period_data,
                  family = binomial(link = "cloglog"))

# (b) Carter & Signorino smooth baseline via cubic in scaled time
b_cloglog <- glm(
  event ~ t_cs + t_cs2 + t_cs3 + coalition + majority + econ_growth + polarization + crisis,
  data   = person_period_data,
  family = binomial(link = "cloglog")
)


# display results
modelsummary(
  list("Cloglog - Time Dummies" = a_cloglog,
       "Cloglog - Cubic" = b_cloglog),
  estimate = "{estimate} ({std.error})",
  statistic = NULL,
  coef_omit = "factor\\(time\\)",
  gof_map = c("nobs", "aic", "bic", "logLik"),
  title = "Discrete-Time Duration Model"
)

Discrete-Time Duration Model
	Cloglog - Time Dummies	Cloglog - Cubic
(Intercept)	-4.027 (0.314)	-4.349 (0.398)
factor(t)2	0.891 (0.323)
factor(t)3	1.203 (0.316)
factor(t)4	1.376 (0.316)
factor(t)5	1.072 (0.341)
factor(t)6	1.655 (0.318)
factor(t)7	1.308 (0.348)
factor(t)8	1.554 (0.341)
factor(t)9	1.584 (0.352)
factor(t)10	1.832 (0.347)
factor(t)11	2.070 (0.344)
factor(t)12	1.688 (0.391)
coalitionCoalition	0.488 (0.117)	0.491 (0.117)
majorityMajority	-1.141 (0.123)	-1.143 (0.123)
econ_growth	-0.152 (0.029)	-0.152 (0.029)
polarization	1.600 (0.220)	1.598 (0.220)
crisisYes	0.944 (0.138)	0.946 (0.138)
t_cs		8.310 (2.662)
t_cs2		-12.658 (5.584)
t_cs3		6.674 (3.424)
Num.Obs.	3629	3629
AIC	1883.3	1878.8
BIC	1988.6	1934.6
Log.Lik.	-924.625	-930.403

exp(0.491)

## [1] 1.633949

exp(-1.143)

## [1] 0.318861

Then:

Compare signs/magnitudes for coalition and majority to the Cox model.
Explain (2–3 sentences) why cloglog coefficients retain a proportional-hazards interpretation for covariates while the baseline hazard is handled via either factor(t) or the C&S cubic.
State when you’d prefer (a) vs (b) given typical annual poli-sci data (ties, size of \(T\), overfitting concerns).

Cox model: Coalition’s coefficient 1.605; Majority’s coefficient 0.321 cloglog model: Coalition’s coefficient 1.633; Majority’s coefficient 0.318 1. The direction is same. But cloglog model’s effect is slightly stronger 2. The direction is same. But cloglog model’s effect is slightly weaker

The cloglog coefficients retain a proportional-hazards interpretation because the link function implies that covariates multiply the baseline hazard, while the baseline itself is captured flexibly either by duration dummies or by a smooth cubic

When the time is moderate(e.g., T<30) and the sample is larget enough to estimate all time dummie variable, (a) is preferred - factor model If the time periods is long and sample is not enough, due to parsimonious principle, the (b) is preferred - cubic model

Finally, using either your Cox model (5.2) or one of your discrete-time models (5.3), produce predicted survival curves (Cox) or period hazards (cloglog) for four scenarios:

(Coalition vs Single) × (Majority vs Minority), holding other covariates at typical values.

Provide two substantive takeaways (direction + a meaningful quantity like median survival or survival at 6/10 years), and comment on whether PH looks plausible for Coalition. If not, name one adjustment (e.g., time interaction, stratification, or preferring discrete-time) and justify briefly.

library(survival)

surv_obj <- Surv(time = cabinet_data$time,
                 event = cabinet_data$event)

weibull_fit <- survreg(surv_obj ~ coalition + majority,
                        data = cabinet_data,
                        dist = "weibull")

shape_param <- 1/weibull_fit$scale
aft_coefs <- coef(weibull_fit)[-1]
time_ratios <- exp(aft_coefs)

aft_se <- sqrt(diag(vcov(weibull_fit)))[2:3]  # extract SEs for democracy and gdp_growth only
weibull_summary <- data.frame(
  Variable = names(aft_coefs),
  AFT_Coef = as.numeric(aft_coefs),
  Time_Ratio = as.numeric(time_ratios),
  SE = as.numeric(aft_se),
  p_value = 2 * pnorm(-abs(aft_coefs/aft_se)),
  row.names = NULL
)
weibull_summary[, -1] <- round(weibull_summary[, -1], 3)
print(weibull_summary)

##             Variable AFT_Coef Time_Ratio    SE p_value
## 1 coalitionCoalition   -0.262       0.77 0.084   0.002
## 2   majorityMajority    0.636       1.89 0.086   0.000

# test for duration dependence
shape_se <- shape_param^2 * weibull_fit$scale[1]
z_shape <- (shape_param - 1) / shape_se
p_shape <- 2 * pnorm(-abs(z_shape))

# use marginaleffects for predictions
# create scenarios for prediction
scenarios_weibull <- datagrid(
  coalition = c("Single", "Coalition"),
  majority = c("Majority", "Minority"),
  model = weibull_fit
)

# get predicted survival times
pred_times <- predict(weibull_fit, newdata = scenarios_weibull, type = "response")
scenarios_weibull$predicted_duration <- pred_times

# display predicted durations
# Display predicted mean durations
print(scenarios_weibull[, c("coalition", "majority", "predicted_duration")])

##   coalition majority predicted_duration
## 1    Single Minority           8.410304
## 2    Single Majority          15.892558
## 3 Coalition Minority           6.474531
## 4 Coalition Majority          12.234619

# create survival curves for visualization
# note: survreg models need different approach for survival curves
time_grid <- seq(0.1, 12, 0.5)
plot_data <- list()

for(i in 1:nrow(scenarios_weibull)) {
  # predict quantiles from survreg model
  pred_quantiles <- predict(weibull_fit,
                            newdata = scenarios_weibull[i,],
                            type = "quantile",
                            p = 1 - pweibull(time_grid,
                                           shape = 1/weibull_fit$scale,
                                           scale = exp(predict(weibull_fit,
                                                             newdata = scenarios_weibull[i,],
                                                             type = "lp"))))

  # calculate survival probabilities
  lp <- predict(weibull_fit, newdata = scenarios_weibull[i,], type = "lp")
  surv_probs <- 1 - pweibull(time_grid,
                             shape = 1/weibull_fit$scale,
                             scale = exp(lp))

  plot_data[[i]] <- data.frame(
    time = time_grid,
    survival = surv_probs,
    coalition = scenarios_weibull$coalition[i],
    majority = scenarios_weibull$majority[i],
    group = paste0("Coalition=", scenarios_weibull$coalition[i],
                   ", majority=", scenarios_weibull$majority[i])
  )
}

plot_df <- do.call(rbind, plot_data)

# create enhanced visualization
ggplot(plot_df, aes(x = time, y = survival, color = group)) +
  geom_line(linewidth = 1.2) +
  labs(title = "Weibull Model: Predicted Survival Curves",
       subtitle = paste("Shape parameter =", round(1/weibull_fit$scale, 2),
                       "indicating", ifelse(1/weibull_fit$scale > 1,
                                          "increasing hazard",
                                          "decreasing hazard")),
       x = "Time (years)",
       y = "Survival Probability",
       color = "Scenario") +
  theme_minimal() +
  facet_wrap(~coalition, labeller = labeller(coalition = c("0" = "Single",
                                                            "1" = "coalition")))

In the Weibull model, positive coefficients indicate longer survival. The coalition coefficient of -0.262 translates to a time ratio of 0.77—coalition survive nearly 33% less than single-cabinets. The predicted durations show single-cabinets lasting approximately 12 years under average conditions, compared to 9.3 years for coliation-cabinets.

The shape parameter of 1.39 indicates positive duration dependence—the hazard increases over time. Cabinets become increasingly vulnerable the longer they survive. This suggests that the hazard ratio for Coalition is not constant, meaning that the PH assumption may be violate

Adding a coalition and time interaction allows the effect of being a coalition government to vary smoothly over the cabinet’s lifespan. This accommodates a non-proportional (time-varying) effect of coalition status.

Binary Model Wrapup Homework

Chaehwan Lim