In the context of binary response models, the latent variable approach assumes that the observed binary outcome is based on an unobserved continuous variable. Which of the following is true regarding this latent variable?
Answer: (C)
What is the primary difference between the logit and probit models in terms of their functional form?
Answer: (A) Error distribution choose the functional form.
Which of the following is a key advantage of using a Linear Probability Model (LPM) over a logit or probit model for binary outcome estimation?
Answer: (B) LPM is easy to estimate/interpret via OLS
In the complementary log-log (cloglog) model, what key characteristic distinguishes its cumulative distribution function (CDF) from that of the logit and probit models?
Answer: (C) The complementary log-log (cloglog) model uses an asymmetric link function and it is skewed in the tail. This structure comes from survival analysis, where we model the hazard of an event.
The cloglog link is preferred for discrete-time models derived from a continuous-time PH process because it:
Answer: (B) Cloglog naturally arises when you observe a continuous-time process at discrete intervals.The underlying hazard is constant within intervals.
When using simulated predictions to interpret logit model results, which of the following best explains the purpose of drawing from the multivariate normal distribution of estimated coefficients?
Answer: (A) Multiple parameters together (multivariate) capture: 1. Individual parameter uncertainty(how coefficient vary - CI) 2. Correlations between parameters 3. Allows us to make statements about combinations of coefficient
Consider the following non-linear model \(P(Y) = \Phi(\beta_1x_1+\beta_2x_2+\beta_{12}x_1x_2+ \mathbf{X}\beta)\). Ai and Norton indicate that the interaction effect could be nonzero, even if \(\beta_{12}=0\).
Considering the same non-linear model, Ai and Norton indicate that the significance of the interaction effect can be tested with a simple t-test on the coefficient of the interaction term \(\beta_{12}\).
Considering the same non-linear model, Ai and Norton show that the sign of \(\beta_{12}\) indicates the sign of the interaction effect.
In logit models, interaction effects are inherently imposed through the link function, even in the absence of an interaction term.
The Akaike information criterion selects a model that minimizes the negative likelihood penalized by the number of parameters.
Right-censored observations should be dropped because the failure time is unknown.
A Weibull with shape (p>1) implies an increasing hazard over time: True
-True : On the other hand, shape p<1 means decreasing hazard(i.e., negative duration dependence) (in 4.3 The Weibull Model)
Show all code but only relevant output.
## load packages
library(mvtnorm)
library(ggplot2)
library(reshape2)
library(dplyr)
library(magrittr)
For this question you’ll be using the data in
debate_election_data.csv
. This dataset contains information
about political candidates who participated in televised debates before
an election. The dataset includes details on their debate performance,
party affiliation, campaign spending, and whether they won or lost the
election. - election_won
: This binary
variable indicates whether the candidate won (1) or lost (0) the
election. It’s the dependent variable in the models. -
debate_perf
: This continuous variable
represents the candidate’s debate performance score, as evaluated by
undecided voters. The scores range from 0 to a 100, with higher scores
indicating better performances. Pay attention to this variable and its
relationship with the election outcome, especially given what we saw
with Biden. - party_major
: This binary
variable indicates whether the candidate belongs to a major political
party (1) or a minor party (0). Major party affiliation is expected to
positively influence election outcomes. -
spend_millions
: This continuous variable
represents the total spending on the candidate’s campaign, in millions
of dollars. Campaign spending can significantly influence a candidate’s
chances of winning the election, with higher spending typically leading
to better outcomes.
In answering the question, you will be using four models: logit, probit, complementary log-log (cloglog), and a linear probability model (LPM) (in the latter make sure to check for heteroskedasticity and correct for it if relevant).
Let’s start with Binomial Model with Logit Link: (p.14 in lecture slides).
\[ Y_i \sim \text{Bernoulli}(\pi_i) \]
\[ \pi_i = \frac{1}{1 + \exp(-X_i \beta)} \] Where \(\pi_i = \mathbb{P}(Y_i = 1)\) and \(X_i \beta\) is the linear predictor.
Let’s move on the probit model:
\[ \pi_i = \Phi(X_i \beta) \]
Where \(\Phi\) is the cumulative distribution function of the standard normal distribution.
#1. Load data
library(modelsummary)
setwd("C:/Users/Chaehwan/OneDrive - Michigan State University/25 Fall/PLS900/Assignment/week7")
rfseed <- 1931
set.seed(rfseed)
debate = read.csv('debate_election_data.csv.')
##1.1 Summary of Data
vars <- c("election_won", "debate_perf", "party_major", "spend_millions")
summary(debate[vars])
## election_won debate_perf party_major spend_millions
## Min. :0.000 Min. : 0.00 Min. :0.0000 Min. : 0.05028
## 1st Qu.:0.000 1st Qu.: 30.01 1st Qu.:0.0000 1st Qu.: 4.43004
## Median :0.000 Median : 39.88 Median :1.0000 Median : 5.66651
## Mean :0.379 Mean : 41.46 Mean :0.5323 Mean : 5.63427
## 3rd Qu.:1.000 3rd Qu.: 51.85 3rd Qu.:1.0000 3rd Qu.: 6.99534
## Max. :1.000 Max. :100.00 Max. :1.0000 Max. :10.53042
##2. Set up our model formula
dv = 'election_won'
ivs = c( 'debate_perf','party_major','spend_millions' )
sysComp = paste(ivs, collapse='+')
form = paste0(dv, '~', sysComp)
modData = na.omit(debate[,c(dv, ivs)])
###3. Fit a binary logit model
m1 = glm(
election_won ~
debate_perf + party_major + spend_millions,
data=debate,
family=binomial(link='logit')
)
m2 = glm(
election_won ~
debate_perf + party_major + spend_millions,
data=debate,
family=binomial(link='probit')
)
m3 = glm(
election_won ~
debate_perf + party_major + spend_millions,
data=debate,
family=binomial(link='cloglog')
)
m4 = glm(
election_won ~
debate_perf + party_major + spend_millions,
data=debate
)
modelsummary(
list('Logit' = m1, 'Probit' = m2, 'Cloglog' = m3, 'LPM' = m4),
stars=c('*'=0.1, '**' = 0.95),
coef_map = c(
'debate_perf' = 'debate performance',
'party_major' = 'major party',
'spend_millions' = 'total spending'
),
gof_omit = 'F|RMSE|Log.Lik.'
)
Logit | Probit | Cloglog | LPM | |
---|---|---|---|---|
** p < 0.95, * p < 0.1 | ||||
debate performance | 0.064* | 0.037* | 0.038* | 0.013* |
(0.010) | (0.006) | (0.006) | (0.002) | |
major party | 0.364** | 0.236** | 0.454* | 0.070** |
(0.297) | (0.176) | (0.220) | (0.056) | |
total spending | -0.066** | -0.043** | -0.046** | -0.014** |
(0.077) | (0.046) | (0.057) | (0.014) | |
Num.Obs. | 248 | 248 | 248 | 248 |
R2 | 0.203 | |||
AIC | 282.7 | 283.3 | 289.8 | 298.8 |
BIC | 296.7 | 297.4 | 303.8 | 316.4 |
plot_predictions
or your manual approach). Discuss how
this effect differs across the logit, probit, cloglog, and LPM models.
Be sure to use a package like patchwork
to present the
results in a clear and organized manner.library(modelsummary)
library(marginaleffects)
library(patchwork)
library(dplyr)
library(ggplot2)
debate_vals <- with(debate,
quantile(debate_perf, probs = c(0.1, 0.9), na.rm = TRUE))
# Plot predicted probabilities
p_logit <- plot_predictions(
m1,
condition = list(
debate_perf = seq(debate_vals[1], debate_vals[2], length.out = 100),
party_major = mean(debate$party_major, na.rm = TRUE),
spend_millions = mean(debate$spend_millions, na.rm = TRUE)
)) +
labs(
x = "Debate Performance",
y = "Pr(Election Won)",
title = "Logit Model" )
p_probit <- plot_predictions(
m2,
condition = list(
debate_perf = seq(debate_vals[1], debate_vals[2], length.out = 100),
party_major = mean(debate$party_major, na.rm = TRUE),
spend_millions = mean(debate$spend_millions, na.rm = TRUE)
) ) +
labs(
x = "Debate Performance",
y = "Pr(Election Won)",
title = "Probit Model" )
p_cloglog <- plot_predictions(
m3,
condition = list(
debate_perf = seq(debate_vals[1], debate_vals[2], length.out = 100),
party_major = mean(debate$party_major, na.rm = TRUE),
spend_millions = mean(debate$spend_millions, na.rm = TRUE)
)) +
labs(
x = "Debate Performance",
y = "Pr(Election Won)",
title = "Cloglog Model" )
p_lpm <- plot_predictions(
m4,
vcov = "HC3",
condition = list(
debate_perf = seq(debate_vals[1], debate_vals[2], length.out = 100),
party_major = mean(debate$party_major, na.rm = TRUE),
spend_millions = mean(debate$spend_millions, na.rm = TRUE)
)) +
labs(
x = "Debate Performance",
y = "Pr(Election Won)",
title = "LPM" )
# Combine plots using patchwork
(p_logit | p_probit) / (p_cloglog | p_lpm) &
theme(strip.text = element_blank(), legend.position = "none")
Even thoguth,logit and probit model assume a different link function
between the linear predictor and predicted probaility, two models not
much differ. In cloglog model, the probability of winning stays low
between 10-90% of debate performance, which indicate that it is Useful
when the event (winning) occurs rarely. The LPM line is linear where the
others are non-linear shaped.
AIC_BIC <- data.frame(
Model = c("Logit", "Probit", "Cloglog", "LPM"),
AIC = c(AIC(m1), AIC(m2), AIC(m3), AIC(m4)),
BIC = c(BIC(m1), BIC(m2), BIC(m3), BIC(m4))
)
AIC_BIC
library(caret)
## Loading required package: lattice
set.seed(1931)
train_control <- trainControl(method = "cv", number = 10)
cv_logit <- train(
factor(election_won) ~ debate_perf + party_major + spend_millions,
data = debate,
method = "glm",
family = binomial(link = "logit"),
trControl = train_control
)
cv_probit <- train(
factor(election_won) ~ debate_perf + party_major + spend_millions,
data = debate,
method = "glm",
family = binomial(link = "probit"),
trControl = train_control
)
cv_cloglog <- train(
factor(election_won) ~ debate_perf + party_major + spend_millions,
data = debate,
method = "glm",
family = binomial(link = "cloglog"),
trControl = train_control
)
cv_lpm <- train(
election_won ~ debate_perf + party_major + spend_millions,
data = debate,
method = "lm",
trControl = train_control
)
## Warning in train.default(x, y, weights = w, ...): You are trying to do
## regression and your outcome only has two possible values Are you trying to do
## classification? If so, use a 2 level factor as your outcome column.
# Compare results
cv_results <- data.frame(
Model = c("Logit", "Probit", "Cloglog", "LPM"),
RMSE = c(cv_logit$results$RMSE,
cv_probit$results$RMSE,
cv_cloglog$results$RMSE,
cv_lpm$results$RMSE),
Rsquared = c(cv_logit$results$Rsquared,
cv_probit$results$Rsquared,
cv_cloglog$results$Rsquared,
cv_lpm$results$Rsquared)
)
cv_results
ANSWER: Lower values indicate a better model fit. Logit model has the lowest AIC (282.65) and BIC (296.71), which indicates the among all four models, the logit model provides the best overall fit to the data.
ANSWER: When holding other predictors constant at typical values, all models show that higher debate performance increases the probability of winning. Each model has different shape of the relationship(e.g., Logit&Probit: S-shaped, symmetric curves, Cloglog: Asymmetric curve, LPM: linear line)
In this dataset, election victories are not rare(mean=0.379), and the relationship between debate performance and winning appears smooth.Therefore, cloglog is not appropriate. Based on AIC and BIC, LPM showed poorest fit. The probit model yields nearly identical results with logit model, but, the logit model captures this relationship most accurately and efficiently based on AIC and BIC.
vars <- c("election_won", "debate_perf", "party_major", "spend_millions")
summary(debate[vars])
## election_won debate_perf party_major spend_millions
## Min. :0.000 Min. : 0.00 Min. :0.0000 Min. : 0.05028
## 1st Qu.:0.000 1st Qu.: 30.01 1st Qu.:0.0000 1st Qu.: 4.43004
## Median :0.000 Median : 39.88 Median :1.0000 Median : 5.66651
## Mean :0.379 Mean : 41.46 Mean :0.5323 Mean : 5.63427
## 3rd Qu.:1.000 3rd Qu.: 51.85 3rd Qu.:1.0000 3rd Qu.: 6.99534
## Max. :1.000 Max. :100.00 Max. :1.0000 Max. :10.53042
In class, we’ve been working with data from Saleyhan
et al. (2011). The data file is called
saleyhan_etal_2011.rda
and is included in the .zip you
downloaded for the assignment. We’ve been working with only subset of
the model that they employed to study whether or not rebels receive
foreign support. Their full model specification includes:
supp1 ~
as.factor(muchweaker) + # rebels much weaker
as.factor(strong) + # rebels strong
cl + # strong central command
gs + # government support
tk + # transnational constituency
tc + # territorial control
moreactors + # more than one actor
riv + # rivalry
log(rgdpch) + # ln gdp per capita
cinc + # military capability
pol6 # democracy
#1. Load data
rebel <- get(load("saleyhan_etal_2011.rda"))
m1 = glm(
supp1 ~
muchweaker + # rebels much weaker
strong + # rebels strong
cl + # strong central command
gs + # government support
tk + # transnational constituency
tc + # territorial control
moreactors + # more than one actor
riv + # rivalry
log(rgdpch) + # ln gdp per capita
cinc + # military capability
pol6, # democracy
data = rebel,
family = binomial(link = "logit")
)
modelsummary(
list('Logit' = m1),
stars=c('*'=0.1, '**' = 0.95),
coef_map = c(
'muchweaker' = 'weaker rebels',
'strong' = 'strong rebels',
'cl' = 'strong central command',
'gs' = 'government support',
'tk' = 'transnational constituency',
'tc' = 'territorial control',
'moreactors' = 'more than one actor',
'riv' = 'rivalry',
'log(rgdpch)' = 'ln(gdp)',
'cinc' = 'military capability',
'pol6' = 'democracy'
),
gof_omit = 'F|RMSE|Log.Lik.'
)
Logit | |
---|---|
** p < 0.95, * p < 0.1 | |
weaker rebels | -0.498** |
(0.305) | |
strong rebels | -1.096** |
(0.680) | |
strong central command | -0.415** |
(0.316) | |
government support | 0.841* |
(0.279) | |
transnational constituency | 1.025* |
(0.300) | |
territorial control | 0.462** |
(0.286) | |
more than one actor | 0.361** |
(0.275) | |
rivalry | 0.962* |
(0.271) | |
ln(gdp) | -0.179** |
(0.150) | |
military capability | 6.631** |
(4.862) | |
democracy | -0.404** |
(0.340) | |
Num.Obs. | 308 |
AIC | 386.5 |
BIC | 431.2 |
In logit mode, a one-unit change in X changes log-odds by size of coefficient. Therefore, for interpretation, we use exponential of coefficient. Based on this logic, holding other factors constant, if a rebel group has government support, it increases the odds of rebels receiving foreign support by about 2.3 times(=exp(0.841)). When a rivalry exists (1 vs 0), the odds of rebels receiving foreign support are 2.6 times higher. If a rebel group has a transnational constituency, it increases the odds of rebels receiving foreign support by about 2.7 times. Rebel groups with strong central command have 34% lower odds of foreign support compared to those without it.
marginaleffects
do the following and INTERPRET the results.
Again make sure to organize the plots in an organized manner using
something like patchwork
.cl_margin = plot_predictions(
m1,
by='cl',
data=rebel ) +
labs(
x='1=With strong central command(CC), 0=without strong CC',
y='Prob.(supp1=1)'
)
## Warning: These arguments are not known to be supported for models of class
## `glm`: data. All arguments are still passed to the model-specific prediction
## function, but users are encouraged to check if the argument is indeed supported
## by their modeling package. Please file a request on Github if you believe that
## an unknown argument should be added to the `marginaleffects` white list of
## known arguments, in order to avoid raising this warning:
## https://github.com/vincentarelbundock/marginaleffects/issues
gs_margin = plot_predictions(
m1,
by='gs',
data=rebel ) +
labs(
x='1=With gov support, 0=without gov support ',
y='Prob.(supp1=1)'
)
## Warning: These arguments are not known to be supported for models of class
## `glm`: data. All arguments are still passed to the model-specific prediction
## function, but users are encouraged to check if the argument is indeed supported
## by their modeling package. Please file a request on Github if you believe that
## an unknown argument should be added to the `marginaleffects` white list of
## known arguments, in order to avoid raising this warning:
## https://github.com/vincentarelbundock/marginaleffects/issues
tk_margin = plot_predictions(
m1,
by='tk',
data=rebel ) +
labs(
x='1=With transnational constituency(TC1), 0=without TC1',
y='Prob.(supp1=1)'
)
## Warning: These arguments are not known to be supported for models of class
## `glm`: data. All arguments are still passed to the model-specific prediction
## function, but users are encouraged to check if the argument is indeed supported
## by their modeling package. Please file a request on Github if you believe that
## an unknown argument should be added to the `marginaleffects` white list of
## known arguments, in order to avoid raising this warning:
## https://github.com/vincentarelbundock/marginaleffects/issues
tc_margin = plot_predictions(
m1,
by='tc',
data=rebel ) +
labs(
x='1=With territorial control(TC2), 0=without TC2',
y='Prob.(supp1=1)'
)
## Warning: These arguments are not known to be supported for models of class
## `glm`: data. All arguments are still passed to the model-specific prediction
## function, but users are encouraged to check if the argument is indeed supported
## by their modeling package. Please file a request on Github if you believe that
## an unknown argument should be added to the `marginaleffects` white list of
## known arguments, in order to avoid raising this warning:
## https://github.com/vincentarelbundock/marginaleffects/issues
(cl_margin | gs_margin) / (tk_margin | tc_margin) &
theme(strip.text = element_blank(), legend.position = "none")
(1) Centralized rebel organizations are less likely to attract foreign
sponsors. (2) When the government receives foreign support, the
probability that rebels also receive external support increases (3)
Rebels with transnational constituencies have a higher probability of
foreign support, compared to those without (4) Rebels that control
territory show a higher probability of foreign support than those who do
not
What is the effect of each of the following on the probability of a rebel group receiving foreign support using a typical case approach:
+ some change, to be defined by you, in the military capability score of the government (cinc)
+ the government in the rebel-gov dyad receiving foreign support (gs)
+ the rebel group having a transnational constituency (tk
+ the government being engaged in an international rivalry (riv)
library(modelsummary)
library(marginaleffects)
library(patchwork)
library(ggplot2)
typical_grid <- datagrid(model = m1) # grid_type = "mean" by default
# 2) EFFECT OF cinc
q <- quantile(rebel$cinc, c(.25, .75), na.rm = TRUE)
pred_cinc <- predictions(
m1,
newdata = datagrid(model = m1, cinc = q),
type = "response"
) %>%
mutate(level = c("25th", "75th"))
p_cinc <- ggplot(pred_cinc, aes(x = level, y = estimate)) +
geom_point(size = 2) +
geom_line(aes(group = 1)) +
geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = .1) +
labs(
x = "cinc",
y = "Pr(Foreign Support)",
title = "Effect of cinc"
) +
theme_minimal()
# 3 gs: 0→1
ap_gs <- avg_predictions(m1, variables = "gs", type = "response")
# contrast (1 vs 0), averaged over the sample distribution of covariates
eff_gs <- comparisons(
m1, variables = list(gs = c(0, 1)),
type = "response"
)
p_gs <- plot_predictions(m1, condition = "gs", type = "response") +
labs(
x = "Government receives foreign support",
y = "Pr(Foreign Support)",
title = "Effect of government foreign support"
) +
theme_minimal()
# 4) TRANSNATIONAL CONSTITUENCY (tk: 0→1)
ap_tk <- avg_predictions(m1, variables = "tk", type = "response")
eff_tk <- comparisons(
m1, variables = list(tk = c(0, 1)),
type = "response"
)
p_tk <- plot_predictions(m1, condition = "tk", type = "response") +
labs(
x = "transnational constituency",
y = "Pr(Foreign Support)",
title = "Effect of Transnational Constituency"
) +
theme_minimal()
# 5) riv: 0→1
ap_riv <- avg_predictions(m1, variables = "riv", type = "response")
eff_riv <- comparisons(
m1, variables = list(riv = c(0, 1)),
type = "response"
)
p_riv <- plot_predictions(m1, condition = "riv", type = "response") +
labs(
x = "Government in Rivalry",
y = "Pr(Foreign Support)",
title = "Effect of Rivalry"
) +
theme_minimal()
(p_cinc | p_gs) / (p_tk | p_riv) &
theme(strip.text = element_blank(), legend.position = "none")
conflict
with all the IVs specified
above and then run another to test whether there is evidence of a
conditional effect between postBoko
and
ngbrCount
on conflict
. Organize the results in
a table.#1. Load data
nigeria <- get(load("nigeria.rda"))
model1 = glm(
conflict ~
lagDV + # binary whether the sender actor initiated a conflict with the receiver in the previous year
lagRecip + # binary the receiver actor initiated a conflict with the sender in the previous year
govActor + # binary forces of the Nigerian government
postBoko + # binary Boko Haram has entered the conflict
elecYear + # binary election year
groupSpread + # binary operate beyond multiple Nigerian administrative
ngbrConfCount, # count conflicts in countries bordering Nigeria
data = nigeria,
family = binomial(link = "logit")
)
model2 = glm(
conflict ~
lagDV + # binary whether the sender actor initiated a conflict with the receiver in the previous year
lagRecip + # binary the receiver actor initiated a conflict with the sender in the previous year
govActor + # binary forces of the Nigerian government
postBoko + # binary Boko Haram has entered the conflict
elecYear + # binary election year
groupSpread + # binary operate beyond multiple Nigerian administrative
ngbrConfCount + # count conflicts in countries bordering Nigeria
postBoko*ngbrConfCount,
data = nigeria,
family = binomial(link = "logit") )
modelsummary(
list('Base Model'=model1, 'Interaction Model'=model2),
stars=c('*'=0.1, '**'=0.05),
coef_map = c(
'postBoko' = 'Boko Haram',
'ngbrConfCount' = 'neighbor conflicts',
'postBoko:ngbrConfCount' = 'Boko Haram x neighbor conflicts',
'lagDV' = 'conflict with the receiver in t-1',
'lagRecip' = 'conflict with the sender in t-1',
'govActor' = 'forces of the Nigerian government',
'elecYear' = 'election year',
'groupSpread' = 'beyond multiple administrative'
),
gof_omit = 'F|RMSE|Log.Lik.'
)
## Profiled confidence intervals may take longer time to compute.
## Use `ci_method="wald"` for faster computation of CIs.
## Profiled confidence intervals may take longer time to compute.
## Use `ci_method="wald"` for faster computation of CIs.
Base Model | Interaction Model | |
---|---|---|
* p < 0.1, ** p < 0.05 | ||
Boko Haram | 0.853** | -0.357 |
(0.155) | (0.254) | |
neighbor conflicts | 0.007** | -0.016** |
(0.002) | (0.005) | |
Boko Haram x neighbor conflicts | 0.029** | |
(0.005) | ||
conflict with the receiver in t-1 | 3.154** | 3.162** |
(0.203) | (0.203) | |
conflict with the sender in t-1 | 1.893** | 1.920** |
(0.241) | (0.239) | |
forces of the Nigerian government | 1.354** | 1.430** |
(0.606) | (0.611) | |
election year | -0.358* | -0.407** |
(0.190) | (0.192) | |
beyond multiple administrative | 0.945** | 0.810* |
(0.420) | (0.427) | |
Num.Obs. | 16154 | 16154 |
AIC | 1862.0 | 1831.9 |
BIC | 1923.6 | 1901.1 |
postBoko
and
ngbrCount
on conflict
.vars <- c("postBoko", "ngbrConfCount")
summary(nigeria[vars])
## postBoko ngbrConfCount
## Min. :0.0000 Min. : 5.00
## 1st Qu.:0.0000 1st Qu.: 16.00
## Median :0.0000 Median : 27.00
## Mean :0.3914 Mean : 40.12
## 3rd Qu.:1.0000 3rd Qu.: 81.00
## Max. :1.0000 Max. :101.00
plot_predictions(
model2,
condition=list(
ngbrConfCount=seq(20, 80, 1),
postBoko=c(0, 1)
) ) +
labs(
x='conflicts in countries bordering Nigeria',
y='Prob.(conflict=1)',
color='postBoko',
fill='postBoko'
)
CONDITIONAL EFFECT: The positive and significant coefficient for two-way interaction confirms a conditional effect — the neighboring countries’ conflict environment matters much more after Boko Haram’s emergence. Based on average case approach, the figure plots the predicted probability that a dyad experiences a conflict as a function of the number of conflicts in countries bordering Nigeria under two condition: red line - before BOko haram entry and blue line - after Boko Haram entry.The shaded area around line represent 95% confidence intervals.
EVALUATION OF MODE: Both AIC and BIC decrease when the interaction term is added, suggesting that the model with the conditional effect provides a better fit to the data
Consider a hypothetical scenario where a research organization
conducts a survey on the adoption of a new green energy initiative among
the citizens of two towns - ‘EcoVille’ and ‘GrayTown’. The data for
these results is stored in green_data.rda
. The organization
models the likelihood of a citizen supporting the green initiative (this
is the dv and is labeled green
) based on two
covariates:
ecoville
: (1 if EcoVille, 0 if GrayTown)
z_monthly_bill
: Standarized monthly energy bill
(continuous variable with higher values indicating higher energy
consumption)
Upon analysis, the organization finds that every citizen of EcoVille supports the green energy initiative while no one from GrayTown does.
ANSWER: Since one variable perfectly predict outcome, this scenario indicate perfect separation.
Coefficients for ecoville become extremely large and standard errors blow up, leading to unreliable inference.
eco <- get(load("green_data.rda"))
mod1 = glm(
green ~
ecoville + # 1 if EcoVille
z_monthly_bill , # Standarized monthly energy bill - continuous variable
data = eco,
family = binomial(link = "logit")
)
## Warning: glm.fit: algorithm did not converge
modelsummary(
list('Base Model'=mod1),
stars=c('*'=0.1, '**'=0.05),
coef_map = c(
'ecoville' = 'EcoVille',
'z_monthly_bill' = 'energy bill' ),
gof_omit = 'F|RMSE|Log.Lik.' )
Base Model | |
---|---|
* p < 0.1, ** p < 0.05 | |
EcoVille | 53.132 |
(51908.478) | |
energy bill | 0.000 |
(25959.327) | |
Num.Obs. | 435 |
AIC | 6.0 |
BIC | 18.2 |
The standard logit fails because of complete separation. Based on the statistically significant coefficient of EcoVile in model 2, the penalized logit induce consistent estimates using small bias-correction penalty.
library(logistf) # install.packages("logistf")
mod2 <- logistf(green ~ ecoville + z_monthly_bill, data = green_data)
## Warning in logistf(green ~ ecoville + z_monthly_bill, data = green_data):
## logistf.fit: Maximum number of iterations for full model exceeded. Try to
## increase the number of iterations or alter step size by passing
## 'logistf.control(maxit=..., maxstep=...)' to parameter control
modelsummary(
list('Base Model'=mod1, 'Penalized Model'=mod2 ),
stars=c('*'=0.1, '**'=0.05),
coef_map = c(
'ecoville' = 'EcoVille',
'z_monthly_bill' = 'energy bill' ),
gof_omit = 'F|RMSE|Log.Lik.')
Base Model | Penalized Model | |
---|---|---|
* p < 0.1, ** p < 0.05 | ||
EcoVille | 53.132 | 14.386** |
(51908.478) | (3.430) | |
energy bill | 0.000 | -1.355 |
(25959.327) | (0.977) | |
Num.Obs. | 435 | 435 |
R2 | 1.000 | |
AIC | 6.0 | |
BIC | 18.2 |
z_monthly_bill
on the likelihood of supporting the green
initiative in EcoVille. Discuss the results.library(marginaleffects)
bill_vals <- with(
subset(green_data, ecoville == 1),
quantile(z_monthly_bill, probs = c(0.1, 0.9), na.rm = TRUE)
)
p_logit <- plot_predictions(
model = mod2,
condition = list(
z_monthly_bill = seq(bill_vals[1], bill_vals[2], length.out = 100),
ecoville = 1
),
type = "response"
) +
labs(
x = "Monthly Energy Bill",
y = "Pr(Supporting Green Initiative)",
title = "Penalized Logit Model"
) +
theme_minimal()
p_logit
Among citizens of Ecovile, the predicted probability of supporting the green initiative is near 1.0. However, the line slightly slopes downward, suggesting that the penalized logit model estimates a weak negative association between monthly energy bills and support probability. In other words, citizens with higher energy consumption are slightly less likely to support the green energy initiative
Consider a (simulated) study of cabinet survival
across a set of very fun polities—Weimaria, Linzonia,
Przeworska, Svolikia, Leviathan,
Selectorate, Median Voterland, Endogeneity
Bay, Hegemonia, Credibilia,
Coalitionberg, and Backslidistan. The dataset for this
exercise is stored in cabinet.rda
and includes two
objects:
cabinet
(continuous-time data; one
row per cabinet):
country
: polity name (see above),time
: observed duration in years (censored at study
end),event
: 1 if cabinet failed, 0 if right-censored,coalition
: factor with levels Single
/
Coalition
,majority
: factor with levels Minority
/
Majority
,econ_growth
: GDP growth (%),polarization
: ideological polarization (0–1),crisis
: factor No
/ Yes
.5.1 Nonparametric foundations
coalition
vs Single
.library(survival)
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
## cluster
library(survminer)
## Loading required package: ggpubr
##
## Attaching package: 'survminer'
## The following object is masked from 'package:survival':
##
## myeloma
# Load in duration dataset
load('cabinet.rda')
# KM by coalition status
km_fit <- survfit(Surv(time, event) ~ coalition, data = cabinet)
# Plot with CIs, censoring marks, and numbers at risk
p_km <- ggsurvplot(
km_fit,
data = cabinet,
conf.int = TRUE,
censor = TRUE,
ggtheme = theme_minimal(),
legend.title = "Cabinet type",
legend.labs = levels(cabinet$coalition),
xlab = "time since cabinet formation (years)",
ylab = "survival probability"
)
p_km
5.2 Cox proportional hazards (continuous time)
Fit a Cox model with Efron ties:
\[ h(t|X) = h_0(t)\exp(\beta_1 \text{Coalition} + \beta_2 \text{Majority} + \beta_3 \text{EconGrowth} + \beta_4 \text{Polarization} + \beta_5 \text{Crisis}) \]
Report hazard ratios with 95% CIs.
Briefly interpret the Coalition
and
Majority
effects (direction + a meaningful
magnitude).
load("cabinet.rda")
cabinet_data <- cabinet
surv_obj <- Surv(time = cabinet_data$time,
event = cabinet_data$event)
cox1 <- coxph(surv_obj ~ coalition + majority + econ_growth +
polarization + crisis,
data = cabinet_data, x = TRUE) # Store design matrix for simPH
# Present results using modelsummary
models_list <- list(
"Base Model" = cox1
)
modelsummary(models_list,
exponentiate = TRUE, # show hazard ratios
title = "Table 2: Cox Proportional Hazards Models of Cabinet Duration",
stars = TRUE,
gof_map = c("nobs", "nevent", "logLik", "AIC", "BIC"),
coef_rename = c(
"coalitionCoalition" = "Coalition Government",
"majorityMajority" = "Majority Government",
"econ_growth" = "Economic Growth (%)",
"polarization" = "Polarization Index",
"crisisYes" = "External Crisis"
),
notes = "Hazard ratios reported. Values > 1 indicate higher failure risk.")
Base Model | |
---|---|
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | |
Hazard ratios reported. Values > 1 indicate higher failure risk. | |
Coalition Government | 1.605*** |
(0.188) | |
Majority Government | 0.321*** |
(0.039) | |
Economic Growth (%) | 0.863*** |
(0.025) | |
Polarization Index | 4.861*** |
(1.066) | |
External Crisis | 2.570*** |
(0.353) | |
Num.Obs. | 450 |
# Extract hazard ratios from full model
summary(cox1)$conf.int
## exp(coef) exp(-coef) lower .95 upper .95
## coalitionCoalition 1.6046268 0.6231979 1.2753524 2.0189143
## majorityMajority 0.3206228 3.1189297 0.2519175 0.4080661
## econ_growth 0.8627470 1.1590884 0.8157571 0.9124437
## polarization 4.8605101 0.2057397 3.1627560 7.4696115
## crisisYes 2.5698209 0.3891322 1.9631411 3.3639861
Coalition: Coalition cabinets have about 60% higher risk of failure at any given time than single-party cabinets.
Majority: Majority cabinets have roughly 68% lower hazard of failure than minority cabinets.
5.3 Discrete-time hazard (cloglog) — two baseline options
Estimate both specifications with a cloglog link
using pp
:
(a) Time dummies (fully flexible baseline)
event ~ factor(t) + coalition + majority + econ_growth + polarization + crisis
(b) Carter & Signorino cubic of time (smooth
baseline) Create \(\tilde t = t /
\max(t)\) and include \(\tilde t,
\tilde t^2, \tilde t^3\):
event ~ t_cs + t_cs2 + t_cs3 + coalition + majority + econ_growth + polarization + crisis
pp
(person–period expansion for
discrete-time modeling):
id
, country
,t
: time since the cabinet formed
(integer periods; resets at the start of each spell),event
: period failure indicator (1 only in the failure
period),Important: In discrete-time models,
t
indexes duration (time since entering the risk set), not calendar year. If you wanted calendar-year shocks, you’d add separate year indicators (not provided here).
load("cabinet.rda")
person_period_data <- pp
head(person_period_data, 15)
t_max <- max(person_period_data$t, na.rm = TRUE)
person_period_data$t_cs <- person_period_data$t / t_max
person_period_data$t_cs2 <- person_period_data$t_cs^2
person_period_data$t_cs3 <- person_period_data$t_cs^3
# base cloglog model
a_cloglog <- glm(event ~ factor(t) + coalition + majority + econ_growth + polarization + crisis ,
data = person_period_data,
family = binomial(link = "cloglog"))
# (b) Carter & Signorino smooth baseline via cubic in scaled time
b_cloglog <- glm(
event ~ t_cs + t_cs2 + t_cs3 + coalition + majority + econ_growth + polarization + crisis,
data = person_period_data,
family = binomial(link = "cloglog")
)
# display results
modelsummary(
list("Cloglog - Time Dummies" = a_cloglog,
"Cloglog - Cubic" = b_cloglog),
estimate = "{estimate} ({std.error})",
statistic = NULL,
coef_omit = "factor\\(time\\)",
gof_map = c("nobs", "aic", "bic", "logLik"),
title = "Discrete-Time Duration Model"
)
Cloglog - Time Dummies | Cloglog - Cubic | |
---|---|---|
(Intercept) | -4.027 (0.314) | -4.349 (0.398) |
factor(t)2 | 0.891 (0.323) | |
factor(t)3 | 1.203 (0.316) | |
factor(t)4 | 1.376 (0.316) | |
factor(t)5 | 1.072 (0.341) | |
factor(t)6 | 1.655 (0.318) | |
factor(t)7 | 1.308 (0.348) | |
factor(t)8 | 1.554 (0.341) | |
factor(t)9 | 1.584 (0.352) | |
factor(t)10 | 1.832 (0.347) | |
factor(t)11 | 2.070 (0.344) | |
factor(t)12 | 1.688 (0.391) | |
coalitionCoalition | 0.488 (0.117) | 0.491 (0.117) |
majorityMajority | -1.141 (0.123) | -1.143 (0.123) |
econ_growth | -0.152 (0.029) | -0.152 (0.029) |
polarization | 1.600 (0.220) | 1.598 (0.220) |
crisisYes | 0.944 (0.138) | 0.946 (0.138) |
t_cs | 8.310 (2.662) | |
t_cs2 | -12.658 (5.584) | |
t_cs3 | 6.674 (3.424) | |
Num.Obs. | 3629 | 3629 |
AIC | 1883.3 | 1878.8 |
BIC | 1988.6 | 1934.6 |
Log.Lik. | -924.625 | -930.403 |
exp(0.491)
## [1] 1.633949
exp(-1.143)
## [1] 0.318861
Then:
coalition
and
majority
to the Cox model.factor(t)
or the C&S cubic.Cox model: Coalition’s coefficient 1.605; Majority’s coefficient 0.321 cloglog model: Coalition’s coefficient 1.633; Majority’s coefficient 0.318 1. The direction is same. But cloglog model’s effect is slightly stronger 2. The direction is same. But cloglog model’s effect is slightly weaker
The cloglog coefficients retain a proportional-hazards interpretation because the link function implies that covariates multiply the baseline hazard, while the baseline itself is captured flexibly either by duration dummies or by a smooth cubic
When the time is moderate(e.g., T<30) and the sample is larget enough to estimate all time dummie variable, (a) is preferred - factor model If the time periods is long and sample is not enough, due to parsimonious principle, the (b) is preferred - cubic model
Finally, using either your Cox model (5.2) or one of your discrete-time models (5.3), produce predicted survival curves (Cox) or period hazards (cloglog) for four scenarios:
(Coalition vs Single) × (Majority vs Minority), holding other covariates at typical values.
Provide two substantive takeaways (direction + a
meaningful quantity like median survival or survival at 6/10 years), and
comment on whether PH looks plausible for Coalition
. If
not, name one adjustment (e.g., time interaction,
stratification, or preferring discrete-time) and justify briefly.
library(survival)
surv_obj <- Surv(time = cabinet_data$time,
event = cabinet_data$event)
weibull_fit <- survreg(surv_obj ~ coalition + majority,
data = cabinet_data,
dist = "weibull")
shape_param <- 1/weibull_fit$scale
aft_coefs <- coef(weibull_fit)[-1]
time_ratios <- exp(aft_coefs)
aft_se <- sqrt(diag(vcov(weibull_fit)))[2:3] # extract SEs for democracy and gdp_growth only
weibull_summary <- data.frame(
Variable = names(aft_coefs),
AFT_Coef = as.numeric(aft_coefs),
Time_Ratio = as.numeric(time_ratios),
SE = as.numeric(aft_se),
p_value = 2 * pnorm(-abs(aft_coefs/aft_se)),
row.names = NULL
)
weibull_summary[, -1] <- round(weibull_summary[, -1], 3)
print(weibull_summary)
## Variable AFT_Coef Time_Ratio SE p_value
## 1 coalitionCoalition -0.262 0.77 0.084 0.002
## 2 majorityMajority 0.636 1.89 0.086 0.000
# test for duration dependence
shape_se <- shape_param^2 * weibull_fit$scale[1]
z_shape <- (shape_param - 1) / shape_se
p_shape <- 2 * pnorm(-abs(z_shape))
# use marginaleffects for predictions
# create scenarios for prediction
scenarios_weibull <- datagrid(
coalition = c("Single", "Coalition"),
majority = c("Majority", "Minority"),
model = weibull_fit
)
# get predicted survival times
pred_times <- predict(weibull_fit, newdata = scenarios_weibull, type = "response")
scenarios_weibull$predicted_duration <- pred_times
# display predicted durations
# Display predicted mean durations
print(scenarios_weibull[, c("coalition", "majority", "predicted_duration")])
## coalition majority predicted_duration
## 1 Single Minority 8.410304
## 2 Single Majority 15.892558
## 3 Coalition Minority 6.474531
## 4 Coalition Majority 12.234619
# create survival curves for visualization
# note: survreg models need different approach for survival curves
time_grid <- seq(0.1, 12, 0.5)
plot_data <- list()
for(i in 1:nrow(scenarios_weibull)) {
# predict quantiles from survreg model
pred_quantiles <- predict(weibull_fit,
newdata = scenarios_weibull[i,],
type = "quantile",
p = 1 - pweibull(time_grid,
shape = 1/weibull_fit$scale,
scale = exp(predict(weibull_fit,
newdata = scenarios_weibull[i,],
type = "lp"))))
# calculate survival probabilities
lp <- predict(weibull_fit, newdata = scenarios_weibull[i,], type = "lp")
surv_probs <- 1 - pweibull(time_grid,
shape = 1/weibull_fit$scale,
scale = exp(lp))
plot_data[[i]] <- data.frame(
time = time_grid,
survival = surv_probs,
coalition = scenarios_weibull$coalition[i],
majority = scenarios_weibull$majority[i],
group = paste0("Coalition=", scenarios_weibull$coalition[i],
", majority=", scenarios_weibull$majority[i])
)
}
plot_df <- do.call(rbind, plot_data)
# create enhanced visualization
ggplot(plot_df, aes(x = time, y = survival, color = group)) +
geom_line(linewidth = 1.2) +
labs(title = "Weibull Model: Predicted Survival Curves",
subtitle = paste("Shape parameter =", round(1/weibull_fit$scale, 2),
"indicating", ifelse(1/weibull_fit$scale > 1,
"increasing hazard",
"decreasing hazard")),
x = "Time (years)",
y = "Survival Probability",
color = "Scenario") +
theme_minimal() +
facet_wrap(~coalition, labeller = labeller(coalition = c("0" = "Single",
"1" = "coalition")))
In the Weibull model, positive coefficients indicate longer survival. The coalition coefficient of -0.262 translates to a time ratio of 0.77—coalition survive nearly 33% less than single-cabinets. The predicted durations show single-cabinets lasting approximately 12 years under average conditions, compared to 9.3 years for coliation-cabinets.
The shape parameter of 1.39 indicates positive duration dependence—the hazard increases over time. Cabinets become increasingly vulnerable the longer they survive. This suggests that the hazard ratio for Coalition is not constant, meaning that the PH assumption may be violate
Adding a coalition and time interaction allows the effect of being a coalition government to vary smoothly over the cabinet’s lifespan. This accommodates a non-proportional (time-varying) effect of coalition status.