CSS_205_PS4

POLI 271 Problem set 4

1. Fearon & Laitin

You will need to download the file flmdw.csv and load it into your workspace. This is the data needed to replicate the analysis in Fearon & Laitin’s 2003 APSR paper. The data are country-year data, but we will be ignoring the time component for the purposes of this exercise.

You will analyze the Fearon & Laitin data on civil war onset, specifically, the binary onset variable.

Examine the distribution of onset. Is this a “rare event”? What options might you consider?

Based on the distribution of onset, I would say that this is a rare event, as the number of 1s is relatively small compared to the number of total observations n. Some options I might consider to mitigate this are to alter the assumed linked function by using an asymmetric function, such as the cloglog approach or generalized extreme value distribution approach.

library(ggplot2)
library(gt)
library(modelsummary)
library(stargazer)


Please cite as:

 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.

 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer

# Read in dataset
df <- read.csv('flmdw-1.csv')

onset_hist <- ggplot(df, aes(x = onset)) +
  geom_bar() 
onset_hist

Fit at least four models that predict binary onset. Analyze only complete cases and make sure that all three models are fit to the same observations but be careful how you remove NAs

Model 1 should be a logistic regression including only an intercept, GDP per capita (gdpenl), population (lpopl1) and percent mountainous (lmtnest).

which(is.na(df))

integer(0)

model1 <- glm(onset ~ gdpenl + lpopl1 + lmtnest, family=binomial(link='logit'), data = df)

stargazer(model1, type="text",  out="model1.txt")


=============================================
                      Dependent variable:    
                  ---------------------------
                             onset           
---------------------------------------------
gdpenl                     -0.295***         
                            (0.062)          
                                             
lpopl1                     0.236***          
                            (0.062)          
                                             
lmtnest                     0.178**          
                            (0.080)          
                                             
Constant                   -6.042***         
                            (0.612)          
                                             
---------------------------------------------
Observations                 6,402           
Log Likelihood             -506.525          
Akaike Inf. Crit.          1,021.051         
=============================================
Note:             *p<0.1; **p<0.05; ***p<0.01

Model 2 should be a probit regression but otherwise identical to model 1.

model2 <- glm(onset ~ gdpenl + lpopl1 + lmtnest, family=binomial(link='probit'), data = df)

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

stargazer(model2, type="text",  out="model2.txt")


=============================================
                      Dependent variable:    
                  ---------------------------
                             onset           
---------------------------------------------
gdpenl                     -0.114***         
                            (0.023)          
                                             
lpopl1                     0.100***          
                            (0.027)          
                                             
lmtnest                     0.075**          
                            (0.032)          
                                             
Constant                   -2.966***         
                            (0.256)          
                                             
---------------------------------------------
Observations                 6,402           
Log Likelihood             -506.218          
Akaike Inf. Crit.          1,020.436         
=============================================
Note:             *p<0.1; **p<0.05; ***p<0.01

Model 3 should be a probit regression that adds a dummy variable for whether a country is an oil exporter (Oil), democracy (polity2l), and religious fractionalization (relfrac).

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

# add dummy variable

  #relfrac
#df_recoded <- df %>%
#  mutate(relfrac = case_when(
#    relfrac >= 0 & relfrac < 0.5 ~ 0,
#   relfrac >= 0.5 & relfrac <= 1 ~ 1,
#   TRUE ~ relfrac),
#  polity2l = case_when(
#    polity2l >= -10 & relfrac < 0 ~ 0,
#   polity2l >= 0 & relfrac <= 10 ~ 1,
#   TRUE ~ relfrac)
#   )

#df_recoded$polity2l <- as.numeric(df_recoded$polity2l)

# model 3 probit regression

## add predictors from model 2
model3<- glm(onset ~ gdpenl + lpopl1 + lmtnest + Oil + relfrac + polity2l, family=binomial(link='probit'), data = df)

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

stargazer(model3, type="text",  out="model3.txt")


=============================================
                      Dependent variable:    
                  ---------------------------
                             onset           
---------------------------------------------
gdpenl                     -0.140***         
                            (0.027)          
                                             
lpopl1                     0.088***          
                            (0.027)          
                                             
lmtnest                     0.085**          
                            (0.033)          
                                             
Oil                        0.404***          
                            (0.117)          
                                             
relfrac                      0.222           
                            (0.199)          
                                             
polity2l                    0.013**          
                            (0.007)          
                                             
Constant                   -2.950***         
                            (0.274)          
                                             
---------------------------------------------
Observations                 6,402           
Log Likelihood             -499.506          
Akaike Inf. Crit.          1,013.011         
=============================================
Note:             *p<0.1; **p<0.05; ***p<0.01

Model 4 should be a probit regression that includes an interaction between polity2l and relfrac.

# model 4 probit regression

model4<- glm(onset ~ gdpenl + lpopl1 + lmtnest + Oil + polity2l + relfrac + I(polity2l*relfrac), family=binomial(link='probit'), data = df)

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

stargazer(model4, type="text",  out="model4.txt")


=================================================
                          Dependent variable:    
                      ---------------------------
                                 onset           
-------------------------------------------------
gdpenl                         -0.147***         
                                (0.027)          
                                                 
lpopl1                         0.094***          
                                (0.027)          
                                                 
lmtnest                        0.090***          
                                (0.033)          
                                                 
Oil                            0.403***          
                                (0.117)          
                                                 
polity2l                        -0.006           
                                (0.013)          
                                                 
relfrac                          0.318           
                                (0.202)          
                                                 
I(polity2l * relfrac)           0.054*           
                                (0.029)          
                                                 
Constant                       -3.038***         
                                (0.280)          
                                                 
-------------------------------------------------
Observations                     6,402           
Log Likelihood                 -497.828          
Akaike Inf. Crit.              1,011.656         
=================================================
Note:                 *p<0.1; **p<0.05; ***p<0.01

Feel free to fit additional models that explore different distributional assumptions or covariates. Present the results from all four models in a well-formatted, publication-quality regression table. Write a paragraph summarizing the information you put in the table. The gt, modelsummary and stargazer packages may help. Based on in-sample model fit, which model(s) appears to be the most promising?

The table presents logistic and probit regression results predicting onset. Based on the following table, we can see that GDP per capita (gdpenl) has a significant negative effect, while population (lpopl1), percent mountainous (lmtnest), and oil (Oil) increase the likelihood of onset. Religious fractionalization (relfrac) is not significant, but its interaction with polity2l in model (4) is weakly significant.

The model that seems to be the most promising is model 4. This is because it has the lowest AIC, which indicates that it best balances goodness of fit and model complexity. This suggests that it explains the data well without overfitting.

stargazer(model1, model2, model3, model4, type="text")


=============================================================
                                Dependent variable:          
                      ---------------------------------------
                                       onset                 
                      logistic             probit            
                         (1)       (2)       (3)       (4)   
-------------------------------------------------------------
gdpenl                -0.295*** -0.114*** -0.140*** -0.147***
                       (0.062)   (0.023)   (0.027)   (0.027) 
                                                             
lpopl1                0.236***  0.100***  0.088***  0.094*** 
                       (0.062)   (0.027)   (0.027)   (0.027) 
                                                             
lmtnest                0.178**   0.075**   0.085**  0.090*** 
                       (0.080)   (0.032)   (0.033)   (0.033) 
                                                             
Oil                                       0.404***  0.403*** 
                                           (0.117)   (0.117) 
                                                             
relfrac                                     0.222     0.318  
                                           (0.199)   (0.202) 
                                                             
I(polity2l * relfrac)                                0.054*  
                                                     (0.029) 
                                                             
polity2l                                   0.013**   -0.006  
                                           (0.007)   (0.013) 
                                                             
Constant              -6.042*** -2.966*** -2.950*** -3.038***
                       (0.612)   (0.256)   (0.274)   (0.280) 
                                                             
-------------------------------------------------------------
Observations            6,402     6,402     6,402     6,402  
Log Likelihood        -506.525  -506.218  -499.506  -497.828 
Akaike Inf. Crit.     1,021.051 1,020.436 1,013.011 1,011.656
=============================================================
Note:                             *p<0.1; **p<0.05; ***p<0.01

Develop ROC plot and separation plots comparing the in-sample fit of your estimated models. The ROCR library may help

library(ROCR)
library(separationplot)

Loading required package: RColorBrewer

Loading required package: Hmisc


Attaching package: 'Hmisc'

The following objects are masked from 'package:dplyr':

    src, summarize

The following object is masked from 'package:modelsummary':

    Mean

The following object is masked from 'package:gt':

    html

The following objects are masked from 'package:base':

    format.pval, units

Loading required package: MASS


Attaching package: 'MASS'

The following object is masked from 'package:dplyr':

    select

Loading required package: foreign

#model 1

#roc plot
pred1 <- prediction(predict(model1, type = "response"), df$onset)  
perf1 <- performance(pred1, measure = "tpr", x.measure = "fpr")
plot(perf1, col=rainbow(10))

#separation plot
separationplot(predict(model1, type = "response"), df$onset,
  heading = "Model 1 Separation Plot",
  line = TRUE)

#model 2

pred2 <- prediction(predict(model2, type = "response"), df$onset)  
perf2 <- performance(pred2, measure = "tpr", x.measure = "fpr")
plot(perf2, col=rainbow(10))

#separation plot
separationplot(predict(model2, type = "response"), df$onset,
  heading = "Model 2 Separation Plot",
  line = TRUE)

#model 3

pred3 <- prediction(predict(model3, type = "response"), df$onset)  
perf3 <- performance(pred3, measure = "tpr", x.measure = "fpr")
plot(perf3, col=rainbow(10))

#separation plot
separationplot(predict(model3, type = "response"), df$onset,
  heading = "Model 3 Separation Plot",
  line = TRUE)

#model 4

pred4 <- prediction(predict(model4, type = "response"), df$onset)  
perf4 <- performance(pred4, measure = "tpr", x.measure = "fpr")
plot(perf4, col=rainbow(10))

#separation plot
separationplot(predict(model4, type = "response"), df$onset,
  heading = "Model 4 Separation Plot",
  line = TRUE)

Using model 4 and a likelihood ratio test, what is the evidence that we can leave polity2l out of the model entirely? In other words, test the hypothesis that \[ (\beta*{dem} =* \beta{dem frac} = 0) \].

When conducting a likelihood ratio test with model 4 and model 4 without polity2l and its interaction, we can compare the log likelihoods of each to determine whether adding polity2l significantly improves the model. We can see that the original model 4 has a log likelihood closer to 0, which indicates that this model might be a better fit. Therefore, polity2l should be included in the model.

library(lmtest)

Loading required package: zoo


Attaching package: 'zoo'

The following objects are masked from 'package:base':

    as.Date, as.Date.numeric

model4_nopolity<- glm(onset ~ gdpenl + lpopl1 + lmtnest + Oil + relfrac, family=binomial(link='probit'), data = df)

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

lrtest(model4, model4_nopolity)

Likelihood ratio test

Model 1: onset ~ gdpenl + lpopl1 + lmtnest + Oil + polity2l + relfrac + 
    I(polity2l * relfrac)
Model 2: onset ~ gdpenl + lpopl1 + lmtnest + Oil + relfrac
  #Df  LogLik Df  Chisq Pr(>Chisq)  
1   8 -497.83                       
2   6 -501.51 -2 7.3558    0.02528 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Undertake a 10-fold cross-validation of each of these models. Construct an ROC plot of the out of sample predictive performance of each of the models. To do this you can write code to create the 10 folds or you can try and work with the tools in many R libraries that implement cross-validation. These include: cvTools, caret, tidymodels/resampling. On the basis of this analysis, which model(s) do you prefer?

Based on the ROC plot, I would say that I prefer model 4, because it seems to have more area under the curve.

library(caret)

Loading required package: lattice

df$onset <- as.factor(df$onset)

#model 1

train_control <- trainControl(method = "cv", 
                              number = 10) 

model1_cv <- train(onset ~ gdpenl + lpopl1 + lmtnest, data = df,  
                 method = 'glm', 
                 family = binomial(link = "logit"),
                 trControl = train_control)

#roc plot
pred1 <- prediction(predict(model1_cv, df, type = "prob")[,2], df$onset)  
perf1 <- performance(pred1, measure = "tpr", x.measure = "fpr")
plot(perf1, col=rainbow(10))

#model 2

train_control <- trainControl(method = "cv", 
                              number = 10) 

model2_cv <- train(onset ~ gdpenl + lpopl1 + lmtnest, data = df,  
                 method = 'glm', 
                 family = binomial(link = "probit"),
                 trControl = train_control)

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

#roc plot
pred2 <- prediction(predict(model2_cv, df, type = "prob")[,2], df$onset)  
perf2 <- performance(pred2, measure = "tpr", x.measure = "fpr")
plot(perf2, col=rainbow(10))

#model 3

train_control <- trainControl(method = "cv", 
                              number = 10) 

model3_cv <- train(onset ~ gdpenl + lpopl1 + lmtnest + Oil + relfrac + polity2l, data = df,  
                 method = 'glm', 
                 family = binomial(link = "probit"),
                 trControl = train_control)

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

#roc plot
pred3 <- prediction(predict(model3_cv, df, type = "prob")[,2], df$onset)  
perf3 <- performance(pred3, measure = "tpr", x.measure = "fpr")
plot(perf3, col=rainbow(10))

#model 4

train_control <- trainControl(method = "cv", 
                              number = 10) 

model4_cv <- train(onset ~ gdpenl + lpopl1 + lmtnest + Oil + polity2l + relfrac + I(polity2l*relfrac), data = df,  
                 method = 'glm', 
                 family = binomial(link = "probit"),
                 trControl = train_control)

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

#roc plot
pred4 <- prediction(predict(model4_cv, df, type = "prob")[,2], df$onset)  
perf4 <- performance(pred4, measure = "tpr", x.measure = "fpr")
plot(perf4, col=rainbow(10))

Interpret the relationship between civil war onset and democracy in the preferred model. Use a visual presentation of the model predictions and be sure to display the estimation uncertainty around the expected values produced by the model. Be sure to clearly state the scenarios you chose and why. If the best performing model includes the interaction term then provide a plot that interprets that conditional relationship.

The preferred model is model 4, which includes the interaction term. Based on the plot, we can see that the predicted probabilities of onset are highly variable across different democracy types, as we can see in the spikes at polity2l score = -6, 0, and 3.

# Get predicted probabilities and standard errors
df$predicted_prob <- predict(model4, df, type = "response") 

# Compute confidence intervals for predictions
df$se_fit <- predict(model4, df, type = "response", se.fit = TRUE)$se.fit  # Standard error
df$upper_ci <- df$predicted_prob + 1.96 * df$se_fit  # 95% CI Upper
df$lower_ci <- df$predicted_prob - 1.96 * df$se_fit  # 95% CI Lower

# Ensure probabilities stay within valid range [0,1]
df <- df %>%
  mutate(upper_ci = pmin(upper_ci, 1), lower_ci = pmax(lower_ci, 0))

# Visualization: Line plot with confidence bands
ggplot(df, aes(x = polity2l, y = predicted_prob)) + 
  geom_line(color = "blue") + 
  geom_ribbon(aes(ymin = lower_ci, ymax = upper_ci), alpha = 0.2, fill = "blue") +
  labs(title = "Predicted Probability of Onset with Confidence Intervals",
       x = "Polity2 Score",
       y = "Predicted Probability of Onset") +
  theme_minimal()

Appendix

I used Open AI’s Chat GPT tool in this assignment, specifically only the last question. I struggled with generating a visualization of the model, especially with computing the confidence intervals. Also, I used this tool to debug some of the errors I was running into, such as “Error in predict(model4_cv, df, type =”raw”, se.fit = TRUE)$se.fit : $ operator is invalid for atomic vectors”. I initially tried to Google these errors, but I struggled to find a solution that was applicable for the purposes of this assignment. In this case, I found the tool very helpful because it allowed me to generate a visualization for the model, and understand why I was receiving the errors. I’m relatively unfamiliar with R, so some of the error messages were unfamiliar and difficult to understand. Chat GPT helped me understand how to tweak the code it gave me to mitigate these errors.

Link to exchange: https://chatgpt.com/share/67a32d8e-36dc-800b-8296-8efc821ad5b9