You will need to download the file flmdw.csv and load it into your workspace. This is the data needed to replicate the analysis in Fearon & Laitin’s 2003 APSR paper. The data are country-year data, but we will be ignoring the time component for the purposes of this exercise.
You will analyze the Fearon & Laitin data on civil war onset, specifically, the binary onset variable.
Examine the distribution of onset. Is this a “rare event”? What options might you consider?
Based on the distribution of onset, I would say that this is a rare event, as the number of 1s is relatively small compared to the number of total observations n. Some options I might consider to mitigate this are to alter the assumed linked function by using an asymmetric function, such as the cloglog approach or generalized extreme value distribution approach.
Fit at least four models that predict binary onset. Analyze only complete cases and make sure that all three models are fit to the same observations but be careful how you remove NAs
Model 1 should be a logistic regression including only an intercept, GDP per capita (gdpenl), population (lpopl1) and percent mountainous (lmtnest).
Model 3 should be a probit regression that adds a dummy variable for whether a country is an oil exporter (Oil), democracy (polity2l), and religious fractionalization (relfrac).
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
Feel free to fit additional models that explore different distributional assumptions or covariates. Present the results from all four models in a well-formatted, publication-quality regression table. Write a paragraph summarizing the information you put in the table. The gt, modelsummary and stargazer packages may help. Based on in-sample model fit, which model(s) appears to be the most promising?
The table presents logistic and probit regression results predicting onset. Based on the following table, we can see that GDP per capita (gdpenl) has a significant negative effect, while population (lpopl1), percent mountainous (lmtnest), and oil (Oil) increase the likelihood of onset. Religious fractionalization (relfrac) is not significant, but its interaction with polity2l in model (4) is weakly significant.
The model that seems to be the most promising is model 4. This is because it has the lowest AIC, which indicates that it best balances goodness of fit and model complexity. This suggests that it explains the data well without overfitting.
#separation plotseparationplot(predict(model4, type ="response"), df$onset,heading ="Model 4 Separation Plot",line =TRUE)
Using model 4 and a likelihood ratio test, what is the evidence that we can leave polity2l out of the model entirely? In other words, test the hypothesis that \[ (\beta*{dem} =* \beta{dem frac} = 0) \].
When conducting a likelihood ratio test with model 4 and model 4 without polity2l and its interaction, we can compare the log likelihoods of each to determine whether adding polity2l significantly improves the model. We can see that the original model 4 has a log likelihood closer to 0, which indicates that this model might be a better fit. Therefore, polity2l should be included in the model.
library(lmtest)
Loading required package: zoo
Attaching package: 'zoo'
The following objects are masked from 'package:base':
as.Date, as.Date.numeric
Undertake a 10-fold cross-validation of each of these models. Construct an ROC plot of the out of sample predictive performance of each of the models. To do this you can write code to create the 10 folds or you can try and work with the tools in many R libraries that implement cross-validation. These include: cvTools, caret, tidymodels/resampling. On the basis of this analysis, which model(s) do you prefer?
Based on the ROC plot, I would say that I prefer model 4, because it seems to have more area under the curve.
library(caret)
Loading required package: lattice
df$onset <-as.factor(df$onset)#model 1train_control <-trainControl(method ="cv", number =10) model1_cv <-train(onset ~ gdpenl + lpopl1 + lmtnest, data = df, method ='glm', family =binomial(link ="logit"),trControl = train_control)#roc plotpred1 <-prediction(predict(model1_cv, df, type ="prob")[,2], df$onset) perf1 <-performance(pred1, measure ="tpr", x.measure ="fpr")plot(perf1, col=rainbow(10))
#model 2train_control <-trainControl(method ="cv", number =10) model2_cv <-train(onset ~ gdpenl + lpopl1 + lmtnest, data = df, method ='glm', family =binomial(link ="probit"),trControl = train_control)
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Interpret the relationship between civil war onset and democracy in the preferred model. Use a visual presentation of the model predictions and be sure to display the estimation uncertainty around the expected values produced by the model. Be sure to clearly state the scenarios you chose and why. If the best performing model includes the interaction term then provide a plot that interprets that conditional relationship.
The preferred model is model 4, which includes the interaction term. Based on the plot, we can see that the predicted probabilities of onset are highly variable across different democracy types, as we can see in the spikes at polity2l score = -6, 0, and 3.
# Get predicted probabilities and standard errorsdf$predicted_prob <-predict(model4, df, type ="response") # Compute confidence intervals for predictionsdf$se_fit <-predict(model4, df, type ="response", se.fit =TRUE)$se.fit # Standard errordf$upper_ci <- df$predicted_prob +1.96* df$se_fit # 95% CI Upperdf$lower_ci <- df$predicted_prob -1.96* df$se_fit # 95% CI Lower# Ensure probabilities stay within valid range [0,1]df <- df %>%mutate(upper_ci =pmin(upper_ci, 1), lower_ci =pmax(lower_ci, 0))# Visualization: Line plot with confidence bandsggplot(df, aes(x = polity2l, y = predicted_prob)) +geom_line(color ="blue") +geom_ribbon(aes(ymin = lower_ci, ymax = upper_ci), alpha =0.2, fill ="blue") +labs(title ="Predicted Probability of Onset with Confidence Intervals",x ="Polity2 Score",y ="Predicted Probability of Onset") +theme_minimal()
Appendix
I used Open AI’s Chat GPT tool in this assignment, specifically only the last question. I struggled with generating a visualization of the model, especially with computing the confidence intervals. Also, I used this tool to debug some of the errors I was running into, such as “Error in predict(model4_cv, df, type =”raw”, se.fit = TRUE)$se.fit : $ operator is invalid for atomic vectors”. I initially tried to Google these errors, but I struggled to find a solution that was applicable for the purposes of this assignment. In this case, I found the tool very helpful because it allowed me to generate a visualization for the model, and understand why I was receiving the errors. I’m relatively unfamiliar with R, so some of the error messages were unfamiliar and difficult to understand. Chat GPT helped me understand how to tweak the code it gave me to mitigate these errors.