Major Episodes of Political Violence

The data used is a table of armed conflicts in the world over the period from 1946 to 2016. Included are the variables of the year the conflict began, the year (if) it ended, type of conflict, magnitude of impact on the society/system, location of episode and the estimated number of death.

The question addressed is whether there a relationship between Magnitude and estimated number of Deaths and, if so, is the relationship linear?

library(XML)
library(knitr)

# pulling table from url
war = "http://www.systemicpeace.org/warlist/warlist.htm"
war.table = readHTMLTable(war, header=T, which=1,stringsAsFactors=F)
# formatting table
names(war.table) = war.table[5,]
war.table = war.table[6:dim(war.table)[1],]
# make numeric
war.table$Mag = as.numeric(war.table$Mag)
war.table$Deaths = as.numeric(war.table$Deaths)

kable(head(war.table))
Begin End Type Mag States Directly Involved Brief Description Deaths References
6 1945 1946 IN 2 Indonesia Independence 10000 a b c f g h
7 1945 1947 EV 2 Iran Azerbaijani and Kurd rebellions 2000 c f g
8 1945 1949 CW 5 Greece Greek civil war 150000 a b c f g h o
9 1945 1954 IN 6 Vietnam Indochina independence 500000 a b c f g h
10 1946 1954 IW 2 France4 Indochina independence 30000 a b c f g h
11 1946 * CV 1 Bolivia President Villarroel ousted by general armed
uprisin g 1000 c f o

Look at Data

# look at summary
summary(war.table)
##     Begin               End                Type                Mag       
##  Length:340         Length:340         Length:340         Min.   :1.000  
##  Class :character   Class :character   Class :character   1st Qu.:1.000  
##  Mode  :character   Mode  :character   Mode  :character   Median :2.000  
##                                                           Mean   :2.345  
##                                                           3rd Qu.:3.000  
##                                                           Max.   :7.000  
##                                                           NA's   :1      
##  States Directly Involved Brief Description      Deaths       
##  Length:340               Length:340         Min.   :    500  
##  Class :character         Class :character   1st Qu.:   1500  
##  Mode  :character         Mode  :character   Median :   5000  
##                                              Mean   :  78011  
##                                              3rd Qu.:  26000  
##                                              Max.   :2500000  
##                                              NA's   :1        
##   References       
##  Length:340        
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 
# relationship between variables
hist(war.table$Deaths, main = "Histogram of Deaths")

hist(war.table$Mag, main = "Histogram of Magnitude")

plot(war.table$Deaths ~ war.table$Mag, main = "Death vs Magnitude")

Seeing how skewed the data is, I don’t think it makes sense to use a linear regression, but we’re going to try it anyway and just see what happens.

Linear Regression

# Regression
lm.war = lm(Deaths ~ Mag, data = war.table)
summary(lm.war)
## 
## Call:
## lm(formula = Deaths ~ Mag, data = war.table)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -376591 -114948  -32733   60481 2150623 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -161696      21751  -7.434 8.79e-13 ***
## Mag           102215       7790  13.121  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 217400 on 337 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.3381, Adjusted R-squared:  0.3362 
## F-statistic: 172.2 on 1 and 337 DF,  p-value: < 2.2e-16

It’s pretty evident that the regression between Deaths and Mag are not linear with such a low R-squared value, even though the Mag variable is deemed significant. If we look at the residuals, I’m pretty sure it will reflect that.

Look at Residuals

# Residuals
hist(lm.war$residuals, main = "Regression Residuals")

qqnorm(lm.war$residuals)
qqline(lm.war$residuals)

Residuals reflect that.

Data Transformation

From earlier plots, it seemed like the relationship between the two may be logarithmic, so we do a log transformation and redo the regression.

# Log transformation
lm.war2 = lm(log(Deaths) ~ log(Mag), data = war.table)
summary(lm.war2)
## 
## Call:
## lm(formula = log(Deaths) ~ log(Mag), data = war.table)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3096 -0.4930 -0.1364  0.5568  3.1726 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.04412    0.06751  104.34   <2e-16 ***
## log(Mag)     2.88844    0.07492   38.55   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.847 on 337 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.8152, Adjusted R-squared:  0.8146 
## F-statistic:  1486 on 1 and 337 DF,  p-value: < 2.2e-16

This seems more appropriate just by looking at the jump up in the R-squared value compared to the linear regression’s.

Look at Log Residuals

hist(lm.war2$residuals, main = "Regression Residuals")

qqnorm(lm.war2$residuals)
qqline(lm.war2$residuals)

The residuals reflect that the transformation is a better approach, but I’m still wary with the Q-Q plot’s tails deviating from the normal line.