The data used is a table of armed conflicts in the world over the period from 1946 to 2016. Included are the variables of the year the conflict began, the year (if) it ended, type of conflict, magnitude of impact on the society/system, location of episode and the estimated number of death.
The question addressed is whether there a relationship between Magnitude and estimated number of Deaths and, if so, is the relationship linear?
library(XML)
library(knitr)
# pulling table from url
war = "http://www.systemicpeace.org/warlist/warlist.htm"
war.table = readHTMLTable(war, header=T, which=1,stringsAsFactors=F)
# formatting table
names(war.table) = war.table[5,]
war.table = war.table[6:dim(war.table)[1],]
# make numeric
war.table$Mag = as.numeric(war.table$Mag)
war.table$Deaths = as.numeric(war.table$Deaths)
kable(head(war.table))
| Begin | End | Type | Mag | States Directly Involved | Brief Description | Deaths | References | |
|---|---|---|---|---|---|---|---|---|
| 6 | 1945 | 1946 | IN | 2 | Indonesia | Independence | 10000 | a b c f g h |
| 7 | 1945 | 1947 | EV | 2 | Iran | Azerbaijani and Kurd rebellions | 2000 | c f g |
| 8 | 1945 | 1949 | CW | 5 | Greece | Greek civil war | 150000 | a b c f g h o |
| 9 | 1945 | 1954 | IN | 6 | Vietnam | Indochina independence | 500000 | a b c f g h |
| 10 | 1946 | 1954 | IW | 2 | France4 | Indochina independence | 30000 | a b c f g h |
| 11 | 1946 | * | CV | 1 | Bolivia | President Villarroel ousted by general armed | ||
| uprisin | g | 1000 c | f o |
# look at summary
summary(war.table)
## Begin End Type Mag
## Length:340 Length:340 Length:340 Min. :1.000
## Class :character Class :character Class :character 1st Qu.:1.000
## Mode :character Mode :character Mode :character Median :2.000
## Mean :2.345
## 3rd Qu.:3.000
## Max. :7.000
## NA's :1
## States Directly Involved Brief Description Deaths
## Length:340 Length:340 Min. : 500
## Class :character Class :character 1st Qu.: 1500
## Mode :character Mode :character Median : 5000
## Mean : 78011
## 3rd Qu.: 26000
## Max. :2500000
## NA's :1
## References
## Length:340
## Class :character
## Mode :character
##
##
##
##
# relationship between variables
hist(war.table$Deaths, main = "Histogram of Deaths")
hist(war.table$Mag, main = "Histogram of Magnitude")
plot(war.table$Deaths ~ war.table$Mag, main = "Death vs Magnitude")
Seeing how skewed the data is, I don’t think it makes sense to use a linear regression, but we’re going to try it anyway and just see what happens.
# Regression
lm.war = lm(Deaths ~ Mag, data = war.table)
summary(lm.war)
##
## Call:
## lm(formula = Deaths ~ Mag, data = war.table)
##
## Residuals:
## Min 1Q Median 3Q Max
## -376591 -114948 -32733 60481 2150623
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -161696 21751 -7.434 8.79e-13 ***
## Mag 102215 7790 13.121 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 217400 on 337 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.3381, Adjusted R-squared: 0.3362
## F-statistic: 172.2 on 1 and 337 DF, p-value: < 2.2e-16
It’s pretty evident that the regression between Deaths and Mag are not linear with such a low R-squared value, even though the Mag variable is deemed significant. If we look at the residuals, I’m pretty sure it will reflect that.
# Residuals
hist(lm.war$residuals, main = "Regression Residuals")
qqnorm(lm.war$residuals)
qqline(lm.war$residuals)
Residuals reflect that.
From earlier plots, it seemed like the relationship between the two may be logarithmic, so we do a log transformation and redo the regression.
# Log transformation
lm.war2 = lm(log(Deaths) ~ log(Mag), data = war.table)
summary(lm.war2)
##
## Call:
## lm(formula = log(Deaths) ~ log(Mag), data = war.table)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3096 -0.4930 -0.1364 0.5568 3.1726
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.04412 0.06751 104.34 <2e-16 ***
## log(Mag) 2.88844 0.07492 38.55 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.847 on 337 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.8152, Adjusted R-squared: 0.8146
## F-statistic: 1486 on 1 and 337 DF, p-value: < 2.2e-16
This seems more appropriate just by looking at the jump up in the R-squared value compared to the linear regression’s.
hist(lm.war2$residuals, main = "Regression Residuals")
qqnorm(lm.war2$residuals)
qqline(lm.war2$residuals)
The residuals reflect that the transformation is a better approach, but I’m still wary with the Q-Q plot’s tails deviating from the normal line.