I’ve decided to start exploring my deepest interst - conflict resolution. I had to find relevant data online, preprocess it, and learn about causal relationships. The database used was generated by the RAND institute (see references), and is considered a benchmark in this field.
Since I wanted to find data related to terrorism, I found data that is poor formated and had to clean it:
Data Preprocessing:
#clean environment:
rm(list = ls())
dat <- read.csv("/Users/oba2311/Desktop/Minerva/Junior/SS154/assignment1/date_clean.csv", header=T)
head(dat[,1:8])
## X Date City Country Perpetrator
## 1 0 1968-02-09 Buenos Aires Argentina Unknown
## 2 1 1968-02-12 Santo Domingo Dominican Republic Unknown
## 3 2 1968-02-13 Montevideo Uruguay Unknown
## 4 3 1968-02-20 Santiago Chile Unknown
## 5 4 1968-02-21 Washington, D.C. United States Unknown
## 6 5 1968-02-21 Neot Hakikar Israel Unknown
## Weapon Injuries Fatalities
## 1 Firearms 0 0
## 2 Explosives 0 0
## 3 Fire or Firebomb 0 0
## 4 Explosives 0 0
## 5 Explosives 0 0
## 6 Unknown 0 0
typeof(dat$Date)
## [1] "integer"
#change to numeric:
dat$Date<-as.integer(format(as.Date(dat$Date), "%Y%m%d"))
head(dat[,1:8])
## X Date City Country Perpetrator
## 1 0 19680209 Buenos Aires Argentina Unknown
## 2 1 19680212 Santo Domingo Dominican Republic Unknown
## 3 2 19680213 Montevideo Uruguay Unknown
## 4 3 19680220 Santiago Chile Unknown
## 5 4 19680221 Washington, D.C. United States Unknown
## 6 5 19680221 Neot Hakikar Israel Unknown
## Weapon Injuries Fatalities
## 1 Firearms 0 0
## 2 Explosives 0 0
## 3 Fire or Firebomb 0 0
## 4 Explosives 0 0
## 5 Explosives 0 0
## 6 Unknown 0 0
Let’s plot a histogram of the number of attacks over time, to learn about the trend:
date <- format(round(dat$Date, 4))
head(as.numeric(date))
## [1] 19680209 19680212 19680213 19680220 19680221 19680221
his<- hist(as.numeric(date))
maxh <- max(his$counts)
strh <- strheight('W')
strw <- strwidth(max(his$counts))
his<- hist(as.numeric(date),border = "red", main="Frequency of Terror Attacks Over Time", sub=substitute(paste(italic("Notice the increase in incidents in recent years"))), ylab="Number of Attacks", xlab = "Time", breaks = 41)
text(his$mids, strh + his$counts, labels=his$counts, adj=c(0, 0.5), srt=90)
We see that the current millenia is much worse than the previous one. We should point out that this can also be a feature of the data: as years go by, documentation and media becomes more accurate and robust. We should expect more incidents in the data even if there was no real growth. That said, the numbers shown are dramatic and it is fair to assume that there is indeed growing number of attacks.
Before further exploring, we can compare these results with other source to validate the data: We learn that the trend is indeed the same, even when using differert data scources.** Let’s find the outlier:
his$mids
## [1] 19685000 19695000 19705000 19715000 19725000 19735000 19745000
## [8] 19755000 19765000 19775000 19785000 19795000 19805000 19815000
## [15] 19825000 19835000 19845000 19855000 19865000 19875000 19885000
## [22] 19895000 19905000 19915000 19925000 19935000 19945000 19955000
## [29] 19965000 19975000 19985000 19995000 20005000 20015000 20025000
## [36] 20035000 20045000 20055000 20065000 20075000 20085000 20095000
#map counts per year to a year:
names(his$breaks) <- his$counts
outl<-max(his$counts)
names(outl)
## NULL
We see that the year 2006 is the highest number of attacks, in 39th place. Let’s verify:
his$breaks[39]
## 6660
## 20060000
summary(dat[,1:8]) #Omit the description column.
## X Date City Country
## Min. : 0 Min. :19680209 : 4974 Iraq :10763
## 1st Qu.:10032 1st Qu.:19990806 Baghdad: 4103 West Bank/Gaza: 2038
## Median :20064 Median :20041125 Kirkuk : 853 Afghanistan : 2025
## Mean :20064 Mean :20004767 Mosul : 839 Thailand : 2009
## 3rd Qu.:30096 3rd Qu.:20060823 Baqubah: 630 Colombia : 1913
## Max. :40128 Max. :20091231 Athens : 435 Israel : 1687
## (Other):28295 (Other) :19694
## Perpetrator
## Unknown :26190
## Other : 2057
## Taliban : 1000
## Revolutionary Armed Forces of Colombia (FARC): 616
## Hamas (Islamic Resistance Movement) : 576
## Basque Fatherland and Freedom (ETA) : 418
## (Other) : 9272
## Weapon Injuries Fatalities
## Explosives :20523 Min. : 0.000 Min. : 0.000
## Firearms :11222 1st Qu.: 0.000 1st Qu.: 0.000
## Unknown : 3213 Median : 0.000 Median : 0.000
## Fire or Firebomb : 2778 Mean : 3.647 Mean : 1.601
## Remote-detonated explosive: 1593 3rd Qu.: 1.000 3rd Qu.: 1.000
## Knives & sharp objects : 418 Max. :5000.000 Max. :2749.000
## (Other) : 382
We see that the biggest terror attack led to the death of 2749 people (September 11).
We see that Iraq is the most dangerous place, and that the Taliban is the most effective and the worst terror organization. Let’s verify this information once again:
We know that September 11 is a huge outlier, so let’s see how to model does without it:
no_9_11<-ifelse(dat$Fatalities>=501,501,dat$Fatalities)
no_outliers <- data.frame(dat,no_9_11)
#Check that the max of the new column does not exceed 501:
summary(no_outliers[,10])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 1.545 1.000 501.000
#The model predicts the number of fatalities, based on the number of injuries:
mdl<-lm(no_outliers$no_9_11 ~no_outliers$Injuries)
plot(no_outliers$no_9_11 ~no_outliers$Injuries, main="Injuries as a Regressor for Fatalities", xlab="Injuries", ylab = "Fatalities", xlim=c(0,300), ylim=c(0,250))
abline(mdl, col = "red")
Because of the cluster of low numbers of both fatalities and injuries, we see that the outliers make it hard to examine the plot. Let’s take a look at the summary:
summary(mdl)
##
## Call:
## lm(formula = no_outliers$no_9_11 ~ no_outliers$Injuries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -427.59 -1.22 -1.22 -0.22 398.78
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.2249463 0.0362104 33.83 <2e-16 ***
## no_outliers$Injuries 0.0876721 0.0008471 103.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.227 on 40127 degrees of freedom
## Multiple R-squared: 0.2107, Adjusted R-squared: 0.2107
## F-statistic: 1.071e+04 on 1 and 40127 DF, p-value: < 2.2e-16
We see prima facie that both the intercept and the number of injuries are significant regressors (i.e. good predictors). Let’s perform a significance test:
As the p-value is much less than 0.05 (\(2e-16\)), we reject the null hypothesis that \(β\) = \(0\). Hence there is a significant relationship between the variables in the linear regression model of the specific dataset. We can assume that this relationship will hold outside of the data (i.e. out of sample) by common sense.
“RAND Databse of Worldwide Terrorism Incidents” - https://www.rand.org/nsrd/projects/terrorism-incidents.html