Molly Renaud
Rensselaer Polytechnic Institute
7 October 2016
The data used in this example can be found in the Global Terrorism Database. This data set contains information about terror attacks all over the world, from 1973 through 2015. For this specific example, we will be looking at terror attacks from 2012-2015, and examining several factors to see how they impact the number of deaths resulting from the attack.
Factor: RegionID Levels: 1 (North America), 2 (Central America & Caribbean), 3 (South America), 4 (East Asia), 5 (Southeast Asia), 6 (South Asia), 7 (Central Asia), 8 (Western Europe), 9 (Eastern Europe), 10 (Middle East & North Africa), 11 (Sub-Saharan Africa), 12 (Australia & Oceania)
Factor: Weapon Type
Levels: 1 (Biological), 2 (Chemical), 3 (Radiological), 4 (Nuclear), 5 (Firearms), 6 (Explosives/Bombs/Dynamite), 7 (Fake Weapons), 8 (Incendiary), 9 (Melee), 10 (Vehicle), 11 (Sabotage Equipment), 12 (Other)
Factor: Target Type Levels: 1 (Business), 2 (Government- General), 3 (Police), 4 (Military), 5 (Abortion-Related), 6 (Airports & Aircraft), 7 (Goverment-Diplomatic), 8 (Educational Institution), 9 (Food or Water Supply), 10 (Journalists & Media), 11 (Maritime), 12 (NGO), 13 (Other), 14 (Private Citizens & Property), 15 (Religious Figures & Institutions), 16 (Telecommunication), 17 (Terrorists & Non-State Militias), 18 (Tourists), 19 (Transportation), 20 (Unknown), 21 (Utilities), 22 (Violent Political Parties)
Factor: Attack Type Levels: 1 (Assassination), 2 (Armed Assult), 3 (Bombing/Explosion), 4 (Hijacking), 5 (Hostage Taking), 6 (Kidnapping), 7 (Facility/Infrastructure Attack), 8 (Unarmed Assult), 9 (Unknown)
This data set contains primarily categorical variables or qualitative variables, but there are a few continuous variables in this data set, such as latitude and longitude that the attack occurred at. Our response variable, number of fatalities, will be treated as a continous variable (see below). However, for this example, we will not be looking at any continous factors, only categorical factors.
For this example, we will be examining the number of fatalities, or number of deaths each attack resulted in. NOTE: Though this should be a discrete variable, we will examine it as a continous variable. This is because when a fatality was linked to multiple attacks, the organizers of the data set distributed the number of fatalities between the attacks to preserve statistical accuracy, resulting in a fractional number of fatalities for an attack.
setwd("~/Downloads")
terror = read.csv("gtd_12to15_0616dist.csv", header = T)
colnames(terror)
## [1] "eventid" "iyear" "imonth"
## [4] "iday" "approxdate" "extended"
## [7] "resolution" "country" "country_txt"
## [10] "region" "region_txt" "provstate"
## [13] "city" "latitude" "longitude"
## [16] "specificity" "vicinity" "location"
## [19] "summary" "crit1" "crit2"
## [22] "crit3" "doubtterr" "alternative"
## [25] "alternative_txt" "multiple" "success"
## [28] "suicide" "attacktype1" "attacktype1_txt"
## [31] "attacktype2" "attacktype2_txt" "attacktype3"
## [34] "attacktype3_txt" "targtype1" "targtype1_txt"
## [37] "targsubtype1" "targsubtype1_txt" "corp1"
## [40] "target1" "natlty1" "natlty1_txt"
## [43] "targtype2" "targtype2_txt" "targsubtype2"
## [46] "targsubtype2_txt" "corp2" "target2"
## [49] "natlty2" "natlty2_txt" "targtype3"
## [52] "targtype3_txt" "targsubtype3" "targsubtype3_txt"
## [55] "corp3" "target3" "natlty3"
## [58] "natlty3_txt" "gname" "gsubname"
## [61] "gname2" "gsubname2" "gname3"
## [64] "ingroup" "ingroup2" "ingroup3"
## [67] "gsubname3" "motive" "guncertain1"
## [70] "guncertain2" "guncertain3" "nperps"
## [73] "nperpcap" "claimed" "claimmode"
## [76] "claimmode_txt" "claim2" "claimmode2"
## [79] "claimmode2_txt" "claim3" "claimmode3"
## [82] "claimmode3_txt" "compclaim" "weaptype1"
## [85] "weaptype1_txt" "weapsubtype1" "weapsubtype1_txt"
## [88] "weaptype2" "weaptype2_txt" "weapsubtype2"
## [91] "weapsubtype2_txt" "weaptype3" "weaptype3_txt"
## [94] "weapsubtype3" "weapsubtype3_txt" "weaptype4"
## [97] "weaptype4_txt" "weapsubtype4" "weapsubtype4_txt"
## [100] "weapdetail" "nkill" "nkillus"
## [103] "nkillter" "nwound" "nwoundus"
## [106] "nwoundte" "property" "propextent"
## [109] "propextent_txt" "propvalue" "propcomment"
## [112] "ishostkid" "nhostkid" "nhostkidus"
## [115] "nhours" "ndays" "divert"
## [118] "kidhijcountry" "ransom" "ransomamt"
## [121] "ransomamtus" "ransompaid" "ransompaidus"
## [124] "ransomnote" "hostkidoutcome" "hostkidoutcome_txt"
## [127] "nreleased" "addnotes" "scite1"
## [130] "scite2" "scite3" "dbsource"
## [133] "INT_LOG" "INT_IDEO" "INT_MISC"
## [136] "INT_ANY" "related"
#As you can see, this data set it a little unruly. To make this data set more managable, we will create a new data frame that contains only the fields we would like to examine: death toll, region of attack, weapon type, target type, and attack type.
attacks <- data.frame(terror$region, terror$attacktype1, terror$targtype1, terror$weaptype1, terror$nkill)
colnames(attacks) <- c("region", "attacktype1", "targtype1", "weaptype1", "nkills")
head(attacks)
## region attacktype1 targtype1 weaptype1 nkills
## 1 6 3 1 6 0.0
## 2 11 2 1 6 2.5
## 3 6 3 4 6 4.0
## 4 11 2 1 6 2.5
## 5 6 6 14 5 0.0
## 6 6 3 17 6 1.0
# Much better- more managable.
This data was complied by the National Consortium for the study of Terrorism and Responses to Terrorism (START) to create the Global Terror Database. For this experiment we will examine the data collected from 2012-2015, looking at how the attack type, target type, weapon type, and attack location factor in to the number of fatalities caused by a terror attack.
This data is available to the public as a dowloadable file on their website with the hope that the data can be used to study terrorism, raise awareness, and provide transparency regarding terror attacks. By studying terror attacks, we can hope to be better prepared to both prevent and respond to them.
The data set is not deliberately randomized, as the data was collected as events occurred from public sources. The Global Terror Database includes all incidents that are considered ‘terrorist attacks’ and does not exclude anything to prevent bias, so we can assume randomization. However, because terror attacks can impact the occurance of other attacks, this data may not be truly random.
Obviously, there are no deliberate replicates in this data set, as that would be unethical and probably start (another) war. It is possible that there are similar attacks in the data set, in terms of region, weapon type, and target type, but those would be a coincidence.
This data is not blocked, but could be, in order to control nuisance factors (e.g. military presence in area of attack, training/preparation for terror attacks, urban/rural location, foreign/domestic attack).
First, we’ll look at the summary statistics for the number of fatalities per attack:
summary(attacks$nkills)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 1.000 2.405 2.000 1500.000 2387
As you can see, we have roughly 2300 N/A values (unknown number of casualties), which seems like a lot until you consider that we have 52,000 data points.
We can also look at the summary statistics for the number of fatalities per attack in a specific region, e.g. the Middle East & North Africa (region 10):
mideast <- attacks$region == "10"
summary(attacks[mideast, "nkills"])
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 0.00 1.00 2.75 2.00 1500.00 882
This code can be replicated for any region, weapon type, attack type, or target type to examine the summary statistics on the number of fatalities per attack dependent upon that factor.
We can look at the frequency plots for each factor:
par(mfrow = c(2,2)) #displays a grid of four plots
hist(attacks$region, main = "Frequency of Attacks by Region", xlab = "Location", ylab = "Number of Attacks")
hist(attacks$attacktype1, main = "Frequency of Attacks by Attack Type", xlab = "Attack Type", ylab = "Number of Attacks")
hist(attacks$targtype1, main = "Frequency of Attacks by Target Type", xlab = "Target Type", ylab = "Number of Attacks")
hist(attacks$weaptype1, main = "Frequency of Attacks by Weapon Type", xlab = "Weapon Type", ylab = "Number of Attacks")
We’ll also want to set the factors as factors in R, like so:
attacks$region <- as.factor(attacks$region)
attacks$attacktype1 <- as.factor(attacks$attacktype1)
attacks$targtype1 <- as.factor(attacks$targtype1)
attacks$weaptype1 <- as.factor(attacks$weaptype1)
We can also look at the boxplots for each factor:
par(mfrow = c(2,2)) #displays a grid of four plots
boxplot(attacks$nkills ~ attacks$region, xlab = "Region", ylab = "Fatalities")
boxplot(attacks$nkills ~ attacks$attacktype1, xlab = "Attack Type", ylab = "Fatalities")
boxplot(attacks$nkills ~ attacks$targtype1, xlab = "Target Type", ylab = "Fatalities")
boxplot(attacks$nkills ~ attacks$weaptype1, xlab = "Weapon Type", ylab = "Fatalities")
Calculating the analysis of variance for the main effects of the factors:
model_region <- aov(nkills~region, data = attacks)
anova(model_region)
## Analysis of Variance Table
##
## Response: nkills
## Df Sum Sq Mean Sq F value Pr(>F)
## region 11 86606 7873.2 63.52 < 2.2e-16 ***
## Residuals 49735 6164583 123.9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model_attacktype <- aov(nkills~attacktype1, data = attacks)
anova(model_attacktype)
## Analysis of Variance Table
##
## Response: nkills
## Df Sum Sq Mean Sq F value Pr(>F)
## attacktype1 8 65483 8185.3 65.817 < 2.2e-16 ***
## Residuals 49738 6185706 124.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model_targettype <- aov(nkills~targtype1, data = attacks)
anova(model_targettype)
## Analysis of Variance Table
##
## Response: nkills
## Df Sum Sq Mean Sq F value Pr(>F)
## targtype1 21 49146 2340.31 18.764 < 2.2e-16 ***
## Residuals 49725 6202042 124.73
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model_weapontype <- aov(nkills~weaptype1, data = attacks)
anova(model_weapontype)
## Analysis of Variance Table
##
## Response: nkills
## Df Sum Sq Mean Sq F value Pr(>F)
## weaptype1 10 19685 1968.54 15.712 < 2.2e-16 ***
## Residuals 49736 6231503 125.29
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now we will calculate the analysis of variance for the two-factor interactions:
#region-attack type interaction
model_region_attack <- aov(nkills~region*attacktype1, data = attacks)
anova(model_region_attack)
## Analysis of Variance Table
##
## Response: nkills
## Df Sum Sq Mean Sq F value Pr(>F)
## region 11 86606 7873.2 64.4173 < 2.2e-16 ***
## attacktype1 8 50252 6281.5 51.3944 < 2.2e-16 ***
## region:attacktype1 70 45136 644.8 5.2757 < 2.2e-16 ***
## Residuals 49657 6069194 122.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#region-target type interaction
model_region_target <- aov(nkills~region*targtype1, data = attacks)
anova(model_region_target)
## Analysis of Variance Table
##
## Response: nkills
## Df Sum Sq Mean Sq F value Pr(>F)
## region 11 86606 7873.2 64.3402 < 2.2e-16 ***
## targtype1 21 39348 1873.7 15.3122 < 2.2e-16 ***
## region:targtype1 152 60396 397.3 3.2471 < 2.2e-16 ***
## Residuals 49562 6064839 122.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#region-weapon type interaction
model_region_weapon <- aov(nkills~region*weaptype1, data = attacks)
anova(model_region_weapon)
## Analysis of Variance Table
##
## Response: nkills
## Df Sum Sq Mean Sq F value Pr(>F)
## region 11 86606 7873.2 63.6620 <2e-16 ***
## weaptype1 10 13204 1320.4 10.6766 <2e-16 ***
## region:weaptype1 62 9437 152.2 1.2307 0.1047
## Residuals 49663 6141942 123.7
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#attack type - weapon type interaction
model_attack_weapon <- aov(nkills~attacktype1*weaptype1, data = attacks)
anova(model_attack_weapon)
## Analysis of Variance Table
##
## Response: nkills
## Df Sum Sq Mean Sq F value Pr(>F)
## attacktype1 8 65483 8185.3 66.445 < 2.2e-16 ***
## weaptype1 10 21781 2178.1 17.681 < 2.2e-16 ***
## attacktype1:weaptype1 32 41864 1308.2 10.620 < 2.2e-16 ***
## Residuals 49696 6122061 123.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#attack type - target type interaction
model_attack_target <- aov(nkills~attacktype1*targtype1, data = attacks)
anova(model_attack_target)
## Analysis of Variance Table
##
## Response: nkills
## Df Sum Sq Mean Sq F value Pr(>F)
## attacktype1 8 65483 8185.3 66.6070 < 2.2e-16 ***
## targtype1 21 35602 1695.3 13.7955 < 2.2e-16 ***
## attacktype1:targtype1 131 56478 431.1 3.5083 < 2.2e-16 ***
## Residuals 49586 6093626 122.9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#target type - weapon type
model_target_weapon <- aov(nkills~targtype1*weaptype1, data = attacks)
anova(model_target_weapon)
## Analysis of Variance Table
##
## Response: nkills
## Df Sum Sq Mean Sq F value Pr(>F)
## targtype1 21 49146 2340.31 18.8019 <2e-16 ***
## weaptype1 10 14901 1490.09 11.9714 <2e-16 ***
## targtype1:weaptype1 101 11612 114.97 0.9236 0.6948
## Residuals 49614 6175529 124.47
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ANOVA results show that all four factors have a statistically significant effect on the number of fatalities resulting from a terror attacks: attack type, target type, weapon type, and location. The ANOVA tests produced a p-value of 2.2e-16, which is a strong indicator of a relationship between the factors and the response variable. The interaction ANOVA analysis shows that there is also some relationship between the factors themselves. Maybe a certain attack type is more common in a particular region, or a certain weapon is used to attack a specific type of target. The variation in the number of fatalities per attack cannot be attributed to randomization alone. Further analysis should be done to verify that the ANOVA test is the best one to run on this data set, and to further determine the relationship between the response variables and the factors looked at in this analysis.
While this data is does not follow the basic principles of experimental design (randomization, replication, blocking), that does not mean that they are not the basis for most statistical analyses.
Randomization refers to the process by which data is collects: collecting data in a random order ensures that there is no unnecessary bias incurred during data collection. This could be due to equipment drift in measurements, or exhaustion in human subjects. For example, if an experiment were to look at the golf swing of test subjects, you would not want to have the same test subject run through all of their experimental runs at once, as fatigue might become a nuisance factor.
Replication refers to repeating an experimental run more than once during an experiment. Continuing with the golf example, a test subject would use the same combination of factor levels to play more than once. This would ensure that if they happened to get really lucky on one swing, the effect of this anomaly would be mitigated by the replicates.
The third principle, blocking, is another way to mitigate nuisance factors. By grouping experimental runs in a certain way, it is possible to reduce the variability introduced into the data set by nuisance factors. For example, with this data, we could group terror attacks by the group who claimed responsibility. This could show differences in patterns of attack between separate terror groups.
Montgomery, Douglas C. Design and analysis of experiments. John Wiley & Sons, 2008.
The raw data can be found through the Global Terrorism Database website here.