Heading

Molly Renaud
Rensselaer Polytechnic Institute
7 October 2016

1. Setting

The data used in this example can be found in the Global Terrorism Database. This data set contains information about terror attacks all over the world, from 1973 through 2015. For this specific example, we will be looking at terror attacks from 2012-2015, and examining several factors to see how they impact the number of deaths resulting from the attack.

Factors and Levels

Factor: RegionID Levels: 1 (North America), 2 (Central America & Caribbean), 3 (South America), 4 (East Asia), 5 (Southeast Asia), 6 (South Asia), 7 (Central Asia), 8 (Western Europe), 9 (Eastern Europe), 10 (Middle East & North Africa), 11 (Sub-Saharan Africa), 12 (Australia & Oceania)

Factor: Weapon Type
Levels: 1 (Biological), 2 (Chemical), 3 (Radiological), 4 (Nuclear), 5 (Firearms), 6 (Explosives/Bombs/Dynamite), 7 (Fake Weapons), 8 (Incendiary), 9 (Melee), 10 (Vehicle), 11 (Sabotage Equipment), 12 (Other)

Factor: Target Type Levels: 1 (Business), 2 (Government- General), 3 (Police), 4 (Military), 5 (Abortion-Related), 6 (Airports & Aircraft), 7 (Goverment-Diplomatic), 8 (Educational Institution), 9 (Food or Water Supply), 10 (Journalists & Media), 11 (Maritime), 12 (NGO), 13 (Other), 14 (Private Citizens & Property), 15 (Religious Figures & Institutions), 16 (Telecommunication), 17 (Terrorists & Non-State Militias), 18 (Tourists), 19 (Transportation), 20 (Unknown), 21 (Utilities), 22 (Violent Political Parties)

Factor: Attack Type Levels: 1 (Assassination), 2 (Armed Assult), 3 (Bombing/Explosion), 4 (Hijacking), 5 (Hostage Taking), 6 (Kidnapping), 7 (Facility/Infrastructure Attack), 8 (Unarmed Assult), 9 (Unknown)

Continuous Variables

This data set contains primarily categorical variables or qualitative variables, but there are a few continuous variables in this data set, such as latitude and longitude that the attack occurred at. Our response variable, number of fatalities, will be treated as a continous variable (see below). However, for this example, we will not be looking at any continous factors, only categorical factors.

Response Variables

For this example, we will be examining the number of fatalities, or number of deaths each attack resulted in. NOTE: Though this should be a discrete variable, we will examine it as a continous variable. This is because when a fatality was linked to multiple attacks, the organizers of the data set distributed the number of fatalities between the attacks to preserve statistical accuracy, resulting in a fractional number of fatalities for an attack.

The Data: How is it organized and what does it look like?

setwd("~/Downloads")
terror = read.csv("gtd_12to15_0616dist.csv", header = T)
colnames(terror)

##   [1] "eventid"            "iyear"              "imonth"            
##   [4] "iday"               "approxdate"         "extended"          
##   [7] "resolution"         "country"            "country_txt"       
##  [10] "region"             "region_txt"         "provstate"         
##  [13] "city"               "latitude"           "longitude"         
##  [16] "specificity"        "vicinity"           "location"          
##  [19] "summary"            "crit1"              "crit2"             
##  [22] "crit3"              "doubtterr"          "alternative"       
##  [25] "alternative_txt"    "multiple"           "success"           
##  [28] "suicide"            "attacktype1"        "attacktype1_txt"   
##  [31] "attacktype2"        "attacktype2_txt"    "attacktype3"       
##  [34] "attacktype3_txt"    "targtype1"          "targtype1_txt"     
##  [37] "targsubtype1"       "targsubtype1_txt"   "corp1"             
##  [40] "target1"            "natlty1"            "natlty1_txt"       
##  [43] "targtype2"          "targtype2_txt"      "targsubtype2"      
##  [46] "targsubtype2_txt"   "corp2"              "target2"           
##  [49] "natlty2"            "natlty2_txt"        "targtype3"         
##  [52] "targtype3_txt"      "targsubtype3"       "targsubtype3_txt"  
##  [55] "corp3"              "target3"            "natlty3"           
##  [58] "natlty3_txt"        "gname"              "gsubname"          
##  [61] "gname2"             "gsubname2"          "gname3"            
##  [64] "ingroup"            "ingroup2"           "ingroup3"          
##  [67] "gsubname3"          "motive"             "guncertain1"       
##  [70] "guncertain2"        "guncertain3"        "nperps"            
##  [73] "nperpcap"           "claimed"            "claimmode"         
##  [76] "claimmode_txt"      "claim2"             "claimmode2"        
##  [79] "claimmode2_txt"     "claim3"             "claimmode3"        
##  [82] "claimmode3_txt"     "compclaim"          "weaptype1"         
##  [85] "weaptype1_txt"      "weapsubtype1"       "weapsubtype1_txt"  
##  [88] "weaptype2"          "weaptype2_txt"      "weapsubtype2"      
##  [91] "weapsubtype2_txt"   "weaptype3"          "weaptype3_txt"     
##  [94] "weapsubtype3"       "weapsubtype3_txt"   "weaptype4"         
##  [97] "weaptype4_txt"      "weapsubtype4"       "weapsubtype4_txt"  
## [100] "weapdetail"         "nkill"              "nkillus"           
## [103] "nkillter"           "nwound"             "nwoundus"          
## [106] "nwoundte"           "property"           "propextent"        
## [109] "propextent_txt"     "propvalue"          "propcomment"       
## [112] "ishostkid"          "nhostkid"           "nhostkidus"        
## [115] "nhours"             "ndays"              "divert"            
## [118] "kidhijcountry"      "ransom"             "ransomamt"         
## [121] "ransomamtus"        "ransompaid"         "ransompaidus"      
## [124] "ransomnote"         "hostkidoutcome"     "hostkidoutcome_txt"
## [127] "nreleased"          "addnotes"           "scite1"            
## [130] "scite2"             "scite3"             "dbsource"          
## [133] "INT_LOG"            "INT_IDEO"           "INT_MISC"          
## [136] "INT_ANY"            "related"

#As you can see, this data set it a little unruly. To make this data set more managable, we will create a new data frame that contains only the fields we would like to examine: death toll, region of attack, weapon type, target type, and attack type. 

attacks <- data.frame(terror$region, terror$attacktype1, terror$targtype1, terror$weaptype1, terror$nkill)
colnames(attacks) <- c("region", "attacktype1", "targtype1", "weaptype1", "nkills")
head(attacks)

##   region attacktype1 targtype1 weaptype1 nkills
## 1      6           3         1         6    0.0
## 2     11           2         1         6    2.5
## 3      6           3         4         6    4.0
## 4     11           2         1         6    2.5
## 5      6           6        14         5    0.0
## 6      6           3        17         6    1.0

# Much better- more managable.

2. Experimental Design

How will experiment be organized and conducted to test the hypothesis?

This data was complied by the National Consortium for the study of Terrorism and Responses to Terrorism (START) to create the Global Terror Database. For this experiment we will examine the data collected from 2012-2015, looking at how the attack type, target type, weapon type, and attack location factor in to the number of fatalities caused by a terror attack.

What is the rationale for this design?

This data is available to the public as a dowloadable file on their website with the hope that the data can be used to study terrorism, raise awareness, and provide transparency regarding terror attacks. By studying terror attacks, we can hope to be better prepared to both prevent and respond to them.

Randomization

The data set is not deliberately randomized, as the data was collected as events occurred from public sources. The Global Terror Database includes all incidents that are considered ‘terrorist attacks’ and does not exclude anything to prevent bias, so we can assume randomization. However, because terror attacks can impact the occurance of other attacks, this data may not be truly random.

Replication

Obviously, there are no deliberate replicates in this data set, as that would be unethical and probably start (another) war. It is possible that there are similar attacks in the data set, in terms of region, weapon type, and target type, but those would be a coincidence.

Blocking

This data is not blocked, but could be, in order to control nuisance factors (e.g. military presence in area of attack, training/preparation for terror attacks, urban/rural location, foreign/domestic attack).

3. Statistical Analysis

Exploratory Data Analysis

First, we’ll look at the summary statistics for the number of fatalities per attack:

summary(attacks$nkills)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##    0.000    0.000    1.000    2.405    2.000 1500.000     2387

As you can see, we have roughly 2300 N/A values (unknown number of casualties), which seems like a lot until you consider that we have 52,000 data points.

We can also look at the summary statistics for the number of fatalities per attack in a specific region, e.g. the Middle East & North Africa (region 10):

mideast <- attacks$region == "10"
summary(attacks[mideast, "nkills"])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    0.00    1.00    2.75    2.00 1500.00     882

This code can be replicated for any region, weapon type, attack type, or target type to examine the summary statistics on the number of fatalities per attack dependent upon that factor.

We can look at the frequency plots for each factor:

par(mfrow = c(2,2)) #displays a grid of four plots

hist(attacks$region, main = "Frequency of Attacks by Region", xlab = "Location", ylab = "Number of Attacks")

hist(attacks$attacktype1, main = "Frequency of Attacks by Attack Type", xlab = "Attack Type", ylab = "Number of Attacks")

hist(attacks$targtype1, main = "Frequency of Attacks by Target Type", xlab = "Target Type", ylab = "Number of Attacks")

hist(attacks$weaptype1, main = "Frequency of Attacks by Weapon Type", xlab = "Weapon Type", ylab = "Number of Attacks")

We’ll also want to set the factors as factors in R, like so:

attacks$region <- as.factor(attacks$region)
attacks$attacktype1 <- as.factor(attacks$attacktype1)
attacks$targtype1 <- as.factor(attacks$targtype1)
attacks$weaptype1 <- as.factor(attacks$weaptype1)

We can also look at the boxplots for each factor:

par(mfrow = c(2,2)) #displays a grid of four plots

boxplot(attacks$nkills ~ attacks$region, xlab = "Region", ylab = "Fatalities")
boxplot(attacks$nkills ~ attacks$attacktype1, xlab = "Attack Type", ylab = "Fatalities")
boxplot(attacks$nkills ~ attacks$targtype1, xlab = "Target Type", ylab = "Fatalities")
boxplot(attacks$nkills ~ attacks$weaptype1, xlab = "Weapon Type", ylab = "Fatalities")

Testing

Calculating the analysis of variance for the main effects of the factors:

model_region <- aov(nkills~region, data = attacks)
anova(model_region)

## Analysis of Variance Table
## 
## Response: nkills
##              Df  Sum Sq Mean Sq F value    Pr(>F)    
## region       11   86606  7873.2   63.52 < 2.2e-16 ***
## Residuals 49735 6164583   123.9                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

model_attacktype <- aov(nkills~attacktype1, data = attacks)
anova(model_attacktype)

## Analysis of Variance Table
## 
## Response: nkills
##                Df  Sum Sq Mean Sq F value    Pr(>F)    
## attacktype1     8   65483  8185.3  65.817 < 2.2e-16 ***
## Residuals   49738 6185706   124.4                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

model_targettype <- aov(nkills~targtype1, data = attacks)
anova(model_targettype)

## Analysis of Variance Table
## 
## Response: nkills
##              Df  Sum Sq Mean Sq F value    Pr(>F)    
## targtype1    21   49146 2340.31  18.764 < 2.2e-16 ***
## Residuals 49725 6202042  124.73                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

model_weapontype <- aov(nkills~weaptype1, data = attacks)
anova(model_weapontype)

## Analysis of Variance Table
## 
## Response: nkills
##              Df  Sum Sq Mean Sq F value    Pr(>F)    
## weaptype1    10   19685 1968.54  15.712 < 2.2e-16 ***
## Residuals 49736 6231503  125.29                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now we will calculate the analysis of variance for the two-factor interactions:

#region-attack type interaction
model_region_attack <- aov(nkills~region*attacktype1, data = attacks)
anova(model_region_attack)

## Analysis of Variance Table
## 
## Response: nkills
##                       Df  Sum Sq Mean Sq F value    Pr(>F)    
## region                11   86606  7873.2 64.4173 < 2.2e-16 ***
## attacktype1            8   50252  6281.5 51.3944 < 2.2e-16 ***
## region:attacktype1    70   45136   644.8  5.2757 < 2.2e-16 ***
## Residuals          49657 6069194   122.2                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#region-target type interaction
model_region_target <- aov(nkills~region*targtype1, data = attacks)
anova(model_region_target)

## Analysis of Variance Table
## 
## Response: nkills
##                     Df  Sum Sq Mean Sq F value    Pr(>F)    
## region              11   86606  7873.2 64.3402 < 2.2e-16 ***
## targtype1           21   39348  1873.7 15.3122 < 2.2e-16 ***
## region:targtype1   152   60396   397.3  3.2471 < 2.2e-16 ***
## Residuals        49562 6064839   122.4                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#region-weapon type interaction
model_region_weapon <- aov(nkills~region*weaptype1, data = attacks)
anova(model_region_weapon)

## Analysis of Variance Table
## 
## Response: nkills
##                     Df  Sum Sq Mean Sq F value Pr(>F)    
## region              11   86606  7873.2 63.6620 <2e-16 ***
## weaptype1           10   13204  1320.4 10.6766 <2e-16 ***
## region:weaptype1    62    9437   152.2  1.2307 0.1047    
## Residuals        49663 6141942   123.7                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#attack type - weapon type interaction
model_attack_weapon <- aov(nkills~attacktype1*weaptype1, data = attacks)
anova(model_attack_weapon)

## Analysis of Variance Table
## 
## Response: nkills
##                          Df  Sum Sq Mean Sq F value    Pr(>F)    
## attacktype1               8   65483  8185.3  66.445 < 2.2e-16 ***
## weaptype1                10   21781  2178.1  17.681 < 2.2e-16 ***
## attacktype1:weaptype1    32   41864  1308.2  10.620 < 2.2e-16 ***
## Residuals             49696 6122061   123.2                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#attack type - target type interaction
model_attack_target <- aov(nkills~attacktype1*targtype1, data = attacks)
anova(model_attack_target)

## Analysis of Variance Table
## 
## Response: nkills
##                          Df  Sum Sq Mean Sq F value    Pr(>F)    
## attacktype1               8   65483  8185.3 66.6070 < 2.2e-16 ***
## targtype1                21   35602  1695.3 13.7955 < 2.2e-16 ***
## attacktype1:targtype1   131   56478   431.1  3.5083 < 2.2e-16 ***
## Residuals             49586 6093626   122.9                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#target type - weapon type
model_target_weapon <- aov(nkills~targtype1*weaptype1, data = attacks)
anova(model_target_weapon)

## Analysis of Variance Table
## 
## Response: nkills
##                        Df  Sum Sq Mean Sq F value Pr(>F)    
## targtype1              21   49146 2340.31 18.8019 <2e-16 ***
## weaptype1              10   14901 1490.09 11.9714 <2e-16 ***
## targtype1:weaptype1   101   11612  114.97  0.9236 0.6948    
## Residuals           49614 6175529  124.47                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA results show that all four factors have a statistically significant effect on the number of fatalities resulting from a terror attacks: attack type, target type, weapon type, and location. The ANOVA tests produced a p-value of 2.2e-16, which is a strong indicator of a relationship between the factors and the response variable. The interaction ANOVA analysis shows that there is also some relationship between the factors themselves. Maybe a certain attack type is more common in a particular region, or a certain weapon is used to attack a specific type of target. The variation in the number of fatalities per attack cannot be attributed to randomization alone. Further analysis should be done to verify that the ANOVA test is the best one to run on this data set, and to further determine the relationship between the response variables and the factors looked at in this analysis.

4. Relevant Theory

While this data is does not follow the basic principles of experimental design (randomization, replication, blocking), that does not mean that they are not the basis for most statistical analyses.

Randomization refers to the process by which data is collects: collecting data in a random order ensures that there is no unnecessary bias incurred during data collection. This could be due to equipment drift in measurements, or exhaustion in human subjects. For example, if an experiment were to look at the golf swing of test subjects, you would not want to have the same test subject run through all of their experimental runs at once, as fatigue might become a nuisance factor.

Replication refers to repeating an experimental run more than once during an experiment. Continuing with the golf example, a test subject would use the same combination of factor levels to play more than once. This would ensure that if they happened to get really lucky on one swing, the effect of this anomaly would be mitigated by the replicates.

The third principle, blocking, is another way to mitigate nuisance factors. By grouping experimental runs in a certain way, it is possible to reduce the variability introduced into the data set by nuisance factors. For example, with this data, we could group terror attacks by the group who claimed responsibility. This could show differences in patterns of attack between separate terror groups.

5. References to Literature

Montgomery, Douglas C. Design and analysis of experiments. John Wiley & Sons, 2008.

6. Appendices

The raw data can be found through the Global Terrorism Database website here.

DOE_project1.3_renaud

Molly Renaud

10/8/2016