Introduction
This report contains an Exploratory Data Analysis of The Global Terorrism Database.
I have elected to use this dataset as my father is a Retired Marine Veteran
having served a great deal of my life protecting our country from the events
described in this dataset.
Throughout this report I will refer to the dataset as GTD.
The GTD dataset consists of 135 variables/columns with a total record count of 170,350 records.
As I conduct the Exploratory Data Analysis I will not use all of the 135
variables that are available to the dataset.
With that said an explanation of each chosen variable will be provided as it
was chosen with some significant value to the EDA process.
Univariate Summaries
colnames(gtData)
## [1] "eventid" "iyear" "imonth"
## [4] "iday" "approxdate" "extended"
## [7] "resolution" "country" "country_txt"
## [10] "region" "region_txt" "provstate"
## [13] "city" "latitude" "longitude"
## [16] "specificity" "vicinity" "location"
## [19] "summary" "crit1" "crit2"
## [22] "crit3" "doubtterr" "alternative"
## [25] "alternative_txt" "multiple" "success"
## [28] "suicide" "attacktype1" "attacktype1_txt"
## [31] "attacktype2" "attacktype2_txt" "attacktype3"
## [34] "attacktype3_txt" "targtype1" "targtype1_txt"
## [37] "targsubtype1" "targsubtype1_txt" "corp1"
## [40] "target1" "natlty1" "natlty1_txt"
## [43] "targtype2" "targtype2_txt" "targsubtype2"
## [46] "targsubtype2_txt" "corp2" "target2"
## [49] "natlty2" "natlty2_txt" "targtype3"
## [52] "targtype3_txt" "targsubtype3" "targsubtype3_txt"
## [55] "corp3" "target3" "natlty3"
## [58] "natlty3_txt" "gname" "gsubname"
## [61] "gname2" "gsubname2" "gname3"
## [64] "gsubname3" "motive" "guncertain1"
## [67] "guncertain2" "guncertain3" "individual"
## [70] "nperps" "nperpcap" "claimed"
## [73] "claimmode" "claimmode_txt" "claim2"
## [76] "claimmode2" "claimmode2_txt" "claim3"
## [79] "claimmode3" "claimmode3_txt" "compclaim"
## [82] "weaptype1" "weaptype1_txt" "weapsubtype1"
## [85] "weapsubtype1_txt" "weaptype2" "weaptype2_txt"
## [88] "weapsubtype2" "weapsubtype2_txt" "weaptype3"
## [91] "weaptype3_txt" "weapsubtype3" "weapsubtype3_txt"
## [94] "weaptype4" "weaptype4_txt" "weapsubtype4"
## [97] "weapsubtype4_txt" "weapdetail" "nkill"
## [100] "nkillus" "nkillter" "nwound"
## [103] "nwoundus" "nwoundte" "property"
## [106] "propextent" "propextent_txt" "propvalue"
## [109] "propcomment" "ishostkid" "nhostkid"
## [112] "nhostkidus" "nhours" "ndays"
## [115] "divert" "kidhijcountry" "ransom"
## [118] "ransomamt" "ransomamtus" "ransompaid"
## [121] "ransompaidus" "ransomnote" "hostkidoutcome"
## [124] "hostkidoutcome_txt" "nreleased" "addnotes"
## [127] "scite1" "scite2" "scite3"
## [130] "dbsource" "INT_LOG" "INT_IDEO"
## [133] "INT_MISC" "INT_ANY" "related"
Now that i“ve had a chance to review the column names I have already identified a few variables that appear to have significant value.
success: Used to measure the percent of success vs. uncessuccessful attacksnkills: Shows the impact on life as it relates to each attackcountry: Correlation between each country is significant to the explorationlat/lon: Geographic points for pin point correlationweapon: Used to measure the correlation between each weapon typecount: Introduced variable to compare correlation between multiple variablessummary(gtData$success)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 1.0000 1.0000 0.8964 1.0000 1.0000
Success Summary: 90% of terrorist attackes where denoted as a success.
summary(gtData$suicide)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.03387 0.00000 1.00000
Suicide Summary: 4% of the total attacks where denoted as suicide.
summary(gtData$nkill)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 2.387 2.000 1500.000 9682
NKill Summary: 2.387 deaths per attack with a peak 1500
summary(gtData$city)
## Unknown Baghdad Karachi
## 9162 7206 2609
## Lima Belfast Mosul
## 2358 2140 1775
## Santiago San Salvador Mogadishu
## 1618 1547 1351
## Istanbul Athens Bogota
## 1037 987 981
## Beirut Kirkuk Medellin
## 919 887 846
## Peshawar Benghazi Guatemala City
## 798 791 754
## Quetta Baqubah Kabul
## 752 725 644
## Srinagar Jerusalem Paris
## 637 605 598
## Fallujah Dhaka Rome
## 561 549 548
## Tripoli Ramadi Manila
## 537 482 481
## Aleppo Buenos Aires Ayacucho
## 469 460 459
## New York City Sanaa
## 449 447 446
## Madrid Algiers Arish
## 414 406 400
## Tikrit Imphal London
## 398 392 383
## Maiduguri Londonderry Damascus
## 376 349 344
## Kandahar Bilbao Gaza
## 343 335 333
## Colombo Cali Ajaccio
## 321 311 305
## Abu Ghraib Ankara Tehran
## 290 289 276
## Aden Donostia-San Sebastian Tuz Khormato
## 275 272 268
## Samarra Johannesburg Baiji
## 259 255 240
## Rafah Jalalabad Madain
## 240 236 231
## Sheikh Zuweid Taizz Bujumbura
## 231 229 227
## Bastia Grozny Barcelona
## 225 222 221
## Lahore Bangkok Muqdadiyah
## 218 212 211
## Mahmudiyah Cairo Milan
## 205 203 203
## Managua Hebron Kismayo
## 201 200 194
## Makhachkala Sidon Tarmiyah
## 194 193 193
## Donetsk Basra Santa Ana
## 192 191 191
## Jaffna San Miguel Taji
## 189 182 182
## Huancayo La Paz Sirte
## 181 180 175
## Lashkar Gah Tokyo Khost
## 174 172 171
## Bara Nablus Batticaloa
## 170 166 165
## Tegucigalpa Ghazni Jamrud
## 164 162 157
## (Other)
## 107248
City Summary: Top 3 cities are Baghdad, Karachi, and Lima. 1
summary(gtData$natlty1_txt)
## Iraq Pakistan
## 21625 13168
## India Afghanistan
## 11110 9669
## Colombia Philippines
## 7783 5997
## Peru El Salvador
## 5832 5212
## United States Turkey
## 4976 4436
## Israel Thailand
## 3940 3623
## Northern Ireland Nigeria
## 3288 3270
## Spain Yemen
## 3091 2908
## France Sri Lanka
## 2867 2811
## Algeria Somalia
## 2648 2636
## International Russia
## 2428 2282
## Chile Egypt
## 2227 2209
## Great Britain South Africa
## 2104 2040
## Nicaragua Syria
## 1960 1954
## Libya Guatemala
## 1952 1893
## Ukraine Bangladesh
## 1605 1593
## Italy
## 1458 1394
## Lebanon Nepal
## 1381 950
## Greece Germany
## 917 898
## West Bank and Gaza Strip Iran
## 889 812
## Sudan Indonesia
## 750 697
## Argentina Democratic Republic of the Congo
## 641 569
## Burundi Kenya
## 556 554
## Mexico Japan
## 462 457
## Myanmar Angola
## 440 423
## China Saudi Arabia
## 369 360
## Uganda Ireland
## 354 336
## Mozambique Multinational
## 310 289
## Bolivia Mali
## 283 279
## Venezuela Serbia-Montenegro
## 271 256
## Honduras Brazil
## 253 244
## Cameroon Cambodia
## 220 203
## Haiti Soviet Union
## 200 199
## Georgia Ecuador
## 187 182
## Bahrain Central African Republic
## 170 168
## Ethiopia Netherlands
## 167 150
## Switzerland Rwanda
## 143 141
## Portugal Canada
## 140 136
## Tajikistan South Sudan
## 135 134
## Namibia Australia
## 131 130
## Yugoslavia Niger
## 125 124
## Tunisia Senegal
## 117 116
## Sweden Panama
## 115 113
## Belgium Jordan
## 110 109
## Bosnia-Herzegovina Macedonia
## 106 106
## Paraguay West Germany (FRG)
## 105 104
## Albania Cuba
## 96 94
## Zimbabwe Cyprus
## 90 89
## Dominican Republic Malaysia
## 86 86
## Austria (Other)
## 82 2452
Nationality Summary: Top 5 Iraq, Pakistan, India, Afghanistan, and Columbia
summary(gtData$weaptype1_txt)
## Biological
## 35
## Chemical
## 293
## Explosives/Bombs/Dynamite
## 86704
## Fake Weapons
## 33
## Firearms
## 55273
## Incendiary
## 10459
## Melee
## 3338
## Other
## 104
## Radiological
## 13
## Sabotage Equipment
## 130
## Unknown
## 13852
## Vehicle (not to include vehicle-borne explosives, i.e., car or truck bombs)
## 116
Weapon Summary: Explsives/Bombs and Firearms are top weapons of choice
summary(gtData$iyear)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1970 1990 2007 2002 2014 2016
Year Summary: Dates range from 1970 to 2016 nearly 50 years of data
summary(gtData$imonth)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 4.000 6.000 6.474 9.000 12.000
Month Summary: June seems to be the average Month attacks occur.
summary(gtData$region_txt)
## Australasia & Oceania Central America & Caribbean
## 264 10340
## Central Asia East Asia
## 554 794
## Eastern Europe Middle East & North Africa
## 5031 46511
## North America South America
## 3346 18762
## South Asia Southeast Asia
## 41497 11453
## Sub-Saharan Africa Western Europe
## 15491 16307
Region Summary: Middle East and South Asia are high attack areas
Given the above summaries it can be determined that approximately 90% of the total attacks 170350 were a success.
Univariate Plots > Note Analysis of each plot will be provided in the subtitle of the plot.
hchart(factor(gtData$success), name = "Success") %>%
hc_title(text = "Success Univ. Plot") %>%
hc_subtitle(text = "Majority of the attacks are denoted as success.")
hchart(factor(gtData$suicide), name = "Suicide") %>%
hc_title(text = "Suicide Univ. Plot") %>%
hc_subtitle(text = "Only a small amount of the attacks were suicide attacks.")
hchart(factor(gtData$iyear), name = "Year") %>%
hc_title(text = "Year Univ. Plot") %>%
hc_subtitle(text = "Shows attacks by year.")
uni_weap <- gtData %>%
group_by(weaptype1_txt) %>%
summarise(count = n())
## Warning in grouped_df_impl(data, unname(vars), drop): '.Random.seed' is not
## an integer vector but of type 'NULL', so ignored
hchart(uni_weap$count, name = "Weapon Type 1") %>%
hc_title(text = "Weapon Type 1 Univ. Plot") %>%
hc_subtitle(text = "Histogram of attacks by Weapon Type 1")
uni_country <- gtData %>%
group_by(country_txt) %>%
summarise(count = n(), unique = length(unique(country_txt))) %>%
arrange(-count, -unique)
hchart(uni_country, "treemap",
hcaes(x = country_txt, value = count, color = unique)) %>%
hc_title(text = "Country Univ. Plot") %>%
hc_subtitle(text = "Attacks by Country")
uni_nkill <- gtData %>%
filter(nkill != '', nkillus != '') %>%
group_by(nkill) %>%
summarise(t_nkill = sum(nkill + nkillus))
hcboxplot(uni_nkill$t_nkill, name = "Nkill") %>%
hc_title(text = "Number of Kills Univ. Plot") %>%
hc_subtitle(text = "Boxplot Showing Attacks by Number of Kills")
hchart(factor(gtData$imonth), name = "Month") %>%
hc_title(text = "Month Univ. Plot") %>%
hc_subtitle(text = "Shows attacks by month")
hchart(factor(gtData$iday), name = "Day") %>%
hc_title(text = "Day Univ. Plot") %>%
hc_subtitle(text = "Shows attacks by Day")
data_hours <- gtData %>%
filter(nhours != '', nhours > 0) %>%
group_by(nhours) %>%
summarise(count = n())
hchart(data_hours$count, name = "Hours") %>%
hc_title(text = "Hours Univ. Plot") %>%
hc_subtitle(text = "Shows attacks by hour")
hchart(factor(gtData$region_txt), name = "Region") %>%
hc_title(text = "Region Univ. Plot")
hchart(factor(gtData$attacktype1_txt), name = "Attack Type 1") %>%
hc_title(text = "Attack Type 1 Univ. Plot")
The data contains 170,350 records of attacks with a total of 135 variables.
The dataset contains 135 variables with a total of 170,350 terroist attacks.
The dataset also contains geographic coordinates for each attak.
One important point of interest in the dataset is the weapon details.
The dataset gives us information, regarding the type of weapon used which can be analyzed to determine the types of weapon choice within an attak.
Variables such as hostage, resolution, year,month, day, vicinity, and summary
are all great points of interest in this dataset.
The additional variables mentioned are exteremely critical.
If we were to take this analysis a step further and step into predictive analysis.
The dataset could be used to cross examine potential threats and correlate them
to a potential outcome of the threat based on historical data of attacks.
In later exploration I will introduce new variables such as count.
The count variable will be used in statistical comparrisions and plots.
I have used the dplyr functions summarise, group_by, mutate, and filter to subset, organize and prepare the data.
# plot weapons by year
weaponByYear <- gtData %>%
group_by(iyear, weaptype1_txt) %>%
summarise(count = n())
bp1 <- hchart(weaponByYear, "scatter",
hcaes(x = iyear, y = count, group = weaptype1_txt)) %>%
hc_title(text = "Bivariate Weapon By Year") %>%
hc_subtitle(text = "The chart shows that Firearms are consistent weapons of
use, however Explosives/Bombs/Dynamite have inceasingly killed
more people as time goes on.")
bp1
# plot country by year
bivarCountry <- gtData %>%
group_by(country_txt) %>%
summarise(count = n()) %>%
arrange(desc(.data$count)) %>%
head(20)
bp2 <- hchart(bivarCountry, "column",
hcaes(x = country_txt, y = count, group = country_txt)) %>%
hc_title(text = "Count By Country") %>%
hc_subtitle(text = "Shows the top 20 countries by total number of attacks" )
bp2
# plot weapons by n of kills
bivarWeapXNKills <- gtData %>%
filter(nkill != "") %>%
group_by(weaptype1_txt, nkill) %>%
arrange(desc(nkill)) %>%
summarise()
hchart(bivarWeapXNKills, "bar",
hcaes(x = weaptype1_txt, y = nkill, group = weaptype1_txt)) %>%
hc_title(text = "Weapons by Kills") %>%
hc_subtitle(text = "Shows weapons by the number of recorded kills, firearms
and Explosives appear to be the primary choice of weapons for
terroist attacks.")
I discovered different weapon usages in correlation to the number of deaths
that are impacted by a particular weapon.
I also analyzed this data by year, and found that as the years progress , more
specifically the Global War on Terroism.
Explosives have also increasingly become more popular overtime.
This knowledge may be common knowledge if one follows the news, but this
analysis allows to put factual statistics behind the topic.
The correlation between nkill and weaptype1_txt variable.
Higher impact weapons such as explosives tend to have a larger impact on the
nkill variable.
Where as lower impact weapons tend to have a lower impact on nkill.
Weapon type based on the year of the attack.
This observation shows the slight evolution of weapon choice overtime.
I observed the relationship between the country and the weapon type used
compared to the number of attacks for the given nation and weapon.
I found it interesting that Iraq, in the pie chart, contained a higher
percentage of the attacks, along with a higher count of explosives.
No efforts were focussed on analyzing the correlation between the weapon type
used and the number of attacks by country.
The Bivariate Weapon By Year shows the usage of a weapon over time.
This plot showed Explosives/Bombs/Dynamites to be an increasingly
popular weapon of choice over the years.
The Count By Country plot shows a breakdown of the top 20 countries
based on the total number of attacks.
After analyzing this plot I was able to identify 3 countries
Iraq, Afghanistan, and Pakistan as top countries in the dataset where terroist
attacks have taken place.
The plot contains the comparison between weapon, count, and country.
I was able to find a significant correlation between the type of weapon used
and the country of origin where the attack took place.
Iraq was leading metric as it relates to the Explosives and number of attacks.
Exploring the Global Terroism Database was a very informing and challenging.
While exploring this dataset I experienced several different challenges.
Due to the size of the dataset over 170,000 records.
I learned that cleaning and subsetting the data to only explore the points of
interest, are extremely important in data exploration.
I also learned that during quick explorations to determine which path will be
taken for deeper explorations require knowing when to filter the data
along with knowing which data points to filter.
For example I started out by subsetting the data by 100. I quickcly realized
that the results of my exploration were not consistent to what I was expecting.
After taking a closer look I realized that I had already grouped my data by year.
Having done so, when I subset the data I only saw recrods for the year 1970.
Which would have significantly limited my exploration to a very narrow piece of the data.
This dataset opens a world of exploration opportunities.
By analyzing and storing the variables of a threat, and cross referencing the threat to the collected data within this dataset, it may be possible to determine the potential outcome of a terroist threat and possibly prevent the threat from taking place.
While exploring this dataset I have learned a great deal about the Global War on Terroism, along with the geographic impact terroism can have on the world.
I decided to render a map of the terroist attacks.
d1 <- gtData %>%
mutate(name = country_txt) %>%
group_by(name) %>%
summarise(z = n())
hcmap("custom/world", data = d1, value = "z", borderColor = "#FAFAFA",
borderWidth = 0.1) %>%
hc_mapNavigation(enabled = TRUE, enableDoubleClickZoomTo = TRUE) %>%
hc_title(text = "Global Terroism Attacks World Map")
Tripoli is one of the historical battle grounds for the United
States Marine Corps.↩