Rebecca Johnston
I have chosen to explore the Global Terrorism Database (GTD), available for download here. To do this, I will use the ggplot2 package in R, to produce “elegant graphics for data analysis” (one can only hope). I will also use plyr to complete data aggregation tasks where necessary.
library(ggplot2)
library(plyr)
terr <- read.table("globalterrorismdb_1012dist.txt", sep = "\t", quote = "\"",
header = TRUE, fill = TRUE)
str(terr)
## 'data.frame': 104457 obs. of 123 variables:
## $ eventid : Factor w/ 104438 levels " "," January 19, 2009.",..: 62 63 64 65 66 67 68 69 70 71 ...
## $ iyear : int 1970 1970 1970 1970 1970 1970 1970 1970 1970 1970 ...
## $ imonth : int 0 0 1 1 1 1 1 1 1 1 ...
## $ iday : int 0 0 0 0 0 1 2 2 2 3 ...
## $ approxdate : Factor w/ 275 levels "","01/04/2000",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ extended : int 0 0 0 0 0 0 0 0 0 0 ...
## $ resolution : Factor w/ 1836 levels "","1/01/1978",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ country : int 58 130 160 78 101 217 218 217 217 217 ...
## $ country_txt : Factor w/ 210 levels "","Afghanistan",..: 53 123 147 75 97 195 196 195 195 195 ...
## $ region : int 2 1 5 8 4 1 3 1 1 1 ...
## $ region_txt : Factor w/ 14 levels "","Australasia & Oceania",..: 3 8 12 14 5 8 10 8 8 8 ...
## $ provstate : Factor w/ 3581 levels ""," Azad Kashmir",..: 1 1 1 1 1 1391 1 687 3491 3491 ...
## $ city : Factor w/ 24688 levels "","'A'Ishah Bakkar",..: 20133 14517 22055 1762 7868 4207 14875 16310 13417 13417 ...
## $ latitude : num 18.5 19.4 NA 38 33.6 ...
## $ longitude : num -70 -99.1 NA 23.7 130.4 ...
## $ specificity : int 1 1 NA 1 1 1 NA 1 1 1 ...
## $ vicinity : int 0 0 0 0 0 0 0 0 0 0 ...
## $ location : Factor w/ 21943 levels "","-9","\"Colony 39\" settlement in uncleared areas of Ampara district, in eastern Sri Lanka",..: 1 1 1 1 1 1 1 2426 21792 1 ...
## $ summary : Factor w/ 13865 levels "","01/00/2006: American soldiers discovered three decapitated bodies on a soccer field west of Baghdad, Iraq. It is unknown when t"| __truncated__,..: 1 1 1 1 1 1 1 9964 1 1 ...
## $ crit1 : int 1 1 1 1 1 NA 1 1 NA NA ...
## $ crit2 : int 1 1 1 1 1 NA 1 1 NA NA ...
## $ crit3 : int 1 1 1 1 1 NA 1 1 NA NA ...
## $ doubtterr : int -9 -9 -9 -9 -9 NA -9 1 NA NA ...
## $ alternative : int NA NA NA NA NA NA NA 2 NA NA ...
## $ alternative_txt : Factor w/ 6 levels "",".","Insurgency/Guerilla Action",..: 2 2 2 2 2 1 2 6 1 1 ...
## $ multiple : int 0 0 0 0 0 NA 0 0 NA NA ...
## $ success : int 1 1 1 1 1 NA 0 1 NA NA ...
## $ suicide : int 0 0 0 0 0 NA 0 0 NA NA ...
## $ attacktype1 : int 1 6 1 3 7 NA 1 3 NA NA ...
## $ attacktype1_txt : Factor w/ 10 levels "","Armed Assault",..: 3 8 3 4 5 1 3 4 1 1 ...
## $ attacktype2 : int NA NA NA NA NA NA NA NA NA NA ...
## $ attacktype2_txt : Factor w/ 8 levels "",".","Armed Assault",..: 2 2 2 2 2 1 2 2 1 1 ...
## $ attacktype3 : int NA NA NA NA NA NA NA NA NA NA ...
## $ attacktype3_txt : Factor w/ 6 levels "",".","Armed Assault",..: 2 2 2 2 2 1 2 2 1 1 ...
## $ targtype1 : int 14 7 10 7 7 NA 3 21 NA NA ...
## $ targtype1_txt : Factor w/ 23 levels "","Abortion Related",..: 15 7 9 7 7 1 14 22 1 1 ...
## $ corp1 : Factor w/ 14672 levels "","'Private Network' column",..: 1 1670 14375 1 1 1 14138 9624 1 1 ...
## $ target1 : Factor w/ 47777 levels "","'Add al-Fattah Isma'il, secretery general",..: 23089 28153 14829 44363 44343 1 22964 14020 1 1 ...
## $ natlty1 : int 58 21 217 217 217 NA 218 217 NA NA ...
## $ natlty1_txt : Factor w/ 212 levels "",".","Afghanistan",..: 55 21 197 197 197 1 199 197 1 1 ...
## $ targtype2 : int NA NA NA NA NA NA NA NA NA NA ...
## $ targtype2_txt : Factor w/ 24 levels "",".","Abortion Related",..: 2 2 2 2 2 1 2 2 1 1 ...
## $ corp2 : Factor w/ 735 levels "","\"L'evenement Du Jeudi\", a magazine",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ target2 : Factor w/ 1305 levels "","(District) Survey Office",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ natlty2 : int NA NA NA NA NA NA NA NA NA NA ...
## $ natlty2_txt : Factor w/ 122 levels "",".","Afghanistan",..: 2 2 2 2 2 1 2 2 1 1 ...
## $ targtype3 : int NA NA NA NA NA NA NA NA NA NA ...
## $ targtype3_txt : Factor w/ 19 levels "",".","Airports & Airlines",..: 2 2 2 2 2 1 2 2 1 1 ...
## $ corp3 : Factor w/ 103 levels "","African National Congress (ANC)",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ target3 : Factor w/ 173 levels "","*","1 Citroen dealership",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ natlty3 : int NA NA NA NA NA NA NA NA NA NA ...
## $ natlty3_txt : Factor w/ 64 levels "",".","Afghanistan",..: 2 2 2 2 2 1 2 2 1 1 ...
## $ gname : Factor w/ 2473 levels "","1 May","14th of December Command",..: 1333 11 2389 2389 2389 1 2295 2389 1 1 ...
## $ gsubname : Factor w/ 433 levels "","10th Front",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ gname2 : Factor w/ 91 levels "","1 May","Abu Baker Martyr Group",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ gsubname2 : Factor w/ 6 levels "","al-Quds Brigades",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ gname3 : Factor w/ 8 levels "","Armed Revolutionary Nuclei (NAR)",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ gsubname3 : logi NA NA NA NA NA NA ...
## $ motive : Factor w/ 1346 levels "","\"Disgruntled tribesmen have frequently targeted the pipeline in the past few years to try to force the government to improve s"| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ guncertain1 : int 0 0 0 0 0 NA 0 0 NA NA ...
## $ guncertain2 : int NA NA NA NA NA NA NA NA NA NA ...
## $ guncertain3 : int NA NA NA NA NA NA NA NA NA NA ...
## $ nperps : int NA 7 NA NA NA NA 3 -99 NA NA ...
## $ nperpcap : int NA NA NA NA NA NA NA -99 NA NA ...
## $ claimed : int NA NA NA NA NA NA NA 0 NA NA ...
## $ claimmode : int NA NA NA NA NA NA NA NA NA NA ...
## $ claimmode_txt : Factor w/ 12 levels "",".","Call (post-incident)",..: 2 2 2 2 2 1 2 2 1 1 ...
## $ claim2 : int NA NA NA NA NA NA NA NA NA NA ...
## $ claimmode2 : int NA NA NA NA NA NA NA NA NA NA ...
## $ claimmode2_txt : Factor w/ 6 levels "",".","Call (post-incident)",..: 2 2 2 2 2 1 2 2 1 1 ...
## $ claim3 : int NA NA NA NA NA NA NA NA NA NA ...
## $ claimmode3 : logi NA NA NA NA NA NA ...
## $ claimmode3_txt : Factor w/ 2 levels "",".": 2 2 2 2 2 1 2 2 1 1 ...
## $ compclaim : int NA NA NA NA NA NA NA NA NA NA ...
## $ weaptype1 : int 13 13 13 6 8 NA 5 6 NA NA ...
## $ weaptype1_txt : Factor w/ 13 levels "","Biological",..: 12 12 12 4 7 1 6 4 1 1 ...
## $ weapsubtype1 : int NA NA NA 16 NA NA 2 16 NA NA ...
## $ weapsubtype1_txt : Factor w/ 30 levels "",".","Arson/Fire",..: 2 2 2 27 2 1 4 27 1 1 ...
## $ weaptype2 : int NA NA NA NA NA NA NA NA NA NA ...
## $ weaptype2_txt : Factor w/ 12 levels "",".","Chemical",..: 2 2 2 2 2 1 2 2 1 1 ...
## $ weapsubtype2 : int NA NA NA NA NA NA NA NA NA NA ...
## $ weapsubtype2_txt : Factor w/ 26 levels "",".","Arson/Fire",..: 2 2 2 2 2 1 2 2 1 1 ...
## $ weaptype3 : int NA NA NA NA NA NA NA NA NA NA ...
## $ weaptype3_txt : Factor w/ 11 levels "",".","Chemical",..: 2 2 2 2 2 1 2 2 1 1 ...
## $ weapsubtype3 : int NA NA NA NA NA NA NA NA NA NA ...
## $ weapsubtype3_txt : Factor w/ 21 levels "",".","Arson/Fire",..: 2 2 2 2 2 1 2 2 1 1 ...
## $ weaptype4 : int NA NA NA NA NA NA NA NA NA NA ...
## $ weaptype4_txt : Factor w/ 8 levels "",".","Explosives/Bombs/Dynamite",..: 2 2 2 2 2 1 2 2 1 1 ...
## $ weapsubtype4 : int NA NA NA NA NA NA NA NA NA NA ...
## $ weapsubtype4_txt : Factor w/ 17 levels "",".","Arson/Fire",..: 2 2 2 2 2 1 2 2 1 1 ...
## $ weapdetail : Factor w/ 3925 levels "","\"Necklacing\" - burning tire around neck",..: 1 1 1 1319 1911 1 739 1 1 1 ...
## $ nkill : num 1 0 1 NA NA NA 0 0 NA NA ...
## $ nkillus : int NA NA NA NA NA NA NA 0 NA NA ...
## $ nkillter : num NA NA NA NA NA NA NA 0 NA NA ...
## $ nwound : num 0 0 0 NA NA NA 0 0 NA NA ...
## $ nwoundus : int NA NA NA NA NA NA NA 0 NA NA ...
## $ nwoundte : int NA NA NA NA NA NA NA 0 NA NA ...
## $ property : int 0 0 0 1 1 NA 0 1 NA NA ...
## $ propextent : int NA NA NA NA NA NA NA 3 NA NA ...
## [list output truncated]
This is a huge dataset, over 100 000 rows. It contains both quantitative and qualitative data to work with. Also note, there are quite a few blank and NA entries which may become an issue when it comes to plotting data.
ggplot(df, aes(x, y,
Arguments:
ggplot(terr, aes(x = iyear, y = nkill)) + geom_point()
## Warning: Removed 30819 rows containing missing values (geom_point).
Can already see outliers, the most extreme being in year 1994.
ggplot(terr, aes(x = iyear, y = nkill, colour = region_txt)) + geom_point()
## Warning: Removed 30819 rows containing missing values (geom_point).
Comments on the above graph:
- The blank entries for region are an issue. There is a blank entry for region in the legend. We should remove entries withour region data.
- There are no data for 1993!
- I would like the increments/tick marks to be more frequent on the x axis. Every 2 years perhaps, instead of every decade.
which(terr$region_txt == "")
## [1] 73869 82898 83627 85255 90081 90082 90083 90084 90156 90157
## [11] 90180 90182 90184 90185 90197 90198 90199 90200 90202 90211
## [21] 90212 90213 90217 90218 90219 90220 90221 90240 90242 90247
## [31] 90272 90273 90274 90291 90357 90358 90360 90361 90365 90366
## [41] 90392 90439 90449 90466 90467 90468 90469 90470 90471 90472
## [51] 90473 90546 90548 90549 90600 90601 90647 90648 90650 90651
## [61] 90652 90653 90655 90656 90685 90688 90689 90754 90755 90790
## [71] 90791 90792 90793 90794 90795 90877 90886 90904 90905 90906
## [81] 90940 90953 90954 90990 92266 103378 103396 103399 103401 103459
## [91] 103499 103566 103578 103693 103705 103740 103813 103910 103987 103988
## [101] 103991 103992 103993 104097 104107 104241 104424 104432
terr[73869, c(1:10)] # Double check we truly are picking up blank entries
## eventid iyear imonth iday approxdate extended resolution
## 73869 200203250004 2002 3 0 March 18-25, 2002 0
## country country_txt region
## 73869 NA NA
terr <- terr[!(terr$region_txt == ""), ] # Remove rows without region data
dim(terr)
## [1] 104349 123
killPerYr <- ggplot(terr, aes(x = iyear, y = nkill, colour = region_txt)) +
geom_point()
killPerYr + theme(axis.text.x = element_text(angle = 90, vjust = 0.25)) + scale_x_continuous(breaks = seq(min(terr$iyear),
max(terr$iyear), 10))
## Warning: Removed 30711 rows containing missing values (geom_point).
This graph appears much less cramped in R Studio, but in the transition from .rmd to HTML with knitr the graph became cramped along the x axis! Is it possible to change the appearance? Also, if I wanted to be really finicky (and had more time up my sleeve) I would change the names of the x and y axis and the legend from the defaults.
killPerYr + theme(axis.text.x = element_text(angle = 90, vjust = 0.25)) + facet_wrap(~region_txt)
## Warning: Removed 31 rows containing missing values (geom_point).
## Warning: Removed 2461 rows containing missing values (geom_point).
## Warning: Removed 55 rows containing missing values (geom_point).
## Warning: Removed 105 rows containing missing values (geom_point).
## Warning: Removed 254 rows containing missing values (geom_point).
## Warning: Removed 7287 rows containing missing values (geom_point).
## Warning: Removed 1067 rows containing missing values (geom_point).
## Warning: Removed 1104 rows containing missing values (geom_point).
## Warning: Removed 2844 rows containing missing values (geom_point).
## Warning: Removed 8680 rows containing missing values (geom_point).
## Warning: Removed 2586 rows containing missing values (geom_point).
## Warning: Removed 2075 rows containing missing values (geom_point).
## Warning: Removed 2162 rows containing missing values (geom_point).
Graph is super cramped, all thanks to the legend. And really, in this case the legend is redundant since the facets are named!
killPerYr + theme(axis.text.x = element_text(angle = 90, vjust = 0.25), legend.position = "none") +
facet_wrap(~region_txt)
## Warning: Removed 31 rows containing missing values (geom_point).
## Warning: Removed 2461 rows containing missing values (geom_point).
## Warning: Removed 55 rows containing missing values (geom_point).
## Warning: Removed 105 rows containing missing values (geom_point).
## Warning: Removed 254 rows containing missing values (geom_point).
## Warning: Removed 7287 rows containing missing values (geom_point).
## Warning: Removed 1067 rows containing missing values (geom_point).
## Warning: Removed 1104 rows containing missing values (geom_point).
## Warning: Removed 2844 rows containing missing values (geom_point).
## Warning: Removed 8680 rows containing missing values (geom_point).
## Warning: Removed 2586 rows containing missing values (geom_point).
## Warning: Removed 2075 rows containing missing values (geom_point).
## Warning: Removed 2162 rows containing missing values (geom_point).
levels(terr$region_txt)
## [1] ""
## [2] "Australasia & Oceania"
## [3] "Central America & Caribbean"
## [4] "Central Asia"
## [5] "East Asia"
## [6] "Eastern Europe"
## [7] "Middle East & North Africa"
## [8] "North America"
## [9] "Russia & the Newly Independent States (NIS)"
## [10] "South America"
## [11] "South Asia"
## [12] "Southeast Asia"
## [13] "Sub-Saharan Africa"
## [14] "Western Europe"
levels(terr$region_txt) <- c("", "Oceania", "Central America", "Central Asia",
"East Asia", "Eastern Europe", "MENA", "North America", "Russia", "South America",
"South Asia", "Southeast Asia", "Sub-Saharan Africa", "Western Europe")
killPerYr <- ggplot(terr, aes(x = iyear, y = nkill, colour = region_txt)) +
geom_point()
killPerYr + theme(axis.text.x = element_text(angle = 90, vjust = 0.25), legend.position = "none") +
facet_wrap(~region_txt)
## Warning: Removed 31 rows containing missing values (geom_point).
## Warning: Removed 2461 rows containing missing values (geom_point).
## Warning: Removed 55 rows containing missing values (geom_point).
## Warning: Removed 105 rows containing missing values (geom_point).
## Warning: Removed 254 rows containing missing values (geom_point).
## Warning: Removed 7287 rows containing missing values (geom_point).
## Warning: Removed 1067 rows containing missing values (geom_point).
## Warning: Removed 1104 rows containing missing values (geom_point).
## Warning: Removed 2844 rows containing missing values (geom_point).
## Warning: Removed 8680 rows containing missing values (geom_point).
## Warning: Removed 2586 rows containing missing values (geom_point).
## Warning: Removed 2075 rows containing missing values (geom_point).
## Warning: Removed 2162 rows containing missing values (geom_point).
ddply(terr, ~region_txt, summarize, maxKillPerYr = max(nkill))
## region_txt maxKillPerYr
## 1 Oceania NA
## 2 Central America NA
## 3 Central Asia NA
## 4 East Asia NA
## 5 Eastern Europe NA
## 6 MENA NA
## 7 North America NA
## 8 Russia NA
## 9 South America NA
## 10 South Asia NA
## 11 Southeast Asia NA
## 12 Sub-Saharan Africa NA
## 13 Western Europe NA
What?!?! Lies.
max(terr$nkill)
## [1] NA
Returns NA!? We know this isn't true… hmm… Turns out there is an argument in max to ignore NA entries!
max(terr$nkill, na.rm = TRUE)
## [1] 1180
ddply(terr, ~region_txt, summarize, maxKillPerYr = max(nkill, na.rm = TRUE))
## region_txt maxKillPerYr
## 1 Oceania 17
## 2 Central America 300
## 3 Central Asia 22
## 4 East Asia 25
## 5 Eastern Europe 180
## 6 MENA 422
## 7 North America 329
## 8 Russia 121
## 9 South America 107
## 10 South Asia 200
## 11 Southeast Asia 114
## 12 Sub-Saharan Africa 1180
## 13 Western Europe 270
Success!
maxKillReg <- ddply(terr, ~region_txt, summarize, maxKillPerYr = max(nkill,
na.rm = TRUE))
ggplot(maxKillReg, aes(x = region_txt, y = maxKillPerYr)) + geom_histogram() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.25))
## Mapping a variable to y and also using stat="bin".
## With stat="bin", it will attempt to set the y value to the count of cases in each group.
## This can result in unexpected behavior and will not be allowed in a future version of ggplot2.
## If you want y to represent counts of cases, use stat="bin" and don't map a variable to y.
## If you want y to represent values in the data, use stat="identity".
## See ?geom_bar for examples. (Deprecated; last used in version 0.9.2)
Returns a massive warning about mapping a variable to y without specifying bin…
ggplot(maxKillReg, aes(x = region_txt, y = maxKillPerYr)) + geom_histogram(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.25))
maxKillReg <- within(maxKillReg, region_txt <- reorder(region_txt, as.integer(maxKillPerYr),
FUN = min))
ggplot(maxKillReg, aes(x = region_txt, y = maxKillPerYr)) + geom_histogram(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.25, hjust = 1))
ggplot(terr, aes(x = iyear, y = nkill)) + geom_point() + theme(axis.text.x = element_text(angle = 90,
vjust = 0.25)) + facet_wrap(~attacktype1_txt)
## Warning: Removed 24216 rows containing missing values (geom_point).
## Warning: Removed 1774 rows containing missing values (geom_point).
## Warning: Removed 72 rows containing missing values (geom_point).
## Warning: Removed 3208 rows containing missing values (geom_point).
## Warning: Removed 529 rows containing missing values (geom_point).
## Warning: Removed 13 rows containing missing values (geom_point).
## Warning: Removed 83 rows containing missing values (geom_point).
## Warning: Removed 588 rows containing missing values (geom_point).
## Warning: Removed 15 rows containing missing values (geom_point).
## Warning: Removed 213 rows containing missing values (geom_point).
How do you deal with an empty facet???
killByAtt <- ddply(terr, ~attacktype1_txt, summarize, totalKill = sum(nkill,
na.rm = TRUE))
killByAtt
## attacktype1_txt totalKill
## 1 0
## 2 Armed Assault 89930
## 3 Assassination 16854
## 4 Bombing/Explosion 40823
## 5 Facility/Infrastructure Attack 2107
## 6 Hijacking 467
## 7 Hostage Taking (Barricade Incident) 911
## 8 Hostage Taking (Kidnapping) 2146
## 9 Unarmed Assault 629
## 10 Unknown 9213
# Reorder levels based on minimum of totalKill
killByAtt <- within(killByAtt, attacktype1_txt <- reorder(attacktype1_txt, as.integer(totalKill),
FUN = min))
# Plot result
ggplot(killByAtt, aes(x = attacktype1_txt, y = totalKill)) + geom_histogram(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
maxKillSumm <- ddply(terr, ~region_txt, function(x) {
indMax <- which.max(x$nkill)
x[indMax, c("nkill", "attacktype1_txt", "weaptype1_txt", "gname")]
})
maxKillSUmm <- within(maxKillSumm, region_txt <- reorder(region_txt, as.integer(nkill),
FUN = min))
ggplot(maxKillSUmm, aes(x = region_txt, y = nkill, fill = attacktype1_txt)) +
geom_histogram(stat = "identity") + theme(axis.text.x = element_text(angle = 45,
hjust = 1))
Issues:
- How to deal with blank entries and NA values??? I had a huge amount of “Warnings” from R about removing rows with missing values each time I called on the ggplot function. And I had graphs that displayed empty facets.
- Best practice renaming long entries that are displayed? (e.g. “Russia” here). Rename levels?