STAT545A hw05: Visualize anything with ggplot2

Rebecca Johnston

I have chosen to explore the Global Terrorism Database (GTD), available for download here. To do this, I will use the ggplot2 package in R, to produce “elegant graphics for data analysis” (one can only hope). I will also use plyr to complete data aggregation tasks where necessary.


Load required R packages and GTD data

library(ggplot2)
library(plyr)
terr <- read.table("globalterrorismdb_1012dist.txt", sep = "\t", quote = "\"", 
    header = TRUE, fill = TRUE)
str(terr)
## 'data.frame':    104457 obs. of  123 variables:
##  $ eventid           : Factor w/ 104438 levels " ","  January 19, 2009.",..: 62 63 64 65 66 67 68 69 70 71 ...
##  $ iyear             : int  1970 1970 1970 1970 1970 1970 1970 1970 1970 1970 ...
##  $ imonth            : int  0 0 1 1 1 1 1 1 1 1 ...
##  $ iday              : int  0 0 0 0 0 1 2 2 2 3 ...
##  $ approxdate        : Factor w/ 275 levels "","01/04/2000",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ extended          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ resolution        : Factor w/ 1836 levels "","1/01/1978",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ country           : int  58 130 160 78 101 217 218 217 217 217 ...
##  $ country_txt       : Factor w/ 210 levels "","Afghanistan",..: 53 123 147 75 97 195 196 195 195 195 ...
##  $ region            : int  2 1 5 8 4 1 3 1 1 1 ...
##  $ region_txt        : Factor w/ 14 levels "","Australasia & Oceania",..: 3 8 12 14 5 8 10 8 8 8 ...
##  $ provstate         : Factor w/ 3581 levels ""," Azad Kashmir",..: 1 1 1 1 1 1391 1 687 3491 3491 ...
##  $ city              : Factor w/ 24688 levels "","'A'Ishah Bakkar",..: 20133 14517 22055 1762 7868 4207 14875 16310 13417 13417 ...
##  $ latitude          : num  18.5 19.4 NA 38 33.6 ...
##  $ longitude         : num  -70 -99.1 NA 23.7 130.4 ...
##  $ specificity       : int  1 1 NA 1 1 1 NA 1 1 1 ...
##  $ vicinity          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ location          : Factor w/ 21943 levels "","-9","\"Colony 39\" settlement in uncleared areas of Ampara district, in eastern Sri Lanka",..: 1 1 1 1 1 1 1 2426 21792 1 ...
##  $ summary           : Factor w/ 13865 levels "","01/00/2006: American soldiers discovered three decapitated bodies on a soccer field west of Baghdad, Iraq. It is unknown when t"| __truncated__,..: 1 1 1 1 1 1 1 9964 1 1 ...
##  $ crit1             : int  1 1 1 1 1 NA 1 1 NA NA ...
##  $ crit2             : int  1 1 1 1 1 NA 1 1 NA NA ...
##  $ crit3             : int  1 1 1 1 1 NA 1 1 NA NA ...
##  $ doubtterr         : int  -9 -9 -9 -9 -9 NA -9 1 NA NA ...
##  $ alternative       : int  NA NA NA NA NA NA NA 2 NA NA ...
##  $ alternative_txt   : Factor w/ 6 levels "",".","Insurgency/Guerilla Action",..: 2 2 2 2 2 1 2 6 1 1 ...
##  $ multiple          : int  0 0 0 0 0 NA 0 0 NA NA ...
##  $ success           : int  1 1 1 1 1 NA 0 1 NA NA ...
##  $ suicide           : int  0 0 0 0 0 NA 0 0 NA NA ...
##  $ attacktype1       : int  1 6 1 3 7 NA 1 3 NA NA ...
##  $ attacktype1_txt   : Factor w/ 10 levels "","Armed Assault",..: 3 8 3 4 5 1 3 4 1 1 ...
##  $ attacktype2       : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ attacktype2_txt   : Factor w/ 8 levels "",".","Armed Assault",..: 2 2 2 2 2 1 2 2 1 1 ...
##  $ attacktype3       : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ attacktype3_txt   : Factor w/ 6 levels "",".","Armed Assault",..: 2 2 2 2 2 1 2 2 1 1 ...
##  $ targtype1         : int  14 7 10 7 7 NA 3 21 NA NA ...
##  $ targtype1_txt     : Factor w/ 23 levels "","Abortion Related",..: 15 7 9 7 7 1 14 22 1 1 ...
##  $ corp1             : Factor w/ 14672 levels "","'Private Network' column",..: 1 1670 14375 1 1 1 14138 9624 1 1 ...
##  $ target1           : Factor w/ 47777 levels "","'Add al-Fattah Isma'il, secretery general",..: 23089 28153 14829 44363 44343 1 22964 14020 1 1 ...
##  $ natlty1           : int  58 21 217 217 217 NA 218 217 NA NA ...
##  $ natlty1_txt       : Factor w/ 212 levels "",".","Afghanistan",..: 55 21 197 197 197 1 199 197 1 1 ...
##  $ targtype2         : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ targtype2_txt     : Factor w/ 24 levels "",".","Abortion Related",..: 2 2 2 2 2 1 2 2 1 1 ...
##  $ corp2             : Factor w/ 735 levels "","\"L'evenement Du Jeudi\", a magazine",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ target2           : Factor w/ 1305 levels "","(District) Survey Office",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ natlty2           : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ natlty2_txt       : Factor w/ 122 levels "",".","Afghanistan",..: 2 2 2 2 2 1 2 2 1 1 ...
##  $ targtype3         : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ targtype3_txt     : Factor w/ 19 levels "",".","Airports & Airlines",..: 2 2 2 2 2 1 2 2 1 1 ...
##  $ corp3             : Factor w/ 103 levels "","African National Congress (ANC)",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ target3           : Factor w/ 173 levels "","*","1 Citroen dealership",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ natlty3           : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ natlty3_txt       : Factor w/ 64 levels "",".","Afghanistan",..: 2 2 2 2 2 1 2 2 1 1 ...
##  $ gname             : Factor w/ 2473 levels "","1 May","14th of December Command",..: 1333 11 2389 2389 2389 1 2295 2389 1 1 ...
##  $ gsubname          : Factor w/ 433 levels "","10th Front",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ gname2            : Factor w/ 91 levels "","1 May","Abu Baker Martyr Group",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ gsubname2         : Factor w/ 6 levels "","al-Quds Brigades",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ gname3            : Factor w/ 8 levels "","Armed Revolutionary Nuclei (NAR)",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ gsubname3         : logi  NA NA NA NA NA NA ...
##  $ motive            : Factor w/ 1346 levels "","\"Disgruntled tribesmen have frequently targeted the pipeline in the past few years to try to force the government to improve s"| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ guncertain1       : int  0 0 0 0 0 NA 0 0 NA NA ...
##  $ guncertain2       : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ guncertain3       : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ nperps            : int  NA 7 NA NA NA NA 3 -99 NA NA ...
##  $ nperpcap          : int  NA NA NA NA NA NA NA -99 NA NA ...
##  $ claimed           : int  NA NA NA NA NA NA NA 0 NA NA ...
##  $ claimmode         : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ claimmode_txt     : Factor w/ 12 levels "",".","Call (post-incident)",..: 2 2 2 2 2 1 2 2 1 1 ...
##  $ claim2            : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ claimmode2        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ claimmode2_txt    : Factor w/ 6 levels "",".","Call (post-incident)",..: 2 2 2 2 2 1 2 2 1 1 ...
##  $ claim3            : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ claimmode3        : logi  NA NA NA NA NA NA ...
##  $ claimmode3_txt    : Factor w/ 2 levels "",".": 2 2 2 2 2 1 2 2 1 1 ...
##  $ compclaim         : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ weaptype1         : int  13 13 13 6 8 NA 5 6 NA NA ...
##  $ weaptype1_txt     : Factor w/ 13 levels "","Biological",..: 12 12 12 4 7 1 6 4 1 1 ...
##  $ weapsubtype1      : int  NA NA NA 16 NA NA 2 16 NA NA ...
##  $ weapsubtype1_txt  : Factor w/ 30 levels "",".","Arson/Fire",..: 2 2 2 27 2 1 4 27 1 1 ...
##  $ weaptype2         : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ weaptype2_txt     : Factor w/ 12 levels "",".","Chemical",..: 2 2 2 2 2 1 2 2 1 1 ...
##  $ weapsubtype2      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ weapsubtype2_txt  : Factor w/ 26 levels "",".","Arson/Fire",..: 2 2 2 2 2 1 2 2 1 1 ...
##  $ weaptype3         : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ weaptype3_txt     : Factor w/ 11 levels "",".","Chemical",..: 2 2 2 2 2 1 2 2 1 1 ...
##  $ weapsubtype3      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ weapsubtype3_txt  : Factor w/ 21 levels "",".","Arson/Fire",..: 2 2 2 2 2 1 2 2 1 1 ...
##  $ weaptype4         : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ weaptype4_txt     : Factor w/ 8 levels "",".","Explosives/Bombs/Dynamite",..: 2 2 2 2 2 1 2 2 1 1 ...
##  $ weapsubtype4      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ weapsubtype4_txt  : Factor w/ 17 levels "",".","Arson/Fire",..: 2 2 2 2 2 1 2 2 1 1 ...
##  $ weapdetail        : Factor w/ 3925 levels "","\"Necklacing\" - burning tire around neck",..: 1 1 1 1319 1911 1 739 1 1 1 ...
##  $ nkill             : num  1 0 1 NA NA NA 0 0 NA NA ...
##  $ nkillus           : int  NA NA NA NA NA NA NA 0 NA NA ...
##  $ nkillter          : num  NA NA NA NA NA NA NA 0 NA NA ...
##  $ nwound            : num  0 0 0 NA NA NA 0 0 NA NA ...
##  $ nwoundus          : int  NA NA NA NA NA NA NA 0 NA NA ...
##  $ nwoundte          : int  NA NA NA NA NA NA NA 0 NA NA ...
##  $ property          : int  0 0 0 1 1 NA 0 1 NA NA ...
##  $ propextent        : int  NA NA NA NA NA NA NA 3 NA NA ...
##   [list output truncated]

This is a huge dataset, over 100 000 rows. It contains both quantitative and qualitative data to work with. Also note, there are quite a few blank and NA entries which may become an issue when it comes to plotting data.


Short note on ggplot() function

ggplot(df, aes(x, y, )) + …

Arguments:


“geom point”

ggplot(terr, aes(x = iyear, y = nkill)) + geom_point()
## Warning: Removed 30819 rows containing missing values (geom_point).

plot of chunk unnamed-chunk-3

Can already see outliers, the most extreme being in year 1994.

ggplot(terr, aes(x = iyear, y = nkill, colour = region_txt)) + geom_point()
## Warning: Removed 30819 rows containing missing values (geom_point).

plot of chunk unnamed-chunk-4

Comments on the above graph:

which(terr$region_txt == "")
##   [1]  73869  82898  83627  85255  90081  90082  90083  90084  90156  90157
##  [11]  90180  90182  90184  90185  90197  90198  90199  90200  90202  90211
##  [21]  90212  90213  90217  90218  90219  90220  90221  90240  90242  90247
##  [31]  90272  90273  90274  90291  90357  90358  90360  90361  90365  90366
##  [41]  90392  90439  90449  90466  90467  90468  90469  90470  90471  90472
##  [51]  90473  90546  90548  90549  90600  90601  90647  90648  90650  90651
##  [61]  90652  90653  90655  90656  90685  90688  90689  90754  90755  90790
##  [71]  90791  90792  90793  90794  90795  90877  90886  90904  90905  90906
##  [81]  90940  90953  90954  90990  92266 103378 103396 103399 103401 103459
##  [91] 103499 103566 103578 103693 103705 103740 103813 103910 103987 103988
## [101] 103991 103992 103993 104097 104107 104241 104424 104432
terr[73869, c(1:10)]  # Double check we truly are picking up blank entries
##            eventid iyear imonth iday        approxdate extended resolution
## 73869 200203250004  2002      3    0 March 18-25, 2002        0           
##       country country_txt region
## 73869      NA                 NA
terr <- terr[!(terr$region_txt == ""), ]  # Remove rows without region data
dim(terr)
## [1] 104349    123
killPerYr <- ggplot(terr, aes(x = iyear, y = nkill, colour = region_txt)) + 
    geom_point()
killPerYr + theme(axis.text.x = element_text(angle = 90, vjust = 0.25)) + scale_x_continuous(breaks = seq(min(terr$iyear), 
    max(terr$iyear), 10))
## Warning: Removed 30711 rows containing missing values (geom_point).

plot of chunk unnamed-chunk-6

This graph appears much less cramped in R Studio, but in the transition from .rmd to HTML with knitr the graph became cramped along the x axis! Is it possible to change the appearance? Also, if I wanted to be really finicky (and had more time up my sleeve) I would change the names of the x and y axis and the legend from the defaults.


“geom point” + “facet wraps”

killPerYr + theme(axis.text.x = element_text(angle = 90, vjust = 0.25)) + facet_wrap(~region_txt)
## Warning: Removed 31 rows containing missing values (geom_point).
## Warning: Removed 2461 rows containing missing values (geom_point).
## Warning: Removed 55 rows containing missing values (geom_point).
## Warning: Removed 105 rows containing missing values (geom_point).
## Warning: Removed 254 rows containing missing values (geom_point).
## Warning: Removed 7287 rows containing missing values (geom_point).
## Warning: Removed 1067 rows containing missing values (geom_point).
## Warning: Removed 1104 rows containing missing values (geom_point).
## Warning: Removed 2844 rows containing missing values (geom_point).
## Warning: Removed 8680 rows containing missing values (geom_point).
## Warning: Removed 2586 rows containing missing values (geom_point).
## Warning: Removed 2075 rows containing missing values (geom_point).
## Warning: Removed 2162 rows containing missing values (geom_point).

plot of chunk unnamed-chunk-7

Graph is super cramped, all thanks to the legend. And really, in this case the legend is redundant since the facets are named!

killPerYr + theme(axis.text.x = element_text(angle = 90, vjust = 0.25), legend.position = "none") + 
    facet_wrap(~region_txt)
## Warning: Removed 31 rows containing missing values (geom_point).
## Warning: Removed 2461 rows containing missing values (geom_point).
## Warning: Removed 55 rows containing missing values (geom_point).
## Warning: Removed 105 rows containing missing values (geom_point).
## Warning: Removed 254 rows containing missing values (geom_point).
## Warning: Removed 7287 rows containing missing values (geom_point).
## Warning: Removed 1067 rows containing missing values (geom_point).
## Warning: Removed 1104 rows containing missing values (geom_point).
## Warning: Removed 2844 rows containing missing values (geom_point).
## Warning: Removed 8680 rows containing missing values (geom_point).
## Warning: Removed 2586 rows containing missing values (geom_point).
## Warning: Removed 2075 rows containing missing values (geom_point).
## Warning: Removed 2162 rows containing missing values (geom_point).

plot of chunk unnamed-chunk-8

levels(terr$region_txt)
##  [1] ""                                           
##  [2] "Australasia & Oceania"                      
##  [3] "Central America & Caribbean"                
##  [4] "Central Asia"                               
##  [5] "East Asia"                                  
##  [6] "Eastern Europe"                             
##  [7] "Middle East & North Africa"                 
##  [8] "North America"                              
##  [9] "Russia & the Newly Independent States (NIS)"
## [10] "South America"                              
## [11] "South Asia"                                 
## [12] "Southeast Asia"                             
## [13] "Sub-Saharan Africa"                         
## [14] "Western Europe"
levels(terr$region_txt) <- c("", "Oceania", "Central America", "Central Asia", 
    "East Asia", "Eastern Europe", "MENA", "North America", "Russia", "South America", 
    "South Asia", "Southeast Asia", "Sub-Saharan Africa", "Western Europe")
killPerYr <- ggplot(terr, aes(x = iyear, y = nkill, colour = region_txt)) + 
    geom_point()
killPerYr + theme(axis.text.x = element_text(angle = 90, vjust = 0.25), legend.position = "none") + 
    facet_wrap(~region_txt)
## Warning: Removed 31 rows containing missing values (geom_point).
## Warning: Removed 2461 rows containing missing values (geom_point).
## Warning: Removed 55 rows containing missing values (geom_point).
## Warning: Removed 105 rows containing missing values (geom_point).
## Warning: Removed 254 rows containing missing values (geom_point).
## Warning: Removed 7287 rows containing missing values (geom_point).
## Warning: Removed 1067 rows containing missing values (geom_point).
## Warning: Removed 1104 rows containing missing values (geom_point).
## Warning: Removed 2844 rows containing missing values (geom_point).
## Warning: Removed 8680 rows containing missing values (geom_point).
## Warning: Removed 2586 rows containing missing values (geom_point).
## Warning: Removed 2075 rows containing missing values (geom_point).
## Warning: Removed 2162 rows containing missing values (geom_point).

plot of chunk unnamed-chunk-10


“geom histogram” and plyr

ddply(terr, ~region_txt, summarize, maxKillPerYr = max(nkill))
##            region_txt maxKillPerYr
## 1             Oceania           NA
## 2     Central America           NA
## 3        Central Asia           NA
## 4           East Asia           NA
## 5      Eastern Europe           NA
## 6                MENA           NA
## 7       North America           NA
## 8              Russia           NA
## 9       South America           NA
## 10         South Asia           NA
## 11     Southeast Asia           NA
## 12 Sub-Saharan Africa           NA
## 13     Western Europe           NA

What?!?! Lies.

max(terr$nkill)
## [1] NA

Returns NA!? We know this isn't true… hmm… Turns out there is an argument in max to ignore NA entries!

max(terr$nkill, na.rm = TRUE)
## [1] 1180
ddply(terr, ~region_txt, summarize, maxKillPerYr = max(nkill, na.rm = TRUE))
##            region_txt maxKillPerYr
## 1             Oceania           17
## 2     Central America          300
## 3        Central Asia           22
## 4           East Asia           25
## 5      Eastern Europe          180
## 6                MENA          422
## 7       North America          329
## 8              Russia          121
## 9       South America          107
## 10         South Asia          200
## 11     Southeast Asia          114
## 12 Sub-Saharan Africa         1180
## 13     Western Europe          270

Success!

maxKillReg <- ddply(terr, ~region_txt, summarize, maxKillPerYr = max(nkill, 
    na.rm = TRUE))
ggplot(maxKillReg, aes(x = region_txt, y = maxKillPerYr)) + geom_histogram() + 
    theme(axis.text.x = element_text(angle = 90, vjust = 0.25))
## Mapping a variable to y and also using stat="bin".
##   With stat="bin", it will attempt to set the y value to the count of cases in each group.
##   This can result in unexpected behavior and will not be allowed in a future version of ggplot2.
##   If you want y to represent counts of cases, use stat="bin" and don't map a variable to y.
##   If you want y to represent values in the data, use stat="identity".
##   See ?geom_bar for examples. (Deprecated; last used in version 0.9.2)

plot of chunk unnamed-chunk-14

Returns a massive warning about mapping a variable to y without specifying bin…

ggplot(maxKillReg, aes(x = region_txt, y = maxKillPerYr)) + geom_histogram(stat = "identity") + 
    theme(axis.text.x = element_text(angle = 90, vjust = 0.25))

plot of chunk unnamed-chunk-15

maxKillReg <- within(maxKillReg, region_txt <- reorder(region_txt, as.integer(maxKillPerYr), 
    FUN = min))
ggplot(maxKillReg, aes(x = region_txt, y = maxKillPerYr)) + geom_histogram(stat = "identity") + 
    theme(axis.text.x = element_text(angle = 90, vjust = 0.25, hjust = 1))

plot of chunk unnamed-chunk-16


Other variables in GTD data

ggplot(terr, aes(x = iyear, y = nkill)) + geom_point() + theme(axis.text.x = element_text(angle = 90, 
    vjust = 0.25)) + facet_wrap(~attacktype1_txt)
## Warning: Removed 24216 rows containing missing values (geom_point).
## Warning: Removed 1774 rows containing missing values (geom_point).
## Warning: Removed 72 rows containing missing values (geom_point).
## Warning: Removed 3208 rows containing missing values (geom_point).
## Warning: Removed 529 rows containing missing values (geom_point).
## Warning: Removed 13 rows containing missing values (geom_point).
## Warning: Removed 83 rows containing missing values (geom_point).
## Warning: Removed 588 rows containing missing values (geom_point).
## Warning: Removed 15 rows containing missing values (geom_point).
## Warning: Removed 213 rows containing missing values (geom_point).

plot of chunk unnamed-chunk-17

How do you deal with an empty facet???

killByAtt <- ddply(terr, ~attacktype1_txt, summarize, totalKill = sum(nkill, 
    na.rm = TRUE))
killByAtt
##                        attacktype1_txt totalKill
## 1                                              0
## 2                        Armed Assault     89930
## 3                        Assassination     16854
## 4                    Bombing/Explosion     40823
## 5       Facility/Infrastructure Attack      2107
## 6                            Hijacking       467
## 7  Hostage Taking (Barricade Incident)       911
## 8          Hostage Taking (Kidnapping)      2146
## 9                      Unarmed Assault       629
## 10                             Unknown      9213
# Reorder levels based on minimum of totalKill
killByAtt <- within(killByAtt, attacktype1_txt <- reorder(attacktype1_txt, as.integer(totalKill), 
    FUN = min))
# Plot result
ggplot(killByAtt, aes(x = attacktype1_txt, y = totalKill)) + geom_histogram(stat = "identity") + 
    theme(axis.text.x = element_text(angle = 45, hjust = 1))

plot of chunk unnamed-chunk-18

maxKillSumm <- ddply(terr, ~region_txt, function(x) {
    indMax <- which.max(x$nkill)
    x[indMax, c("nkill", "attacktype1_txt", "weaptype1_txt", "gname")]
})
maxKillSUmm <- within(maxKillSumm, region_txt <- reorder(region_txt, as.integer(nkill), 
    FUN = min))
ggplot(maxKillSUmm, aes(x = region_txt, y = nkill, fill = attacktype1_txt)) + 
    geom_histogram(stat = "identity") + theme(axis.text.x = element_text(angle = 45, 
    hjust = 1))

plot of chunk unnamed-chunk-20

Issues: