Wikileaks Data

Data Understanding and Preparation

The Guardian has stored the data in a fusion table. You can pull the data into your working directory as a .csv like this.

# y <- download.file("http://www.google.com/fusiontables/exporttable?query=select+col0%2Ccol1%2Ccol2%2Ccol3%2Ccol4%2Ccol5%2Ccol6%2Ccol7%2Ccol8%2Ccol9%2Ccol10%2Ccol11%2Ccol12%2Ccol13%2Ccol14%2Ccol15%2Ccol16%2Ccol17%2Ccol18+from+273326",destfile = "iraq.csv")

# file iraq.csv will be stored in your working directory
# iraq <- read.csv("iraq.csv", stringsAsFactors = FALSE)
# head(iraq)

I have saved this file already, so I load it accordingly and begin to explore.

iraq <- read.csv("iraqdeaths.csv", stringsAsFactors = FALSE)
dim(iraq)

## [1] 52048    19

names(iraq)

##  [1] "Report.key"               "Date.and.time"           
##  [3] "Type"                     "Category"                
##  [5] "Title"                    "Region"                  
##  [7] "Attack.on"                "Coalition.forces.wounded"
##  [9] "Coalition.forces.killed"  "Iraq.forces.wounded"     
## [11] "Iraq.forces.killed"       "Civilian.wia"            
## [13] "Civilian.kia"             "Enemy.wia"               
## [15] "Enemy.kia"                "Enemy.detained"          
## [17] "Total.deaths"             "Latitude"                
## [19] "Longitude"

str(iraq)

## 'data.frame':    52048 obs. of  19 variables:
##  $ Report.key              : chr  "9488" "9507" "9543" "9549" ...
##  $ Date.and.time           : chr  "01/01/2004 03:00" "01/01/2004 09:26" "01/01/2004 12:00" "01/01/2004 16:30" ...
##  $ Type                    : chr  "Non-Combat Event" "Non-Combat Event" "Criminal Event" "Friendly Action" ...
##  $ Category                : chr  "Accident" "Accident" "Murder" "Other" ...
##  $ Title                   : chr  "FATAL VEHICLE ACCIDENT IN BAGHDAD (ZONE 1) - 2 KIA" "1/1 ID 5-TON TRUCK ACCIDENT S. OF AR RAMADI: 1 FKIA, 6 FWIA" "UNIVERSITY DEAN FOUND DEAD" "GAS STATION SET ON FIRE; 1 X EN KIA, 2 X EN WIA" ...
##  $ Region                  : chr  "MND-BAGHDAD" "MNF-W" "MND-N" "MND-SE" ...
##  $ Attack.on               : chr  "NEUTRAL" "NEUTRAL" "ENEMY" "FRIEND" ...
##  $ Coalition.forces.wounded: int  0 6 0 0 0 0 1 3 1 5 ...
##  $ Coalition.forces.killed : int  2 1 0 0 0 0 1 2 1 1 ...
##  $ Iraq.forces.wounded     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Iraq.forces.killed      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Civilian.wia            : int  0 0 0 0 2 0 0 0 0 0 ...
##  $ Civilian.kia            : int  0 0 1 0 1 0 0 0 0 0 ...
##  $ Enemy.wia               : int  0 0 0 2 0 0 0 0 0 0 ...
##  $ Enemy.kia               : int  0 0 0 1 0 4 0 0 0 0 ...
##  $ Enemy.detained          : int  0 0 0 0 0 6 0 0 0 3 ...
##  $ Total.deaths            : int  2 1 1 1 1 4 1 2 1 1 ...
##  $ Latitude                : num  33.3 33.2 36.4 31.1 31.1 ...
##  $ Longitude               : num  44.4 43.4 43.2 46.2 46.2 ...

iraq[11:12, 2]

## [1] "03/01/2004 00:48" "03/01/2004 03:33"

Let’s put Date.and.time into something simpler as well as making it of class “Date”.

names(iraq)[2] <- "DTG"
iraq$DTG <- as.Date(iraq$DTG, "%d/%m/%Y")

Notice within as.Date() that after specifying the object to be converted, it was necessary to put in quotation marks exactly how the date is formatted. The default for as.Date() is to try year first, so if we had not specified it we would have received an unintelligible result. By the way, the capital Y in our specification means year with century, whereas a small y (%y) would yield year without century (00-99). You can see all of the format options available by calling up the R help for “strptime”. We have other data issues to deal with when we examine Type and Category. Let’s have a look at them and also include univariate tables for Region and Attack.on.

table(iraq$Type)

## 
##      criminal event      Criminal Event      CRIMINAL EVENT 
##                   1               24356                   3 
##        Enemy Action    Explosive Hazard    EXPLOSIVE HAZARD 
##               11681                9878                   1 
##     Friendly Action       Friendly Fire    Non-Combat Event 
##                4636                 184                1107 
##               Other Suspicious Incident       Threat Report 
##                 146                  32                  23

length(unique(iraq$Category))

## [1] 86

table(iraq$Region)

## 
##             MND-BAGHDAD       MND-C       MND-N      MND-NE       MND-S 
##          49       23783        4678       15944          64         378 
##      MND-SE       MNF-W 
##        2680        4472

table(iraq$Attack.on)

## 
##   ENEMY  FRIEND NEUTRAL 
##   45975    4820    1253

The tolower() function will take care of the levels in Type, but what of the 86 different levels of Category? In this instance I will not include it and leave it to the enterprising reader to use it for their own analysis and reporting. Much the same could be said about Title, which has over 39,000 different entries and can be rich in text data. At any rate, the forthcoming clustering algorithm will give us our own categories. The other thing I want to do is rename the quantitative variables that describe the killed and wounded, giving them shorter and simpler naming conventions.

iraq$Type <- tolower(iraq$Type)
table(iraq$Type)

## 
##      criminal event        enemy action    explosive hazard 
##               24360               11681                9879 
##     friendly action       friendly fire    non-combat event 
##                4636                 184                1107 
##               other suspicious incident       threat report 
##                 146                  32                  23

names(iraq)[8] = "CF.W"
names(iraq)[9] = "CF.KIA"
names(iraq)[10] = "I.W"
names(iraq)[11] = "I.KIA"
names(iraq)[12] = "Civ.W"
names(iraq)[13] = "Civ.KIA"
names(iraq)[14] = "EN.W"
names(iraq)[15] = "EN.KIA"
names(iraq)[16] = "EN.Detained"
names(iraq)[17] = "Total.KIA"
names(iraq)

##  [1] "Report.key"  "DTG"         "Type"        "Category"    "Title"      
##  [6] "Region"      "Attack.on"   "CF.W"        "CF.KIA"      "I.W"        
## [11] "I.KIA"       "Civ.W"       "Civ.KIA"     "EN.W"        "EN.KIA"     
## [16] "EN.Detained" "Total.KIA"   "Latitude"    "Longitude"

I also want to include variables of “year” and “month” by themselves. Then, we can create a subset of the data. This will only be for Mult-National Division - Baghdad (MND-BAGHDAD) and for the year 2009.

iraq$year = as.numeric(format(iraq$DTG, "%Y"))
iraq$month = as.numeric(format(iraq$DTG, "%m"))
baghdad = subset(iraq, Region == "MND-BAGHDAD" & year == 2009)
dim(baghdad)

## [1] 708  21

There are quite a few univariate and multivariate visualizations we could perform now, but in the interest of time and space, I will forgo them. As stated before, please conduct and publish your own analysis.

Modeling and Evaluation

We will now cluster the data. It is mixed, which is to say that we have quantitative and qualitative variables, or more specifically, nominal, ordinal and interval/ratio variables. There are a number of ways we could approach this for clustering, such as doing Principal Components Analysis (PCA) first to create latent variables then using them as the input into clustering, or using different dissimilarity calculations. I am personally fond of using the Gower metric along with Partitioning Around Medoids (PAM). PAM is very similar to k-means, but offers a couple of advantages. First, PAM accepts a dissimilarity matrix, which allows the inclusion of mixed data. Second, it is more robust to outliers and skewed data because it minimizes a sum of dissimilarities instead of a sum of squared Euclidean distances (Reynolds, 1992). I won’t bore you with the details as I cover them in detail in my book, Chapter 8 specifically.

Additional Data Preparation

The Gower metric/coefficient requires scaled data. I also want to include Type and Attack.on in the analysis. These need to be factors. Once those two tasks are done it is then a simple matter of putting them together for analysis.

quant <- scale(baghdad[,8:17]) #numeric variables
factors <- as.data.frame(lapply(baghdad[,c(3,7)] , factor))
dat <- cbind(quant, factors)

Clustering

This will consist of creating the Gower coefficient/dissimilarity matrix and then using that as the input for PAM. You need to specify the number of clusters (k=?). So how many in this case? After trial and error, I chose four. Why four you ask? What metric did you use to evaluate your clusters you ask? Great questions. Simple answer…it works well. In the real world my experience has been that the stakeholders of any clustering project, e.g. Medical Oncologist segmentation, need to agree on what is best for the business. When I presented this analysis at the Indianapolis R-User Group a couple of months ago, we agreed, to a person, that this was the right way to do it. I welcome any dissenters. Now to the code! By the way, we will need the cluster package for the gower metric and PAM.

library(cluster) 
dat.dis <- daisy(dat, metric = "gower")  
set.seed(123)
fit.dis <- pam(dat.dis, k = 4)

table(fit.dis$clustering)

## 
##   1   2   3   4 
##  57 107 263 281

dat$cluster <- as.factor(fit.dis$clustering)
baghdad$cluster <- as.factor(fit.dis$clustering) #for the map

What we are going to do now is take advantage of the compareGroups library. The first step is to create an object of the descriptive statistics by cluster with the compareGroups() function. Then, using createTable(), we will turn the statistics into an easy to export table as a .csv, .pdf, html or the LaTex format.

library(compareGroups) #build descriptive statistic tables
group <- compareGroups(cluster~., data = dat) 
clustab <- createTable(group, type = 1) #type 1 relative, 2 absolute and relative, 3 absolute
clustab

## 
## --------Summary descriptives table by 'cluster'---------
## 
## ____________________________________________________________________________________ 
##                              1            2            3           4       p.overall 
##                             N=57        N=107        N=263       N=281               
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ 
## CF.W                    0.00 (0.77)  -0.09 (0.54) 0.21 (1.54) -0.16 (0.00)  <0.001   
## CF.KIA                  0.82 (2.12)  -0.14 (0.37) 0.05 (1.16) -0.16 (0.37)  <0.001   
## I.W                     -0.07 (0.42) -0.03 (0.44) 0.22 (1.58) -0.18 (0.06)  <0.001   
## I.KIA                   -0.18 (0.80) 0.41 (0.97)  0.08 (1.37) -0.20 (0.42)  <0.001   
## Civ.W                   -0.13 (0.03) -0.13 (0.03) 0.23 (1.62) -0.14 (0.01)  <0.001   
## Civ.KIA                 -0.20 (0.12) -0.18 (0.18) 0.24 (1.60) -0.11 (0.19)  <0.001   
## EN.W                    0.01 (0.81)  0.02 (0.83)  0.02 (1.25) -0.03 (0.82)   0.937   
## EN.KIA                  0.33 (1.54)  0.02 (0.98)  0.18 (1.22) -0.24 (0.43)  <0.001   
## EN.Detained             0.16 (0.98)  0.02 (1.19)  0.03 (0.93) -0.07 (0.99)   0.360   
## Total.KIA               -0.15 (0.14) -0.12 (0.16) 0.26 (1.60) -0.16 (0.17)  <0.001   
## Type:                                                                       <0.001   
##     criminal event         0.00%        0.00%        0.00%        100%               
##     enemy action           0.00%        95.3%        0.00%       0.00%               
##     explosive hazard       0.00%        0.00%        99.6%       0.00%               
##     friendly action        31.6%        2.80%        0.00%       0.00%               
##     friendly fire          1.75%        1.87%        0.00%       0.00%               
##     non-combat event       54.4%        0.00%        0.00%       0.00%               
##     other                  12.3%        0.00%        0.00%       0.00%               
##     suspicious incident    0.00%        0.00%        0.38%       0.00%               
## Attack.on:                                                                  <0.001   
##     ENEMY                  0.00%        95.3%        100%         100%               
##     FRIEND                 33.3%        4.67%        0.00%       0.00%               
##     NEUTRAL                66.7%        0.00%        0.00%       0.00%               
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯

#export2csv(clustab, file="baghdad_clusters.csv") #allows view in a spreadsheet
#cluster 1 - Friendly Actions and Accidents
#cluster 2 - Attacks on Iraqi Forces
#cluster 3 - IEDs, VBIEDs, SVIEDs
#cluster 4 - Criminal Activity

To summarize, the 1st cluster is composed of friendly actions and what we could classify as “Accidents”. Cluster 2 focuses around attacks on Iraqi forces. Cluster 3 is primarily Improvised Explosive Devices (IEDs) and all their variants e.g. suicide vest, truck-borne etc. Notice that this is where the highest level of civilian casualties fall. This would include the horrible attacks on the 19th of August. https://en.wikipedia.org/wiki/August_2009_Baghdad_bombings. Finally cluster 4 is all about criminal activity. This is just one way to cluster this data. I hope to see others tackle it and present their results. Since I’m building this as a Markdown document, we can look at a kable table, this time at the non-scaled values (average).

library(dplyr)
df <- baghdad[,c(8:17,22)]
df2 <- group_by(df, cluster) 
df3 <- as.data.frame(summarise_each(df2, funs(mean)))
knitr::kable(df3, format = "markdown", digits = 2, align  = "c")

cluster	CF.W	CF.KIA	I.W	I.KIA	Civ.W	Civ.KIA	EN.W	EN.KIA	EN.Detained	Total.KIA
1	0.09	0.37	0.18	0.16	0.23	0.56	0.02	0.21	0.32	1.30
2	0.04	0.02	0.22	0.68	0.38	0.65	0.02	0.10	0.19	1.46
3	0.21	0.09	0.59	0.39	12.25	3.07	0.02	0.16	0.19	3.71
4	0.00	0.01	0.01	0.14	0.08	1.07	0.01	0.01	0.10	1.23

Visualizing the Clusters in a Geo-Spatial Setting

The first thing is to use the function get_googlemap() to access the google map API, bringing in a static map of the area we want. However, we will need this function wrapped by ggmap() function, which tells R to print it. We don’t want to print it yet, so we will save it as an object called “map”. It goes without saying that you need to be connected to the internet to access the API. The other items we need to specify is the zoom level and the map type. This code snippet accomplishes all of the above. We will focus on the area around the Al Dora Oil Refinery.

library(ggmap)
map <- ggmap(
  get_googlemap(
    center = "Dora, Iraq", 
    zoom = 13,
    maptype = "hybrid")) #also hybrid/satellite/roadmap/terrain

Points are added using geom_point(). We will include latitude, longitude, the variable to fill the point, the plotting character and character size.

map

map + geom_point(
  data = baghdad,
  aes(x = Longitude, y = Latitude, fill = cluster), 
  pch = 21,  
  size = 7)

The area in the upper right of the map, between the Dora Expressway and The Two Stories Bridge is where the refinery is located. It looks free from attacks in 2009. Also, either of the two approaches from forward bases in eastern Baghdad, which would be either the expressway or the Two Stories Bridge via the Karada Peninsula, appear to have been free of activity. This is no surprise because of the high concentration of police forces in and around the refinery and those two roads. However, west of the refinery and along Masafi Street is an area that could have proven hazardous to your health.

Interactive Map

This is where it should now get interesting with the introduction of an interactive map. This will be a quick introduction into just one of many ways to utilize the leaflet package.

library(leaflet)
library(car) #recode variables
#marker identification colors
pal <- colorFactor(c("lightseagreen","blue2","red1","green3"), domain = c("Friendly Forces","Iraqi Forces","IEDs","Criminal Activity"))
baghdad <- arrange(baghdad, cluster)
baghdad$detail <- paste(baghdad$DTG, baghdad$Title, sep = ", ")
str(baghdad)

## 'data.frame':    708 obs. of  23 variables:
##  $ Report.key : chr  "94A84E9C-D373-78D5-43F7AB5F7FE4A3E1" "9E875B1C-DCE6-DCFE-B2F62370C43000EE" "A9000774-423D-4561-5BC0B3C487DF23A6" "B21A450E-E635-26D8-C0783B3EEA6E3ECD" ...
##  $ DTG        : Date, format: "2009-01-01" "2009-01-03" ...
##  $ Type       : chr  "friendly fire" "other" "friendly action" "friendly action" ...
##  $ Category   : chr  "Green-White" "Other" "Escalation of Force" "Escalation of Force" ...
##  $ Title      : chr  "(FRIENDLY FIRE) GREEN-WHITE RPT   BAGHDAD POLICE OPS : 1 CIV KIA" "(OTHER) OTHER RPT   BAGHDAD POLICE OPS : 1 ISF KIA 2 ISF WIA" "(FRIENDLY ACTION) ESCALATION OF FORCE RPT   3/A 5-73 CAV : 1 CIV KIA" "(FRIENDLY ACTION) ESCALATION OF FORCE RPT   BAGHDAD OPS/NCC : 1 CIV KIA" ...
##  $ Region     : chr  "MND-BAGHDAD" "MND-BAGHDAD" "MND-BAGHDAD" "MND-BAGHDAD" ...
##  $ Attack.on  : chr  "FRIEND" "NEUTRAL" "FRIEND" "FRIEND" ...
##  $ CF.W       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ CF.KIA     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ I.W        : int  0 2 0 0 0 0 0 0 2 2 ...
##  $ I.KIA      : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ Civ.W      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Civ.KIA    : int  1 0 1 1 1 1 1 0 0 0 ...
##  $ EN.W       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ EN.KIA     : int  0 0 0 0 0 0 0 1 1 1 ...
##  $ EN.Detained: int  0 0 0 0 0 0 0 2 5 0 ...
##  $ Total.KIA  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Latitude   : num  33.4 33.4 33.3 33.3 33.3 ...
##  $ Longitude  : num  44.4 44.3 44.4 44.3 44.5 ...
##  $ year       : num  2009 2009 2009 2009 2009 ...
##  $ month      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ cluster    : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
##  $ detail     : chr  "2009-01-01, (FRIENDLY FIRE) GREEN-WHITE RPT   BAGHDAD POLICE OPS : 1 CIV KIA" "2009-01-03, (OTHER) OTHER RPT   BAGHDAD POLICE OPS : 1 ISF KIA 2 ISF WIA" "2009-01-05, (FRIENDLY ACTION) ESCALATION OF FORCE RPT   3/A 5-73 CAV : 1 CIV KIA" "2009-01-06, (FRIENDLY ACTION) ESCALATION OF FORCE RPT   BAGHDAD OPS/NCC : 1 CIV KIA" ...

baghdad$cluster <- recode(baghdad$cluster, '"1"="Friendly Forces";"2"="Iraqi Forces";"3"="IEDs"; "4"="Criminal Activity";')
m <- leaflet(baghdad) %>% 
  addProviderTiles("OpenStreetMap.HOT", group = "Street Map") %>%
  addProviderTiles("Esri.WorldImagery", group = "Image")
  #stationary markers
m %>% addCircleMarkers(~Longitude, ~Latitude, popup = ~detail, 
  #weight = 2, 
  radius = 10,
  color = ~pal(cluster),stroke = FALSE, #formula to link colors to row
  fillOpacity = 0.9) %>% 
  #interactive control
  addLayersControl(
   baseGroups = c("Street Map", "Image"),
   options = layersControlOptions(collapsed = FALSE)
  ) %>%
  addLegend("bottomright",
  pal = pal, values = ~ cluster, #notice color determined above
  title = "Clusters", opacity = 0.9)

Wikileaks Data - Iraq

Cory Lesmeister

June 14, 2016