This analysis looks at data from all recorded attacks which took place in Iraq during the years 2000 - 2015.
#install.packages("grid")
There were 12 warnings (use warnings() to see them)
#install.packages("ggmap")
#install.packages("mapproj")
#install.packages("viridis")
#install.packages("RColorBrewer")
Load libraries:
library(readr)
library(stringr)
library(ggplot2)
library(dplyr)
library(purrr)
library(tidyr)
library(magrittr)
library(ggmap)
library(mapproj)
library(viridis)
library(RColorBrewer)
library(grid)
setwd("C:/Users/Ana/Desktop/Data Analytics/CSV Files")
raw_data <- read_csv("Iraq_attacks_csv.csv")
Parsed with column specification:
cols(
.default = col_double(),
approxdate = [31mcol_character()[39m,
resolution = [31mcol_character()[39m,
country_txt = [31mcol_character()[39m,
region_txt = [31mcol_character()[39m,
provstate = [31mcol_character()[39m,
city = [31mcol_character()[39m,
location = [31mcol_character()[39m,
summary = [31mcol_character()[39m,
alternative_txt = [31mcol_character()[39m,
attacktype1_txt = [31mcol_character()[39m,
attacktype2_txt = [31mcol_character()[39m,
attacktype3_txt = [31mcol_character()[39m,
targtype1_txt = [31mcol_character()[39m,
targsubtype1_txt = [31mcol_character()[39m,
corp1 = [31mcol_character()[39m,
target1 = [31mcol_character()[39m,
natlty1_txt = [31mcol_character()[39m,
targtype2_txt = [31mcol_character()[39m,
targsubtype2_txt = [31mcol_character()[39m,
corp2 = [31mcol_character()[39m
# ... with 40 more columns
)
See spec(...) for full column specifications.
11 parsing failures.
row col expected actual file
4577 gsubname2 1/0/T/F/TRUE/FALSE Islamic Shiite Resistance in Iraq 'Iraq_attacks_csv.csv'
14853 gsubname2 1/0/T/F/TRUE/FALSE Intifadat Ahrar al-Iraq 'Iraq_attacks_csv.csv'
15018 gsubname2 1/0/T/F/TRUE/FALSE Intifadat Ahrar al-Iraq 'Iraq_attacks_csv.csv'
15414 gsubname2 1/0/T/F/TRUE/FALSE Intifadat Ahrar al-Iraq 'Iraq_attacks_csv.csv'
15609 gsubname2 1/0/T/F/TRUE/FALSE Intifadat Ahrar al-Iraq 'Iraq_attacks_csv.csv'
..... ......... .................. ................................. ......................
See problems(...) for more details.
#raw_data_syria <- read_csv("Syria_attacks_csv.csv")
Data Cleaning
The first 10 rows of the raw data is shown below.
attack_data <- raw_data
head(attack_data, 10)
NA
The data will cleaned by carriying out the following:
1 - Count how many NA values are in each column. If a column consists mainly of over 90% NA values, consideration should be made for removing it from the dataset.
2 - From looking at the first 10 rows, the author of the dataset sometimes puts a full stop ‘.’ in a column where an NA value should be. Before checking the number of NA values, I will just run through the dataframe replacing the cells with ‘.’ with NA.
3 - Some columns contain codes for the text which is provided in a separate column. For example, the columns region and region_txt contain the same information, only region is the code for the full name provided in region_txt. Therefore, all columns which are number codes for adjacent text columns will be removed as these columns double count information provided elsewhere.
4 - All columns which contain descriptions of items that are covered by data in other columns will be removed.Examples include Location - the long and lat are provided in separate columns.
5- String/Text analsis will not be carried out in this piece of work so columns containing descriptions i.e. summary or scite will be removed. Also columns relating to the media references have been deleted as this does not form part of the required analysis.
6 - By this point there are 50 columns remaining so now it’s ime to look at each one individually. Looking through these columns, most of them look like useful data. But there are a few columns which have ambiguous titles such as crit1. I need to check with the data owner what data these columns contain and see if they are necessary for the analysis.
Ambiguous Columns: -specificity -vicinity -crit1 -crit2 -crit3 -doubtterr -alternative_txt -multiple -success (I presume this is if the attack achieved the terrorists’ aims) -corp1 -ingroup -guncertain1 -nperps -nperpcap -nkillus (US specific deaths?) -nkillter -property -propextent_txt -propvalue -ishostkid -INT_LOG -INT_IDEO -INT_MISC -INT_ANY -related
Having spoken to the data owner,it has been confirmed that these columns do not hold information relevant to the key questions and therefore they will be removed from the dataframe.
7 - By now the dataframe contains 25 columns. I will now look at these columns and check that all of them are necessary and if there are any more that are superfuous to the analysis we want to carry out.
These columns potentially could be removed: -extended (contains 1 if the attack lasted more than 1 day, usually indicates hostage situation) -claimmode_txt (method of notification of claim of attack) -nwoundus (unsure) -nwoundte (unsure)
I don’t think that, for now, the analysis will focus on extended or not extended attacks, although this is something that could be reviewed later. For now, I’ll remove the column. Approxdate will also be removed as the year, month and day columns are populated. nwoundus and nwoundte - unsure but we have the data for total wounded so, for now at least, remove these columns.
8 - Finally, we are left with 21 columns of data relevant to this analysis. Last step is to rename the columns to make them clearer.
All of these data cleaning steps are detailed in the code. The cleaned data set is provided below.
#replace '.' values with NA
attack_data[attack_data == "."] <- NA
#head(attack_data, 10)
#count the number of NA values in each row
logic_na <- as.logical(rowSums(is.na(attack_data)))
#count the number of NA values in each column and output as dataframe
no_nas <- as.data.frame(colSums(is.na(attack_data)))
#rename the columns of the no_nas dataframe
na_values <- as.data.frame(cbind(colnames(attack_data), no_nas[,1]))
colnames(na_values)<- c("column_titles", "No. NA Values")
#na_values
#change values to integer
na_values[2] <- lapply(na_values[2], as.integer)
#calculate 90% of total number of rows
pc_90 <- length(attack_data$eventid)*0.9
#pc_90
#filter the dataframe so that it only contains column names with over 90% NA values
na_values_90pc <- na_values %>%
filter(`No. NA Values` > pc_90)
#na_values_90pc
#create a vector containing the names of the columns which contain over 90% NA values
na_columns <- na_values_90pc$column_titles
#na_columns
#All columns which contain over 90% NA values will be removed from the dataframe for analysis. This corresponds to 66 columns in the `na_columns` vector. Next to remove these columns.
reduced_data_1 <- attack_data %>%
select(-all_of(na_columns))
#head(reduced_data_1, 10)
#Some columns contain codes for the text which is provided in a separate column. For example, the columns `region` and `region_txt` contain the same information, only `region` is the code for the full name provided in `region_txt`. Therefore, all columns which are number codes for adjacent text columns will be removed as these columns double cont information provided elsewhere.
column_names <- colnames(reduced_data_1)
#column_names
reduced_data_2 <- reduced_data_1 %>%
select(-country, -region, -alternative, -attacktype1, -targtype1, -targsubtype1, -natlty1, -weaptype1, -weapsubtype1, -propextent)
#reduced_data_2
#All columns which contain descriptions of items that are covered by data in other columns will be removed.Examples include `Location` - the long and lat are provided in separate columns.
#Additionally, string/Text analsis will not be carried out in this piece of work so columns containing descriptions i.e. `summary` or `scite` will be removed. Also columns relating to the media references have been deleted as well.
column_names_2 <- colnames(reduced_data_2)
#column_names_2
reduced_data_3 <- reduced_data_2 %>%
select(-location, -summary, -target1, -motive, -weapdetail, -propcomment, -addnotes, -scite1, -scite2, -scite3, -dbsource)
#head(reduced_data_3, 10)
#Getting there, but there are still 50 columns. Looking through these columns, most of them look like useful data. But there are a few columns which have ambiguous titles such as `crit1`. I need to check with the data owner what data these columns contain and see if they are necessary for the analysis. Data owner confirmed removal.
reduced_data_4 <- reduced_data_3 %>%
select(-specificity, -vicinity, -crit1, -crit2, -crit3, -doubtterr, -alternative_txt, -multiple, -success, -corp1, -ingroup, -guncertain1, -nperps, -nperpcap, -nkillus, -nkillter, -property, -propextent_txt, -propvalue, -ishostkid, -INT_LOG, -INT_IDEO, -INT_MISC, -INT_ANY, -related)
#head(reduced_data_4, 10)
#remove further columns which arenot perinent to analysis.
reduced_data_5 <- reduced_data_4 %>%
select(-extended, -approxdate, -nwoundus, -nwoundte)
#head(reduced_data_5, 10)
#We are left with 21 columns of data relevant to this analysis. Next, rename the columns to make them more logical.
new_names <- c("eventid", "iyear", "imonth", "iday", "country", "region", "provstate", "city", "latitude", "longitude", "suicide", "attack_type_1", "target_type_1", "target_sub_type_1", "nationality", "group_name", "claimed", "weapon", "weapon_sub_type", "n_kill", "n_wounded")
colnames(reduced_data_5) <- new_names
head(reduced_data_5, 10)
Next, let’s look at the number of NA entires in the dataframe.
logic_na <- as.logical(rowSums(is.na(reduced_data_5)))
na_rows <- reduced_data_5[logic_na,]
df_na <- map_df(na_rows, function(x) as.numeric(is.na(x)))
df_na_heat <- df_na %>%
pivot_longer(cols = everything(),
names_to = "x") %>%
group_by(x) %>%
mutate(y = row_number())
plot_na_matrix <- function(df_na) {
# Preparing the dataframe for heatmaps
df_heat <- df_na %>%
pivot_longer(cols = everything(),
names_to = "x") %>%
group_by(x) %>%
mutate(y = row_number())
# Ensuring the order of columns is kept as it is
df_heat <- df_heat %>%
ungroup() %>%
mutate(x = factor(x,levels = colnames(df_na)))
# Plotting data
g <- ggplot(data = df_heat, aes(x=x, y=y, fill=value)) +
geom_tile() +
theme(legend.position = "none",
axis.title.y=element_blank(),
axis.text.y =element_blank(),
axis.ticks.y=element_blank(),
axis.title.x=element_blank(),
axis.text.x = element_text(angle = 90, hjust = 1))
# Returning the plot
g
}
plot_na_matrix(df_na)

From this plot, you can see that the cleaned dataset still contains some NA values. This is because only columns with over 90% NA values were removed. There are very few NA values in the city column and long and lat columns. The column target_sub_type_1 has quite a few NA values, presumably because there was only one target. The column weapon_sub_type has a number of NA values presumably relating to attacks that only used one weapon i.e. they didn’t have a sub-weapon. Then for n_kill and n_wounded there are a number of NA values. It would be good to know whether these NA values should be 0 values or whether it should be ‘unknown’.
no_nas <- as.data.frame(colSums(is.na(na_rows)))
colnames(no_nas) <- "No. of NA Values"
#no_nas
#length(reduced_data_5$eventid)
Data Analysis Questions
From looking at the data, the following analysis will be carried out:
- Analysis of Type of Attack
1a - Number of attacks per year
1b - Analyse trends in types of attacks
1c - Analyse the number of suicide attacks per year.
- Analysis of Attack Organisation
2a - Number of attacks per group, per year
2b - Plot number of attacks for each groups based on geography? Are some groups spreading geographically? Are some groups condensing geographically?
Analysis of Type of Target
Analysis of Weapon Usage
Analysis of Geography of attacks
Question 1a: Number of Attacks Vs Year
Track the number of attacks over the years 2000 - 2015.
attack_data_2 <- reduced_data_5
q1a_data <- attack_data_2 %>%
group_by(iyear) %>%
summarise(no_attacks = n())
#q1a_data
ggplot(data = q1a_data,
aes(x = iyear, y = no_attacks))+
geom_line(lwd = 1.5, color = "blue4") +
labs(title = "No. of Attacks vs Year in Iraq", x = "Year", y = "No. of Attacks") +
theme(panel.background = element_rect(fill = "white"))

#Filter the data so that it contains only 2015 attacks and check that all months are present in the data.
q1a_data_checkmonth <- attack_data_2 %>%
group_by(iyear, imonth) %>%
summarise(no_attacks = n()) %>%
filter(iyear == 2015)
#q1a_data_checkmonth
Attacks in Iraq increased between 2000 and 2014 with a very large increase in attacks between 2012 - 2014. However the number of attacks in 2015 dropped from the level in 2014. The data has been checked to ensure that it covers all months of 2015, which it does, so this decrease reflects a genuine reduction in attacks in the year 2015, compared to 2014.
Question 1b: Analyse trends in types of attacks
q1b_data <- attack_data_2 %>%
group_by(iyear, attack_type_1) %>%
summarise(no_attacks = n())
#q1b_data
ggplot(data = q1b_data,
aes(x = iyear, y = no_attacks, color = attack_type_1)) +
facet_wrap(~attack_type_1) +
geom_line(lwd = 1, color = "blue4") +
labs(title = "No. of Attacks vs Year in Iraq, per Attack Type", x = "Year", y = "No. of Attacks") +
theme(panel.background = element_rect(fill = "white"))

From these line graphs, you can see that there has been a vast increase in bombing/explosion attacks between 2000 and 2015. Additionally, there has been an increase (albeit smaller) in armed assault. Assassination attacks have remained constant. It appears there has been a small increase in hostage taking in 2014 and 2015.
#This next section was looking at whether there was any seasonal trend. There was sufficient evidence for this, so it has been left out of the analysis.
q1c_data <- attack_data_2 %>%
group_by(imonth) %>%
summarise(no_attacks = n())
#q1c_data
#ggplot(data = q1c_data,
# aes(x = imonth, y = no_attacks)) +
# geom_line(lwd = 1.5, color = "blue") +
# labs(title = "No. of Attacks vs Month in Iraq", x = "Month", y = "No. of Attacks") +
# theme(panel.background = element_rect(fill = "white")) +
# scale_x_discrete(limits=month.abb)
#This data is taken across all years. Overall, September is the month with the lowest average number of attacks, whereas November has the highest average number of attacks. Let's repeat this graph and plot the years out individually and see whether the trends are seen in each year.
q1ci_data <- attack_data_2 %>%
mutate(iyear = as.character(iyear)) %>%
group_by(iyear, imonth) %>%
summarise(no_attacks = n())
year_check <- q1ci_data %>%
group_by(iyear) %>%
summarise(no_months_in_year_active = n())
#year_check
q1ci_data <- q1ci_data %>%
filter(iyear > 2003)
#q1ci_data
#ggplot(data = q1ci_data,
# aes(x = imonth, y = no_attacks, color = iyear, lty = iyear)) +
# geom_line() +
# labs(title = "No. of Attacks vs Month in Iraq", x = "Month", y = "No. of Attacks") +
# theme(panel.background = element_rect(fill = "white")) +
# scale_x_discrete(limits=month.abb) +
# geom_text(data = subset(q1ci_data, imonth == 12), aes(label = iyear, colour = iyear, x = 12, y = no_attacks), hjust = -.1) +
# theme(legend.position="none")
#From plotting the monthly number of attacks for each year, you can see that the main trends seen in the previous graph are actually down to the years 2013 and 2014 which have a large number of attacks and variation. Therefore, overall it does not seem likely that there is a monthly pattern of attacks.
Question 1c: Analyse the number of suicide attacks
q1d_data <- attack_data_2 %>%
group_by(iyear, suicide) %>%
summarise(no_attacks = n())
#q1d_data
ggplot(data = q1d_data,
aes(x = iyear, y = no_attacks, color = as.character(suicide), lty = as.character(suicide))) +
geom_line(lwd = 1) +
labs(title = "Attacks and Suicide Attacks per Year in Iraq", x = "Year", y = "No. of Attacks") +
theme(panel.background = element_rect(fill = "white")) +
#theme(legend.title = element_blank()) +
scale_color_discrete(name ="Attack Type",
breaks=c("0", "1"),
labels=c("Not Suicide", "Suicide")) +
scale_linetype_discrete(name ="Attack Type",
breaks=c("0", "1"),
labels=c("Not Suicide", "Suicide"))

#q1d2_data <- q1d_data %>%
# pivot_wider(names_from = suicide, values_from = no_attacks)
#q1d2_data[is.na(q1d2_data)] = 0
The graph above shows that while there has been a large increase in the number of attacks from 2000 to 2015, the number of suicide attacks has not increased to the same extent. In fact, between 2008 and 2012, suicide attacks decreased whilst the overall number of attacks increased.
Question 2a: Analysis of Attack Organisation
q2ai_data <- attack_data_2 %>%
group_by(iyear, group_name) %>%
summarise(no_attack = n()) %>%
pivot_wider(names_from = iyear, values_from = no_attack) %>%
#replace_na(list(`2000` = 0)) %>%
replace_na(list(`2000` = 0, `2001` = 0, `2002` = 0, `2003` = 0, `2004` = 0, `2005` = 0, `2006` = 0, `2007` = 0, `2008` = 0, `2009` = 0, `2010` = 0, `2011` = 0, `2012` = 0, `2013` = 0, `2014` = 0, `2015` = 0)) %>%
mutate(total_all_yrs = rowSums(select(., 2:17))) %>%
mutate(sum_2015 = rowSums(select(., 17)))
q2ai_data
NA
This table shows the different terrorist groups in Iraq and the corresponding number of (known) attacks carried out between 2000 and 2015. There are 69 different terrorist groups in Iraq. For the purpose of this analysis, we will interrogate the data of groups which meet this criteria:
[1] Have carried out more than 20 attacks in total (across all years)
This removes small perpetrator groups from the analysis.
q2ai_data <- q2ai_data %>%
filter(total_all_yrs > 20)
#q2ai_data[,1]
By filtering out groups which have carried out less than 20 attacks in total, 10 groups are left. Unfortunately one of these groups is ‘unknown’. As can be seen from the data, for the large marjority of attacks, the perpetrators are not identified and these attacks are recorded as ‘Unknown’. There are also 2 other ambiguous group names - “Other” and “Gunmen”. Both of these groups have been allocated more than 20 attacks, however the attacks do not occur in the last 3 years. Given that these groups are not likely to be legitimate terrorist group names, they will be removed from the data set for analysis.
After this, there are 8 groups of interest for analysis. The activities of these groups will be analysed in more detail.
q2ai_data <- q2ai_data %>%
filter(group_name != "Other", group_name != "Gunmen")
#q2ai_data
#The 8 remaining groups of interest will be assigned to a vector `key_groups`
key_groups <- q2ai_data$group_name
#key_groups
q2a_data <- attack_data_2 %>%
filter(group_name %in% key_groups) %>%
group_by(iyear, group_name) %>%
summarise(no_attacks = n())
#q2a_data
g <- ggplot(data = q2a_data,
aes(x = iyear, y = no_attacks, color = group_name)) +
geom_line(lwd = 1.5) +
labs(title = "No. of Attacks per Group per year", x = "Year", y = "No. of Attacks") +
theme(panel.background = element_rect(fill = "white"))
#geom_text(data = subset(q2a_data_iraq, iyear == 2013), aes(label = group_name, colour = group_name, x = 2013, y = no_attacks), hjust = -.1) +
#theme(legend.position="none")
gt <- ggplotGrob(g)
gt$layout$clip[gt$layout$name == "panel"] <- "off"
grid.draw(gt)
q2a_data_2 <- q2a_data %>%
group_by(group_name) %>%
summarise(no_attacks = sum(no_attacks))
#q2a_data_2
ggplot(data = q2a_data_2,
aes(x = group_name, y = no_attacks, fill = group_name)) +
geom_bar(stat = 'identity') +
labs(title = "No. of Attacks per Group", x = "Group Name", y = "No. of Attacks") +
theme(panel.background = element_rect(fill = "white")) +
theme(axis.text.x = element_text(angle=60, hjust=1))+
geom_text(aes(label = no_attacks, y = no_attacks+1000)) +
theme(legend.position="none")


NA
NA
The bar chart shows the terrorist groups and the number of attacks committed by each in total from 2000-2015. It is worth noting that the group_name for the vast majoirty of attacks is unknown. This could be because there are unknown groups active who do not claim attacks, or perhaps more likely is that the attacks are carried out by known groups, but they are not claimed/cannot be identified.
Of the groups that are identified, ISIL has by far carried out the most attacks, followed by Al-Qaida in Iraq, then ISI.
However, when looking at the line graph of number of attacks per year by different groups, it shows that the organisations are not consistent throughout the years. ISI is active between the years 2007 and 2010, Al-Qaida in Iraq is most active between 2011 and 2013 and after that, ISIL is by far the dominant organisation.
Do members of these groups switch alegiance depending on which group is most ‘popular’ at the time? i.e. are the attacks caried out by largely the same population of people regardless of the group name?
Or are these groups made up of categorically different people with different beliefs and motivations?
i.e. is Al Quaida largely defeated or have the perpetrators simply jumped ship to ISIL?
Further work: Look at how many of the attacks are ‘claimed’ and is that the main way that groups are identified or is the group_name identified by other means?
Question 3a: Analysis of Target Type
The following graphs show the number of attacks on each type of target. The first set of graphs show attacks by all terrorist groups. The second shows just the attacks by ISIL as the data shows that this is the most active, current group.
q3a_data <- attack_data_2 %>%
group_by(iyear, target_type_1) %>%
summarise(no_attacks = n())
#q3a_data
ggplot(data = q3a_data,
aes(x = iyear, y = no_attacks))+
facet_wrap(~target_type_1)+
geom_line(lwd = 1, color = "blue4") +
labs(title = "No. of Attacks per Group per year in Iraq, split by Target", x = "Year", y = "No. of Attacks") +
theme(panel.background = element_rect(fill = "white"))

q3a_data <- attack_data_2 %>%
filter(group_name == "Islamic State of Iraq and the Levant (ISIL)") %>%
group_by(iyear, target_type_1) %>%
summarise(no_attacks = n())
ggplot(data = q3a_data,
aes(x = iyear, y = no_attacks))+
facet_wrap(~target_type_1)+
geom_line(lwd = 1, color = "blue4") +
labs(title = "No. of Attacks by ISIL per year in Iraq, split by Target", x = "Year", y = "No. of Attacks") +
theme(panel.background = element_rect(fill = "white"))

The number of attacks generally has increased from 2000 to 2015 in Iraq. This is driven largely by attacks on private citizens and property although there has also been an increase in attacks on the miliary and police. Attacks on businesses has seen an increase too. The data shows that ISIL tends to attack Private Citizens & Property, Military and Police. The attacks on the Military by ISIL has increased year on year which would suggest that this was a partcular focus of the group.
#This next section of code looked at the nationality of victims. The vast majority of victimes were Iraqi nationals. No further investigation will be done on victim nationality.
q3b_data <- attack_data_2 %>%
group_by(nationality) %>%
summarise(no_attacks = n()) %>%
filter(no_attacks > 5)
#q3b_data
#ggplot(data = q3b_data,
# aes(x = nationality, y = no_attacks, fill = nationality))+
# geom_bar(stat = 'identity') +
# theme(axis.text.x = element_text(angle=90, hjust=1)) +
# labs(title = "Nationality of Attack Victims in Iraq - All Yrs (no. attacks > 5)", x = "Nationality", y = "No. of Attacks") +
# theme(panel.background = element_rect(fill = "white"))
Question 4: Analysis on Weapon Use
q4a_data <- attack_data_2 %>%
group_by(weapon) %>%
summarise(no_attacks = n(), no_killed_real = sum(n_kill)) %>%
#summarise(no_killed = mean(claimed))
mutate(weapon = str_replace(weapon, "Explosives/Bombs/Dynamite", "Explosives")) %>%
mutate(weapon = str_replace(weapon, "not to include vehicle-borne explosives, i.e., car or truck bombs", "excl explosives"))
#q4a_data
pct<- round(100*q4a_data$no_attacks/sum(q4a_data$no_attacks), 1)
ggplot(data = q4a_data,
aes(x = weapon, y = no_attacks, fill = weapon))+
geom_bar(stat = 'identity') +
theme(axis.text.x = element_text(angle=50, hjust=1)) +
labs(title = "No of Attacks by Weapon Type - All Years", x = "Weapon", y = "No. Killed") +
theme(panel.background = element_rect(fill = "white")) +
theme(legend.position = "none") +
geom_text(aes(label = no_attacks, y = no_attacks+500))

NA
NA
NA
NA
NA
The vast majority of attacks are carried out using explosives. This accounts for 76% of all attacks (across all years). Attacks using firearms then accounts for 20%.
As these are by far the methods used most for carrying out attacks, these two weapon types will be explored in more detail. Firstly how are the numbers of these attack types changing over the years?
q4b_data <- attack_data_2 %>%
group_by(iyear, weapon) %>%
summarise(no_attacks = n()) %>%
filter(weapon == "Explosives/Bombs/Dynamite" | weapon == "Firearms") %>%
pivot_wider(names_from = weapon, values_from = no_attacks) %>%
replace_na(list(Firearms = 0)) %>%
mutate(total = `Explosives/Bombs/Dynamite` + `Firearms`) %>%
mutate(expl_pct = round(`Explosives/Bombs/Dynamite`/total*100, 0)) %>%
pivot_longer(cols = c(`Explosives/Bombs/Dynamite`, `Firearms`), names_to = "weapon", values_to = "no_attacks")
#q4b_data
#ggplot(data = q4b_data,
# aes(x = iyear, y = no_attacks, fill = weapon))+
# geom_bar(stat = 'identity')+
# labs(title = "No. of Explosives and Firearms Attacks per Year (% = Explosive Attacks)", x = "Year", y = "No. of Attacks") +
# theme(panel.background = element_rect(fill = "white")) +
# geom_text(aes(label = paste(expl_pct,"%"), y = total+0.5))
ggplot(data = q4b_data, aes(x=iyear)) +
geom_line( aes(y=no_attacks, color = weapon), size=2) +
# geom_line( aes(y=expl_pct*40), lty = 2, size=0.5) +
#scale_y_continuous(name = "No. of attacks", sec.axis = sec_axis(~ . /40, name="% of Attacks that use Explosives")
#) +
#ggtitle("No. of Explosive and Firearms attacks per Year - Iraq")+
labs(title = "No. of Explosive and Firearms attacks per Year - Iraq", x = "Year", y = "No. of Attacks") +
theme(panel.background = element_rect(fill = "white"))

As Explosives/Bombs/Dynamite account for such a large proportion of all attacks, it will be useful to understand more about the sub-type of weapon and if there are any trends in what attackers are using.
q4c_data <- attack_data_2 %>%
filter(weapon == "Explosives/Bombs/Dynamite") %>%
group_by(iyear, weapon_sub_type) %>%
summarise(no_attacks = n()) %>%
filter(weapon_sub_type != "Unknown Explosive Type") %>%
filter(weapon_sub_type != "Other Explosive Device")
#q4c_data
ggplot(data = q4c_data,
aes(x = iyear, y = no_attacks, lty = weapon_sub_type, color = weapon_sub_type), size = 2)+
geom_line(lwd = 1.25)+
# scale_colour_brewer("Weapon Sub-Type", palette="Set1") +
labs(title = "Types of Explosives Attacks - 2000 to 2015", x = "Year", y = "No. of Attacks") +
theme(panel.background = element_rect(fill = "white"))

NA
There has been a large increase in explosives - ‘vehicle’ i.e car bombs. In fact, in recent years, this accounts for the majority of explosives attacks. Additionally, projectiles have been used more in recent years. Sticky bombs were used more around 2010-2011 but incidents using sticky bombs have declined in recent years. Overall, the two main explosives used are vehicle bombs and projectiles.
Question 5: Analysis on Geography of Attacks
q5a_data <- attack_data_2 %>%
select(iyear, latitude, longitude, group_name) %>%
filter(group_name %in% key_groups) %>%
filter(longitude < 50 & latitude < 39) %>% #removing 3no rogue locations outside of Iraq
mutate(iyear = as.integer(iyear)) %>%
mutate(group_name = str_replace(group_name, "Islamic State of Iraq \\(ISI\\)", "ISI")) %>%
mutate(group_name = str_replace(group_name, "Islamic State of Iraq and the Levant \\(ISIL\\)", "ISIL"))
q5a_data
fileName <- 'google_api.txt'
code <- readChar(fileName, file.info(fileName)$size)
#register_google(key = "XXX")
register_google(key = code)
#get_map("Iraq", zoom = 6) %>% ggmap()
#iraq <- c(left = 38.5, bottom = 29, right = 48.5, top = 37.5)
#get_stamenmap(iraq, zoom = 6) %>% ggmap()
qmplot(longitude, latitude, data = q5a_data, maptype = "toner-lite", darken = 0, color = group_name) + scale_colour_brewer("Group Name", palette="Set1")+
labs(title = "Attack Locations from 2000 to 2015")
Using zoom = 7...

#qmplot(longitude, latitude, data = q5a_data, maptype = "toner-lite", darken = 0, color = group_name) + scale_colour_brewer("Group Name", palette="Set1")+
# facet_wrap(~group_name)
This plot shows the locations of all attacks carried out my the 8 key groups, across all years.
qmplot(longitude, latitude, data = q5a_data, maptype = "toner-lite", darken = 0, color = -iyear) +
scale_color_viridis(option = "C")+
facet_wrap(~group_name) +
#scale_color_manual(values = brewer.pal(n=16, name = "RdBu"))+
labs(title = "Attack Locations by Group, 2000 to 2015") +
theme(legend.title = element_blank())
Using zoom = 7...

#ggsave("mapiraq.png")
This plot shows the attacks per group, colour coded for year. Yellow markers show attacks that occurred in the early 2000s. Dark blue attacks are recent.
From this you can see that the group Tawhid and Jihad is no longer active in carrying out attacks. ISI attacks were carried out predominatly between the years 2005 and 2010 whilst ISIL attacks are shown to be from 2010 onwards. This makes sense because ISI largely became ISIL.
The Al-Naqshabandiya Army appears to be a group carrying out attacks in Central Iraq, since 2010. This is therefore a relativley new group. Al-Qaida in Iraq are a long standing group which has carried out attacks in Iraq from the year 2000 and still carry out attacks in 2015. Ansar al-Islam is another long standing group carrying out attacks across all years, however it has carried out far few attacks than Al-Qaida in Iraq. The group Muslim Fundamentalists have carried out attacks in recent years mostly in central Iraq.
Unfortunately, the great majority of the attacks do not have a identified group.
---
title: "Analysis of Terrorist Attacks in Iraq, 2000 - 2015"
output: html_notebook
---
This analysis looks at data from all recorded attacks which took place in Iraq during the years 2000 - 2015.

```{r}
#install.packages("grid")
#install.packages("ggmap")
#install.packages("mapproj")
#install.packages("viridis")
#install.packages("RColorBrewer")
```

Load libraries:
```{r}
library(readr)
library(stringr)
library(ggplot2)
library(dplyr)
library(purrr)
library(tidyr)
library(magrittr)
library(ggmap)
library(mapproj)
library(viridis)
library(RColorBrewer)
library(grid)

setwd("C:/Users/Ana/Desktop/Data Analytics/CSV Files")
raw_data <- read_csv("Iraq_attacks_csv.csv")
#raw_data_syria <- read_csv("Syria_attacks_csv.csv")
```
```{r}
 
```
<font size="5"> Data Cleaning </font>

The first 10 rows of the raw data is shown below. 

```{r}
attack_data <- raw_data
head(attack_data, 10)

```
The data will cleaned by carriying out the following:

1 - Count how many NA values are in each column. If a column consists mainly of over 90% NA values, consideration should be made for removing it from the dataset.

2 - From looking at the first 10 rows, the author of the dataset sometimes puts a full stop '.' in a column where an NA value should be. Before checking the number of NA values, I will just run through the dataframe replacing the cells with '.' with NA.

3 - Some columns contain codes for the text which is provided in a separate column. For example, the columns `region` and `region_txt` contain the same information, only `region` is the code for the full name provided in `region_txt`. Therefore, all columns which are number codes for adjacent text columns will be removed as these columns double count information provided elsewhere. 

4 - All columns which contain descriptions of items that are covered by data in other columns will be removed.Examples include `Location` - the long and lat are provided in separate columns.

5- String/Text analsis will not be carried out in this piece of work so columns containing descriptions i.e. `summary` or `scite` will be removed. Also columns relating to the media references have been deleted as this does not form part of the required analysis.

6 - By this point there are 50 columns remaining so now it's ime to look at each one individually. Looking through these columns, most of them look like useful data. But there are a few columns which have ambiguous titles such as `crit1`. I need to check with the data owner what data these columns contain and see if they are necessary for the analysis.

Ambiguous Columns:
-specificity
-vicinity
-crit1
-crit2
-crit3
-doubtterr
-alternative_txt
-multiple
-success (I presume this is if the attack achieved the terrorists' aims)
-corp1
-ingroup
-guncertain1
-nperps
-nperpcap
-nkillus (US specific deaths?)
-nkillter
-property
-propextent_txt
-propvalue
-ishostkid
-INT_LOG
-INT_IDEO
-INT_MISC
-INT_ANY
-related

Having spoken to the data owner,it has been confirmed that these columns do not hold information relevant to the key questions and therefore they will be removed from the dataframe.

7 - By now the dataframe contains 25 columns. I will now look at these columns and check that all of them are necessary and if there are any more that are superfuous to the analysis we want to carry out.

These columns potentially could be removed:
-extended (contains 1 if the attack lasted more than 1 day, usually indicates hostage situation)
-claimmode_txt (method of notification of claim of attack)
-nwoundus (unsure)
-nwoundte (unsure)

I don't think that, for now, the analysis will focus on extended or not extended attacks, although this is something that could be reviewed later. For now, I'll remove the column.
Approxdate will also be removed as the year, month and day columns are populated.
nwoundus and nwoundte - unsure but we have the data for total wounded so, for now at least, remove these columns.

8 - Finally, we are left with 21 columns of data relevant to this analysis. Last step is to rename the columns to make them clearer.

All of these data cleaning steps are detailed in the code. The cleaned data set is provided below. 


```{r}

#replace '.' values with NA

attack_data[attack_data == "."] <- NA
#head(attack_data, 10)

#count the number of NA values in each row

logic_na <- as.logical(rowSums(is.na(attack_data)))

#count the number of NA values in each column and output as dataframe

no_nas <- as.data.frame(colSums(is.na(attack_data)))

#rename the columns of the no_nas dataframe

na_values <- as.data.frame(cbind(colnames(attack_data), no_nas[,1]))
colnames(na_values)<- c("column_titles", "No. NA Values")
#na_values

#change values to integer

na_values[2] <- lapply(na_values[2], as.integer)

#calculate 90% of total number of rows

pc_90 <- length(attack_data$eventid)*0.9
#pc_90

#filter the dataframe so that it only contains column names with over 90% NA values

na_values_90pc <- na_values %>%
  filter(`No. NA Values` > pc_90)
#na_values_90pc

#create a vector containing the names of the columns which contain over 90% NA values

na_columns <- na_values_90pc$column_titles
#na_columns
```

```{r}
#All columns which contain over 90% NA values will be removed from the dataframe for analysis. This corresponds to 66 columns in the `na_columns` vector. Next to remove these columns.

reduced_data_1 <- attack_data %>%
  select(-all_of(na_columns))

#head(reduced_data_1, 10)
```


```{r}
#Some columns contain codes for the text which is provided in a separate column. For example, the columns `region` and `region_txt` contain the same information, only `region` is the code for the full name provided in `region_txt`. Therefore, all columns which are number codes for adjacent text columns will be removed as these columns double cont information provided elsewhere. 

column_names <- colnames(reduced_data_1)
#column_names

reduced_data_2 <- reduced_data_1 %>%
  select(-country, -region, -alternative, -attacktype1, -targtype1, -targsubtype1, -natlty1, -weaptype1, -weapsubtype1, -propextent)

#reduced_data_2
```


```{r}
#All columns which contain descriptions of items that are covered by data in other columns will be removed.Examples include `Location` - the long and lat are provided in separate columns.

#Additionally, string/Text analsis will not be carried out in this piece of work so columns containing descriptions i.e. `summary` or `scite` will be removed. Also columns relating to the media references have been deleted as well.

column_names_2 <- colnames(reduced_data_2)
#column_names_2

reduced_data_3 <- reduced_data_2 %>%
  select(-location, -summary, -target1, -motive, -weapdetail, -propcomment, -addnotes, -scite1, -scite2, -scite3, -dbsource)

#head(reduced_data_3, 10)
```


```{r}
#Getting there, but there are still 50 columns. Looking through these columns, most of them look like useful data. But there are a few columns which have ambiguous titles such as `crit1`. I need to check with the data owner what data these columns contain and see if they are necessary for the analysis. Data owner confirmed removal.

reduced_data_4 <- reduced_data_3 %>%
  select(-specificity, -vicinity, -crit1, -crit2, -crit3, -doubtterr, -alternative_txt, -multiple, -success, -corp1, -ingroup, -guncertain1, -nperps, -nperpcap, -nkillus, -nkillter, -property, -propextent_txt, -propvalue, -ishostkid, -INT_LOG, -INT_IDEO, -INT_MISC, -INT_ANY, -related)

#head(reduced_data_4, 10)
```


```{r}
#remove further columns which arenot perinent to analysis.

reduced_data_5 <- reduced_data_4 %>%
  select(-extended, -approxdate, -nwoundus, -nwoundte)

#head(reduced_data_5, 10)
```


```{r}
#We are left with 21 columns of data relevant to this analysis. Next, rename the columns to make them more logical.

new_names <- c("eventid", "iyear", "imonth", "iday", "country", "region", "provstate", "city", "latitude", "longitude", "suicide", "attack_type_1", "target_type_1", "target_sub_type_1", "nationality", "group_name", "claimed", "weapon", "weapon_sub_type", "n_kill", "n_wounded")

colnames(reduced_data_5) <- new_names

head(reduced_data_5, 10)
```
Next, let's look at the number of NA entires in the dataframe.

```{r}
logic_na <- as.logical(rowSums(is.na(reduced_data_5)))

na_rows <- reduced_data_5[logic_na,]

df_na <- map_df(na_rows, function(x) as.numeric(is.na(x)))
 df_na_heat <- df_na %>%
    pivot_longer(cols = everything(),
           names_to = "x") %>%
    group_by(x) %>%
    mutate(y = row_number())

plot_na_matrix <- function(df_na) {
     # Preparing the dataframe for heatmaps 
    df_heat <- df_na %>%
        pivot_longer(cols = everything(),
           names_to = "x") %>%
        group_by(x) %>%
        mutate(y = row_number())
     # Ensuring the order of columns is kept as it is
    df_heat <- df_heat %>%
        ungroup() %>%
        mutate(x = factor(x,levels = colnames(df_na)))
     # Plotting data
    g <- ggplot(data = df_heat, aes(x=x, y=y, fill=value)) + 
        geom_tile() + 
        theme(legend.position = "none",
              axis.title.y=element_blank(),
              axis.text.y =element_blank(),
              axis.ticks.y=element_blank(),
              axis.title.x=element_blank(),
              axis.text.x = element_text(angle = 90, hjust = 1))
     # Returning the plot
    g
 } 
 
plot_na_matrix(df_na)
```
From this plot, you can see that the cleaned dataset still contains some NA values. This is because only columns with over 90% NA values were removed. There are very few NA values in the city column and long and lat columns. The column `target_sub_type_1` has quite a few NA values, presumably because there was only one target.
The column `weapon_sub_type` has a number of NA values presumably relating to attacks that only used one weapon i.e. they didn't have a sub-weapon.
Then for n_kill and n_wounded there are a number of NA values. It would be good to know whether these NA values should be 0 values or whether it should be 'unknown'. 

```{r}
no_nas <- as.data.frame(colSums(is.na(na_rows)))
colnames(no_nas) <- "No. of NA Values"
#no_nas

#length(reduced_data_5$eventid)
```
<font size="5"> Data Analysis Questions </font>

From looking at the data, the following analysis will be carried out:

1. Analysis of Type of Attack 

1a - Number of attacks per year
  
1b - Analyse trends in types of attacks
  
1c - Analyse the number of suicide attacks per year. 
    
2. Analysis of Attack Organisation

2a - Number of attacks per group, per year
  
2b - Plot number of attacks for each groups based on geography? Are some groups spreading geographically? Are some groups condensing geographically?

3. Analysis of Type of Target

4. Analysis of Weapon Usage

5. Analysis of Geography of attacks

<font size="5"> Question 1a: Number of Attacks Vs Year </font>

Track the number of attacks over the years 2000 - 2015.

```{r}
attack_data_2 <- reduced_data_5

q1a_data <- attack_data_2 %>%
  group_by(iyear) %>%
  summarise(no_attacks = n())

#q1a_data

ggplot(data = q1a_data,
       aes(x = iyear, y = no_attacks))+
      geom_line(lwd = 1.5, color = "blue4") +
      labs(title = "No. of Attacks vs Year in Iraq", x = "Year", y = "No. of Attacks") +
      theme(panel.background = element_rect(fill = "white"))

#Filter the data so that it contains only 2015 attacks and check that all months are present in the data.

q1a_data_checkmonth <- attack_data_2 %>%
  group_by(iyear, imonth) %>%
  summarise(no_attacks = n()) %>%
  filter(iyear == 2015)

#q1a_data_checkmonth
  
```
Attacks in Iraq increased between 2000 and 2014 with a very large increase in attacks between 2012 - 2014. However the number of attacks in 2015 dropped from the level in 2014. The data has been checked to ensure that it covers all months of 2015, which it does, so this decrease reflects a genuine reduction in attacks in the year 2015, compared to 2014. 

<font size="5"> Question 1b: Analyse trends in types of attacks </font>


```{r}
q1b_data <- attack_data_2 %>%
    group_by(iyear, attack_type_1) %>%
    summarise(no_attacks = n())

#q1b_data

ggplot(data = q1b_data,
       aes(x = iyear, y = no_attacks, color = attack_type_1)) +
      facet_wrap(~attack_type_1) +
      geom_line(lwd = 1, color = "blue4") +
      labs(title = "No. of Attacks vs Year in Iraq, per Attack Type", x = "Year", y = "No. of Attacks") +
      theme(panel.background = element_rect(fill = "white"))

```

From these line graphs, you can see that there has been a vast increase in bombing/explosion attacks between 2000 and 2015. Additionally, there has been an increase (albeit smaller) in armed assault. Assassination attacks have remained constant. It appears there has been a small increase in hostage taking in 2014 and 2015. 

```{r}
#This next section was looking at whether there was any seasonal trend. There was sufficient evidence for this, so it has been left out of the analysis. 

q1c_data <- attack_data_2 %>%
    group_by(imonth) %>%
    summarise(no_attacks = n())

#q1c_data

#ggplot(data = q1c_data,
#       aes(x = imonth, y = no_attacks)) +
#      geom_line(lwd = 1.5, color = "blue") +
#      labs(title = "No. of Attacks vs Month in Iraq", x = "Month", y = "No. of Attacks") +
#      theme(panel.background = element_rect(fill = "white")) +
#      scale_x_discrete(limits=month.abb) 

#This data is taken across all years. Overall, September is the month with the lowest average number of attacks, whereas November has the highest average number of attacks. Let's repeat this graph and plot the years out individually and see whether the trends are seen in each year.
```

```{r}
q1ci_data <- attack_data_2 %>%
    mutate(iyear = as.character(iyear)) %>%
    group_by(iyear, imonth) %>%
    summarise(no_attacks = n()) 

year_check <- q1ci_data %>%
  group_by(iyear) %>%
  summarise(no_months_in_year_active = n())

#year_check

q1ci_data <- q1ci_data %>%
  filter(iyear > 2003) 

#q1ci_data

#ggplot(data = q1ci_data,
#       aes(x = imonth, y = no_attacks, color = iyear, lty = iyear)) +
#      geom_line() +
#      labs(title = "No. of Attacks vs Month in Iraq", x = "Month", y = "No. of Attacks") +
#      theme(panel.background = element_rect(fill = "white")) +
#      scale_x_discrete(limits=month.abb) +
#      geom_text(data = subset(q1ci_data, imonth == 12), aes(label = iyear, colour = iyear, x = 12, y = no_attacks), hjust = -.1) +
#   theme(legend.position="none")

#From plotting the monthly number of attacks for each year, you can see that the main trends seen in the previous graph are actually down to the years 2013 and 2014 which have a large number of attacks and variation. Therefore, overall it does not seem likely that there is a monthly pattern of attacks.
```

<font size="5"> Question 1c: Analyse the number of suicide attacks </font>


```{r}
q1d_data <- attack_data_2 %>%
    group_by(iyear, suicide) %>%
    summarise(no_attacks = n())

#q1d_data

ggplot(data = q1d_data,
       aes(x = iyear, y = no_attacks, color = as.character(suicide), lty = as.character(suicide))) +
  
      geom_line(lwd = 1) +
      labs(title = "Attacks and Suicide Attacks per Year in Iraq", x = "Year", y = "No. of Attacks") +
      theme(panel.background = element_rect(fill = "white")) +
      #theme(legend.title = element_blank()) +
   scale_color_discrete(name  ="Attack Type",
                          breaks=c("0", "1"),
                          labels=c("Not Suicide", "Suicide")) +
       scale_linetype_discrete(name  ="Attack Type",
                          breaks=c("0", "1"),
                          labels=c("Not Suicide", "Suicide"))

#q1d2_data <- q1d_data %>%
#  pivot_wider(names_from = suicide, values_from = no_attacks)

#q1d2_data[is.na(q1d2_data)] = 0



```
The graph above shows that while there has been a large increase in the number of attacks from 2000 to 2015, the number of suicide attacks has not increased to the same extent. In fact, between 2008 and 2012, suicide attacks decreased whilst the overall number of attacks increased. 

<font size="5"> Question 2a: Analysis of Attack Organisation </font>

```{r}
q2ai_data <- attack_data_2 %>%
  group_by(iyear, group_name) %>%
  summarise(no_attack = n()) %>%
  pivot_wider(names_from = iyear, values_from = no_attack) %>%
  #replace_na(list(`2000` = 0)) %>%
  replace_na(list(`2000` = 0, `2001` = 0, `2002` = 0, `2003` = 0, `2004` = 0, `2005` = 0, `2006` = 0, `2007` = 0, `2008` = 0, `2009` = 0, `2010` = 0, `2011` = 0, `2012` = 0, `2013` = 0, `2014` = 0, `2015` = 0)) %>%
   mutate(total_all_yrs = rowSums(select(., 2:17))) %>%
   mutate(sum_2015 = rowSums(select(., 17)))

q2ai_data

```
This table shows the different terrorist groups in Iraq and the corresponding number of (known) attacks carried out between 2000 and 2015. There are 69 different terrorist groups in Iraq. For the purpose of this analysis, we will interrogate the data of groups which meet this criteria:

[1] Have carried out more than 20 attacks in total (across all years) 

This removes small perpetrator groups from the analysis.

```{r}
q2ai_data <- q2ai_data %>%
  filter(total_all_yrs > 20)

#q2ai_data[,1]

```
By filtering out groups which have carried out less than 20 attacks in total, 10 groups are left. Unfortunately one of these groups is 'unknown'. As can be seen from the data, for the large marjority of attacks, the perpetrators are not identified and these attacks are recorded as 'Unknown'.
There are also 2 other ambiguous group names - "Other" and "Gunmen". Both of these groups have been allocated more than 20 attacks, however the attacks do not occur in the last 3 years. Given that these groups are not likely to be legitimate terrorist group names, they will be removed from the data set for analysis.

After this, there are 8 groups of interest for analysis. The activities of these groups will be analysed in more detail.

```{r}
q2ai_data <-  q2ai_data %>%
  filter(group_name != "Other", group_name != "Gunmen")

#q2ai_data

#The 8 remaining groups of interest will be assigned to a vector `key_groups`

key_groups <- q2ai_data$group_name
#key_groups


  
```


```{r}
q2a_data <- attack_data_2 %>%
    filter(group_name %in% key_groups) %>%
    group_by(iyear, group_name) %>%
    summarise(no_attacks = n()) 

#q2a_data

g <- ggplot(data = q2a_data,
       aes(x = iyear, y = no_attacks, color = group_name)) +
      geom_line(lwd = 1.5) +
      labs(title = "No. of Attacks per Group per year", x = "Year", y = "No. of Attacks") +
      theme(panel.background = element_rect(fill = "white")) 
      #geom_text(data = subset(q2a_data_iraq, iyear == 2013), aes(label = group_name, colour = group_name, x = 2013, y = no_attacks), hjust = -.1) +
   #theme(legend.position="none")

gt <- ggplotGrob(g)
gt$layout$clip[gt$layout$name == "panel"] <- "off"
grid.draw(gt)

q2a_data_2 <- q2a_data %>%
    group_by(group_name) %>%
    summarise(no_attacks = sum(no_attacks))
    
#q2a_data_2

ggplot(data = q2a_data_2,
       aes(x = group_name, y = no_attacks, fill = group_name)) +
      geom_bar(stat = 'identity') +
      labs(title = "No. of Attacks per Group", x = "Group Name", y = "No. of Attacks") +
      theme(panel.background = element_rect(fill = "white")) +
      theme(axis.text.x = element_text(angle=60, hjust=1))+
      geom_text(aes(label = no_attacks, y = no_attacks+1000)) +
      theme(legend.position="none")


```
The bar chart shows the terrorist groups and the number of attacks committed by each in total from 2000-2015. It is worth noting that the `group_name` for the vast majoirty of attacks is unknown. This could be because there are unknown groups active who do not claim attacks, or perhaps more likely is that the attacks are carried out by known groups, but they are not claimed/cannot be identified.   

Of the groups that are identified, ISIL has by far carried out the most attacks, followed by Al-Qaida in Iraq, then ISI. 

However, when looking at the line graph of number of attacks per year by different groups, it shows that the organisations are not consistent throughout the years. ISI is active between the years 2007 and 2010, Al-Qaida in Iraq is most active between 2011 and 2013 and after that, ISIL is by far the dominant organisation.

- Do members of these groups switch alegiance depending on which group is most 'popular' at the time? i.e. are the attacks caried out by largely the same population of people regardless of the group name?

- Or are these groups made up of categorically different people with different beliefs and motivations?

- i.e. is Al Quaida largely defeated or have the perpetrators simply jumped ship to ISIL?

Further work: Look at how many of the attacks are 'claimed' and is that the main way that groups are identified or is the group_name identified by other means?


<font size="5"> Question 3a: Analysis of Target Type </font>

The following graphs show the number of attacks on each type of target. The first set of graphs show attacks by all terrorist groups. The second shows just the attacks by ISIL as the data shows that this is the most active, current group.

```{r}
q3a_data <- attack_data_2 %>%
  group_by(iyear, target_type_1) %>%
  summarise(no_attacks = n())

#q3a_data

ggplot(data = q3a_data,
       aes(x = iyear, y = no_attacks))+
      facet_wrap(~target_type_1)+
      geom_line(lwd = 1, color = "blue4") +
      labs(title = "No. of Attacks per Group per year in Iraq, split by Target", x = "Year", y = "No. of Attacks") +
      theme(panel.background = element_rect(fill = "white")) 

q3a_data <- attack_data_2 %>%
  filter(group_name == "Islamic State of Iraq and the Levant (ISIL)") %>%
  group_by(iyear, target_type_1) %>%
  summarise(no_attacks = n()) 
  
ggplot(data = q3a_data,
      aes(x = iyear, y = no_attacks))+
      facet_wrap(~target_type_1)+
      geom_line(lwd = 1, color = "blue4") +
      labs(title = "No. of Attacks by ISIL per year in Iraq, split by Target", x = "Year", y = "No. of Attacks") +
      theme(panel.background = element_rect(fill = "white")) 
```
The number of attacks generally has increased from 2000 to 2015 in Iraq. This is driven largely by attacks on private citizens and property although there has also been an increase in attacks on the miliary and police. Attacks on businesses has seen an increase too. The data shows that ISIL tends to attack Private Citizens & Property, Military and Police. The attacks on the Military by ISIL has increased year on year which would suggest that this was a partcular focus of the group. 



```{r}
#This next section of code looked at the nationality of victims. The vast majority of victimes were Iraqi nationals. No further investigation will be done on victim nationality. 

q3b_data <- attack_data_2 %>%
  group_by(nationality) %>%
  summarise(no_attacks = n()) %>%
  filter(no_attacks > 5)

#q3b_data

#ggplot(data = q3b_data,
#       aes(x = nationality, y = no_attacks, fill = nationality))+
#      geom_bar(stat = 'identity') +
#      theme(axis.text.x = element_text(angle=90, hjust=1)) +
#      labs(title = "Nationality of Attack Victims in Iraq - All Yrs (no. attacks > 5)", x = "Nationality", y = "No. of Attacks") +
#      theme(panel.background = element_rect(fill = "white")) 
```

<font size="5"> Question 4: Analysis on Weapon Use </font>

```{r}
q4a_data <- attack_data_2 %>%
  group_by(weapon) %>%
  summarise(no_attacks = n(), no_killed_real = sum(n_kill)) %>%
  #summarise(no_killed = mean(claimed))
  mutate(weapon = str_replace(weapon, "Explosives/Bombs/Dynamite", "Explosives")) %>%
  mutate(weapon = str_replace(weapon, "not to include vehicle-borne explosives, i.e., car or truck bombs", "excl explosives")) 

#q4a_data

pct<- round(100*q4a_data$no_attacks/sum(q4a_data$no_attacks), 1)

ggplot(data = q4a_data,
       aes(x = weapon, y = no_attacks, fill = weapon))+
      geom_bar(stat = 'identity') +
      theme(axis.text.x = element_text(angle=50, hjust=1)) +
      labs(title = "No of Attacks by Weapon Type - All Years", x = "Weapon", y = "No. Killed") +
      theme(panel.background = element_rect(fill = "white")) +
      theme(legend.position = "none") +
      geom_text(aes(label = no_attacks, y = no_attacks+500))





```
The vast majority of attacks are carried out using explosives. This accounts for 76% of all attacks (across all years). Attacks using firearms then accounts for 20%. 

As these are by far the methods used most for carrying out attacks, these two weapon types will be explored in more detail. Firstly how are the numbers of these attack types changing over the years?

```{r}
q4b_data <- attack_data_2 %>%
  group_by(iyear, weapon) %>%
  summarise(no_attacks = n()) %>%
  filter(weapon == "Explosives/Bombs/Dynamite" | weapon == "Firearms") %>%
  pivot_wider(names_from = weapon, values_from = no_attacks) %>%
  replace_na(list(Firearms = 0)) %>%
  mutate(total = `Explosives/Bombs/Dynamite` + `Firearms`) %>%
  mutate(expl_pct = round(`Explosives/Bombs/Dynamite`/total*100, 0)) %>%
  pivot_longer(cols = c(`Explosives/Bombs/Dynamite`, `Firearms`),  names_to = "weapon", values_to = "no_attacks") 
  
#q4b_data

#ggplot(data = q4b_data,
#      aes(x = iyear, y = no_attacks, fill = weapon))+
#      geom_bar(stat = 'identity')+
#      labs(title = "No. of Explosives and Firearms Attacks per Year (% = Explosive Attacks)", x = "Year", y = "No. of Attacks") +
#      theme(panel.background = element_rect(fill = "white")) +
#      geom_text(aes(label = paste(expl_pct,"%"), y = total+0.5))



```

```{r}
ggplot(data = q4b_data, aes(x=iyear)) +
  
  geom_line( aes(y=no_attacks, color = weapon), size=2) + 
 # geom_line( aes(y=expl_pct*40), lty = 2, size=0.5) +
  
  #scale_y_continuous(name = "No. of attacks", sec.axis = sec_axis(~ . /40, name="% of Attacks that use Explosives")
  #) + 
  
  #ggtitle("No. of Explosive and Firearms attacks per Year - Iraq")+
  labs(title = "No. of Explosive and Firearms attacks per Year - Iraq", x = "Year", y = "No. of Attacks") +
      theme(panel.background = element_rect(fill = "white")) 
```

As Explosives/Bombs/Dynamite account for such a large proportion of all attacks, it will be useful to understand more about the sub-type of weapon and if there are any trends in what attackers are using.


```{r}
q4c_data <- attack_data_2 %>%
  filter(weapon == "Explosives/Bombs/Dynamite") %>%
  group_by(iyear, weapon_sub_type) %>%
  summarise(no_attacks = n()) %>%
  filter(weapon_sub_type != "Unknown Explosive Type") %>%
  filter(weapon_sub_type != "Other Explosive Device")

#q4c_data

ggplot(data = q4c_data,
       aes(x = iyear, y = no_attacks, lty = weapon_sub_type, color = weapon_sub_type), size = 2)+
      geom_line(lwd = 1.25)+
     # scale_colour_brewer("Weapon Sub-Type", palette="Set1") +
      labs(title = "Types of Explosives Attacks - 2000 to 2015", x = "Year", y = "No. of Attacks") +
      theme(panel.background = element_rect(fill = "white")) 
  
```
There has been a large increase in explosives - 'vehicle' i.e car bombs. In fact, in recent years, this accounts for the majority of explosives attacks. Additionally, projectiles have been used more in recent years. Sticky bombs were used more around 2010-2011 but incidents using sticky bombs have declined in recent years. 
Overall, the two main explosives used are vehicle bombs and projectiles.

<font size="5"> Question 5: Analysis on Geography of Attacks </font>

```{r}
q5a_data <- attack_data_2 %>%
  select(iyear, latitude, longitude, group_name) %>%
  filter(group_name %in% key_groups) %>%
  filter(longitude < 50 & latitude < 39) %>% #removing 3no rogue locations outside of Iraq
  mutate(iyear = as.integer(iyear)) %>%
  mutate(group_name = str_replace(group_name, "Islamic State of Iraq \\(ISI\\)", "ISI")) %>%
  mutate(group_name = str_replace(group_name, "Islamic State of Iraq and the Levant \\(ISIL\\)", "ISIL"))

q5a_data

fileName <- 'google_api.txt'
code <- readChar(fileName, file.info(fileName)$size)

#register_google(key = "XXX")
register_google(key = code)

#get_map("Iraq", zoom = 6) %>% ggmap()

#iraq <- c(left = 38.5, bottom = 29, right = 48.5, top = 37.5)
#get_stamenmap(iraq, zoom = 6) %>% ggmap() 

qmplot(longitude, latitude, data = q5a_data, maptype = "toner-lite", darken = 0, color = group_name) + scale_colour_brewer("Group Name", palette="Set1")+
  labs(title = "Attack Locations from 2000 to 2015")

#qmplot(longitude, latitude, data = q5a_data, maptype = "toner-lite", darken = 0, color = group_name) + scale_colour_brewer("Group Name", palette="Set1")+
#  facet_wrap(~group_name)

```
This plot shows the locations of all attacks carried out my the 8 key groups, across all years. 

```{r}
qmplot(longitude, latitude, data = q5a_data, maptype = "toner-lite", darken = 0, color = -iyear) + 
  scale_color_viridis(option = "C")+
  facet_wrap(~group_name) +
  #scale_color_manual(values = brewer.pal(n=16, name = "RdBu"))+
  labs(title = "Attack Locations by Group, 2000 to 2015") +
  theme(legend.title = element_blank()) 

#ggsave("mapiraq.png")
```
This plot shows the attacks per group, colour coded for year. Yellow markers show attacks that occurred in the early 2000s. Dark blue attacks are recent. 

From this you can see that the group Tawhid and Jihad is no longer active in carrying out attacks. ISI attacks were carried out predominatly between the years 2005 and 2010 whilst ISIL attacks are shown to be from 2010 onwards. This makes sense because ISI largely became ISIL. 

The Al-Naqshabandiya Army appears to be a group carrying out attacks in Central Iraq, since 2010. This is therefore a relativley new group. Al-Qaida in Iraq are a long standing group which has carried out attacks in Iraq from the year 2000 and still carry out attacks in 2015. Ansar al-Islam is another long standing group carrying out attacks across all years, however it has carried out far few attacks than Al-Qaida in Iraq. The group Muslim Fundamentalists have carried out attacks in recent years mostly in central Iraq. 

Unfortunately, the great majority of the attacks do not have a identified group.








