library (tidyverse)
library (ggfortify)
library(RColorBrewer)Suicide Attacks
Introduction
This data set is about suicide attacks from 1982 through October 2020. The data base includes information about the location of attacks, the target type, the weapon used, and symmetric information on the demographic and general biographical characteristics of suicide attackers. The current CPOST-SAD release contains the universe of suicide attacks from 1982 through September 2015, a total of 4814 attacks in over 40 countries.
Variables
date.year: the year when the suicide attack occurred (numeric) statistics.#wounded_high: represents the highest number of people injured in attack (numeric) statistics.#killed_high: represents the highest number of people killed in a single suicide attack (numeric) target.country: the country that was targeted in the attack
Question
How do the number of wounded and the year of an attack affect the number of deaths in suicide attacks across different target countries.
Source
<Chicago Project on Security and Terrorism (CPOST). 2020. Suicide Attack Database (October, 2020 Release)
Load the libraries
Load the data set
suicide_attacks <- read_csv("suicide_attacks.csv")Just to look at the data type and first 6 rows
head(suicide_attacks)# A tibble: 6 × 39
groups claim status statistics.sources date.year date.month date.day
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Islamic State Susp… Confi… 2 2015 6 2
2 Islamic State Susp… Possi… 3 2017 1 6
3 Islamic State Susp… Possi… 3 2017 1 6
4 Unknown Group Uncl… Confi… 4 2004 10 5
5 Taliban (IEA) Clai… Possi… 5 2017 7 4
6 Al-Jaysh al-Isl… Clai… Confi… 4 2012 10 3
# ℹ 32 more variables: `statistics.# wounded_low` <dbl>,
# `statistics.# wounded_high` <dbl>, `statistics.# killed_low` <dbl>,
# `statistics.# killed_high` <dbl>, `statistics.# killed_low_civilian` <dbl>,
# `statistics.# killed_high_civilian` <dbl>,
# `statistics.# killed_low_political` <dbl>,
# `statistics.# killed_high_political` <dbl>,
# `statistics.# killed_low_security` <dbl>, …
str(suicide_attacks)spc_tbl_ [10,018 × 39] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ groups : chr [1:10018] "Islamic State" "Islamic State" "Islamic State" "Unknown Group" ...
$ claim : chr [1:10018] "Suspected" "Suspected" "Suspected" "Unclaimed" ...
$ status : chr [1:10018] "Confirmed Suicide" "Possible - Too Few Sources" "Possible - Too Few Sources" "Confirmed Suicide" ...
$ statistics.sources : num [1:10018] 2 3 3 4 5 4 4 4 4 4 ...
$ date.year : num [1:10018] 2015 2017 2017 2004 2017 ...
$ date.month : num [1:10018] 6 1 1 10 7 10 10 10 10 10 ...
$ date.day : num [1:10018] 2 6 6 5 4 3 3 3 3 3 ...
$ statistics.# wounded_low : num [1:10018] 8 0 0 10 2 100 100 100 100 100 ...
$ statistics.# wounded_high : num [1:10018] 8 0 0 15 2 120 120 120 120 120 ...
$ statistics.# killed_low : num [1:10018] 5 40 40 1 0 31 31 31 31 31 ...
$ statistics.# killed_high : num [1:10018] 5 40 40 10 0 40 40 40 40 40 ...
$ statistics.# killed_low_civilian : num [1:10018] 0 20 20 1 0 31 31 31 31 31 ...
$ statistics.# killed_high_civilian : num [1:10018] 0 20 20 10 0 40 40 40 40 40 ...
$ statistics.# killed_low_political : num [1:10018] 0 0 0 0 0 0 0 0 0 0 ...
$ statistics.# killed_high_political: num [1:10018] 0 0 0 0 0 0 0 0 0 0 ...
$ statistics.# killed_low_security : num [1:10018] 5 20 20 0 0 0 0 0 0 0 ...
$ statistics.# killed_high_security : num [1:10018] 5 20 20 0 0 0 0 0 0 0 ...
$ statistics.# belt_bomb : num [1:10018] 0 0 0 0 0 0 0 0 0 0 ...
$ statistics.# truck_bomb : num [1:10018] 0 0 0 0 1 0 0 0 0 0 ...
$ statistics.# car_bomb : num [1:10018] 1 0 0 1 0 1 1 1 1 1 ...
$ statistics.# weapon_oth : num [1:10018] 0 1 1 0 0 0 0 0 0 0 ...
$ statistics.# weapon_unk : num [1:10018] 0 0 0 0 0 0 0 0 0 0 ...
$ target.weapon : chr [1:10018] "Car bomb" "Unspecified" "Unspecified" "Car bomb" ...
$ target.region : chr [1:10018] "Asia" "Asia" "Asia" "Asia" ...
$ target.subregion : chr [1:10018] "Western Asia" "Western Asia" "Western Asia" "Western Asia" ...
$ target.country : chr [1:10018] "Syria" "Syria" "Syria" "Iraq" ...
$ target.province : chr [1:10018] "Hasaka (Al Haksa)" "Deir ez-Zor" "Deir ez-Zor" "Baghdad" ...
$ target.city : chr [1:10018] "Al Hasakah" "Deir ez-Zor" "Deir ez-Zor" "Baghdad" ...
$ target.location : chr [1:10018] "close to a children's hospital" "Route between City & Deir ez-Zor Airport" "Route between City & Deir ez-Zor Airport" "Al Dora neighborhood, near refinery and cathedral" ...
$ target.latitude : num [1:10018] 36.5 35.3 35.3 33.3 31.8 ...
$ target.longtitude : num [1:10018] 40.8 40.1 40.1 44.4 64.5 ...
$ target.desc : chr [1:10018] "Syrian Army checkpoint" "Syrian regime forces" "Syrian regime forces" "Iraqi Police patrol" ...
$ target.type : chr [1:10018] "Security" "Security" "Security" "Security" ...
$ target.nationality : chr [1:10018] "Syrian" "Syrian" "Syrian" "Iraqi" ...
$ statistics.# attackers : num [1:10018] 1 2 2 1 1 3 3 3 3 3 ...
$ statistics.# female_attackers : num [1:10018] 0 0 0 0 0 0 0 0 0 0 ...
$ statistics.# male_attackers : num [1:10018] 0 0 0 0 0 0 0 0 0 0 ...
$ statistics.# unknown_attackers : num [1:10018] 1 2 2 1 1 3 3 3 3 3 ...
$ attacker.gender : chr [1:10018] "Unknown" "Unknown" "Unknown" "Unknown" ...
- attr(*, "spec")=
.. cols(
.. groups = col_character(),
.. claim = col_character(),
.. status = col_character(),
.. statistics.sources = col_double(),
.. date.year = col_double(),
.. date.month = col_double(),
.. date.day = col_double(),
.. `statistics.# wounded_low` = col_double(),
.. `statistics.# wounded_high` = col_double(),
.. `statistics.# killed_low` = col_double(),
.. `statistics.# killed_high` = col_double(),
.. `statistics.# killed_low_civilian` = col_double(),
.. `statistics.# killed_high_civilian` = col_double(),
.. `statistics.# killed_low_political` = col_double(),
.. `statistics.# killed_high_political` = col_double(),
.. `statistics.# killed_low_security` = col_double(),
.. `statistics.# killed_high_security` = col_double(),
.. `statistics.# belt_bomb` = col_double(),
.. `statistics.# truck_bomb` = col_double(),
.. `statistics.# car_bomb` = col_double(),
.. `statistics.# weapon_oth` = col_double(),
.. `statistics.# weapon_unk` = col_double(),
.. target.weapon = col_character(),
.. target.region = col_character(),
.. target.subregion = col_character(),
.. target.country = col_character(),
.. target.province = col_character(),
.. target.city = col_character(),
.. target.location = col_character(),
.. target.latitude = col_double(),
.. target.longtitude = col_double(),
.. target.desc = col_character(),
.. target.type = col_character(),
.. target.nationality = col_character(),
.. `statistics.# attackers` = col_double(),
.. `statistics.# female_attackers` = col_double(),
.. `statistics.# male_attackers` = col_double(),
.. `statistics.# unknown_attackers` = col_double(),
.. attacker.gender = col_character()
.. )
- attr(*, "problems")=<externalptr>
Data cleaning
names(suicide_attacks) <- gsub ("[#]", "_", names(suicide_attacks)) ##Replacing # and . in the column names with underscore
names(suicide_attacks) <- gsub("[.]", "", names(suicide_attacks))
head(suicide_attacks)# A tibble: 6 × 39
groups claim status statisticssources dateyear datemonth dateday
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Islamic State Susp… Confi… 2 2015 6 2
2 Islamic State Susp… Possi… 3 2017 1 6
3 Islamic State Susp… Possi… 3 2017 1 6
4 Unknown Group Uncl… Confi… 4 2004 10 5
5 Taliban (IEA) Clai… Possi… 5 2017 7 4
6 Al-Jaysh al-Islami … Clai… Confi… 4 2012 10 3
# ℹ 32 more variables: `statistics_ wounded_low` <dbl>,
# `statistics_ wounded_high` <dbl>, `statistics_ killed_low` <dbl>,
# `statistics_ killed_high` <dbl>, `statistics_ killed_low_civilian` <dbl>,
# `statistics_ killed_high_civilian` <dbl>,
# `statistics_ killed_low_political` <dbl>,
# `statistics_ killed_high_political` <dbl>,
# `statistics_ killed_low_security` <dbl>, …
To look at the exact names for columns
names(suicide_attacks) [1] "groups" "claim"
[3] "status" "statisticssources"
[5] "dateyear" "datemonth"
[7] "dateday" "statistics_ wounded_low"
[9] "statistics_ wounded_high" "statistics_ killed_low"
[11] "statistics_ killed_high" "statistics_ killed_low_civilian"
[13] "statistics_ killed_high_civilian" "statistics_ killed_low_political"
[15] "statistics_ killed_high_political" "statistics_ killed_low_security"
[17] "statistics_ killed_high_security" "statistics_ belt_bomb"
[19] "statistics_ truck_bomb" "statistics_ car_bomb"
[21] "statistics_ weapon_oth" "statistics_ weapon_unk"
[23] "targetweapon" "targetregion"
[25] "targetsubregion" "targetcountry"
[27] "targetprovince" "targetcity"
[29] "targetlocation" "targetlatitude"
[31] "targetlongtitude" "targetdesc"
[33] "targettype" "targetnationality"
[35] "statistics_ attackers" "statistics_ female_attackers"
[37] "statistics_ male_attackers" "statistics_ unknown_attackers"
[39] "attackergender"
Removing NAs from the certain columns I need
suicide_country <- suicide_attacks |>
filter(!is.na(dateyear) & (!is.na(`statistics_ killed_high`)) & (!is.na(`statistics_ wounded_high`))& (!is.na(targetcountry)))Selecting columns I need for the research question
suicide_country <- suicide_country |>
select(dateyear,`statistics_ killed_high`,`statistics_ wounded_high`,targetcountry) |>
group_by(dateyear,`statistics_ killed_high`,`statistics_ wounded_high`,targetcountry)
head(suicide_country)# A tibble: 6 × 4
# Groups: dateyear, statistics_ killed_high, statistics_ wounded_high,
# targetcountry [5]
dateyear `statistics_ killed_high` `statistics_ wounded_high` targetcountry
<dbl> <dbl> <dbl> <chr>
1 2015 5 8 Syria
2 2017 40 0 Syria
3 2017 40 0 Syria
4 2004 10 15 Iraq
5 2017 0 2 Afghanistan
6 2012 40 120 Syria
Linear Regression Model
fit1 <- lm(`statistics_ killed_high` ~ dateyear + `statistics_ wounded_high`+ targetcountry, data = suicide_country)
autoplot(fit1, 1:4,nrow=2,ncol=2) ##Got this from correlation scatter plots and regressions tutorial, to see the diagnostic plotssummary(fit1)
Call:
lm(formula = `statistics_ killed_high` ~ dateyear + `statistics_ wounded_high` +
targetcountry, data = suicide_country)
Residuals:
Min 1Q Median 3Q Max
-1530.81 -4.12 -0.99 2.36 251.10
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.123e+01 1.071e+02 0.198 0.842937
dateyear -1.043e-02 5.324e-02 -0.196 0.844746
`statistics_ wounded_high` 3.951e-01 1.286e-03 307.138 < 2e-16 ***
targetcountryAlgeria -8.934e+00 4.454e+00 -2.006 0.044883 *
targetcountryArgentina 5.537e+00 1.657e+01 0.334 0.738237
targetcountryBangladesh -5.190e-02 4.171e+00 -0.012 0.990072
targetcountryBelgium -3.432e+01 1.351e+01 -2.541 0.011082 *
targetcountryBolivia -2.284e+00 2.339e+01 -0.098 0.922218
targetcountryBosnia & Herzegovina -1.954e+00 2.339e+01 -0.084 0.933413
targetcountryBulgaria -6.893e+00 2.338e+01 -0.295 0.768160
targetcountryBurkina Faso -7.921e+00 1.351e+01 -0.586 0.557708
targetcountryCameroon -5.739e-01 2.238e+00 -0.256 0.797608
targetcountryChad -5.413e+00 4.453e+00 -1.216 0.224176
targetcountryChina -4.701e+00 4.450e+00 -1.056 0.290841
targetcountryColombia -6.045e+00 2.339e+01 -0.258 0.796039
targetcountryDjibouti -5.155e+00 1.654e+01 -0.312 0.755284
targetcountryEgypt -4.259e+00 2.637e+00 -1.615 0.106340
targetcountryFinland -1.851e+01 1.655e+01 -1.118 0.263456
targetcountryFrance 1.838e+01 8.284e+00 2.219 0.026514 *
targetcountryGeorgia -2.071e-01 2.338e+01 -0.009 0.992933
targetcountryGermany -6.134e+00 2.338e+01 -0.262 0.793078
targetcountryIndia 3.062e+00 5.141e+00 0.596 0.551483
targetcountryIndonesia -1.585e+00 3.934e+00 -0.403 0.686950
targetcountryIran -1.209e+01 6.507e+00 -1.858 0.063138 .
targetcountryIraq 1.686e+00 6.641e-01 2.539 0.011129 *
targetcountryIsrael -1.237e+01 2.095e+00 -5.905 3.65e-09 ***
targetcountryJordan 3.119e+00 8.852e+00 0.352 0.724569
targetcountryKazakhstan -1.544e-01 1.654e+01 -0.009 0.992552
targetcountryKenya -2.322e+02 8.338e+00 -27.856 < 2e-16 ***
targetcountryKuwait -2.795e+01 1.354e+01 -2.064 0.039051 *
targetcountryKyrgyzstan -1.393e+00 2.338e+01 -0.060 0.952514
targetcountryLebanon 5.028e+00 2.221e+00 2.263 0.023635 *
targetcountryLibya -7.774e-02 2.342e+00 -0.033 0.973517
targetcountryMali 2.108e+00 3.095e+00 0.681 0.495883
targetcountryMauritania -1.466e+00 2.338e+01 -0.063 0.950028
targetcountryMontenegro -1.863e-01 2.339e+01 -0.008 0.993644
targetcountryMorocco -2.815e+00 4.324e+00 -0.651 0.514994
targetcountryNiger 3.387e+00 3.942e+00 0.859 0.390268
targetcountryNigeria 3.149e+00 1.145e+00 2.750 0.005964 **
targetcountryPakistan -2.452e-01 9.883e-01 -0.248 0.804033
targetcountryPalestine -1.231e+00 2.519e+00 -0.489 0.625135
targetcountryPanama 2.056e+01 2.340e+01 0.879 0.379598
targetcountryPhilippines -9.753e+00 7.819e+00 -1.247 0.212314
targetcountryQatar -4.063e+00 2.339e+01 -0.174 0.862062
targetcountryRussia -2.784e+00 2.177e+00 -1.279 0.200949
targetcountrySaudi Arabia -9.791e+00 3.693e+00 -2.652 0.008025 **
targetcountrySerbia -6.023e-01 2.338e+01 -0.026 0.979453
targetcountrySomalia 5.227e+00 1.486e+00 3.518 0.000436 ***
targetcountrySouth Sudan -4.145e+00 2.338e+01 -0.177 0.859321
targetcountrySpain -5.176e-01 8.862e+00 -0.058 0.953424
targetcountrySri Lanka 1.106e+01 1.568e+00 7.052 1.87e-12 ***
targetcountrySweden -1.060e+00 2.338e+01 -0.045 0.963845
targetcountrySyria 5.408e+00 9.466e-01 5.713 1.14e-08 ***
targetcountryTajikistan -5.988e+00 1.170e+01 -0.512 0.608880
targetcountryTanzania -1.863e+01 2.340e+01 -0.797 0.425748
targetcountryTunisia 9.927e-01 6.773e+00 0.147 0.883489
targetcountryTurkey -1.263e+01 2.720e+00 -4.643 3.49e-06 ***
targetcountryUganda 5.202e+01 1.654e+01 3.145 0.001663 **
targetcountryUkraine -1.031e-01 1.170e+01 -0.009 0.992968
targetcountryUnited Kingdom -6.721e+01 1.047e+01 -6.417 1.45e-10 ***
targetcountryUnited States 2.300e+02 6.356e+00 36.182 < 2e-16 ***
targetcountryUzbekistan 9.591e-02 9.566e+00 0.010 0.992001
targetcountryYemen 4.178e+00 1.635e+00 2.555 0.010628 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 23.38 on 9955 degrees of freedom
Multiple R-squared: 0.9647, Adjusted R-squared: 0.9644
F-statistic: 4382 on 62 and 9955 DF, p-value: < 2.2e-16
Linear Regression Analysis
The multiple linear regression model predicts the number of people killed in suicide attacks based on the year, the number of wounded, and the target country.
Model Equation
using y = ax+b, Just because I am using multiple regression, the equation will be, y = a + b1x1 + b2x2 So, in my case, statistics_ killed_high = a+b1(dateyear)+b2(statistics_ wounded_high)
a = Intercept b1 = how death changes when the year increases (slope for year) b2 = how deaths change when more people are wounded (slope for wounded) So the final equation will be,
statistics_ killed_high = 21.23 - 0.01043(dateyear) + 0.3951(statistics_ wounded_high)
P-value and Adjusted R-squared analyze
The p-value for statistics_ wounded_high (<2e-16) indicates that the number of wounded is strong, statistically significant predictor for deaths. The variable dateyear has a high p-value (0.844746), suggesting that year has a little effect on the number of deaths.
The adjusted R-squared = 0.9644, meaning the model explains 96.4% of the variation in the number of deaths. The overall model is statistically significant.
Diagnostic plots
Residuals vs fitted plot : the blue line is slightly curved, suggesting that the model may not be perfectly linear and that a few points deviate from the pattern.
Normal Q-Q plot : Most points following the diagonal line, meaning the residuals are approximately normal, but a few outliers(such as observations 367,4527,and 8801) deviate from normality.
Scale-Location plot : shows that the spread of residuals increases with fitted values, indicating mild heteroscedasticity.
Cook’s Distance : a few influential points (367,1609, and 8801) that may affect the model’s results.
Citation
https://rpubs.com/rsaidi/950425
Grouping the data by country
suicide_country_grouped <- suicide_country |>
group_by(targetcountry) |>
summarize(avg_wounded = mean(`statistics_ wounded_high`),
avg_killed = mean(`statistics_ killed_high`)) |>
arrange(desc(avg_killed))Top 5 by average deaths per attack
top5_countries <- suicide_country_grouped |>
slice_max(order_by = avg_killed, n=5)
top5_countries# A tibble: 5 × 3
targetcountry avg_wounded avg_killed
<chr> <dbl> <dbl>
1 United States 3781 1724.
2 Argentina 200 85
3 Uganda 60 76
4 Kenya 686. 39
5 France 40.9 34.8
Filtering data for the top 5 countries
suicide_top5 <- suicide_country |>
filter(targetcountry %in% top5_countries$targetcountry)Plot 1
ggplot(suicide_top5, aes(x =factor (dateyear), ##I did factor(dateyear) so all years appear as separate categories on the x axis
y = `statistics_ killed_high`,
fill = targetcountry)) +
geom_col(position = position_dodge(width = 0.9), width = 0.5) +
labs(title = "Yearly Suicide Attack Fatalities (Top 5 Countries)",
x = "Year",
y = "Number of People Killed",
color = "Country",
caption = "Source: CPOST Suicide Attack Database (2020)") +
theme_minimal() +
scale_fill_brewer(palette = "Set2")Ignoring unknown labels:
• colour : "Country"
Removing U.S. because it is a outlier
suicide_no_us <- suicide_top5 |>
filter(targetcountry != "United States")Plot 2 - Final plot(For the grading)
ggplot(suicide_no_us, aes(x=factor(dateyear), y = `statistics_ killed_high`, fill=targetcountry))+
geom_col(position = position_dodge(width =0.5),width = 0.6)+
labs(title = "Yearly Suicide Attack Fatalities (Top 4 Countries, excluding U.S.)",
x = "Year",
y = "Number of People Killed",
fill = "Country",
caption = "Source: CPOST Suicide Attack Database (2020)") +
theme_light() +
scale_fill_brewer(palette = "Set1")Plot 3 - trivial one
ggplot(suicide_no_us,
aes(x = factor(dateyear), y = targetcountry, fill = `statistics_ killed_high`)) +
geom_tile() +
scale_fill_gradient(low = "#87CEEB", high = "#36648B") +
labs(title = "Heatmap of Deaths by Country and Year",
x = "Year",
y = "Country",
fill = "Deaths",
caption = "Source: CPOST Suicide Attack Database (2020)") +
theme_light()Citation
https://r-charts.com/colors/
Essay
a. Data Cleaning Process
To prepare my dataset for analysis, I first cleaned the column names by removing special characters such as “#” and “.” using the gsub() function. This helped ensure that the variable names were consistent and easy to reference in R. After that, I filtered out missing values (NAs) from the main variables needed for my analysis: dateyear, statistics_killed_high, statistics_wounded_high, and targetcountry.
Then, I selected only these columns because they were directly related to my research question. I also checked the data structure to confirm that the numeric and categorical variables were correctly formatted. Finally, I summarized and grouped the data by country to identify which countries had the highest average fatalities. These cleaning steps allowed me to create a clear, error-free dataset that was ready for both visualization and regression analysis.
b. Visualization Interpretation
My final visualization is a bar chart showing yearly suicide attack fatalities for the top 4 countries, excluding the United States because U.S. has a extreme outlier. Each bar represents the number of people killed in a given year, and each color corresponds to a different country. The chart makes it easy to compare how suicide attacks vary across both time and location.
For example, Kenya shows a large spike in 1998, which aligns with the historical U.S. Embassy bombing in Nairobi that year. Argentina, France, and Uganda also show individual years with notable death counts, suggesting that these countries experienced fewer but highly impactful suicide attacks. The visualization highlights how these incidents are often concentrated in specific years rather than being evenly distributed over time.
citation
https://www.fbi.gov/history/famous-cases/east-african-embassy-bombings
c. Challenges and Improvements
One challenge I faced was that many of the columns contained a lot of zeros, which likely represented years or countries where no attacks occurred. This made it harder to find strong patterns since most data points were zero. Another issue I ran into was that my first few visualizations were too cluttered,since there were so many countries, the plots were messy and difficult to read. I solved that by focusing only on the top countries with the highest fatalities.
I also wanted to create a visualization comparing killed vs wounded, but it didn’t work out as I planned. I tried a few different chart types, including scatter plots and box plots, but they either looked confusing or didn’t show clear relationships. I also wanted to make a box plot of the top 10 countries’ deaths by year, but it didn’t display properly in R. If I had more time, I would explore improving those visualizations and use tools like Plotly to make the graphs more interactive and insightful. Despite these challenges, my final visualization successfully illustrates the main patterns and answers my research question.