This dataset shows political violence events and fatalities in Palestine by region, district, month, and year.
The main variables I use in this project are `admin1`, `admin2`, `month`, `year`, `events`, and `fatalities`. There are 5 categorical and 3 quantitative variables.
I want to explore whether places with more events also tend to have more fatalities and try to see the relation between how fatalities increase when violent events occur.
Source
# Load libraries for plotting, cleaning and regressionlibrary(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)
Attaching package: 'janitor'
The following objects are masked from 'package:stats':
chisq.test, fisher.test
library(ggfortify)# Read the dataset into Rpalestine <-read_csv("palestine_events.csv")
Rows: 1968 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): Country, Admin1, Admin2, ISO3, Admin2 Pcode, Admin1 Pcode, Month
dbl (3): Year, Events, Fatalities
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Viewing the first 6 rows using head()head(palestine)
# A tibble: 6 × 10
Country Admin1 Admin2 ISO3 `Admin2 Pcode` `Admin1 Pcode` Month Year Events
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 Palestine West … Al Qu… PSE PS0140 PS01 Janu… 2016 16
2 Palestine West … Tulka… PSE PS0110 PS01 Janu… 2016 3
3 Palestine West … Hebron PSE PS0150 PS01 Janu… 2016 14
4 Palestine West … Jenin PSE PS0101 PS01 Janu… 2016 2
5 Palestine West … Qalqi… PSE PS0120 PS01 Janu… 2016 6
6 Palestine Gaza … Gaza … PSE PS0260 PS02 Janu… 2016 6
# ℹ 1 more variable: Fatalities <dbl>
# A tibble: 6 × 10
country admin1 admin2 iso3 admin2_pcode admin1_pcode month year events
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 Palestine West Bank Al Qu… PSE PS0140 PS01 Janu… 2016 16
2 Palestine West Bank Tulka… PSE PS0110 PS01 Janu… 2016 3
3 Palestine West Bank Hebron PSE PS0150 PS01 Janu… 2016 14
4 Palestine West Bank Jenin PSE PS0101 PS01 Janu… 2016 2
5 Palestine West Bank Qalqi… PSE PS0120 PS01 Janu… 2016 6
6 Palestine Gaza Strip Gaza … PSE PS0260 PS02 Janu… 2016 6
# ℹ 1 more variable: fatalities <dbl>
# Create scatterplot of events vs fatalitiesp1 <-ggplot(palestine, aes(x = events, y = fatalities, color = admin1)) +labs(title ="Fatalities versus Events in Palestine",caption ="Source: ACLED",x ="Number of Events",y ="Number of Fatalities",color ="Region" ) +theme_minimal(base_size =12)p1 +geom_point()
# add the regression linep2 <- p1 +geom_point() +geom_smooth(method ="lm", se =FALSE)p2
`geom_smooth()` using formula = 'y ~ x'
# Final plot with better colors and themep3 <-ggplot(palestine, aes(x = events, y = fatalities, color = admin1)) +geom_point() +geom_smooth(method ="lm", formula = y ~ x, se =FALSE) +labs(title ="Political Violence Events versus Fatalities in Palestine",caption ="Source: ACLED",x ="Number of Events",y ="Number of Fatalities",color ="Region" ) +scale_color_brewer(palette ="Set1") +theme_bw()p3
# Fit linear regression modelfit1 <-lm(fatalities ~ events, data = palestine)summary(fit1)
Call:
lm(formula = fatalities ~ events, data = palestine)
Residuals:
Min 1Q Median 3Q Max
-483.50 -38.86 21.06 44.38 3140.84
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -52.14854 3.64088 -14.32 <2e-16 ***
events 2.59075 0.05643 45.91 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 137 on 1966 degrees of freedom
Multiple R-squared: 0.5174, Adjusted R-squared: 0.5172
F-statistic: 2108 on 1 and 1966 DF, p-value: < 2.2e-16
# Show regression diagnostic plotsautoplot(fit1)
Warning: `fortify(<lm>)` was deprecated in ggplot2 4.0.0.
ℹ Please use `broom::augment(<lm>)` instead.
ℹ The deprecated feature was likely used in the ggfortify package.
Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the ggfortify package.
Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the ggfortify package.
Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
Summary
I cleaned the data by first loading the dataset using read_csv(). Then I used clean_names() to make all column names lowercase and easier to use in R. After that, I removed rows that had missing values in important variables such as events, fatalities, region (admin1), month, and year. This made the dataset cleaner and ready for analysis. For the visualization, each point represents an observation in the dataset, and the colors represent different regions. I also added a linear regression line to help show the general trend in the data. The regression results show a positive relationship between events and fatalities. Specifically, the slope is about 2.59, which means that for each additional event, fatalities increase by about 2.59 on average. The p-value is extremely small, which indicates that this relationship is statistically significant, and the adjusted R-squared value of about 0.517 shows that around 51.7% of the variation in fatalities is explained by the number of events. I wish I could include how many people are dying in one family after each attack. I think this detail would open up many eyes because it shows the severity of the situation and the level of impact that each event can have.