Palestine and events

Author

Sarah Abdela

Palestine

This dataset shows political violence events and fatalities in Palestine by region, district, month, and year.
The main variables I use in this project are `admin1`, `admin2`, `month`, `year`, `events`, and `fatalities`. There are 5 categorical and 3 quantitative variables.
I want to explore whether places with more events also tend to have more fatalities and try to see the relation between how fatalities increase when violent events occur.
Source

https://data.humdata.org/dataset/political-violence-events-and-fatalities

# Load libraries for plotting, cleaning and regression
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)

Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test
library(ggfortify)
# Read the dataset into R
palestine <- read_csv("palestine_events.csv")
Rows: 1968 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): Country, Admin1, Admin2, ISO3, Admin2 Pcode, Admin1 Pcode, Month
dbl (3): Year, Events, Fatalities

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Viewing the first 6 rows using head()
head(palestine)
# A tibble: 6 × 10
  Country   Admin1 Admin2 ISO3  `Admin2 Pcode` `Admin1 Pcode` Month  Year Events
  <chr>     <chr>  <chr>  <chr> <chr>          <chr>          <chr> <dbl>  <dbl>
1 Palestine West … Al Qu… PSE   PS0140         PS01           Janu…  2016     16
2 Palestine West … Tulka… PSE   PS0110         PS01           Janu…  2016      3
3 Palestine West … Hebron PSE   PS0150         PS01           Janu…  2016     14
4 Palestine West … Jenin  PSE   PS0101         PS01           Janu…  2016      2
5 Palestine West … Qalqi… PSE   PS0120         PS01           Janu…  2016      6
6 Palestine Gaza … Gaza … PSE   PS0260         PS02           Janu…  2016      6
# ℹ 1 more variable: Fatalities <dbl>
# Cleaning and removing missing values
palestine <- palestine %>%
  clean_names() %>%
  drop_na(events, fatalities, admin1, month, year)
# View again
head(palestine)
# A tibble: 6 × 10
  country   admin1     admin2 iso3  admin2_pcode admin1_pcode month  year events
  <chr>     <chr>      <chr>  <chr> <chr>        <chr>        <chr> <dbl>  <dbl>
1 Palestine West Bank  Al Qu… PSE   PS0140       PS01         Janu…  2016     16
2 Palestine West Bank  Tulka… PSE   PS0110       PS01         Janu…  2016      3
3 Palestine West Bank  Hebron PSE   PS0150       PS01         Janu…  2016     14
4 Palestine West Bank  Jenin  PSE   PS0101       PS01         Janu…  2016      2
5 Palestine West Bank  Qalqi… PSE   PS0120       PS01         Janu…  2016      6
6 Palestine Gaza Strip Gaza … PSE   PS0260       PS02         Janu…  2016      6
# ℹ 1 more variable: fatalities <dbl>
# Create scatterplot of events vs fatalities
p1 <- ggplot(palestine, aes(x = events, y = fatalities, color = admin1)) +
  labs(
    title = "Fatalities versus Events in Palestine",
    caption = "Source: ACLED",
    x = "Number of Events",
    y = "Number of Fatalities",
    color = "Region"
  ) +
  theme_minimal(base_size = 12)

p1 + geom_point()

# add the regression line
p2 <- p1 +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

p2
`geom_smooth()` using formula = 'y ~ x'

# Final plot with better colors and theme
p3 <- ggplot(palestine, aes(x = events, y = fatalities, color = admin1)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ x, se = FALSE) +
  labs(
    title = "Political Violence Events versus Fatalities in Palestine",
    caption = "Source: ACLED",
    x = "Number of Events",
    y = "Number of Fatalities",
    color = "Region"
  ) +
  scale_color_brewer(palette = "Set1") +
  theme_bw()

p3

# Calculate correlation
cor(palestine$events, palestine$fatalities)
[1] 0.7193295
# Fit linear regression model
fit1 <- lm(fatalities ~ events, data = palestine)

summary(fit1)

Call:
lm(formula = fatalities ~ events, data = palestine)

Residuals:
    Min      1Q  Median      3Q     Max 
-483.50  -38.86   21.06   44.38 3140.84 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -52.14854    3.64088  -14.32   <2e-16 ***
events        2.59075    0.05643   45.91   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 137 on 1966 degrees of freedom
Multiple R-squared:  0.5174,    Adjusted R-squared:  0.5172 
F-statistic:  2108 on 1 and 1966 DF,  p-value: < 2.2e-16
# Show regression diagnostic plots
autoplot(fit1)
Warning: `fortify(<lm>)` was deprecated in ggplot2 4.0.0.
ℹ Please use `broom::augment(<lm>)` instead.
ℹ The deprecated feature was likely used in the ggfortify package.
  Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the ggfortify package.
  Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the ggfortify package.
  Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.

Summary

I cleaned the data by first loading the dataset using read_csv(). Then I used clean_names() to make all column names lowercase and easier to use in R. After that, I removed rows that had missing values in important variables such as events, fatalities, region (admin1), month, and year. This made the dataset cleaner and ready for analysis. For the visualization, each point represents an observation in the dataset, and the colors represent different regions. I also added a linear regression line to help show the general trend in the data. The regression results show a positive relationship between events and fatalities. Specifically, the slope is about 2.59, which means that for each additional event, fatalities increase by about 2.59 on average. The p-value is extremely small, which indicates that this relationship is statistically significant, and the adjusted R-squared value of about 0.517 shows that around 51.7% of the variation in fatalities is explained by the number of events. I wish I could include how many people are dying in one family after each attack. I think this detail would open up many eyes because it shows the severity of the situation and the level of impact that each event can have.