Using ML to Identify Risk Factors and Predict Fatal Outcomes in Protests

Protests have played a major role throughout history in driving social movement and pursuing causes. Though not common, human casualties sometimes are unfortunate consequence to these events.

This document attempts to predict government’s response to mass mobilization protests- of which that may lead to lethal outcomes. “Lethal Outcome” hereafter is defined as “Shooting” or “Killing”.

Can we forecast the lethality of the protest and provide insight on how we can mitigate the danger?

Libraries

Show code

library(tidyverse)
library(tidymodels)
library(DT)
library(embed)
library(themis)
library(lubridate)
library(countrycode)
library(corrplot)
library(skimr)
library(GGally)
library(kaggler)
library(janitor)
library(wbstats)
library(vip)

Data Import

Four distinct datasets will be explored in this document, where first two will be used in the final model .

df_protest: Mass Mobillization Data of 166 countries, 1990-2020, tracking protests against governments. (Clark and Regan 2016)
df_hap: World Happiness Data of 164 countries, 2015-2022. (Helliwell, n.d.)
df_gdp: GDP Data of 217 countries, 1960-2021. (“GDP (1960-2021)],” n.d.)
df_cpi: Corruption Perception Index Data of 189 countries, 1998-2021. (“Corruption Perception Index (1998-2021)],” n.d.)

Show code

# Mass Mobilization Protest Data
df_protest <-
  read_csv("data/Mass Mobilization Protests/df_massmobil.csv")

# World Happiness Index Data
source("functions.R")

get_kaggle_data <- function() {
  # Authenticate to Kaggle API
  # This is Hidden in Github. Please Generate your own for reproducibility.
  kaggler::kgl_auth(creds_file = "kaggle.json")
  
  # Define starting point of download
  kaggle_start_date <- 2015
  
  # Create blank vector to store output from below logic
  kaggle_years <- c()
  
  # Get distinct years since 2015 up to current year
  while (kaggle_start_date <= Sys.Date() %>% year()) {
    kaggle_years <- append(kaggle_start_date, kaggle_years)
    kaggle_start_date <- kaggle_start_date + 1
  }
  
  # Get String values to use in download logic
  kaggle_years_csv <- kaggle_years %>% paste0(".csv")
  
  # Clear df_hp object in case it already exists
  if (exists("df_hp")) {
    rm(df_hp)
  }
  
  # Download data from Kaggle and store & append to df_hp object
  for (i in kaggle_years_csv) {
    # Download
    df_temp <- kgl_dataset(ref = "mathurinache/world-happiness-report",
                           file_name = i)
    
    # Data Cleansing
    df_temp <-
      lapply(df_temp, function(x)
        sub(",", ".", x)) %>%
      data.frame() %>%
      clean_names() %>%
      select(matches("country|score")) %>%
      rename_with(~ paste0("country"), matches("country")) %>%
      mutate(year = sub(".csv", "", i) %>% as.numeric())
    
    # Assign df_temp to df_hp if first loop, append if loop > 1
    ifelse(exists("df_hp"),
           df_hp <- df_hp %>% bind_rows(df_temp),
           df_hp <- df_temp)
  }
  
  # Additional Cleansing
  df_hp <- df_hp %>%
    
    # Concat different happiness score namings to one
    select(country, year, happiness_score, ladder_score, score) %>%
    mutate(
      final_score = case_when(
        !is.na(happiness_score) ~ happiness_score,!is.na(ladder_score) ~ ladder_score,!is.na(score) ~ score,
        TRUE ~ NA_character_
      )
    ) %>%
    mutate(final_score = as.numeric(final_score)) %>%
    
    # Get iso3 code from country names
    mutate(iso_3 = countrycode(
      country,
      "country.name",
      "iso3c",
      custom_match = c(
        "Kosovo" = "XXK",
        "Somaliland region" = "SOM",
        "Somaliland Region" = "SOM"
      )
    )) %>%
    select(year, iso_3, happiness_score = final_score)
  
}

df_hap <- get_kaggle_data()


# GDP Data (Not used in final model)
df_gdp <-
  wb_data("NY.GDP.MKTP.CD") %>% select(iso_3 = iso3c, year = date, gdp = NY.GDP.MKTP.CD)

# Corruption Perception Index Data (Not used in final model)
df_cpi <- read_csv("data/Mass Mobilization Protests/df_cpi.csv")

Data Overview

Summary

Show code

df_protest %>% skim() %>% datatable(options = list(scrollX = TRUE, dom = "lrtip"))

Sample Data

Show code

df_protest %>% select(-sources, -notes) %>% head() %>% datatable(options = list(scrollX = TRUE, dom = "rti"))

Summary

Show code

df_hap %>% skim() %>% datatable(options = list(scrollX = TRUE, dom = "rti"))

Sample Data

Show code

df_hap %>% drop_na(happiness_score) %>% head() %>% datatable(options = list(scrollX = TRUE, dom = "rti"))

Summary

Show code

df_cpi %>% skim() %>% datatable(options = list(scrollX = TRUE, dom = "rti"))

Sample Data

Show code

df_cpi %>% drop_na(cpi_score) %>% head() %>% datatable(options = list(scrollX = TRUE, dom = "rti"))

Summary

Show code

df_gdp %>% skim() %>% datatable(options = list(scrollX = TRUE, dom = "rti"))

Sample Data

Show code

df_gdp %>% drop_na(gdp) %>% head() %>% datatable(options = list(scrollX = TRUE, dom = "rti"))

Data Transformation

Here we’ll apply the first layer of data cleansing to edit dirty data, create target label, and remove what we don’t need. Additional transformation will be done in the Feature Engineering section down below.

Show code

df_cleansed <-
  df_protest %>%
  
  # Remove records (years) that did not have any protests
  filter(protest != 0) %>%
  
  # I'm particularly interested in [Participants] since it may indicate how major the protest is; it seems like the feature is in mixed numerical/character format.
  # Manually reformat [Participants] to numerical values.
  # Use the lower bound when the value is a range.
  mutate(participants = tolower(participants)) %>%
  mutate(
    participants = case_when(
      grepl("police put the turnout at 1,700", participants) ~ "1700",
      grepl("tens of", participants) ~ "",
      grepl("several dozen", participants) ~ "12",
      grepl("several hundred", participants) ~ "100",
      grepl("several thousand", participants) ~ "1000",
      grepl("thousand", participants) ~ "1000",
      grepl("two million", participants) ~ "2000000",
      grepl("million", participants) ~ "1000000",
      grepl("6000 to 8000", participants) ~ "6000",
      grepl("3000 to 5000", participants) ~ "3000",
      grepl("2000 to 3,000", participants) ~ "2000",
      grepl("2000 to 3000", participants) ~ "2000",
      grepl("2,000 to 3,000", participants) ~ "2000",
      grepl("2,000 to 4,000 people", participants) ~ "2000",
      grepl("about 100 activists", participants) ~ "100",
      grepl("between 11000 and 45000", participants) ~ "11000",
      grepl("between 35,000 and 50,000", participants) ~ "35000",
      grepl("Dennis Kwok, said more than 2,000", participants) ~ "2000",
      grepl("200 kamnans and village heads", participants) ~ "200",
      TRUE ~ participants
    )
  ) %>%
  mutate(participants = gsub("[-&;].*", "", participants)) %>%
  mutate(participants = gsub("[^0-9]", "", participants)) %>%
  mutate(participants = as.numeric(participants)) %>%
  
  # Government responses are dispersed into multiple features.
  # Concatenate all response columns into one.
  unite(
    col = 'concat_response',
    contains("stateresponse"),
    sep = "_",
    remove = FALSE
  ) %>%
  
  # Although I later determine to focus on the lethality of the response, here I try to explore the degree of danger level as a target label.
  # Identify danger level of government response toward protesters.
  mutate(danger_level = case_when(
    grepl("killing|shooting", concat_response) ~ 2,
    grepl("arrest|beating|crowd dispersal", concat_response) ~ 1,
    grepl("ignore|accomodation", concat_response) ~ 0,
    TRUE ~ 0
  ) %>% as.factor()) %>%
  
  # Create target label to be used in model.
  # Identify whether government response was lethal to protesters.
  mutate(lethal = case_when(
    grepl("killing|shooting", concat_response) ~ "Lethal",
    TRUE ~ "Non-Lethal"
  ) %>% factor(levels = c("Lethal", "Non-Lethal"))) %>%
  
  # Remove original government response cols since I don't need it anymore.
  select(-contains("response")) %>%
  
  # Manually One hot encode protester demand cols due to the way it's structured.
  mutate(n = 1) %>%
  pivot_longer(cols = contains("demand")) %>%
  mutate(value =
           case_when(is.na(value) ~ "None",
                     value == "." ~ "None",
                     TRUE ~ value)) %>%
  select(-name) %>%
  distinct() %>%
  pivot_wider(names_from = value,
              values_from = n,
              values_fill = 0) %>%
  select(-None) %>%
  
  # Concat Year Month Day cols together.
  mutate(startdate = as.Date(paste(startyear, startmonth, startday, sep =
                                     "-"), "%Y-%m-%d")) %>%
  mutate(enddate = as.Date(paste(endyear, endmonth, endday, sep = "-"), "%Y-%m-%d")) %>%
  
  # Calculate duration of protests (in days).
  mutate(days_in_protest = as.numeric(enddate - startdate)) %>%
  
  # Standardize col names.
  janitor::clean_names()

Let’s join all datasets together using iso_3 country code and year as mapping keys.

Show code

df_cleansed_joined <-
  df_cleansed %>%
  
  # Create iso_3 with "countrycode" package, which I'll use as a joining key.
  mutate(iso_3 = countrycode(
    country,
    "country.name",
    "iso3c",
    custom_match = c(
      "Czechoslovakia" = "CZE",
      "Germany East" = "DEU",
      "Kosovo" = "XXK",
      "Serbia and Montenegro" = "SRB",
      "Yugoslavia" = "YUG"
    )
  )) %>%
  
  # Join Happiness Score.
  left_join(df_hap, by = c("iso_3", "year")) %>%
  
  # Join GDP.
  left_join(df_gdp, by = c("iso_3", "year")) %>%
  
  # Join Corruption Perception Index Score.
  left_join(
    df_cpi %>%
      mutate(cpi_score = case_when(Year <= 2009 ~ cpi_score * 10,
                                   TRUE ~ cpi_score)) %>%
      filter(!is.na(cpi_score)),
    by = c("iso_3", "year" = "Year")
  )

EDA

Show code

df_eda <-
  df_cleansed_joined %>%
  select(
    lethal,
    year,
    protesterviolence,
    participants,
    days_in_protest,
    cpi_score,
    gdp,
    happiness_score
  )

df_eda %>%
  ggpairs(
    mapping =
      aes(
        alpha = 0.5,
        color =
          factor(
            lethal,
            levels = c("Lethal", "Non-Lethal"),
            labels = c("Lethal", "Non-Lethal")
          )
      ),
    legend = 1,
    upper = list(continuous = wrap("cor", alpha = 1, size = 2)),
    title = "EDA"
  ) +
  theme(
    strip.text.x = element_text(size = 7),
    strip.text.y = element_text(size = 7),
    axis.text = element_text(size = 6),
    legend.position = "bottom"
  ) +
  labs(fill = "Lethality")

There are a lot of information here. To point out few key insights:

Imbalanced, meaning there are unequal distribution of records between lethal vs. non-lethal records.
Signs of distinction between lethal vs non-lethal records observed in [protesterviolence], [cpi_score], and [happiness_score].

Correlation Analysis

Let’s check correlation between features to understand which features might be repetitive.

Show code

# Create correlation matrix & plot.
df_col <- cor(
  df_cleansed_joined %>%
    select_if(is.numeric) %>%
    select(-protest,-id,-ccode),
  method = "spearman",
  use = "complete.obs"
)
corrplot(df_col,
         "square",
         addrect = 5,
         order = 'hclust',
         diag = FALSE)

I’m particularly interested in [happiness score], [gdp], and [cpi_score] here. I’ll run some significance tests to validate correlation between these features.

Show code

# Quick Significance Test on correlated variables.
cor.test(df_cleansed_joined$gdp, df_cleansed_joined$happiness_score)


    Pearson's product-moment correlation

data:  df_cleansed_joined$gdp and df_cleansed_joined$happiness_score
t = 13.97, df = 3444, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.1997375 0.2629401
sample estimates:
      cor 
0.2315832

Show code

cor.test(df_cleansed_joined$gdp, df_cleansed_joined$cpi_score)


    Pearson's product-moment correlation

data:  df_cleansed_joined$gdp and df_cleansed_joined$cpi_score
t = 33.838, df = 11024, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.2897412 0.3235613
sample estimates:
      cor 
0.3067481

Show code

cor.test(df_cleansed_joined$happiness_score, df_cleansed_joined$cpi_score)


    Pearson's product-moment correlation

data:  df_cleansed_joined$happiness_score and df_cleansed_joined$cpi_score
t = 67.3, df = 3517, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.7354577 0.7643606
sample estimates:
      cor 
0.7502674

Feature Selection

Test tells us that they’re significantly correlated. We’ll remove [cpi_score] and [gdp]. I’m choosing [happiness_score] over others based on our EDA (and comparing how all 3 impacts the final model). I’ll also remove the duplicate date features.

Show code

df_final <-
  df_cleansed_joined %>%
  select(
    lethal,
    id,
    year,
    startmonth,
    iso_3,
    protestnumber,
    protesterviolence,
    participants,
    days_in_protest,
    happiness_score,
    political_behavior_process,
    labor_wage_dispute,
    land_farm_issue,
    police_brutality,
    price_increases_tax_policy,
    social_restrictions,
    removal_of_politician
  ) %>%
  mutate_if(is.character, factor)

Train Test Split

Split the data to 75% Train and 25% Test, and create 25 partitions of K-Folds.

Show code

set.seed(49)
pt_split <- initial_split(df_final, strata = lethal)

pt_train <- training(pt_split)
pt_test <- testing(pt_split)

df_folds <- vfold_cv(pt_train, v = 25, strata = lethal)

Feature Engineering

We’ll use recipes package to create a recipe of feature engineering pipeline.

Show code

# DT seems to be overriding already existing functions. Detach in this code block.
if(any(grepl("package:DT", search()))) detach("package:DT") else message("DT not loaded")

pt_rec <- recipe(lethal ~ ., data = pt_train) %>%
  update_role(id, new_role = "id") %>%

  # Impute & Bin Missing [Participants].
  step_impute_knn(participants) %>%
  step_discretize_xgb(participants, outcome = "lethal") %>%
  
  # Impute [Happiness Score].
  step_impute_knn(happiness_score) %>%
   
  # Convert [Participants] to ordinal factor.
  step_mutate(participants = factor(participants, levels = sort(unique(participants)), ordered = TRUE)) %>%

  # Convert [Participants] to numeric.
  step_ordinalscore(c("participants")) %>%
  
  # One Hot encode all nominal features.
  step_dummy(all_nominal(), -all_outcomes()) %>%
  
  # Remove features with no variability for safe measure.
  step_zv(all_numeric()) %>%
  
  # Normalize values of all predictors.
  step_normalize(all_predictors()) %>%
  
  # Downsample
  step_downsample(lethal) %>%
  
  prep()

Model Building

Predicting lethality of a protest is a classification problem. There are many algorithms to use, and we’ll try to use couple of them and pick the one with best outcome in the end. We’ll cover Logistic Regression, KNN, Random Forest, and XGBoost algorithms. We’ll also use Recall as the tuning metric, since we care more about capturing as many lethal scenarios as possible.

Here we create a specification for Logistic Regression Model, then use tune_grid to collect performance metrics of different hyper parameters.

Show code

lr_spec <-
  logistic_reg(penalty = tune(), mixture = tune()) %>%
  set_mode("classification") %>%
  set_engine("glmnet")

lr_workflow <-
  workflow() %>%
  add_recipe(pt_rec) %>%
  add_model(lr_spec)

# Use the model if it already exists- if not, create the model.
lr_path <- file.path("data/Mass Mobilization Protests/lr_tune.rds")

if (file.exists(lr_path)) {
  lr_tune <- read_rds(file = lr_path)
} else {
  lr_tune <-
    tune_grid(
      lr_workflow,
      resamples = df_folds,
      metrics = metric_set(roc_auc, accuracy, recall),
      grid = 11
    )
  saveRDS(lr_tune, file = lr_path)
}

What are the best parameters according to our metrics collected from different folds?

Show code

# Reload DT
library(DT)

cols <- c("pendalty", "mixture", "mean", "std_err")

# Show best parameters
# show_best(lr_tune, metric = "recall") %>%
#   mutate(across(cols, round, 4)) %>%
#   datatable(options = list(scrollX = TRUE, dom = "rti"))

# Show all parameters from folds.
plt_lr_penalty <-
  lr_tune %>%
  collect_metrics() %>%
  filter(.metric == "recall") %>%
  select(mean, penalty) %>%
  ggplot(aes(penalty, mean)) +
  geom_point(show.legend = FALSE)

plt_lr_mixture <-
  lr_tune %>%
  collect_metrics() %>%
  filter(.metric == "recall") %>%
  select(mean, mixture) %>%
  ggplot(aes(mixture, mean)) +
  geom_point(show.legend = FALSE)

grid.arrange(plt_lr_penalty, plt_lr_mixture)

Lastly, fit the model

Show code

# Create a workflow to use a model with best hyper parameters based on recall.
final_lr_wf <- lr_workflow %>%
  finalize_workflow(select_best(lr_tune, "recall"))

# Fit the model.
protest_fit_lr <-
  last_fit(final_lr_wf, pt_split, metrics = metric_set(roc_auc, accuracy, recall))

We’ll follow the same steps below to create KNN, Random Forest, and XGBoost .

Create specification for KKNN Model and collect performance metrics from folds.

Show code

# DT seems to be overriding already existing functions. Detach in this code block.
if(any(grepl("package:DT", search()))) detach("package:DT") else message("DT not loaded")

kknn_spec <-
  nearest_neighbor(neighbors = tune(), weight_func = tune()) %>%
  set_mode("classification") %>%
  set_engine("kknn")

kknn_workflow <-
  workflow() %>%
  add_recipe(pt_rec) %>%
  add_model(kknn_spec)

# Use the model if it already exists- if not, create the model.
kknn_path <-
  file.path("data/Mass Mobilization Protests/kknn_tune.rds")

if (file.exists(kknn_path)) {
  kknn_tune <- read_rds(file = kknn_path)
} else {
  kknn_tune <-
    tune_grid(
      kknn_workflow,
      resamples = df_folds,
      metrics = metric_set(roc_auc, accuracy, recall),
      grid = 11
    )
  saveRDS(kknn_tune, file = kknn_path)
}

What are the best parameters according to our metrics collected from different folds?

Show code

# Reload DT
library(DT)

cols <- c("mean", "std_err")

# Show all parameters from folds.
plt_kknn_neighbors <-
  kknn_tune %>%
  collect_metrics() %>%
  filter(.metric == "recall") %>%
  select(mean, neighbors) %>%
  ggplot(aes(neighbors, mean)) +
  geom_point(show.legend = FALSE)

plt_kknn_weight_func <-
  kknn_tune %>%
  collect_metrics() %>%
  filter(.metric == "recall") %>%
  select(mean, weight_func) %>%
  ggplot(aes(weight_func, mean)) +
  geom_point(show.legend = FALSE)

grid.arrange(plt_kknn_neighbors, plt_kknn_weight_func)

Fit the model.

Show code

# Create a workflow to use a model with best hyper parameters based on recall.
final_kknn_wf <- kknn_workflow %>%
  finalize_workflow(select_best(kknn_tune, "recall"))

# Fit the model.
protest_fit_kknn <-
  last_fit(final_kknn_wf, pt_split, metrics = metric_set(roc_auc, accuracy, recall))

Create specification for Random Forest Model and collect performance metrics from folds.

Show code

# DT seems to be overriding already existing functions. Detach in this code block.
if(any(grepl("package:DT", search()))) detach("package:DT") else message("DT not loaded")

ranger_spec <- 
  rand_forest(mtry = tune(), min_n = tune(), trees = tune()) %>%
  set_mode("classification") %>% 
  set_engine("ranger")

ranger_workflow <- 
  workflow() %>% 
  add_recipe(pt_rec) %>% 
  add_model(ranger_spec) 

# Use the model if it already exists- if not, create the model.
ranger_path <- file.path("data/Mass Mobilization Protests/rforest_tune.rds")

if (file.exists(ranger_path)) {
  ranger_tune <- read_rds(file = ranger_path)
  } else {
    ranger_tune <-
    tune_grid(ranger_workflow, resamples = df_folds, metrics = metric_set(roc_auc, accuracy, recall), grid = 11)
    saveRDS(ranger_tune, file = ranger_path)
    }

Show best parameters from folds.

Show code

library(DT)

cols <- c("mean", "std_err")

# Show Best Parameters.
# show_best(ranger_tune, metric = "recall") %>% 
#   mutate(across(cols, round, 4)) %>%
#   datatable(options = list(scrollX = TRUE, dom = "rti"))

# Show all parameters from folds.
ranger_tune %>%
  collect_metrics() %>%
  filter(.metric == "recall") %>%
  select(mean, min_n, mtry, trees) %>%
  pivot_longer(min_n:trees,
               values_to = "value",
               names_to = "parameter") %>%
  ggplot(aes(value, mean, color = parameter)) +
  geom_point(show.legend = FALSE) +
  facet_wrap( ~ parameter, scales = "free_x")

Fit the model.

Show code

# Create a workflow to use a model with best hyper parameters based on recall.
final_rf_wf <- ranger_workflow %>%
  finalize_workflow(select_best(ranger_tune, "recall"))

# Fit the model.
protest_fit_rf <- last_fit(final_rf_wf, pt_split, metrics = metric_set(roc_auc, accuracy, recall))

Create specification for XGBoost Model and collect performance metrics from folds.

Show code

# DT seems to be overriding already existing functions. Detach in this code block.
if(any(grepl("package:DT", search()))) detach("package:DT") else message("DT not loaded")

xgboost_spec <-
  boost_tree(
    trees = tune(),
    min_n = tune(),
    tree_depth = tune(),
    learn_rate = tune(),
    loss_reduction = tune(),
    sample_size = tune()
  ) %>%
  set_mode("classification") %>%
  set_engine("xgboost")

xgboost_workflow <-
  workflow() %>%
  add_recipe(pt_rec) %>%
  add_model(xgboost_spec)

# Use the model if it already exists- if not, create the model.
xgboost_path <-
  file.path("data/Mass Mobilization Protests/xgb_tune.rds")

if (file.exists(xgboost_path)) {
  xgboost_tune <- read_rds(file = xgboost_path)
} else {
  xgboost_tune <-
    tune_grid(
      xgboost_workflow,
      resamples = df_folds,
      metrics = metric_set(roc_auc, accuracy, recall),
      grid = 11
    )
  saveRDS(xgboost_tune, file = xgboost_path)
}

Show best parameters from folds.

Show code

library(DT)

cols <- c("learn_rate", "loss_reduction", "sample_size", "mean", "std_err")

# Show all parameters from folds.
xgboost_tune %>%
  collect_metrics() %>%
  filter(.metric == "recall") %>%
  select(mean,
         trees,
         min_n,
         tree_depth,
         learn_rate,
         loss_reduction,
         sample_size) %>%
  pivot_longer(trees:sample_size,
               values_to = "value",
               names_to = "parameter") %>%
  ggplot(aes(value, mean, color = parameter)) +
  geom_point(show.legend = FALSE) +
  facet_wrap(~ parameter, scales = "free_x")

Fit the model.

Show code

# Create a workflow to use a model with best hyper parameters based on recall.
final_xgb_wf <- xgboost_workflow %>%
  finalize_workflow(select_best(xgboost_tune, metric = "recall"))

# Fit the model.
protest_fit_xgb <-
  last_fit(final_xgb_wf, pt_split, metrics = metric_set(roc_auc, accuracy, recall))

Model Results

collect_metrics automatically uses test dataset from the split to evaluate. Let’s take a look at accuracy, roc_auc, and most importantly, recall.

Show code

result_lr <- collect_metrics(protest_fit_lr) %>% mutate(model = "Logistic Regression")
result_kknn <- collect_metrics(protest_fit_kknn) %>% mutate(model = "KKNN")
result_rf <- collect_metrics(protest_fit_rf) %>% mutate(model = "Random Forest")
result_xgb <- collect_metrics(protest_fit_xgb) %>% mutate(model = "XGBoost")

bind_rows(result_lr, result_kknn, result_rf, result_xgb) %>% 
  pivot_wider(names_from = .metric, values_from = .estimate) %>% 
  select(-.estimator, -.config) %>% 
  mutate(accuracy = percent(accuracy, accuracy = 0.1)) %>%
  mutate(roc_auc = percent(roc_auc, accuracy = 0.1)) %>%
  mutate(recall = percent(recall, accuracy = 0.1)) %>%
  datatable(options = list(scrollX = TRUE, dom = "rti"))

Confusion Matrix

Show code

# Confusion Matrix
cm <-
  collect_predictions(protest_fit_lr) %>%
  mutate(model = "logreg") %>%
  bind_rows(collect_predictions(protest_fit_kknn) %>%
              mutate(model = "kknn")) %>%
  bind_rows(collect_predictions(protest_fit_rf) %>%
              mutate(model = "rforest")) %>%
  bind_rows(collect_predictions(protest_fit_xgb) %>%
              mutate(model = "xgboost")) %>%
  group_by(model) %>%
  conf_mat(lethal, .pred_class)

lr_cm <-
  cm %>%
  filter(model == "logreg") %>%
  .$conf_mat %>%
  .[[1]] %>%
  autoplot(type = "heatmap") +
  ggtitle("Logistic Regression") +
  theme(plot.title = element_text(hjust = 0.95))

kknn_cm <-
  cm %>%
  filter(model == "kknn") %>%
  .$conf_mat %>%
  .[[1]] %>%
  autoplot(type = "heatmap") +
  ggtitle("Weighted K-Nearest Neighbor") +
  theme(plot.title = element_text(hjust = 0.95))

rf_cm <-
  cm %>%
  filter(model == "rforest") %>%
  .$conf_mat %>%
  .[[1]] %>%
  autoplot(type = "heatmap") +
  ggtitle("Random Forest") +
  theme(plot.title = element_text(hjust = 0.95))

xgboost_cm <-
  cm %>%
  filter(model == "xgboost") %>%
  .$conf_mat %>%
  .[[1]] %>%
  autoplot(type = "heatmap") +
  ggtitle("XGBoost") +
  theme(plot.title = element_text(hjust = 0.95))

grid.arrange(lr_cm, kknn_cm, rf_cm, xgboost_cm)

ROC Curve

Show code

# ROC Curve
collect_predictions(protest_fit_lr) %>%
  mutate(model = "Logistic Regression") %>%
  bind_rows(collect_predictions(protest_fit_kknn) %>%
              mutate(model = "KKNN")) %>%
  bind_rows(collect_predictions(protest_fit_rf) %>%
              mutate(model = "Random Forest")) %>%
  bind_rows(collect_predictions(protest_fit_xgb) %>%
              mutate(model = "XGBoost")) %>%
  group_by(model) %>%
  roc_curve(lethal, .pred_Lethal) %>%
  autoplot() +
  ylab("True Positive Rate \n (Sensitivity)") +
  xlab("False Positive Rate \n (1 - Specificity)")

With an unbalanced data like ours, accuracy can be a poor predictor of model performance. We’ll use the roc curve instead as a measure. Both Logistic Regression and Random Forest Model performs well for our small dataset. Our models shows signs of good generalization and likely will do well at predicting with unseen data.

KNN seems to have the highest recall. Since we care about capturing as many positive labels as possible in this case, KNN may be a good choice. However, it has lower accuracy and roc_auc compared to other models, which lowers my confidence in its ability to remain consistent in distinguishing positive and negative labels. Also, while it’s possible to indirectly understand feature importance through KNN, it’s not as intuitive as tree or regression based model. For practicality reasons, I’d select Random Forest as a model of choice for real world scenarios.

Variable Importance

What can we learn from the model? What are the most important features to predict lethality according to the Logistic Regression model?

Show code

imp_spec_logreg <- lr_spec %>%
  finalize_model(select_best(lr_tune, metric = "roc_auc")) %>%
  set_engine("glmnet")

plt_imp_logreg_nocountry <- workflow() %>%
  add_recipe(pt_rec) %>%
  add_model(imp_spec_logreg) %>%
  fit(pt_train) %>%
  extract_fit_parsnip() %>%
  vi_model() %>%
  filter(!grepl("iso", Variable)) %>%
  vip(aesthetics = list(alpha = 0.8, fill = "midnightblue")) +
  ggtitle("LR: VI Without Country Dummy Variables")

plt_imp_logreg_wcountry <- workflow() %>%
  add_recipe(pt_rec) %>%
  add_model(imp_spec_logreg) %>%
  fit(pt_train) %>%
  extract_fit_parsnip() %>%
  vi_model() %>%
  vip(aesthetics = list(alpha = 0.8, fill = "midnightblue")) +
  ggtitle("LR: VI With Country Dummy Variables")

gA <- ggplotGrob(plt_imp_logreg_nocountry)
gB <- ggplotGrob(plt_imp_logreg_wcountry)
grid::grid.newpage()
grid::grid.draw(rbind(gA, gB))

What are the most important features to predict lethality according to the Random Forest model?

Show code

imp_spec_rforest <- ranger_spec %>%
  finalize_model(select_best(ranger_tune, metric = "roc_auc")) %>%
  set_engine("ranger", importance = "permutation")

plt_imp_rforest_nocountry <- workflow() %>%
  add_recipe(pt_rec) %>%
  add_model(imp_spec_rforest) %>%
  fit(pt_train) %>%
  extract_fit_parsnip() %>%
  vi_model() %>%
  filter(!grepl("iso", Variable)) %>%
  vip(aesthetics = list(alpha = 0.8, fill = "midnightblue")) +
  ggtitle("RF: VI Without Country Dummy Variables")

plt_imp_rforest_wcountry <- workflow() %>%
  add_recipe(pt_rec) %>%
  add_model(imp_spec_rforest) %>%
  fit(pt_train) %>%
  extract_fit_parsnip() %>%
  vi_model() %>%
  vip(aesthetics = list(alpha = 0.8, fill = "midnightblue")) +
  ggtitle("RF: VI With Country Dummy Variables")

gA <- ggplotGrob(plt_imp_rforest_nocountry)
gB <- ggplotGrob(plt_imp_rforest_wcountry)
grid::grid.newpage()
grid::grid.draw(rbind(gA, gB))

One Hot Encoded country dummy variables make this analysis a bit messy to interpret. We can summarise that countries with consistent track record of response heavily influences the model prediction. For example, France, Ireland, Germany, and South Korea were among the top of list when it comes to predicting “Non-Lethal” cases per Logistic Regression.

Note that two models will output different feature importance, since Random Forest importance is based on expected decrease in performance when said predictor is used in a tree, and GLM importance is based on the scale of coefficients. Combining these results will provide us with the most accurate interpretation.

Conclusion

We can use few key features to intuitively guess if protest will be life-threatening.

Protester Violence
Happiness Score of a Country
Days in Protest
Country

Show code

df_final %>%
  count(lethal, protesterviolence) %>%
  mutate(
    protesterviolence =
      case_when(
        protesterviolence == 0 ~ "No Protester violence",
        TRUE ~ "Protester Violence"
      )
  ) %>%
  ggplot(aes(x = lethal, y = n, fill = lethal)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  facet_wrap("protesterviolence") +
  theme_bw() +
  xlab("") +
  ylab("Count")

Show code

df_final %>%
  filter(!is.na(happiness_score)) %>%
  mutate(happiness_score_bin = cut(happiness_score, breaks = 5)) %>%
  count(lethal, happiness_score_bin) %>%
  ggplot(aes(x = lethal, y = n, fill = lethal)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  facet_wrap("happiness_score_bin") +
  theme_bw() +
  xlab("") +
  ylab("Count") +
  theme()

Show code

df_final %>% 
  filter(lethal == "Lethal") %>%
  mutate(days_in_protest_bin = 
           case_when(days_in_protest == 0 ~ "First Day of Protest", 
                     TRUE ~ "On Second Day or After")) %>% 
  count(days_in_protest_bin) %>% 
  ggplot(aes(x = days_in_protest_bin, y = n)) + 
  geom_bar(stat = "identity", show.legend = FALSE) + 
  theme_bw() +
  xlab("") +
  ylab("Count") +
  theme() +
  ggtitle("Lethal Cases")

Show code

df_final %>%
  mutate(Country = countrycode(
    iso_3,
    "iso3c",
    "country.name",
    custom_match = c("XXK" = "Kosovo",
                     "YUG" = "Yugoslavia")
  )) %>%
  count(lethal, Country) %>%
  pivot_wider(names_from = lethal, values_from = n) %>%
  mutate(Lethal = case_when(is.na(Lethal) ~ 0, TRUE ~ as.numeric(Lethal))) %>%
  mutate(`Non-Lethal` =
           case_when(is.na(`Non-Lethal`) ~ 0, TRUE ~ as.numeric(`Non-Lethal`))) %>%
  mutate(Percent_Lethal =
           percent(Lethal / (Lethal + `Non-Lethal`), accuracy = 0.1)) %>%
  datatable(filter = "top",
            options = list(scrollX = TRUE, dom = "lrtip"))

Show code

df_final %>% 
  count(lethal, removal_of_politician) %>% 
  pivot_wider(names_from = removal_of_politician, values_from = n) %>%
  mutate(`1` = `1` / sum(`1`)) %>%
  mutate(`0` = `0` / sum(`0`)) %>%
  rename(`Protest Reason: Removal of Politician` = `1`, `Other Reason` = `0`) %>%
  pivot_longer(cols=c(`Protest Reason: Removal of Politician`, `Other Reason`)) %>%
  ggplot(aes(x = lethal, y = value, fill = lethal)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  facet_wrap("name") +
  theme_bw() +
  xlab("") +
  ylab("Percent Total") +
  theme() +
  scale_y_continuous(labels = scales::percent)

Our data does not indicate whether a protester violence was originally initiated by protesters, or was initiated as a reaction to government’s violence. However, we can safely assume that provocation by either side may escalate the situation.

Citizens’ happiness score (and innate government track record impacting that happiness) is a great predictor of how the government would respond. Lower the score, higher the likelihood of lethal response to occur.

Most lethal events happen on the first day of the protest, and countries with consistent track record are likely to continue do what they have done in the past.

Lastly, if the protest is around removal of politician, it likely will yield higher chance of lethality.

Though some of these seem like a no-brainer now that we look at it, as my mentor used to put it, good analysis often leads to obvious results.

As a closing comment, this analysis in no way makes any statement about how the protest should be conducted, nor which method is the most effective to invoke change. Many prominent studies claim that non-violent protest is the best method for social movements to pursue causes, but the reality may be more complicated.

You can find all resources as well as saved models in my Github Repo.

Note: Rpubs blocks external redirections. Please open any links in this document in a new tab.

References

Clark, David, and Patrick Regan. 2016. “Mass Mobilization Protest Data.” Harvard Dataverse. https://doi.org/10.7910/DVN/HTTWYL.

“Corruption Perception Index (1998-2021)].” n.d. Transparency International. https://www.transparency.org/en/cpi.

“GDP (1960-2021)].” n.d. Washington, D.C., United States: World Bank Group Archives. https://data.worldbank.org/indicator/NY.GDP.MKTP.CD.

Helliwell, Layard, J. F. n.d. “World Happiness Report.” New York, United States: Sustainable Development Solutions Network; mathurinache/Kaggle. https://worldhappiness.report.

--- title-block-style: default title-block-banner: "#191970" title-block-banner-color: "#FFFFFF" title: "Using ML to Identify Risk Factors and Predict Fatal Outcomes in Protests" author: "Alex Lee" date: 2022-11-20 format: html: code-fold: show code-summary: "Show code" code-overflow: scroll code-line-numbers: true code-copy: hover code-tools: true theme: default toc: true toc-location: left knitr: opts_chunk: warning: false message: false cache: true bibliography: refs.bib link-citations: true --- Protests have played a major role throughout history in driving social movement and pursuing causes. Though not common, human casualties sometimes are unfortunate consequence to these events. This document attempts to predict government's response to mass mobilization protests- of which that may lead to lethal outcomes. "Lethal Outcome" hereafter is defined as "Shooting" or "Killing". Can we forecast the lethality of the protest and provide insight on how we can mitigate the danger? ## Libraries ```{r setup} library(tidyverse) library(tidymodels) library(DT) library(embed) library(themis) library(lubridate) library(countrycode) library(corrplot) library(skimr) library(GGally) library(kaggler) library(janitor) library(wbstats) library(vip) ``` ## Data Import Four distinct datasets will be explored in this document, where **first two** will be used in the final model . 1. **df_protest**: Mass Mobillization Data of 166 countries, 1990-2020, tracking protests against governments. [@protest] 2. **df_hap**: World Happiness Data of 164 countries, 2015-2022. [@hap] 3. **df_gdp**: GDP Data of 217 countries, 1960-2021. [@gdp] 4. **df_cpi**: Corruption Perception Index Data of 189 countries, 1998-2021. [@cpi] ```{r Data Import, results='hide'} # Mass Mobilization Protest Data df_protest <- read_csv("data/Mass Mobilization Protests/df_massmobil.csv") # World Happiness Index Data source("functions.R") get_kaggle_data <- function() { # Authenticate to Kaggle API # This is Hidden in Github. Please Generate your own for reproducibility. kaggler::kgl_auth(creds_file = "kaggle.json") # Define starting point of download kaggle_start_date <- 2015 # Create blank vector to store output from below logic kaggle_years <- c() # Get distinct years since 2015 up to current year while (kaggle_start_date <= Sys.Date() %>% year()) { kaggle_years <- append(kaggle_start_date, kaggle_years) kaggle_start_date <- kaggle_start_date + 1 } # Get String values to use in download logic kaggle_years_csv <- kaggle_years %>% paste0(".csv") # Clear df_hp object in case it already exists if (exists("df_hp")) { rm(df_hp) } # Download data from Kaggle and store & append to df_hp object for (i in kaggle_years_csv) { # Download df_temp <- kgl_dataset(ref = "mathurinache/world-happiness-report", file_name = i) # Data Cleansing df_temp <- lapply(df_temp, function(x) sub(",", ".", x)) %>% data.frame() %>% clean_names() %>% select(matches("country|score")) %>% rename_with(~ paste0("country"), matches("country")) %>% mutate(year = sub(".csv", "", i) %>% as.numeric()) # Assign df_temp to df_hp if first loop, append if loop > 1 ifelse(exists("df_hp"), df_hp <- df_hp %>% bind_rows(df_temp), df_hp <- df_temp) } # Additional Cleansing df_hp <- df_hp %>% # Concat different happiness score namings to one select(country, year, happiness_score, ladder_score, score) %>% mutate( final_score = case_when( !is.na(happiness_score) ~ happiness_score,!is.na(ladder_score) ~ ladder_score,!is.na(score) ~ score, TRUE ~ NA_character_ ) ) %>% mutate(final_score = as.numeric(final_score)) %>% # Get iso3 code from country names mutate(iso_3 = countrycode( country, "country.name", "iso3c", custom_match = c( "Kosovo" = "XXK", "Somaliland region" = "SOM", "Somaliland Region" = "SOM" ) )) %>% select(year, iso_3, happiness_score = final_score) } df_hap <- get_kaggle_data() # GDP Data (Not used in final model) df_gdp <- wb_data("NY.GDP.MKTP.CD") %>% select(iso_3 = iso3c, year = date, gdp = NY.GDP.MKTP.CD) # Corruption Perception Index Data (Not used in final model) df_cpi <- read_csv("data/Mass Mobilization Protests/df_cpi.csv") ``` ## Data Overview ::: panel-tabset ### 1. Mass Mobillization Data #### Summary ```{r Data Overview Protest 1} df_protest %>% skim() %>% datatable(options = list(scrollX = TRUE, dom = "lrtip")) ``` #### Sample Data ```{r Data Overview Protest 2} df_protest %>% select(-sources, -notes) %>% head() %>% datatable(options = list(scrollX = TRUE, dom = "rti")) ``` ### 2. World Happiness Data #### Summary ```{r Data Overview hap 1} df_hap %>% skim() %>% datatable(options = list(scrollX = TRUE, dom = "rti")) ``` #### Sample Data ```{r Data Overview hap 2} df_hap %>% drop_na(happiness_score) %>% head() %>% datatable(options = list(scrollX = TRUE, dom = "rti")) ``` ### 3. Corruption Perception Index Data #### Summary ```{r Data Overview cpi 1} df_cpi %>% skim() %>% datatable(options = list(scrollX = TRUE, dom = "rti")) ``` #### Sample Data ```{r Data Overview cpi 2} df_cpi %>% drop_na(cpi_score) %>% head() %>% datatable(options = list(scrollX = TRUE, dom = "rti")) ``` ### 4. GDP Data #### Summary ```{r Data Overview gdp 1} df_gdp %>% skim() %>% datatable(options = list(scrollX = TRUE, dom = "rti")) ``` #### Sample Data ```{r Data Overview gdp 2} df_gdp %>% drop_na(gdp) %>% head() %>% datatable(options = list(scrollX = TRUE, dom = "rti")) ``` ::: ## Data Transformation ::: panel-tabset Now that we have a good overview of how the data is structured, let's restructure & join datasets together in preparation for exploratory data analysis & modeling. ### Data Cleansing Here we'll apply the first layer of data cleansing to edit dirty data, create target label, and remove what we don't need. Additional transformation will be done in the Feature Engineering section down below. ```{r Data Cleansing} df_cleansed <- df_protest %>% # Remove records (years) that did not have any protests filter(protest != 0) %>% # I'm particularly interested in [Participants] since it may indicate how major the protest is; it seems like the feature is in mixed numerical/character format. # Manually reformat [Participants] to numerical values. # Use the lower bound when the value is a range. mutate(participants = tolower(participants)) %>% mutate( participants = case_when( grepl("police put the turnout at 1,700", participants) ~ "1700", grepl("tens of", participants) ~ "", grepl("several dozen", participants) ~ "12", grepl("several hundred", participants) ~ "100", grepl("several thousand", participants) ~ "1000", grepl("thousand", participants) ~ "1000", grepl("two million", participants) ~ "2000000", grepl("million", participants) ~ "1000000", grepl("6000 to 8000", participants) ~ "6000", grepl("3000 to 5000", participants) ~ "3000", grepl("2000 to 3,000", participants) ~ "2000", grepl("2000 to 3000", participants) ~ "2000", grepl("2,000 to 3,000", participants) ~ "2000", grepl("2,000 to 4,000 people", participants) ~ "2000", grepl("about 100 activists", participants) ~ "100", grepl("between 11000 and 45000", participants) ~ "11000", grepl("between 35,000 and 50,000", participants) ~ "35000", grepl("Dennis Kwok, said more than 2,000", participants) ~ "2000", grepl("200 kamnans and village heads", participants) ~ "200", TRUE ~ participants ) ) %>% mutate(participants = gsub("[-&;].*", "", participants)) %>% mutate(participants = gsub("[^0-9]", "", participants)) %>% mutate(participants = as.numeric(participants)) %>% # Government responses are dispersed into multiple features. # Concatenate all response columns into one. unite( col = 'concat_response', contains("stateresponse"), sep = "_", remove = FALSE ) %>% # Although I later determine to focus on the lethality of the response, here I try to explore the degree of danger level as a target label. # Identify danger level of government response toward protesters. mutate(danger_level = case_when( grepl("killing|shooting", concat_response) ~ 2, grepl("arrest|beating|crowd dispersal", concat_response) ~ 1, grepl("ignore|accomodation", concat_response) ~ 0, TRUE ~ 0 ) %>% as.factor()) %>% # Create target label to be used in model. # Identify whether government response was lethal to protesters. mutate(lethal = case_when( grepl("killing|shooting", concat_response) ~ "Lethal", TRUE ~ "Non-Lethal" ) %>% factor(levels = c("Lethal", "Non-Lethal"))) %>% # Remove original government response cols since I don't need it anymore. select(-contains("response")) %>% # Manually One hot encode protester demand cols due to the way it's structured. mutate(n = 1) %>% pivot_longer(cols = contains("demand")) %>% mutate(value = case_when(is.na(value) ~ "None", value == "." ~ "None", TRUE ~ value)) %>% select(-name) %>% distinct() %>% pivot_wider(names_from = value, values_from = n, values_fill = 0) %>% select(-None) %>% # Concat Year Month Day cols together. mutate(startdate = as.Date(paste(startyear, startmonth, startday, sep = "-"), "%Y-%m-%d")) %>% mutate(enddate = as.Date(paste(endyear, endmonth, endday, sep = "-"), "%Y-%m-%d")) %>% # Calculate duration of protests (in days). mutate(days_in_protest = as.numeric(enddate - startdate)) %>% # Standardize col names. janitor::clean_names() ``` ### Data Joining Let's join all datasets together using iso_3 country code and year as mapping keys. ```{r Data Joining} df_cleansed_joined <- df_cleansed %>% # Create iso_3 with "countrycode" package, which I'll use as a joining key. mutate(iso_3 = countrycode( country, "country.name", "iso3c", custom_match = c( "Czechoslovakia" = "CZE", "Germany East" = "DEU", "Kosovo" = "XXK", "Serbia and Montenegro" = "SRB", "Yugoslavia" = "YUG" ) )) %>% # Join Happiness Score. left_join(df_hap, by = c("iso_3", "year")) %>% # Join GDP. left_join(df_gdp, by = c("iso_3", "year")) %>% # Join Corruption Perception Index Score. left_join( df_cpi %>% mutate(cpi_score = case_when(Year <= 2009 ~ cpi_score * 10, TRUE ~ cpi_score)) %>% filter(!is.na(cpi_score)), by = c("iso_3", "year" = "Year") ) ``` ::: ## EDA ```{r Exploratory Analysis} #| column: body-outset-right df_eda <- df_cleansed_joined %>% select( lethal, year, protesterviolence, participants, days_in_protest, cpi_score, gdp, happiness_score ) df_eda %>% ggpairs( mapping = aes( alpha = 0.5, color = factor( lethal, levels = c("Lethal", "Non-Lethal"), labels = c("Lethal", "Non-Lethal") ) ), legend = 1, upper = list(continuous = wrap("cor", alpha = 1, size = 2)), title = "EDA" ) + theme( strip.text.x = element_text(size = 7), strip.text.y = element_text(size = 7), axis.text = element_text(size = 6), legend.position = "bottom" ) + labs(fill = "Lethality") ``` There are a lot of information here. To point out few key insights: 1. Imbalanced, meaning there are unequal distribution of records between lethal vs. non-lethal records. 2. Signs of distinction between lethal vs non-lethal records observed in \[protesterviolence\], \[cpi_score\], and \[happiness_score\]. ## Correlation Analysis Let's check correlation between features to understand which features might be repetitive. ```{r Corr Analysis 1} # Create correlation matrix & plot. df_col <- cor( df_cleansed_joined %>% select_if(is.numeric) %>% select(-protest,-id,-ccode), method = "spearman", use = "complete.obs" ) corrplot(df_col, "square", addrect = 5, order = 'hclust', diag = FALSE) ``` I'm particularly interested in \[happiness score\], \[gdp\], and \[cpi_score\] here. I'll run some significance tests to validate correlation between these features. ```{r Corr Analysis 2} # Quick Significance Test on correlated variables. cor.test(df_cleansed_joined$gdp, df_cleansed_joined$happiness_score) cor.test(df_cleansed_joined$gdp, df_cleansed_joined$cpi_score) cor.test(df_cleansed_joined$happiness_score, df_cleansed_joined$cpi_score) ``` ## Feature Selection Test tells us that they're significantly correlated. We'll remove \[cpi_score\] and \[gdp\]. I'm choosing \[happiness_score\] over others based on our EDA (and comparing how all 3 impacts the final model). I'll also remove the duplicate date features. ```{r Feature Selection} df_final <- df_cleansed_joined %>% select( lethal, id, year, startmonth, iso_3, protestnumber, protesterviolence, participants, days_in_protest, happiness_score, political_behavior_process, labor_wage_dispute, land_farm_issue, police_brutality, price_increases_tax_policy, social_restrictions, removal_of_politician ) %>% mutate_if(is.character, factor) ``` ## Train Test Split Split the data to 75% Train and 25% Test, and create 25 partitions of K-Folds. ```{r Train Test Split} set.seed(49) pt_split <- initial_split(df_final, strata = lethal) pt_train <- training(pt_split) pt_test <- testing(pt_split) df_folds <- vfold_cv(pt_train, v = 25, strata = lethal) ``` ## Feature Engineering We'll use recipes package to create a recipe of feature engineering pipeline. ```{r Feature Engineering} # DT seems to be overriding already existing functions. Detach in this code block. if(any(grepl("package:DT", search()))) detach("package:DT") else message("DT not loaded") pt_rec <- recipe(lethal ~ ., data = pt_train) %>% update_role(id, new_role = "id") %>% # Impute & Bin Missing [Participants]. step_impute_knn(participants) %>% step_discretize_xgb(participants, outcome = "lethal") %>% # Impute [Happiness Score]. step_impute_knn(happiness_score) %>% # Convert [Participants] to ordinal factor. step_mutate(participants = factor(participants, levels = sort(unique(participants)), ordered = TRUE)) %>% # Convert [Participants] to numeric. step_ordinalscore(c("participants")) %>% # One Hot encode all nominal features. step_dummy(all_nominal(), -all_outcomes()) %>% # Remove features with no variability for safe measure. step_zv(all_numeric()) %>% # Normalize values of all predictors. step_normalize(all_predictors()) %>% # Downsample step_downsample(lethal) %>% prep() ``` ## Model Building Predicting lethality of a protest is a classification problem. There are many algorithms to use, and we'll try to use couple of them and pick the one with best outcome in the end. We'll cover Logistic Regression, KNN, Random Forest, and XGBoost algorithms. We'll also use Recall as the tuning metric, since we care more about capturing as many lethal scenarios as possible. ::: panel-tabset ### Model: Logistic Regression Here we create a specification for Logistic Regression Model, then use tune_grid to collect performance metrics of different hyper parameters. ```{r Logistic Regression 1} lr_spec <- logistic_reg(penalty = tune(), mixture = tune()) %>% set_mode("classification") %>% set_engine("glmnet") lr_workflow <- workflow() %>% add_recipe(pt_rec) %>% add_model(lr_spec) # Use the model if it already exists- if not, create the model. lr_path <- file.path("data/Mass Mobilization Protests/lr_tune.rds") if (file.exists(lr_path)) { lr_tune <- read_rds(file = lr_path) } else { lr_tune <- tune_grid( lr_workflow, resamples = df_folds, metrics = metric_set(roc_auc, accuracy, recall), grid = 11 ) saveRDS(lr_tune, file = lr_path) } ``` What are the best parameters according to our metrics collected from different folds? ```{r Logistic Regression 2} # Reload DT library(DT) cols <- c("pendalty", "mixture", "mean", "std_err") # Show best parameters # show_best(lr_tune, metric = "recall") %>% # mutate(across(cols, round, 4)) %>% # datatable(options = list(scrollX = TRUE, dom = "rti")) # Show all parameters from folds. plt_lr_penalty <- lr_tune %>% collect_metrics() %>% filter(.metric == "recall") %>% select(mean, penalty) %>% ggplot(aes(penalty, mean)) + geom_point(show.legend = FALSE) plt_lr_mixture <- lr_tune %>% collect_metrics() %>% filter(.metric == "recall") %>% select(mean, mixture) %>% ggplot(aes(mixture, mean)) + geom_point(show.legend = FALSE) grid.arrange(plt_lr_penalty, plt_lr_mixture) ``` Lastly, fit the model ```{r Logistic Regression 3} # Create a workflow to use a model with best hyper parameters based on recall. final_lr_wf <- lr_workflow %>% finalize_workflow(select_best(lr_tune, "recall")) # Fit the model. protest_fit_lr <- last_fit(final_lr_wf, pt_split, metrics = metric_set(roc_auc, accuracy, recall)) ``` We'll follow the same steps below to create KNN, Random Forest, and XGBoost . ### Model: Weighted K-Nearest Neighbor Create specification for KKNN Model and collect performance metrics from folds. ```{r Weighted K-Nearest Neighbor 1} # DT seems to be overriding already existing functions. Detach in this code block. if(any(grepl("package:DT", search()))) detach("package:DT") else message("DT not loaded") kknn_spec <- nearest_neighbor(neighbors = tune(), weight_func = tune()) %>% set_mode("classification") %>% set_engine("kknn") kknn_workflow <- workflow() %>% add_recipe(pt_rec) %>% add_model(kknn_spec) # Use the model if it already exists- if not, create the model. kknn_path <- file.path("data/Mass Mobilization Protests/kknn_tune.rds") if (file.exists(kknn_path)) { kknn_tune <- read_rds(file = kknn_path) } else { kknn_tune <- tune_grid( kknn_workflow, resamples = df_folds, metrics = metric_set(roc_auc, accuracy, recall), grid = 11 ) saveRDS(kknn_tune, file = kknn_path) } ``` What are the best parameters according to our metrics collected from different folds? ```{r Weighted K-Nearest Neighbor 2} # Reload DT library(DT) cols <- c("mean", "std_err") # Show all parameters from folds. plt_kknn_neighbors <- kknn_tune %>% collect_metrics() %>% filter(.metric == "recall") %>% select(mean, neighbors) %>% ggplot(aes(neighbors, mean)) + geom_point(show.legend = FALSE) plt_kknn_weight_func <- kknn_tune %>% collect_metrics() %>% filter(.metric == "recall") %>% select(mean, weight_func) %>% ggplot(aes(weight_func, mean)) + geom_point(show.legend = FALSE) grid.arrange(plt_kknn_neighbors, plt_kknn_weight_func) ``` Fit the model. ```{r Weighted K-Nearest Neighbor Fit} # Create a workflow to use a model with best hyper parameters based on recall. final_kknn_wf <- kknn_workflow %>% finalize_workflow(select_best(kknn_tune, "recall")) # Fit the model. protest_fit_kknn <- last_fit(final_kknn_wf, pt_split, metrics = metric_set(roc_auc, accuracy, recall)) ``` ### Model: Random Forest Create specification for Random Forest Model and collect performance metrics from folds. ```{r Random Forest 1} # DT seems to be overriding already existing functions. Detach in this code block. if(any(grepl("package:DT", search()))) detach("package:DT") else message("DT not loaded") ranger_spec <- rand_forest(mtry = tune(), min_n = tune(), trees = tune()) %>% set_mode("classification") %>% set_engine("ranger") ranger_workflow <- workflow() %>% add_recipe(pt_rec) %>% add_model(ranger_spec) # Use the model if it already exists- if not, create the model. ranger_path <- file.path("data/Mass Mobilization Protests/rforest_tune.rds") if (file.exists(ranger_path)) { ranger_tune <- read_rds(file = ranger_path) } else { ranger_tune <- tune_grid(ranger_workflow, resamples = df_folds, metrics = metric_set(roc_auc, accuracy, recall), grid = 11) saveRDS(ranger_tune, file = ranger_path) } ``` Show best parameters from folds. ```{r Random Forest 2} library(DT) cols <- c("mean", "std_err") # Show Best Parameters. # show_best(ranger_tune, metric = "recall") %>% # mutate(across(cols, round, 4)) %>% # datatable(options = list(scrollX = TRUE, dom = "rti")) # Show all parameters from folds. ranger_tune %>% collect_metrics() %>% filter(.metric == "recall") %>% select(mean, min_n, mtry, trees) %>% pivot_longer(min_n:trees, values_to = "value", names_to = "parameter") %>% ggplot(aes(value, mean, color = parameter)) + geom_point(show.legend = FALSE) + facet_wrap( ~ parameter, scales = "free_x") ``` Fit the model. ```{r Random Forest 3} # Create a workflow to use a model with best hyper parameters based on recall. final_rf_wf <- ranger_workflow %>% finalize_workflow(select_best(ranger_tune, "recall")) # Fit the model. protest_fit_rf <- last_fit(final_rf_wf, pt_split, metrics = metric_set(roc_auc, accuracy, recall)) ``` ### Model: XGBoost Create specification for XGBoost Model and collect performance metrics from folds. ```{r XGBoost 1} # DT seems to be overriding already existing functions. Detach in this code block. if(any(grepl("package:DT", search()))) detach("package:DT") else message("DT not loaded") xgboost_spec <- boost_tree( trees = tune(), min_n = tune(), tree_depth = tune(), learn_rate = tune(), loss_reduction = tune(), sample_size = tune() ) %>% set_mode("classification") %>% set_engine("xgboost") xgboost_workflow <- workflow() %>% add_recipe(pt_rec) %>% add_model(xgboost_spec) # Use the model if it already exists- if not, create the model. xgboost_path <- file.path("data/Mass Mobilization Protests/xgb_tune.rds") if (file.exists(xgboost_path)) { xgboost_tune <- read_rds(file = xgboost_path) } else { xgboost_tune <- tune_grid( xgboost_workflow, resamples = df_folds, metrics = metric_set(roc_auc, accuracy, recall), grid = 11 ) saveRDS(xgboost_tune, file = xgboost_path) } ``` Show best parameters from folds. ```{r XGBoost 2} library(DT) cols <- c("learn_rate", "loss_reduction", "sample_size", "mean", "std_err") # Show all parameters from folds. xgboost_tune %>% collect_metrics() %>% filter(.metric == "recall") %>% select(mean, trees, min_n, tree_depth, learn_rate, loss_reduction, sample_size) %>% pivot_longer(trees:sample_size, values_to = "value", names_to = "parameter") %>% ggplot(aes(value, mean, color = parameter)) + geom_point(show.legend = FALSE) + facet_wrap(~ parameter, scales = "free_x") ``` Fit the model. ```{r XGBoost 3} # Create a workflow to use a model with best hyper parameters based on recall. final_xgb_wf <- xgboost_workflow %>% finalize_workflow(select_best(xgboost_tune, metric = "recall")) # Fit the model. protest_fit_xgb <- last_fit(final_xgb_wf, pt_split, metrics = metric_set(roc_auc, accuracy, recall)) ``` ::: ## Model Results collect_metrics automatically uses test dataset from the split to evaluate. Let's take a look at accuracy, roc_auc, and most importantly, recall. ```{r Results 1} result_lr <- collect_metrics(protest_fit_lr) %>% mutate(model = "Logistic Regression") result_kknn <- collect_metrics(protest_fit_kknn) %>% mutate(model = "KKNN") result_rf <- collect_metrics(protest_fit_rf) %>% mutate(model = "Random Forest") result_xgb <- collect_metrics(protest_fit_xgb) %>% mutate(model = "XGBoost") bind_rows(result_lr, result_kknn, result_rf, result_xgb) %>% pivot_wider(names_from = .metric, values_from = .estimate) %>% select(-.estimator, -.config) %>% mutate(accuracy = percent(accuracy, accuracy = 0.1)) %>% mutate(roc_auc = percent(roc_auc, accuracy = 0.1)) %>% mutate(recall = percent(recall, accuracy = 0.1)) %>% datatable(options = list(scrollX = TRUE, dom = "rti")) ``` ### Confusion Matrix ```{r Results 2} # Confusion Matrix cm <- collect_predictions(protest_fit_lr) %>% mutate(model = "logreg") %>% bind_rows(collect_predictions(protest_fit_kknn) %>% mutate(model = "kknn")) %>% bind_rows(collect_predictions(protest_fit_rf) %>% mutate(model = "rforest")) %>% bind_rows(collect_predictions(protest_fit_xgb) %>% mutate(model = "xgboost")) %>% group_by(model) %>% conf_mat(lethal, .pred_class) lr_cm <- cm %>% filter(model == "logreg") %>% .$conf_mat %>% .[[1]] %>% autoplot(type = "heatmap") + ggtitle("Logistic Regression") + theme(plot.title = element_text(hjust = 0.95)) kknn_cm <- cm %>% filter(model == "kknn") %>% .$conf_mat %>% .[[1]] %>% autoplot(type = "heatmap") + ggtitle("Weighted K-Nearest Neighbor") + theme(plot.title = element_text(hjust = 0.95)) rf_cm <- cm %>% filter(model == "rforest") %>% .$conf_mat %>% .[[1]] %>% autoplot(type = "heatmap") + ggtitle("Random Forest") + theme(plot.title = element_text(hjust = 0.95)) xgboost_cm <- cm %>% filter(model == "xgboost") %>% .$conf_mat %>% .[[1]] %>% autoplot(type = "heatmap") + ggtitle("XGBoost") + theme(plot.title = element_text(hjust = 0.95)) grid.arrange(lr_cm, kknn_cm, rf_cm, xgboost_cm) ``` ### ROC Curve ```{r Results 3} # ROC Curve collect_predictions(protest_fit_lr) %>% mutate(model = "Logistic Regression") %>% bind_rows(collect_predictions(protest_fit_kknn) %>% mutate(model = "KKNN")) %>% bind_rows(collect_predictions(protest_fit_rf) %>% mutate(model = "Random Forest")) %>% bind_rows(collect_predictions(protest_fit_xgb) %>% mutate(model = "XGBoost")) %>% group_by(model) %>% roc_curve(lethal, .pred_Lethal) %>% autoplot() + ylab("True Positive Rate \n (Sensitivity)") + xlab("False Positive Rate \n (1 - Specificity)") ``` With an unbalanced data like ours, accuracy can be a poor predictor of model performance. We'll use the roc curve instead as a measure. Both Logistic Regression and Random Forest Model performs well for our small dataset. Our models shows signs of good generalization and likely will do well at predicting with unseen data. KNN seems to have the highest recall. Since we care about capturing as many positive labels as possible in this case, KNN may be a good choice. However, it has lower accuracy and roc_auc compared to other models, which lowers my confidence in its ability to remain consistent in distinguishing positive and negative labels. Also, while it's possible to indirectly understand feature importance through KNN, it's not as intuitive as tree or regression based model. For practicality reasons, I'd select Random Forest as a model of choice for real world scenarios. ## Variable Importance ::: panel-tabset ### Logistic Regression What can we learn from the model? What are the most important features to predict lethality according to the Logistic Regression model? ```{r Variable Importance: LR} imp_spec_logreg <- lr_spec %>% finalize_model(select_best(lr_tune, metric = "roc_auc")) %>% set_engine("glmnet") plt_imp_logreg_nocountry <- workflow() %>% add_recipe(pt_rec) %>% add_model(imp_spec_logreg) %>% fit(pt_train) %>% extract_fit_parsnip() %>% vi_model() %>% filter(!grepl("iso", Variable)) %>% vip(aesthetics = list(alpha = 0.8, fill = "midnightblue")) + ggtitle("LR: VI Without Country Dummy Variables") plt_imp_logreg_wcountry <- workflow() %>% add_recipe(pt_rec) %>% add_model(imp_spec_logreg) %>% fit(pt_train) %>% extract_fit_parsnip() %>% vi_model() %>% vip(aesthetics = list(alpha = 0.8, fill = "midnightblue")) + ggtitle("LR: VI With Country Dummy Variables") gA <- ggplotGrob(plt_imp_logreg_nocountry) gB <- ggplotGrob(plt_imp_logreg_wcountry) grid::grid.newpage() grid::grid.draw(rbind(gA, gB)) ``` ### Random Forest What are the most important features to predict lethality according to the Random Forest model? ```{r Variable Importance: RF} imp_spec_rforest <- ranger_spec %>% finalize_model(select_best(ranger_tune, metric = "roc_auc")) %>% set_engine("ranger", importance = "permutation") plt_imp_rforest_nocountry <- workflow() %>% add_recipe(pt_rec) %>% add_model(imp_spec_rforest) %>% fit(pt_train) %>% extract_fit_parsnip() %>% vi_model() %>% filter(!grepl("iso", Variable)) %>% vip(aesthetics = list(alpha = 0.8, fill = "midnightblue")) + ggtitle("RF: VI Without Country Dummy Variables") plt_imp_rforest_wcountry <- workflow() %>% add_recipe(pt_rec) %>% add_model(imp_spec_rforest) %>% fit(pt_train) %>% extract_fit_parsnip() %>% vi_model() %>% vip(aesthetics = list(alpha = 0.8, fill = "midnightblue")) + ggtitle("RF: VI With Country Dummy Variables") gA <- ggplotGrob(plt_imp_rforest_nocountry) gB <- ggplotGrob(plt_imp_rforest_wcountry) grid::grid.newpage() grid::grid.draw(rbind(gA, gB)) ``` **One Hot Encoded country dummy variables** make this analysis a bit messy to interpret. We can summarise that countries with consistent track record of response heavily influences the model prediction. For example, France, Ireland, Germany, and South Korea were among the top of list when it comes to predicting "Non-Lethal" cases per Logistic Regression. Note that two models will output different feature importance, since Random Forest importance is based on expected decrease in performance when said predictor is used in a tree, and GLM importance is based on the scale of coefficients. Combining these results will provide us with the most accurate interpretation. ::: ## Conclusion We can use few key features to intuitively guess if protest will be life-threatening. 1. Protester Violence 2. Happiness Score of a Country 3. Days in Protest 4. Country ::: panel-tabset ### Protester Violence ```{r} df_final %>% count(lethal, protesterviolence) %>% mutate( protesterviolence = case_when( protesterviolence == 0 ~ "No Protester violence", TRUE ~ "Protester Violence" ) ) %>% ggplot(aes(x = lethal, y = n, fill = lethal)) + geom_bar(stat = "identity", show.legend = FALSE) + facet_wrap("protesterviolence") + theme_bw() + xlab("") + ylab("Count") ``` ### Happiness Score ```{r} df_final %>% filter(!is.na(happiness_score)) %>% mutate(happiness_score_bin = cut(happiness_score, breaks = 5)) %>% count(lethal, happiness_score_bin) %>% ggplot(aes(x = lethal, y = n, fill = lethal)) + geom_bar(stat = "identity", show.legend = FALSE) + facet_wrap("happiness_score_bin") + theme_bw() + xlab("") + ylab("Count") + theme() ``` ### Days in Protest ```{r} df_final %>% filter(lethal == "Lethal") %>% mutate(days_in_protest_bin = case_when(days_in_protest == 0 ~ "First Day of Protest", TRUE ~ "On Second Day or After")) %>% count(days_in_protest_bin) %>% ggplot(aes(x = days_in_protest_bin, y = n)) + geom_bar(stat = "identity", show.legend = FALSE) + theme_bw() + xlab("") + ylab("Count") + theme() + ggtitle("Lethal Cases") ``` ### Country ```{r} df_final %>% mutate(Country = countrycode( iso_3, "iso3c", "country.name", custom_match = c("XXK" = "Kosovo", "YUG" = "Yugoslavia") )) %>% count(lethal, Country) %>% pivot_wider(names_from = lethal, values_from = n) %>% mutate(Lethal = case_when(is.na(Lethal) ~ 0, TRUE ~ as.numeric(Lethal))) %>% mutate(`Non-Lethal` = case_when(is.na(`Non-Lethal`) ~ 0, TRUE ~ as.numeric(`Non-Lethal`))) %>% mutate(Percent_Lethal = percent(Lethal / (Lethal + `Non-Lethal`), accuracy = 0.1)) %>% datatable(filter = "top", options = list(scrollX = TRUE, dom = "lrtip")) ``` ### Removal of Politician ```{r} df_final %>% count(lethal, removal_of_politician) %>% pivot_wider(names_from = removal_of_politician, values_from = n) %>% mutate(`1` = `1` / sum(`1`)) %>% mutate(`0` = `0` / sum(`0`)) %>% rename(`Protest Reason: Removal of Politician` = `1`, `Other Reason` = `0`) %>% pivot_longer(cols=c(`Protest Reason: Removal of Politician`, `Other Reason`)) %>% ggplot(aes(x = lethal, y = value, fill = lethal)) + geom_bar(stat = "identity", show.legend = FALSE) + facet_wrap("name") + theme_bw() + xlab("") + ylab("Percent Total") + theme() + scale_y_continuous(labels = scales::percent) ``` ::: Our data does not indicate whether a protester violence was originally initiated by protesters, or was initiated as a reaction to government's violence. However, we can safely assume that provocation by either side may escalate the situation. Citizens' happiness score (and innate government track record impacting that happiness) is a great predictor of how the government would respond. Lower the score, higher the likelihood of lethal response to occur. Most lethal events happen on the first day of the protest, and countries with consistent track record are likely to continue do what they have done in the past. Lastly, if the protest is around removal of politician, it likely will yield higher chance of lethality. Though some of these seem like a no-brainer now that we look at it, as my mentor used to put it, good analysis often leads to obvious results. As a closing comment, this analysis in no way makes any statement about how the protest should be conducted, nor which method is the most effective to invoke change. Many prominent studies claim that non-violent protest is the best method for social movements to pursue causes, but the reality may be more complicated. You can find all resources as well as saved models in my [Github Repo](https://github.com/ALEE0001/pp_protest_lethality_prediction). Note: Rpubs blocks external redirections. Please open any links in this document in a new tab.