Introduction

There has been conflicts happening around the world within groups which can range from militant activity, conflicts between groups, conflicts of civilians with government bodies etc. This project aims at exploring such conflicts happening in different countries of Africa for past 20 years and clusteing to find trends.

With this dataset, it is possible to perform Exploratory data analysis and clustering to see the type of conflicts affecting different region and assessing the political situation of a particular region.

The data set I have selected for my project is “ACLED African Conflicts data for a duration of 1997-2017”. The dataset is not tidy and needs to be cleaned to be used for analysis. There are many columns which have joined data and need to be separated to different columns. There are many cases of missing and NA values which needs to be addressed.

Packages Required

The packages required for this project are mentioned below:

library(dplyr)
library(xlsx)
library(leaflet)

The functionalities of each package is mentioned below:

dplyr - Used for data manipulation of the data set

xlsx - Used to import/export data from/to xlsx file format

leaflet -Used to make Create Interactive Web Maps

Data Preparation

The source of this data is from Kaggle - ACLED African Conflicts, 1997-2017.

This data is originally collected under ACLED project which is an acronym for ‘Armed Conflict Location and Event Data’. This project is directed by Prof. Clionadh Raleigh (University of Sussex) and operated by Senior Research Manager Andrea Carboni (University of Sussex) for Africa and Hillary Tanoff for South and South-East Asia. The aim of this project is to collate data on Political Violence in developing countries with focus on Africa. This dataset was first introduced in 2010 by Raleigh and co-authors in 2010 paper in the Journal of Peace Research. The ACLED data is used by several researchers in their research on civil wars and political violence. This dataset has also been referenced by news media agencies like The New York Times, The Guardian, BBC etc. to study recent conflict trends.

After examining the data, it was observed that missing values are recorded as 'Blanks' and with NA in a few columns. To bring consistency, we will replace Blanks with NA during reading the data. The date format is in “DD/MM/YYYY” format.

Data is imported using the read.csv() function. It takes the String values as Factors so we use the argument stringsAsFactors = FALSE to consider it as strings. We convert blanks to NA for missing data.

df <- read.csv("african_conflicts.csv", stringsAsFactors = FALSE, na.strings = "")

This dataset contains total 28 columns before making any changes.

names(df)
 [1] "ACTOR1"           "ACTOR1_ID"        "ACTOR2"          
 [4] "ACTOR2_ID"        "ACTOR_DYAD_ID"    "ADMIN1"          
 [7] "ADMIN2"           "ADMIN3"           "ALLY_ACTOR_1"    
[10] "ALLY_ACTOR_2"     "COUNTRY"          "EVENT_DATE"      
[13] "EVENT_ID_CNTY"    "EVENT_ID_NO_CNTY" "EVENT_TYPE"      
[16] "FATALITIES"       "GEO_PRECISION"    "GWNO"            
[19] "INTER1"           "INTER2"           "INTERACTION"     
[22] "LATITUDE"         "LOCATION"         "LONGITUDE"       
[25] "NOTES"            "SOURCE"           "TIME_PRECISION"  
[28] "YEAR"            

Out of these columns, ACTOR1_ID, ACTOR2_ID and ACTOR_DYAD_ID, EVENT_ID_CNTY, EVENT_ID_NO_CNTY are surrogate columns. So we will remove these columns. Most of the values in ALLY_ACTOR_1 and ALLY_ACTOR_2 are missing, so we remove these columns. GEO_PRECISION, GWNO and TIME_PRECISION are not required for our analysis. After removing these columns, we are left with 18 columns.

df1 <- df[-c(2, 4, 5, 9, 10, 13, 14, 17, 18, 27)]

The table below gives details about these columns:

Variable Description
ACTOR1 Name of first actor
ACTOR2 Name of second actor
ADMIN1 The largest sub-national administrative region in which the event took place
ADMIN2 The second-largest sub-national administrative region in which the event took place
ADMIN3 The third-largest sub-national administrative region in which the event took place
COUNTRY Country of conflict
EVENT_DATE Date of conflict, DD/MM/YYYY
FATALITIES Integer value of fatalities that occurred, as reported by source
INTER1 A numeric code indicating the type of ACTOR1
INTER2 A numeric code indicating the type of ACTOR2
INTERACTION A numeric code indicating the interaction between types of ACTOR1 and ACTOR2
LATITUDE The latitude of the location
LOCATION The location where event occurred
LONGITUDE The longitude of the location
NOTES Additional notes
SOURCE Source of conflict information
YEAR Year event occurred

Looking at columns ACTOR2, ADMIN2, ADMIN3, LOCATION, NOTES, SOURCE we see that they have some missing data.

apply(df1, 2, function(x) any(is.na(x)))
     ACTOR1      ACTOR2      ADMIN1      ADMIN2      ADMIN3 
      FALSE        TRUE       FALSE        TRUE        TRUE 
    COUNTRY  EVENT_DATE  EVENT_TYPE  FATALITIES      INTER1 
      FALSE       FALSE       FALSE       FALSE       FALSE 
     INTER2 INTERACTION    LATITUDE    LOCATION   LONGITUDE 
      FALSE       FALSE       FALSE        TRUE       FALSE 
      NOTES      SOURCE        YEAR 
       TRUE        TRUE       FALSE 

For Column ACTOR2, NA tells us that there was no second actor. So we can replace it with string “NONE”

df1$ACTOR2[is.na(df1$ACTOR2)] <- "NONE"
head(df1$ACTOR2)
[1] "Civilians (Algeria)"             
[2] "Police Forces of Algeria (1999-)"
[3] "NONE"                            
[4] "Police Forces of Algeria (1999-)"
[5] "Police Forces of Algeria (1999-)"
[6] "Police Forces of Algeria (1999-)"

Now for columns, INTER1, INTER2 and INTERACTION numerical subsitutes for categories are provided. We will replace these with the actual values from the codebook.

Preview of Column INTER1

df1$INTER1 <- as.character(df1$INTER1)
lut1 <- c("1" = "Government or mutinous force", "2" = "Rebel force", "3" = "Political militia", "4" = "Ethnic militia", "5" = "Rioters", "6" = "Protesters", "7" = "Civilians", "8" = "Outside/external force")
df1$INTER1 <- lut1[df1$INTER1]
head(df1$INTER1)
[1] "Government or mutinous force" "Rioters"                     
[3] "Protesters"                   "Rioters"                     
[5] "Rioters"                      "Rioters"                     

Preview of Column INTER2

df1$INTER2 <- as.character(df1$INTER2)
lut2 <- c("0" = "NONE", "1" = "Government or mutinous force", "2" = "Rebel force", "3" = "Political militia", "4" = "Ethnic militia", "5" = "Rioters", "6" = "Protesters", "7" = "Civilians", "8" = "Outside/external force")
df1$INTER2 <- lut2[df1$INTER2]
head(df1$INTER2)
[1] "Civilians"                    "Government or mutinous force"
[3] "NONE"                         "Government or mutinous force"
[5] "Government or mutinous force" "Government or mutinous force"

Preview of Column INTERACTION

#head(df1$INTERACTION)
df1$INTERACTION<-as.character(df1$INTERACTION)
#head(df1$INTERACTION)
lut3<-c("10" = "SOLE MILITARY ACTION", "11" = "MILITARY VERSUS MILITARY", "12" = "MILITARY VERSUS REBELS", "13" = "MILITARY VERSUS POLITICAL MILITIA", "14" = "MILITARY VERSUS COMMUNAL MILITIA", "15" = "MILITARY VERSUS RIOTERS", "16" = "MILITARY VERSUS PROTESTERS", "17" = "MILITARY VERSUS CIVILIANS", "18" = "MILITARY VERSUS OTHER", "20" = "SOLE REBEL ACTION ", "22" = "REBELS VERSUS REBELS", "23" = "REBELS VERSUS POLITICAL MILIITA", "24" = "REBELS VERSUS COMMUNAL MILITIA", "25" = "REBELS VERSUS RIOTERS", "26" = "REBELS VERSUS PROTESTERS", "27" = "REBELS VERSUS CIVILIANS", "28" = "REBELS VERSUS OTHERS", "30" = "SOLE POLITICAL MILITIA ACTION","33" = "POLITICAL MILITIA VERSUS POLITICAL MILITIA", "34" = "POLITICAL MILITIA VERSUS COMMUNAL MILITIA", "35" = "POLITICAL MILITIA VERSUS RIOTERS", "36" = "POLITICAL MILITIA VERSUS PROTESTERS", "37" = "POLITICAL MILITIA VERSUS CIVILIANS", "38" = "POLITICAL MILITIA VERSUS OTHERS", "40" = "SOLE COMMUNAL MILITIA ACTION", "44" = "COMMUNAL MILITIA VERSUS COMMUNAL MILITIA", "45" = "COMMUNAL MILITIA VERSUS RIOTERS", "46" = "COMMUNAL MILITIA VERSUS PROTESTERS", "47" = "COMMUNAL MILITIA VERSUS CIVILIANS", "48" = "COMMUNAL MILITIA VERSUS OTHER","50" = "SOLE RIOTER ACTION", "55" = "RIOTERS VERSUS RIOTERS", "56" = "RIOTERS VERSUS PROTESTERS", "57" = "RIOTERS VERSUS CIVILIANS", "58" = "RIOTERS VERSUS OTHERS", "60" = "SOLE PROTESTER ACTION", "66" = "PROTESTERS VERSUS PROTESTERS", "67" = "PROTESTERS VERSUS CIVILIANS", "68" = "PROTESTERS VERSUS OTHER", "70" = "SOLE CIVILIANS", "77" = "CIVILIANS VERSUS CIVILIANS", "78" = "OTHER ACTOR VERSUS CIVILIANS", "80" = "SOLE OTHER ACTION", "88" = "OTHERS VERSUS OTHERS")
df1$INTERACTION <- lut3[df1$INTERACTION]
head(df1$INTERACTION)
[1] "MILITARY VERSUS CIVILIANS" "MILITARY VERSUS RIOTERS"  
[3] "SOLE PROTESTER ACTION"     "MILITARY VERSUS RIOTERS"  
[5] "MILITARY VERSUS RIOTERS"   "MILITARY VERSUS RIOTERS"  


Preview of data after cleaning is given below:

head(df1)

Looking at the summary of data using mmary() function, we see below facts:

summary(df1)
Variable Min Max Mean
Fatalities 0 25000 4.42
Year 1997 2017 -

From above summary of Fatalities, it is observed that minimum afatalities in conflicts 0 and maximum value is 25000 with a mean of 4.42

hist(df1$YEAR, xlab = "YEAR", main = "Histogram of conflicts per year ")

Looking at the histogram above, it is observed that highest number of cnflicts happened in the year 2016. With this information, we can further explore the areas which were affected by this conflict and the time of the year most conflicts happened.

boxplot(df1$FATALITIES~df$YEAR)

Looking at the boxplot above, it can be seen that there are a few outliers inthe dataset with highest being in 1997.


Proposed Exploratory Data Analysis

With this dataset, we can perform EDA using leaflets package to map the critical areas where most conflicts happen. We can further drill down to check details for areas of a country where most conflicts take place. We can see the trends to find the areas where conflicts happen at a particular time of the year. The EVENT_DATE column can be separated to Day, Month and Year to find trends on a monthly or seasonal basis. We can summarize the data to find new metrics like total fatalaties per year, total fatalaties per country, total fatalities per region in a country, frequency of conflicts per country. etc. We can plot bar plots to visialize this data and use boxplot to find out outliers.

We can create summary information as below to get insights about the top metrics.

Below are tables showing top 5 years when the fatalities were the highest and top 10 countries with highest fatalities.

df1%>%group_by(YEAR)%>%summarise(sum(FATALITIES))%>%arrange(desc(`sum(FATALITIES)`))%>%head(n=5)

From the table above, we can see that year 1999 had the highest fatalities due to conflicts.

df1%>%group_by(COUNTRY)%>%summarise(sum(FATALITIES))%>%arrange(desc(`sum(FATALITIES)`))%>%head(n=10)

Here, we can see that Angola has the highest fatalities due to conflicts.

A cluster analysis can be performed on this data to get new insights. Text analysis on NOTES column can be performed to understand the most popular words.




