Data Preparation
The source of this data is from Kaggle - ACLED African Conflicts, 1997-2017.
This data is originally collected under ACLED project which is an acronym for ‘Armed Conflict Location and Event Data’. This project is directed by Prof. Clionadh Raleigh (University of Sussex) and operated by Senior Research Manager Andrea Carboni (University of Sussex) for Africa and Hillary Tanoff for South and South-East Asia. The aim of this project is to collate data on Political Violence in developing countries with focus on Africa. This dataset was first introduced in 2010 by Raleigh and co-authors in 2010 paper in the Journal of Peace Research. The ACLED data is used by several researchers in their research on civil wars and political violence. This dataset has also been referenced by news media agencies like The New York Times, The Guardian, BBC etc. to study recent conflict trends.
After examining the data, it was observed that missing values are recorded as 'Blanks' and with NA in a few columns. To bring consistency, we will replace Blanks with NA during reading the data. The date format is in “DD/MM/YYYY” format.
Data is imported using the read.csv() function. It takes the String values as Factors so we use the argument stringsAsFactors = FALSE to consider it as strings. We convert blanks to NA for missing data.
df <- read.csv("african_conflicts.csv", stringsAsFactors = FALSE, na.strings = "")
This dataset contains total 28 columns before making any changes.
names(df)
[1] "ACTOR1" "ACTOR1_ID" "ACTOR2"
[4] "ACTOR2_ID" "ACTOR_DYAD_ID" "ADMIN1"
[7] "ADMIN2" "ADMIN3" "ALLY_ACTOR_1"
[10] "ALLY_ACTOR_2" "COUNTRY" "EVENT_DATE"
[13] "EVENT_ID_CNTY" "EVENT_ID_NO_CNTY" "EVENT_TYPE"
[16] "FATALITIES" "GEO_PRECISION" "GWNO"
[19] "INTER1" "INTER2" "INTERACTION"
[22] "LATITUDE" "LOCATION" "LONGITUDE"
[25] "NOTES" "SOURCE" "TIME_PRECISION"
[28] "YEAR"
Out of these columns, ACTOR1_ID, ACTOR2_ID and ACTOR_DYAD_ID, EVENT_ID_CNTY, EVENT_ID_NO_CNTY are surrogate columns. So we will remove these columns. Most of the values in ALLY_ACTOR_1 and ALLY_ACTOR_2 are missing, so we remove these columns. GEO_PRECISION, GWNO and TIME_PRECISION are not required for our analysis. After removing these columns, we are left with 18 columns.
df1 <- df[-c(2, 4, 5, 9, 10, 13, 14, 17, 18, 27)]
The table below gives details about these columns:
| ACTOR1 |
Name of first actor |
| ACTOR2 |
Name of second actor |
| ADMIN1 |
The largest sub-national administrative region in which the event took place |
| ADMIN2 |
The second-largest sub-national administrative region in which the event took place |
| ADMIN3 |
The third-largest sub-national administrative region in which the event took place |
| COUNTRY |
Country of conflict |
| EVENT_DATE |
Date of conflict, DD/MM/YYYY |
| FATALITIES |
Integer value of fatalities that occurred, as reported by source |
| INTER1 |
A numeric code indicating the type of ACTOR1 |
| INTER2 |
A numeric code indicating the type of ACTOR2 |
| INTERACTION |
A numeric code indicating the interaction between types of ACTOR1 and ACTOR2 |
| LATITUDE |
The latitude of the location |
| LOCATION |
The location where event occurred |
| LONGITUDE |
The longitude of the location |
| NOTES |
Additional notes |
| SOURCE |
Source of conflict information |
| YEAR |
Year event occurred |
Looking at columns ACTOR2, ADMIN2, ADMIN3, LOCATION, NOTES, SOURCE we see that they have some missing data.
apply(df1, 2, function(x) any(is.na(x)))
ACTOR1 ACTOR2 ADMIN1 ADMIN2 ADMIN3
FALSE TRUE FALSE TRUE TRUE
COUNTRY EVENT_DATE EVENT_TYPE FATALITIES INTER1
FALSE FALSE FALSE FALSE FALSE
INTER2 INTERACTION LATITUDE LOCATION LONGITUDE
FALSE FALSE FALSE TRUE FALSE
NOTES SOURCE YEAR
TRUE TRUE FALSE
For Column ACTOR2, NA tells us that there was no second actor. So we can replace it with string “NONE”
df1$ACTOR2[is.na(df1$ACTOR2)] <- "NONE"
head(df1$ACTOR2)
[1] "Civilians (Algeria)"
[2] "Police Forces of Algeria (1999-)"
[3] "NONE"
[4] "Police Forces of Algeria (1999-)"
[5] "Police Forces of Algeria (1999-)"
[6] "Police Forces of Algeria (1999-)"
Now for columns, INTER1, INTER2 and INTERACTION numerical subsitutes for categories are provided. We will replace these with the actual values from the codebook.
Preview of Column INTER1
df1$INTER1 <- as.character(df1$INTER1)
lut1 <- c("1" = "Government or mutinous force", "2" = "Rebel force", "3" = "Political militia", "4" = "Ethnic militia", "5" = "Rioters", "6" = "Protesters", "7" = "Civilians", "8" = "Outside/external force")
df1$INTER1 <- lut1[df1$INTER1]
head(df1$INTER1)
[1] "Government or mutinous force" "Rioters"
[3] "Protesters" "Rioters"
[5] "Rioters" "Rioters"
Preview of Column INTER2
df1$INTER2 <- as.character(df1$INTER2)
lut2 <- c("0" = "NONE", "1" = "Government or mutinous force", "2" = "Rebel force", "3" = "Political militia", "4" = "Ethnic militia", "5" = "Rioters", "6" = "Protesters", "7" = "Civilians", "8" = "Outside/external force")
df1$INTER2 <- lut2[df1$INTER2]
head(df1$INTER2)
[1] "Civilians" "Government or mutinous force"
[3] "NONE" "Government or mutinous force"
[5] "Government or mutinous force" "Government or mutinous force"
Preview of Column INTERACTION
#head(df1$INTERACTION)
df1$INTERACTION<-as.character(df1$INTERACTION)
#head(df1$INTERACTION)
lut3<-c("10" = "SOLE MILITARY ACTION", "11" = "MILITARY VERSUS MILITARY", "12" = "MILITARY VERSUS REBELS", "13" = "MILITARY VERSUS POLITICAL MILITIA", "14" = "MILITARY VERSUS COMMUNAL MILITIA", "15" = "MILITARY VERSUS RIOTERS", "16" = "MILITARY VERSUS PROTESTERS", "17" = "MILITARY VERSUS CIVILIANS", "18" = "MILITARY VERSUS OTHER", "20" = "SOLE REBEL ACTION ", "22" = "REBELS VERSUS REBELS", "23" = "REBELS VERSUS POLITICAL MILIITA", "24" = "REBELS VERSUS COMMUNAL MILITIA", "25" = "REBELS VERSUS RIOTERS", "26" = "REBELS VERSUS PROTESTERS", "27" = "REBELS VERSUS CIVILIANS", "28" = "REBELS VERSUS OTHERS", "30" = "SOLE POLITICAL MILITIA ACTION","33" = "POLITICAL MILITIA VERSUS POLITICAL MILITIA", "34" = "POLITICAL MILITIA VERSUS COMMUNAL MILITIA", "35" = "POLITICAL MILITIA VERSUS RIOTERS", "36" = "POLITICAL MILITIA VERSUS PROTESTERS", "37" = "POLITICAL MILITIA VERSUS CIVILIANS", "38" = "POLITICAL MILITIA VERSUS OTHERS", "40" = "SOLE COMMUNAL MILITIA ACTION", "44" = "COMMUNAL MILITIA VERSUS COMMUNAL MILITIA", "45" = "COMMUNAL MILITIA VERSUS RIOTERS", "46" = "COMMUNAL MILITIA VERSUS PROTESTERS", "47" = "COMMUNAL MILITIA VERSUS CIVILIANS", "48" = "COMMUNAL MILITIA VERSUS OTHER","50" = "SOLE RIOTER ACTION", "55" = "RIOTERS VERSUS RIOTERS", "56" = "RIOTERS VERSUS PROTESTERS", "57" = "RIOTERS VERSUS CIVILIANS", "58" = "RIOTERS VERSUS OTHERS", "60" = "SOLE PROTESTER ACTION", "66" = "PROTESTERS VERSUS PROTESTERS", "67" = "PROTESTERS VERSUS CIVILIANS", "68" = "PROTESTERS VERSUS OTHER", "70" = "SOLE CIVILIANS", "77" = "CIVILIANS VERSUS CIVILIANS", "78" = "OTHER ACTOR VERSUS CIVILIANS", "80" = "SOLE OTHER ACTION", "88" = "OTHERS VERSUS OTHERS")
df1$INTERACTION <- lut3[df1$INTERACTION]
head(df1$INTERACTION)
[1] "MILITARY VERSUS CIVILIANS" "MILITARY VERSUS RIOTERS"
[3] "SOLE PROTESTER ACTION" "MILITARY VERSUS RIOTERS"
[5] "MILITARY VERSUS RIOTERS" "MILITARY VERSUS RIOTERS"
Preview of data after cleaning is given below:
head(df1)
Looking at the summary of data using mmary() function, we see below facts:
summary(df1)
| Fatalities |
0 |
25000 |
4.42 |
| Year |
1997 |
2017 |
- |
From above summary of Fatalities, it is observed that minimum afatalities in conflicts 0 and maximum value is 25000 with a mean of 4.42
hist(df1$YEAR, xlab = "YEAR", main = "Histogram of conflicts per year ")

Looking at the histogram above, it is observed that highest number of cnflicts happened in the year 2016. With this information, we can further explore the areas which were affected by this conflict and the time of the year most conflicts happened.
boxplot(df1$FATALITIES~df$YEAR)

Looking at the boxplot above, it can be seen that there are a few outliers inthe dataset with highest being in 1997.
