Introduction

There have been conflicts happening around the world within groups which can range from militant activity, conflicts between groups, conflicts of civilians with government bodies etc. This project aims at exploring such conflicts happening in different countries of Africa for past 20 years and clusteing to find trends.

With this dataset, it is possible to perform Exploratory data analysis and clustering to see the type of conflicts affecting different region and assessing the political situation of a particular region.

The data set I have selected for my project is “ACLED African Conflicts data for a duration of 1997-2017”. The dataset is not tidy and needs to be cleaned to be used for analysis. There are many columns which have joined data and need to be separated to different columns. There are many cases of missing and NA values which needs to be addressed.

Packages Required

Below code checks if packages required for this project are installed on the system and if not then installs them.

packages <- c("dplyr","leaflet","htmltools","ggplot2","lubridate","qdap","tm","SnowballC","wordcloud","RColorBrewer")
if (length(setdiff(packages, rownames(installed.packages()))) > 0) {
  install.packages(setdiff(packages, rownames(installed.packages())))  
}

The packages required for this project are mentioned below:

library(dplyr)      
library(leaflet)    
library(htmltools)  
library(ggplot2)    
library(lubridate)  
library(qdap)       
library(tm)     
library(SnowballC) 
library(wordcloud)  
library(RColorBrewer)    

Description of these packages are as below:

Package Description
(dplyr) #Used for data manipulation of the data set
(leaflet) #Used to make Create Interactive Web Maps
(htmltools) #Tools for HTML
(ggplot2) #Used to create elegant data visualisations using the grammar of graphics
(lubridate) #Used for easily dealing with dates
(qdap) #Used for text analysis
(tm) #Text Mining Package
(SnowballC) #Used for text and Word analysis
(wordcloud) #Used to create a word cloud
(RColorBrewer) #Used for color paletting in R

Data Preparation

The source of this data is from Kaggle - ACLED African Conflicts, 1997-2017.

This data is originally collected under ACLED project which is an acronym for ‘Armed Conflict Location and Event Data’. This project is directed by Prof. Clionadh Raleigh (University of Sussex) and operated by Senior Research Manager Andrea Carboni (University of Sussex) for Africa and Hillary Tanoff for South and South-East Asia. The aim of this project is to collate data on Political Violence in developing countries with focus on Africa. This dataset was first introduced in 2010 by Raleigh and co-authors in 2010 paper in the Journal of Peace Research. The ACLED data is used by several researchers in their research on civil wars and political violence. This dataset has also been referenced by news media agencies like The New York Times, The Guardian, BBC etc. to study recent conflict trends.

After examining the data, it was observed that missing values are recorded as 'Blanks' and with NA in a few columns. To bring consistency, we will replace Blanks with NA during reading the data. The date format is in “DD/MM/YYYY” format.

Data is imported using the read.csv() function. It takes the String values as Factors so we use the argument stringsAsFactors = FALSE to consider it as strings. We convert blanks to NA for missing data.

df <- read.csv("https://www.dropbox.com/s/0eeick4mo0zv4ug/african_conflicts.csv?dl=1", stringsAsFactors = FALSE, na.strings = "")

This dataset contains total 28 columns before making any changes.

names(df)
 [1] "ACTOR1"           "ACTOR1_ID"        "ACTOR2"           "ACTOR2_ID"       
 [5] "ACTOR_DYAD_ID"    "ADMIN1"           "ADMIN2"           "ADMIN3"          
 [9] "ALLY_ACTOR_1"     "ALLY_ACTOR_2"     "COUNTRY"          "EVENT_DATE"      
[13] "EVENT_ID_CNTY"    "EVENT_ID_NO_CNTY" "EVENT_TYPE"       "FATALITIES"      
[17] "GEO_PRECISION"    "GWNO"             "INTER1"           "INTER2"          
[21] "INTERACTION"      "LATITUDE"         "LOCATION"         "LONGITUDE"       
[25] "NOTES"            "SOURCE"           "TIME_PRECISION"   "YEAR"            

Out of these columns, ACTOR1_ID, ACTOR2_ID and ACTOR_DYAD_ID, EVENT_ID_CNTY, EVENT_ID_NO_CNTY are surrogate columns. So we will remove these columns. Most of the values in ALLY_ACTOR_1 and ALLY_ACTOR_2 are missing, so we remove these columns. GEO_PRECISION, GWNO and TIME_PRECISION are not required for our analysis. After removing these columns, we are left with 18 columns.

df1 <- df[-c(2, 4, 5, 9, 10, 13, 14, 17, 18, 27)]

The table below gives details about these columns:

Variable Description
ACTOR1 Name of first actor
ACTOR2 Name of second actor
ADMIN1 The largest sub-national administrative region in which the event took place
ADMIN2 The second-largest sub-national administrative region in which the event took place
ADMIN3 The third-largest sub-national administrative region in which the event took place
COUNTRY Country of conflict
EVENT_DATE Date of conflict, DD/MM/YYYY
FATALITIES Integer value of fatalities that occurred, as reported by source
INTER1 A numeric code indicating the type of ACTOR1
INTER2 A numeric code indicating the type of ACTOR2
INTERACTION A numeric code indicating the interaction between types of ACTOR1 and ACTOR2
LATITUDE The latitude of the location
LOCATION The location where event occurred
LONGITUDE The longitude of the location
NOTES Additional notes
SOURCE Source of conflict information
YEAR Year event occurred

Looking at columns ACTOR2, ADMIN2, ADMIN3, LOCATION, NOTES, SOURCE we see that they have some missing data.

apply(df1, 2, function(x) any(is.na(x)))
     ACTOR1      ACTOR2      ADMIN1      ADMIN2      ADMIN3     COUNTRY  EVENT_DATE 
      FALSE        TRUE       FALSE        TRUE        TRUE       FALSE       FALSE 
 EVENT_TYPE  FATALITIES      INTER1      INTER2 INTERACTION    LATITUDE    LOCATION 
      FALSE       FALSE       FALSE       FALSE       FALSE       FALSE        TRUE 
  LONGITUDE       NOTES      SOURCE        YEAR 
      FALSE        TRUE        TRUE       FALSE 

For Column ACTOR2, NA tells us that there was no second actor. So we can replace it with string “NONE”

df1$ACTOR2[is.na(df1$ACTOR2)] <- "NONE"
head(df1$ACTOR2)
[1] "Civilians (Algeria)"              "Police Forces of Algeria (1999-)"
[3] "NONE"                             "Police Forces of Algeria (1999-)"
[5] "Police Forces of Algeria (1999-)" "Police Forces of Algeria (1999-)"

Now for columns, INTER1, INTER2 and INTERACTION numerical subsitutes for categories are provided. We will replace these with the actual values from the codebook.

Preview of Column INTER1

df1$INTER1 <- as.character(df1$INTER1)
lut1 <- c("1" = "Government or mutinous force", "2" = "Rebel force",
          "3" = "Political militia", "4" = "Ethnic militia", 
          "5" = "Rioters", "6" = "Protesters", "7" = "Civilians", 
          "8" = "Outside/external force")
df1$INTER1 <- lut1[df1$INTER1]
head(df1$INTER1)
[1] "Government or mutinous force" "Rioters"                     
[3] "Protesters"                   "Rioters"                     
[5] "Rioters"                      "Rioters"                     

Preview of Column INTER2

df1$INTER2 <- as.character(df1$INTER2)
lut2 <- c("0" = "NONE", "1" = "Government or mutinous force", 
          "2" = "Rebel force", "3" = "Political militia", 
          "4" = "Ethnic militia", "5" = "Rioters", "6" = "Protesters",
          "7" = "Civilians", "8" = "Outside/external force")
df1$INTER2 <- lut2[df1$INTER2]
head(df1$INTER2)
[1] "Civilians"                    "Government or mutinous force"
[3] "NONE"                         "Government or mutinous force"
[5] "Government or mutinous force" "Government or mutinous force"

Preview of Column INTERACTION

df1$INTERACTION<-as.character(df1$INTERACTION)
lut3<-c("10" = "SOLE MILITARY ACTION", "11" = "MILITARY VERSUS MILITARY",
        "12" = "MILITARY VERSUS REBELS", "13" = "MILITARY VERSUS POLITICAL MILITIA",
        "14" = "MILITARY VERSUS COMMUNAL MILITIA", "15" = "MILITARY VERSUS RIOTERS",
        "16" = "MILITARY VERSUS PROTESTERS", "17" = "MILITARY VERSUS CIVILIANS",
        "18" = "MILITARY VERSUS OTHER", "20" = "SOLE REBEL ACTION ",
        "22" = "REBELS VERSUS REBELS", "23" = "REBELS VERSUS POLITICAL MILIITA",
        "24" = "REBELS VERSUS COMMUNAL MILITIA", "25" = "REBELS VERSUS RIOTERS",
        "26" = "REBELS VERSUS PROTESTERS", "27" = "REBELS VERSUS CIVILIANS",
        "28" = "REBELS VERSUS OTHERS", "30" = "SOLE POLITICAL MILITIA ACTION",
        "33" = "POLITICAL MILITIA VERSUS POLITICAL MILITIA", 
        "34" = "POLITICAL MILITIA VERSUS COMMUNAL MILITIA", 
        "35" = "POLITICAL MILITIA VERSUS RIOTERS", 
        "36" = "POLITICAL MILITIA VERSUS PROTESTERS", 
        "37" = "POLITICAL MILITIA VERSUS CIVILIANS",
        "38" = "POLITICAL MILITIA VERSUS OTHERS",
        "40" = "SOLE COMMUNAL MILITIA ACTION", 
        "44" = "COMMUNAL MILITIA VERSUS COMMUNAL MILITIA", 
        "45" = "COMMUNAL MILITIA VERSUS RIOTERS", 
        "46" = "COMMUNAL MILITIA VERSUS PROTESTERS", 
        "47" = "COMMUNAL MILITIA VERSUS CIVILIANS", 
        "48" = "COMMUNAL MILITIA VERSUS OTHER","50" = "SOLE RIOTER ACTION", 
        "55" = "RIOTERS VERSUS RIOTERS", "56" = "RIOTERS VERSUS PROTESTERS", 
        "57" = "RIOTERS VERSUS CIVILIANS", "58" = "RIOTERS VERSUS OTHERS", 
        "60" = "SOLE PROTESTER ACTION", "66" = "PROTESTERS VERSUS PROTESTERS", 
        "67" = "PROTESTERS VERSUS CIVILIANS", "68" = "PROTESTERS VERSUS OTHER", 
        "70" = "SOLE CIVILIANS", "77" = "CIVILIANS VERSUS CIVILIANS", 
        "78" = "OTHER ACTOR VERSUS CIVILIANS", "80" = "SOLE OTHER ACTION", 
        "88" = "OTHERS VERSUS OTHERS")
df1$INTERACTION <- lut3[df1$INTERACTION]
head(df1$INTERACTION)
[1] "MILITARY VERSUS CIVILIANS" "MILITARY VERSUS RIOTERS"   "SOLE PROTESTER ACTION"    
[4] "MILITARY VERSUS RIOTERS"   "MILITARY VERSUS RIOTERS"   "MILITARY VERSUS RIOTERS"  

Convert LONGITUDE and LATITUDE columns to Numeric

df1$LONGITUDE<-as.numeric(df1$LONGITUDE)
df1$LATITUDE<-as.numeric(df1$LATITUDE)

Examining the date column - EVENT_DATE

head(df1$EVENT_DATE)
[1] "18/04/2001" "19/04/2001" "20/04/2001" "21/04/2001" "21/04/2001" "21/04/2001"

The event date column is in DD/MM/YYYY format. We assign this to the data set using dmy() function.

df1$EVENT_DATE<-dmy(df1$EVENT_DATE)

for further analysis and aggregating data on monthly basis wew mutate the dataset to add a column for month.

df1<-df1%>%mutate(MONTH=month(df1$EVENT_DATE))

Preview of data after cleaning is given below:

head(df1)

Exploratory Data Analysis

We start our Exploratory data analysis by looking at the summary statistics of numberical atributes of our data set.

summary(df1)
Variable Min Max Mean
Event Date 1997-01-01 2017-07-29 -
Fatalities 0 25000 4.42
Year 1997 2017 -

From above, we can see that minimum fatalities in conflicts 0 and maximum value is 25000 with a mean of 4.42.

Exploring Dataset with number of conflicts

Below chart shows the number of conflicts which happened every year.

#creating data set for ggplot()
plot2<-df1%>%
  group_by(YEAR)%>%summarise(count=n())%>%
  arrange(desc(count))
#using ggplot for visualization
plot2%>%ggplot(aes(x = YEAR,y=count,fill=-count)) +
  geom_bar(stat = "identity") +
  scale_x_continuous(name = "Year") +
  scale_y_continuous(name = "Total Number of Conflicts",labels = scales::comma) +
  ggtitle("Frequncy Plot for Number of Conflicts in past 20 years",
  subtitle = "Data about number of conflicts which happened from 1997 to 2017")

It can be observed that maximum number of conflicts happened in 2016.

Drilling down, we try to understand which countries were most involved in the conflicts in this year.

#creating data set for ggplot()
plot1<-df1%>%
  filter(YEAR==2016)%>%
  group_by(COUNTRY)%>%summarise(count=n())%>%
  arrange(desc(count))%>%head(n=10)
#using ggplot for visualization
plot1%>%arrange(desc(count))%>%
  ggplot(aes(x = reorder(COUNTRY,-count), y=count,fill=-count)) +
  geom_bar(stat = "identity") + 
  scale_x_discrete(name = "Country") +
  scale_y_continuous(name = "Total Number of Conflicts",labels = scales::comma) +
  ggtitle("Frequency Plot for Number of Conflicts  in 2016",
  subtitle = "Top 10 countries with highest number of conflicts in 2016")+
  coord_flip()

It can be seen that Somalia had the highest number of conflicts. We try to find the regions in Somalia where conflicts occured.

From the map below, itcan be seen that the events were spread all across the country of Somalia.

#creating dataset with desired Latitude and Longitude
data = df1 %>%
  filter(YEAR==2016&COUNTRY=="Somalia") %>% 
  filter(!is.na(LATITUDE)) %>%
  filter(!is.na(LONGITUDE))
#defining median location
center_lon = median(data$LONGITUDE)
center_lat = median(data$LATITUDE)
#creating color palette
palette_color <- colorFactor(c("red","blue","green","yellow","orange"), data$COUNTRY)
#creating map
leaflet(data) %>% addTiles() %>%
  addCircles(lng = ~LONGITUDE, lat = ~LATITUDE,#radius = ~(), 
  color = ~palette_color(COUNTRY))%>%
#setting the location of the map  
  setView(lng=center_lon, lat=center_lat,zoom = 5) %>%
#adding a minimap  
  addMiniMap("bottomright", width = 150, height = 150,
  collapsedWidth = 19, collapsedHeight = 19, zoomLevelOffset = -5,
  zoomLevelFixed = FALSE, centerFixed = FALSE, zoomAnimation = FALSE,
  toggleDisplay = FALSE, autoToggleDisplay = FALSE, minimized = FALSE,
  aimingRectOptions = list(color = "#ff7800", weight = 1, clickable = FALSE),
  shadowRectOptions = list(color = "#000000", weight = 1, clickable = FALSE,
  opacity = 0, fillOpacity = 0), strings = list(hideText = "Hide MiniMap",
  showText = "Show MiniMap"), tiles = NULL, mapOptions = list())

Exploring Dataset considering fatalities Below barplot shows comparison of fatalities happened each year due to conflicts.

df1%>%
  group_by(YEAR)%>%summarise(total=sum(FATALITIES))%>%
  ggplot(aes(x=YEAR,y=total,fill=-total))+geom_bar(stat = "identity")+
  scale_x_continuous(name = "Year") +
  scale_y_continuous(name = "Total Number of Fatalities",labels = scales::comma) +
  ggtitle("Fatalities over the years due to conflicts",
  subtitle = "Data about fatalities happening every year due to conflicts")

It can be seen from the plot above that maximum fatalities occured during conflicts in 1999.

Digging further, we try to understand which months these conflicts led to fatalities.

  
  df1%>%
  filter(YEAR==1999)%>%
  group_by(MONTH)%>%
  summarise(total=sum(FATALITIES))%>%
  ggplot(aes(x=as.factor(MONTH),y=total,fill=-total))+geom_bar(stat = "identity")+
  scale_x_discrete(name = "Month of 1999") +
  scale_y_continuous(name = "Total Fatalities",labels = scales::comma) +
  ggtitle("Number of fatalities in 1999",
  subtitle = "Data about fatalities which happened in 1999 per month")

It is observed that, maximum fatalities happened in first four months.

With below plot we try to analyze which country had maximum fatalities in 1999.

df1%>%
  filter(YEAR==1999)%>%
  group_by(COUNTRY)%>%summarise(total=sum(FATALITIES))%>%
  arrange(desc(total))%>%head(n=10)%>%
  ggplot(aes(x = reorder(COUNTRY,-total), y=total,fill=-total)) +
  geom_bar(stat = "identity") + 
  scale_x_discrete(name = "Country") +
  scale_y_continuous(name = "Total fatalities",labels = scales::comma) +
  ggtitle("Fatalities analysis for the year 1999",
  subtitle = "Analysis of fatalities in each country in the year 1999")+coord_flip()

From the plot above it is evident that Eritrea and Angola had the maximum fatalities.


Below tables give an overall picture of top 5 years and top 10 countries with highest fatalities.

df1%>%
  group_by(YEAR)%>%
  summarise(sum(FATALITIES))%>%
  arrange(desc(`sum(FATALITIES)`))%>%
  head(n=5)

From the table above, we can see that year 1999 had the highest fatalities due to conflicts.

df1%>%
  group_by(COUNTRY)%>%
  summarise(sum(FATALITIES))%>%
  arrange(desc(`sum(FATALITIES)`))%>%
  head(n=10)

Here, we can see that Angola has the highest fatalities due to conflicts over all the years.




Text Analysis

A text analysis on the notes column can help us understand which words were used the most in the text.

notes<-df1%>%filter(YEAR==c(2017,2016,2015,2014,2013))%>%select(NOTES)
#head(notes)
notes_source<-VectorSource(notes)
notes_corpus<-VCorpus(notes_source)
clean_corpus <- function(corpus){
  corpus <- tm_map(corpus, stripWhitespace)
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, content_transformer(tolower))
  corpus <- tm_map(corpus, removeNumbers)
  corpus <- tm_map(corpus, removeWords, c(stopwords("en"),"Top200Words"))
  return(corpus)
}
notes_clean<-clean_corpus(notes_corpus)
notes_tdm<-TermDocumentMatrix(notes_clean)
notes_m<-as.matrix(notes_tdm)
notes_words<-rowSums(notes_m)
notes_words<-sort(notes_words,decreasing=TRUE)
notes_freqs<-data.frame(term = names(notes_words),num = notes_words)
wordcloud(notes_freqs$term, notes_freqs$num,max.words = 30,colors = c("chartreuse", "cornflowerblue", "darkorange"))

From the above word cloud, it can be seen that most popular words in the notes column are “killed”, “police”, “forces”.

Summary

With a huge dataset like this, a lot of information can be extracted to know facts about events like which areas were affected the most, what is the degree of damage that has occured due to conflicts and many other things.

All these insights and many others can be gathered using visual tools like plots, histograms, geospatial maps etc.

In this report, we were able to get insights like the year 1999 was the worst for all the conflicts as this year witnessed very high fatalities in the countries of Angola and Eritrea.

We also observed that the highest count of conflicts happened in 2016 but the number of fatalities were less. This may indicate that police forces were able to control the conflicts on an early stage.

This kind of analysis can let the end user know about the current and historical political situations of an African country. The user can further analyse to find which groups were involved the most and can provide insightful information for possible next location of a conflict in real-time.

This project can be further developed to include interactive visualization with the help of Shiny package which can let the user to analyze data of different years or regions interactively.

More analysis can be done to find which parties were involved maximum number of times in conflicts.

---
title: "ACLED African Conflicts"
author: "Anurag Jain"
date: "April 22, 2018"
output: html_notebook
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# {.tabset .tabset-fade .tabset-pills}
## Introduction
<!-- 
1.1 Provide an introduction that explains the problem statement you are addressing. Why should I be interested in this? 
1.2 Provide a short explanation of how you plan to address this problem statement (the data used and the methodology employed) 
1.3 Discuss your current proposed approach/analytic technique you think will address (fully or partially) this problem. 
1.4 Explain how your analysis will help the consumer of your analysis.
-->
There have been conflicts happening around the world within groups which can range from militant activity, conflicts between groups, conflicts of civilians with government bodies etc. This project aims at exploring such conflicts happening in different countries of Africa for past 20 years and clusteing to find trends.

With this dataset, it is possible to perform Exploratory data analysis and clustering to see the type of conflicts affecting different region and assessing the political situation of a particular region. 

The data set I have selected for my project is "ACLED African Conflicts data for a duration of 1997-2017". The dataset is not tidy and needs to be cleaned to be used for analysis. There are many columns which have joined data and need to be separated to different columns. There are many cases of missing and NA values which needs to be addressed.


## Packages Required
<!--
2.1 All packages used are loaded upfront so the reader knows which are required to replicate the analysis. 
2.2 Messages and warnings resulting from loading the package are suppressed. 
2.3 Explanation is provided regarding the purpose of each package (there are over 10,000 packages, don't assume that I know why you loaded each package).
-->

Below code checks if packages required for this project are installed on the system and if not then installs them.

```{r}
packages <- c("dplyr","leaflet","htmltools","ggplot2","lubridate","qdap","tm","SnowballC","wordcloud","RColorBrewer")
if (length(setdiff(packages, rownames(installed.packages()))) > 0) {
  install.packages(setdiff(packages, rownames(installed.packages())))  
}
```


The packages required for this project are mentioned below:

```{r packages, warning=FALSE, error=FALSE, message=FALSE}
library(dplyr)      
library(leaflet)    
library(htmltools)  
library(ggplot2)    
library(lubridate)  
library(qdap)       
library(tm)     
library(SnowballC) 
library(wordcloud)  
library(RColorBrewer)    
```

Description of these packages are as below:

Package    | Description |
--------------------- | --------------------------------------------------------------------|
(dplyr)  |    	  #Used for data manipulation of the data set|
(leaflet)    	|  #Used to make Create Interactive Web Maps|
(htmltools)  	 | #Tools for HTML|
(ggplot2)    	|  #Used to create elegant data visualisations using the grammar of graphics|
(lubridate)  	 | #Used for easily dealing with dates|
(qdap)       	 | #Used for text analysis|
(tm)        	 | #Text Mining Package|
(SnowballC) 	 | #Used for text and Word analysis|
(wordcloud)  	|  #Used to create a word cloud|
(RColorBrewer) | #Used for color paletting in R|


## Data Preparation
<!--
3.1 Original source where the data was obtained is cited and, if possible, hyperlinked. 
3.2 Source data is thoroughly explained (i.e. what was the original purpose of the data, when was it collected, how many variables did the original have, explain any peculiarities of the source data such as how missing values are recorded, or how data was imputed, etc.). 
 
-->

The source of this data is from [Kaggle - ACLED African Conflicts, 1997-2017](https://www.kaggle.com/jboysen/african-conflicts).

This data is originally collected under ACLED project which is an acronym for 'Armed Conflict Location and Event Data'. This project is directed by Prof. Clionadh Raleigh (University of Sussex) and operated by Senior Research Manager Andrea Carboni (University of Sussex) for Africa and Hillary Tanoff for South and South-East Asia. The aim of this project is to collate data on Political Violence in developing countries with focus on Africa. This dataset was first introduced in 2010 by Raleigh and co-authors in 2010 paper in the [Journal of Peace Research](https://en.wikipedia.org/wiki/Journal_of_Peace_Research). The ACLED data is used by several researchers in their research on civil wars and political violence. This dataset has also been referenced by news media agencies like The New York Times, The Guardian, BBC etc. to study recent conflict trends.

After examining the data, it was observed that missing values are recorded as `'Blanks'` and with `NA` in a few columns. To bring consistency, we will replace Blanks with NA during reading the data. The date format is in "DD/MM/YYYY" format.

<!--
3.3 Data importing and cleaning steps are explained in the text (tell me why you are doing the data cleaning activities that you perform) and follow a logical process.
-->
Data is imported using the read.csv() function. It takes the String values as Factors so we use the argument stringsAsFactors = FALSE to consider it as strings. We convert blanks to NA for missing data.

```{r readcsv, echo=TRUE}
df <- read.csv("https://www.dropbox.com/s/0eeick4mo0zv4ug/african_conflicts.csv?dl=1", stringsAsFactors = FALSE, na.strings = "")

```

This dataset contains total 28 columns before making any changes.

```{r names, echo=TRUE}
names(df)
```

Out of these columns, ACTOR1_ID, ACTOR2_ID and ACTOR_DYAD_ID, EVENT_ID_CNTY, EVENT_ID_NO_CNTY are surrogate columns. So we will remove these columns. Most of the values in ALLY_ACTOR_1 and ALLY_ACTOR_2 are missing, so we remove these columns. GEO_PRECISION, GWNO and TIME_PRECISION are not required for our analysis.
After removing these columns, we are left with 18 columns.

```{r remove_columns, echo=TRUE}
df1 <- df[-c(2, 4, 5, 9, 10, 13, 14, 17, 18, 27)]
```

The table below gives details about these columns:

  Variable    | Description |
--------------------- | -------------------------------------------|
ACTOR1 | Name of first actor|
ACTOR2  | Name of second actor|
ADMIN1 | The largest sub-national administrative region in which the event took place|
ADMIN2 | The second-largest sub-national administrative region in which the event took place|
ADMIN3 | The third-largest sub-national administrative region in which the event took place|
COUNTRY | Country of conflict|
EVENT_DATE | Date of conflict, DD/MM/YYYY|
FATALITIES | Integer value of fatalities that occurred, as reported by source|
 INTER1 | A numeric code indicating the type of ACTOR1|
 INTER2 | A numeric code indicating the type of ACTOR2|
 INTERACTION | A numeric code indicating the interaction between types of ACTOR1 and ACTOR2|
LATITUDE | The latitude of the location|
LOCATION | The location where event occurred|
LONGITUDE | The longitude of the location|
NOTES | Additional notes|
SOURCE | Source of conflict information|
YEAR      |  Year event occurred|

```{r str, eval=FALSE, echo=FALSE}
str(df1)
```




<!--
Removing blank cells
-->

Looking at columns ACTOR2, ADMIN2, ADMIN3, LOCATION, NOTES, SOURCE we see that they have some missing data.
```{r Miss_check, echo=TRUE}
apply(df1, 2, function(x) any(is.na(x)))
```
For Column ACTOR2, NA tells us that there was no second actor. So we can replace it with string "NONE"

```{r blanks, echo=TRUE}
df1$ACTOR2[is.na(df1$ACTOR2)] <- "NONE"
head(df1$ACTOR2)
```
Now for columns, INTER1, INTER2 and INTERACTION numerical subsitutes for categories are provided. We will replace these with the actual values from the codebook.

Preview of Column INTER1
```{r replace_categorical, echo=TRUE}

df1$INTER1 <- as.character(df1$INTER1)
lut1 <- c("1" = "Government or mutinous force", "2" = "Rebel force",
          "3" = "Political militia", "4" = "Ethnic militia", 
          "5" = "Rioters", "6" = "Protesters", "7" = "Civilians", 
          "8" = "Outside/external force")
df1$INTER1 <- lut1[df1$INTER1]
head(df1$INTER1)
```

Preview of Column INTER2

```{r replace_categorical_1, echo=TRUE}
df1$INTER2 <- as.character(df1$INTER2)
lut2 <- c("0" = "NONE", "1" = "Government or mutinous force", 
          "2" = "Rebel force", "3" = "Political militia", 
          "4" = "Ethnic militia", "5" = "Rioters", "6" = "Protesters",
          "7" = "Civilians", "8" = "Outside/external force")
df1$INTER2 <- lut2[df1$INTER2]
head(df1$INTER2)
```


Preview of Column INTERACTION

```{r replace_categorical_2, echo=TRUE}
df1$INTERACTION<-as.character(df1$INTERACTION)
lut3<-c("10" = "SOLE MILITARY ACTION", "11" = "MILITARY VERSUS MILITARY",
        "12" = "MILITARY VERSUS REBELS", "13" = "MILITARY VERSUS POLITICAL MILITIA",
        "14" = "MILITARY VERSUS COMMUNAL MILITIA", "15" = "MILITARY VERSUS RIOTERS",
        "16" = "MILITARY VERSUS PROTESTERS", "17" = "MILITARY VERSUS CIVILIANS",
        "18" = "MILITARY VERSUS OTHER", "20" = "SOLE REBEL ACTION ",
        "22" = "REBELS VERSUS REBELS", "23" = "REBELS VERSUS POLITICAL MILIITA",
        "24" = "REBELS VERSUS COMMUNAL MILITIA", "25" = "REBELS VERSUS RIOTERS",
        "26" = "REBELS VERSUS PROTESTERS", "27" = "REBELS VERSUS CIVILIANS",
        "28" = "REBELS VERSUS OTHERS", "30" = "SOLE POLITICAL MILITIA ACTION",
        "33" = "POLITICAL MILITIA VERSUS POLITICAL MILITIA", 
        "34" = "POLITICAL MILITIA VERSUS COMMUNAL MILITIA", 
        "35" = "POLITICAL MILITIA VERSUS RIOTERS", 
        "36" = "POLITICAL MILITIA VERSUS PROTESTERS", 
        "37" = "POLITICAL MILITIA VERSUS CIVILIANS",
        "38" = "POLITICAL MILITIA VERSUS OTHERS",
        "40" = "SOLE COMMUNAL MILITIA ACTION", 
        "44" = "COMMUNAL MILITIA VERSUS COMMUNAL MILITIA", 
        "45" = "COMMUNAL MILITIA VERSUS RIOTERS", 
        "46" = "COMMUNAL MILITIA VERSUS PROTESTERS", 
        "47" = "COMMUNAL MILITIA VERSUS CIVILIANS", 
        "48" = "COMMUNAL MILITIA VERSUS OTHER","50" = "SOLE RIOTER ACTION", 
        "55" = "RIOTERS VERSUS RIOTERS", "56" = "RIOTERS VERSUS PROTESTERS", 
        "57" = "RIOTERS VERSUS CIVILIANS", "58" = "RIOTERS VERSUS OTHERS", 
        "60" = "SOLE PROTESTER ACTION", "66" = "PROTESTERS VERSUS PROTESTERS", 
        "67" = "PROTESTERS VERSUS CIVILIANS", "68" = "PROTESTERS VERSUS OTHER", 
        "70" = "SOLE CIVILIANS", "77" = "CIVILIANS VERSUS CIVILIANS", 
        "78" = "OTHER ACTOR VERSUS CIVILIANS", "80" = "SOLE OTHER ACTION", 
        "88" = "OTHERS VERSUS OTHERS")
df1$INTERACTION <- lut3[df1$INTERACTION]
head(df1$INTERACTION)
```


Convert LONGITUDE and LATITUDE columns to Numeric
```{r, message=FALSE, error=FALSE, warning=FALSE}
df1$LONGITUDE<-as.numeric(df1$LONGITUDE)
df1$LATITUDE<-as.numeric(df1$LATITUDE)

```

Examining the date column - EVENT_DATE

```{r}
head(df1$EVENT_DATE)

```

The event date column is in DD/MM/YYYY format. We assign this to the data set using dmy() function.



```{r}

df1$EVENT_DATE<-dmy(df1$EVENT_DATE)

```


for further analysis and aggregating data on monthly basis wew mutate the dataset to add a column for month.

```{r, warning=FALSE, error=FALSE, message=FALSE}
df1<-df1%>%mutate(MONTH=month(df1$EVENT_DATE))
```


<!--
3.4 Once your data is clean, show what the final data set looks like. However, do not print off a data frame with 200+ rows; show me the data in the most condensed form possible. 
3.5 Provide summary information about the variables of concern in your cleaned data set. Do not just print off a bunch of code chunks with str(), summary(), etc. Rather, provide me with a consolidated explanation, either with a table that provides summary info for each variable or a nicely written summary paragraph with inline code.

-->

Preview of data after cleaning is given below:


```{r clean_data, echo=TRUE}

head(df1)
```
## Exploratory Data Analysis
We start our Exploratory data analysis by looking at the summary statistics of numberical atributes of our data set.

```{r summary, echo=TRUE,eval=FALSE}

summary(df1)
```


  Variable    | Min           |Max          |Mean     |
------------- | ------------- |-------------|---------|
Event Date | 1997-01-01 | 2017-07-29 | - |
Fatalities    | 0        |  25000           |    4.42     |
Year        | 1997        |   2017          |     -    |



From above, we can see that minimum fatalities in conflicts 0 and maximum value is 25000 with a mean of 4.42.

**Exploring Dataset with number of conflicts**

Below chart shows the number of conflicts which happened every year.
```{r hist, echo=TRUE}
#creating data set for ggplot()
plot2<-df1%>%
  group_by(YEAR)%>%summarise(count=n())%>%
  arrange(desc(count))

#using ggplot for visualization
plot2%>%ggplot(aes(x = YEAR,y=count,fill=-count)) +
  geom_bar(stat = "identity") +
  scale_x_continuous(name = "Year") +
  scale_y_continuous(name = "Total Number of Conflicts",labels = scales::comma) +
  ggtitle("Frequncy Plot for Number of Conflicts in past 20 years",
  subtitle = "Data about number of conflicts which happened from 1997 to 2017")


```
 It can be observed that maximum number of conflicts happened in 2016. 
 
 Drilling down, we try to understand which countries were most involved in the conflicts in this year.


```{r}
#creating data set for ggplot()
plot1<-df1%>%
  filter(YEAR==2016)%>%
  group_by(COUNTRY)%>%summarise(count=n())%>%
  arrange(desc(count))%>%head(n=10)

#using ggplot for visualization
plot1%>%arrange(desc(count))%>%
  ggplot(aes(x = reorder(COUNTRY,-count), y=count,fill=-count)) +
  geom_bar(stat = "identity") + 
  scale_x_discrete(name = "Country") +
  scale_y_continuous(name = "Total Number of Conflicts",labels = scales::comma) +
  ggtitle("Frequency Plot for Number of Conflicts  in 2016",
  subtitle = "Top 10 countries with highest number of conflicts in 2016")+
  coord_flip()
```

It can be seen that Somalia had the highest number of conflicts. We try to find the regions in Somalia where conflicts occured.

From the map below, itcan be seen that the events were spread all across the country of Somalia.

```{r}
#creating dataset with desired Latitude and Longitude
data = df1 %>%
  filter(YEAR==2016&COUNTRY=="Somalia") %>% 
  filter(!is.na(LATITUDE)) %>%
  filter(!is.na(LONGITUDE))

#defining median location
center_lon = median(data$LONGITUDE)
center_lat = median(data$LATITUDE)

#creating color palette
palette_color <- colorFactor(c("red","blue","green","yellow","orange"), data$COUNTRY)

#creating map
leaflet(data) %>% addTiles() %>%
  addCircles(lng = ~LONGITUDE, lat = ~LATITUDE,#radius = ~(), 
  color = ~palette_color(COUNTRY))%>%

#setting the location of the map  
  setView(lng=center_lon, lat=center_lat,zoom = 5) %>%

#adding a minimap  
  addMiniMap("bottomright", width = 150, height = 150,
  collapsedWidth = 19, collapsedHeight = 19, zoomLevelOffset = -5,
  zoomLevelFixed = FALSE, centerFixed = FALSE, zoomAnimation = FALSE,
  toggleDisplay = FALSE, autoToggleDisplay = FALSE, minimized = FALSE,
  aimingRectOptions = list(color = "#ff7800", weight = 1, clickable = FALSE),
  shadowRectOptions = list(color = "#000000", weight = 1, clickable = FALSE,
  opacity = 0, fillOpacity = 0), strings = list(hideText = "Hide MiniMap",
  showText = "Show MiniMap"), tiles = NULL, mapOptions = list())

```

**Exploring Dataset considering fatalities**
Below barplot shows comparison of fatalities happened each year due to conflicts.

```{r hist_1, echo=TRUE}
df1%>%
  group_by(YEAR)%>%summarise(total=sum(FATALITIES))%>%
  ggplot(aes(x=YEAR,y=total,fill=-total))+geom_bar(stat = "identity")+
  scale_x_continuous(name = "Year") +
  scale_y_continuous(name = "Total Number of Fatalities",labels = scales::comma) +
  ggtitle("Fatalities over the years due to conflicts",
  subtitle = "Data about fatalities happening every year due to conflicts")

```

It can be seen from the plot above that maximum fatalities occured during conflicts in 1999.

Digging further, we try to understand which months these conflicts led to fatalities.


```{r}

  
  df1%>%
  filter(YEAR==1999)%>%
  group_by(MONTH)%>%
  summarise(total=sum(FATALITIES))%>%
  ggplot(aes(x=as.factor(MONTH),y=total,fill=-total))+geom_bar(stat = "identity")+
  scale_x_discrete(name = "Month of 1999") +
  scale_y_continuous(name = "Total Fatalities",labels = scales::comma) +
  ggtitle("Number of fatalities in 1999",
  subtitle = "Data about fatalities which happened in 1999 per month")

```

It is observed that, maximum fatalities happened in first four months.


With below plot we try to analyze which country had maximum fatalities in 1999.
```{r}
df1%>%
  filter(YEAR==1999)%>%
  group_by(COUNTRY)%>%summarise(total=sum(FATALITIES))%>%
  arrange(desc(total))%>%head(n=10)%>%
  ggplot(aes(x = reorder(COUNTRY,-total), y=total,fill=-total)) +
  geom_bar(stat = "identity") + 
  scale_x_discrete(name = "Country") +
  scale_y_continuous(name = "Total fatalities",labels = scales::comma) +
  ggtitle("Fatalities analysis for the year 1999",
  subtitle = "Analysis of fatalities in each country in the year 1999")+coord_flip()
```

From the plot above it is evident that Eritrea and Angola had the maximum fatalities.





<br />


<!--
4.1 Discuss how you plan to uncover new information in the data that is not self-evident. What are different ways you could look at this data to answer the questions you want to answer? Do you plan to slice and dice the data in different ways, create new variables, or join separate data frames to create new summary information? How could you summarize your data to answer key questions? 
4.2 What types of plots and tables will help you to illustrate the findings to your questions? 
4.3 What do you not know how to do right now that you need to learn to answer your questions? 
4.4 Do you plan on incorporating any machine learning techniques (i.e. linear regression, discriminant analysis, cluster analysis) to answer your questions? 
-->

Below tables give an overall picture of top 5 years and top 10 countries with highest fatalities.


```{r sum3}
df1%>%
  group_by(YEAR)%>%
  summarise(sum(FATALITIES))%>%
  arrange(desc(`sum(FATALITIES)`))%>%
  head(n=5)

```

From the table above, we can see that year 1999 had the highest fatalities due to conflicts.

```{r sum4, echo=TRUE}

df1%>%
  group_by(COUNTRY)%>%
  summarise(sum(FATALITIES))%>%
  arrange(desc(`sum(FATALITIES)`))%>%
  head(n=10)
```
Here, we can see that Angola has the highest fatalities due to conflicts over all the years.

<!--

## Formatting & Other Requirements

7.1 All code is visible, proper coding style is followed, and code is well commented (see section regarding syle). 
7.2 Coding is systematic - complicated problem broken down into sub-problems that are individually much simpler. Code is efficient, correct, and minimal. Code uses appropriate data structure (list, data frame, vector/matrix/array). Code checks for common errors. 
7.3 Achievement, mastery, cleverness, creativity: Tools and techniques from the course are applied very competently and, perhaps,somewhat creatively. Perhaps student has gone beyond what was expected and required, e.g., extraordinary effort, additional tools not addressed by this course, unusually sophisticated application of tools from course. 
7.4 .Rmd fully executes without any errors and HTML produced matches the HTML report submitted by student.
-->
<br />
<br />
<br />

##Text Analysis

A text analysis on the notes column can help us understand which words were used the most in the text. 


```{r message=FALSE, warning=FALSE, error=FALSE}
notes<-df1%>%filter(YEAR==c(2017,2016,2015,2014,2013))%>%select(NOTES)
#head(notes)
notes_source<-VectorSource(notes)
notes_corpus<-VCorpus(notes_source)

clean_corpus <- function(corpus){
  corpus <- tm_map(corpus, stripWhitespace)
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, content_transformer(tolower))
  corpus <- tm_map(corpus, removeNumbers)
  corpus <- tm_map(corpus, removeWords, c(stopwords("en"),"Top200Words"))
  return(corpus)
}

notes_clean<-clean_corpus(notes_corpus)
```

```{r}
notes_tdm<-TermDocumentMatrix(notes_clean)
notes_m<-as.matrix(notes_tdm)
notes_words<-rowSums(notes_m)
notes_words<-sort(notes_words,decreasing=TRUE)

notes_freqs<-data.frame(term = names(notes_words),num = notes_words)
wordcloud(notes_freqs$term, notes_freqs$num,max.words = 30,colors = c("chartreuse", "cornflowerblue", "darkorange"))


```

From the above word cloud, it can be seen that most popular words in the notes column are "killed", "police", "forces".


##Summary
<!--
6.1 Summarize the problem statement you addressed. 
6.2 Summarize how you addressed this problem statement (the data used and the methodology employed). 
6.3 Summarize the interesting insights that your analysis provided. 
6.4 Summarize the implications to the consumer of your analysis. 
6.5 Discuss the limitations of your analysis and how you, or someone else, could improve or build on it.
-->

With a huge dataset like this, a lot of information can be extracted to know facts about events like which areas were affected the most, what is the degree of damage that has occured due to conflicts and many other things. 

All these insights and many others can be gathered using visual tools like plots, histograms, geospatial maps etc.

In this report, we were able to get insights like the year 1999 was the worst for all the conflicts as this year witnessed very high fatalities in the countries of Angola and Eritrea.

We also observed that the highest count of conflicts happened in 2016 but the number of fatalities were less. This may indicate that police forces were able to control the conflicts on an early stage.

This kind of analysis can let the end user know about the current and historical political situations of an African country. The user can further analyse to find which groups were involved the most and can provide insightful information for possible next location of a conflict in real-time.

This project can be further developed to include interactive visualization with the help of Shiny package which can let the user to analyze data of different years or regions interactively.

More analysis can be done to find which parties were involved maximum number of times in conflicts.

