Loading the requires packages to produce the report. Install packages if needed.
knitr::opts_chunk$set(echo = TRUE,
warning = FALSE,
message = FALSE)
# Install script, should installing packages be needed.
#install.packages("readr")
#install.packages("here")
#install.packages("knitr")
#install.packages("dplyr")
#install.packages("magrittr")
#install.packages("glue")
#install.packages("lubridate")
library(readr) # Useful for importing data
## Warning: package 'readr' was built under R version 4.1.2
#library(foreign) # Useful for importing SPSS, SAS, STATA etc. data files
#library(rvest) # Useful for scraping HTML data
library(knitr) # Useful for creating nice tables
library(here) # For specifying file paths
## here() starts at /Users/samuelklettnavarro/PY4E
library(dplyr) # For data wrangling
## Warning: package 'dplyr' was built under R version 4.1.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(magrittr) # For pipes
## Warning: package 'magrittr' was built under R version 4.1.2
library(glue) # For sticking strings together
## Warning: package 'glue' was built under R version 4.1.2
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
From one of the offered open source sites, we are going to use:
State Fragility CSV File From the CORGIS Dataset Project
By Ryan Whitcomb, Joung Min Choi, Bo Guan Version 3.0.0, created 9/16/2021 Tags: world, countries, security, politics, economy, society, effectiveness, legitimacy http://www.systemicpeace.org/inscrdata.html
The url assigned to this dataset can be found below.
url = "https://corgis-edu.github.io/corgis/datasets/csv/state_fragility/state_fragility.csv"
To import the table from the website, we are going to use the “readr” package version of the read csv function, as per the code below; given that it is faster than the baseR functions.
frag_index <- read_csv(url, show_col_types = FALSE)
# show_col_types = FALSE allows us to run this chunk without creating an output.
The Center for Systemic Peace was founded in 1997 to engage in global systems analysis to minimize the effects of political violence in the world as a whole.
The following data set shows the state fragility for countries with a population greater than 500,000 in 2013.
The State Fragility Index scores countries on two main categories: effectiveness and legitimacy. These are then broken down into four dimensions: Security, Political, Economic, and Social.
The State Fragility Index score is the sum of the individual country’s effectiveness score and their legitimacy score. Each of these scores are the summation of the four dimensions within the category. The lower the country’s score is, the more stable it is.
For example, Metrics.Effectiveness.Economic Effectiveness is a score for gross domestic product per capita (4 = less than $500.00; 3 = $500.00 to $1199.99; 2 = $1200.00 to $2999.99; 1 = $3000.00 to $7499.99; and 0 = greater than or equal to $7500), which is then accounted towards the effectiveness score.
With this example, we can see that we have the required qualitative (categorical) variable, although we might need some wrangling to get it up to standard. We are going to also generate one categorical variable throughout the steps below.
For more detail about the different scores (variables) visit the link below.
Let’s look at the structure of our data frame, in order to find out what operations are needed to perform our analysis.
dim(frag_index)
## [1] 3960 13
#str(frag_index) # the output is quite detailed and might take too much space in the report
#head(frag_index) # gives us the first 6 rows, headings and their types
# the head/tail output is too wide for the report
summary(frag_index) # preliminary statistical analysis of the df
## Country Year Metrics.State Fragility Index
## Length:3960 Min. :1995 Min. : 0.000
## Class :character 1st Qu.:2001 1st Qu.: 3.000
## Mode :character Median :2007 Median : 9.000
## Mean :2007 Mean : 9.158
## 3rd Qu.:2013 3rd Qu.:14.000
## Max. :2018 Max. :25.000
## Metrics.Effectiveness.Economic Effectiveness
## Min. :0.000
## 1st Qu.:0.000
## Median :2.000
## Mean :1.781
## 3rd Qu.:3.000
## Max. :4.000
## Metrics.Effectiveness.Effectiveness Score
## Min. : 0.000
## 1st Qu.: 1.000
## Median : 4.000
## Mean : 4.614
## 3rd Qu.: 8.000
## Max. :13.000
## Metrics.Effectiveness.Political Effectiveness
## Min. :0.000
## 1st Qu.:0.000
## Median :1.000
## Mean :1.121
## 3rd Qu.:2.000
## Max. :3.000
## Metrics.Effectiveness.Security Effectiveness
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.5641
## 3rd Qu.:1.0000
## Max. :3.0000
## Metrics.Effectiveness.Social Effectiveness
## Min. :0.000
## 1st Qu.:0.000
## Median :1.000
## Mean :1.148
## 3rd Qu.:2.000
## Max. :3.000
## Metrics.Legitimacy.Economic Legitimacy Metrics.Legitimacy.Legitimacy Score
## Min. :0.000 Min. : 0.000
## 1st Qu.:0.000 1st Qu.: 2.000
## Median :1.000 Median : 4.000
## Mean :1.258 Mean : 4.544
## 3rd Qu.:3.000 3rd Qu.: 7.000
## Max. :3.000 Max. :12.000
## Metrics.Legitimacy.Political Legitimacy Metrics.Legitimacy.Security Legitimacy
## Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.000
## Median :1.000 Median :1.000
## Mean :1.183 Mean :1.014
## 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :3.000 Max. :3.000
## Metrics.Legitimacy.Social Legitimacy
## Min. :0.000
## 1st Qu.:0.000
## Median :1.000
## Mean :1.089
## 3rd Qu.:2.000
## Max. :3.000
#View(frag_index) # to view the entire df
dfnames <- colnames(frag_index) # taking the opportunity to create a vector with the names
# as we might have to make changes for easier understanding
We can see that we have 13 variables, of which only one is in the character class.
From the numeric variables, we can see that the ones that vary between 0-4, could be turned into factors with similar levels, whilst the rest represent a numeric score, but instead we are going to create a new categorical variable, in order to comply with the assignment requirements.
We are going to make some changes to the data set, starting for the column names. In order to safeguard the original data frame, through this process, we are going to make a copy, that way we can always reference the original one to check the results.
Changing the names is ideal, not only to understand the data better, but to increase the ease of column function calls. Also, we are going to arrange the columns in a easier order to read the results.
# generating a data set copy
df <- frag_index
#change of column names
dfnames_new <- gsub(pattern = "Metrics.Effectiveness.",
replacement = "", x = dfnames)
dfnames_new <- gsub(pattern = "Metrics.Legitimacy.",
replacement = "", x = dfnames_new)
dfnames_new <- gsub(pattern = "Metrics.", replacement = "", x = dfnames_new)
dfnames_new <- gsub(pattern = " ", replacement = "_", x = dfnames_new)
names(df) <- dfnames_new # passing the new column names to the df
df %<>% select(1:3, contains("Score"), everything())
#head(df) # to check
summary(df)
## Country Year State_Fragility_Index Effectiveness_Score
## Length:3960 Min. :1995 Min. : 0.000 Min. : 0.000
## Class :character 1st Qu.:2001 1st Qu.: 3.000 1st Qu.: 1.000
## Mode :character Median :2007 Median : 9.000 Median : 4.000
## Mean :2007 Mean : 9.158 Mean : 4.614
## 3rd Qu.:2013 3rd Qu.:14.000 3rd Qu.: 8.000
## Max. :2018 Max. :25.000 Max. :13.000
## Legitimacy_Score Economic_Effectiveness Political_Effectiveness
## Min. : 0.000 Min. :0.000 Min. :0.000
## 1st Qu.: 2.000 1st Qu.:0.000 1st Qu.:0.000
## Median : 4.000 Median :2.000 Median :1.000
## Mean : 4.544 Mean :1.781 Mean :1.121
## 3rd Qu.: 7.000 3rd Qu.:3.000 3rd Qu.:2.000
## Max. :12.000 Max. :4.000 Max. :3.000
## Security_Effectiveness Social_Effectiveness Economic_Legitimacy
## Min. :0.0000 Min. :0.000 Min. :0.000
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.000
## Median :0.0000 Median :1.000 Median :1.000
## Mean :0.5641 Mean :1.148 Mean :1.258
## 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:3.000
## Max. :3.0000 Max. :3.000 Max. :3.000
## Political_Legitimacy Security_Legitimacy Social_Legitimacy
## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000
## Median :1.000 Median :1.000 Median :1.000
## Mean :1.183 Mean :1.014 Mean :1.089
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :3.000 Max. :3.000 Max. :3.000
We are going to make the countries a categorical variable, even though we might have well over 180. This will reduce the size of the file.
countries <- df$Country %>%
c() %>%
unique()
#countries
df$Country %<>%
factor(., levels = countries)
length(countries) # number of countries reported on
## [1] 169
class(df$Country)
## [1] "factor"
We are going to generate a categorical variable, as a ranking of risk of violence in countries with Fragility_Index within the ranges:
The case_when() function is basically a multiple ifelse() as a single function.
#checking that there are not missing values
df$State_Fragility_Index %>%
is.na() %>%
sum()
## [1] 0
df %<>%
mutate(Instability_Risk =
case_when(State_Fragility_Index < 10 ~ "LOW",
State_Fragility_Index <= 20 ~ "MEDIUM",
State_Fragility_Index >= 21 ~ "HIGH",
TRUE ~ "Unknown"))
df$Instability_Risk %<>%
factor(.,
levels= c("LOW", "MEDIUM", "HIGH"),
ordered = TRUE)
class(df$Instability_Risk) # check
## [1] "ordered" "factor"
The three interrelated rules which make a data set tidy (Wickham & Grolemund, 2016) are:
1.- Each variable must have its own column. 2.- Each observation must have its own row. 3.- Each value must have its own cell.
In this case, I believe we have a tidy data set, although, perhaps for a basic analysis, we could drop the eight columns that give us the parts for the Legitimacy and Efficiency scores, keeping only the final scores, which will make it much easier to handle.
If the data set was to be used for machine learning, we would first check what is the correlation with some of these variables, before dropping them, but in this case, it should be fine dealing with the overall scores from each section.
df2 <- df %>%
select(1:5, 14)
head(df2)
To gain some meaningful insights, we are going to focus in a handful of countries, in order to be able to compare their values, instead of comparing the 169 of them. The selected countries are of interest, as they have had evolving conflict situations over the last few years.
subdf <- df2 %>%
filter(Country %in% c("Afghanistan", "Egypt", "Syria", "Ukraine", "Russia"))
sum_stats <- subdf %>%
group_by(Country) %>%
summarise(Min_Risk = min(Instability_Risk, na.rm = TRUE),
Max_Risk = max(Instability_Risk, na.rm = TRUE),
Median_Score = median(State_Fragility_Index, na.rm = TRUE),
Mean_Score = mean(State_Fragility_Index, na.rm = TRUE),
SD_Score = sd(State_Fragility_Index, na.rm = TRUE)
)
sum_stats
Here, we are creating a list that contains a numeric value for each response to the categorical variable. The approach is to relabel the ‘column’, convert it to numeric and convert the result into a list.
risklist <- df$Instability_Risk %>%
factor(.,
levels= c("LOW", "MEDIUM", "HIGH"),
labels = c("1", "2", "3"),
ordered = TRUE) %>%
as.numeric() %>%
list()
names(risklist) <- "Risk_Grade" # naming the list to convert it to a column
str(risklist)
## List of 1
## $ Risk_Grade: num [1:3960] 3 3 3 3 3 3 3 3 3 3 ...
In order to join this list, we are going to use the ‘bind_cols’ function, given that standard joining is not available, as it would take a list to data frame conversion, the creation of two ID columns and an inner join to do so; whereas we know the list has the same length than the data frame and only a column name is necessary on column binding.
joineddf2 <- bind_cols(df2, risklist)
head(joineddf2)
joineddf <- bind_cols(df, risklist) # in case we need to use all the original variables
sub10 <- joineddf2[1:10, ]
matsub10 <- as.matrix(sub10)
str(matsub10)
## chr [1:10, 1:7] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr [1:7] "Country" "Year" "State_Fragility_Index" "Effectiveness_Score" ...
dim(matsub10)
## [1] 10 7
We can see that the dimensions are 10 rows and 7 variables, as expected. We also know that matrices are single class data structures, so R converts automatically all of the different types into a common one, in this case ‘character’.
sub2 <- select(joineddf2, c(1,7))
#head(sub2) #check
save(sub2, file = "dub2.RData")
-Whitcomb, Min Choi, Guan (2021), ‘State Fragility CSV File’, CORGIS Dataset Project. Accessed on 27 March 2022 https://corgis-edu.github.io/corgis/csv/state_fragility/
-Wickham & Grolemund (2020), ‘R for Data Science’, O’Reilly.
-STHDA (Unknown), ‘Saving Data into R Data Format: RDS and RDATA’, Statistical tools for high-throughput data analysis. Accessed on 27 March 2022 http://www.sthda.com/english/wiki/saving-data-into-r-data-format-rds-and-rdata