Assessment 1: Pre-processing data project

Setup

Loading the requires packages to produce the report. Install packages if needed.

knitr::opts_chunk$set(echo = TRUE, 
                      warning = FALSE, 
                      message = FALSE)

# Install script, should installing packages be needed.
#install.packages("readr")
#install.packages("here")
#install.packages("knitr")
#install.packages("dplyr")
#install.packages("magrittr")
#install.packages("glue")
#install.packages("lubridate")

library(readr) # Useful for importing data

## Warning: package 'readr' was built under R version 4.1.2

#library(foreign) # Useful for importing SPSS, SAS, STATA etc. data files
#library(rvest) # Useful for scraping HTML data
library(knitr) # Useful for creating nice tables
library(here)   # For specifying file paths

## here() starts at /Users/samuelklettnavarro/PY4E

library(dplyr)        # For data wrangling

## Warning: package 'dplyr' was built under R version 4.1.2

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(magrittr)     # For pipes

## Warning: package 'magrittr' was built under R version 4.1.2

library(glue)     # For sticking strings together

## Warning: package 'glue' was built under R version 4.1.2

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

STEP 1 - Locate data

From one of the offered open source sites, we are going to use:

State Fragility CSV File From the CORGIS Dataset Project

By Ryan Whitcomb, Joung Min Choi, Bo Guan Version 3.0.0, created 9/16/2021 Tags: world, countries, security, politics, economy, society, effectiveness, legitimacy http://www.systemicpeace.org/inscrdata.html

The url assigned to this dataset can be found below.

url =  "https://corgis-edu.github.io/corgis/datasets/csv/state_fragility/state_fragility.csv"

STEP 2 - Import Data

To import the table from the website, we are going to use the “readr” package version of the read csv function, as per the code below; given that it is faster than the baseR functions.

frag_index <- read_csv(url, show_col_types = FALSE)
# show_col_types = FALSE allows us to run this chunk without creating an output.

STEP 3 - Data Description

The Center for Systemic Peace was founded in 1997 to engage in global systems analysis to minimize the effects of political violence in the world as a whole.

The following data set shows the state fragility for countries with a population greater than 500,000 in 2013.

The State Fragility Index scores countries on two main categories: effectiveness and legitimacy. These are then broken down into four dimensions: Security, Political, Economic, and Social.

The State Fragility Index score is the sum of the individual country’s effectiveness score and their legitimacy score. Each of these scores are the summation of the four dimensions within the category. The lower the country’s score is, the more stable it is.

For example, Metrics.Effectiveness.Economic Effectiveness is a score for gross domestic product per capita (4 = less than $500.00; 3 = $500.00 to $1199.99; 2 = $1200.00 to $2999.99; 1 = $3000.00 to $7499.99; and 0 = greater than or equal to $7500), which is then accounted towards the effectiveness score.

With this example, we can see that we have the required qualitative (categorical) variable, although we might need some wrangling to get it up to standard. We are going to also generate one categorical variable throughout the steps below.

For more detail about the different scores (variables) visit the link below.

https://corgis-edu.github.io/corgis/csv/state_fragility/

STEP 4 - Inspect data set and variables

Let’s look at the structure of our data frame, in order to find out what operations are needed to perform our analysis.

dim(frag_index)

## [1] 3960   13

#str(frag_index) # the output is quite detailed and might take too much space in the report

#head(frag_index) # gives us the first 6 rows, headings and their types
# the head/tail output is too wide for the report

summary(frag_index) # preliminary statistical analysis of the df

##    Country               Year      Metrics.State Fragility Index
##  Length:3960        Min.   :1995   Min.   : 0.000               
##  Class :character   1st Qu.:2001   1st Qu.: 3.000               
##  Mode  :character   Median :2007   Median : 9.000               
##                     Mean   :2007   Mean   : 9.158               
##                     3rd Qu.:2013   3rd Qu.:14.000               
##                     Max.   :2018   Max.   :25.000               
##  Metrics.Effectiveness.Economic Effectiveness
##  Min.   :0.000                               
##  1st Qu.:0.000                               
##  Median :2.000                               
##  Mean   :1.781                               
##  3rd Qu.:3.000                               
##  Max.   :4.000                               
##  Metrics.Effectiveness.Effectiveness Score
##  Min.   : 0.000                           
##  1st Qu.: 1.000                           
##  Median : 4.000                           
##  Mean   : 4.614                           
##  3rd Qu.: 8.000                           
##  Max.   :13.000                           
##  Metrics.Effectiveness.Political Effectiveness
##  Min.   :0.000                                
##  1st Qu.:0.000                                
##  Median :1.000                                
##  Mean   :1.121                                
##  3rd Qu.:2.000                                
##  Max.   :3.000                                
##  Metrics.Effectiveness.Security Effectiveness
##  Min.   :0.0000                              
##  1st Qu.:0.0000                              
##  Median :0.0000                              
##  Mean   :0.5641                              
##  3rd Qu.:1.0000                              
##  Max.   :3.0000                              
##  Metrics.Effectiveness.Social Effectiveness
##  Min.   :0.000                             
##  1st Qu.:0.000                             
##  Median :1.000                             
##  Mean   :1.148                             
##  3rd Qu.:2.000                             
##  Max.   :3.000                             
##  Metrics.Legitimacy.Economic Legitimacy Metrics.Legitimacy.Legitimacy Score
##  Min.   :0.000                          Min.   : 0.000                     
##  1st Qu.:0.000                          1st Qu.: 2.000                     
##  Median :1.000                          Median : 4.000                     
##  Mean   :1.258                          Mean   : 4.544                     
##  3rd Qu.:3.000                          3rd Qu.: 7.000                     
##  Max.   :3.000                          Max.   :12.000                     
##  Metrics.Legitimacy.Political Legitimacy Metrics.Legitimacy.Security Legitimacy
##  Min.   :0.000                           Min.   :0.000                         
##  1st Qu.:0.000                           1st Qu.:0.000                         
##  Median :1.000                           Median :1.000                         
##  Mean   :1.183                           Mean   :1.014                         
##  3rd Qu.:2.000                           3rd Qu.:2.000                         
##  Max.   :3.000                           Max.   :3.000                         
##  Metrics.Legitimacy.Social Legitimacy
##  Min.   :0.000                       
##  1st Qu.:0.000                       
##  Median :1.000                       
##  Mean   :1.089                       
##  3rd Qu.:2.000                       
##  Max.   :3.000

#View(frag_index) # to view the entire df

dfnames <- colnames(frag_index) # taking the opportunity to create a vector with the names
# as we might have to make changes for easier understanding

We can see that we have 13 variables, of which only one is in the character class.

From the numeric variables, we can see that the ones that vary between 0-4, could be turned into factors with similar levels, whilst the rest represent a numeric score, but instead we are going to create a new categorical variable, in order to comply with the assignment requirements.

We are going to make some changes to the data set, starting for the column names. In order to safeguard the original data frame, through this process, we are going to make a copy, that way we can always reference the original one to check the results.

Changing the names is ideal, not only to understand the data better, but to increase the ease of column function calls. Also, we are going to arrange the columns in a easier order to read the results.

# generating a data set copy
df <- frag_index

#change of column names
dfnames_new <- gsub(pattern = "Metrics.Effectiveness.",
                    replacement = "", x = dfnames)
dfnames_new <- gsub(pattern = "Metrics.Legitimacy.", 
                    replacement = "", x = dfnames_new)
dfnames_new <- gsub(pattern = "Metrics.", replacement = "", x = dfnames_new)
dfnames_new <- gsub(pattern = " ", replacement = "_", x = dfnames_new)

names(df) <- dfnames_new # passing the new column names to the df

df %<>% select(1:3, contains("Score"), everything())
#head(df) # to check
summary(df)

##    Country               Year      State_Fragility_Index Effectiveness_Score
##  Length:3960        Min.   :1995   Min.   : 0.000        Min.   : 0.000     
##  Class :character   1st Qu.:2001   1st Qu.: 3.000        1st Qu.: 1.000     
##  Mode  :character   Median :2007   Median : 9.000        Median : 4.000     
##                     Mean   :2007   Mean   : 9.158        Mean   : 4.614     
##                     3rd Qu.:2013   3rd Qu.:14.000        3rd Qu.: 8.000     
##                     Max.   :2018   Max.   :25.000        Max.   :13.000     
##  Legitimacy_Score Economic_Effectiveness Political_Effectiveness
##  Min.   : 0.000   Min.   :0.000          Min.   :0.000          
##  1st Qu.: 2.000   1st Qu.:0.000          1st Qu.:0.000          
##  Median : 4.000   Median :2.000          Median :1.000          
##  Mean   : 4.544   Mean   :1.781          Mean   :1.121          
##  3rd Qu.: 7.000   3rd Qu.:3.000          3rd Qu.:2.000          
##  Max.   :12.000   Max.   :4.000          Max.   :3.000          
##  Security_Effectiveness Social_Effectiveness Economic_Legitimacy
##  Min.   :0.0000         Min.   :0.000        Min.   :0.000      
##  1st Qu.:0.0000         1st Qu.:0.000        1st Qu.:0.000      
##  Median :0.0000         Median :1.000        Median :1.000      
##  Mean   :0.5641         Mean   :1.148        Mean   :1.258      
##  3rd Qu.:1.0000         3rd Qu.:2.000        3rd Qu.:3.000      
##  Max.   :3.0000         Max.   :3.000        Max.   :3.000      
##  Political_Legitimacy Security_Legitimacy Social_Legitimacy
##  Min.   :0.000        Min.   :0.000       Min.   :0.000    
##  1st Qu.:0.000        1st Qu.:0.000       1st Qu.:0.000    
##  Median :1.000        Median :1.000       Median :1.000    
##  Mean   :1.183        Mean   :1.014       Mean   :1.089    
##  3rd Qu.:2.000        3rd Qu.:2.000       3rd Qu.:2.000    
##  Max.   :3.000        Max.   :3.000       Max.   :3.000

We are going to make the countries a categorical variable, even though we might have well over 180. This will reduce the size of the file.

countries <-  df$Country  %>% 
  c() %>% 
  unique()

#countries

df$Country %<>% 
  factor(., levels = countries)

length(countries) # number of countries reported on

## [1] 169

class(df$Country)

## [1] "factor"

We are going to generate a categorical variable, as a ranking of risk of violence in countries with Fragility_Index within the ranges:

LOW = less than 10
MEDIUM = 10 to 20 and
HIGH = 21 AND ABOVE

The case_when() function is basically a multiple ifelse() as a single function.

#checking that there are not missing values
df$State_Fragility_Index %>%
  is.na() %>% 
  sum()

## [1] 0

df %<>% 
  mutate(Instability_Risk = 
           case_when(State_Fragility_Index < 10 ~ "LOW", 
                     State_Fragility_Index <= 20 ~ "MEDIUM", 
                     State_Fragility_Index >= 21 ~ "HIGH", 
                     TRUE ~ "Unknown")) 

df$Instability_Risk %<>% 
  factor(.,
         levels= c("LOW", "MEDIUM", "HIGH"), 
         ordered = TRUE)

class(df$Instability_Risk) # check

## [1] "ordered" "factor"

STEP 5 - Tidy data

The three interrelated rules which make a data set tidy (Wickham & Grolemund, 2016) are:

1.- Each variable must have its own column. 2.- Each observation must have its own row. 3.- Each value must have its own cell.

In this case, I believe we have a tidy data set, although, perhaps for a basic analysis, we could drop the eight columns that give us the parts for the Legitimacy and Efficiency scores, keeping only the final scores, which will make it much easier to handle.

If the data set was to be used for machine learning, we would first check what is the correlation with some of these variables, before dropping them, but in this case, it should be fine dealing with the overall scores from each section.

df2 <- df %>% 
  select(1:5, 14)

head(df2)

STEP 6 - Summary statistics

To gain some meaningful insights, we are going to focus in a handful of countries, in order to be able to compare their values, instead of comparing the 169 of them. The selected countries are of interest, as they have had evolving conflict situations over the last few years.

subdf <- df2 %>% 
  filter(Country %in% c("Afghanistan", "Egypt", "Syria", "Ukraine", "Russia"))

sum_stats <- subdf %>% 
  group_by(Country) %>%
  summarise(Min_Risk = min(Instability_Risk, na.rm = TRUE), 
            Max_Risk = max(Instability_Risk, na.rm = TRUE),
            Median_Score = median(State_Fragility_Index, na.rm = TRUE), 
            Mean_Score = mean(State_Fragility_Index, na.rm = TRUE), 
            SD_Score = sd(State_Fragility_Index, na.rm = TRUE)
            )
sum_stats

STEP 7 - Create a list

Here, we are creating a list that contains a numeric value for each response to the categorical variable. The approach is to relabel the ‘column’, convert it to numeric and convert the result into a list.

risklist <- df$Instability_Risk %>% 
  factor(.,
         levels= c("LOW", "MEDIUM", "HIGH"), 
         labels = c("1", "2", "3"), 
         ordered = TRUE) %>% 
  as.numeric() %>% 
  list()

names(risklist) <- "Risk_Grade" # naming the list to convert it to a column

str(risklist)

## List of 1
##  $ Risk_Grade: num [1:3960] 3 3 3 3 3 3 3 3 3 3 ...

STEP 8 - Join the list

In order to join this list, we are going to use the ‘bind_cols’ function, given that standard joining is not available, as it would take a list to data frame conversion, the creation of two ID columns and an inner join to do so; whereas we know the list has the same length than the data frame and only a column name is necessary on column binding.

joineddf2 <- bind_cols(df2, risklist)

head(joineddf2)

joineddf <- bind_cols(df, risklist) # in case we need to use all the original variables

STEP 9 - Convert sub-set to a matrix

sub10 <- joineddf2[1:10, ]

matsub10 <- as.matrix(sub10)

str(matsub10)

##  chr [1:10, 1:7] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:7] "Country" "Year" "State_Fragility_Index" "Effectiveness_Score" ...

dim(matsub10)

## [1] 10  7

We can see that the dimensions are 10 rows and 7 variables, as expected. We also know that matrices are single class data structures, so R converts automatically all of the different types into a common one, in this case ‘character’.

SETP 10 - Saving a sub-set as an R object file (.RData)

sub2 <- select(joineddf2, c(1,7))
#head(sub2) #check

save(sub2, file = "dub2.RData")

References

-Whitcomb, Min Choi, Guan (2021), ‘State Fragility CSV File’, CORGIS Dataset Project. Accessed on 27 March 2022 https://corgis-edu.github.io/corgis/csv/state_fragility/

-Wickham & Grolemund (2020), ‘R for Data Science’, O’Reilly.

-STHDA (Unknown), ‘Saving Data into R Data Format: RDS and RDATA’, Statistical tools for high-throughput data analysis. Accessed on 27 March 2022 http://www.sthda.com/english/wiki/saving-data-into-r-data-format-rds-and-rdata