Introduction
Discussion Board Title: World Government Indicators
Dataset: https://info.worldbank.org/governance/wgi/Home/downLoadFile?fileName=wgidataset.xlsx
Provided by: Thomas Buonora
Suggested Prompt: None provided, will perform self-exploration around these questions: What countries have the lowest Voice and Accountability ratings today? What countries had the greatest drop in Voice and Accountability ratings since they began tracking for that country? What rating types are most correlated?
Load Libraries
We will be using tidyverse and stringr packages for this exercise.
library(tidyverse)## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.4 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 2.0.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(stringr)
library(corrr)
library(corrplot)## corrplot 0.90 loaded
Load Data
This assignment will tidy and concatenate 6 global ratings datasets:
- Voice and Accountability Ratings
- Political Stability Ratings
- Control of Corruption Ratings
- Government Effectiveness Ratings
- Regulator Quality Ratings
- Rule of Law Ratings
These ratings are standardized between -2.5 (poor rating) and 2.5 (positive rating).
voice_ratings <- read_csv("https://raw.githubusercontent.com/man-of-moose/masters_607/main/projects/project_2/world_indicators/free_speech_indicators.csv")stability_ratings <- read_csv("https://raw.githubusercontent.com/man-of-moose/masters_607/main/projects/project_2/world_indicators/political_stability.csv")regulator_quality <- read_csv("https://raw.githubusercontent.com/man-of-moose/masters_607/main/projects/project_2/world_indicators/regulator_quality.csv")control_of_corruption <- read_csv("https://raw.githubusercontent.com/man-of-moose/masters_607/main/projects/project_2/world_indicators/control_of_corruption.csv")government_effectiveness <- read_csv("https://raw.githubusercontent.com/man-of-moose/masters_607/main/projects/project_2/world_indicators/government_effectiveness.csv")rule_of_law <- read_csv("https://raw.githubusercontent.com/man-of-moose/masters_607/main/projects/project_2/world_indicators/rule_of_law.csv")Viewing the raw data for voice_ratings, it is evident that this is in a troublesome “wide” format. The data is grouped horizontally by year, but includes multiple measurements per group (estimate, stderr, numsrc, rank, lower, upper). Because of this, a simple gather() will not suffice.
Because we will be analyzing multiple ratings per country, we will focus only on capturing the year data for “Estimate” column, which corresponds to the rating per year, based on the dataset.
head(voice_ratings)As a note, all other input dataframes are of the same structure as voice_ratings.
Based on the above data, it is apparent that both the column names, and the values in row 1 both contribute to the real column values. To be clear, the current column names (starting at column 3) provide information for the year, while row 1 (starting at column 3) classify the column’s datapoints as “Estimate”, “StdError”,“NumSrc”,“Rank”,“Lower”,“Upper”.
To remedy this, we will rename the columns to contain the concatenated strings of colnames() and row 1.
colnames(voice_ratings) <- str_c(colnames(voice_ratings),voice_ratings[1,])colnames(stability_ratings) <- str_c(colnames(stability_ratings),stability_ratings[1,])colnames(control_of_corruption) <- str_c(colnames(control_of_corruption),control_of_corruption[1,])colnames(rule_of_law) <- str_c(colnames(rule_of_law),rule_of_law[1,])colnames(regulator_quality) <- str_c(colnames(regulator_quality),regulator_quality[1,])colnames(government_effectiveness) <- str_c(colnames(government_effectiveness),government_effectiveness[1,])We can now drop row 1 from the tables.
control_of_corruption <- control_of_corruption[-1,]government_effectiveness <- government_effectiveness[-1,]regulator_quality <- regulator_quality[-1,]rule_of_law <- rule_of_law[-1,]stability_ratings <- stability_ratings[-1,]voice_ratings <- voice_ratings[-1,]Looking at the data now, we can see that each column (starting at column 3) contains both the year and the measurement type for that column.
head(voice_ratings)Selecr estimate column for each dataset
Here we will select the “estimate” column for each dataframe, which hold the relevant rankings for each dataset.
control_of_corruption <- control_of_corruption %>%
rename(
country = `...1Country/Territory`,
code = `...2Code`
) %>%
select(
country, code, contains("Estimate")
)government_effectiveness <- government_effectiveness %>%
rename(
country = `...1Country/Territory`,
code = `...2Code`
) %>%
select(
country, code, contains("Estimate")
)regulator_quality <- regulator_quality %>%
rename(
country = `...1Country/Territory`,
code = `...2Code`
) %>%
select(
country, code, contains("Estimate")
)rule_of_law <- rule_of_law %>%
rename(
country = `...1Country/Territory`,
code = `...2Code`
) %>%
select(
country, code, contains("Estimate")
)stability_ratings <- stability_ratings %>%
rename(
country = `...1Country/Territory`,
code = `...2Code`
) %>%
select(
country, code, contains("Estimate")
)voice_ratings <- voice_ratings %>%
rename(
country = `...1Country/Territory`,
code = `...2Code`
) %>%
select(
country, code, contains("Estimate")
)View the below table to get a visualization for what we’re doing here. The returned table includes the country and code columns, as well as every column which pertains the “Estimate” mesasurements.
head(voice_ratings)create function to rename year columns
Now that we’ve split the original data up, we need to begin cleaning up the column names so that we can use gather(). Function should take a string vector as input, and return a year if regex year is detected, otherwise return provided
parse_year <- function(col_value) {
regex <- "\\d{4}"
extracted <- str_extract(col_value, regex)
if (is.na(extracted)) {
return(col_value)
}
return(extracted)
}Rename all colnames for all tables
colnames(voice_ratings) <- colnames(voice_ratings) %>% lapply(parse_year) %>% unlist()Let’s take a look at the first table transformation to see what’s happening
head(voice_ratings)As we can see above, this is the same estimate_data table that we viewed earlier, only now the year columns contain only year values!
colnames(stability_ratings) <- colnames(stability_ratings) %>% lapply(parse_year) %>% unlist()colnames(control_of_corruption) <- colnames(control_of_corruption) %>% lapply(parse_year) %>% unlist()colnames(government_effectiveness) <- colnames(government_effectiveness) %>% lapply(parse_year) %>% unlist()colnames(regulator_quality) <- colnames(regulator_quality) %>% lapply(parse_year) %>% unlist()colnames(rule_of_law) <- colnames(rule_of_law) %>% lapply(parse_year) %>% unlist()Use gather() on all tibbles to convert wide format to long format
Now we are ready to use gather() to convert each of the previously created tibbles into long format.
voice_ratings <- voice_ratings %>%
gather("year","voice_rating",3:24)stability_ratings <- stability_ratings %>%
gather("year","stability_rating",3:24)control_of_corruption <- control_of_corruption %>%
gather("year","control_corr_rating",3:24)government_effectiveness <- government_effectiveness %>%
gather("year","gov_rating",3:24)regulator_quality <- regulator_quality %>%
gather("year","regulator_rating",3:24)rule_of_law <- rule_of_law %>%
gather("year","rule_of_law_rating",3:24)Add all gathered values into original dataframe
Finally, we can use a chained inner_join to merge all of the datasets back into one, “long” formatted dataframe.
final_data <- voice_ratings %>%
inner_join(stability_ratings,by=c("country","code","year")) %>%
inner_join(control_of_corruption,by=c("country","code","year")) %>%
inner_join(government_effectiveness,by=c("country","code","year")) %>%
inner_join(regulator_quality,by=c("country","code","year")) %>%
inner_join(rule_of_law,by=c("country","code","year"))And finally, we can convert all of the measurement columns to double.
final_data[,3:9] <- sapply(final_data[,3:9],as.double)Below is the original data in “long” format.
final_dataQuestions
What countries currently have the worst Voice and Accountability ratings
North Korea, Eritrea, Turkmenistan, Syria, Equatorial Guinea, South Sudan, Somalia, Lao PDR, Tajikistan and Yemen have the lowest Voice and Accountablilty ratings among all included countries.
final_data %>%
filter(year == 2020) %>%
arrange(voice_rating) %>%
select(country, voice_rating) %>%
head(10)Which countries had the greatest drops in free_speech scores across all years?
Create a new tibble containing unique country names
countries_data <- as_tibble(unique(final_data$country))Create function that collects all country estimate entries, then calculates the difference between the first non-na entry and the last non-na entry
get_total_change <- function(country_name){
values_vector <- final_data %>%
filter(country==country_name) %>%
pull(voice_rating) %>%
.[!is.na(.)]
return(tail(values_vector, n=1) - values_vector[1])
}Test out the function on Zimbabwe.
In 1996, Zimbabwe had a Voice and Accountability ranking of -0.61. In 2020, their ranking had dropped to -1.12 (decrease of 0.51)
final_data %>%
filter(country=="Zimbabwe" & year %in% c(1996, 2020))zimbabwe_1996_and_2020_estimates <- final_data %>%
filter(country=="Zimbabwe" & year %in% c(1996, 2020)) %>%
.$voice_rating
zimbabwe_1996_and_2020_estimates[2] - zimbabwe_1996_and_2020_estimates[1]## [1] -0.5099288
get_total_change("Zimbabwe")## [1] -0.5099288
Apply the function to the countries tibble
countries_data$ratings_difference <- countries_data$value %>% lapply(get_total_change) %>% unlist()countries_data %>%
ggplot() +
geom_boxplot(aes(x=ratings_difference))The above boxplot reveals a lot of interesting trends in the dataset:
Most of the countries in the dataset did not see their Voice and Accountablity ratings change much (+/- 0.3) across all years measured.
The median is below zero, which suggests that most countries actually decreased in ratings throughout the years!
Who are the worst countries in terms of ratings change over time?
countries_data %>%
arrange(ratings_difference)