Introduction

Discussion Board Title: World Government Indicators

Dataset: https://info.worldbank.org/governance/wgi/Home/downLoadFile?fileName=wgidataset.xlsx

Provided by: Thomas Buonora

Suggested Prompt: None provided, will perform self-exploration around these questions: What countries have the lowest Voice and Accountability ratings today? What countries had the greatest drop in Voice and Accountability ratings since they began tracking for that country? What rating types are most correlated?

Load Libraries

We will be using tidyverse and stringr packages for this exercise.

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.4     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(stringr)
library(corrr)
library(corrplot)

## corrplot 0.90 loaded

Load Data

This assignment will tidy and concatenate 6 global ratings datasets:

Voice and Accountability Ratings
Political Stability Ratings
Control of Corruption Ratings
Government Effectiveness Ratings
Regulator Quality Ratings
Rule of Law Ratings

These ratings are standardized between -2.5 (poor rating) and 2.5 (positive rating).

voice_ratings <- read_csv("https://raw.githubusercontent.com/man-of-moose/masters_607/main/projects/project_2/world_indicators/free_speech_indicators.csv")

stability_ratings <- read_csv("https://raw.githubusercontent.com/man-of-moose/masters_607/main/projects/project_2/world_indicators/political_stability.csv")

regulator_quality <- read_csv("https://raw.githubusercontent.com/man-of-moose/masters_607/main/projects/project_2/world_indicators/regulator_quality.csv")

control_of_corruption <- read_csv("https://raw.githubusercontent.com/man-of-moose/masters_607/main/projects/project_2/world_indicators/control_of_corruption.csv")

government_effectiveness <- read_csv("https://raw.githubusercontent.com/man-of-moose/masters_607/main/projects/project_2/world_indicators/government_effectiveness.csv")

rule_of_law <- read_csv("https://raw.githubusercontent.com/man-of-moose/masters_607/main/projects/project_2/world_indicators/rule_of_law.csv")

Viewing the raw data for voice_ratings, it is evident that this is in a troublesome “wide” format. The data is grouped horizontally by year, but includes multiple measurements per group (estimate, stderr, numsrc, rank, lower, upper). Because of this, a simple gather() will not suffice.

Because we will be analyzing multiple ratings per country, we will focus only on capturing the year data for “Estimate” column, which corresponds to the rating per year, based on the dataset.

head(voice_ratings)

As a note, all other input dataframes are of the same structure as voice_ratings.

Based on the above data, it is apparent that both the column names, and the values in row 1 both contribute to the real column values. To be clear, the current column names (starting at column 3) provide information for the year, while row 1 (starting at column 3) classify the column’s datapoints as “Estimate”, “StdError”,“NumSrc”,“Rank”,“Lower”,“Upper”.

To remedy this, we will rename the columns to contain the concatenated strings of colnames() and row 1.

colnames(voice_ratings) <- str_c(colnames(voice_ratings),voice_ratings[1,])

colnames(stability_ratings) <- str_c(colnames(stability_ratings),stability_ratings[1,])

colnames(control_of_corruption) <- str_c(colnames(control_of_corruption),control_of_corruption[1,])

colnames(rule_of_law) <- str_c(colnames(rule_of_law),rule_of_law[1,])

colnames(regulator_quality) <- str_c(colnames(regulator_quality),regulator_quality[1,])

colnames(government_effectiveness) <- str_c(colnames(government_effectiveness),government_effectiveness[1,])

We can now drop row 1 from the tables.

control_of_corruption <- control_of_corruption[-1,]

government_effectiveness <- government_effectiveness[-1,]

regulator_quality <- regulator_quality[-1,]

rule_of_law <- rule_of_law[-1,]

stability_ratings <- stability_ratings[-1,]

voice_ratings <- voice_ratings[-1,]

Looking at the data now, we can see that each column (starting at column 3) contains both the year and the measurement type for that column.

head(voice_ratings)

Selecr estimate column for each dataset

Here we will select the “estimate” column for each dataframe, which hold the relevant rankings for each dataset.

control_of_corruption <- control_of_corruption %>%
  rename(
    country = `...1Country/Territory`,
    code = `...2Code`
    ) %>%
  select(
    country, code, contains("Estimate")
  )

government_effectiveness <- government_effectiveness %>%
  rename(
    country = `...1Country/Territory`,
    code = `...2Code`
    ) %>%
  select(
    country, code, contains("Estimate")
  )

regulator_quality <- regulator_quality %>%
  rename(
    country = `...1Country/Territory`,
    code = `...2Code`
    ) %>%
  select(
    country, code, contains("Estimate")
  )

rule_of_law <- rule_of_law %>%
  rename(
    country = `...1Country/Territory`,
    code = `...2Code`
    ) %>%
  select(
    country, code, contains("Estimate")
  )

stability_ratings <- stability_ratings %>%
  rename(
    country = `...1Country/Territory`,
    code = `...2Code`
    ) %>%
  select(
    country, code, contains("Estimate")
  )

voice_ratings <- voice_ratings %>%
  rename(
    country = `...1Country/Territory`,
    code = `...2Code`
    ) %>%
  select(
    country, code, contains("Estimate")
  )

View the below table to get a visualization for what we’re doing here. The returned table includes the country and code columns, as well as every column which pertains the “Estimate” mesasurements.

head(voice_ratings)

create function to rename year columns

Now that we’ve split the original data up, we need to begin cleaning up the column names so that we can use gather(). Function should take a string vector as input, and return a year if regex year is detected, otherwise return provided

parse_year <- function(col_value) {
  regex <- "\\d{4}"
  extracted <- str_extract(col_value, regex)
  
  if (is.na(extracted)) {
    return(col_value)
  }
  return(extracted)
}

Rename all colnames for all tables

colnames(voice_ratings) <- colnames(voice_ratings) %>% lapply(parse_year) %>% unlist()

Let’s take a look at the first table transformation to see what’s happening

head(voice_ratings)

As we can see above, this is the same estimate_data table that we viewed earlier, only now the year columns contain only year values!

colnames(stability_ratings) <- colnames(stability_ratings) %>% lapply(parse_year) %>% unlist()

colnames(control_of_corruption) <- colnames(control_of_corruption) %>% lapply(parse_year) %>% unlist()

colnames(government_effectiveness) <- colnames(government_effectiveness) %>% lapply(parse_year) %>% unlist()

colnames(regulator_quality) <- colnames(regulator_quality) %>% lapply(parse_year) %>% unlist()

colnames(rule_of_law) <- colnames(rule_of_law) %>% lapply(parse_year) %>% unlist()

Use gather() on all tibbles to convert wide format to long format

Now we are ready to use gather() to convert each of the previously created tibbles into long format.

voice_ratings <- voice_ratings %>%
  gather("year","voice_rating",3:24)

stability_ratings <- stability_ratings %>%
  gather("year","stability_rating",3:24)

control_of_corruption <- control_of_corruption %>%
  gather("year","control_corr_rating",3:24)

government_effectiveness <- government_effectiveness %>%
  gather("year","gov_rating",3:24)

regulator_quality <- regulator_quality %>%
  gather("year","regulator_rating",3:24)

rule_of_law <- rule_of_law %>%
  gather("year","rule_of_law_rating",3:24)

Add all gathered values into original dataframe

Finally, we can use a chained inner_join to merge all of the datasets back into one, “long” formatted dataframe.

final_data <- voice_ratings %>%
  inner_join(stability_ratings,by=c("country","code","year")) %>%
  inner_join(control_of_corruption,by=c("country","code","year")) %>%
  inner_join(government_effectiveness,by=c("country","code","year")) %>%
  inner_join(regulator_quality,by=c("country","code","year")) %>%
  inner_join(rule_of_law,by=c("country","code","year"))

And finally, we can convert all of the measurement columns to double.

final_data[,3:9] <- sapply(final_data[,3:9],as.double)

Below is the original data in “long” format.

final_data

Questions

What countries currently have the worst Voice and Accountability ratings

North Korea, Eritrea, Turkmenistan, Syria, Equatorial Guinea, South Sudan, Somalia, Lao PDR, Tajikistan and Yemen have the lowest Voice and Accountablilty ratings among all included countries.

final_data %>%
  filter(year == 2020) %>%
  arrange(voice_rating) %>%
  select(country, voice_rating) %>%
  head(10)

Which countries had the greatest drops in free_speech scores across all years?

Create a new tibble containing unique country names

countries_data <- as_tibble(unique(final_data$country))

Create function that collects all country estimate entries, then calculates the difference between the first non-na entry and the last non-na entry

get_total_change <- function(country_name){
  values_vector <- final_data %>%
                    filter(country==country_name) %>%
                    pull(voice_rating) %>%
                    .[!is.na(.)]
  
  return(tail(values_vector, n=1) - values_vector[1])
}

Test out the function on Zimbabwe.

In 1996, Zimbabwe had a Voice and Accountability ranking of -0.61. In 2020, their ranking had dropped to -1.12 (decrease of 0.51)

final_data %>%
  filter(country=="Zimbabwe" & year %in% c(1996, 2020))

zimbabwe_1996_and_2020_estimates <- final_data %>%
  filter(country=="Zimbabwe" & year %in% c(1996, 2020)) %>%
  .$voice_rating

zimbabwe_1996_and_2020_estimates[2] - zimbabwe_1996_and_2020_estimates[1]

## [1] -0.5099288

get_total_change("Zimbabwe")

## [1] -0.5099288

Apply the function to the countries tibble

countries_data$ratings_difference <- countries_data$value %>% lapply(get_total_change) %>% unlist()

countries_data %>%
  ggplot() +
  geom_boxplot(aes(x=ratings_difference))

The above boxplot reveals a lot of interesting trends in the dataset:

Most of the countries in the dataset did not see their Voice and Accountablity ratings change much (+/- 0.3) across all years measured.
The median is below zero, which suggests that most countries actually decreased in ratings throughout the years!

Who are the worst countries in terms of ratings change over time?

countries_data %>%
  arrange(ratings_difference)

Are any of the Indicators correlated?

All of these ratings are highly correlated. In other words, it is likely that a country with high ratings for one category have higher ratings in other categories as well.

Notably, control_of_corruption and rule_of_law are the most highly correlated.

final_data %>%
  select(is.numeric, -year) %>%
  cor(use="complete.obs")

##                     voice_rating stability_rating control_corr_rating
## voice_rating           1.0000000        0.6824166           0.7777878
## stability_rating       0.6824166        1.0000000           0.7430841
## control_corr_rating    0.7777878        0.7430841           1.0000000
## gov_rating             0.7551116        0.7019244           0.9242728
## regulator_rating       0.7780338        0.6552770           0.8689652
## rule_of_law_rating     0.8206824        0.7786554           0.9426131
##                     gov_rating regulator_rating rule_of_law_rating
## voice_rating         0.7551116        0.7780338          0.8206824
## stability_rating     0.7019244        0.6552770          0.7786554
## control_corr_rating  0.9242728        0.8689652          0.9426131
## gov_rating           1.0000000        0.9358783          0.9348742
## regulator_rating     0.9358783        1.0000000          0.9069087
## rule_of_law_rating   0.9348742        0.9069087          1.0000000

final_data %>%
  select(is.numeric, -year) %>%
  cor(use="complete.obs") %>%
  corrplot()

World Health Indicators

Alec

10/3/2021