In our class we deal a lot with tidying international datasets where data is presented in wide format, with years as columns. Tidying these datasets into long format, where every column is a variable, and every row is an observation is a critical step that will make it much easier to analyze, visualize, and model your data.

We’ve done some basic examples in class. This short tutorial will teach you how to deal with some common challenges not covered previously.

First, let’s let’s load some useful libraries

library(tidyverse) # load the tidyverse - always helpful
library(janitor) # has helper functions like clean_names() that can be useful
library(countrycode) # our powerhouse for standardizing country names

Make Some Fake Data

In R, the easiest way to learn is to work with small reproducible datasets that contain the problems we face in much bigger real datasets.

The dataset below looks like a lot of international datasets we work with. On the left, there is metadata that describes a time series. It contains the country name, a subject code (often meant to be machine readable, not human readable), and other relevant information. On the right, observations are displayed horizontally under columns for each year.

fake_data <- tibble::tribble(
  ~Subject.Code,         ~Country,   ~`2021`,      ~`2022`,      ~`2023`,
       "AB_XYZ",        "Austria",   "6,234.88",  "4,493.55",  "7,842.66",
       "BC_WQT",        "Austria",   " 83.306",    "82.878",     "78.532",
       "AB_XYZ",  "CÙte d'Ivoire",   "1,951.55",   "1,842.39",   "3,095.52",
       "BC_WQT",  "CÙte d'Ivoire",   "47.583",     "52.136",     "56.028",
       "AB_XYZ",        "T¸rkiye",   "9,532.14",   "8,543.10",   "4,966.44",
       "BC_WQT",        "T¸rkiye",   "39.651",     "41.804",     "37.483"
  )

fake_data

What are the challenges here?

The country names are not standardized. The format does strange things with country names with accent marks. We need to standardize the country names so we can easily connect it to other datasets.
The dataset is not tidy. It is in wide format, where years (observations) are column headers.
The numbers are in character format. We need to coerce them into numeric format before we can do analysis on them.
The subject codes don’t make sense to humans. We need to add subject codes that make intuitive sense, so that we can more easily work with the data, and so they can be used in tables or data visualizations.

We will go through these challenges one-by-one, and then we will put it together in a data processing pipeline.

Challenge 1: Standardize Country Names

Standardizing country names allows you to combine datasets like Legos. It’s critical for sustainable finance research where you’ll be regularly combining datasets across disciplines.

We’ve done this before, but this time we’ll introduce a common challenge to this process that you may face.

As a reminder, we use the countrycode R package to standardize names. While you can do it however you like, I choose to convert all country names to ISO3C codes, and then back to standardized country names from the countrycode package.

First, as we have done previously, we construct two helper functions.

The first one takes a country name and turns it into an iso3c code. It uses regular expressions to try a wide variety of different ways of writing the country name (for example: US, U.S., United States).

country_regex_to_iso3c <- function(country_string) {
  country_string %>%
    countrycode::countrycode(origin = "country.name", destination = "iso3c", origin_regex = TRUE)
}

So even though our dataset mangles the letters with accents (a common problem), our function reads it correctly.

"CÙte d'Ivoire" %>% country_regex_to_iso3c()

## [1] "CIV"

And it will play nicely with other datasets that use the English language name for the country.

"Ivory Coast" %>% country_regex_to_iso3c()

## [1] "CIV"

The second function turns ISO3C codes into standardized country names from the countrycode package. This is useful to make sure that our country names are always the same between datasets.

iso3c_to_country_name <- function(country_string) {
  country_string %>%
    countrycode::countrycode(origin = "iso3c", destination = "country.name")
}

"CIV" %>% iso3c_to_country_name()

## [1] "Côte d’Ivoire"

Okay, now let’s try applying it to our dataset

fake_data %>%
  mutate(iso3c = country_regex_to_iso3c(Country))

## Warning in countrycode_convert(sourcevar = sourcevar, origin = origin, destination = dest, : Some values were not matched unambiguously: T¸rkiye

We get a warning: Warning: Some values were not matched unambiguously: T¸rkiye

While countrycode is good at parsing a wide variety of country names, it doesn’t get everything.

In June of 2022, Turkey changed it’s official name to the Turkish version of its name, Türkiye. While countrycode is updated fairly regularly, it hasn’t registered this change yet.

"Türkiye" %>% country_regex_to_iso3c()

## Warning in countrycode_convert(sourcevar = sourcevar, origin = origin, destination = dest, : Some values were not matched unambiguously: Türkiye

## [1] NA

While eventually, this will change, for now we’ll change the name to Turkey to keep it standardized with the other datasets. We use case_when(), which is like an expandable if_else() function.

fake_data %>%
  # when country equals "T¸rkiye", change it to "Turkey"
  mutate(Country = case_when(Country == "T¸rkiye" ~ "Turkey",
                             # can add others if needed, for example:
                             # uncomment the line below to add "CÙte d'Ivoire"
                             #Country == "CÙte d'Ivoire" ~ "Côte d’Ivoire",
                             # otherwise, leave it as is
                             TRUE ~ Country))

Now, everything looks great, with a little thanks to our friend case_when()

fake_data %>%
  # when country equals "T¸rkiye", change it to "Turkey"
  mutate(Country = case_when(Country == "T¸rkiye" ~ "Turkey",
                             # otherwise, leave it as is
                             TRUE ~ Country)) %>%
  # standardize country names to iso3c and the countrycode country names
  mutate(iso3c = country_regex_to_iso3c(Country),
         country_name = iso3c_to_country_name(iso3c))

Problem solved!!!

case_when() is a very useful tool in your data cleaning toolkit.

Challenge 2: Pivot the years longer

Tidy data requires that every column is a variable, and every row is an observation.

Datasets of international time series are commonly stored in a wide format, where each year (an observation) is its own column. In order to work with this data using tidy data tools, we’ll want to transform this into a long format.

To do this, we’ll need to select all of the year columns.

We could do this by writing out each year manually:

fake_data %>%
  select(`2021`, `2022`, `2023`)

But if you have a lot of years, this is going to take a long time. And if the dataset gets updated, you’ll have to update your code as well. In programming, laziness is a virtue. So we’re going to use the magic of regular expressions.

\\d{4} is a regular expression that will find a string composed of four digits. This will find all columns that are numbers

To break this into its component parts:

\\d: find digits only
{4}: four of them

You can test this using stringr::str_detect on a vector containing years (four digits) amongst other text. It labels the years as TRUE and non-years as FALSE.

c("Apple", "1746", "Banana", "2001", "Orangutan") %>% str_detect(pattern = "\\d{4}")

## [1] FALSE  TRUE FALSE  TRUE FALSE

fake_data %>%
  # `matches` will select every column header that matches a regular expression
  select(matches("\\d{4}"))

Now, we can use this same syntax to select the columns we want to pivot longer

fake_data %>%
  pivot_longer(cols = matches("\\d{4}"), names_to = "year")

Challenge 3: Coerce numbers from characters to numeric values

As explained in more detail in another post, there are a few common challenges for coercing data into numeric format

Normally, you can coerce a number in character format into a numeric using as.numeric()

"-7842.66" %>% as.numeric()

## [1] -7842.66

But if there in numbers over 1,000 the comma will not parse and as.numeric() will return a NA value.

"-7,842.66" %>% as.numeric()

## Warning in "-7,842.66" %>% as.numeric(): NAs introduced by coercion

## [1] NA

Extra spaces before or after the digits can also cause problems.

So, we create a simple helper function using functions from the stringr package to help out:

character_num_to_numeric <- function(character_num) {
  character_num %>%
  # take out the commas
  stringr::str_remove_all(pattern = ",") %>%
  # take out any blank spaces before or after the number
  stringr::str_trim() %>%
  # coerce to numeric
  as.numeric()
}

This works:

" -7,842.66 " %>% character_num_to_numeric()

## [1] -7842.66

Now, we can apply this function to all of the columns that should be numeric values.

fake_data %>%
  pivot_longer(cols = matches("\\d{4}"), names_to = "year") %>%
  # we will apply the function `character_num_to_numeric` that we created above
  # to the columns `year` and `value`
  mutate(across(.cols = c(year, value), .fns = character_num_to_numeric))

Challenge 4: Make Human Readable Subject Codes

Nobody knows what AB_XYZ really means. Tell the world. It means Owls per capita.

Create a tibble using the tribble() function that has one column with the same name as your subject code (in this case Subject.Code), and another column with the human readable versions you’d like to use. d

indicator_names <- tribble(~Subject.Code, ~indicator,
        "AB_XYZ", "Owls per capita",
        "BC_WQT", "Squirrels per capita",
        "TY_LME", "Monkeys per capita",
        "BH_VBI", "Dinosaurs per capita")

indicator_names

Use left_join() to attach the indicator names to our dataset. Specify that you are joining by the shared column, in this case Subject.Code.

fake_data %>%
  left_join(indicator_names, by = c("Subject.Code"))

Putting It All Together in a Data Cleaning Pipeline

Now we can put it all togther

processed_data <- fake_data %>%
  
  # Step 1: Standardize Country Names
  # when country equals "T¸rkiye", change it to "Turkey"
  mutate(Country = case_when(Country == "T¸rkiye" ~ "Turkey",
                             # otherwise, leave it as is
                             TRUE ~ Country)) %>%
  # standardize country names to iso3c and the countrycode country names
  mutate(iso3c = country_regex_to_iso3c(Country),
         country_name = iso3c_to_country_name(iso3c)) %>%
  
  # Step 2: Pivot the data longer
  # pivot all of the year columns into a column called "year"
  pivot_longer(cols = matches("\\d{4}"), names_to = "year") %>%
  
  # Step 3:  Coerce all numbers into numeric form from character form
  ## for the year and value columns, apply the function character_num_to_numeric
  ## that we created above.
  mutate(across(.cols = c(year, value), .fns = character_num_to_numeric)) %>%
  
  
  # Step 4: Make Human Readable Subject Codes
  # we attach our human readable indicator names we created above.
  left_join(indicator_names, by = c("Subject.Code")) %>%
  
  # Step 5: Select which variables we want to keep, and in which order
  select(country_name, iso3c, indicator, year, value)
  

processed_data

Yay! Now our data set is clean and tidy. We can write it to file as a .csv, .rds, or any other file type we prefer. In the future we’ll be able to just read in that file and work with cleaned data.