In our class we deal a lot with tidying international datasets where data is presented in wide format, with years as columns. Tidying these datasets into long format, where every column is a variable, and every row is an observation is a critical step that will make it much easier to analyze, visualize, and model your data.
We’ve done some basic examples in class. This short tutorial will teach you how to deal with some common challenges not covered previously.
First, let’s let’s load some useful libraries
library(tidyverse) # load the tidyverse - always helpful
library(janitor) # has helper functions like clean_names() that can be useful
library(countrycode) # our powerhouse for standardizing country names
In R, the easiest way to learn is to work with small reproducible datasets that contain the problems we face in much bigger real datasets.
The dataset below looks like a lot of international datasets we work with. On the left, there is metadata that describes a time series. It contains the country name, a subject code (often meant to be machine readable, not human readable), and other relevant information. On the right, observations are displayed horizontally under columns for each year.
fake_data <- tibble::tribble(
~Subject.Code, ~Country, ~`2021`, ~`2022`, ~`2023`,
"AB_XYZ", "Austria", "6,234.88", "4,493.55", "7,842.66",
"BC_WQT", "Austria", " 83.306", "82.878", "78.532",
"AB_XYZ", "CÙte d'Ivoire", "1,951.55", "1,842.39", "3,095.52",
"BC_WQT", "CÙte d'Ivoire", "47.583", "52.136", "56.028",
"AB_XYZ", "T¸rkiye", "9,532.14", "8,543.10", "4,966.44",
"BC_WQT", "T¸rkiye", "39.651", "41.804", "37.483"
)
fake_data
We will go through these challenges one-by-one, and then we will put it together in a data processing pipeline.
Standardizing country names allows you to combine datasets like Legos. It’s critical for sustainable finance research where you’ll be regularly combining datasets across disciplines.
We’ve done this before, but this time we’ll introduce a common challenge to this process that you may face.
As a reminder, we use the countrycode
R package to standardize names. While you can do it however you like, I
choose to convert all country names to ISO3C codes, and then back to
standardized country names from the countrycode package.
First, as we have done previously, we construct two helper functions.
The first one takes a country name and turns it into an iso3c code. It uses regular expressions to try a wide variety of different ways of writing the country name (for example: US, U.S., United States).
country_regex_to_iso3c <- function(country_string) {
country_string %>%
countrycode::countrycode(origin = "country.name", destination = "iso3c", origin_regex = TRUE)
}
So even though our dataset mangles the letters with accents (a common problem), our function reads it correctly.
"CÙte d'Ivoire" %>% country_regex_to_iso3c()
## [1] "CIV"
And it will play nicely with other datasets that use the English language name for the country.
"Ivory Coast" %>% country_regex_to_iso3c()
## [1] "CIV"
The second function turns ISO3C codes into standardized country names
from the countrycode package. This is useful to make sure
that our country names are always the same between datasets.
iso3c_to_country_name <- function(country_string) {
country_string %>%
countrycode::countrycode(origin = "iso3c", destination = "country.name")
}
"CIV" %>% iso3c_to_country_name()
## [1] "Côte d’Ivoire"
Okay, now let’s try applying it to our dataset
fake_data %>%
mutate(iso3c = country_regex_to_iso3c(Country))
## Warning in countrycode_convert(sourcevar = sourcevar, origin = origin, destination = dest, : Some values were not matched unambiguously: T¸rkiye
We get a warning:
Warning: Some values were not matched unambiguously: T¸rkiye
While countrycode is good at parsing a wide variety of
country names, it doesn’t get everything.
In June of 2022, Turkey changed it’s official name to the Turkish
version of its name, Türkiye.
While countrycode is updated fairly regularly, it hasn’t
registered this change yet.
"Türkiye" %>% country_regex_to_iso3c()
## Warning in countrycode_convert(sourcevar = sourcevar, origin = origin, destination = dest, : Some values were not matched unambiguously: Türkiye
## [1] NA
While eventually, this will change, for now we’ll change the name to
Turkey to keep it standardized with the other datasets. We use case_when(),
which is like an expandable if_else() function.
fake_data %>%
# when country equals "T¸rkiye", change it to "Turkey"
mutate(Country = case_when(Country == "T¸rkiye" ~ "Turkey",
# can add others if needed, for example:
# uncomment the line below to add "CÙte d'Ivoire"
#Country == "CÙte d'Ivoire" ~ "Côte d’Ivoire",
# otherwise, leave it as is
TRUE ~ Country))
Now, everything looks great, with a little thanks to our friend
case_when()
fake_data %>%
# when country equals "T¸rkiye", change it to "Turkey"
mutate(Country = case_when(Country == "T¸rkiye" ~ "Turkey",
# otherwise, leave it as is
TRUE ~ Country)) %>%
# standardize country names to iso3c and the countrycode country names
mutate(iso3c = country_regex_to_iso3c(Country),
country_name = iso3c_to_country_name(iso3c))
Problem solved!!!
case_when() is a very useful tool in your data cleaning
toolkit.
Tidy data requires that every column is a variable, and every row is an observation.
Datasets of international time series are commonly stored in a wide format, where each year (an observation) is its own column. In order to work with this data using tidy data tools, we’ll want to transform this into a long format.
To do this, we’ll need to select all of the year columns.
We could do this by writing out each year manually:
fake_data %>%
select(`2021`, `2022`, `2023`)
But if you have a lot of years, this is going to take a long time. And if the dataset gets updated, you’ll have to update your code as well. In programming, laziness is a virtue. So we’re going to use the magic of regular expressions.
\\d{4} is a regular expression that will find a string
composed of four digits. This will find all columns that are numbers
To break this into its component parts:
\\d: find digits only{4}: four of themYou can test this using stringr::str_detect on a vector
containing years (four digits) amongst other text. It labels the years
as TRUE and non-years as FALSE.
c("Apple", "1746", "Banana", "2001", "Orangutan") %>% str_detect(pattern = "\\d{4}")
## [1] FALSE TRUE FALSE TRUE FALSE
fake_data %>%
# `matches` will select every column header that matches a regular expression
select(matches("\\d{4}"))
Now, we can use this same syntax to select the columns we want to pivot longer
fake_data %>%
pivot_longer(cols = matches("\\d{4}"), names_to = "year")
As explained in more detail in another post, there are a few common challenges for coercing data into numeric format
Normally, you can coerce a number in character format into a numeric
using as.numeric()
"-7842.66" %>% as.numeric()
## [1] -7842.66
But if there in numbers over 1,000 the comma will not parse and
as.numeric() will return a NA value.
"-7,842.66" %>% as.numeric()
## Warning in "-7,842.66" %>% as.numeric(): NAs introduced by coercion
## [1] NA
Extra spaces before or after the digits can also cause problems.
So, we create a simple helper function using functions from the stringr package
to help out:
character_num_to_numeric <- function(character_num) {
character_num %>%
# take out the commas
stringr::str_remove_all(pattern = ",") %>%
# take out any blank spaces before or after the number
stringr::str_trim() %>%
# coerce to numeric
as.numeric()
}
This works:
" -7,842.66 " %>% character_num_to_numeric()
## [1] -7842.66
Now, we can apply this function to all of the columns that should be numeric values.
fake_data %>%
pivot_longer(cols = matches("\\d{4}"), names_to = "year") %>%
# we will apply the function `character_num_to_numeric` that we created above
# to the columns `year` and `value`
mutate(across(.cols = c(year, value), .fns = character_num_to_numeric))
Nobody knows what AB_XYZ really means. Tell the world.
It means Owls per capita.
Create a tibble using the tribble()
function that has one column with the same name as your subject code (in
this case Subject.Code), and another column with the human readable
versions you’d like to use. d
indicator_names <- tribble(~Subject.Code, ~indicator,
"AB_XYZ", "Owls per capita",
"BC_WQT", "Squirrels per capita",
"TY_LME", "Monkeys per capita",
"BH_VBI", "Dinosaurs per capita")
indicator_names
Use left_join() to attach the indicator names to our
dataset. Specify that you are joining by the shared column, in this case
Subject.Code.
fake_data %>%
left_join(indicator_names, by = c("Subject.Code"))
Now we can put it all togther
processed_data <- fake_data %>%
# Step 1: Standardize Country Names
# when country equals "T¸rkiye", change it to "Turkey"
mutate(Country = case_when(Country == "T¸rkiye" ~ "Turkey",
# otherwise, leave it as is
TRUE ~ Country)) %>%
# standardize country names to iso3c and the countrycode country names
mutate(iso3c = country_regex_to_iso3c(Country),
country_name = iso3c_to_country_name(iso3c)) %>%
# Step 2: Pivot the data longer
# pivot all of the year columns into a column called "year"
pivot_longer(cols = matches("\\d{4}"), names_to = "year") %>%
# Step 3: Coerce all numbers into numeric form from character form
## for the year and value columns, apply the function character_num_to_numeric
## that we created above.
mutate(across(.cols = c(year, value), .fns = character_num_to_numeric)) %>%
# Step 4: Make Human Readable Subject Codes
# we attach our human readable indicator names we created above.
left_join(indicator_names, by = c("Subject.Code")) %>%
# Step 5: Select which variables we want to keep, and in which order
select(country_name, iso3c, indicator, year, value)
processed_data
Yay! Now our data set is clean and tidy. We can write it to file as a .csv, .rds, or any other file type we prefer. In the future we’ll be able to just read in that file and work with cleaned data.
z