library(tidyverse)
library(janitor)
library(here)
library(readxl)
library(rnaturalearth)
library(countrycode)
library(wbstats)Toolkit For Working With Country-Level Data
Introduction
This week we are going to start building a toolkit for working with country-level data.
Let’s load our libraries
A Toolkit for working with sovereign data
The countrycode Package Is Your Friend
A persistent challenge for working with sovereign-level data is dealing with country names. Countries are referred to be a variety of different names and coding schemes. For example:
- USA vs. US. vs. U.S. vs. United States
- Russia vs. Russian Federation.
The countrycode package in R will be your best friend in dealing with country names.
I’ll show you the workflow I use, and which works well for me. I take the raw country names and codes and convert them into
- ISO3C Codes. These are the three letter country codes. They are helpful because they are unambiguous, and because they are shorter than country names they can be useful for charts.
- Standardized country names provided by the country code package.
In the end, all datasets have the same first two columns: country_name (standardized country names), and iso3c. If there are regional groupings, I’ll use an if_else to create a column indicating whether an entity is a region or a country.
What is the advantage of doing this? If all of your datasets use standardized country names, it’s easy to combine datasets.
First, take a look at the help page for the countrycode() function:
?countrycodeUncomment & run to view the codelist
#View(codelist)Now, let’s try it out.
"Russian Federation" |>
# The pipe |> feeds in "Russian Federation as the first argument (sourcevar)
# origin - we are saying that the origin is the country name
countrycode(origin = "country.name",
# and we want to get the iso3c code
destination = "iso3c",
# and it should use regular expressions to search for different permutations of "Russia"
origin_regex = TRUE)[1] "RUS"
This is what the regular expression searching for Russia looks like. Don’t worry about understanding it now. Just know that regular expressions are magical, and they are a tool worth learning eventually. This textbook chapter will get you started.
codelist |>
filter(iso3c == "RUS") |>
pull(country.name.en.regex)[1] "\\brussia|soviet.?union|u\\.?s\\.?s\\.?r|socialist.?republics"
Let’s work hard to be lazy, and lets make a function so we don’t have to type this over and over.
country_name_regex_to_iso3c <- function(country_name) {
country_name |>
countrycode(origin = "country.name",
destination = "iso3c",
origin_regex = TRUE)
}And let’s give it a try
"Bolivarian Republic of Venezuela" |> country_name_regex_to_iso3c()[1] "VEN"
Now let’s try it the other direction: From iso3c to country name.
"VEN" |>
countrycode(origin = "iso3c", destination = "country.name")[1] "Venezuela"
And let’s make a function out of that.
iso3c_to_country_name <- function(iso3c) {
iso3c |>
countrycode(origin = "iso3c", destination = "country.name")
}Does it work? Let’s test.
"VEN" |> iso3c_to_country_name()[1] "Venezuela"
Now we can put these together:
"DRC" |>
country_name_regex_to_iso3c() |>
iso3c_to_country_name()[1] "Congo - Kinshasa"
Of course, you can combine these together and do whatever you find most effective. But I find it useful both to have the written name and the iso3c code.
From here, I use the iso3c to convert to any other feature available in the countrycode package
Here are a few of the ones I find most useful
- continent: Continent as defined in the World Bank Development Indicators
- region: 7 Regions as defined in the World Bank Development Indicators
- region23: 23 Regions as used to be in the World Bank Development Indicators (legacy)
- currency: ISO 4217 currency name
- unicode.symbol: emoji flag. Nice for tables and charts
purrr::partial() is an easy way to create functions where you want to pre-fill some of the variables. Below, we’re going to make a new function iso3c_to_x where we’ve told purrr::partial() to take the function countrycode() and fill in origin = "iso3c". We can just fill in the parts we want:
iso3c_to_x <- purrr::partial(countrycode, origin = "iso3c")
"VEN" |> iso3c_to_x(destination = "region")[1] "Latin America & Caribbean"
Let’s get the flag:
"VEN" |> iso3c_to_x(destination = "unicode.symbol")[1] "🇻🇪"
The Benefits of Standardizing Country Names – An Example
Let’s create a mini dataset with country names written in a variety of ways.
messy_country_dataset <- tibble(messy_country = c("The United States of America", "US", "United Mexican States", "Mexico", "The Russian Federation", "Russia", "Czechia", "The Czech Republic"), value = c(7, 2, 5, 2, 8, 4, 1, 4))
messy_country_dataset# A tibble: 8 × 2
messy_country value
<chr> <dbl>
1 The United States of America 7
2 US 2
3 United Mexican States 5
4 Mexico 2
5 The Russian Federation 8
6 Russia 4
7 Czechia 1
8 The Czech Republic 4
We can then use the functions we’ve built to standardize names.
cleaned_names <- messy_country_dataset |>
mutate(iso3c = country_name_regex_to_iso3c(messy_country)) |>
mutate(country_name = iso3c_to_country_name(iso3c)) |>
mutate(flag = iso3c_to_x(iso3c, destination = "unicode.symbol"))
cleaned_names# A tibble: 8 × 5
messy_country value iso3c country_name flag
<chr> <dbl> <chr> <chr> <chr>
1 The United States of America 7 USA United States 🇺🇸
2 US 2 USA United States 🇺🇸
3 United Mexican States 5 MEX Mexico 🇲🇽
4 Mexico 2 MEX Mexico 🇲🇽
5 The Russian Federation 8 RUS Russia 🇷🇺
6 Russia 4 RUS Russia 🇷🇺
7 Czechia 1 CZE Czechia 🇨🇿
8 The Czech Republic 4 CZE Czechia 🇨🇿
Now you can do cool stuff with the data easily. Here’s an example.
cleaned_names |>
group_by(flag, country_name, iso3c) |>
summarize(sum_of_value = sum(value)) |>
arrange(sum_of_value) `summarise()` has grouped output by 'flag', 'country_name'. You can override
using the `.groups` argument.
# A tibble: 4 × 4
# Groups: flag, country_name [4]
flag country_name iso3c sum_of_value
<chr> <chr> <chr> <dbl>
1 🇨🇿 Czechia CZE 5
2 🇲🇽 Mexico MEX 7
3 🇺🇸 United States USA 9
4 🇷🇺 Russia RUS 12
World Bank Data Using The wbstats R Package
The wbstats package makes it easy to get data from the World Bank. Here is a tutorial for getting started.
Often times we’ll want to get the most recent value. With sovereign level data there are commonly large data gaps and lags. The wbstats package has a nice function to give us the most recent value, which we use below. mrnev = “Most recent non-empty value”.
gdp_capita <- wb_data("NY.GDP.PCAP.CD", mrnev = 1)
gdp_capita# A tibble: 214 × 8
iso2c iso3c country date NY.GDP.PC…¹ obs_s…² footn…³ last_upd…⁴
<chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <date>
1 AW ABW Aruba 2021 29342. <NA> <NA> 2023-03-01
2 AF AFG Afghanistan 2021 369. <NA> <NA> 2023-03-01
3 AO AGO Angola 2021 1954. <NA> <NA> 2023-03-01
4 AL ALB Albania 2021 6493. <NA> <NA> 2023-03-01
5 AD AND Andorra 2021 42137. <NA> <NA> 2023-03-01
6 AE ARE United Arab Emirates 2021 44316. <NA> <NA> 2023-03-01
7 AR ARG Argentina 2021 10636. <NA> <NA> 2023-03-01
8 AM ARM Armenia 2021 4967. <NA> <NA> 2023-03-01
9 AS ASM American Samoa 2021 15743. <NA> <NA> 2023-03-01
10 AG ATG Antigua and Barbuda 2021 15781. <NA> <NA> 2023-03-01
# … with 204 more rows, and abbreviated variable names ¹NY.GDP.PCAP.CD,
# ²obs_status, ³footnote, ⁴last_updated
You can find interesting relevant datasets on the World Bank’s Sovereign ESG Data Portal.
From experience: The World Bank’s website isn’t the most user friendly. If you find an indicator that you want google “World Bank” and the name of the indicator, and google will do a better job than the World Bank’s own website at helping you locate the code.
Try googling: “World Bank Electricity production from coal sources (% of total))”. The following page should be the first page shown. You’ll see the data code in the end of the url. Should there be an easier way to find the codes? Yes. Is there? Let me know if you find one.
Also, sometimes the World Bank API throws an error. Try running it again 30 seconds later, and often it will work. Only after trying to re-run it once or twice should you move on to other de-bugging.
Everybody Loves a Map
Everybody loves a map. They look cool. And they convey a lot of information about geographic trends in the data.
And, they are easy to make with ggplot2. Here’s an example:
# the World Bank indicator we want
wb_indicator <- "EG.ELC.ACCS.ZS"
# get the metadata from wbstats' wb_cachelist. Will use these for labels
indicator_info <- filter(wb_cachelist$indicators, indicator_id == wb_indicator)
# get the map data in sf format (google it)
world_map_data <- rnaturalearth::ne_countries(returnclass = "sf")
# get the data from the WB API. the code we want is assigned to the variable wb_indicator. Rename the column where the data will be to `value`
wb_data_tbl <- wb_data(c(value = wb_indicator), mrnev = 1)
world_map_data |>
# left join the WB data to the map data by the iso3c country codes. But the iso3c code is called the iso_a3 code in th emap data, so we specify this using by = c("iso_a3" = "iso3c")
left_join(wb_data_tbl, by = c("iso_a3" = "iso3c")) |>
# remove Antarctica. It takes up visual space, and isn't important to this analysis
filter(iso_a3 != "ATA") %>%
#start your ggplot!
ggplot(aes(fill = value)) +
# this makes a map using the sf data. So easy, right?
geom_sf() +
# uses the viridis color pallete (option G). Looks good on maps, colorblind friendly.
scale_fill_viridis_c(option = "G", labels = scales::percent_format(scale = 1)) +
# move the legend to the bottom
theme(legend.position="bottom") +
# use the metadata we got above for the labels! Nifty.
labs(
title = indicator_info$indicator,
fill = NULL,
caption = str_wrap(paste("Source:", indicator_info$source_org))
)This map efficiently presents a clear insight.
Maps can be useful both for presentation graphics and for data exploration. Maps are great for presenting sovereign-level data in presentations and papers. But they are also useful when you are first playing around with the data. You have data for 52 countries? Map it. It will show you where they all are.
Joining Datasets
When working with country-level data, we’ll often be joining together multiple datasets from different sources. You can read more about Joins in the textbook. Here we’ll introduce the different joins availale in dplyr and show how they can be useful in analyzing country-level data.
We’ll start by looking at two bite-sized datasets. The first contains country names and regions. The second contains country names and capital cities. Each dataset has three observations – two of which overlap with the other dataset.
region_tbl <- tibble(country = c("United States", "China", "Nigeria"), region = c("North America", "Asia", "Africa"))
region_tbl# A tibble: 3 × 2
country region
<chr> <chr>
1 United States North America
2 China Asia
3 Nigeria Africa
capital_tbl <- tibble(country = c("Nigeria", "Pakistan", "China"), capital = c("Abuja", "Islamabad", "Beijing"))
capital_tbl# A tibble: 3 × 2
country capital
<chr> <chr>
1 Nigeria Abuja
2 Pakistan Islamabad
3 China Beijing
We have observations for Nigeria and China in both data objects. But regions_tbl has an observation for the United States, and capital_tbl has an observation for Pakistan.
Left Join (most common)
Most common join we will use: left_join()
It keeps all of the observations on the left (x), and will join only those observations on the right (y) that share the key (in this case, country).
Because there was no data about the capital of the US in capital_tbl, it shows up as NA.
region_tbl |>
left_join(capital_tbl, by = "country")# A tibble: 3 × 3
country region capital
<chr> <chr> <chr>
1 United States North America <NA>
2 China Asia Beijing
3 Nigeria Africa Abuja
Right Join
As its name suggest, right_join() does the opposite.
It keeps all of the observations on the right (y), and will join only those observations on the right (x) that share the key (in this case, country).
Notice that the United States observation from the region_tbl is no longer there, and that Pakistan, from the capitals_tbl is now present, with an NA value for region.
region_tbl |>
right_join(capital_tbl, by = "country")# A tibble: 3 × 3
country region capital
<chr> <chr> <chr>
1 China Asia Beijing
2 Nigeria Africa Abuja
3 Pakistan <NA> Islamabad
Inner Join
inner_join keeps only observations that are in both left (x) and (y).
It does not include Pakistan or the US, because they are not in both data structures.
This can be very useful when you’re looking for observations where you have full data.
region_tbl |>
inner_join(capital_tbl, by = "country")# A tibble: 2 × 3
country region capital
<chr> <chr> <chr>
1 China Asia Beijing
2 Nigeria Africa Abuja
Full Join
Joins together all observations from left (x) and right (y). It adds NA values where there is not overlapping data.
region_tbl |>
full_join(capital_tbl, by = "country")# A tibble: 4 × 3
country region capital
<chr> <chr> <chr>
1 United States North America <NA>
2 China Asia Beijing
3 Nigeria Africa Abuja
4 Pakistan <NA> Islamabad
Anti Join
anti_join() identifies the observations on the left (x) that do not have observations on the right (y).
This is extremely useful for checking to see that you have complete data when you are joining datasets.
For example, here, we learn that we don’t have capital data for the United States, so we can go back and add it.
region_tbl |>
anti_join(capital_tbl, by = "country")# A tibble: 1 × 2
country region
<chr> <chr>
1 United States North America
The Country Features Dataset
On our class GitHub data repository, there is a dataset called Country Features . It includes the country’s World Bank Income Group, and the World Bank Region. It includes key macro datapoints from the most recent IMF World Economic Outlook (WEO).
country_features <- read_csv("https://raw.githubusercontent.com/t-emery/sais-susfin_data/main/datasets/country_features_2022-10.csv")Rows: 217 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): country_name, iso3c, wb_income_group, wb_region
dbl (4): debt_gross_percent_of_gdp, nominal_gdp_bn_ppp, nominal_gdp_per_capi...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
country_features |> glimpse()Rows: 217
Columns: 8
$ country_name <chr> "Aruba", "Afghanistan", "Angola", "Albania"…
$ iso3c <chr> "ABW", "AFG", "AGO", "ALB", "AND", "ARE", "…
$ wb_income_group <chr> "High", "Low", "Lower Middle", "Upper Middl…
$ wb_region <chr> "Latin America & Caribbean", "South Asia", …
$ debt_gross_percent_of_gdp <dbl> 95.042, 7.397, 56.564, 70.278, 43.050, 30.7…
$ nominal_gdp_bn_ppp <dbl> 5.066, 80.912, 245.442, 51.189, 5.301, 814.…
$ nominal_gdp_per_capita_ppp <dbl> 46308.66, 2456.29, 7455.47, 17858.42, 65371…
$ population_mn <dbl> 0.109, 32.941, 32.921, 2.866, 0.081, 10.544…
The first two columns are country_name and iso3c. They are standardized using the workflow from part 1, so you can attach these easily to other country-level datasets.
Let’s look at our capital_tbl
capital_tbl# A tibble: 3 × 2
country capital
<chr> <chr>
1 Nigeria Abuja
2 Pakistan Islamabad
3 China Beijing
Now we can add on these useful features
capital_tbl |>
left_join(country_features, by = c("country" = "country_name"))# A tibble: 3 × 9
country capital iso3c wb_income_g…¹ wb_re…² debt_…³ nomin…⁴ nomin…⁵ popul…⁶
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Nigeria Abuja NGA Lower Middle Sub-Sa… 37.4 1275. 5884. 217.
2 Pakistan Islamabad PAK Lower Middle South … 77.8 1512. 6662. 227.
3 China Beijing CHN Upper Middle East A… 76.9 30074. 21291. 1413.
# … with abbreviated variable names ¹wb_income_group, ²wb_region,
# ³debt_gross_percent_of_gdp, ⁴nominal_gdp_bn_ppp,
# ⁵nominal_gdp_per_capita_ppp, ⁶population_mn
Data Cleaning: An Example with OECD Emissions Data
We’ll be cleaning up an excel workbook exported from the OECD’s website. This will help you with your second paper, and will give you an introduction to dealing with the challenges of real-world data.
The data comes from this table on the OECD’s Data Portal. Note that often times Excel spreadsheets that are exported from websites are not actually Excel spreadsheets. They are actually written in XML or other formats. So the first step is to save the file as a .xlsx file. Here we’re going to download the file from the class GitHub data repo.
The file is housed at the URL below. Download the excel file and open it up so you can see what we’re working with.
file_url <- "https://github.com/t-emery/sais-susfin_data/raw/main/datasets/oecd_emissions_data_example.xlsx"We can extract the file name from the URL using basename()
file_path <- basename(file_url)
file_path[1] "oecd_emissions_data_example.xlsx"
Now we’ll download the file.
download.file(url = file_url, file_path)Read the excel file. The data starts on row 7, so we’ll use the argument skip to skip the first 6 rows.
oecd_emissions_data_raw <- read_excel(file_path,skip = 6)New names:
• `` -> `...2`
• `` -> `...4`
• `` -> `...6`
• `` -> `...8`
• `` -> `...10`
• `` -> `...12`
• `` -> `...14`
• `` -> `...16`
• `` -> `...18`
• `` -> `...20`
• `` -> `...22`
• `` -> `...24`
• `` -> `...26`
• `` -> `...28`
• `` -> `...30`
• `` -> `...32`
• `` -> `...34`
• `` -> `...36`
• `` -> `...38`
• `` -> `...40`
• `` -> `...42`
• `` -> `...44`
• `` -> `...46`
• `` -> `...48`
• `` -> `...50`
oecd_emissions_data_raw# A tibble: 86 × 50
Time ...2 `1995` ...4 `1996` ...6 `1997` ...8 `1998` ...10
<chr> <lgl> <lgl> <dbl> <lgl> <dbl> <lgl> <dbl> <lgl> <dbl>
1 Country NA NA NA NA NA NA NA NA NA
2 WLD: World NA NA 7531. NA 7815. NA 8022. NA 8197.
3 OECD: OECD… NA NA 4431. NA 4522. NA 4716. NA 4881.
4 AUS: Austr… NA NA 130. NA 133. NA 140. NA 158.
5 AUT: Austr… NA NA 21.1 NA 22.2 NA 21.3 NA 21.0
6 BEL: Belgi… NA NA 33.9 NA 32.2 NA 30.3 NA 32.6
7 CAN: Canada NA NA 115. NA 113. NA 127. NA 136.
8 CHL: Chile NA NA 8.23 NA 11.6 NA 14.0 NA 16.2
9 COL: Colom… NA NA 10.8 NA 8.78 NA 11.8 NA 12.1
10 CRI: Costa… NA NA 1.53 NA 1.23 NA 1.22 NA 1.46
# … with 76 more rows, and 40 more variables: `1999` <lgl>, ...12 <dbl>,
# `2000` <lgl>, ...14 <dbl>, `2001` <lgl>, ...16 <dbl>, `2002` <lgl>,
# ...18 <dbl>, `2003` <lgl>, ...20 <dbl>, `2004` <lgl>, ...22 <dbl>,
# `2005` <lgl>, ...24 <dbl>, `2006` <lgl>, ...26 <dbl>, `2007` <lgl>,
# ...28 <dbl>, `2008` <lgl>, ...30 <dbl>, `2009` <lgl>, ...32 <dbl>,
# `2010` <lgl>, ...34 <dbl>, `2011` <lgl>, ...36 <dbl>, `2012` <lgl>,
# ...38 <dbl>, `2013` <lgl>, ...40 <dbl>, `2014` <lgl>, ...42 <dbl>, …
Let’s look at the raw data.
View(oecd_emissions_data_raw)It looks strange because of the excel formatting. This is common. But we can fix it.
Data Cleaning, Step-by-Step:
Deal with Weird Excel Formatting
First, let’s grab all of the columns that actually have data:
oecd_emissions_actual_data <- oecd_emissions_data_raw |>
# all the column headers with data are formatted: ...[Number]
select(Time, contains("...")) |>
# except for one, so get rid of it.
select(-`...2`) |>
# get rid of the first row, which has no data
filter(Time != "Country")
oecd_emissions_actual_data# A tibble: 85 × 25
Time ...4 ...6 ...8 ...10 ...12 ...14 ...16 ...18 ...20 ...22
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 WLD: W… 7.53e3 7.82e3 8.02e3 8.20e3 8.24e3 8.75e3 8.96e3 9.10e3 9.64e3 9.97e3
2 OECD: … 4.43e3 4.52e3 4.72e3 4.88e3 4.95e3 5.29e3 5.28e3 5.20e3 5.41e3 5.54e3
3 AUS: A… 1.30e2 1.33e2 1.40e2 1.58e2 1.57e2 1.62e2 1.63e2 1.82e2 1.75e2 1.89e2
4 AUT: A… 2.11e1 2.22e1 2.13e1 2.10e1 2.09e1 2.14e1 2.21e1 2.19e1 2.57e1 2.62e1
5 BEL: B… 3.39e1 3.22e1 3.03e1 3.26e1 3.15e1 3.34e1 3.13e1 3.08e1 3.28e1 3.49e1
6 CAN: C… 1.15e2 1.13e2 1.27e2 1.36e2 1.35e2 1.41e2 1.44e2 1.42e2 1.54e2 1.54e2
7 CHL: C… 8.23e0 1.16e1 1.40e1 1.62e1 1.74e1 1.51e1 1.28e1 1.32e1 1.43e1 1.66e1
8 COL: C… 1.08e1 8.78e0 1.18e1 1.21e1 7.80e0 9.83e0 9.96e0 9.69e0 9.83e0 9.33e0
9 CRI: C… 1.53e0 1.23e0 1.22e0 1.46e0 1.33e0 1.39e0 1.36e0 1.31e0 1.35e0 1.28e0
10 CZE: C… 5.24e1 5.45e1 5.36e1 5.17e1 5.02e1 5.41e1 5.45e1 5.36e1 5.48e1 5.28e1
# … with 75 more rows, and 14 more variables: ...24 <dbl>, ...26 <dbl>,
# ...28 <dbl>, ...30 <dbl>, ...32 <dbl>, ...34 <dbl>, ...36 <dbl>,
# ...38 <dbl>, ...40 <dbl>, ...42 <dbl>, ...44 <dbl>, ...46 <dbl>,
# ...48 <dbl>, ...50 <dbl>
Next, let’s find the correct column names:
correct_column_names <- oecd_emissions_data_raw |>
# select all columns that don't contain `...` in their name
select(!contains("...")) |>
# selects the column names as a vector
names()
correct_column_names [1] "Time" "1995" "1996" "1997" "1998" "1999" "2000" "2001" "2002" "2003"
[11] "2004" "2005" "2006" "2007" "2008" "2009" "2010" "2011" "2012" "2013"
[21] "2014" "2015" "2016" "2017" "2018"
Now we rename our dataset with the correct column names.
oecd_emissions_actual_step_2 <- oecd_emissions_actual_data |>
rename_with(.f = function(x) correct_column_names) |>
# rename the column to something that makes sense
rename(geography = Time)
oecd_emissions_actual_step_2# A tibble: 85 × 25
geogr…¹ `1995` `1996` `1997` `1998` `1999` `2000` `2001` `2002` `2003` `2004`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 WLD: W… 7.53e3 7.82e3 8.02e3 8.20e3 8.24e3 8.75e3 8.96e3 9.10e3 9.64e3 9.97e3
2 OECD: … 4.43e3 4.52e3 4.72e3 4.88e3 4.95e3 5.29e3 5.28e3 5.20e3 5.41e3 5.54e3
3 AUS: A… 1.30e2 1.33e2 1.40e2 1.58e2 1.57e2 1.62e2 1.63e2 1.82e2 1.75e2 1.89e2
4 AUT: A… 2.11e1 2.22e1 2.13e1 2.10e1 2.09e1 2.14e1 2.21e1 2.19e1 2.57e1 2.62e1
5 BEL: B… 3.39e1 3.22e1 3.03e1 3.26e1 3.15e1 3.34e1 3.13e1 3.08e1 3.28e1 3.49e1
6 CAN: C… 1.15e2 1.13e2 1.27e2 1.36e2 1.35e2 1.41e2 1.44e2 1.42e2 1.54e2 1.54e2
7 CHL: C… 8.23e0 1.16e1 1.40e1 1.62e1 1.74e1 1.51e1 1.28e1 1.32e1 1.43e1 1.66e1
8 COL: C… 1.08e1 8.78e0 1.18e1 1.21e1 7.80e0 9.83e0 9.96e0 9.69e0 9.83e0 9.33e0
9 CRI: C… 1.53e0 1.23e0 1.22e0 1.46e0 1.33e0 1.39e0 1.36e0 1.31e0 1.35e0 1.28e0
10 CZE: C… 5.24e1 5.45e1 5.36e1 5.17e1 5.02e1 5.41e1 5.45e1 5.36e1 5.48e1 5.28e1
# … with 75 more rows, 14 more variables: `2005` <dbl>, `2006` <dbl>,
# `2007` <dbl>, `2008` <dbl>, `2009` <dbl>, `2010` <dbl>, `2011` <dbl>,
# `2012` <dbl>, `2013` <dbl>, `2014` <dbl>, `2015` <dbl>, `2016` <dbl>,
# `2017` <dbl>, `2018` <dbl>, and abbreviated variable name ¹geography
Standardize Country Names with countrycode()
Now, let’s standardize the country names, and differentiate between countries and regions.
geography_tbl <- oecd_emissions_actual_step_2 |>
select(geography)
geography_tbl# A tibble: 85 × 1
geography
<chr>
1 WLD: World
2 OECD: OECD member countries
3 AUS: Australia
4 AUT: Austria
5 BEL: Belgium
6 CAN: Canada
7 CHL: Chile
8 COL: Colombia
9 CRI: Costa Rica
10 CZE: Czech Republic
# … with 75 more rows
geography_tbl_2 <- geography_tbl |>
# separate the geography column
# into geography_code and geography_name
# separate at the colon ":"
# don't remove the original column (by default it removes the original column)
separate(col = geography, into = c("geography_code", "geography_name"),
sep = ":", remove = FALSE) |>
# use str_trim() to make sure there is no extra spaces around the new names that might mess up other calculations
mutate(across(c("geography_code", "geography_name"), str_trim))
geography_tbl_2# A tibble: 85 × 3
geography geography_code geography_name
<chr> <chr> <chr>
1 WLD: World WLD World
2 OECD: OECD member countries OECD OECD member countries
3 AUS: Australia AUS Australia
4 AUT: Austria AUT Austria
5 BEL: Belgium BEL Belgium
6 CAN: Canada CAN Canada
7 CHL: Chile CHL Chile
8 COL: Colombia COL Colombia
9 CRI: Costa Rica CRI Costa Rica
10 CZE: Czech Republic CZE Czech Republic
# … with 75 more rows
Great, now we have two separate columns with the geography codes and the geography names.
Next, let’s find standardized country names using countrycode() . The places where there are NA values should all be regions. Let’s check.
geography_tbl_2 |>
mutate(country_name = countrycode(sourcevar = geography_code,
origin = "iso3c",
destination = "country.name")) |>
filter(is.na(country_name))Warning: There was 1 warning in `mutate()`.
ℹ In argument: `country_name = countrycode(sourcevar = geography_code, origin =
"iso3c", destination = "country.name")`.
Caused by warning in `countrycode_convert()`:
! Some values were not matched unambiguously: APEC, ASEAN, DATA EXTRACTED ON 26 FEB 2023 00, EA19, EASIA, EU13, EU15, EU27_2020, EU28, G20, NONOECD, OECD, ROW, WLD, ZASI, ZEUR, ZNAM, ZOTH, ZSCA
# A tibble: 19 × 4
geography geogr…¹ geogr…² count…³
<chr> <chr> <chr> <chr>
1 WLD: World WLD World <NA>
2 OECD: OECD member countries OECD OECD m… <NA>
3 NONOECD: Non-OECD economies and aggregates NONOECD Non-OE… <NA>
4 ROW: Rest of the World ROW Rest o… <NA>
5 APEC: Asia-Pacific Economic Cooperation APEC Asia-P… <NA>
6 ASEAN: Association of South East Asian Nations ASEAN Associ… <NA>
7 EASIA: Eastern Asia EASIA Easter… <NA>
8 EU27_2020: European Union (27 countries) EU27_2… Europe… <NA>
9 EU28: European Union (28 countries) EU28 Europe… <NA>
10 EU15: European Union (15 countries) EU15 Europe… <NA>
11 EU13: EU28 excluding EU15 EU13 EU28 e… <NA>
12 EA19: Euro area (19 countries) EA19 Euro a… <NA>
13 G20: Group of Twenty G20 Group … <NA>
14 ZEUR: Europe ZEUR Europe <NA>
15 ZASI: East and Southeastern Asia ZASI East a… <NA>
16 ZNAM: North America ZNAM North … <NA>
17 ZSCA: South and Central America ZSCA South … <NA>
18 ZOTH: Other regions ZOTH Other … <NA>
19 Data extracted on 26 Feb 2023 00:16 UTC (GMT) from O… Data e… 16 UTC… <NA>
# … with abbreviated variable names ¹geography_code, ²geography_name,
# ³country_name
Okay, all the places that are missing in country_name are regions. We can use that to our advantage
geography_processed <- geography_tbl_2 |>
# use countrycode() to give the countries standardized names
mutate(country_name = countrycode(sourcevar = geography_code,
origin = "iso3c",
destination = "country.name")) |>
# use if_else to make a column telling whether the entity is a region or a country
# if country_name is NA, then it is a region, if not it is a country
mutate(country_or_region = if_else(is.na(country_name),
true = "region",
false = "country")) |>
# fill in the region names into country_name using the same logic
mutate(country_name = if_else(is.na(country_name),
true = geography_name,
false = country_name)) |>
# select the columns we want.
# we'll rename geography_code into iso3c to match our other standardized datasets.
# we'll use the original geography column to join it to the original dataset.
select(geography, country_name, iso3c = geography_code, country_or_region)Warning: There was 1 warning in `mutate()`.
ℹ In argument: `country_name = countrycode(sourcevar = geography_code, origin =
"iso3c", destination = "country.name")`.
Caused by warning in `countrycode_convert()`:
! Some values were not matched unambiguously: APEC, ASEAN, DATA EXTRACTED ON 26 FEB 2023 00, EA19, EASIA, EU13, EU15, EU27_2020, EU28, G20, NONOECD, OECD, ROW, WLD, ZASI, ZEUR, ZNAM, ZOTH, ZSCA
geography_processed# A tibble: 85 × 4
geography country_name iso3c country_or_region
<chr> <chr> <chr> <chr>
1 WLD: World World WLD region
2 OECD: OECD member countries OECD member countries OECD region
3 AUS: Australia Australia AUS country
4 AUT: Austria Austria AUT country
5 BEL: Belgium Belgium BEL country
6 CAN: Canada Canada CAN country
7 CHL: Chile Chile CHL country
8 COL: Colombia Colombia COL country
9 CRI: Costa Rica Costa Rica CRI country
10 CZE: Czech Republic Czechia CZE country
# … with 75 more rows
Now we join our standardized country names + region/country flag back onto the main dataset. We attach it by the geography column. But once we’ve done that, we can get rid of that column.
step3 <- geography_processed |>
left_join(oecd_emissions_actual_step_2, by = "geography") |>
# now that we've joined this back to the dataset, we don't need the geography column anymore
select(-geography)
step3 # A tibble: 85 × 27
count…¹ iso3c count…² `1995` `1996` `1997` `1998` `1999` `2000` `2001` `2002`
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 World WLD region 7.53e3 7.82e3 8.02e3 8.20e3 8.24e3 8.75e3 8.96e3 9.10e3
2 OECD m… OECD region 4.43e3 4.52e3 4.72e3 4.88e3 4.95e3 5.29e3 5.28e3 5.20e3
3 Austra… AUS country 1.30e2 1.33e2 1.40e2 1.58e2 1.57e2 1.62e2 1.63e2 1.82e2
4 Austria AUT country 2.11e1 2.22e1 2.13e1 2.10e1 2.09e1 2.14e1 2.21e1 2.19e1
5 Belgium BEL country 3.39e1 3.22e1 3.03e1 3.26e1 3.15e1 3.34e1 3.13e1 3.08e1
6 Canada CAN country 1.15e2 1.13e2 1.27e2 1.36e2 1.35e2 1.41e2 1.44e2 1.42e2
7 Chile CHL country 8.23e0 1.16e1 1.40e1 1.62e1 1.74e1 1.51e1 1.28e1 1.32e1
8 Colomb… COL country 1.08e1 8.78e0 1.18e1 1.21e1 7.80e0 9.83e0 9.96e0 9.69e0
9 Costa … CRI country 1.53e0 1.23e0 1.22e0 1.46e0 1.33e0 1.39e0 1.36e0 1.31e0
10 Czechia CZE country 5.24e1 5.45e1 5.36e1 5.17e1 5.02e1 5.41e1 5.45e1 5.36e1
# … with 75 more rows, 16 more variables: `2003` <dbl>, `2004` <dbl>,
# `2005` <dbl>, `2006` <dbl>, `2007` <dbl>, `2008` <dbl>, `2009` <dbl>,
# `2010` <dbl>, `2011` <dbl>, `2012` <dbl>, `2013` <dbl>, `2014` <dbl>,
# `2015` <dbl>, `2016` <dbl>, `2017` <dbl>, `2018` <dbl>, and abbreviated
# variable names ¹country_name, ²country_or_region
Pivot the Dataset into Long Format
Make it long form:
step4 <- step3 |>
# Use a `regular expression` (aka regex) to select any column name that is 4 digits (d{4})
pivot_longer(cols = matches("\\d{4}"),
# change the names column to `year`
names_to = "year",
# transform it into to integer data type (from character)
names_transform = as.integer)
step4# A tibble: 2,040 × 5
country_name iso3c country_or_region year value
<chr> <chr> <chr> <int> <dbl>
1 World WLD region 1995 7531.
2 World WLD region 1996 7815.
3 World WLD region 1997 8022.
4 World WLD region 1998 8197.
5 World WLD region 1999 8238.
6 World WLD region 2000 8752.
7 World WLD region 2001 8958.
8 World WLD region 2002 9095.
9 World WLD region 2003 9637.
10 World WLD region 2004 9974.
# … with 2,030 more rows
Attach Metadata
Get the metadata for our series and attach it.
oecd_metadata <- read_excel(basename(file_url),range = "a3:c6", col_names = c("metadata_parameter", "blank", "metadata_value")) |>
select(-blank) |>
pivot_wider(names_from = metadata_parameter, values_from = metadata_value)
oecd_metadata# A tibble: 1 × 4
Indicator Indus…¹ Partner Unit
<chr> <chr> <chr> <chr>
1 FD_CO2: CO2 emissions embodied in domestic final demand… D35: E… WLD: W… Tonn…
# … with abbreviated variable name ¹Industry
Alternatively, you can make and attach your own shorter meta data description. This can be useful for analysis and charting.
my_oecd_metadata <- tribble(~my_indicator_name,
"CO2 from electricity, gas, and heating/cooling")
my_oecd_metadata# A tibble: 1 × 1
my_indicator_name
<chr>
1 CO2 from electricity, gas, and heating/cooling
You can add that on in addition to, or instead of the official metadata. Better to save both.
The Tidy dataset
Now we have a complete tidy dataset that is ready for analysis.
What are the key elements?
first two columns are
country_nameandiso3cstandardized using thecountrycodepackage. Third column flags whether it is a region or a country.All columns are the correct data type. year is
integer, value isdouble(numeric)Metadata gives us a clear sense of exactly what our data is.
Column names are in
snake_caseso they are easy to work with in R.
step4 |>
bind_cols(oecd_metadata) |>
clean_names()# A tibble: 2,040 × 9
country_name iso3c country_or_reg…¹ year value indic…² indus…³ partner unit
<chr> <chr> <chr> <int> <dbl> <chr> <chr> <chr> <chr>
1 World WLD region 1995 7531. FD_CO2… D35: E… WLD: W… Tonn…
2 World WLD region 1996 7815. FD_CO2… D35: E… WLD: W… Tonn…
3 World WLD region 1997 8022. FD_CO2… D35: E… WLD: W… Tonn…
4 World WLD region 1998 8197. FD_CO2… D35: E… WLD: W… Tonn…
5 World WLD region 1999 8238. FD_CO2… D35: E… WLD: W… Tonn…
6 World WLD region 2000 8752. FD_CO2… D35: E… WLD: W… Tonn…
7 World WLD region 2001 8958. FD_CO2… D35: E… WLD: W… Tonn…
8 World WLD region 2002 9095. FD_CO2… D35: E… WLD: W… Tonn…
9 World WLD region 2003 9637. FD_CO2… D35: E… WLD: W… Tonn…
10 World WLD region 2004 9974. FD_CO2… D35: E… WLD: W… Tonn…
# … with 2,030 more rows, and abbreviated variable names ¹country_or_region,
# ²indicator, ³industry
Consolidate The Workflow
Okay, now put this all into 1 workflow so we can use it for other OECD
file_path <- basename(file_url)
oecd_emissions_data_raw <- read_excel(file_path,skip = 6)New names:
• `` -> `...2`
• `` -> `...4`
• `` -> `...6`
• `` -> `...8`
• `` -> `...10`
• `` -> `...12`
• `` -> `...14`
• `` -> `...16`
• `` -> `...18`
• `` -> `...20`
• `` -> `...22`
• `` -> `...24`
• `` -> `...26`
• `` -> `...28`
• `` -> `...30`
• `` -> `...32`
• `` -> `...34`
• `` -> `...36`
• `` -> `...38`
• `` -> `...40`
• `` -> `...42`
• `` -> `...44`
• `` -> `...46`
• `` -> `...48`
• `` -> `...50`
correct_column_names <- oecd_emissions_data_raw |>
select(!contains("...")) |>
names()
oecd_emissions_actual_step_2 <- oecd_emissions_actual_data |>
rename_with(.f = function(x) correct_column_names) |>
rename(geography = Time)
geography_tbl <- oecd_emissions_actual_step_2 |>
select(geography)
geography_processed <- geography_tbl |>
# separate the geography column
# into geography_code and geography_name
# separate at the colon ":"
# don't remove the original column (by default it removes the original column)
separate(col = geography, into = c("geography_code", "geography_name"),
sep = ":", remove = FALSE) |>
# use str_trim() to make sure there is no extra spaces around the new names that might mess up other calculations
mutate(across(c("geography_code", "geography_name"), str_trim)) |>
# use countrycode() to give the countries standardized names
mutate(country_name = countrycode(sourcevar = geography_code,
origin = "iso3c",
destination = "country.name")) |>
# use if_else to make a column telling whether the entity is a region or a country
# if country_name is NA, then it is a region, if not it is a country
mutate(country_or_region = if_else(is.na(country_name),
true = "region",
false = "country")) |>
# fill in the region names into country_name using the same logic
mutate(country_name = if_else(is.na(country_name),
true = geography_name,
false = country_name)) |>
# select the columns we want.
# we'll rename geography_code into iso3c to match our other standardized datasets.
# we'll use the original geography column to join it to the original dataset.
select(geography, country_name, iso3c = geography_code, country_or_region)Warning: There was 1 warning in `mutate()`.
ℹ In argument: `country_name = countrycode(sourcevar = geography_code, origin =
"iso3c", destination = "country.name")`.
Caused by warning in `countrycode_convert()`:
! Some values were not matched unambiguously: APEC, ASEAN, DATA EXTRACTED ON 26 FEB 2023 00, EA19, EASIA, EU13, EU15, EU27_2020, EU28, G20, NONOECD, OECD, ROW, WLD, ZASI, ZEUR, ZNAM, ZOTH, ZSCA
data_processed <- geography_processed |>
left_join(oecd_emissions_actual_step_2, by = "geography") |>
# now that we've joined this back to the dataset, we don't need the geography column anymore
select(-geography) |>
# Use a `regular expression` (aka regex) to select any column name that is 4 digits (d{4})
pivot_longer(cols = matches("\\d{4}"),
# change the names column to `year`
names_to = "year",
# transform it into to integer data type (from character)
names_transform = as.integer)
oecd_metadata <- read_excel(file_path,range = "a3:c6", col_names = c("metadata_parameter", "blank", "metadata_value")) |>
select(-blank) |>
pivot_wider(names_from = metadata_parameter, values_from = metadata_value)
final_data <- data_processed |>
bind_cols(oecd_metadata) |>
janitor::clean_names()
final_data # A tibble: 2,040 × 9
country_name iso3c country_or_reg…¹ year value indic…² indus…³ partner unit
<chr> <chr> <chr> <int> <dbl> <chr> <chr> <chr> <chr>
1 World WLD region 1995 7531. FD_CO2… D35: E… WLD: W… Tonn…
2 World WLD region 1996 7815. FD_CO2… D35: E… WLD: W… Tonn…
3 World WLD region 1997 8022. FD_CO2… D35: E… WLD: W… Tonn…
4 World WLD region 1998 8197. FD_CO2… D35: E… WLD: W… Tonn…
5 World WLD region 1999 8238. FD_CO2… D35: E… WLD: W… Tonn…
6 World WLD region 2000 8752. FD_CO2… D35: E… WLD: W… Tonn…
7 World WLD region 2001 8958. FD_CO2… D35: E… WLD: W… Tonn…
8 World WLD region 2002 9095. FD_CO2… D35: E… WLD: W… Tonn…
9 World WLD region 2003 9637. FD_CO2… D35: E… WLD: W… Tonn…
10 World WLD region 2004 9974. FD_CO2… D35: E… WLD: W… Tonn…
# … with 2,030 more rows, and abbreviated variable names ¹country_or_region,
# ²indicator, ³industry
Make a Data Processing Function
Now we can make it into a function:
process_oecd_emissions_xlsx
process_oecd_emissions_xlsx <- function(file_path_xlsx) {
oecd_emissions_data_raw <- read_excel(file_path_xlsx,skip = 6)
correct_column_names <- oecd_emissions_data_raw |>
select(!contains("...")) |>
names()
oecd_emissions_actual_step_2 <- oecd_emissions_actual_data |>
rename_with(.f = function(x) correct_column_names) |>
rename(geography = Time)
geography_tbl <- oecd_emissions_actual_step_2 |>
select(geography)
geography_processed <- geography_tbl |>
# separate the geography column
# into geography_code and geography_name
# separate at the colon ":"
# don't remove the original column (by default it removes the original column)
separate(col = geography, into = c("geography_code", "geography_name"),
sep = ":", remove = FALSE) |>
# use str_trim() to make sure there is no extra spaces around the new names that might mess up other calculations
mutate(across(c("geography_code", "geography_name"), str_trim)) |>
# use countrycode() to give the countries standardized names
mutate(country_name = countrycode(sourcevar = geography_code,
origin = "iso3c",
destination = "country.name")) |>
# use if_else to make a column telling whether the entity is a region or a country
# if country_name is NA, then it is a region, if not it is a country
mutate(country_or_region = if_else(is.na(country_name),
true = "region",
false = "country")) |>
# fill in the region names into country_name using the same logic
mutate(country_name = if_else(is.na(country_name),
true = geography_name,
false = country_name)) |>
# select the columns we want.
# we'll rename geography_code into iso3c to match our other standardized datasets.
# we'll use the original geography column to join it to the original dataset.
select(geography, country_name, iso3c = geography_code, country_or_region)
data_processed <- geography_processed |>
left_join(oecd_emissions_actual_step_2, by = "geography") |>
# now that we've joined this back to the dataset, we don't need the geography column anymore
select(-geography) |>
# Use a `regular expression` (aka regex) to select any column name that is 4 digits (d{4})
pivot_longer(cols = matches("\\d{4}"),
# change the names column to `year`
names_to = "year",
# transform it into to integer data type (from character)
names_transform = as.integer)
oecd_metadata <- read_excel(file_path,range = "a3:c6", col_names = c("metadata_parameter", "blank", "metadata_value")) |>
select(-blank) |>
pivot_wider(names_from = metadata_parameter, values_from = metadata_value)
final_data <- data_processed |> bind_cols(oecd_metadata) |>
janitor::clean_names()
final_data
}processed_data <- process_oecd_emissions_xlsx(file_path_xlsx = "oecd_emissions_data_example.xlsx")New names:
• `` -> `...2`
• `` -> `...4`
• `` -> `...6`
• `` -> `...8`
• `` -> `...10`
• `` -> `...12`
• `` -> `...14`
• `` -> `...16`
• `` -> `...18`
• `` -> `...20`
• `` -> `...22`
• `` -> `...24`
• `` -> `...26`
• `` -> `...28`
• `` -> `...30`
• `` -> `...32`
• `` -> `...34`
• `` -> `...36`
• `` -> `...38`
• `` -> `...40`
• `` -> `...42`
• `` -> `...44`
• `` -> `...46`
• `` -> `...48`
• `` -> `...50`
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `country_name = countrycode(sourcevar = geography_code, origin =
"iso3c", destination = "country.name")`.
Caused by warning in `countrycode_convert()`:
! Some values were not matched unambiguously: APEC, ASEAN, DATA EXTRACTED ON 26 FEB 2023 00, EA19, EASIA, EU13, EU15, EU27_2020, EU28, G20, NONOECD, OECD, ROW, WLD, ZASI, ZEUR, ZNAM, ZOTH, ZSCA
processed_data# A tibble: 2,040 × 9
country_name iso3c country_or_reg…¹ year value indic…² indus…³ partner unit
<chr> <chr> <chr> <int> <dbl> <chr> <chr> <chr> <chr>
1 World WLD region 1995 7531. FD_CO2… D35: E… WLD: W… Tonn…
2 World WLD region 1996 7815. FD_CO2… D35: E… WLD: W… Tonn…
3 World WLD region 1997 8022. FD_CO2… D35: E… WLD: W… Tonn…
4 World WLD region 1998 8197. FD_CO2… D35: E… WLD: W… Tonn…
5 World WLD region 1999 8238. FD_CO2… D35: E… WLD: W… Tonn…
6 World WLD region 2000 8752. FD_CO2… D35: E… WLD: W… Tonn…
7 World WLD region 2001 8958. FD_CO2… D35: E… WLD: W… Tonn…
8 World WLD region 2002 9095. FD_CO2… D35: E… WLD: W… Tonn…
9 World WLD region 2003 9637. FD_CO2… D35: E… WLD: W… Tonn…
10 World WLD region 2004 9974. FD_CO2… D35: E… WLD: W… Tonn…
# … with 2,030 more rows, and abbreviated variable names ¹country_or_region,
# ²indicator, ³industry
processed_data |> glimpse()Rows: 2,040
Columns: 9
$ country_name <chr> "World", "World", "World", "World", "World", "World"…
$ iso3c <chr> "WLD", "WLD", "WLD", "WLD", "WLD", "WLD", "WLD", "WL…
$ country_or_region <chr> "region", "region", "region", "region", "region", "r…
$ year <int> 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003…
$ value <dbl> 7530.733, 7815.068, 8022.339, 8197.269, 8238.322, 87…
$ indicator <chr> "FD_CO2: CO2 emissions embodied in domestic final de…
$ industry <chr> "D35: Electricity, gas, steam and air conditioning s…
$ partner <chr> "WLD: World", "WLD: World", "WLD: World", "WLD: Worl…
$ unit <chr> "Tonnes, Millions", "Tonnes, Millions", "Tonnes, Mil…
Save the Processed Data
Write this data to your `00_processed1
# use here::here() to specify relative paths from your projects base directory
write_csv(processed_data, here("00_data_processed", "oecd_data_xyz_processed.csv"))
# if you have a really big file, or if you have data that is not rectangular (nested data, map data, etc..), writing to a .rds file is a good option.
write_rds(processed_data, here("00_data_processed", "oecd_data_xyz_processed.rds"), compress = "gz")