Toolkit For Working With Country-Level Data

Author

Teal Emery

Introduction

This week we are going to start building a toolkit for working with country-level data.

Let’s load our libraries

library(tidyverse) 
library(janitor)
library(here)
library(readxl)

library(rnaturalearth)
library(countrycode)
library(wbstats)

A Toolkit for working with sovereign data

The countrycode Package Is Your Friend

A persistent challenge for working with sovereign-level data is dealing with country names. Countries are referred to be a variety of different names and coding schemes. For example:

  • USA vs. US. vs. U.S. vs. United States
  • Russia vs. Russian Federation.

The countrycode package in R will be your best friend in dealing with country names.

I’ll show you the workflow I use, and which works well for me. I take the raw country names and codes and convert them into

  1. ISO3C Codes. These are the three letter country codes. They are helpful because they are unambiguous, and because they are shorter than country names they can be useful for charts.
  2. Standardized country names provided by the country code package.

In the end, all datasets have the same first two columns: country_name (standardized country names), and iso3c. If there are regional groupings, I’ll use an if_else to create a column indicating whether an entity is a region or a country.

What is the advantage of doing this? If all of your datasets use standardized country names, it’s easy to combine datasets.

First, take a look at the help page for the countrycode() function:

?countrycode

Uncomment & run to view the codelist

#View(codelist)

Now, let’s try it out.

"Russian Federation" |> 
  # The pipe |> feeds in "Russian Federation as the first argument (sourcevar)
  # origin - we are saying that the origin is the country name
  countrycode(origin = "country.name", 
              # and we want to get the iso3c code
              destination = "iso3c", 
              # and it should use regular expressions to search for different permutations of "Russia"
              origin_regex = TRUE)
[1] "RUS"

This is what the regular expression searching for Russia looks like. Don’t worry about understanding it now. Just know that regular expressions are magical, and they are a tool worth learning eventually. This textbook chapter will get you started.

codelist |> 
  filter(iso3c == "RUS") |> 
  pull(country.name.en.regex)
[1] "\\brussia|soviet.?union|u\\.?s\\.?s\\.?r|socialist.?republics"

Let’s work hard to be lazy, and lets make a function so we don’t have to type this over and over.

country_name_regex_to_iso3c <- function(country_name) {
  country_name |>
    countrycode(origin = "country.name", 
                                     destination = "iso3c",
                                     origin_regex = TRUE)
}

And let’s give it a try

"Bolivarian Republic of Venezuela" |> country_name_regex_to_iso3c()
[1] "VEN"

Now let’s try it the other direction: From iso3c to country name.

"VEN" |>
  countrycode(origin = "iso3c", destination = "country.name")
[1] "Venezuela"

And let’s make a function out of that.

iso3c_to_country_name <- function(iso3c) {
  iso3c |>
  countrycode(origin = "iso3c", destination = "country.name")
}

Does it work? Let’s test.

"VEN" |> iso3c_to_country_name()
[1] "Venezuela"

Now we can put these together:

"DRC" |>
  country_name_regex_to_iso3c() |>
  iso3c_to_country_name()
[1] "Congo - Kinshasa"

Of course, you can combine these together and do whatever you find most effective. But I find it useful both to have the written name and the iso3c code.

From here, I use the iso3c to convert to any other feature available in the countrycode package

Here are a few of the ones I find most useful

  • continent: Continent as defined in the World Bank Development Indicators
  • region: 7 Regions as defined in the World Bank Development Indicators
  • region23: 23 Regions as used to be in the World Bank Development Indicators (legacy)
  • currency: ISO 4217 currency name
  • unicode.symbol: emoji flag. Nice for tables and charts

purrr::partial() is an easy way to create functions where you want to pre-fill some of the variables. Below, we’re going to make a new function iso3c_to_x where we’ve told purrr::partial() to take the function countrycode() and fill in origin = "iso3c". We can just fill in the parts we want:

iso3c_to_x <- purrr::partial(countrycode, origin = "iso3c")

"VEN" |> iso3c_to_x(destination = "region")
[1] "Latin America & Caribbean"

Let’s get the flag:

"VEN" |> iso3c_to_x(destination = "unicode.symbol")
[1] "🇻🇪"

The Benefits of Standardizing Country Names – An Example

Let’s create a mini dataset with country names written in a variety of ways.

messy_country_dataset <- tibble(messy_country = c("The United States of America", "US", "United Mexican States", "Mexico", "The Russian Federation", "Russia", "Czechia", "The Czech Republic"), value = c(7, 2, 5, 2, 8, 4, 1, 4)) 

messy_country_dataset
# A tibble: 8 × 2
  messy_country                value
  <chr>                        <dbl>
1 The United States of America     7
2 US                               2
3 United Mexican States            5
4 Mexico                           2
5 The Russian Federation           8
6 Russia                           4
7 Czechia                          1
8 The Czech Republic               4

We can then use the functions we’ve built to standardize names.

cleaned_names <- messy_country_dataset |>
  mutate(iso3c = country_name_regex_to_iso3c(messy_country)) |>
  mutate(country_name = iso3c_to_country_name(iso3c)) |>
  mutate(flag = iso3c_to_x(iso3c, destination = "unicode.symbol"))

cleaned_names
# A tibble: 8 × 5
  messy_country                value iso3c country_name  flag 
  <chr>                        <dbl> <chr> <chr>         <chr>
1 The United States of America     7 USA   United States 🇺🇸   
2 US                               2 USA   United States 🇺🇸   
3 United Mexican States            5 MEX   Mexico        🇲🇽   
4 Mexico                           2 MEX   Mexico        🇲🇽   
5 The Russian Federation           8 RUS   Russia        🇷🇺   
6 Russia                           4 RUS   Russia        🇷🇺   
7 Czechia                          1 CZE   Czechia       🇨🇿   
8 The Czech Republic               4 CZE   Czechia       🇨🇿   

Now you can do cool stuff with the data easily. Here’s an example.

cleaned_names |>
  group_by(flag, country_name, iso3c) |>
  summarize(sum_of_value = sum(value)) |>
  arrange(sum_of_value) 
`summarise()` has grouped output by 'flag', 'country_name'. You can override
using the `.groups` argument.
# A tibble: 4 × 4
# Groups:   flag, country_name [4]
  flag  country_name  iso3c sum_of_value
  <chr> <chr>         <chr>        <dbl>
1 🇨🇿    Czechia       CZE              5
2 🇲🇽    Mexico        MEX              7
3 🇺🇸    United States USA              9
4 🇷🇺    Russia        RUS             12

World Bank Data Using The wbstats R Package

The wbstats package makes it easy to get data from the World Bank. Here is a tutorial for getting started.

Often times we’ll want to get the most recent value. With sovereign level data there are commonly large data gaps and lags. The wbstats package has a nice function to give us the most recent value, which we use below. mrnev = “Most recent non-empty value”.

gdp_capita <- wb_data("NY.GDP.PCAP.CD", mrnev = 1)

gdp_capita
# A tibble: 214 × 8
   iso2c iso3c country               date NY.GDP.PC…¹ obs_s…² footn…³ last_upd…⁴
   <chr> <chr> <chr>                <dbl>       <dbl> <chr>   <chr>   <date>    
 1 AW    ABW   Aruba                 2021      29342. <NA>    <NA>    2023-03-01
 2 AF    AFG   Afghanistan           2021        369. <NA>    <NA>    2023-03-01
 3 AO    AGO   Angola                2021       1954. <NA>    <NA>    2023-03-01
 4 AL    ALB   Albania               2021       6493. <NA>    <NA>    2023-03-01
 5 AD    AND   Andorra               2021      42137. <NA>    <NA>    2023-03-01
 6 AE    ARE   United Arab Emirates  2021      44316. <NA>    <NA>    2023-03-01
 7 AR    ARG   Argentina             2021      10636. <NA>    <NA>    2023-03-01
 8 AM    ARM   Armenia               2021       4967. <NA>    <NA>    2023-03-01
 9 AS    ASM   American Samoa        2021      15743. <NA>    <NA>    2023-03-01
10 AG    ATG   Antigua and Barbuda   2021      15781. <NA>    <NA>    2023-03-01
# … with 204 more rows, and abbreviated variable names ¹​NY.GDP.PCAP.CD,
#   ²​obs_status, ³​footnote, ⁴​last_updated

You can find interesting relevant datasets on the World Bank’s Sovereign ESG Data Portal.

From experience: The World Bank’s website isn’t the most user friendly. If you find an indicator that you want google “World Bank” and the name of the indicator, and google will do a better job than the World Bank’s own website at helping you locate the code.

Try googling: “World Bank Electricity production from coal sources (% of total))”. The following page should be the first page shown. You’ll see the data code in the end of the url. Should there be an easier way to find the codes? Yes. Is there? Let me know if you find one.

Also, sometimes the World Bank API throws an error. Try running it again 30 seconds later, and often it will work. Only after trying to re-run it once or twice should you move on to other de-bugging.

Everybody Loves a Map

Everybody loves a map. They look cool. And they convey a lot of information about geographic trends in the data.

And, they are easy to make with ggplot2. Here’s an example:

# the World Bank indicator we want
wb_indicator <- "EG.ELC.ACCS.ZS"

# get the metadata from wbstats' wb_cachelist. Will use these for labels
indicator_info <- filter(wb_cachelist$indicators, indicator_id == wb_indicator)

# get the map data in sf format (google it)
world_map_data <- rnaturalearth::ne_countries(returnclass = "sf") 

# get the data from the WB API. the code we want is assigned to the variable wb_indicator. Rename the column where the data will be to `value`
wb_data_tbl <- wb_data(c(value = wb_indicator), mrnev = 1)


world_map_data |> 
  # left join the WB data to the map data by the iso3c country codes. But the iso3c code is called the iso_a3 code in th emap data, so we specify this using by = c("iso_a3" = "iso3c")
  left_join(wb_data_tbl, by = c("iso_a3" = "iso3c")) |> 
  # remove Antarctica. It takes up visual space, and isn't important to this analysis
  filter(iso_a3 != "ATA") %>% 
  #start your ggplot!
  ggplot(aes(fill = value)) +
  # this makes a map using the sf data.  So easy, right?
  geom_sf() +
  # uses the viridis color pallete (option G).  Looks good on maps, colorblind friendly.
  scale_fill_viridis_c(option = "G", labels = scales::percent_format(scale = 1)) +
  # move the legend to the bottom
  theme(legend.position="bottom") +
  # use the metadata we got above for the labels!  Nifty.
  labs(
    title = indicator_info$indicator,
    fill = NULL,
    caption = str_wrap(paste("Source:", indicator_info$source_org))
  )

This map efficiently presents a clear insight.

Maps can be useful both for presentation graphics and for data exploration. Maps are great for presenting sovereign-level data in presentations and papers. But they are also useful when you are first playing around with the data. You have data for 52 countries? Map it. It will show you where they all are.

Joining Datasets

When working with country-level data, we’ll often be joining together multiple datasets from different sources. You can read more about Joins in the textbook. Here we’ll introduce the different joins availale in dplyr and show how they can be useful in analyzing country-level data.

We’ll start by looking at two bite-sized datasets. The first contains country names and regions. The second contains country names and capital cities. Each dataset has three observations – two of which overlap with the other dataset.

region_tbl <- tibble(country = c("United States", "China", "Nigeria"), region = c("North America", "Asia", "Africa"))

region_tbl
# A tibble: 3 × 2
  country       region       
  <chr>         <chr>        
1 United States North America
2 China         Asia         
3 Nigeria       Africa       
capital_tbl <- tibble(country = c("Nigeria", "Pakistan", "China"), capital = c("Abuja", "Islamabad", "Beijing"))

capital_tbl
# A tibble: 3 × 2
  country  capital  
  <chr>    <chr>    
1 Nigeria  Abuja    
2 Pakistan Islamabad
3 China    Beijing  

We have observations for Nigeria and China in both data objects. But regions_tbl has an observation for the United States, and capital_tbl has an observation for Pakistan.

Left Join (most common)

Most common join we will use: left_join()

It keeps all of the observations on the left (x), and will join only those observations on the right (y) that share the key (in this case, country).

Because there was no data about the capital of the US in capital_tbl, it shows up as NA.

region_tbl |> 
  left_join(capital_tbl, by = "country")
# A tibble: 3 × 3
  country       region        capital
  <chr>         <chr>         <chr>  
1 United States North America <NA>   
2 China         Asia          Beijing
3 Nigeria       Africa        Abuja  

Right Join

As its name suggest, right_join() does the opposite.

It keeps all of the observations on the right (y), and will join only those observations on the right (x) that share the key (in this case, country).

Notice that the United States observation from the region_tbl is no longer there, and that Pakistan, from the capitals_tbl is now present, with an NA value for region.

region_tbl |> 
  right_join(capital_tbl, by = "country")
# A tibble: 3 × 3
  country  region capital  
  <chr>    <chr>  <chr>    
1 China    Asia   Beijing  
2 Nigeria  Africa Abuja    
3 Pakistan <NA>   Islamabad

Inner Join

inner_join keeps only observations that are in both left (x) and (y).

It does not include Pakistan or the US, because they are not in both data structures.

This can be very useful when you’re looking for observations where you have full data.

region_tbl |> 
  inner_join(capital_tbl, by = "country")
# A tibble: 2 × 3
  country region capital
  <chr>   <chr>  <chr>  
1 China   Asia   Beijing
2 Nigeria Africa Abuja  

Full Join

Joins together all observations from left (x) and right (y). It adds NA values where there is not overlapping data.

region_tbl |> 
  full_join(capital_tbl, by = "country")
# A tibble: 4 × 3
  country       region        capital  
  <chr>         <chr>         <chr>    
1 United States North America <NA>     
2 China         Asia          Beijing  
3 Nigeria       Africa        Abuja    
4 Pakistan      <NA>          Islamabad

Anti Join

anti_join() identifies the observations on the left (x) that do not have observations on the right (y).

This is extremely useful for checking to see that you have complete data when you are joining datasets.

For example, here, we learn that we don’t have capital data for the United States, so we can go back and add it.

region_tbl |> 
  anti_join(capital_tbl, by = "country")
# A tibble: 1 × 2
  country       region       
  <chr>         <chr>        
1 United States North America

The Country Features Dataset

On our class GitHub data repository, there is a dataset called Country Features . It includes the country’s World Bank Income Group, and the World Bank Region. It includes key macro datapoints from the most recent IMF World Economic Outlook (WEO).

country_features <- read_csv("https://raw.githubusercontent.com/t-emery/sais-susfin_data/main/datasets/country_features_2022-10.csv")
Rows: 217 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): country_name, iso3c, wb_income_group, wb_region
dbl (4): debt_gross_percent_of_gdp, nominal_gdp_bn_ppp, nominal_gdp_per_capi...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
country_features |> glimpse()
Rows: 217
Columns: 8
$ country_name               <chr> "Aruba", "Afghanistan", "Angola", "Albania"…
$ iso3c                      <chr> "ABW", "AFG", "AGO", "ALB", "AND", "ARE", "…
$ wb_income_group            <chr> "High", "Low", "Lower Middle", "Upper Middl…
$ wb_region                  <chr> "Latin America & Caribbean", "South Asia", …
$ debt_gross_percent_of_gdp  <dbl> 95.042, 7.397, 56.564, 70.278, 43.050, 30.7…
$ nominal_gdp_bn_ppp         <dbl> 5.066, 80.912, 245.442, 51.189, 5.301, 814.…
$ nominal_gdp_per_capita_ppp <dbl> 46308.66, 2456.29, 7455.47, 17858.42, 65371…
$ population_mn              <dbl> 0.109, 32.941, 32.921, 2.866, 0.081, 10.544…

The first two columns are country_name and iso3c. They are standardized using the workflow from part 1, so you can attach these easily to other country-level datasets.

Let’s look at our capital_tbl

capital_tbl
# A tibble: 3 × 2
  country  capital  
  <chr>    <chr>    
1 Nigeria  Abuja    
2 Pakistan Islamabad
3 China    Beijing  

Now we can add on these useful features

capital_tbl |> 
  left_join(country_features, by = c("country" = "country_name"))
# A tibble: 3 × 9
  country  capital   iso3c wb_income_g…¹ wb_re…² debt_…³ nomin…⁴ nomin…⁵ popul…⁶
  <chr>    <chr>     <chr> <chr>         <chr>     <dbl>   <dbl>   <dbl>   <dbl>
1 Nigeria  Abuja     NGA   Lower Middle  Sub-Sa…    37.4   1275.   5884.    217.
2 Pakistan Islamabad PAK   Lower Middle  South …    77.8   1512.   6662.    227.
3 China    Beijing   CHN   Upper Middle  East A…    76.9  30074.  21291.   1413.
# … with abbreviated variable names ¹​wb_income_group, ²​wb_region,
#   ³​debt_gross_percent_of_gdp, ⁴​nominal_gdp_bn_ppp,
#   ⁵​nominal_gdp_per_capita_ppp, ⁶​population_mn

Data Cleaning: An Example with OECD Emissions Data

We’ll be cleaning up an excel workbook exported from the OECD’s website. This will help you with your second paper, and will give you an introduction to dealing with the challenges of real-world data.

The data comes from this table on the OECD’s Data Portal. Note that often times Excel spreadsheets that are exported from websites are not actually Excel spreadsheets. They are actually written in XML or other formats. So the first step is to save the file as a .xlsx file. Here we’re going to download the file from the class GitHub data repo.

The file is housed at the URL below. Download the excel file and open it up so you can see what we’re working with.

file_url <- "https://github.com/t-emery/sais-susfin_data/raw/main/datasets/oecd_emissions_data_example.xlsx"

We can extract the file name from the URL using basename()

file_path <- basename(file_url)

file_path
[1] "oecd_emissions_data_example.xlsx"

Now we’ll download the file.

download.file(url = file_url, file_path)

Read the excel file. The data starts on row 7, so we’ll use the argument skip to skip the first 6 rows.

oecd_emissions_data_raw <- read_excel(file_path,skip = 6)
New names:
• `` -> `...2`
• `` -> `...4`
• `` -> `...6`
• `` -> `...8`
• `` -> `...10`
• `` -> `...12`
• `` -> `...14`
• `` -> `...16`
• `` -> `...18`
• `` -> `...20`
• `` -> `...22`
• `` -> `...24`
• `` -> `...26`
• `` -> `...28`
• `` -> `...30`
• `` -> `...32`
• `` -> `...34`
• `` -> `...36`
• `` -> `...38`
• `` -> `...40`
• `` -> `...42`
• `` -> `...44`
• `` -> `...46`
• `` -> `...48`
• `` -> `...50`
oecd_emissions_data_raw
# A tibble: 86 × 50
   Time        ...2  `1995`    ...4 `1996`    ...6 `1997`    ...8 `1998`   ...10
   <chr>       <lgl> <lgl>    <dbl> <lgl>    <dbl> <lgl>    <dbl> <lgl>    <dbl>
 1 Country     NA    NA       NA    NA       NA    NA       NA    NA       NA   
 2 WLD: World  NA    NA     7531.   NA     7815.   NA     8022.   NA     8197.  
 3 OECD: OECD… NA    NA     4431.   NA     4522.   NA     4716.   NA     4881.  
 4 AUS: Austr… NA    NA      130.   NA      133.   NA      140.   NA      158.  
 5 AUT: Austr… NA    NA       21.1  NA       22.2  NA       21.3  NA       21.0 
 6 BEL: Belgi… NA    NA       33.9  NA       32.2  NA       30.3  NA       32.6 
 7 CAN: Canada NA    NA      115.   NA      113.   NA      127.   NA      136.  
 8 CHL: Chile  NA    NA        8.23 NA       11.6  NA       14.0  NA       16.2 
 9 COL: Colom… NA    NA       10.8  NA        8.78 NA       11.8  NA       12.1 
10 CRI: Costa… NA    NA        1.53 NA        1.23 NA        1.22 NA        1.46
# … with 76 more rows, and 40 more variables: `1999` <lgl>, ...12 <dbl>,
#   `2000` <lgl>, ...14 <dbl>, `2001` <lgl>, ...16 <dbl>, `2002` <lgl>,
#   ...18 <dbl>, `2003` <lgl>, ...20 <dbl>, `2004` <lgl>, ...22 <dbl>,
#   `2005` <lgl>, ...24 <dbl>, `2006` <lgl>, ...26 <dbl>, `2007` <lgl>,
#   ...28 <dbl>, `2008` <lgl>, ...30 <dbl>, `2009` <lgl>, ...32 <dbl>,
#   `2010` <lgl>, ...34 <dbl>, `2011` <lgl>, ...36 <dbl>, `2012` <lgl>,
#   ...38 <dbl>, `2013` <lgl>, ...40 <dbl>, `2014` <lgl>, ...42 <dbl>, …

Let’s look at the raw data.

View(oecd_emissions_data_raw)

It looks strange because of the excel formatting. This is common. But we can fix it.

Data Cleaning, Step-by-Step:

Deal with Weird Excel Formatting

First, let’s grab all of the columns that actually have data:

oecd_emissions_actual_data <- oecd_emissions_data_raw |> 
  # all the column headers with data are formatted: ...[Number]
  select(Time, contains("...")) |> 
  # except for one, so get rid of it.
  select(-`...2`) |> 
  # get rid of the first row, which has no data
  filter(Time != "Country")

oecd_emissions_actual_data
# A tibble: 85 × 25
   Time      ...4   ...6   ...8  ...10  ...12  ...14  ...16  ...18  ...20  ...22
   <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 WLD: W… 7.53e3 7.82e3 8.02e3 8.20e3 8.24e3 8.75e3 8.96e3 9.10e3 9.64e3 9.97e3
 2 OECD: … 4.43e3 4.52e3 4.72e3 4.88e3 4.95e3 5.29e3 5.28e3 5.20e3 5.41e3 5.54e3
 3 AUS: A… 1.30e2 1.33e2 1.40e2 1.58e2 1.57e2 1.62e2 1.63e2 1.82e2 1.75e2 1.89e2
 4 AUT: A… 2.11e1 2.22e1 2.13e1 2.10e1 2.09e1 2.14e1 2.21e1 2.19e1 2.57e1 2.62e1
 5 BEL: B… 3.39e1 3.22e1 3.03e1 3.26e1 3.15e1 3.34e1 3.13e1 3.08e1 3.28e1 3.49e1
 6 CAN: C… 1.15e2 1.13e2 1.27e2 1.36e2 1.35e2 1.41e2 1.44e2 1.42e2 1.54e2 1.54e2
 7 CHL: C… 8.23e0 1.16e1 1.40e1 1.62e1 1.74e1 1.51e1 1.28e1 1.32e1 1.43e1 1.66e1
 8 COL: C… 1.08e1 8.78e0 1.18e1 1.21e1 7.80e0 9.83e0 9.96e0 9.69e0 9.83e0 9.33e0
 9 CRI: C… 1.53e0 1.23e0 1.22e0 1.46e0 1.33e0 1.39e0 1.36e0 1.31e0 1.35e0 1.28e0
10 CZE: C… 5.24e1 5.45e1 5.36e1 5.17e1 5.02e1 5.41e1 5.45e1 5.36e1 5.48e1 5.28e1
# … with 75 more rows, and 14 more variables: ...24 <dbl>, ...26 <dbl>,
#   ...28 <dbl>, ...30 <dbl>, ...32 <dbl>, ...34 <dbl>, ...36 <dbl>,
#   ...38 <dbl>, ...40 <dbl>, ...42 <dbl>, ...44 <dbl>, ...46 <dbl>,
#   ...48 <dbl>, ...50 <dbl>

Next, let’s find the correct column names:

correct_column_names <- oecd_emissions_data_raw |> 
  # select all columns that don't contain `...` in their name
  select(!contains("...")) |> 
  # selects the column names as a vector
  names()

correct_column_names
 [1] "Time" "1995" "1996" "1997" "1998" "1999" "2000" "2001" "2002" "2003"
[11] "2004" "2005" "2006" "2007" "2008" "2009" "2010" "2011" "2012" "2013"
[21] "2014" "2015" "2016" "2017" "2018"

Now we rename our dataset with the correct column names.

oecd_emissions_actual_step_2 <- oecd_emissions_actual_data |> 
  rename_with(.f = function(x) correct_column_names) |> 
  # rename the column to something that makes sense
  rename(geography = Time)

oecd_emissions_actual_step_2
# A tibble: 85 × 25
   geogr…¹ `1995` `1996` `1997` `1998` `1999` `2000` `2001` `2002` `2003` `2004`
   <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 WLD: W… 7.53e3 7.82e3 8.02e3 8.20e3 8.24e3 8.75e3 8.96e3 9.10e3 9.64e3 9.97e3
 2 OECD: … 4.43e3 4.52e3 4.72e3 4.88e3 4.95e3 5.29e3 5.28e3 5.20e3 5.41e3 5.54e3
 3 AUS: A… 1.30e2 1.33e2 1.40e2 1.58e2 1.57e2 1.62e2 1.63e2 1.82e2 1.75e2 1.89e2
 4 AUT: A… 2.11e1 2.22e1 2.13e1 2.10e1 2.09e1 2.14e1 2.21e1 2.19e1 2.57e1 2.62e1
 5 BEL: B… 3.39e1 3.22e1 3.03e1 3.26e1 3.15e1 3.34e1 3.13e1 3.08e1 3.28e1 3.49e1
 6 CAN: C… 1.15e2 1.13e2 1.27e2 1.36e2 1.35e2 1.41e2 1.44e2 1.42e2 1.54e2 1.54e2
 7 CHL: C… 8.23e0 1.16e1 1.40e1 1.62e1 1.74e1 1.51e1 1.28e1 1.32e1 1.43e1 1.66e1
 8 COL: C… 1.08e1 8.78e0 1.18e1 1.21e1 7.80e0 9.83e0 9.96e0 9.69e0 9.83e0 9.33e0
 9 CRI: C… 1.53e0 1.23e0 1.22e0 1.46e0 1.33e0 1.39e0 1.36e0 1.31e0 1.35e0 1.28e0
10 CZE: C… 5.24e1 5.45e1 5.36e1 5.17e1 5.02e1 5.41e1 5.45e1 5.36e1 5.48e1 5.28e1
# … with 75 more rows, 14 more variables: `2005` <dbl>, `2006` <dbl>,
#   `2007` <dbl>, `2008` <dbl>, `2009` <dbl>, `2010` <dbl>, `2011` <dbl>,
#   `2012` <dbl>, `2013` <dbl>, `2014` <dbl>, `2015` <dbl>, `2016` <dbl>,
#   `2017` <dbl>, `2018` <dbl>, and abbreviated variable name ¹​geography

Standardize Country Names with countrycode()

Now, let’s standardize the country names, and differentiate between countries and regions.

geography_tbl <- oecd_emissions_actual_step_2 |> 
  select(geography)

geography_tbl
# A tibble: 85 × 1
   geography                  
   <chr>                      
 1 WLD: World                 
 2 OECD: OECD member countries
 3 AUS: Australia             
 4 AUT: Austria               
 5 BEL: Belgium               
 6 CAN: Canada                
 7 CHL: Chile                 
 8 COL: Colombia              
 9 CRI: Costa Rica            
10 CZE: Czech Republic        
# … with 75 more rows
geography_tbl_2 <- geography_tbl |> 
  # separate the geography column 
  # into geography_code and geography_name
  # separate at the colon ":"
  # don't remove the original column (by default it removes the original column)
  separate(col = geography, into = c("geography_code", "geography_name"), 
           sep = ":", remove = FALSE) |> 
  # use str_trim() to make sure there is no extra spaces around the new names that might mess up other calculations
  mutate(across(c("geography_code", "geography_name"), str_trim))

geography_tbl_2
# A tibble: 85 × 3
   geography                   geography_code geography_name       
   <chr>                       <chr>          <chr>                
 1 WLD: World                  WLD            World                
 2 OECD: OECD member countries OECD           OECD member countries
 3 AUS: Australia              AUS            Australia            
 4 AUT: Austria                AUT            Austria              
 5 BEL: Belgium                BEL            Belgium              
 6 CAN: Canada                 CAN            Canada               
 7 CHL: Chile                  CHL            Chile                
 8 COL: Colombia               COL            Colombia             
 9 CRI: Costa Rica             CRI            Costa Rica           
10 CZE: Czech Republic         CZE            Czech Republic       
# … with 75 more rows

Great, now we have two separate columns with the geography codes and the geography names.

Next, let’s find standardized country names using countrycode() . The places where there are NA values should all be regions. Let’s check.

geography_tbl_2 |> 
  mutate(country_name = countrycode(sourcevar = geography_code, 
                                    origin = "iso3c", 
                                    destination = "country.name")) |> 
  filter(is.na(country_name))
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `country_name = countrycode(sourcevar = geography_code, origin =
  "iso3c", destination = "country.name")`.
Caused by warning in `countrycode_convert()`:
! Some values were not matched unambiguously: APEC, ASEAN, DATA EXTRACTED ON 26 FEB 2023 00, EA19, EASIA, EU13, EU15, EU27_2020, EU28, G20, NONOECD, OECD, ROW, WLD, ZASI, ZEUR, ZNAM, ZOTH, ZSCA
# A tibble: 19 × 4
   geography                                             geogr…¹ geogr…² count…³
   <chr>                                                 <chr>   <chr>   <chr>  
 1 WLD: World                                            WLD     World   <NA>   
 2 OECD: OECD member countries                           OECD    OECD m… <NA>   
 3 NONOECD: Non-OECD economies and aggregates            NONOECD Non-OE… <NA>   
 4 ROW: Rest of the World                                ROW     Rest o… <NA>   
 5 APEC: Asia-Pacific Economic Cooperation               APEC    Asia-P… <NA>   
 6 ASEAN: Association of South East Asian Nations        ASEAN   Associ… <NA>   
 7 EASIA: Eastern Asia                                   EASIA   Easter… <NA>   
 8 EU27_2020: European Union (27 countries)              EU27_2… Europe… <NA>   
 9 EU28: European Union (28 countries)                   EU28    Europe… <NA>   
10 EU15: European Union (15 countries)                   EU15    Europe… <NA>   
11 EU13: EU28 excluding EU15                             EU13    EU28 e… <NA>   
12 EA19: Euro area (19 countries)                        EA19    Euro a… <NA>   
13 G20: Group of Twenty                                  G20     Group … <NA>   
14 ZEUR: Europe                                          ZEUR    Europe  <NA>   
15 ZASI: East and Southeastern Asia                      ZASI    East a… <NA>   
16 ZNAM: North America                                   ZNAM    North … <NA>   
17 ZSCA: South and Central America                       ZSCA    South … <NA>   
18 ZOTH: Other regions                                   ZOTH    Other … <NA>   
19 Data extracted on 26 Feb 2023 00:16 UTC (GMT) from O… Data e… 16 UTC… <NA>   
# … with abbreviated variable names ¹​geography_code, ²​geography_name,
#   ³​country_name

Okay, all the places that are missing in country_name are regions. We can use that to our advantage

geography_processed <- geography_tbl_2 |> 
  # use countrycode() to give the countries standardized names
  mutate(country_name = countrycode(sourcevar = geography_code, 
                                    origin = "iso3c", 
                                    destination = "country.name")) |>
  # use if_else to make a column telling whether the entity is a region or a country
  # if country_name is NA, then it is a region, if not it is a country
  mutate(country_or_region = if_else(is.na(country_name), 
                                     true = "region",
                                     false = "country")) |> 
  # fill in the region names into country_name using the same logic
  mutate(country_name = if_else(is.na(country_name),
                                true = geography_name,
                                false = country_name)) |> 
  # select the columns we want. 
  # we'll rename geography_code into iso3c to match our other standardized datasets.
  # we'll use the original geography column to join it to the original dataset.
  select(geography, country_name, iso3c = geography_code, country_or_region)
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `country_name = countrycode(sourcevar = geography_code, origin =
  "iso3c", destination = "country.name")`.
Caused by warning in `countrycode_convert()`:
! Some values were not matched unambiguously: APEC, ASEAN, DATA EXTRACTED ON 26 FEB 2023 00, EA19, EASIA, EU13, EU15, EU27_2020, EU28, G20, NONOECD, OECD, ROW, WLD, ZASI, ZEUR, ZNAM, ZOTH, ZSCA
geography_processed
# A tibble: 85 × 4
   geography                   country_name          iso3c country_or_region
   <chr>                       <chr>                 <chr> <chr>            
 1 WLD: World                  World                 WLD   region           
 2 OECD: OECD member countries OECD member countries OECD  region           
 3 AUS: Australia              Australia             AUS   country          
 4 AUT: Austria                Austria               AUT   country          
 5 BEL: Belgium                Belgium               BEL   country          
 6 CAN: Canada                 Canada                CAN   country          
 7 CHL: Chile                  Chile                 CHL   country          
 8 COL: Colombia               Colombia              COL   country          
 9 CRI: Costa Rica             Costa Rica            CRI   country          
10 CZE: Czech Republic         Czechia               CZE   country          
# … with 75 more rows

Now we join our standardized country names + region/country flag back onto the main dataset. We attach it by the geography column. But once we’ve done that, we can get rid of that column.

step3 <- geography_processed |> 
  left_join(oecd_emissions_actual_step_2, by = "geography") |> 
  # now that we've joined this back to the dataset, we don't need the geography column anymore
  select(-geography)

step3 
# A tibble: 85 × 27
   count…¹ iso3c count…² `1995` `1996` `1997` `1998` `1999` `2000` `2001` `2002`
   <chr>   <chr> <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 World   WLD   region  7.53e3 7.82e3 8.02e3 8.20e3 8.24e3 8.75e3 8.96e3 9.10e3
 2 OECD m… OECD  region  4.43e3 4.52e3 4.72e3 4.88e3 4.95e3 5.29e3 5.28e3 5.20e3
 3 Austra… AUS   country 1.30e2 1.33e2 1.40e2 1.58e2 1.57e2 1.62e2 1.63e2 1.82e2
 4 Austria AUT   country 2.11e1 2.22e1 2.13e1 2.10e1 2.09e1 2.14e1 2.21e1 2.19e1
 5 Belgium BEL   country 3.39e1 3.22e1 3.03e1 3.26e1 3.15e1 3.34e1 3.13e1 3.08e1
 6 Canada  CAN   country 1.15e2 1.13e2 1.27e2 1.36e2 1.35e2 1.41e2 1.44e2 1.42e2
 7 Chile   CHL   country 8.23e0 1.16e1 1.40e1 1.62e1 1.74e1 1.51e1 1.28e1 1.32e1
 8 Colomb… COL   country 1.08e1 8.78e0 1.18e1 1.21e1 7.80e0 9.83e0 9.96e0 9.69e0
 9 Costa … CRI   country 1.53e0 1.23e0 1.22e0 1.46e0 1.33e0 1.39e0 1.36e0 1.31e0
10 Czechia CZE   country 5.24e1 5.45e1 5.36e1 5.17e1 5.02e1 5.41e1 5.45e1 5.36e1
# … with 75 more rows, 16 more variables: `2003` <dbl>, `2004` <dbl>,
#   `2005` <dbl>, `2006` <dbl>, `2007` <dbl>, `2008` <dbl>, `2009` <dbl>,
#   `2010` <dbl>, `2011` <dbl>, `2012` <dbl>, `2013` <dbl>, `2014` <dbl>,
#   `2015` <dbl>, `2016` <dbl>, `2017` <dbl>, `2018` <dbl>, and abbreviated
#   variable names ¹​country_name, ²​country_or_region

Pivot the Dataset into Long Format

Make it long form:

step4 <- step3 |> 
  # Use a `regular expression` (aka regex) to select any column name that is 4 digits (d{4})
  pivot_longer(cols = matches("\\d{4}"), 
               # change the names column to `year`
               names_to = "year", 
               # transform it into to integer data type (from character)
               names_transform = as.integer)

step4
# A tibble: 2,040 × 5
   country_name iso3c country_or_region  year value
   <chr>        <chr> <chr>             <int> <dbl>
 1 World        WLD   region             1995 7531.
 2 World        WLD   region             1996 7815.
 3 World        WLD   region             1997 8022.
 4 World        WLD   region             1998 8197.
 5 World        WLD   region             1999 8238.
 6 World        WLD   region             2000 8752.
 7 World        WLD   region             2001 8958.
 8 World        WLD   region             2002 9095.
 9 World        WLD   region             2003 9637.
10 World        WLD   region             2004 9974.
# … with 2,030 more rows

Attach Metadata

Get the metadata for our series and attach it.

oecd_metadata <- read_excel(basename(file_url),range = "a3:c6", col_names = c("metadata_parameter", "blank", "metadata_value")) |> 
  select(-blank) |> 
  pivot_wider(names_from = metadata_parameter, values_from = metadata_value)

oecd_metadata
# A tibble: 1 × 4
  Indicator                                                Indus…¹ Partner Unit 
  <chr>                                                    <chr>   <chr>   <chr>
1 FD_CO2: CO2 emissions embodied in domestic final demand… D35: E… WLD: W… Tonn…
# … with abbreviated variable name ¹​Industry

Alternatively, you can make and attach your own shorter meta data description. This can be useful for analysis and charting.

my_oecd_metadata <- tribble(~my_indicator_name,
                            "CO2 from electricity, gas, and heating/cooling")

my_oecd_metadata
# A tibble: 1 × 1
  my_indicator_name                             
  <chr>                                         
1 CO2 from electricity, gas, and heating/cooling

You can add that on in addition to, or instead of the official metadata. Better to save both.

The Tidy dataset

Now we have a complete tidy dataset that is ready for analysis.

What are the key elements?

  1. first two columns are country_name and iso3c standardized using the countrycode package. Third column flags whether it is a region or a country.

  2. All columns are the correct data type. year is integer, value is double (numeric)

  3. Metadata gives us a clear sense of exactly what our data is.

  4. Column names are in snake_case so they are easy to work with in R.

step4 |> 
  bind_cols(oecd_metadata) |> 
  clean_names()
# A tibble: 2,040 × 9
   country_name iso3c country_or_reg…¹  year value indic…² indus…³ partner unit 
   <chr>        <chr> <chr>            <int> <dbl> <chr>   <chr>   <chr>   <chr>
 1 World        WLD   region            1995 7531. FD_CO2… D35: E… WLD: W… Tonn…
 2 World        WLD   region            1996 7815. FD_CO2… D35: E… WLD: W… Tonn…
 3 World        WLD   region            1997 8022. FD_CO2… D35: E… WLD: W… Tonn…
 4 World        WLD   region            1998 8197. FD_CO2… D35: E… WLD: W… Tonn…
 5 World        WLD   region            1999 8238. FD_CO2… D35: E… WLD: W… Tonn…
 6 World        WLD   region            2000 8752. FD_CO2… D35: E… WLD: W… Tonn…
 7 World        WLD   region            2001 8958. FD_CO2… D35: E… WLD: W… Tonn…
 8 World        WLD   region            2002 9095. FD_CO2… D35: E… WLD: W… Tonn…
 9 World        WLD   region            2003 9637. FD_CO2… D35: E… WLD: W… Tonn…
10 World        WLD   region            2004 9974. FD_CO2… D35: E… WLD: W… Tonn…
# … with 2,030 more rows, and abbreviated variable names ¹​country_or_region,
#   ²​indicator, ³​industry

Consolidate The Workflow

Okay, now put this all into 1 workflow so we can use it for other OECD

file_path <- basename(file_url)

oecd_emissions_data_raw <- read_excel(file_path,skip = 6)
New names:
• `` -> `...2`
• `` -> `...4`
• `` -> `...6`
• `` -> `...8`
• `` -> `...10`
• `` -> `...12`
• `` -> `...14`
• `` -> `...16`
• `` -> `...18`
• `` -> `...20`
• `` -> `...22`
• `` -> `...24`
• `` -> `...26`
• `` -> `...28`
• `` -> `...30`
• `` -> `...32`
• `` -> `...34`
• `` -> `...36`
• `` -> `...38`
• `` -> `...40`
• `` -> `...42`
• `` -> `...44`
• `` -> `...46`
• `` -> `...48`
• `` -> `...50`
correct_column_names <- oecd_emissions_data_raw |> 
  select(!contains("...")) |> 
  names()

oecd_emissions_actual_step_2 <- oecd_emissions_actual_data |> 
  rename_with(.f = function(x) correct_column_names) |> 
  rename(geography = Time)

geography_tbl <- oecd_emissions_actual_step_2 |> 
  select(geography)

geography_processed <- geography_tbl |> 
  # separate the geography column 
  # into geography_code and geography_name
  # separate at the colon ":"
  # don't remove the original column (by default it removes the original column)
  separate(col = geography, into = c("geography_code", "geography_name"), 
           sep = ":", remove = FALSE) |> 
  # use str_trim() to make sure there is no extra spaces around the new names that might mess up other calculations
  mutate(across(c("geography_code", "geography_name"), str_trim)) |> 
  # use countrycode() to give the countries standardized names
  mutate(country_name = countrycode(sourcevar = geography_code, 
                                    origin = "iso3c", 
                                    destination = "country.name")) |>
  # use if_else to make a column telling whether the entity is a region or a country
  # if country_name is NA, then it is a region, if not it is a country
  mutate(country_or_region = if_else(is.na(country_name), 
                                     true = "region",
                                     false = "country")) |> 
  # fill in the region names into country_name using the same logic
  mutate(country_name = if_else(is.na(country_name),
                                true = geography_name,
                                false = country_name)) |> 
  # select the columns we want. 
  # we'll rename geography_code into iso3c to match our other standardized datasets.
  # we'll use the original geography column to join it to the original dataset.
  select(geography, country_name, iso3c = geography_code, country_or_region)
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `country_name = countrycode(sourcevar = geography_code, origin =
  "iso3c", destination = "country.name")`.
Caused by warning in `countrycode_convert()`:
! Some values were not matched unambiguously: APEC, ASEAN, DATA EXTRACTED ON 26 FEB 2023 00, EA19, EASIA, EU13, EU15, EU27_2020, EU28, G20, NONOECD, OECD, ROW, WLD, ZASI, ZEUR, ZNAM, ZOTH, ZSCA
data_processed <- geography_processed |> 
  left_join(oecd_emissions_actual_step_2, by = "geography") |> 
  # now that we've joined this back to the dataset, we don't need the geography column anymore
  select(-geography) |> 
  # Use a `regular expression` (aka regex) to select any column name that is 4 digits (d{4})
  pivot_longer(cols = matches("\\d{4}"), 
               # change the names column to `year`
               names_to = "year", 
               # transform it into to integer data type (from character)
               names_transform = as.integer)

oecd_metadata <- read_excel(file_path,range = "a3:c6", col_names = c("metadata_parameter", "blank", "metadata_value")) |> 
  select(-blank) |> 
  pivot_wider(names_from = metadata_parameter, values_from = metadata_value)


final_data <- data_processed |> 
  bind_cols(oecd_metadata) |> 
  janitor::clean_names()
  

final_data 
# A tibble: 2,040 × 9
   country_name iso3c country_or_reg…¹  year value indic…² indus…³ partner unit 
   <chr>        <chr> <chr>            <int> <dbl> <chr>   <chr>   <chr>   <chr>
 1 World        WLD   region            1995 7531. FD_CO2… D35: E… WLD: W… Tonn…
 2 World        WLD   region            1996 7815. FD_CO2… D35: E… WLD: W… Tonn…
 3 World        WLD   region            1997 8022. FD_CO2… D35: E… WLD: W… Tonn…
 4 World        WLD   region            1998 8197. FD_CO2… D35: E… WLD: W… Tonn…
 5 World        WLD   region            1999 8238. FD_CO2… D35: E… WLD: W… Tonn…
 6 World        WLD   region            2000 8752. FD_CO2… D35: E… WLD: W… Tonn…
 7 World        WLD   region            2001 8958. FD_CO2… D35: E… WLD: W… Tonn…
 8 World        WLD   region            2002 9095. FD_CO2… D35: E… WLD: W… Tonn…
 9 World        WLD   region            2003 9637. FD_CO2… D35: E… WLD: W… Tonn…
10 World        WLD   region            2004 9974. FD_CO2… D35: E… WLD: W… Tonn…
# … with 2,030 more rows, and abbreviated variable names ¹​country_or_region,
#   ²​indicator, ³​industry

Make a Data Processing Function

Now we can make it into a function:

process_oecd_emissions_xlsx

process_oecd_emissions_xlsx <- function(file_path_xlsx) {
  
  oecd_emissions_data_raw <- read_excel(file_path_xlsx,skip = 6)

  correct_column_names <- oecd_emissions_data_raw |> 
    select(!contains("...")) |> 
    names()

  oecd_emissions_actual_step_2 <- oecd_emissions_actual_data |> 
    rename_with(.f = function(x) correct_column_names) |> 
    rename(geography = Time)

  geography_tbl <- oecd_emissions_actual_step_2 |> 
    select(geography)

  geography_processed <- geography_tbl |> 
    # separate the geography column 
    # into geography_code and geography_name
    # separate at the colon ":"
    # don't remove the original column (by default it removes the original column)
    separate(col = geography, into = c("geography_code", "geography_name"), 
             sep = ":", remove = FALSE) |> 
    # use str_trim() to make sure there is no extra spaces around the new names that might mess up other calculations
    mutate(across(c("geography_code", "geography_name"), str_trim)) |> 
    # use countrycode() to give the countries standardized names
    mutate(country_name = countrycode(sourcevar = geography_code, 
                                      origin = "iso3c", 
                                      destination = "country.name")) |>
    # use if_else to make a column telling whether the entity is a region or a country
    # if country_name is NA, then it is a region, if not it is a country
    mutate(country_or_region = if_else(is.na(country_name), 
                                       true = "region",
                                       false = "country")) |> 
    # fill in the region names into country_name using the same logic
    mutate(country_name = if_else(is.na(country_name),
                                  true = geography_name,
                                  false = country_name)) |> 
    # select the columns we want. 
    # we'll rename geography_code into iso3c to match our other standardized datasets.
    # we'll use the original geography column to join it to the original dataset.
    select(geography, country_name, iso3c = geography_code, country_or_region)

  data_processed <- geography_processed |> 
    left_join(oecd_emissions_actual_step_2, by = "geography") |> 
    # now that we've joined this back to the dataset, we don't need the geography column anymore
    select(-geography) |> 
    # Use a `regular expression` (aka regex) to select any column name that is 4 digits (d{4})
    pivot_longer(cols = matches("\\d{4}"), 
                 # change the names column to `year`
                 names_to = "year", 
                 # transform it into to integer data type (from character)
                 names_transform = as.integer)
  
  oecd_metadata <- read_excel(file_path,range = "a3:c6", col_names = c("metadata_parameter", "blank", "metadata_value")) |> 
    select(-blank) |> 
    pivot_wider(names_from = metadata_parameter, values_from = metadata_value)
  
  
  final_data <- data_processed |> bind_cols(oecd_metadata) |> 
    janitor::clean_names()
  
  final_data 
}

processed_data <- process_oecd_emissions_xlsx(file_path_xlsx = "oecd_emissions_data_example.xlsx")
New names:
• `` -> `...2`
• `` -> `...4`
• `` -> `...6`
• `` -> `...8`
• `` -> `...10`
• `` -> `...12`
• `` -> `...14`
• `` -> `...16`
• `` -> `...18`
• `` -> `...20`
• `` -> `...22`
• `` -> `...24`
• `` -> `...26`
• `` -> `...28`
• `` -> `...30`
• `` -> `...32`
• `` -> `...34`
• `` -> `...36`
• `` -> `...38`
• `` -> `...40`
• `` -> `...42`
• `` -> `...44`
• `` -> `...46`
• `` -> `...48`
• `` -> `...50`
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `country_name = countrycode(sourcevar = geography_code, origin =
  "iso3c", destination = "country.name")`.
Caused by warning in `countrycode_convert()`:
! Some values were not matched unambiguously: APEC, ASEAN, DATA EXTRACTED ON 26 FEB 2023 00, EA19, EASIA, EU13, EU15, EU27_2020, EU28, G20, NONOECD, OECD, ROW, WLD, ZASI, ZEUR, ZNAM, ZOTH, ZSCA
processed_data
# A tibble: 2,040 × 9
   country_name iso3c country_or_reg…¹  year value indic…² indus…³ partner unit 
   <chr>        <chr> <chr>            <int> <dbl> <chr>   <chr>   <chr>   <chr>
 1 World        WLD   region            1995 7531. FD_CO2… D35: E… WLD: W… Tonn…
 2 World        WLD   region            1996 7815. FD_CO2… D35: E… WLD: W… Tonn…
 3 World        WLD   region            1997 8022. FD_CO2… D35: E… WLD: W… Tonn…
 4 World        WLD   region            1998 8197. FD_CO2… D35: E… WLD: W… Tonn…
 5 World        WLD   region            1999 8238. FD_CO2… D35: E… WLD: W… Tonn…
 6 World        WLD   region            2000 8752. FD_CO2… D35: E… WLD: W… Tonn…
 7 World        WLD   region            2001 8958. FD_CO2… D35: E… WLD: W… Tonn…
 8 World        WLD   region            2002 9095. FD_CO2… D35: E… WLD: W… Tonn…
 9 World        WLD   region            2003 9637. FD_CO2… D35: E… WLD: W… Tonn…
10 World        WLD   region            2004 9974. FD_CO2… D35: E… WLD: W… Tonn…
# … with 2,030 more rows, and abbreviated variable names ¹​country_or_region,
#   ²​indicator, ³​industry
processed_data |> glimpse()
Rows: 2,040
Columns: 9
$ country_name      <chr> "World", "World", "World", "World", "World", "World"…
$ iso3c             <chr> "WLD", "WLD", "WLD", "WLD", "WLD", "WLD", "WLD", "WL…
$ country_or_region <chr> "region", "region", "region", "region", "region", "r…
$ year              <int> 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003…
$ value             <dbl> 7530.733, 7815.068, 8022.339, 8197.269, 8238.322, 87…
$ indicator         <chr> "FD_CO2: CO2 emissions embodied in domestic final de…
$ industry          <chr> "D35: Electricity, gas, steam and air conditioning s…
$ partner           <chr> "WLD: World", "WLD: World", "WLD: World", "WLD: Worl…
$ unit              <chr> "Tonnes, Millions", "Tonnes, Millions", "Tonnes, Mil…

Save the Processed Data

Write this data to your `00_processed1

# use here::here() to specify relative paths from your projects base directory
write_csv(processed_data, here("00_data_processed", "oecd_data_xyz_processed.csv"))

# if you have a really big file, or if you have data that is not rectangular (nested data, map data, etc..), writing to a .rds file is a good option.
write_rds(processed_data, here("00_data_processed", "oecd_data_xyz_processed.rds"), compress = "gz")