Introduction

Today’s lesson is based on an actual project that I recently worked on for a hedge fund client. Following COP26 and the rise of Mark Carney’s Glasgow Financial Alliance for Net Zero (GFANZ), clients were increasingly interested in measuring the carbon footprints of their portfolios.

To answer this, first I had to read. A lot. I read everything that I could on carbon footprinting generally, and the few pieces that had been written specifically about carbon footprinting sovereigns.

I also used past domain expertise. What matters the most in driving fixed income flows are the metrics that drive investment benchmark construction, and the incentives they provide.

So for the data analysis, I focused on understanding the distributional implications of different proposed carbon intensity metrics. I used publicly available data so that I could make the findings reproducible. Transparency builds credibility.

This lessson will aim to:

  1. Give you a basic toolkit for working with sovereign level data. You’ll learn to work with messy country names, pull data from the world bank, and connect data from different sources like a pro.
  2. Go through some of the basic analysis process.
  3. Give you a chance to work with the data and squeeze out meaningful insights.

Toolkit for working with sovereign data

This first section gives you a toolkit for working with sovereign level data. You are SAIS students. Being able to give a global perspective to issues is your calling card. So learning to work with sovereign level data is important for everyone, not just those working in sovereign fixed income

The countrycode package is your friend

A persistent challenge for working with sovereign-level data is dealing with country names. Countries are referred to be a variety of different names and coding schemes. For example:

  • USA vs. US. vs. U.S. vs. United States
  • Russia vs. Russian Federation.

The countrycode package in R will be your best friend in dealing with country names.

I’ll show you the workflow I use, and which works well for me. I take the raw country names and codes and convert them into

  1. ISO3C Codes. These are the three letter country codes. They are helpful because they are unambigous, and because they are shorter than country names they can be useful for charts.
  2. Standardized country names provided by the country code package.

What is the advantage of doing this? If all of your datasets use standardized country names, it’s easy to combine datasets. And being able to combine datasets from disparate source is a key ingredient of great analysis.

?countrycode
## starting httpd help server ... done
countrycode("Russian Federation",origin = "country.name", 
                                     destination = "iso3c",
                                     origin_regex = TRUE)
## [1] "RUS"
"Russian Federation" %>% countrycode(origin = "country.name", 
                                     destination = "iso3c",
                                     origin_regex = TRUE)
## [1] "RUS"

And, because we work hard to be lazy, we make a function:

country_name_regex_to_iso3c <- function(country_name) {
  country_name %>%
    countrycode(origin = "country.name", 
                                     destination = "iso3c",
                                     origin_regex = TRUE)
}
"Bolivarian Republic of Venezuela" %>% country_name_regex_to_iso3c()
## [1] "VEN"
"VEN" %>%
  countrycode(origin = "iso3c", destination = "country.name")
## [1] "Venezuela"

And we’ll make another function to convert from iso3c to a standardized country name from the countrycode package.

iso3c_to_country_name <- function(iso3c) {
  iso3c %>%
  countrycode(origin = "iso3c", destination = "country.name")
}
"VEN" %>% iso3c_to_country_name()
## [1] "Venezuela"

Now we can put these together:

"DRC" %>%
  country_name_regex_to_iso3c() %>%
  iso3c_to_country_name()
## [1] "Congo - Kinshasa"

Of course, you can combine these together and do whatever you find most effective. But I find it useful both to have the written name and the iso3c code.

From here, I use the iso3c to create any other feature available in the countrycode package

Here are a few of the ones I find most useful

  • continent: Continent as defined in the World Bank Development Indicators
  • region: 7 Regions as defined in the World Bank Development Indicators
  • region23: 23 Regions as used to be in the World Bank Development Indicators (legacy)
  • currency: ISO 4217 currency name
  • unicode.symbol: emoji flag. Nice for tables and charts
# look up ?purrr:partial().  It's an easy way to simplify functions.
iso3c_to_x <- purrr::partial(countrycode, origin = "iso3c")

"VEN" %>% iso3c_to_x(destination = "region")
## [1] "Latin America & Caribbean"
"VEN" %>% iso3c_to_x(destination = "unicode.symbol")
## [1] "🇻🇪"

A Brief Example

First let’s start with a mini dataset where we have four countries each referred to by two different common permutations of their name.

messy_country_dataset <- tibble(messy_country = c("The United States of America", "US", "United Mexican States", "Mexico", "The Russian Federation", "Russia", "Czechia", "The Czech Republic"), value = c(7, 2, 5, 2, 8, 4, 1, 4)) 

messy_country_dataset

Now we can use the functions we made to clean up the dataset, and add a pretty flag that could be helpful if we present the final data in a table.

cleaned_names <- messy_country_dataset %>%
  # using multiple mutate functions simply so we can run them one at a time.
  mutate(iso3c = country_name_regex_to_iso3c(messy_country)) %>%
  mutate(country_name = iso3c_to_country_name(iso3c)) %>%
  mutate(flag = iso3c_to_x(iso3c, destination = "unicode.symbol"))

cleaned_names

Now you can do cool stuff with the data easily. Here’s an example.

cleaned_names %>%
  group_by(flag, country_name, iso3c) %>%
  summarize(sum_of_value = sum(value)) %>%
  arrange(sum_of_value)
## `summarise()` has grouped output by 'flag', 'country_name'. You can override
## using the `.groups` argument.

World Bank Data using the wbstats package

The wbstats package makes it easy to get data from the World Bank. Here is a tutorial for getting started.

Often times we’ll want to get the most recent value. With sovereign level data there are commonly large data gaps and lags. The wbstats package has a nice function to give us the most recent value, which we use below. mrnev = “Most recent non-empty value”.

gdp_capita <- wb_data("NY.GDP.PCAP.CD", mrnev = 1)

gdp_capita %>% glimpse()
## Rows: 214
## Columns: 8
## $ iso2c          <chr> "AW", "AF", "AO", "AL", "AD", "AE", "AR", "AM", "AS", "…
## $ iso3c          <chr> "ABW", "AFG", "AGO", "ALB", "AND", "ARE", "ARG", "ARM",…
## $ country        <chr> "Aruba", "Afghanistan", "Angola", "Albania", "Andorra",…
## $ date           <dbl> 2020, 2020, 2021, 2021, 2021, 2020, 2021, 2021, 2020, 2…
## $ NY.GDP.PCAP.CD <dbl> 23384.2988, 516.7479, 2137.9094, 6494.3857, 43047.6863,…
## $ obs_status     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ footnote       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ last_updated   <date> 2022-09-16, 2022-09-16, 2022-09-16, 2022-09-16, 2022-0…

You can find interesting relevant datasets on the World Bank’s Sovereign ESG Data Portal.

From experience: The World Bank’s website isn’t the most user friendly. If you find an indicator that you want google “World Bank” and the name of the indicator, and google will do a better job than the World Bank’s own website at helping you locate the code.

Try googling: “World Bank Electricity production from coal sources (% of total))”. The following page should be the first page shown. You’ll see the data code in the end of the url. Should there be an easier way to find the codes? Yes. Is there? Let me know if you find one.

Joining datasets

Joining datasets together from different sources helps you unlock new insights from your data. Learning the different joins can be confusing at first, but it’s well worth your time. Let’s practice with a very small dataset first so its clear what’s happening. You can come back to these examples when you are joining much larger datasets later in the exercise.

The first dataset has three countries and the regions they are part of.

region_tbl <- tibble(country = c("United States", "China", "Nigeria"), region = c("North America", "Asia", "Africa"))

region_tbl

The second dataset has three countries and their capitals.

Only two of the three countries overlap. We don’t have Pakistan’s region in the region_tbl. And we don’t have the US capital in the capital_tbl

capital_tbl <- tibble(country = c("Nigeria", "Pakistan", "China"), capital = c("Abuja", "Islamabad", "Beijing"))

capital_tbl

Left Join (most common)

Most commmon join we will use: left_join()

It keeps all of the observations on the left (x), and will join only those observations on the right (y) that share the key (in this case, country).

Because there was no data about the capital of the US in capital_tbl, it shows up as NA.

region_tbl %>% left_join(capital_tbl, by = "country")

Right Join

As its name suggest, right_join() does the opposite.

It keeps all of the observations on the right (y), and will join only those observations on the right (x) that share the key (in this case, country).

Notice that the United States observation from the region_tbl is no longer there, and that Pakistan, from the capitals_tbl is now present, with an NA value for region.

region_tbl %>% right_join(capital_tbl, by = "country")

Inner Join

inner_join keeps only observations that are in both left (x) and (y).

It does not include Pakistan or the US, because they are not in both data structures.

This can be very useful when you’re looking for observations where you have full data.

region_tbl %>% inner_join(capital_tbl, by = "country")

Full Join

Joins together all observations from left (x) and right (y). It adds NA values where there is not overlapping data.

region_tbl %>% full_join(capital_tbl, by = "country")

Anti Join

anti_join() identifies the observations on the left (x) that do not have observations on the right (y).

This is extremely useful for checking to see that you have compelete data when you are joining datasets.

For example, here, we learn that we don’t have capital data for the United States, so we can go back and add it.

region_tbl %>% anti_join(capital_tbl, by = "country")

Reading in our data

These are a few tricks to avoid typing the same thing too many times:

folder_path <- partial(here, "00_data", "sov_debt_paris_alignment")

folder_path() %>% list.files()
##  [1] "country_list_bis_debt_securities.csv"  
##  [2] "country_list_combined_bond_indices.csv"
##  [3] "country_list_em_dm.csv"                
##  [4] "emissions_dataset.csv"                 
##  [5] "emissions_dataset.rds"                 
##  [6] "emissions_dataset_full.csv"            
##  [7] "emissions_dataset_full.rds"            
##  [8] "imf_wb_country_groups.csv"             
##  [9] "imf_wb_country_groups.rds"             
## [10] "wb_income_groups.csv"                  
## [11] "wb_income_groups.rds"                  
## [12] "weo_world_income_population.csv"       
## [13] "weo_world_income_population.rds"

Let’s read in all the datasets

emissions_dataset <- folder_path("emissions_dataset.rds") %>%
  read_rds()

emissions_dataset_full <- folder_path("emissions_dataset_full.rds") %>%
  read_rds()

bond_indices <- folder_path("country_list_combined_bond_indices.csv") %>%
  read_csv()
## Rows: 158 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): country_name, iso3c, index, em_dm
## dbl (1): value_pct
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
bis_gov_securities <- folder_path("country_list_bis_debt_securities.csv") %>%
  read_csv()
## Rows: 55 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): country_name, iso3c, em_dm
## dbl (2): total_debt_securities, pct_of_total
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
imf_wb_country_groups <- folder_path("imf_wb_country_groups.rds") %>%
  read_rds()

wb_income_groups <- folder_path("wb_income_groups.rds") %>%
  read_rds()

weo_world_income_population <- folder_path("weo_world_income_population.rds") %>%
  read_rds()

taking a look at our datasets

Emissions Datasets

The main emissions dataset is created from 2 other datasets.

  1. Our World In Data carbon emissions data
  2. IMF WEO data

Carbon data comes from OWID, and the population and macroeconomic data come from the IMF. Many of the variables you see below are “feature engineered” from the base data using the dply::mutate() function as we’ve done elsewhere. For example, consumption_co2_per_capita is caluclated by dividing the consumption carbon emissions figure from OWID by the population data provided by the IMF.

I’m a big fan of doing ones own calculation of figures like this from base data that you trust and know the source of. If you see differences in a per capita or per GDP figure, you can get a transparent sense of whether that difference is because of the numerator or the denominator.

Here are the numerators:

  • Territorial Emissions: the Co2 emissions created in a country’s physical territory.
  • Trade Emissions: Emissions imported or exported in trade with other countries.
  • Consumption Emissions: Emissions resulting from territorial emissions + Net trade emissions.
  • Cumulative Emissions: Cumulative territorial emissions.

Here are the denominators:

  • Population
  • GDP
  • Debt
  • Government Expenditure

Current GFANZ guidance for “Net Zero” investing in sovereign debt suggests using territorial emissions per capita.

This version of the emissions dataset has only the countries that are in the bond indices we’re using. We also have the other countries in emissions_dataset_full should you want to use them.

emissions_dataset %>% glimpse()
## Rows: 2,820
## Columns: 31
## $ country_name                         <chr> "Angola", "Angola", "Angola", "An…
## $ iso3c                                <chr> "AGO", "AGO", "AGO", "AGO", "AGO"…
## $ em_dm                                <chr> "Emerging Markets", "Emerging Mar…
## $ year                                 <dbl> 1990, 1991, 1992, 1993, 1994, 199…
## $ gdp_usd_current_prices               <dbl> 12.571, 12.186, 9.395, 6.819, 4.9…
## $ gdp_ppp_current_prices               <dbl> 19.681, 22.805, 25.984, 29.518, 3…
## $ gdp_pc_usd_current_prices            <dbl> 986.639, 928.506, 695.059, 489.75…
## $ gdp_pc_ppp_current_prices            <dbl> 1544.61, 1737.67, 1922.26, 2120.0…
## $ population                           <dbl> 12.742, 13.124, 13.518, 13.923, 1…
## $ govt_expenditure_pct_gdp             <dbl> NA, NA, NA, NA, NA, NA, 33.197, 3…
## $ debt_pct_gdp                         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ territorial_co2                      <dbl> 5.090, 5.064, 5.164, 5.748, 3.865…
## $ trade_co2                            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ consumption_co2                      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ cumulative_co2                       <dbl> 115.663, 120.727, 125.891, 131.64…
## $ debt_usd                             <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ govt_expenditure_usd                 <dbl> NA, NA, NA, NA, NA, NA, 265.3768,…
## $ territorial_co2_per_capita           <dbl> 0.3994663, 0.3858580, 0.3820092, …
## $ trade_co2_per_capita                 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ consumption_co2_per_capita           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ territorial_co2_per_gdp              <dbl> 0.2586251, 0.2220566, 0.1987377, …
## $ trade_co2_per_gdp                    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ consumption_co2_per_gdp              <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ territorial_co2_per_debt             <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ trade_co2_per_debt                   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ consumption_co2_per_debt             <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ territorial_co2_per_govt_expenditure <dbl> NA, NA, NA, NA, NA, NA, 0.0392799…
## $ trade_co2_per_govt_expenditure       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ consumption_co2_govt_expenditure     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ trade_pct_of_consumption_co2         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ cumulative_co2_per_capita            <dbl> 9.077303, 9.198948, 9.312842, 9.4…

Bond Indices

This data comes from BlackRock government bond ETFs. It’s useful because it can help us clarify how different decisions would affect asset allocation decisions between asset classes.

bond_indices %>% glimpse()
## Rows: 158
## Columns: 5
## $ country_name <chr> "Australia", "Austria", "Belgium", "Canada", "Denmark", "…
## $ iso3c        <chr> "AUS", "AUT", "BEL", "CAN", "DNK", "FIN", "FRA", "DEU", "…
## $ value_pct    <dbl> 0.04692378933177, 0.04572291615032, 0.04605351394130, 0.0…
## $ index        <chr> "Developed Markets", "Developed Markets", "Developed Mark…
## $ em_dm        <chr> "Developed Markets", "Developed Markets", "Developed Mark…

One challenge with using real-world bond indices is that they generally don’t mix emerging markets and developed markets. And because part of what we’re looking at is the relative impact of different carbon intensity measures on EM and DM, it would be nice to have an index that covers both. Also, there are some countries, like China & India, for example, that are very large economies but have relatively low weights in government bond indices.

So we’re making a synthetic bond benchmark using the Bank for International Settlements (BIS) data on outstanding sovereign bonds. The logic is simple. Most investment benchmarks are based on market capitalization weighting (sometimes modified slightly to ensure diversification). We create an integrated EM/DM index using the total debt securities outstanding to determine the weights.

If the data you would like does not exist, figure out the best proxy. Keep it transparent and make sure you can justify to others why the assumptions are reasonable and the process adds value to the analysis.

bis_gov_securities %>% glimpse()
## Rows: 55
## Columns: 5
## $ country_name          <chr> "United States", "Japan", "China", "United Kingd…
## $ iso3c                 <chr> "USA", "JPN", "CHN", "GBR", "FRA", "ITA", "DEU",…
## $ total_debt_securities <dbl> 19115.4810, 9088.4047, 7056.0582, 2571.9160, 251…
## $ pct_of_total          <dbl> 0.326874485, 0.155411606, 0.120658507, 0.0439797…
## $ em_dm                 <chr> "Developed Markets", "Developed Markets", "Emerg…

Helpful Features

imf_wb_country_groups %>% glimpse()
## Rows: 2,587
## Columns: 3
## $ country_name  <chr> "Australia", "Austria", "Belgium", "Canada", "Switzerlan…
## $ country_group <chr> "Advanced Economies", "Advanced Economies", "Advanced Ec…
## $ group_type    <chr> "IMF", "IMF", "IMF", "IMF", "IMF", "IMF", "IMF", "IMF", …
imf_wb_country_groups %>%
  group_by(country_group, group_type) %>%
  count()
wb_income_groups %>% glimpse()
## Rows: 217
## Columns: 2
## $ country_name    <chr> "Aruba", "Afghanistan", "Angola", "Albania", "Andorra"…
## $ wb_income_group <fct> High, Low, Lower Middle, Upper Middle, High, High, Upp…

This dataset provides helps provide global context. It provided the GDP per capita (PPP) and the population, as well as indicators of where these figures fall on the entire global scale (217 countries + territories).

weo_world_income_population %>% glimpse()
## Rows: 217
## Columns: 11
## $ country_name                                  <chr> "Afghanistan", "Albania"…
## $ iso3c                                         <chr> "AFG", "ALB", "DZA", "AN…
## $ wb_income_group                               <fct> Low, Upper Middle, Lower…
## $ gdp_pc_ppp_current_prices_value               <dbl> 2542.64, 14467.09, 11893…
## $ gdp_pc_ppp_current_prices_pct_of_world_total  <dbl> 0.0005905654, 0.00336019…
## $ gdp_pc_ppp_current_prices_percentile_of_world <dbl> 0.1256545, 0.5130890, 0.…
## $ gdp_pc_ppp_current_prices_decile_of_world     <dbl> 2, 6, 5, 10, 4, 7, 7, 5,…
## $ population_value                              <dbl> 32.200, 2.881, 43.424, 0…
## $ population_pct_of_world_total                 <dbl> 0.00426954287, 0.0003820…
## $ population_percentile_of_world                <dbl> 0.77486911, 0.30890052, …
## $ population_decile_of_world                    <dbl> 8, 4, 9, 1, 8, 1, 9, 4, …

The Latest Data

The latest data is from 2019. It’s very common for international environmental data to have significant time lags. Let’s create a dataset that just focus on the latest data by filtering for year==2019. We’re also going to select just the data we want to look at here, and pivot it longer into tidy format.

emissions_dataset_2019 <- emissions_dataset %>%
  filter(year == 2019) %>%
  select(-year) %>%
  pivot_longer(gdp_usd_current_prices:cumulative_co2_per_capita)
  

emissions_dataset_2019

We use our friend left_join() to connect these datasets like legos.

bond_indices_w_data <- bond_indices %>% 
  left_join(emissions_dataset_2019, by = c("country_name", "iso3c", "em_dm")) %>%
  rename(index_pct = value_pct, variable = name)

bond_indices_w_data

Calculated Benchmark Weighted Statistics

At the highest level, we’re trying to find the forest for the trees. Each investment benchmark has a lot of countries. How can we describe the characteristics of the entire benchmark? Most commonly, we do this using weighted averages. For an investment benchmark we use the sum of weight % multiplied by the figure in question (population, GDP, emissions, etc.).

First: Check The Data

bond_indices_w_data %>%
  # filter for NA values)
  filter(is.na(value)) %>%
  # filter for when the variable includes "trade" or "consumption"
  filter(str_detect(variable,pattern ="trade|consumption"))
missing_trade_data <- bond_indices_w_data %>%
  filter(is.na(value)) %>%
  filter(str_detect(variable,pattern ="trade|consumption")) %>%
  select(country_name, index_pct, index, em_dm) %>%
  unique() 

missing_trade_data

How much coverage do we have?

missing_trade_data %>%
  group_by(index) %>%
  summarize(pct_of_index_missing = sum(index_pct) * 100,
            pct_of_index_complete = 100 - pct_of_index_missing)

We’ve got 95%+ coverage. This is pretty good! We can work with that.

What other data is missing?

bond_indices_w_data %>%
  filter(is.na(value)) %>%
  # use the ! (not) sign to search for anythinge except what we searched for above
  filter(!str_detect(variable,pattern ="trade|consumption")) %>%
  select(country_name, index_pct, index, em_dm) %>%
  unique() 

And it looks like nothing else is missing.

Reweighting the indices

Do the current index weighst sum to 1? yep.

bond_indices_w_data %>%
  select(country_name, index_pct, index) %>%
  unique() %>%
  group_by(index) %>%
  summarize(index_total_weight = sum(index_pct))

Take out the missing data, and re-calculate the weights

bond_indices_w_data_reweighted <- bond_indices_w_data %>%
  filter(!is.na(value)) %>%
  group_by(em_dm, index, variable) %>%
  mutate(index_pct = index_pct/sum(index_pct)) %>%
  ungroup()

bond_indices_w_data_reweighted   

Calculated Benchmark Weighted Statistics

Now we can calculate the benchmark weighted for everything in just a few lines.

bond_indices_weighted_avg <- bond_indices_w_data_reweighted %>%
  mutate(contribution = index_pct * value) %>%
  group_by(index, variable) %>%
  summarize(weighted_avg = sum(contribution)) %>%
  ungroup() %>%
  arrange(variable, weighted_avg)
## `summarise()` has grouped output by 'index'. You can override using the
## `.groups` argument.
bond_indices_weighted_avg 
bond_indices_weighted_avg %>%
  summarize(n_index = n_distinct(index),
            n_variable = n_distinct(variable)) %>%
  mutate(total_n = n_index * n_variable)

Let’s just stop to think about this for a minute. Tidy datasets like the one above are not intuitive to work with at first. But there are some great reasons why we learn to love them. And this is one of them. We just calculated 108 different weighted averages in just a few lines of code. Doing this in Excel would be unfun.

Comparing Numerators

We’ll use a workflow similar to what we’ve used before. We’ll pivot tidy data into wide data so that we can compare two groups.

numerator_compare <- bond_indices_weighted_avg %>%
  filter(variable == "consumption_co2_per_capita"|
           variable == "territorial_co2_per_capita") %>%
  pivot_wider(names_from = variable, values_from = weighted_avg) %>%
  mutate(pct_delta_from_consumption = (territorial_co2_per_capita/consumption_co2_per_capita - 1) * 100)

numerator_compare
numerator_compare %>%
  ggplot(aes(x = territorial_co2_per_capita, y = consumption_co2_per_capita, size = abs(pct_delta_from_consumption), label = index, alpha = abs(pct_delta_from_consumption))) +
  geom_point() +
  geom_text_repel() +
  labs(size = "",
       alpha = "") +
  geom_abline(intercept = 0, lty = "dashed", alpha = .6) +
  scale_x_continuous(limits = c(0,10)) +
  scale_y_continuous(limits = c(0,10))

We’ve made the size aesthetic map to the absolute value of the percentage difference. What we can see is that

  • developed markets are much more carbon intensive from a consumption perspective than a territorial perspective
  • EM hard currency debt is the opposite.
  • EM local currency debt is about the same in both.

This kind of chart brings up a lot of questions that we can explore in the data later.

Squeezing Out More Insights

What percent of consumption emissions come from net imports?

Well, for Developed Markets its pretty high. This is relevant because the current policy proposals suggest using territorial emissions. We can see that the main policy implications of that decision is to make Developed Markets look a lot cleaner.

bond_indices_weighted_avg %>%
  filter(variable == "trade_pct_of_consumption_co2") %>%
  mutate(weighted_avg = weighted_avg * 100)

The DM benchmark is 2-2.5x richer than the EM benchmarks

bond_indices_weighted_avg %>%
  filter(variable == "gdp_pc_ppp_current_prices")

this matters because the only difference between per capita and per GDP calculations is that the denominator is multiplied by GDP per capita.

emission per capita = emissions/population

emission per GDP = emissions/GDP = emissions/(population * GDP per capita).

Think about what this means. Because it inflates the denominator by GDP per capita, it’s basically lowering the resulting emission intensity by a country’s level of income. Morally questionable? Hmmm.

BIS Securities

Now you’re going to play around with the BIS data we showed earlier.

bis_gov_securities %>% glimpse()
## Rows: 55
## Columns: 5
## $ country_name          <chr> "United States", "Japan", "China", "United Kingd…
## $ iso3c                 <chr> "USA", "JPN", "CHN", "GBR", "FRA", "ITA", "DEU",…
## $ total_debt_securities <dbl> 19115.4810, 9088.4047, 7056.0582, 2571.9160, 251…
## $ pct_of_total          <dbl> 0.326874485, 0.155411606, 0.120658507, 0.0439797…
## $ em_dm                 <chr> "Developed Markets", "Developed Markets", "Emerg…

Putting Together The Data

“Developed Markets” and “Emerging Markets” are ultimately marketing constructs to describe investment benchmarks. They are nebulous and somewhat arbitrary.

So let’s also look at the IMF’s classification of Advanced Economies and Emerging Market Economies. It overlaps mostly, but is less arbitrary. And it may be more meaningful to policymakers, who are another audience for this analysis.

imf_ae_eme <- imf_wb_country_groups %>%
  filter(country_group %in% c("Advanced Economies", "Emerging Market Economies")) %>%
  select(-group_type)

imf_ae_eme

We’ll add togeher a variety of datasets here so we have lots of features to play around with

bis_emissions <- bis_gov_securities %>%
  left_join(imf_ae_eme) %>%
  rename(imf_ae_eme = country_group) %>%
  mutate(region = countrycode(iso3c, origin = "iso3c", destination = "un.region.name"),
         sub_region = countrycode(iso3c, origin = "iso3c", destination = "un.regionsub.name")) %>%
  left_join(wb_income_groups) %>%
  left_join(emissions_dataset_2019, by = c("country_name", "iso3c", "em_dm")) %>%
  rename(characteristic = name)
## Joining, by = "country_name"
## Warning in countrycode_convert(sourcevar = sourcevar, origin = origin, destination = dest, : Some values were not matched unambiguously: TWN

## Warning in countrycode_convert(sourcevar = sourcevar, origin = origin, destination = dest, : Some values were not matched unambiguously: TWN
## Joining, by = "country_name"
bis_emissions

We get the following warning: Some values were not matched unambiguously: TWN

Country and territory names and classifications are geopolitical. These datasets often come from the UN System, and China doesn’t want Taiwan included as a country.

So for our analysis, we’ll add the region data manually.

bis_emissions_clean <- bis_emissions %>%
  # case_when() is a more flexible if_else().  It's great.  learn it. 
  mutate(region = case_when(iso3c == "TWN" ~ "Asia",
                            TRUE ~ region),
         sub_region = case_when(iso3c == "TWN" ~ "Eastern Asia",
                            TRUE ~ sub_region))

bis_emissions_clean

Now instead of re-writing code. We can use the function we wrote in a previous exercise. Normally you might put this function in a functions.r file for easy future use. And eventually you can make your own R package.

add_rank_features <- function(data, value_var = value, rank_by_lowest = TRUE, 
                              quantile_n = 5) {
  

  # If rank_by_lowest is set to TRUE, as it is by default, then it will use this first code chunk 
if (rank_by_lowest)
  data %>%
    mutate("{{value_var}}_z_score" := ({{ value_var }}-mean({{ value_var }}))/
             sd({{ value_var }}),
         "{{value_var}}_rank" := rank({{ value_var }}), 
         "{{value_var}}_percentile" := percent_rank({{ value_var }}),
         "{{value_var}}_quantile_{{quantile_n}}" := ntile({{ value_var }}, 
                                                          {{ quantile_n }}),
         "{{value_var}}_pct_total" := {{ value_var }}/sum({{ value_var }}))
  
   # If rank_by_lowest is set to FALSE, as it is by default, then it will use this second code chunk
else
  
  data %>%
    mutate("{{value_var}}_z_score" := ({{ value_var }}-mean({{ value_var }}))/
             sd({{ value_var }}) * -1, # multiply by -1
           # put negative sign
           "{{value_var}}_rank" := rank(-{{ value_var }}), 
           # put negative sign
           "{{value_var}}_percentile" := percent_rank(-{{ value_var }}),
           # put negative sign
           "{{value_var}}_quantile_{{quantile_n}}" := ntile(-{{ value_var }}, 
                                                            {{ quantile_n }}), 
           # no change, not relevant for pct_total
           "{{value_var}}_pct_total" := {{ value_var }}/sum({{ value_var }})) 
    
}
bis_emissions_w_features <- bis_emissions_clean %>%
  group_by(characteristic) %>%
  add_rank_features(rank_by_lowest = FALSE) %>%
  ungroup()

bis_emissions_w_features
bis_emissions_w_features %>% glimpse()
## Rows: 1,485
## Columns: 16
## $ country_name          <chr> "United States", "United States", "United States…
## $ iso3c                 <chr> "USA", "USA", "USA", "USA", "USA", "USA", "USA",…
## $ total_debt_securities <dbl> 19115.48, 19115.48, 19115.48, 19115.48, 19115.48…
## $ pct_of_total          <dbl> 0.3268745, 0.3268745, 0.3268745, 0.3268745, 0.32…
## $ em_dm                 <chr> "Developed Markets", "Developed Markets", "Devel…
## $ imf_ae_eme            <chr> "Advanced Economies", "Advanced Economies", "Adv…
## $ region                <chr> "Americas", "Americas", "Americas", "Americas", …
## $ sub_region            <chr> "Northern America", "Northern America", "Norther…
## $ wb_income_group       <fct> High, High, High, High, High, High, High, High, …
## $ characteristic        <chr> "gdp_usd_current_prices", "gdp_ppp_current_price…
## $ value                 <dbl> 21372.6000000000, 21372.6000000000, 65051.880000…
## $ value_z_score         <dbl> -5.78563491, -4.45319392, -1.38198117, -1.027934…
## $ value_rank            <dbl> 1, 2, 6, 6, 3, 32, 6, 2, 1, 2, 1, 1, 1, 3, 23, 4…
## $ value_percentile      <dbl> 0.00000000, 0.01851852, 0.09259259, 0.09259259, …
## $ value_quantile_5      <int> 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, …
## $ value_pct_total       <dbl> 0.2659056825, 0.1819383748, 0.0375239280, 0.0280…

Now it’s your turn

You have a lot of features to play around with.

Sorted bar charts

bond_indices_weighted_avg %>%
  filter(variable == "gdp_pc_ppp_current_prices") %>%
  ggplot(aes(x = weighted_avg, y = fct_reorder(.f = index, .x = weighted_avg, .desc = TRUE))) +
  geom_bar(stat = "identity")

bis_gov_securities %>%
  glimpse()
## Rows: 55
## Columns: 5
## $ country_name          <chr> "United States", "Japan", "China", "United Kingd…
## $ iso3c                 <chr> "USA", "JPN", "CHN", "GBR", "FRA", "ITA", "DEU",…
## $ total_debt_securities <dbl> 19115.4810, 9088.4047, 7056.0582, 2571.9160, 251…
## $ pct_of_total          <dbl> 0.326874485, 0.155411606, 0.120658507, 0.0439797…
## $ em_dm                 <chr> "Developed Markets", "Developed Markets", "Emerg…
em_debt <- bis_gov_securities %>%
  group_by(em_dm)%>%
  summarize(total_debt = sum(total_debt_securities)) %>%
  print()
## # A tibble: 2 × 2
##   em_dm             total_debt
##   <chr>                  <dbl>
## 1 Developed Markets     44964.
## 2 Emerging Markets      13516.
em_debt %>%
  ggplot(aes(x = em_dm, y = total_debt)) + 
  geom_col(fill = "brown3", colour = "black") + 
  labs(x = "Market Type", 
       y = "Total Debt",
       title = "Total Debt by Market Type",
       subtitle = "$mn",
       caption = "Source: World Bank Data") +
  scale_y_continuous(breaks = seq(10000,50000, by = 5000)) +
  theme_economist()

emissions_dataset_full %>%
  rename(Country = iso3c) %>%
  filter(Country %in% c("SGP","AUS","PHL","BGD","IND","IDN","MYS")) %>%
    ggplot(emissions_dataset_full, mapping = aes(x = year, y = cumulative_co2, group = Country)) +
    geom_line(aes(color = Country), size = 2) +
    labs(x = "Year", 
       y = "Cumulative CO2",
       title = "Cumulative CO2 by Year",
       caption = "Source: IMF") +
    scale_y_continuous(breaks = seq(10000,1500000, by = 10000)) +
    scale_x_continuous(breaks = seq(1990,2020, by = 5)) + 
    theme_bw()

sub_region_emissions <- bis_emissions_w_features %>%
  group_by(sub_region) %>%
  filter(characteristic == "cumulative_co2") %>%
  summarize(total_co2 = sum(value)) %>%
  arrange(desc(total_co2)) %>%
  print()
## # A tibble: 12 × 2
##    sub_region                      total_co2
##    <chr>                               <dbl>
##  1 Northern America                  445051.
##  2 Eastern Asia                      317591.
##  3 Eastern Europe                    174178.
##  4 Western Europe                    163685.
##  5 Northern Europe                    98878.
##  6 Southern Asia                      51975.
##  7 Latin America and the Caribbean    51792.
##  8 Southern Europe                    47678.
##  9 South-eastern Asia                 32082.
## 10 Western Asia                       28368.
## 11 Sub-Saharan Africa                 20712.
## 12 Australia and New Zealand          18244.
fix(sub_region_emissions)
sub_region_emissions %>%  
  ggplot(aes(x = reorder (sub_region, -total_co2), y = total_co2)) +
  geom_col(fill = "brown") +
   labs(x = "Sub Region", 
       y = "Total CO2",
       title = "Total CO2 Emissions by Sub-Region",
       caption = "Source: IMF") +
  scale_y_continuous(breaks = seq(1000,1000000, by = 50000)) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0 , hjust=.5),
        panel.background = element_rect(fill = "lightblue",
                           colour = "lightblue",
                           size = 0.5, 
                           linetype = "solid")) 

emissions <- read.csv(“emissions_dataset.csv”)

emissions_AGO <- emissions %>% filter(country_name == “Angola”) %>% select(-iso3c) %>% select(-em_dm)