Today’s lesson is based on an actual project that I recently worked on for a hedge fund client. Following COP26 and the rise of Mark Carney’s Glasgow Financial Alliance for Net Zero (GFANZ), clients were increasingly interested in measuring the carbon footprints of their portfolios.
To answer this, first I had to read. A lot. I read everything that I could on carbon footprinting generally, and the few pieces that had been written specifically about carbon footprinting sovereigns.
I also used past domain expertise. What matters the most in driving fixed income flows are the metrics that drive investment benchmark construction, and the incentives they provide.
So for the data analysis, I focused on understanding the distributional implications of different proposed carbon intensity metrics. I used publicly available data so that I could make the findings reproducible. Transparency builds credibility.
This lessson will aim to:
This first section gives you a toolkit for working with sovereign level data. You are SAIS students. Being able to give a global perspective to issues is your calling card. So learning to work with sovereign level data is important for everyone, not just those working in sovereign fixed income
countrycode package is your friendA persistent challenge for working with sovereign-level data is dealing with country names. Countries are referred to be a variety of different names and coding schemes. For example:
The countrycode package in R will be your best friend in
dealing with country names.
I’ll show you the workflow I use, and which works well for me. I take the raw country names and codes and convert them into
What is the advantage of doing this? If all of your datasets use standardized country names, it’s easy to combine datasets. And being able to combine datasets from disparate source is a key ingredient of great analysis.
?countrycode
## starting httpd help server ... done
countrycode("Russian Federation",origin = "country.name",
destination = "iso3c",
origin_regex = TRUE)
## [1] "RUS"
"Russian Federation" %>% countrycode(origin = "country.name",
destination = "iso3c",
origin_regex = TRUE)
## [1] "RUS"
And, because we work hard to be lazy, we make a function:
country_name_regex_to_iso3c <- function(country_name) {
country_name %>%
countrycode(origin = "country.name",
destination = "iso3c",
origin_regex = TRUE)
}
"Bolivarian Republic of Venezuela" %>% country_name_regex_to_iso3c()
## [1] "VEN"
"VEN" %>%
countrycode(origin = "iso3c", destination = "country.name")
## [1] "Venezuela"
And we’ll make another function to convert from iso3c to a standardized country name from the countrycode package.
iso3c_to_country_name <- function(iso3c) {
iso3c %>%
countrycode(origin = "iso3c", destination = "country.name")
}
"VEN" %>% iso3c_to_country_name()
## [1] "Venezuela"
Now we can put these together:
"DRC" %>%
country_name_regex_to_iso3c() %>%
iso3c_to_country_name()
## [1] "Congo - Kinshasa"
Of course, you can combine these together and do whatever you find most effective. But I find it useful both to have the written name and the iso3c code.
From here, I use the iso3c to create any other feature available in the countrycode package
Here are a few of the ones I find most useful
# look up ?purrr:partial(). It's an easy way to simplify functions.
iso3c_to_x <- purrr::partial(countrycode, origin = "iso3c")
"VEN" %>% iso3c_to_x(destination = "region")
## [1] "Latin America & Caribbean"
"VEN" %>% iso3c_to_x(destination = "unicode.symbol")
## [1] "🇻🇪"
First let’s start with a mini dataset where we have four countries each referred to by two different common permutations of their name.
messy_country_dataset <- tibble(messy_country = c("The United States of America", "US", "United Mexican States", "Mexico", "The Russian Federation", "Russia", "Czechia", "The Czech Republic"), value = c(7, 2, 5, 2, 8, 4, 1, 4))
messy_country_dataset
Now we can use the functions we made to clean up the dataset, and add a pretty flag that could be helpful if we present the final data in a table.
cleaned_names <- messy_country_dataset %>%
# using multiple mutate functions simply so we can run them one at a time.
mutate(iso3c = country_name_regex_to_iso3c(messy_country)) %>%
mutate(country_name = iso3c_to_country_name(iso3c)) %>%
mutate(flag = iso3c_to_x(iso3c, destination = "unicode.symbol"))
cleaned_names
Now you can do cool stuff with the data easily. Here’s an example.
cleaned_names %>%
group_by(flag, country_name, iso3c) %>%
summarize(sum_of_value = sum(value)) %>%
arrange(sum_of_value)
## `summarise()` has grouped output by 'flag', 'country_name'. You can override
## using the `.groups` argument.
The wbstats package makes it easy to get data from the
World Bank. Here is a tutorial
for getting started.
Often times we’ll want to get the most recent value. With sovereign
level data there are commonly large data gaps and lags. The wbstats
package has a nice function to give us the most recent value, which we
use below. mrnev = “Most recent non-empty value”.
gdp_capita <- wb_data("NY.GDP.PCAP.CD", mrnev = 1)
gdp_capita %>% glimpse()
## Rows: 214
## Columns: 8
## $ iso2c <chr> "AW", "AF", "AO", "AL", "AD", "AE", "AR", "AM", "AS", "…
## $ iso3c <chr> "ABW", "AFG", "AGO", "ALB", "AND", "ARE", "ARG", "ARM",…
## $ country <chr> "Aruba", "Afghanistan", "Angola", "Albania", "Andorra",…
## $ date <dbl> 2020, 2020, 2021, 2021, 2021, 2020, 2021, 2021, 2020, 2…
## $ NY.GDP.PCAP.CD <dbl> 23384.2988, 516.7479, 2137.9094, 6494.3857, 43047.6863,…
## $ obs_status <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ footnote <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ last_updated <date> 2022-09-16, 2022-09-16, 2022-09-16, 2022-09-16, 2022-0…
You can find interesting relevant datasets on the World Bank’s Sovereign ESG Data Portal.
From experience: The World Bank’s website isn’t the most user friendly. If you find an indicator that you want google “World Bank” and the name of the indicator, and google will do a better job than the World Bank’s own website at helping you locate the code.
Try googling: “World Bank Electricity production from coal sources (% of total))”. The following page should be the first page shown. You’ll see the data code in the end of the url. Should there be an easier way to find the codes? Yes. Is there? Let me know if you find one.
Joining datasets together from different sources helps you unlock new insights from your data. Learning the different joins can be confusing at first, but it’s well worth your time. Let’s practice with a very small dataset first so its clear what’s happening. You can come back to these examples when you are joining much larger datasets later in the exercise.
The first dataset has three countries and the regions they are part of.
region_tbl <- tibble(country = c("United States", "China", "Nigeria"), region = c("North America", "Asia", "Africa"))
region_tbl
The second dataset has three countries and their capitals.
Only two of the three countries overlap. We don’t have Pakistan’s
region in the region_tbl. And we don’t have the US capital
in the capital_tbl
capital_tbl <- tibble(country = c("Nigeria", "Pakistan", "China"), capital = c("Abuja", "Islamabad", "Beijing"))
capital_tbl
Most commmon join we will use: left_join()
It keeps all of the observations on the left (x), and will join only those observations on the right (y) that share the key (in this case, country).
Because there was no data about the capital of the US in
capital_tbl, it shows up as NA.
region_tbl %>% left_join(capital_tbl, by = "country")
As its name suggest, right_join() does the opposite.
It keeps all of the observations on the right (y), and will join only those observations on the right (x) that share the key (in this case, country).
Notice that the United States observation from the
region_tbl is no longer there, and that Pakistan, from the
capitals_tbl is now present, with an NA value
for region.
region_tbl %>% right_join(capital_tbl, by = "country")
inner_join keeps only observations that are in
both left (x) and (y).
It does not include Pakistan or the US, because they are not in both data structures.
This can be very useful when you’re looking for observations where you have full data.
region_tbl %>% inner_join(capital_tbl, by = "country")
Joins together all observations from left (x) and right (y). It adds
NA values where there is not overlapping data.
region_tbl %>% full_join(capital_tbl, by = "country")
anti_join() identifies the observations on the left (x)
that do not have observations on the right (y).
This is extremely useful for checking to see that you have compelete data when you are joining datasets.
For example, here, we learn that we don’t have capital data for the United States, so we can go back and add it.
region_tbl %>% anti_join(capital_tbl, by = "country")
These are a few tricks to avoid typing the same thing too many times:
purrr::partial(), with the here()
function to create a filepathlist.files() to print out the files in that folder
so that you can copy and past them.folder_path <- partial(here, "00_data", "sov_debt_paris_alignment")
folder_path() %>% list.files()
## [1] "country_list_bis_debt_securities.csv"
## [2] "country_list_combined_bond_indices.csv"
## [3] "country_list_em_dm.csv"
## [4] "emissions_dataset.csv"
## [5] "emissions_dataset.rds"
## [6] "emissions_dataset_full.csv"
## [7] "emissions_dataset_full.rds"
## [8] "imf_wb_country_groups.csv"
## [9] "imf_wb_country_groups.rds"
## [10] "wb_income_groups.csv"
## [11] "wb_income_groups.rds"
## [12] "weo_world_income_population.csv"
## [13] "weo_world_income_population.rds"
Let’s read in all the datasets
emissions_dataset <- folder_path("emissions_dataset.rds") %>%
read_rds()
emissions_dataset_full <- folder_path("emissions_dataset_full.rds") %>%
read_rds()
bond_indices <- folder_path("country_list_combined_bond_indices.csv") %>%
read_csv()
## Rows: 158 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): country_name, iso3c, index, em_dm
## dbl (1): value_pct
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
bis_gov_securities <- folder_path("country_list_bis_debt_securities.csv") %>%
read_csv()
## Rows: 55 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): country_name, iso3c, em_dm
## dbl (2): total_debt_securities, pct_of_total
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
imf_wb_country_groups <- folder_path("imf_wb_country_groups.rds") %>%
read_rds()
wb_income_groups <- folder_path("wb_income_groups.rds") %>%
read_rds()
weo_world_income_population <- folder_path("weo_world_income_population.rds") %>%
read_rds()
The main emissions dataset is created from 2 other datasets.
Carbon data comes from OWID, and the population and macroeconomic
data come from the IMF. Many of the variables you see below are “feature
engineered” from the base data using the dply::mutate()
function as we’ve done elsewhere. For example,
consumption_co2_per_capita is caluclated by dividing the
consumption carbon emissions figure from OWID by the population data
provided by the IMF.
I’m a big fan of doing ones own calculation of figures like this from base data that you trust and know the source of. If you see differences in a per capita or per GDP figure, you can get a transparent sense of whether that difference is because of the numerator or the denominator.
Here are the numerators:
Here are the denominators:
Current GFANZ guidance for “Net Zero” investing in sovereign debt suggests using territorial emissions per capita.
This version of the emissions dataset has only the countries that are
in the bond indices we’re using. We also have the other countries in
emissions_dataset_full should you want to use them.
emissions_dataset %>% glimpse()
## Rows: 2,820
## Columns: 31
## $ country_name <chr> "Angola", "Angola", "Angola", "An…
## $ iso3c <chr> "AGO", "AGO", "AGO", "AGO", "AGO"…
## $ em_dm <chr> "Emerging Markets", "Emerging Mar…
## $ year <dbl> 1990, 1991, 1992, 1993, 1994, 199…
## $ gdp_usd_current_prices <dbl> 12.571, 12.186, 9.395, 6.819, 4.9…
## $ gdp_ppp_current_prices <dbl> 19.681, 22.805, 25.984, 29.518, 3…
## $ gdp_pc_usd_current_prices <dbl> 986.639, 928.506, 695.059, 489.75…
## $ gdp_pc_ppp_current_prices <dbl> 1544.61, 1737.67, 1922.26, 2120.0…
## $ population <dbl> 12.742, 13.124, 13.518, 13.923, 1…
## $ govt_expenditure_pct_gdp <dbl> NA, NA, NA, NA, NA, NA, 33.197, 3…
## $ debt_pct_gdp <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ territorial_co2 <dbl> 5.090, 5.064, 5.164, 5.748, 3.865…
## $ trade_co2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ consumption_co2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ cumulative_co2 <dbl> 115.663, 120.727, 125.891, 131.64…
## $ debt_usd <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ govt_expenditure_usd <dbl> NA, NA, NA, NA, NA, NA, 265.3768,…
## $ territorial_co2_per_capita <dbl> 0.3994663, 0.3858580, 0.3820092, …
## $ trade_co2_per_capita <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ consumption_co2_per_capita <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ territorial_co2_per_gdp <dbl> 0.2586251, 0.2220566, 0.1987377, …
## $ trade_co2_per_gdp <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ consumption_co2_per_gdp <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ territorial_co2_per_debt <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ trade_co2_per_debt <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ consumption_co2_per_debt <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ territorial_co2_per_govt_expenditure <dbl> NA, NA, NA, NA, NA, NA, 0.0392799…
## $ trade_co2_per_govt_expenditure <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ consumption_co2_govt_expenditure <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ trade_pct_of_consumption_co2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ cumulative_co2_per_capita <dbl> 9.077303, 9.198948, 9.312842, 9.4…
This data comes from BlackRock government bond ETFs. It’s useful because it can help us clarify how different decisions would affect asset allocation decisions between asset classes.
bond_indices %>% glimpse()
## Rows: 158
## Columns: 5
## $ country_name <chr> "Australia", "Austria", "Belgium", "Canada", "Denmark", "…
## $ iso3c <chr> "AUS", "AUT", "BEL", "CAN", "DNK", "FIN", "FRA", "DEU", "…
## $ value_pct <dbl> 0.04692378933177, 0.04572291615032, 0.04605351394130, 0.0…
## $ index <chr> "Developed Markets", "Developed Markets", "Developed Mark…
## $ em_dm <chr> "Developed Markets", "Developed Markets", "Developed Mark…
One challenge with using real-world bond indices is that they generally don’t mix emerging markets and developed markets. And because part of what we’re looking at is the relative impact of different carbon intensity measures on EM and DM, it would be nice to have an index that covers both. Also, there are some countries, like China & India, for example, that are very large economies but have relatively low weights in government bond indices.
So we’re making a synthetic bond benchmark using the Bank for International Settlements (BIS) data on outstanding sovereign bonds. The logic is simple. Most investment benchmarks are based on market capitalization weighting (sometimes modified slightly to ensure diversification). We create an integrated EM/DM index using the total debt securities outstanding to determine the weights.
If the data you would like does not exist, figure out the best proxy. Keep it transparent and make sure you can justify to others why the assumptions are reasonable and the process adds value to the analysis.
bis_gov_securities %>% glimpse()
## Rows: 55
## Columns: 5
## $ country_name <chr> "United States", "Japan", "China", "United Kingd…
## $ iso3c <chr> "USA", "JPN", "CHN", "GBR", "FRA", "ITA", "DEU",…
## $ total_debt_securities <dbl> 19115.4810, 9088.4047, 7056.0582, 2571.9160, 251…
## $ pct_of_total <dbl> 0.326874485, 0.155411606, 0.120658507, 0.0439797…
## $ em_dm <chr> "Developed Markets", "Developed Markets", "Emerg…
imf_wb_country_groups %>% glimpse()
## Rows: 2,587
## Columns: 3
## $ country_name <chr> "Australia", "Austria", "Belgium", "Canada", "Switzerlan…
## $ country_group <chr> "Advanced Economies", "Advanced Economies", "Advanced Ec…
## $ group_type <chr> "IMF", "IMF", "IMF", "IMF", "IMF", "IMF", "IMF", "IMF", …
imf_wb_country_groups %>%
group_by(country_group, group_type) %>%
count()
wb_income_groups %>% glimpse()
## Rows: 217
## Columns: 2
## $ country_name <chr> "Aruba", "Afghanistan", "Angola", "Albania", "Andorra"…
## $ wb_income_group <fct> High, Low, Lower Middle, Upper Middle, High, High, Upp…
This dataset provides helps provide global context. It provided the GDP per capita (PPP) and the population, as well as indicators of where these figures fall on the entire global scale (217 countries + territories).
weo_world_income_population %>% glimpse()
## Rows: 217
## Columns: 11
## $ country_name <chr> "Afghanistan", "Albania"…
## $ iso3c <chr> "AFG", "ALB", "DZA", "AN…
## $ wb_income_group <fct> Low, Upper Middle, Lower…
## $ gdp_pc_ppp_current_prices_value <dbl> 2542.64, 14467.09, 11893…
## $ gdp_pc_ppp_current_prices_pct_of_world_total <dbl> 0.0005905654, 0.00336019…
## $ gdp_pc_ppp_current_prices_percentile_of_world <dbl> 0.1256545, 0.5130890, 0.…
## $ gdp_pc_ppp_current_prices_decile_of_world <dbl> 2, 6, 5, 10, 4, 7, 7, 5,…
## $ population_value <dbl> 32.200, 2.881, 43.424, 0…
## $ population_pct_of_world_total <dbl> 0.00426954287, 0.0003820…
## $ population_percentile_of_world <dbl> 0.77486911, 0.30890052, …
## $ population_decile_of_world <dbl> 8, 4, 9, 1, 8, 1, 9, 4, …
The latest data is from 2019. It’s very common for international
environmental data to have significant time lags. Let’s create a dataset
that just focus on the latest data by filtering for
year==2019. We’re also going to select just the data we
want to look at here, and pivot it longer into tidy format.
emissions_dataset_2019 <- emissions_dataset %>%
filter(year == 2019) %>%
select(-year) %>%
pivot_longer(gdp_usd_current_prices:cumulative_co2_per_capita)
emissions_dataset_2019
We use our friend left_join() to connect these datasets
like legos.
bond_indices_w_data <- bond_indices %>%
left_join(emissions_dataset_2019, by = c("country_name", "iso3c", "em_dm")) %>%
rename(index_pct = value_pct, variable = name)
bond_indices_w_data
At the highest level, we’re trying to find the forest for the trees. Each investment benchmark has a lot of countries. How can we describe the characteristics of the entire benchmark? Most commonly, we do this using weighted averages. For an investment benchmark we use the sum of weight % multiplied by the figure in question (population, GDP, emissions, etc.).
bond_indices_w_data %>%
# filter for NA values)
filter(is.na(value)) %>%
# filter for when the variable includes "trade" or "consumption"
filter(str_detect(variable,pattern ="trade|consumption"))
missing_trade_data <- bond_indices_w_data %>%
filter(is.na(value)) %>%
filter(str_detect(variable,pattern ="trade|consumption")) %>%
select(country_name, index_pct, index, em_dm) %>%
unique()
missing_trade_data
How much coverage do we have?
missing_trade_data %>%
group_by(index) %>%
summarize(pct_of_index_missing = sum(index_pct) * 100,
pct_of_index_complete = 100 - pct_of_index_missing)
We’ve got 95%+ coverage. This is pretty good! We can work with that.
What other data is missing?
bond_indices_w_data %>%
filter(is.na(value)) %>%
# use the ! (not) sign to search for anythinge except what we searched for above
filter(!str_detect(variable,pattern ="trade|consumption")) %>%
select(country_name, index_pct, index, em_dm) %>%
unique()
And it looks like nothing else is missing.
Do the current index weighst sum to 1? yep.
bond_indices_w_data %>%
select(country_name, index_pct, index) %>%
unique() %>%
group_by(index) %>%
summarize(index_total_weight = sum(index_pct))
Take out the missing data, and re-calculate the weights
bond_indices_w_data_reweighted <- bond_indices_w_data %>%
filter(!is.na(value)) %>%
group_by(em_dm, index, variable) %>%
mutate(index_pct = index_pct/sum(index_pct)) %>%
ungroup()
bond_indices_w_data_reweighted
Now we can calculate the benchmark weighted for everything in just a few lines.
bond_indices_weighted_avg <- bond_indices_w_data_reweighted %>%
mutate(contribution = index_pct * value) %>%
group_by(index, variable) %>%
summarize(weighted_avg = sum(contribution)) %>%
ungroup() %>%
arrange(variable, weighted_avg)
## `summarise()` has grouped output by 'index'. You can override using the
## `.groups` argument.
bond_indices_weighted_avg
bond_indices_weighted_avg %>%
summarize(n_index = n_distinct(index),
n_variable = n_distinct(variable)) %>%
mutate(total_n = n_index * n_variable)
Let’s just stop to think about this for a minute. Tidy datasets like the one above are not intuitive to work with at first. But there are some great reasons why we learn to love them. And this is one of them. We just calculated 108 different weighted averages in just a few lines of code. Doing this in Excel would be unfun.
We’ll use a workflow similar to what we’ve used before. We’ll pivot tidy data into wide data so that we can compare two groups.
numerator_compare <- bond_indices_weighted_avg %>%
filter(variable == "consumption_co2_per_capita"|
variable == "territorial_co2_per_capita") %>%
pivot_wider(names_from = variable, values_from = weighted_avg) %>%
mutate(pct_delta_from_consumption = (territorial_co2_per_capita/consumption_co2_per_capita - 1) * 100)
numerator_compare
numerator_compare %>%
ggplot(aes(x = territorial_co2_per_capita, y = consumption_co2_per_capita, size = abs(pct_delta_from_consumption), label = index, alpha = abs(pct_delta_from_consumption))) +
geom_point() +
geom_text_repel() +
labs(size = "",
alpha = "") +
geom_abline(intercept = 0, lty = "dashed", alpha = .6) +
scale_x_continuous(limits = c(0,10)) +
scale_y_continuous(limits = c(0,10))
We’ve made the size aesthetic map to the absolute value of the percentage difference. What we can see is that
This kind of chart brings up a lot of questions that we can explore in the data later.
What percent of consumption emissions come from net imports?
Well, for Developed Markets its pretty high. This is relevant because the current policy proposals suggest using territorial emissions. We can see that the main policy implications of that decision is to make Developed Markets look a lot cleaner.
bond_indices_weighted_avg %>%
filter(variable == "trade_pct_of_consumption_co2") %>%
mutate(weighted_avg = weighted_avg * 100)
The DM benchmark is 2-2.5x richer than the EM benchmarks
bond_indices_weighted_avg %>%
filter(variable == "gdp_pc_ppp_current_prices")
this matters because the only difference between per capita and per GDP calculations is that the denominator is multiplied by GDP per capita.
emission per capita = emissions/population
emission per GDP = emissions/GDP = emissions/(population * GDP per capita).
Think about what this means. Because it inflates the denominator by GDP per capita, it’s basically lowering the resulting emission intensity by a country’s level of income. Morally questionable? Hmmm.
Now you’re going to play around with the BIS data we showed earlier.
bis_gov_securities %>% glimpse()
## Rows: 55
## Columns: 5
## $ country_name <chr> "United States", "Japan", "China", "United Kingd…
## $ iso3c <chr> "USA", "JPN", "CHN", "GBR", "FRA", "ITA", "DEU",…
## $ total_debt_securities <dbl> 19115.4810, 9088.4047, 7056.0582, 2571.9160, 251…
## $ pct_of_total <dbl> 0.326874485, 0.155411606, 0.120658507, 0.0439797…
## $ em_dm <chr> "Developed Markets", "Developed Markets", "Emerg…
“Developed Markets” and “Emerging Markets” are ultimately marketing constructs to describe investment benchmarks. They are nebulous and somewhat arbitrary.
So let’s also look at the IMF’s classification of Advanced Economies and Emerging Market Economies. It overlaps mostly, but is less arbitrary. And it may be more meaningful to policymakers, who are another audience for this analysis.
imf_ae_eme <- imf_wb_country_groups %>%
filter(country_group %in% c("Advanced Economies", "Emerging Market Economies")) %>%
select(-group_type)
imf_ae_eme
We’ll add togeher a variety of datasets here so we have lots of features to play around with
bis_emissions <- bis_gov_securities %>%
left_join(imf_ae_eme) %>%
rename(imf_ae_eme = country_group) %>%
mutate(region = countrycode(iso3c, origin = "iso3c", destination = "un.region.name"),
sub_region = countrycode(iso3c, origin = "iso3c", destination = "un.regionsub.name")) %>%
left_join(wb_income_groups) %>%
left_join(emissions_dataset_2019, by = c("country_name", "iso3c", "em_dm")) %>%
rename(characteristic = name)
## Joining, by = "country_name"
## Warning in countrycode_convert(sourcevar = sourcevar, origin = origin, destination = dest, : Some values were not matched unambiguously: TWN
## Warning in countrycode_convert(sourcevar = sourcevar, origin = origin, destination = dest, : Some values were not matched unambiguously: TWN
## Joining, by = "country_name"
bis_emissions
We get the following warning:
Some values were not matched unambiguously: TWN
Country and territory names and classifications are geopolitical. These datasets often come from the UN System, and China doesn’t want Taiwan included as a country.
So for our analysis, we’ll add the region data manually.
bis_emissions_clean <- bis_emissions %>%
# case_when() is a more flexible if_else(). It's great. learn it.
mutate(region = case_when(iso3c == "TWN" ~ "Asia",
TRUE ~ region),
sub_region = case_when(iso3c == "TWN" ~ "Eastern Asia",
TRUE ~ sub_region))
bis_emissions_clean
Now instead of re-writing code. We can use the function we wrote in a previous exercise. Normally you might put this function in a functions.r file for easy future use. And eventually you can make your own R package.
add_rank_features <- function(data, value_var = value, rank_by_lowest = TRUE,
quantile_n = 5) {
# If rank_by_lowest is set to TRUE, as it is by default, then it will use this first code chunk
if (rank_by_lowest)
data %>%
mutate("{{value_var}}_z_score" := ({{ value_var }}-mean({{ value_var }}))/
sd({{ value_var }}),
"{{value_var}}_rank" := rank({{ value_var }}),
"{{value_var}}_percentile" := percent_rank({{ value_var }}),
"{{value_var}}_quantile_{{quantile_n}}" := ntile({{ value_var }},
{{ quantile_n }}),
"{{value_var}}_pct_total" := {{ value_var }}/sum({{ value_var }}))
# If rank_by_lowest is set to FALSE, as it is by default, then it will use this second code chunk
else
data %>%
mutate("{{value_var}}_z_score" := ({{ value_var }}-mean({{ value_var }}))/
sd({{ value_var }}) * -1, # multiply by -1
# put negative sign
"{{value_var}}_rank" := rank(-{{ value_var }}),
# put negative sign
"{{value_var}}_percentile" := percent_rank(-{{ value_var }}),
# put negative sign
"{{value_var}}_quantile_{{quantile_n}}" := ntile(-{{ value_var }},
{{ quantile_n }}),
# no change, not relevant for pct_total
"{{value_var}}_pct_total" := {{ value_var }}/sum({{ value_var }}))
}
bis_emissions_w_features <- bis_emissions_clean %>%
group_by(characteristic) %>%
add_rank_features(rank_by_lowest = FALSE) %>%
ungroup()
bis_emissions_w_features
bis_emissions_w_features %>% glimpse()
## Rows: 1,485
## Columns: 16
## $ country_name <chr> "United States", "United States", "United States…
## $ iso3c <chr> "USA", "USA", "USA", "USA", "USA", "USA", "USA",…
## $ total_debt_securities <dbl> 19115.48, 19115.48, 19115.48, 19115.48, 19115.48…
## $ pct_of_total <dbl> 0.3268745, 0.3268745, 0.3268745, 0.3268745, 0.32…
## $ em_dm <chr> "Developed Markets", "Developed Markets", "Devel…
## $ imf_ae_eme <chr> "Advanced Economies", "Advanced Economies", "Adv…
## $ region <chr> "Americas", "Americas", "Americas", "Americas", …
## $ sub_region <chr> "Northern America", "Northern America", "Norther…
## $ wb_income_group <fct> High, High, High, High, High, High, High, High, …
## $ characteristic <chr> "gdp_usd_current_prices", "gdp_ppp_current_price…
## $ value <dbl> 21372.6000000000, 21372.6000000000, 65051.880000…
## $ value_z_score <dbl> -5.78563491, -4.45319392, -1.38198117, -1.027934…
## $ value_rank <dbl> 1, 2, 6, 6, 3, 32, 6, 2, 1, 2, 1, 1, 1, 3, 23, 4…
## $ value_percentile <dbl> 0.00000000, 0.01851852, 0.09259259, 0.09259259, …
## $ value_quantile_5 <int> 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, …
## $ value_pct_total <dbl> 0.2659056825, 0.1819383748, 0.0375239280, 0.0280…
You have a lot of features to play around with.
bond_indices_weighted_avg %>%
filter(variable == "gdp_pc_ppp_current_prices") %>%
ggplot(aes(x = weighted_avg, y = fct_reorder(.f = index, .x = weighted_avg, .desc = TRUE))) +
geom_bar(stat = "identity")
bis_gov_securities %>%
glimpse()
## Rows: 55
## Columns: 5
## $ country_name <chr> "United States", "Japan", "China", "United Kingd…
## $ iso3c <chr> "USA", "JPN", "CHN", "GBR", "FRA", "ITA", "DEU",…
## $ total_debt_securities <dbl> 19115.4810, 9088.4047, 7056.0582, 2571.9160, 251…
## $ pct_of_total <dbl> 0.326874485, 0.155411606, 0.120658507, 0.0439797…
## $ em_dm <chr> "Developed Markets", "Developed Markets", "Emerg…
em_debt <- bis_gov_securities %>%
group_by(em_dm)%>%
summarize(total_debt = sum(total_debt_securities)) %>%
print()
## # A tibble: 2 × 2
## em_dm total_debt
## <chr> <dbl>
## 1 Developed Markets 44964.
## 2 Emerging Markets 13516.
em_debt %>%
ggplot(aes(x = em_dm, y = total_debt)) +
geom_col(fill = "brown3", colour = "black") +
labs(x = "Market Type",
y = "Total Debt",
title = "Total Debt by Market Type",
subtitle = "$mn",
caption = "Source: World Bank Data") +
scale_y_continuous(breaks = seq(10000,50000, by = 5000)) +
theme_economist()
emissions_dataset_full %>%
rename(Country = iso3c) %>%
filter(Country %in% c("SGP","AUS","PHL","BGD","IND","IDN","MYS")) %>%
ggplot(emissions_dataset_full, mapping = aes(x = year, y = cumulative_co2, group = Country)) +
geom_line(aes(color = Country), size = 2) +
labs(x = "Year",
y = "Cumulative CO2",
title = "Cumulative CO2 by Year",
caption = "Source: IMF") +
scale_y_continuous(breaks = seq(10000,1500000, by = 10000)) +
scale_x_continuous(breaks = seq(1990,2020, by = 5)) +
theme_bw()
sub_region_emissions <- bis_emissions_w_features %>%
group_by(sub_region) %>%
filter(characteristic == "cumulative_co2") %>%
summarize(total_co2 = sum(value)) %>%
arrange(desc(total_co2)) %>%
print()
## # A tibble: 12 × 2
## sub_region total_co2
## <chr> <dbl>
## 1 Northern America 445051.
## 2 Eastern Asia 317591.
## 3 Eastern Europe 174178.
## 4 Western Europe 163685.
## 5 Northern Europe 98878.
## 6 Southern Asia 51975.
## 7 Latin America and the Caribbean 51792.
## 8 Southern Europe 47678.
## 9 South-eastern Asia 32082.
## 10 Western Asia 28368.
## 11 Sub-Saharan Africa 20712.
## 12 Australia and New Zealand 18244.
fix(sub_region_emissions)
sub_region_emissions %>%
ggplot(aes(x = reorder (sub_region, -total_co2), y = total_co2)) +
geom_col(fill = "brown") +
labs(x = "Sub Region",
y = "Total CO2",
title = "Total CO2 Emissions by Sub-Region",
caption = "Source: IMF") +
scale_y_continuous(breaks = seq(1000,1000000, by = 50000)) +
theme(axis.text.x = element_text(angle = 90, vjust = 0 , hjust=.5),
panel.background = element_rect(fill = "lightblue",
colour = "lightblue",
size = 0.5,
linetype = "solid"))
emissions <- read.csv(“emissions_dataset.csv”)
emissions_AGO <- emissions %>% filter(country_name == “Angola”) %>% select(-iso3c) %>% select(-em_dm)