Final project data cleaning

Preliminary Results for Final Project

Since I’m analyzing the effect of green financing on carbon emission for each country, and the effectiveness of different type of financial instrument, I will start with finding data:

  1. Carbon emission data, by country, by region: carbon emissions data based on production

  2. Green financing data, by country, by region: Data on green bonds, loans, and investments

  3. Project Outcome Data: Information on the estimated and actual carbon reduction achieved through financed projects - skipped (can’t access Bloomberg/climate bond initiative database). So I will run regression to predict the result.

  4. Policy and Regulatory Data: Details on climate-related financial regulations (again, the database that contains such data is unavailable to students) so I will try to search for relevent results directly.

library(readxl)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(writexl)
library(dplyr)
library(countrycode)
library(here) # for using relative file paths, here() starts at /Users/ciel.wang/Desktop/JH/R/00_data_raw 
here() starts at /Users/ciel.wang/Desktop/JH/R
#read the excel file
setwd("/Users/ciel.wang/Desktop/JH/R/00_data_raw")

#Carbon emission data
path_to_sheet <- here("00_data_raw", "Book3.xlsx")
emission <- read_xlsx("Book3.xlsx", 
                               sheet = 'emission_by_country'  )

#green financing data
green_debt <- read_xlsx("Book3.xlsx", 
                               sheet = 'green_debt'  )

Before cleaning and analyzing the data, we first standardized the country names and codes to make sure that they are matched in both data frames:

#design a function to standardize country name and code 
country_regex_to_iso3c <- function(country_string) {
  country_string |>
    countrycode::countrycode(origin = "country.name", destination = "iso3c", origin_regex = TRUE)
}

iso3c_to_country_name <- function(country_string) {
  country_string |>
    countrycode::countrycode(origin = "iso3c", destination = "country.name")
} 

Next, pivot longer each data sets and select the relevant columns:

emission <- emission %>% 
  mutate(continent = countrycode(Country_Name, origin = "country.name", destination = "continent"))
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `continent = countrycode(Country_Name, origin = "country.name",
  destination = "continent")`.
Caused by warning:
! Some values were not matched unambiguously: Int. Aviation, Int. Shipping, Serbia and Montenegro, Virgin Islands_USA
emission_long <- emission |> 
  pivot_longer(
    cols= starts_with('Y_'),
    names_to = 'year',
    values_to = 'emission'
  ) |>
  rename(ISO3=Country_code_A3) |>
  mutate(year=parse_number(year)) |>
    select('ISO3','Country_Name', 'year','emission')
emission_long
# A tibble: 11,819 × 4
   ISO3  Country_Name  year emission
   <chr> <chr>        <dbl>    <dbl>
 1 ABW   Aruba         1970     45.2
 2 ABW   Aruba         1971     50.1
 3 ABW   Aruba         1972     60.9
 4 ABW   Aruba         1973     65.8
 5 ABW   Aruba         1974     65.1
 6 ABW   Aruba         1975     79.1
 7 ABW   Aruba         1976     78.2
 8 ABW   Aruba         1977     88.9
 9 ABW   Aruba         1978     93.7
10 ABW   Aruba         1979     97.7
# ℹ 11,809 more rows
green_debt_long <- green_debt |> 
  pivot_longer(
    cols= starts_with('F'),
    names_to = 'year',
    values_to = 'debt_issuance',
    values_drop_na = TRUE
  ) |>
  mutate(year=parse_number(year)) |>
  select("ISO3",Country_Name, year, debt_issuance)
green_debt_long
# A tibble: 863 × 4
   ISO3  Country_Name  year debt_issuance
   <chr> <chr>        <dbl>         <dbl>
 1 ARG   Argentina     2017        0.974 
 2 ARG   Argentina     2020        0.0500
 3 ARG   Argentina     2021        0.916 
 4 ARG   Argentina     2022        0.207 
 5 AUS   Australia     2014        0.526 
 6 AUS   Australia     2015        0.413 
 7 AUS   Australia     2016        0.531 
 8 AUS   Australia     2017        2.53  
 9 AUS   Australia     2018        2.22  
10 AUS   Australia     2019        1.98  
# ℹ 853 more rows

Then we can try to standardize country variables so that the two data frames can be merged:

#### standardzied country names and codes
emission_long |>
  mutate(ISO3 = country_regex_to_iso3c(Country_Name))
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `ISO3 = country_regex_to_iso3c(Country_Name)`.
Caused by warning:
! Some values were not matched unambiguously: Int. Aviation, Int. Shipping, Netherlands Antilles, Serbia and Montenegro, Virgin Islands_USA
# A tibble: 11,819 × 4
   ISO3  Country_Name  year emission
   <chr> <chr>        <dbl>    <dbl>
 1 ABW   Aruba         1970     45.2
 2 ABW   Aruba         1971     50.1
 3 ABW   Aruba         1972     60.9
 4 ABW   Aruba         1973     65.8
 5 ABW   Aruba         1974     65.1
 6 ABW   Aruba         1975     79.1
 7 ABW   Aruba         1976     78.2
 8 ABW   Aruba         1977     88.9
 9 ABW   Aruba         1978     93.7
10 ABW   Aruba         1979     97.7
# ℹ 11,809 more rows
green_debt_long |>
  mutate(ISO3 = country_regex_to_iso3c(Country_Name))
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `ISO3 = country_regex_to_iso3c(Country_Name)`.
Caused by warning:
! Some values were not matched unambiguously: World
# A tibble: 863 × 4
   ISO3  Country_Name  year debt_issuance
   <chr> <chr>        <dbl>         <dbl>
 1 ARG   Argentina     2017        0.974 
 2 ARG   Argentina     2020        0.0500
 3 ARG   Argentina     2021        0.916 
 4 ARG   Argentina     2022        0.207 
 5 AUS   Australia     2014        0.526 
 6 AUS   Australia     2015        0.413 
 7 AUS   Australia     2016        0.531 
 8 AUS   Australia     2017        2.53  
 9 AUS   Australia     2018        2.22  
10 AUS   Australia     2019        1.98  
# ℹ 853 more rows

Join the data frames by country and year and calculate the financed emission based on the sovereign debt metric:

debt_emission <- left_join( green_debt_long, emission_long,
                                     by = c("ISO3" = "ISO3", "year" = "year", "Country_Name" = "Country_Name")) |>
  mutate(
    previous_year_emission = lag(emission),
    percentage_change_emission = (emission - previous_year_emission)/ previous_year_emission * 100
  ) |>
  select("ISO3", "Country_Name", "year", "debt_issuance","emission",
        "percentage_change_emission") 

debt_emission
# A tibble: 863 × 6
   ISO3  Country_Name  year debt_issuance emission percentage_change_emission
   <chr> <chr>        <dbl>         <dbl>    <dbl>                      <dbl>
 1 ARG   Argentina     2017        0.974   379169.                    NA     
 2 ARG   Argentina     2020        0.0500  359026.                    -5.31  
 3 ARG   Argentina     2021        0.916   378420.                     5.40  
 4 ARG   Argentina     2022        0.207   382992.                     1.21  
 5 AUS   Australia     2014        0.526   585442.                    52.9   
 6 AUS   Australia     2015        0.413   593561.                     1.39  
 7 AUS   Australia     2016        0.531   594024.                     0.0780
 8 AUS   Australia     2017        2.53    601065.                     1.19  
 9 AUS   Australia     2018        2.22    597774.                    -0.548 
10 AUS   Australia     2019        1.98    596032.                    -0.291 
# ℹ 853 more rows

Folder Structure: Create 3 folders (and as many sub-folders as you would like)

  • 00_data_raw:  put your raw files in here.

  • 01_data_cleaning: put your r notebooks in hear that you use to clean the data.

  • 03_data_processed:  write your processed files into this folder (as .csv, .rds, or whatever file format you prefer).

 

Include your data tidying plan in an initial section. Try to execute your plan from step 1.   Use headers and text to narrate what you are trying to do in code chunks.

When you’ve finished tidying your data, write it to the folder 03_data_processed

Step 4:  Publish your notebook to RPubs