01_data_cleaning

Step 1: Install packages

The first step is to install packages and load data in file raw_data in this data_cleaning file.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.0     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.1     ✔ tibble    3.1.8
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors

library(dplyr)
library(here)

## Warning: package 'here' was built under R version 4.2.3

## here() starts at C:/Users/Runnan/Documents/sustainable finance/Final Project

library(readxl)

## Warning: package 'readxl' was built under R version 4.2.3

library(qmd)

## Warning: package 'qmd' was built under R version 4.2.3

library(readxl)

#this step is to load data_list I saved in the first file to this file so that I can directly use data sets without importing them again#

print("Reading data_list.rds file")

## [1] "Reading data_list.rds file"

data_list <- readRDS("C:/Users/Runnan/Documents/sustainable finance/Final Project/data_list.rds")

# To check if the names are correct#
names(data_list)

## [1] "Historical_GHG_Emission.xlsx"         
## [2] "Electricity.xlsx"                     
## [3] "Energy_use"                           
## [4] "Infrastructure_Economic_Accounts.xlsx"
## [5] "Population.xlsx"                      
## [6] "PPPGDP.xlsx"                          
## [7] "Supply_Demand_NG.xlsx"

data_list <- readRDS("C:/Users/Runnan/Documents/sustainable finance/Final Project/data_list.rds")
print(data_list)

## $Historical_GHG_Emission.xlsx
## # A tibble: 14 × 34
##    Country Source  Sector Gas   `1990` `1991` `1992` `1993` `1994` `1995` `1996`
##    <chr>   <chr>   <chr>  <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1 CAN     Climat… Total… All … 540.   536.   551.   547.   565.   580.   598.  
##  2 CAN     Climat… Total… All … 607.   603.   617.   613.   631.   646.   668.  
##  3 CAN     Climat… Energy All … 446.   442.   456.   454.   469.   482.   497.  
##  4 CAN     Climat… Indus… All …  29.1   28.5   26.8   24.9   27.2   27.1   27.6 
##  5 CAN     Climat… Agric… All …  47.7   48.1   49.3   49.8   49.9   52.6   55.0 
##  6 CAN     Climat… Waste  All …  17.7   18.0   18.2   18.4   18.4   18.4   18.3 
##  7 CAN     Climat… Land-… All …  66.5   66.5   66.5   66.5   66.5   66.8   70.8 
##  8 CAN     Climat… Bunke… All …   5.68   5.53   5.82   5.27   5.60   5.86   6.11
##  9 CAN     Climat… Elect… All … 139.   139.   146.   138.   143.   148.   149.  
## 10 CAN     Climat… Build… All …  64.6   64.1   65.9   69.1   70.3   69.3   74.8 
## 11 CAN     Climat… Manuf… All …  77.0   75.0   74.0   73.9   77.2   78.3   81.8 
## 12 CAN     Climat… Trans… All … 129.   125.   129.   131.   137.   141.   144.  
## 13 CAN     Climat… Other… All …   7.38   7.43   8.03   8.38   8.47   8.98   9.54
## 14 CAN     Climat… Fugit… All …  29.7   31.2   33.0   33.1   33.2   35.9   36.8 
## # … with 23 more variables: `1997` <dbl>, `1998` <dbl>, `1999` <dbl>,
## #   `2000` <dbl>, `2001` <dbl>, `2002` <dbl>, `2003` <dbl>, `2004` <dbl>,
## #   `2005` <dbl>, `2006` <dbl>, `2007` <dbl>, `2008` <dbl>, `2009` <dbl>,
## #   `2010` <dbl>, `2011` <dbl>, `2012` <dbl>, `2013` <dbl>, `2014` <dbl>,
## #   `2015` <dbl>, `2016` <dbl>, `2017` <dbl>, `2018` <dbl>, `2019` <dbl>
## 
## $Electricity.xlsx
## # A tibble: 3,809 × 16
##    REF_DATE GEO    DGUID      Source Fuel …¹ UOM   UOM_ID SCALA…² SCALA…³ VECTOR
##       <dbl> <chr>  <chr>      <chr>  <chr>   <chr>  <dbl> <chr>     <dbl> <chr> 
##  1     1995 Canada 2016A0000… Total… Total … Giga…    146 units         0 v4171…
##  2     1995 Canada 2016A0000… Total… Coal    Giga…    146 units         0 v4171…
##  3     1995 Canada 2016A0000… Total… Natura… Giga…    146 units         0 v4171…
##  4     1995 Canada 2016A0000… Total… Diesel… Giga…    146 units         0 v4171…
##  5     1995 Canada 2016A0000… Total… Heavy … Giga…    146 units         0 v4171…
##  6     1995 Canada 2016A0000… Total… Wood a… Giga…    146 units         0 v4171…
##  7     1995 Canada 2016A0000… Total… Other … Giga…    146 units         0 v4171…
##  8     1995 Canada 2016A0000… Utili… Total … Giga…    146 units         0 v4171…
##  9     1995 Canada 2016A0000… Utili… Coal    Giga…    146 units         0 v4171…
## 10     1995 Canada 2016A0000… Utili… Natura… Giga…    146 units         0 v4171…
## # … with 3,799 more rows, 6 more variables: COORDINATE <chr>, VALUE <dbl>,
## #   STATUS <chr>, SYMBOL <lgl>, TERMINATED <lgl>, DECIMALS <dbl>, and
## #   abbreviated variable names ¹`Fuel type`, ²SCALAR_FACTOR, ³SCALAR_ID
## 
## $Energy_use
## # A tibble: 240 × 16
##    REF_DATE GEO    DGUID     Physi…¹ Goods…² UOM   UOM_ID SCALA…³ SCALA…⁴ VECTOR
##       <dbl> <chr>  <chr>     <chr>   <chr>   <chr>  <dbl> <chr>     <dbl> <chr> 
##  1     2012 Canada 2021A000… Direct… Total,… Tera…    283 units         0 v1456…
##  2     2012 Canada 2021A000… Direct… Total,… Tera…    283 units         0 v1456…
##  3     2012 Canada 2021A000… Direct… Total,… Tera…    283 units         0 v1456…
##  4     2012 Canada 2021A000… Direct… Clean … Tera…    283 units         0 v1456…
##  5     2012 Canada 2021A000… Direct… Electr… Tera…    283 units         0 v1456…
##  6     2012 Canada 2021A000… Direct… Electr… Tera…    283 units         0 v1456…
##  7     2012 Canada 2021A000… Direct… Biofue… Tera…    283 units         0 v1456…
##  8     2012 Canada 2021A000… Direct… Waste … Tera…    283 units         0 v1456…
##  9     2012 Canada 2021A000… Direct… Waste … Tera…    283 units         0 v1456…
## 10     2012 Canada 2021A000… Direct… Total,… Tera…    283 units         0 v1456…
## # … with 230 more rows, 6 more variables: COORDINATE <chr>, VALUE <dbl>,
## #   STATUS <lgl>, SYMBOL <lgl>, TERMINATED <lgl>, DECIMALS <dbl>, and
## #   abbreviated variable names ¹`Physical flow`,
## #   ²`Goods and services (products)`, ³SCALAR_FACTOR, ⁴SCALAR_ID
## 
## $Infrastructure_Economic_Accounts.xlsx
## # A tibble: 100,800 × 17
##    REF_DATE GEO   DGUID Estim…¹ Impact Asset UOM   UOM_ID SCALA…² SCALA…³ VECTOR
##       <dbl> <chr> <chr> <chr>   <chr>  <chr> <chr>  <dbl> <chr>     <dbl> <chr> 
##  1     2009 Cana… 2016… Greenh… Total… Tota… Tonn…    287 thousa…       3 v1273…
##  2     2009 Cana… 2016… Greenh… Total… Comm… Tonn…    287 thousa…       3 v1273…
##  3     2009 Cana… 2016… Greenh… Total… Spor… Tonn…    287 thousa…       3 v1273…
##  4     2009 Cana… 2016… Greenh… Total… Indo… Tonn…    287 thousa…       3 v1273…
##  5     2009 Cana… 2016… Greenh… Total… Stud… Tonn…    287 thousa…       3 v1273…
##  6     2009 Cana… 2016… Greenh… Total… Airp… Tonn…    287 thousa…       3 v1273…
##  7     2009 Cana… 2016… Greenh… Total… Comm… Tonn…    287 thousa…       3 v1273…
##  8     2009 Cana… 2016… Greenh… Total… Inst… Tonn…    287 thousa…       3 v1273…
##  9     2009 Cana… 2016… Greenh… Total… Scho… Tonn…    287 thousa…       3 v1273…
## 10     2009 Cana… 2016… Greenh… Total… Hosp… Tonn…    287 thousa…       3 v1273…
## # … with 100,790 more rows, 6 more variables: COORDINATE <chr>, VALUE <dbl>,
## #   STATUS <lgl>, SYMBOL <lgl>, TERMINATED <lgl>, DECIMALS <dbl>, and
## #   abbreviated variable names ¹Estimate, ²SCALAR_FACTOR, ³SCALAR_ID
## 
## $Population.xlsx
## # A tibble: 266 × 36
##    Country N…¹ Count…² Indic…³ Indic…⁴ `1990` `1991` `1992` `1993` `1994` `1995`
##    <chr>       <chr>   <chr>   <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1 Aruba       ABW     Popula… SP.POP… 6.57e4 6.79e4 7.02e4 7.24e4 7.47e4 7.70e4
##  2 Africa Eas… AFE     Popula… SP.POP… 3.10e8 3.19e8 3.27e8 3.36e8 3.44e8 3.53e8
##  3 Afghanistan AFG     Popula… SP.POP… 1.07e7 1.07e7 1.21e7 1.40e7 1.55e7 1.64e7
##  4 Africa Wes… AFW     Popula… SP.POP… 2.07e8 2.12e8 2.18e8 2.24e8 2.30e8 2.36e8
##  5 Angola      AGO     Popula… SP.POP… 1.18e7 1.22e7 1.26e7 1.30e7 1.35e7 1.39e7
##  6 Albania     ALB     Popula… SP.POP… 3.29e6 3.27e6 3.25e6 3.23e6 3.21e6 3.19e6
##  7 Andorra     AND     Popula… SP.POP… 5.36e4 5.54e4 5.73e4 5.92e4 6.10e4 6.29e4
##  8 Arab World  ARB     Popula… SP.POP… 2.26e8 2.31e8 2.37e8 2.43e8 2.50e8 2.56e8
##  9 United Ara… ARE     Popula… SP.POP… 1.90e6 2.01e6 2.12e6 2.22e6 2.33e6 2.43e6
## 10 Argentina   ARG     Popula… SP.POP… 3.26e7 3.31e7 3.36e7 3.40e7 3.45e7 3.49e7
## # … with 256 more rows, 26 more variables: `1996` <dbl>, `1997` <dbl>,
## #   `1998` <dbl>, `1999` <dbl>, `2000` <dbl>, `2001` <dbl>, `2002` <dbl>,
## #   `2003` <dbl>, `2004` <dbl>, `2005` <dbl>, `2006` <dbl>, `2007` <dbl>,
## #   `2008` <dbl>, `2009` <dbl>, `2010` <dbl>, `2011` <dbl>, `2012` <dbl>,
## #   `2013` <dbl>, `2014` <dbl>, `2015` <dbl>, `2016` <dbl>, `2017` <dbl>,
## #   `2018` <dbl>, `2019` <dbl>, `2020` <dbl>, `2021` <dbl>, and abbreviated
## #   variable names ¹`Country Name`, ²`Country Code`, ³`Indicator Name`, …
## 
## $PPPGDP.xlsx
## # A tibble: 269 × 33
##    Data Sour…¹ ...2  ...3  ...4  ...5  ...6  ...7  ...8  ...9  ...10 ...11 ...12
##    <chr>       <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
##  1 Last Updat… <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
##  2 <NA>        <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
##  3 Country Na… 1990  1991  1992  1993  1994  1995  1996  1997  1998  1999  2000 
##  4 Aruba       1363… 1522… 1648… 1810… 2001… 2095… 2158… 2350… 2424… 2489… 2739…
##  5 Africa Eas… 5624… 5821… 5843… 5973… 6217… 6632… 7125… 7562… 7789… 8123… 8596…
##  6 Afghanistan <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
##  7 Africa Wes… 3614… 3784… 3969… 4022… 4105… 4281… 4564… 4845… 5076… 5230… 5541…
##  8 Angola      3866… 4036… 3887… 3025… 3131… 3676… 4251… 4638… 4911… 5089… 5363…
##  9 Albania     8379… 6237… 5920… 6640… 7345… 8498… 9442… 8556… 9451… 1079… 1192…
## 10 Andorra     <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
## # … with 259 more rows, 21 more variables: ...13 <chr>, ...14 <chr>,
## #   ...15 <chr>, ...16 <chr>, ...17 <chr>, ...18 <chr>, ...19 <chr>,
## #   ...20 <chr>, ...21 <chr>, ...22 <chr>, ...23 <chr>, ...24 <chr>,
## #   ...25 <chr>, ...26 <chr>, ...27 <chr>, ...28 <chr>, ...29 <chr>,
## #   ...30 <chr>, ...31 <chr>, ...32 <chr>, ...33 <chr>, and abbreviated
## #   variable name ¹`Data Source`
## 
## $Supply_Demand_NG.xlsx
## # A tibble: 52,927 × 17
##    REF_DATE GEO    DGUID     Fuel …¹ Sector Measu…² UOM   UOM_ID SCALA…³ SCALA…⁴
##       <dbl> <chr>  <chr>     <chr>   <chr>  <chr>   <chr>  <dbl> <chr>     <dbl>
##  1     1995 Canada 2016A000… Gas pl… Produ… Megali… Mega…    208 units         0
##  2     1995 Canada 2016A000… Gas pl… Produ… Terajo… Tera…    283 units         0
##  3     1995 Canada 2016A000… Gas pl… Expor… Megali… Mega…    208 units         0
##  4     1995 Canada 2016A000… Gas pl… Expor… Terajo… Tera…    283 units         0
##  5     1995 Canada 2016A000… Gas pl… Impor… Megali… Mega…    208 units         0
##  6     1995 Canada 2016A000… Gas pl… Impor… Terajo… Tera…    283 units         0
##  7     1995 Canada 2016A000… Gas pl… Inter… Megali… Mega…    208 units         0
##  8     1995 Canada 2016A000… Gas pl… Inter… Terajo… Tera…    283 units         0
##  9     1995 Canada 2016A000… Gas pl… Stock… Megali… Mega…    208 units         0
## 10     1995 Canada 2016A000… Gas pl… Stock… Terajo… Tera…    283 units         0
## # … with 52,917 more rows, 7 more variables: VECTOR <chr>, COORDINATE <chr>,
## #   VALUE <dbl>, STATUS <chr>, SYMBOL <lgl>, TERMINATED <lgl>, DECIMALS <dbl>,
## #   and abbreviated variable names ¹`Fuel type`, ²Measures, ³SCALAR_FACTOR,
## #   ⁴SCALAR_ID

# An example to show that the data in the first file has been correctly load in this one#
print(data_list$Historical_GHG_Emission)

## # A tibble: 14 × 34
##    Country Source  Sector Gas   `1990` `1991` `1992` `1993` `1994` `1995` `1996`
##    <chr>   <chr>   <chr>  <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1 CAN     Climat… Total… All … 540.   536.   551.   547.   565.   580.   598.  
##  2 CAN     Climat… Total… All … 607.   603.   617.   613.   631.   646.   668.  
##  3 CAN     Climat… Energy All … 446.   442.   456.   454.   469.   482.   497.  
##  4 CAN     Climat… Indus… All …  29.1   28.5   26.8   24.9   27.2   27.1   27.6 
##  5 CAN     Climat… Agric… All …  47.7   48.1   49.3   49.8   49.9   52.6   55.0 
##  6 CAN     Climat… Waste  All …  17.7   18.0   18.2   18.4   18.4   18.4   18.3 
##  7 CAN     Climat… Land-… All …  66.5   66.5   66.5   66.5   66.5   66.8   70.8 
##  8 CAN     Climat… Bunke… All …   5.68   5.53   5.82   5.27   5.60   5.86   6.11
##  9 CAN     Climat… Elect… All … 139.   139.   146.   138.   143.   148.   149.  
## 10 CAN     Climat… Build… All …  64.6   64.1   65.9   69.1   70.3   69.3   74.8 
## 11 CAN     Climat… Manuf… All …  77.0   75.0   74.0   73.9   77.2   78.3   81.8 
## 12 CAN     Climat… Trans… All … 129.   125.   129.   131.   137.   141.   144.  
## 13 CAN     Climat… Other… All …   7.38   7.43   8.03   8.38   8.47   8.98   9.54
## 14 CAN     Climat… Fugit… All …  29.7   31.2   33.0   33.1   33.2   35.9   36.8 
## # … with 23 more variables: `1997` <dbl>, `1998` <dbl>, `1999` <dbl>,
## #   `2000` <dbl>, `2001` <dbl>, `2002` <dbl>, `2003` <dbl>, `2004` <dbl>,
## #   `2005` <dbl>, `2006` <dbl>, `2007` <dbl>, `2008` <dbl>, `2009` <dbl>,
## #   `2010` <dbl>, `2011` <dbl>, `2012` <dbl>, `2013` <dbl>, `2014` <dbl>,
## #   `2015` <dbl>, `2016` <dbl>, `2017` <dbl>, `2018` <dbl>, `2019` <dbl>

Step 2: Standardization

For this project, there are currently 7 data sets. Some of them are global data, others are country data only for Canada. The first step is to adjust different data sets into a similar format so that it is easier to compare or joint for future calculation.

Historical_GHG_Emission is a country level data, it contains the historical GHG emission in Canada from year 1990 to 2019. Right now, it is a width data, so it has to be converted to pivot long format. Also, the data set contains total GHG emission with and without LUCF, as the purpose for this project is to identify the energy use and emissions, I delete the inclusive portion of GHG emissions and other titles which are unnecessary. The table “Emissions_by_sector” shows the GHG emissions of different industries in Canada from 1990 to 2019. The table “Emissions_by_sector_long” is the data converted to pivot long format.

Emissions_by_sector<-data_list$Historical_GHG_Emission[-c(2),-c(1,2,4)]
Emissions_by_sector_long<-Emissions_by_sector%>%
  pivot_longer(cols=-c(Sector),names_to="Year", values_to="Emissions")
print(Emissions_by_sector_long)

## # A tibble: 390 × 3
##    Sector               Year  Emissions
##    <chr>                <chr>     <dbl>
##  1 Total excluding LUCF 1990       540.
##  2 Total excluding LUCF 1991       536.
##  3 Total excluding LUCF 1992       551.
##  4 Total excluding LUCF 1993       547.
##  5 Total excluding LUCF 1994       565.
##  6 Total excluding LUCF 1995       580.
##  7 Total excluding LUCF 1996       598.
##  8 Total excluding LUCF 1997       612.
##  9 Total excluding LUCF 1998       620.
## 10 Total excluding LUCF 1999       625.
## # … with 380 more rows

The table “Electricity” shows how much electricity generated from fossil fuels. The data is divided into different provinces from 1995 to 2002. The data is not up-to-date, so I might change or add any other data that is more relevant to the present. The data is already pivot long format, so I only need to delete columns that are providng unrelated information, such as “vector, ID” and change the column titles.

#delete the unrelated columns and rename"
Electricity_generated <- data_list$Electricity [-c(3,7,8,9,10,11,13,14,15,16)]%>%
  rename("Year"="REF_DATE")

print(Electricity_generated)

## # A tibble: 3,809 × 6
##     Year GEO    Source                       `Fuel type`            UOM    VALUE
##    <dbl> <chr>  <chr>                        <chr>                  <chr>  <dbl>
##  1  1995 Canada Total industry and utilities Total all fuels        Giga… 1.18e5
##  2  1995 Canada Total industry and utilities Coal                   Giga… 8.52e4
##  3  1995 Canada Total industry and utilities Natural gas            Giga… 1.86e4
##  4  1995 Canada Total industry and utilities Diesel and light fuel… Giga… 2.41e3
##  5  1995 Canada Total industry and utilities Heavy fuel oil         Giga… 3.45e3
##  6  1995 Canada Total industry and utilities Wood and spent pulpin… Giga… 2.69e3
##  7  1995 Canada Total industry and utilities Other fuels            Giga… 5.41e3
##  8  1995 Canada Utilities                    Total all fuels        Giga… 1.04e5
##  9  1995 Canada Utilities                    Coal                   Giga… 8.52e4
## 10  1995 Canada Utilities                    Natural gas            Giga… 1.37e4
## # … with 3,799 more rows

Because there are too many categories, for the convenience of further calculation, I create some sub-data sets based on the regions and choose Ontario as an example to illustrate.

#use filter to modify this code to create subsets for every province#

Canada <- Electricity_generated %>% filter(GEO=="Canada")
Newfoundland_and_Labrador <- Electricity_generated %>% filter(GEO== "Newfoundland and Labrador")
Prince_Edward_Island <- Electricity_generated %>% filter (GEO == "Prince Edward Island")
Nova_Scotia <- Electricity_generated %>% filter (GEO == "Nova Scotia")
New_Brunswick <- Electricity_generated %>% filter (GEO == "New Brunswick")
Quebec <- Electricity_generated %>% filter (GEO=="Quebec")
Ontario <- Electricity_generated %>% filter (GEO == "Ontario")
Manitoba <- Electricity_generated %>% filter (GEO == "Manitoba")
Saskatchewan <- Electricity_generated %>% filter (GEO == "Saskatchewan")
Alberta <- Electricity_generated %>% filter (GEO == "Alberta")
British_Columbia <- Electricity_generated %>% filter (GEO == "British Columbia")
Yukon_Northwest_Territories_and_Nunavut <- Electricity_generated %>% filter (GEO=="Yukon, Northwest Territories and Nunavut")
print(Ontario)

## # A tibble: 368 × 6
##     Year GEO     Source                       `Fuel type`           UOM    VALUE
##    <dbl> <chr>   <chr>                        <chr>                 <chr>  <dbl>
##  1  1995 Ontario Total industry and utilities Total all fuels       Giga… 26143.
##  2  1995 Ontario Total industry and utilities Coal                  Giga… 16708.
##  3  1995 Ontario Total industry and utilities Natural gas           Giga…  7836.
##  4  1995 Ontario Total industry and utilities Diesel and light fue… Giga…   177.
##  5  1995 Ontario Total industry and utilities Heavy fuel oil        Giga…   128.
##  6  1995 Ontario Total industry and utilities Wood and spent pulpi… Giga…   488 
##  7  1995 Ontario Total industry and utilities Other fuels           Giga…   807.
##  8  1995 Ontario Utilities                    Total all fuels       Giga… 22919.
##  9  1995 Ontario Utilities                    Coal                  Giga… 16692.
## 10  1995 Ontario Utilities                    Natural gas           Giga…  6039 
## # … with 358 more rows

In the future, if it is necessary to observe the contribution of different fuel types in each province to electricity generated, I create another type of sub data set . Taking Ontario as an example, I created sub-data set that shows Coal’s contribution to energy generated.

Ontario_fueltype <- Ontario %>% filter (`Fuel type`=="Coal")
print(Ontario_fueltype)

## # A tibble: 54 × 6
##     Year GEO     Source                       `Fuel type` UOM             VALUE
##    <dbl> <chr>   <chr>                        <chr>       <chr>           <dbl>
##  1  1995 Ontario Total industry and utilities Coal        Gigawatt hours 16708.
##  2  1995 Ontario Utilities                    Coal        Gigawatt hours 16692.
##  3  1996 Ontario Total industry and utilities Coal        Gigawatt hours 19001.
##  4  1996 Ontario Utilities                    Coal        Gigawatt hours 18988.
##  5  1997 Ontario Total industry and utilities Coal        Gigawatt hours 23654.
##  6  1997 Ontario Utilities                    Coal        Gigawatt hours 23622.
##  7  1998 Ontario Total industry and utilities Coal        Gigawatt hours 32845.
##  8  1998 Ontario Utilities                    Coal        Gigawatt hours 32804 
##  9  1999 Ontario Total industry and utilities Coal        Gigawatt hours 33630.
## 10  1999 Ontario Utilities                    Coal        Gigawatt hours 33618.
## # … with 44 more rows

Similar, for Table Energy_use, I delete the unrelated columns and change the REF_DATE to Year, and divide data into two separate data sets based on Physical flow. For future use, it would be easy to change the filter.

Besides, the table Energy_use shows the physical flows into different goods and services in Canada from 2012 to 2019, the physical flows are flows from direct energy use and direct GHG emissions.

Energy_use_CAN <- data_list$Energy_use[-c(2, 3, 6:11, 13:16)] %>%
  rename(Year = REF_DATE, `Good and Services` = `Goods and services (products)`)
Energy_use_CAN_FLow <- Energy_use_CAN %>% filter(`Physical flow` == "Direct energy use")
print(Energy_use_CAN_FLow)

## # A tibble: 120 × 4
##     Year `Physical flow`   `Good and Services`                             VALUE
##    <dbl> <chr>             <chr>                                           <dbl>
##  1  2012 Direct energy use Total, environmental and clean technology pro… 146990
##  2  2012 Direct energy use Total, environmental products                   84670
##  3  2012 Direct energy use Total, environmental goods                      49529
##  4  2012 Direct energy use Clean electricity                               41635
##  5  2012 Direct energy use Electricity from nuclear                        12029
##  6  2012 Direct energy use Electricity from renewable sources              29607
##  7  2012 Direct energy use Biofuels and primary goods                       2131
##  8  2012 Direct energy use Waste and scrap goods                            5763
##  9  2012 Direct energy use Waste management and remediation services       35141
## 10  2012 Direct energy use Total, clean technology products                62320
## # … with 110 more rows

In the table Infrastructure Economic Account, main indicators include GHG emissions as a result of production of infrastructure assets and greenhouses value per emis input proportion, focusing on the environmental aspect to aid in understanding the relationship between investment in infrastructure and the environment. Similar, delete the unrelated columns and rename.

Infrastructure_Economic_Accounts_CAN <- data_list$Infrastructure_Economic_Accounts[-c(3,7:12,14:17)]%>%
  rename("Year"="REF_DATE")
print(Infrastructure_Economic_Accounts_CAN)

## # A tibble: 100,800 × 6
##     Year GEO    Estimate                 Impact       Asset                VALUE
##    <dbl> <chr>  <chr>                    <chr>        <chr>                <dbl>
##  1  2009 Canada Greenhouse gas emissions Total impact Total assets         21276
##  2  2009 Canada Greenhouse gas emissions Total impact Commercial buildings   752
##  3  2009 Canada Greenhouse gas emissions Total impact Sports facilities w…   181
##  4  2009 Canada Greenhouse gas emissions Total impact Indoor recreational…   506
##  5  2009 Canada Greenhouse gas emissions Total impact Student residences       1
##  6  2009 Canada Greenhouse gas emissions Total impact Airports and other …    55
##  7  2009 Canada Greenhouse gas emissions Total impact Communications buil…     9
##  8  2009 Canada Greenhouse gas emissions Total impact Institutional build…  3978
##  9  2009 Canada Greenhouse gas emissions Total impact Schools, colleges, …  1987
## 10  2009 Canada Greenhouse gas emissions Total impact Hospitals             1199
## # … with 100,790 more rows

Because the table Infrastructure Economic Accounts are divided by regions and assets as well as different ways of impact, I create sub data sets to display different regions and impact in different ways on different assets.

Ontario_Impact <- Infrastructure_Economic_Accounts_CAN %>% filter (GEO=="Ontario")
BC_Impact <- Infrastructure_Economic_Accounts_CAN %>% filter(GEO == "British Columbia")
Quebec_Impact <- Infrastructure_Economic_Accounts_CAN %>% filter(GEO == "Quebec")
Alberta_Impact <- Infrastructure_Economic_Accounts_CAN %>% filter(GEO == "Alberta")
Manitoba_Impact <- Infrastructure_Economic_Accounts_CAN %>% filter(GEO == "Manitoba")
NB_Impact <- Infrastructure_Economic_Accounts_CAN %>%filter(GEO == "New Brunswick")
NL_Impact <- Infrastructure_Economic_Accounts_CAN %>% filter(GEO == "Newfoundland and Labrador")
 
NWT_Impact <- Infrastructure_Economic_Accounts_CAN %>% filter(GEO == "Yukon, Northwest Territories and Nunavut")
NS_Impact <- Infrastructure_Economic_Accounts_CAN %>% filter(GEO == "Nova Scotia")
PEI_Impact <- Infrastructure_Economic_Accounts_CAN %>% filter(GEO == "Prince Edward Island")
Saskatchewan_Impact <- Infrastructure_Economic_Accounts_CAN %>% filter(GEO == "Saskatchewan")

For table Population, it is a global level, therefore, I delete other countries and only focus in Canada. Meanwhile, I convert this into pivot long format.

Population_CAN <- data_list$Population%>%
  filter(`Country Name` == "Canada")%>%
  select(-c(2:4))
Population_CAN_long <- Population_CAN %>%
  pivot_longer(cols = -c('Country Name'), 
               names_to = "Year", 
               values_to = "Population") %>%
  mutate(Year = as.numeric(str_replace(Year, "\\D", "")))

print(Population_CAN_long)

## # A tibble: 32 × 3
##    `Country Name`  Year Population
##    <chr>          <dbl>      <dbl>
##  1 Canada          1990   27691138
##  2 Canada          1991   28037420
##  3 Canada          1992   28371264
##  4 Canada          1993   28684764
##  5 Canada          1994   29000663
##  6 Canada          1995   29302311
##  7 Canada          1996   29610218
##  8 Canada          1997   29905948
##  9 Canada          1998   30155173
## 10 Canada          1999   30401286
## # … with 22 more rows

PPPGDP table is also the global level. But this table does not have columns names, and the country name isn’t shown as a title. So I will need to delete the first two rows and then create names for each column and then convert them into pivot long format.

PPPGDP_CAN <- data_list$PPPGDP %>%
  slice(-1:-2) %>%
  setNames(.[1,]) %>%
  slice(-1)

PPPGDP_CAN_long <- PPPGDP_CAN %>%
  pivot_longer(cols = -c('Country Name'), 
               names_to = "Year",
               values_to = "PPPGDP")

print(PPPGDP_CAN_long)

## # A tibble: 8,512 × 3
##    `Country Name` Year  PPPGDP            
##    <chr>          <chr> <chr>             
##  1 Aruba          1990  1363755808.7800636
##  2 Aruba          1991  1522141298.8258934
##  3 Aruba          1992  1648407757.6416767
##  4 Aruba          1993  1810796526.6005981
##  5 Aruba          1994  2001192963.822479 
##  6 Aruba          1995  2095198112.5692353
##  7 Aruba          1996  2158860032.2951546
##  8 Aruba          1997  2350843673.1923804
##  9 Aruba          1998  2424658706.9520659
## 10 Aruba          1999  2489269108.5688787
## # … with 8,502 more rows

Finally, the Supply and Demand NG shows the supply and demand of natural gas liquids annually. There are various terms in the table, for example, production means the capture, extraction or manufacture of fuels or energy in forms that are ready for general use. Other detailed information can be found in here https://www150.statcan.gc.ca/n1/pub/57-003-x/57-003-x2023001-eng.htm Megalitres and Terajoules are two different units of measurement, and it is possible to convert between the two units. Therefore, it’s not meaningful to compare megalitres and only focus one unit. Here I choose megalitres.

Supply_Demand_NG_CAN <- data_list$Supply_Demand_NG[-c(3,7:12,14:17)] %>%
  filter(Measures == "Megalitres") %>%
  rename("Year"="REF_DATE")

print(Supply_Demand_NG_CAN)

## # A tibble: 26,485 × 6
##     Year GEO    `Fuel type`                           Sector      Measu…¹  VALUE
##    <dbl> <chr>  <chr>                                 <chr>       <chr>    <dbl>
##  1  1995 Canada Gas plant natural gas liquids (NGL's) Production  Megali… 25121.
##  2  1995 Canada Gas plant natural gas liquids (NGL's) Exports     Megali…  9958.
##  3  1995 Canada Gas plant natural gas liquids (NGL's) Imports     Megali…   463 
##  4  1995 Canada Gas plant natural gas liquids (NGL's) Inter-regi… Megali…     0 
##  5  1995 Canada Gas plant natural gas liquids (NGL's) Stock vari… Megali…   665.
##  6  1995 Canada Gas plant natural gas liquids (NGL's) Inter-prod… Megali…    NA 
##  7  1995 Canada Gas plant natural gas liquids (NGL's) Other adju… Megali…  4343.
##  8  1995 Canada Gas plant natural gas liquids (NGL's) Availabili… Megali… 19304.
##  9  1995 Canada Gas plant natural gas liquids (NGL's) Transforme… Megali…  2962.
## 10  1995 Canada Gas plant natural gas liquids (NGL's) Net refine… Megali…  1701.
## # … with 26,475 more rows, and abbreviated variable name ¹Measures

Next, I would like to know how many variable or unique values there are in the Sector column and in the Fuel type column.

n_sectors <- length(unique(Supply_Demand_NG_CAN$Sector))
print(n_sectors)

## [1] 33

n_fueltype <- length(unique(Supply_Demand_NG_CAN$`Fuel type`))
print(n_fueltype)

## [1] 4