Required packages
# Load required packages
library(readr)
library(tidyr)
library(dplyr)
library(stringr)
library(outliers)
library(ggplot2)
library(forecast)
Custom functions
- is.special returns TRUE if x is either an infinite value or not a number, FALSE otherwise.
- is.specialorNA returns TRUE if x is either an infinite value, not a number or a missing value, FALSE otherwise.
is.special <- function(x){
if (is.numeric(x)) (is.infinite(x) | is.nan(x))
}
is.specialorNA <- function(x){
if (is.numeric(x)) (is.infinite(x) | is.nan(x) | is.na(x))
}
Executive Summary
Data for analysis of Carbon emissions from nations across the years 1995 to 2016, along with nations vulnerability to climate change, was prepared.
3 datasets from two sources were obtained and imported for preparation: Emissions from CDIAC; NDGain (Vulnerability) and Population from NDGAI.
The country names from the datasets were cleaned up / standardised in preparation for the merge of sets. Country name is a join column.
Subsetting of data was conducted: Ony those countries with complete set of NDGain indices were considered. And Data for years 1995 to 2016 were considered.
Datasets for NDGain and population were made into tidy datasets.
All 3 datasets were merged.
The resulting dataset was inspected to understand the structure , datatypes, and the necessary conversions to factors, numeric datatypes were conducted.
New variable included: All emissions total, using total carbon emissions and bunker fuel emissions.
Scanning and dealing with missing values, special values, errors.
Scanning and dealing with outliers.
Transformation of total carbon emissions (tot_co2_emi) and population (Population) distributions to normal distributions.
Data
Source descriptions
Carbon emissions data (nation.1751_2016.csv) from Carbon Dioxide Information Analysis Center
Dataset of different carbon emissions by countries across the years with the following variables.
Nation - name of country to which emissions pertain to.
Year - year in which emissions occurred.
tot_co2_emi - Total CO2 emissions from fossil-fuels (solid, liquid, gas,flaring) and cement production. In thousand metric tons of C.
solid_cons - Emissions from solid fuel consumption.
liquid_cons - Emissions from liquid fuel consumption.
gas_cons - Emissions from gas fuel consumption.
cement_prod - Emissions from cement production.
gas_flare - Emissions from gas flaring.
cap_co2_emi - Per capita CO2 emissions (metric tons of carbon).
bunker_cons - Emissions from bunker fuels (not included in the totals). Emissions from fuel used for international aviation and maritime transport. Convention and protocol dictate these emissions are not included in national totals (tot_co2_emi).
NDGain Index data (gain.csv) from the Notre Dame Global Adaptation Initiative
Data set of NDGain index (vulnerability to climate change) for each country across the years, with the following variables.
This index is based on numerous (quantitative) indicators eg economic and governance readiness to adapt to change , and exposure to negative impacts.
The smaller the index, the more vulnerable to climate change is the nation.
ISO3 - 3 character ISO 3166 coding of country.
Name - Name of country to which the NDGain Index pertains to.
1995 - NDGain index for year 1995.
Indices for subsequent years, through to 2018, are provided as seperate variables.
Population data (input.csv) from the Notre Dame Global Adaptation Initiative
Data set of human population for each country across the years, with the following variables.
ISO3 - 3 character ISO 3166 coding of country.
Name - Name of country to which the population pertains to.
1995 - population for year 1995.
Population for subsequent years, through to 2018, are provided as seperate variables.
Tidy & Manipulate Data I
The NDGain dataset (gain) is not tidy as it has values for variable year as column names (1995, 1996 …). And the observations for each year are on one row. The year needs to be along a column , and each observation on a seperate row.
The same applies to the population dataset (pop): values for variable year as column names (1995, 1996 …). And the observations for each year are on one row. The year needs to be along a column , and each observation on a seperate row.
To make the datasets tidy the following steps are conducted.
- Make the data set longer using pivot_longer() where the years 1995, 1996 etc are moved to column “Year”, and the values to column “NDGain” for gain , “Population” for population.
#Make gain dataset tidy
gain_tidy <- gain_nation %>% pivot_longer(cols = -(1:2), names_to = "Year", values_to = "NDGain")
#Convert Year to integer
gain_tidy$Year <- as.integer(gain_tidy$Year)
#Make population dataset tidy
pop_tidy <- pop %>% pivot_longer(cols = -(1), names_to = "Year", values_to = "Population")
#Convert Year to integer
pop_tidy$Year <- as.integer(pop_tidy$Year)
# Show tidy gain dataset
head(gain_tidy)
# Show tidy population dataset
head(pop_tidy)
Merge data sets
The gain and population data sets are then merged together (gain_tidy) by ISO3 and Year.
The final dataset (emigain) is the result of the merge of the tidy gain dataset(gain_tidy) and emissions dataset(c_emi_nation) by Nation/Name and Year.
#left join gain and population sets together on ISO3 and Year
gain_tidy <- gain_tidy %>% left_join(pop_tidy, by = c("ISO3" = "ISO3", "Year" = "Year"))
#merge gain and emissions datasets together
emigain <- gain_tidy %>% inner_join(c_emi_nation , by = c("Name" = "Nation", "Year" = "Year") )
Understand
The emigain dataset is a dataframe of dimension 3887 x 13 (13 variables, 3887 observations).
Variables, their data type and any type conversions are as follows.
- ISO3: character datatype. Will be converted to factor. Country is categorical.
- Name: character datatype. Will be converted to factor. Country is categorical.
- Year: numeric datatype. Will be converted to ordered factor. Year is ordinal.
- NDGain: numeric datatype. No conversion.
- Population: numeric datatype. No conversion.
- tot_co2_emi: numeric datatype. No conversion.
- solid_cons: numeric datatype. No conversion.
- liquid_cons: numeric datatype. No conversion.
- gas_cons: numeric datatype. No conversion.
- cement_prod: numeric datatype. No conversion.
- gas_flare: character datatype. Will be converted to numeric(double) datatype. This is an emissions variable so continuous/rational.
- cap_co2_emi: character datatype. Will be converted to numeric(double) datatype. This is an emissions variable so continuous/rational.
- bunker_cons: numeric datatype. No conversion.
# Understand the data and structure
emigain %>% str()
tibble [3,887 x 13] (S3: tbl_df/tbl/data.frame)
$ ISO3 : chr [1:3887] "AFG" "AFG" "AFG" "AFG" ...
$ Name : chr [1:3887] "AFGHANISTAN" "AFGHANISTAN" "AFGHANISTAN" "AFGHANISTAN" ...
$ Year : num [1:3887] 1995 1996 1997 1998 1999 ...
$ NDGain : num [1:3887] 34.3 34.3 34.3 34.5 34.5 ...
$ Population : num [1:3887] 18110657 18853437 19357126 19737765 20170844 ...
$ tot_co2_emi: num [1:3887] 339 321 299 284 224 211 223 292 331 250 ...
$ solid_cons : num [1:3887] 4 2 1 1 1 1 19 15 25 25 ...
$ liquid_cons: num [1:3887] 225 213 198 188 135 136 134 120 169 153 ...
$ gas_cons : num [1:3887] 88 84 77 72 66 61 57 149 127 62 ...
$ cement_prod: num [1:3887] 16 16 16 16 16 7 7 8 10 10 ...
$ gas_flare : chr [1:3887] "6" "6" "6" "6" ...
$ cap_co2_emi: chr [1:3887] "0.02" "0.02" "0.02" "0.01" ...
$ bunker_cons: num [1:3887] 4 4 4 4 4 4 4 4 7 9 ...
The conversion to factor is done via the factor() function. No labelling is required this time.
For the ordered factor (Year), the factor() function is used along with parameters levels = [vector of ordered years] and ordered=TRUE.
The conversion of the two emissions (gas_flare, cap_co2_emi) from character to numeric double is completed using the conversion function as.double().
#ISO3 as factor
emigain$ISO3 <- emigain$ISO3 %>% factor()
#Name as factor
emigain$Name <- emigain$Name %>% factor()
#Name as factor
emigain$Name <- emigain$Name %>% factor()
#Year as order factor
#emigain$Year %>% unique()
emigain$Year <- emigain$Year %>% factor(levels = c(1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016), ordered = TRUE)
#gas_flare & cap_co2_emi
emigain$gas_flare <- emigain$gas_flare %>% as.double()
emigain$cap_co2_emi <- emigain$cap_co2_emi %>% as.double()
NAs introduced by coercion
After the necessary conversions, inspecting the resulting data types below show the 3 factors, one ordered, and all emission variables are numeric.
emigain %>% str()
tibble [3,887 x 13] (S3: tbl_df/tbl/data.frame)
$ ISO3 : Factor w/ 178 levels "AFG","AGO","ALB",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Name : Factor w/ 178 levels "AFGHANISTAN",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Year : Ord.factor w/ 22 levels "1995"<"1996"<..: 1 2 3 4 5 6 7 8 9 10 ...
$ NDGain : num [1:3887] 34.3 34.3 34.3 34.5 34.5 ...
$ Population : num [1:3887] 18110657 18853437 19357126 19737765 20170844 ...
$ tot_co2_emi: num [1:3887] 339 321 299 284 224 211 223 292 331 250 ...
$ solid_cons : num [1:3887] 4 2 1 1 1 1 19 15 25 25 ...
$ liquid_cons: num [1:3887] 225 213 198 188 135 136 134 120 169 153 ...
$ gas_cons : num [1:3887] 88 84 77 72 66 61 57 149 127 62 ...
$ cement_prod: num [1:3887] 16 16 16 16 16 7 7 8 10 10 ...
$ gas_flare : num [1:3887] 6 6 6 6 6 6 6 0 0 0 ...
$ cap_co2_emi: num [1:3887] 0.02 0.02 0.02 0.01 0.01 0.01 0.01 0.01 0.01 0.01 ...
$ bunker_cons: num [1:3887] 4 4 4 4 4 4 4 4 7 9 ...
Tidy & Manipulate Data II
Addition of variable all_co2_emi
This variable is the total of ALL emissions, including those from bunker(international aviation and maritime) fuel.
Long standing convention/protocol states that bunker emissions are excluded from nation reporting. Historically hard to attribute to country. Source country? Desintation? etc.
It’s handy to have this variable as it gives a better picture of emissions. Why exclude another source of carbon emissions?
As total emissions (tot_co2_emi) is provided, add bunker emissions to this.
all_co2_emi = tot_co2_emi + bunker_cons
The variable is inclued in the emigain dataset via the mutate function, where it is derived using the indicated formula.
#Add all_co2_emi variable - total co2 emissions + bunker emissions.
emigain <- emigain %>% mutate(
all_co2_emi = tot_co2_emi + bunker_cons
)
#Inspect variable structure
emigain$all_co2_emi %>% str()
num [1:3887] 343 325 303 288 228 215 227 296 338 259 ...
Scan I
Scanning and dealing with missing values, special values, errors
Missing values either through merging of datasets (Population) or as literals ‘.’ , ‘..’ were standardised to R’s NA.
gas_flare/cap_co2_emi conversion to numeric coerced the ‘.’/‘..’ to NA.
Complete cases in emigain, across numeric variables (4:14), were examined. ISO3/Name/Year (1:3) were ignored as they are join columns.
Column counts of missing values reveal that missing values are confined to Population and cap_co2_emi.
Verifying the records where there are missing values for either Population or cap_co2_emi, show that they can easily be imputed.
Impute using formula ‘carbon emission per capita’ = ‘total carbon emission’ x 1000 / population
cap_co2_emi = tot_co2_emi x 1000 / Population
Scan for missing values
Summary of missing values show
#Get incomplete cases to check against post impute
emigain_notc2 <- emigain[!complete.cases(emigain[, 4:14]),]
#Scan for NAs in data set
colSums(is.na(emigain))
ISO3 Name Year NDGain Population tot_co2_emi solid_cons liquid_cons gas_cons cement_prod gas_flare cap_co2_emi bunker_cons
0 0 0 0 22 0 0 0 0 0 0 17 0
all_co2_emi
0
Show samples of records with missing values
#Check Population NAs
emigain[is.na(emigain$Population),] %>% head(5)
#Check cap_co2_emi NAs
emigain[is.na(emigain$cap_co2_emi),] %>% head(5)
NA
Impute missing values
Impute missing population and per capita emission based on formula above.
Checking imputation by looking up results against incomplete case dataset (emigain_notc2) show that the imputations are correct.
#Impute missing population
emigain$Population[is.na(emigain$Population)] <- emigain$tot_co2_emi[is.na(emigain$Population)] * 1000 / emigain$cap_co2_emi[is.na(emigain$Population)]
#Impute missing per capita emission
emigain$cap_co2_emi[is.na(emigain$cap_co2_emi)] <- round(emigain$tot_co2_emi[is.na(emigain$cap_co2_emi)] * 1000/ emigain$Population[is.na(emigain$cap_co2_emi)], 2)
#Check imputations
emigain_notc2 %>% inner_join(select(emigain, Year, Name, Population, cap_co2_emi), by = c("Year", "Name"))
NA
Scan for special values and errors
Special values (NaN , Inf) in emigain are scanned for, revealing none. Via applying the is.special() function across the columns.
Errors (negative emissions, non positive NDGain/Population) are scanned for, revealing none. Via applying comparison operators against relevant subsets.
#Check for special values
sapply(emigain, function(x) sum(is.special(x)))
ISO3 Name Year NDGain Population tot_co2_emi solid_cons liquid_cons gas_cons cement_prod gas_flare cap_co2_emi bunker_cons
0 0 0 0 0 0 0 0 0 0 0 0 0
all_co2_emi
0
#Check for negative emissions
colSums(emigain[, -(1:5)] < 0, na.rm = TRUE)
tot_co2_emi solid_cons liquid_cons gas_cons cement_prod gas_flare cap_co2_emi bunker_cons all_co2_emi
0 0 0 0 0 0 0 0 0
#Check for non positive NDGain or Population
colSums(emigain[, (4:5)] < 0, na.rm = TRUE)
NDGain Population
0 0
Scan II
Approach & Summary
Univariate outlier detection is used on each NDGain , Population and the 9 emission variables.
This was decided as the emissions don’t share special relationships with each other and all the variables can be treated individually.
z-scores are applied across each variable. Outliers are flagged as those values with |z-score| > 3.
NDGain has no outliers flagged. The other variables eg. Population (44 flaged) , tot_co2_emi (47) do have outliers flagged.
Further inspection shows that the population for China and India are indeed valid values.
Likewise, total emission and other emissions from China, India and USA are legitimate values.
The distribution for these variables above are extremely positively skewed. These “outliers” are so far from the next set of observations in the same quartile.
Therefore the suggested outliers will be left alone - no transformation nor imputation.
Outlier summary
The z-scores across each of the numeric variables (column 4 to 14) will be established via applying the scores(type = “z”) function using function lapply().
Summary of outliers using the scores (|z-score| > 3)
00 NDGain
44 Population
47 tot_co2_emi
49 solid_cons
48 liquid_cons
45 gas_cons
22 cement_prod
88 gas_flare
78 cap_co2_emi
69 bunker_cons
46 all_co2_emi
# Get z-score for each numeric variable. Score() function with type= "z"
z_scores <- emigain[, (4:14)] %>% lapply(FUN = function(x) scores(x, type= "z")) %>% as.data.frame()
# Outlier summary
colSums(abs(z_scores) > 3)
NDGain Population tot_co2_emi solid_cons liquid_cons gas_cons cement_prod gas_flare cap_co2_emi bunker_cons all_co2_emi
0 44 47 49 48 45 22 88 78 69 46
Outlier inspection
Closer inspection of the Population outliers show that the populations are those for China and India. These are all valid. Confirmed these via the internet.
emigain[abs(z_scores$Population) > 3, ] %>% group_by(ISO3, Name) %>% slice_head(n=2)
NA
Closer inspection of the tot_co2_emi outliers show that the total emissions are those for China, India and USA. These are all valid. Confirmed these via the internet.
emigain[abs(z_scores$tot_co2_emi) > 3, ] %>% group_by(ISO3, Name) %>% slice_head(n=2)
NA
Closer inspection of the solid_cons outliers show that these emissions are those for China, India and USA. These are all valid. Confirmed these via the internet.
emigain[abs(z_scores$solid_cons) > 3, ] %>% group_by(ISO3, Name) %>% slice_head(n=2)
NA
Closer inspection of the liquid_cons outliers show that these emissions are those for China, India, Japan and USA. These are all valid. Confirmed these via the internet.
emigain[abs(z_scores$liquid_cons) > 3, ] %>% group_by(ISO3, Name) %>% slice_head(n=2)
NA
Closer inspection of the gas_cons outliers show that these emissions are those for China, Russia and USA. These are all valid. Confirmed these via the internet.
emigain[abs(z_scores$gas_cons) > 3, ] %>% group_by(ISO3, Name) %>% slice_head(n=2)
NA
Closer inspection of the cement_prod outliers show that these emissions are those for China. These are all valid. Confirmed these via the internet.
emigain[abs(z_scores$cement_prod) > 3, ] %>% group_by(ISO3, Name) %>% slice_head(n=2)
NA
Closer inspection of the gas_flare outliers show that these emissions all valid. Confirmed these via the internet.
emigain[abs(z_scores$gas_flare) > 3, ] %>% group_by(ISO3, Name) %>% slice_head(n=2)
NA
Closer inspection of the cap_co2_emi outliers show that these per capita emissions are valid. Confirmed these via the internet.
emigain[abs(z_scores$cap_co2_emi) > 3, ] %>% group_by(ISO3, Name) %>% slice_head(n=2)
NA
Closer inspection of the bunker_cons outliers show that these emissions are all valid. Confirmed these via the internet.
emigain[abs(z_scores$bunker_cons) > 3, ] %>% group_by(ISO3, Name) %>% slice_head(n=2)
NA
NA
Closer inspection of the all_co2_emi outliers show that these emissions are those for China, India and USA. These are all valid. They would follow the total emissions above.
emigain[abs(z_scores$all_co2_emi) > 3, ] %>% group_by(ISO3, Name) %>% slice_head(n=2)
NA
---
title: "Data wrangling of Carbon emissions data"
author: "Kevin Kuek (s3835152)"
subtitle: MATH2349 Data Wrangling Assignment 2
output:
  html_notebook: default
  html_document:
    df_print: paged
  pdf_document: default
---


## Required packages 

```{r message=FALSE}
# Load required packages

library(readr)
library(tidyr)
library(dplyr)
library(stringr)
library(outliers)
library(ggplot2)
library(forecast)

```

**Custom functions**

* is.special returns TRUE if x is either an infinite value or not a number, FALSE otherwise.
* is.specialorNA returns TRUE if x is either an infinite value, not a number or a missing value, FALSE otherwise.

```{r}

is.special <- function(x){
if (is.numeric(x)) (is.infinite(x) | is.nan(x))
}

is.specialorNA <- function(x){
if (is.numeric(x)) (is.infinite(x) | is.nan(x) | is.na(x))
}

```


## Executive Summary 

Data for analysis of Carbon emissions from nations across the years 1995 to 2016, along with nations vulnerability to climate change, was prepared.

* 3 datasets from two sources were obtained and imported for preparation: Emissions from CDIAC; NDGain (Vulnerability) and Population from NDGAI.

* The country names from the datasets were cleaned up / standardised in preparation for the merge of sets.  Country name is a join column.

* Subsetting of data was conducted: Ony those countries with complete set of NDGain indices were considered.  And Data for years 1995 to 2016 were considered.

* Datasets for NDGain and population were made into tidy datasets.

* All 3 datasets were merged.

* The resulting dataset was inspected to understand the structure , datatypes, and the necessary conversions to factors, numeric datatypes were conducted.

* New variable included: All emissions total, using total carbon emissions and bunker fuel emissions.

* Scanning and dealing with missing values, special values, errors.

* Scanning and dealing with outliers.

* Transformation of total carbon emissions (tot_co2_emi) and population (Population) distributions to normal distributions.


## Data 

### Source descriptions

**Carbon emissions data (nation.1751_2016.csv) from Carbon Dioxide Information Analysis Center**

Dataset of different carbon emissions by countries across the years with the following variables.

***Nation*** - name of country to which emissions pertain to.

***Year*** - year in which emissions occurred.

***tot_co2_emi*** - Total CO2 emissions from fossil-fuels (solid, liquid, gas,flaring) and cement production. In thousand metric tons of C.

***solid_cons*** - Emissions from solid fuel consumption.

***liquid_cons*** - Emissions from liquid fuel consumption.

***gas_cons*** - Emissions from gas fuel consumption.

***cement_prod*** - Emissions from cement production.

***gas_flare*** - Emissions from gas flaring.

***cap_co2_emi*** - Per capita CO2 emissions (metric tons of carbon).

***bunker_cons*** - Emissions from bunker fuels (not included in the totals). Emissions from fuel used for international aviation and maritime transport.  Convention and protocol dictate these emissions are not included in national totals (tot_co2_emi).


<br>
**NDGain Index data (gain.csv) from the Notre Dame Global Adaptation Initiative**

Data set of NDGain index (vulnerability to climate change) for each country across the years, with the following variables.

This index is based on numerous (quantitative) indicators eg economic and governance readiness to adapt to change , and exposure to negative impacts.

The smaller the index, the more vulnerable to climate change is the nation.

***ISO3*** - 3 character ISO 3166 coding of country.

***Name*** - Name of country to which the NDGain Index pertains to.

***1995*** - NDGain index for year 1995.

Indices for subsequent years, through to 2018, are provided as seperate variables.


<br>
**Population data (input.csv) from the Notre Dame Global Adaptation Initiative**

Data set of human population for each country across the years, with the following variables.

***ISO3*** - 3 character ISO 3166 coding of country.

***Name*** - Name of country to which the population pertains to.

***1995*** - population for year 1995.

Population for subsequent years, through to 2018, are provided as seperate variables.

<br>

### Extract Steps
The three data sets are read in via readr::read_csv function.

The variables for carbon emissions are named appropriately.  They are each too long for analysis.

```{r}
#Source data

# Carbon Dioxide Information Analysis Center at Appalachian State University, Boone North Carolina, https://energy.appstate/CDIAC.
c_emissions <- read_csv("nation.1751_2016.csv", skip = 4)

# Notre Dame Global Adaptation Initiative, University of Notre Dame,  https://gain.nd.edu/our-work/country-index/rankings/
gain <- read_csv("resources_2020_04_05_19h09/resources/gain/gain.csv")

# Notre Dame Global Adaptation Initiative, University of Notre Dame,  https://gain.nd.edu/our-work/country-index/rankings/
# Get population data for inputing missing CO2 emission per capita 
pop <- read_csv("resources_2020_04_05_19h09/resources/indicators/pop/input.csv") %>% select(-(2))


#Renaming emission variables appropriately
names(c_emissions)[names(c_emissions)== "Total CO2 emissions from fossil-fuels and cement production (thousand metric tons of C)"] <- "tot_co2_emi"
names(c_emissions)[names(c_emissions)== "Emissions from solid fuel consumption"] <- "solid_cons"
names(c_emissions)[names(c_emissions)== "Emissions from liquid fuel consumption"] <- "liquid_cons"
names(c_emissions)[names(c_emissions)== "Emissions from gas fuel consumption"] <- "gas_cons"
names(c_emissions)[names(c_emissions)== "Emissions from cement production"] <- "cement_prod"
names(c_emissions)[names(c_emissions)== "Emissions from gas flaring"] <- "gas_flare"
names(c_emissions)[names(c_emissions)== "Per capita CO2 emissions (metric tons of carbon)"] <- "cap_co2_emi"
names(c_emissions)[names(c_emissions)== "Emissions from bunker fuels (not included in the totals)"] <- "bunker_cons"

#Sample emissions data
c_emissions %>% head()

#Sample NDGain data
gain %>% head()

#Sample population data
pop %>% head()

```

### Extract Steps (continued)

Clean up and standardise join column Nation/Name in both source data sets c_emissions and gain

* Capitalise
* Standardise eg. 'AND' to '&', shortening of name etc.

```{r}
# Clean up / standardise join column Nation
c_emi_nation <- c_emissions %>% 
  mutate(
    Nation = str_replace_all(Nation,  c(" AND " = " & ", " \\(.*\\)" = "", "UNITED STATES OF AMERICA" = "UNITED STATES", "DEMOCRATIC REPUBLIC OF THE CONGO" = "ZAIRE","COTE D IVOIRE" = "COTE D'IVOIRE", "BRUNEI DARUSSALAM"= "BRUNEI", "REPUBLIC OF " = "", "ISLAMIC IRAN" = "IRAN", "UNITED TANZANIA" = "TANZANIA", "LAO PEOPLE S DEMOCRATIC REPUBLIC" = "LAO PEOPLE'S DEMOCRATIC REPUBLIC", "PLURINATIONAL STATE OF BOLIVIA" = "BOLIVIA", "GUINEA BISSAU" = "GUINEA-BISSAU", "RUSSIAN FEDERATION" = "RUSSIA"))
  )

# clean join column
gain_nation <- gain %>%
              mutate(
                Name = toupper(Name) %>% str_replace_all(c(" AND " = " & ",", REPUBLIC OF" = "", ", BOLIVARIAN REPUBLIC O" = "", "BRUNEI DARUSSALAM"= "BRUNEI", ", UNITED REPUBLIC OF" = "", ", ISLAMIC REPUBLIC OF" = "", "CONGO, THE DEMOCRATIC REPUBLIC O" = "ZAIRE", ", PLURINATIONAL STATE OF" = "", "RUSSIAN FEDERATION" = "RUSSIA"))
                            
              )

```

**Subsetting**

* Only those records from gain(gain_nation) that have no missing values (complete cases) will be considered.  All incomplete cases have no NDGain index for any year.
* Subset NDGains 1995 - 2016.  Emissions dataset go as far as 2016.

```{r}
# Scan for incomplete cases
gain_notcomplete <- gain_nation[!complete.cases(gain_nation),]

# All incomplete cases have no index for any year
gain_notcomplete %>% head()

# Subset gain data to 1995 - 2016
gain_nation <- gain_nation[complete.cases(gain_nation),] %>% select(1:24)

```

##	Tidy & Manipulate Data I 

The NDGain dataset (gain) is not tidy as it has values for variable year as column names (1995, 1996 ...). And the observations for each year are on one row.  The year needs to be along a column , and each observation on a seperate row.

The same applies to the population dataset (pop): values for variable year as column names (1995, 1996 ...). And the observations for each year are on one row.  The year needs to be along a column , and each observation on a seperate row.

To make the datasets tidy the following steps are conducted.

* Make the data set longer using pivot_longer() where the years 1995, 1996 etc are moved to column "Year", and the values to column "NDGain" for gain , "Population" for population.

```{r}
#Make gain dataset tidy
gain_tidy <- gain_nation %>% pivot_longer(cols = -(1:2), names_to = "Year",  values_to = "NDGain")
#Convert Year to integer
gain_tidy$Year <- as.integer(gain_tidy$Year)

#Make population dataset tidy
pop_tidy <- pop %>% pivot_longer(cols = -(1), names_to = "Year",  values_to = "Population")
#Convert Year to integer
pop_tidy$Year <- as.integer(pop_tidy$Year)

# Show tidy gain dataset
head(gain_tidy)

# Show tidy population dataset
head(pop_tidy)
```

### Merge data sets
The gain and population data sets are then merged together (gain_tidy) by ISO3 and Year.

The final dataset (emigain) is the result of the merge of the tidy gain dataset(gain_tidy) and emissions dataset(c_emi_nation) by Nation/Name and Year.
```{r}
#left join gain and population sets together on ISO3 and Year
gain_tidy <- gain_tidy %>% left_join(pop_tidy, by = c("ISO3" = "ISO3", "Year" = "Year"))

#merge gain and emissions datasets together
emigain <- gain_tidy %>% inner_join(c_emi_nation , by = c("Name" = "Nation", "Year" = "Year") )

```


## Understand 

The ***emigain*** dataset is a dataframe of dimension 3887 x 13 (13 variables, 3887 observations).

Variables, their data type and any type conversions are as follows.

* *ISO3*: character datatype. Will be converted to factor.  Country is categorical.
* *Name*: character datatype. Will be converted to factor.  Country is categorical.
* *Year*: numeric datatype. Will be converted to ordered factor.  Year is ordinal.
* *NDGain*: numeric datatype. No conversion.
* *Population*: numeric datatype. No conversion.
* *tot_co2_emi*: numeric datatype. No conversion.
* *solid_cons*: numeric datatype. No conversion.
* *liquid_cons*: numeric datatype. No conversion.
* *gas_cons*: numeric datatype. No conversion.
* *cement_prod*: numeric datatype. No conversion.
* *gas_flare*: character datatype. Will be converted to numeric(double) datatype.  This is an emissions variable so continuous/rational.
* *cap_co2_emi*: character datatype. Will be converted to numeric(double) datatype.  This is an emissions variable so continuous/rational.
* *bunker_cons*: numeric datatype. No conversion.

```{r}
# Understand the data and structure
emigain %>% str()

```

<br>

The conversion to factor is done via the factor() function.  No labelling is required this time.

For the ordered factor (Year), the factor() function is used along with parameters *levels = [vector of ordered years] * and *ordered=TRUE*.

The conversion of the two emissions (*gas_flare*, *cap_co2_emi*) from character to numeric double is completed using the conversion function as.double().

```{r}
#ISO3 as factor
emigain$ISO3 <- emigain$ISO3 %>% factor()

#Name as factor
emigain$Name <- emigain$Name %>% factor()

#Name as factor
emigain$Name <- emigain$Name %>% factor()

#Year as order factor
#emigain$Year %>% unique() 

emigain$Year <- emigain$Year %>% factor(levels = c(1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016), ordered = TRUE)

#gas_flare & cap_co2_emi
emigain$gas_flare <- emigain$gas_flare %>% as.double()
emigain$cap_co2_emi <- emigain$cap_co2_emi %>% as.double()

```

<br>
After the necessary conversions, inspecting the resulting data types below show the 3 factors, one ordered, and all emission variables are numeric.
```{r}
emigain %>% str()

```


##	Tidy & Manipulate Data II 

Addition of variable ***all_co2_emi***

* This variable is the total of ALL emissions, including those from bunker(international aviation and maritime) fuel.  

* Long standing convention/protocol states that bunker emissions are excluded from nation reporting. Historically hard to attribute to country.  Source country? Desintation? etc.

* It's handy to have this variable as it gives a better picture of emissions. Why exclude another source of carbon emissions?

* As total emissions (tot_co2_emi) is provided, add bunker emissions to this.

* all_co2_emi = tot_co2_emi + bunker_cons

The variable is inclued in the emigain dataset via the mutate function, where it is derived using the indicated formula.

```{r}
#Add all_co2_emi variable - total co2 emissions + bunker emissions.

emigain <- emigain %>% mutate(
                              all_co2_emi = tot_co2_emi + bunker_cons  
                      )

#Inspect variable structure
emigain$all_co2_emi %>% str()
```

##	Scan I 

### Scanning and dealing with missing values, special values, errors

* Missing values either through merging of datasets (Population) or as literals '.' , '..' were standardised to R's NA.  

* *gas_flare*/*cap_co2_emi* conversion to numeric coerced the '.'/'..' to NA.

* Complete cases in emigain, across numeric variables (4:14), were examined. ISO3/Name/Year (1:3) were ignored as they are join columns.

* Column counts of missing values reveal that missing values are confined to *Population* and *cap_co2_emi*.

* Verifying the records where there are missing values for either *Population* or *cap_co2_emi*, show that they can easily be imputed.

* Impute using formula 'carbon emission per capita' = 'total carbon emission' x 1000 / population

> *cap_co2_emi* = *tot_co2_emi* x 1000 / *Population*


```{r include=FALSE}
t <- emigain

```

<br>

### Scan for missing values

Summary of missing values show

* 22 missing in Population

* 17 missing in cap_co2_emi 
```{r}
#Get incomplete cases to check against post impute
emigain_notc2 <- emigain[!complete.cases(emigain[, 4:14]),]

#Scan for NAs in data set
colSums(is.na(emigain))

```

Show samples of records with missing values
```{r}
#Check Population NAs
emigain[is.na(emigain$Population),]  %>% head(5)

#Check cap_co2_emi NAs
emigain[is.na(emigain$cap_co2_emi),] %>% head(5)

```

### Impute missing values

* Impute missing population and per capita emission based on formula above.

* Checking imputation by looking up results against incomplete case dataset (emigain_notc2) show that the imputations are correct.

```{r}
#Impute missing population
emigain$Population[is.na(emigain$Population)] <- emigain$tot_co2_emi[is.na(emigain$Population)] * 1000 /  emigain$cap_co2_emi[is.na(emigain$Population)]

#Impute missing per capita emission
emigain$cap_co2_emi[is.na(emigain$cap_co2_emi)] <- round(emigain$tot_co2_emi[is.na(emigain$cap_co2_emi)] * 1000/  emigain$Population[is.na(emigain$cap_co2_emi)], 2)

#Check imputations
emigain_notc2 %>% inner_join(select(emigain, Year, Name, Population, cap_co2_emi), by = c("Year", "Name"))

```

### Scan for special values and errors

* Special values (NaN , Inf) in emigain are scanned for, revealing none.  Via applying the is.special() function across the columns.

* Errors (negative emissions, non positive NDGain/Population) are scanned for, revealing none.  Via applying comparison operators against relevant subsets.
```{r}
#Check for special values
sapply(emigain, function(x) sum(is.special(x)))

#Check for negative emissions
colSums(emigain[, -(1:5)] < 0, na.rm = TRUE)

#Check for non positive NDGain or Population
colSums(emigain[, (4:5)] < 0, na.rm = TRUE)
```

```{r eval=FALSE, include=FALSE}
emigain[emigain$ISO3 %in% c('SDN', 'MKD'), ]
setdiff(t, emigain)

#library(deductive)
#library(validate)
# testing another impute method that doesnt work
rules <- validator(cap = tot_co2_emi * 1000/ population  == cap_co2_emi )
rules
# Use impute_lr function

imputed_df <- impute_lr(emigain,rules)

imputed_df[imputed_df$ISO3 %in% c('MKD', 'SDN'),]


```

##	Scan II

### Approach & Summary

* Univariate outlier detection is used on each NDGain , Population and the 9 emission variables.

* This was decided as the emissions don't share special relationships with each other and all the variables can be treated individually.

* z-scores are applied across each variable.  Outliers are flagged as those values with |z-score| > 3.

* *NDGain* has no outliers flagged.  The other variables eg. *Population* (44 flaged) , *tot_co2_emi* (47) do have outliers flagged.

* Further inspection shows that the population for China and India are indeed valid values.

* Likewise, total emission and other emissions from China, India and USA are legitimate values.

* The distribution for these variables above are extremely positively skewed.  These "outliers" are so far from the next set of observations in the same quartile.

* Therefore the suggested outliers will be left alone - no transformation nor imputation.

```{r eval=FALSE, include=FALSE}
library(MVN)

emigain_sub <- emigain[, -(1:5)]

results <- mvn(data = emigain_sub, multivariateOutlierMethod = "quan", showOutliers = TRUE, showNewData = TRUE)

results$multivariateOutliers
results$newData

```

### Outlier summary

The z-scores across each of the numeric variables (column 4 to 14) will be established via applying the scores(type = "z") function using function lapply().

**Summary of outliers using the scores (|z-score| > 3)**

00 NDGain  

44 Population 

47 tot_co2_emi  

49 solid_cons 

48 liquid_cons    

45 gas_cons 

22 cement_prod

88 gas_flare 

78 cap_co2_emi 

69 bunker_cons 

46 all_co2_emi 

```{r}
# Get z-score for each numeric variable.  Score() function with type= "z"
z_scores <- emigain[, (4:14)] %>% lapply(FUN = function(x) scores(x, type= "z")) %>% as.data.frame()

# Outlier summary
colSums(abs(z_scores) > 3) 

```

### Outlier inspection

Closer inspection of the Population outliers show that the populations are those for China and India.  These are all valid.  Confirmed these via the internet.

```{r}
emigain[abs(z_scores$Population) > 3, ] %>% group_by(ISO3, Name) %>% slice_head(n=2)

```

Closer inspection of the tot_co2_emi outliers show that the total emissions are those for China, India and USA.  These are all valid.  Confirmed these via the internet.

```{r}
emigain[abs(z_scores$tot_co2_emi) > 3, ] %>% group_by(ISO3, Name) %>% slice_head(n=2)

```

Closer inspection of the solid_cons outliers show that these emissions are those for China, India and USA.  These are all valid.  Confirmed these via the internet.

```{r}
emigain[abs(z_scores$solid_cons) > 3, ] %>% group_by(ISO3, Name) %>% slice_head(n=2)

```

Closer inspection of the liquid_cons outliers show that these emissions are those for China, India, Japan and USA.  These are all valid.  Confirmed these via the internet.

```{r}
emigain[abs(z_scores$liquid_cons) > 3, ] %>% group_by(ISO3, Name) %>% slice_head(n=2)

```

Closer inspection of the gas_cons outliers show that these emissions are those for China, Russia and USA.  These are all valid.  Confirmed these via the internet.

```{r}
emigain[abs(z_scores$gas_cons) > 3, ] %>% group_by(ISO3, Name) %>% slice_head(n=2)

```

Closer inspection of the cement_prod outliers show that these emissions are those for China.  These are all valid.  Confirmed these via the internet.

```{r}
emigain[abs(z_scores$cement_prod) > 3, ] %>% group_by(ISO3, Name) %>% slice_head(n=2)

```

Closer inspection of the gas_flare outliers show that these emissions all valid.  Confirmed these via the internet.

```{r}
emigain[abs(z_scores$gas_flare) > 3, ] %>% group_by(ISO3, Name) %>% slice_head(n=2)

```

Closer inspection of the cap_co2_emi outliers show that these per capita emissions are valid.  Confirmed these via the internet.

```{r}
emigain[abs(z_scores$cap_co2_emi) > 3, ] %>% group_by(ISO3, Name) %>% slice_head(n=2)

```

Closer inspection of the bunker_cons outliers show that these emissions are all valid.  Confirmed these via the internet.

```{r}
emigain[abs(z_scores$bunker_cons) > 3, ] %>% group_by(ISO3, Name) %>% slice_head(n=2)


```

Closer inspection of the all_co2_emi outliers show that these emissions are those for China, India and USA.  These are all valid. They would follow the total emissions above.

```{r}
emigain[abs(z_scores$all_co2_emi) > 3, ] %>% group_by(ISO3, Name) %>% slice_head(n=2)

```


##	Transform 

### Approach & Summary

The total carbon emissions (tot_co2_emi) and population (Population) distributions are both heavily right skewed.

Transformations are applied to move them to normal distributions.  tot_co2_emi.t and Population.t respectively.

Given that they are right skewed, mathematical functions log(), log10(), sqrt() are applied. 

The resulting distributions are verified using histogram plots. At least one of the functions is considered suitable for the transformation, ie. closest to normal.  So there is no need to explore elsewhere for other methods.

The Box Cox transformaton is used, using forecast::BoxCox.lambda to work out the best lambda, and then BoxCox() to perform the transformation.   This is to discover alternative methods.


### Total CO2 emissions transformation

* Best transformation function is log10()

* BoxCox lambda is 0

```{r}
#total emissions distribution is heavily right skewed
emigain$tot_co2_emi %>% hist(main = "tot_co2_emi", breaks= 100)

#try transformations
emigain$tot_co2_emi %>% log() %>% hist(main = "log(tot_co2_emi)")
emigain$tot_co2_emi %>% log10() %>% hist(main = "log10(tot_co2_emi)") #best
emigain$tot_co2_emi %>% sqrt() %>% hist(main = "sqrt(tot_co2_emi)")

#get best lambda , then apply to transformation
emigain$tot_co2_emi %>% BoxCox.lambda(method = "loglik", lower = -3, upper = 3)
emigain$tot_co2_emi %>% BoxCox(lambda = 0) %>% hist(main = "BoxCox(tot_co2_emi)")

#apply best transformation
emigain$tot_co2_emi.t <- emigain$tot_co2_emi %>% log10()
```

### Population transformation

* Best transformation function is log()

* BoxCox lambda is 0.05
```{r}
#population distribution is heavily right skewed
emigain$Population %>% hist(main = "Population")

#try transformations
emigain$Population %>% log() %>% hist(main = "log(Population)") #best
emigain$Population %>% log10() %>% hist(main = "log10(Population)")
emigain$Population %>% sqrt() %>% hist(main = "sqrt(Population)")

#get best lambda , then apply to transformation
emigain$Population %>% BoxCox.lambda(method = "loglik", lower = -3, upper = 3)
emigain$Population %>% BoxCox(lambda = 0.05) %>% hist(main = "BoxCox(Population)")

#apply best transformation
emigain$Population.t <- emigain$Population %>% log()

```

View transformations *tot_co2_emi.t* and *Population.t*

```{r}
emigain %>% head()

```

```{r eval=FALSE, include=FALSE}
#NDGain distribution is mildly right skewed
emigain$NDGain %>% hist()

#try transformations
emigain$NDGain %>% log() %>% hist()
emigain$NDGain %>% log10() %>% hist() #best
emigain$NDGain %>% sqrt() %>% hist()

emigain$NDGain %>% BoxCox.lambda(method = "guerrero", lower = -3, upper = 3)
emigain$NDGain %>% BoxCox.lambda(method = "loglik", lower = -3, upper = 3)
emigain$NDGain %>% BoxCox(lambda = -0.25) %>% hist()

qplot(x=NDGain, y=tot_co2_emi, data=emigain, geom = "point")

qplot(x=log10(NDGain), y=log10(tot_co2_emi), data=emigain, geom = "point")

qplot(x=Population, y=tot_co2_emi, data=emigain, geom = "point")

qplot(x=log(Population), y=log10(tot_co2_emi), data=emigain, geom = "point")
```


### Data Reference

* D. Gilfillan, G. Marland, T. Boden, and R.  Andres. (2019). *Global, Regional, and National Fossil-Fuel CO2 Emissions: 1751-2016* [csv file nation.1751_2016.csv]. https://energy.appstate.edu/research/work-areas/cdiac-appstate
* University of Notre Dame, Notre Dame Global Adaptation Initiative.(2020). *ND-GAIN Country Index*. https://gain.nd.edu/our-work/country-index/rankings/


<br>
<br>
