Statistical Learning Techniques for GHG Emission Modelling

Abstract

The greenhouse effect is a naturally occurring physical process which traps greenhouse emissions in the global atmosphere and acts as insulation. With the increase of global population, food consumption, energy dependence, there is evidence to suggest that humans are over-contributing to the greenhouse effect which in turn is correlated to temperature rises and more frequent natural disasters. This increase of temperatures and higher frequency of natural disasters, calls for a higher need of multidisciplinary science and accurate predictions in climate forecasting.

The report scope is to explore different statistical learning approaches to predict green-house gas emissions change from 1990 till current. Outlining explanatory variables that contribute to this change of GHG’s, and which methods perform with higher accuracy. With outlier detection and cluster formations for countries due to their greenhouse emissions.

Statistical learning primary use is to form predictions using regression, classification, and clustering analysis to form these predictions. Cross-validation algorithms make these predictions with higher accuracy to infer models that might not have been recognized by a more traditional prediction from previous statistical models. The data used is from the World Bank Climate Change dataset which contains a sample size including 217 Countries across the globe and record variables from agriculture, economics, education, and energy industries contributions, as well as statistics on population density and growth.

Unsupervised learning algorithms successfully indicated Equatorial Guinea as an extreme outlier in the dataset. Also grouped countries into clusters relative to their greenhouse emission ouputs. Equatorial Guinea has a total greenhouse gas emission change of 2519.0195675, both methods of hierarchical clustering and kmeans have successfully indicated this variable.

Introdution

Observed changes over the last century include increases in atmospheric and oceanic temperatures, rising sea levels, and fluctuations weather patterns and seasonal changes. These climatic changes are caused by a climatic thermodynamic process called the greenhouse effect, which acts by trapping heat in the earth’s atmosphere, acting as insulation.

The exponential growth of population and industry has led to a direct causation of anthropogenic emissions of GHG’s, the agriculture and forestry industries contribute around 13% of CO2, 44% of methane (CH4), and 82% of nitrous oxide (N20) into the earth’s atmosphere. Contributions of a negative greenhouse effect cause more energetic weather patterns which impact food production, quality, and security, as well as lowering the amount of C02 converted into oxygen during the photosynthesis process (Arneth & Barbosa, 2009). This correlation between human industrial activity and greenhouse effect predict greenhouse effects will cause further warming and climatic changes. Teams of scientist across the globe are working together to mitigate these changes, by incorporating many statistical prediction methods. Increases of computational power further increases the predictions by machine learning algorithms (Chantry, Christensen, 2020).

Machine learning algorithms aim to optimize the statistical models bias and variance, a model with high bias is likely to not perform accurately with a different set of observations, thus a model will be underfit and not make accurate future predictions. Allowing variation in the model might model the over-all phenomenon studied by increasing flexibility but can lower accuracy to the data you are analyzing. Thus, an overfit model. Optimizing this trade-off is essential in outlining the true phenomenon you wish to make inference about.

The objective to be outlined in this report, is exploring different statistical learning approaches to model to greenhouse percentage change from 1990 to current. I have selected explanatory variable’s including emissions produced in burning fossil fuels for energy production. These methods include multiple linear regression, to source influential predictor variables in the total change of greenhouse emissions from the 1990’s, and a predictive model the we could use in predicting N/A variables in the data set. Different clustering algorithms such as hierarchical and kmeans to group countries relative to their greenhouse emissions and identify outliers in the dataset. These two methods can also be used to predict N/A values in the dataset, though the report will not go into that much depth.

Data

The data used has been sourced from the World Bank, variables are indicator of the climate change topic. From this dataset, I created a subset of variables of interest into data frame. These variables include total greenhouse emissions percentage from 1990 till current (wb. EG.ELC.RNEW.ZS). Percentage of CO2 emissions from gaseous fuel consumption (wb.EN.ATM.CO2E.GF.ZS). Percentage of CO2 emissions from liquid fuel consumption (wb.EN.ATM.CO2E.GF.KT). Percentage of CO2 emissions from solid fuel consumption (wb.EN.ATM.CO2E.SF.ZS). Percentage change of methane emissions from 1990 till current (wb.EN.ATM.METH.ZG). Percentage change of nitrous oxide emissions from 1990 till current (wb.EN.ATM.NOXE.ZG).

Noting many N/A values contained in the dataset particularly corresponding to developing countries. N/A values were handled by the na.omit() function on the subset of variables. The original data set had a large dimensional space consisting of 217 observations from 79 variables. The subset of variables shrunk the dimensions down to 129 observations from 15 variables. From the correlation plot below, there is likely to be much collinearity between predictor variables and possibly non-linear relationships between the predictors and response, a log transformation has been done on the response variable to try to eleviate this problem.

## corrplot 0.90 loaded

##  wb.EN.ATM.GHGT.ZG  wb.EN.ATM.CO2E.GF.ZS wb.EN.ATM.CO2E.LF.ZS
##  Min.   : -77.988   Min.   :  0.000      Min.   :  2.928     
##  1st Qu.:  -9.139   1st Qu.:  0.000      1st Qu.: 38.761     
##  Median :  44.232   Median :  6.185      Median : 59.282     
##  Mean   :  86.958   Mean   : 16.914      Mean   : 60.650     
##  3rd Qu.: 112.372   3rd Qu.: 28.831      3rd Qu.: 84.651     
##  Max.   :2519.020   Max.   :112.740      Max.   :118.392     
##  wb.EN.ATM.CO2E.SF.ZS wb.EN.ATM.METH.ZG  wb.EN.ATM.NOXE.ZG
##  Min.   :  0.000      Min.   : -77.502   Min.   :-79.45   
##  1st Qu.:  0.000      1st Qu.:  -9.668   1st Qu.:-26.82   
##  Median :  3.612      Median :  19.083   Median : 13.69   
##  Mean   : 15.913      Mean   :  45.401   Mean   : 21.72   
##  3rd Qu.: 25.310      3rd Qu.:  55.971   3rd Qu.: 50.68   
##  Max.   :121.635      Max.   :2414.776   Max.   :329.13

Table 1: Summary statistics of climate variables selected.
Figure 1: Correlation matrix of selected climate variables.

Methods

Multiple Linear Regression

The multiple linear regression model provides predictions on the total change in greenhouse gas emissions from 1990-current, based on the relationships between the greenhouse percentage and its predictor variables outlined above. To model these predictions, we must estimate the model’s parameters. In the case of MLR, these parameters are the slopes of predictor variables have on the greenhouse response, Noted as beta.

Let values 1, 2, …, 6 denote each predictor variable selected for the regression, to set the null hypothesis of slope coefficients.

\(H_{O}: \beta_{1} = \beta_{2} =...=\beta_{6}=0\)
\(H_{A}:\) at least one \(\beta_{j} \neq 0\)

The alternative hypothesis implies at least one slope coefficient is not equal to zero, indicating a possible linear relationship to the response of greenhouse emission change (%). To form the model:
\[\hat{GHG} = \beta_{0} + \beta_{1}x_1 +\beta_{2}x_2+...+\beta_{6}x_{6}+\epsilon\] The RSS is used in regression to determine dispersion of data points away from the mean in each variable, it explains the amount of explained variation and can be calculate by the sum of the estimated values away from the mean, squared. \[RSS=\sum_{i=1}^{6}(\hat{y_i}-\overline{y_i})^2\] The coefficient parameters are estimated by using this least squares approach in minimizing the models RSS and improving accuracy. To determine the amount of variation in the model explained by predictor variables selected, we must analyse the \(R^2\) value from the regression output. In the case of selecting which method performed better with the dataset given, the RSS will be used to determine this.

#MLR and MLR RSS
climate_lm <- lm(wb.EN.ATM.GHGT.ZG~., wb_sub)
RSS <- sum(climate_lm$residuals)^2

Wards algorithm

Follows the hierarchical clustering algorithm, it defines the scaled distance between clusters, and merges the SSE. Starting with each observation as a cluster of its own, until there is one cluster. Wards method then finds a pair of clusters with the closest distance, and merges them together. Resulting in a merge of nested partitions. Setting the parameter hang = 1, shifts the labels to hang below the plot, selecting this number made finding the outliers relatively easier.

# euclidean distance between observations and using Wards.D2 method of clustering
set.seed(0)
library(cluster)
wb_sub2 <- scale(wb_sub)
wb_d <- dist(wb_sub2, method = "euclidean")
wb_hc <- hclust(wb_d, "ward.D2") #hierachical cluster

The Silhouette Width Criteria is a function that returns an average of distance between an observation to other observations in the same cluster, and the distance from an observation to the nearest neighbor cluster. This is used as a selection criteria for the value of k, the number of clusters. We want to find the highest value of average distance

K-means Clustering

The Kmeans algorithm partitions the data-set into subgroups, where each observation in the dataset belongs to only one group. It assigns each observation to a cluster relative to the squared euclidean (sometimes other distance measures) distance to the clusters mean, or centroid. Assigning these observations to their centroid we want to minimise the variation contained in the within clusters. The nstart parameter in the kmeans() function attempts multiple initial configurations and reports the best one, using nstart = 25, the recommended option creates 25 initial centroids to map clusters. The Kmeans algorith works as such:

Select number of cluster prototypes “k”.
Re-arrange the data set selecting random observation points for centroids without replacing them.
Repeat 2, until the is no change of centroids.
Compute the sum of distance between observations and cluster mean.
Each observation is assigned to their closest cluster mean.
Centroids are computed by averaging the observations that belong to their respective cluster.

The Kmeans algorithm is aiming to minimize the total SSE, and cluster variation is defined by the within cluster sum of squares divided by the total sum of squares. Translating to the variation within each cluster as a proportion to the variation over every cluster. When the number of clusters is increased, variation explained by the variables increase and possibly underfit clusters. However, this comes with a cost of complexity, resulting in a model that is harder to interpret and much overlap of the clusters.

# Clusters = 3, number of iterations = 25
wb_km <- kmeans(wb_sub2, centers = 3, nstart = 25)

Kmeans Visualization

The function fviz_cluster(Cluster object, dataframe, …) from library factoextra, transforms the variables into new variables using principle component analysis. PCA, reduces dimensionality of the data by transforming variables into principle components. The plot axis are the two first main principle components, explaining most of the variation contained in the dataset wb_sub. The function uses ggplot() to give a visualization of the clusters, to the principle components explaining variation in the data (Kassambara).

library(factoextra)

## Loading required package: ggplot2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

wb_clust <- fviz_cluster(wb_km, data = wb_sub2, geom = "point",
                         ellipsetype = "norm", outlier.color = "black",
                         ggtheme = theme_bw())

Results and Discussion

MLR

At a 5% significance level, the multiple linear regression output shows two significant predictor variables in the percentage change of GHG emissions from 1990. That being the percentage changes in methane (\(P < 0.001\)) and nitrous oxide (\(P = 0.00359\)). These two variables are well below the significance level, and evidence to reject \(H_O\) in favor of the alternative. Stating that there is a significant linear relationship between atleast one predictor variable to the response of GHG total change. From the \(R^2\) value of 0.7504 the explanatory variables selected do a reasonable job at explaining variability in the model, the adjusted \(\overline{R^2} = 0.743\), the variation explained only by the percentage of methane and nitrous oxide. The model produces a very low level of variance given by \(RSS = 2.100389e-25\), although I am unsure the model will be flexible enough to predict N/A values in the original dataset.

#MLR output
summary(climate_lm)

## 
## Call:
## lm(formula = wb.EN.ATM.GHGT.ZG ~ ., data = wb_sub)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -340.43  -54.13  -16.11   25.11  836.13 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -13.55856   68.46146  -0.198  0.84325    
## wb.EN.ATM.CO2E.GF.ZS  -0.75671    0.76697  -0.987  0.32525    
## wb.EN.ATM.CO2E.LF.ZS   0.72989    0.75311   0.969  0.33385    
## wb.EN.ATM.CO2E.SF.ZS   0.95414    0.74245   1.285  0.20052    
## wb.EN.ATM.METH.ZG      0.98567    0.04781  20.617  < 2e-16 ***
## wb.EN.ATM.NOXE.ZG      0.41954    0.14202   2.954  0.00359 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 115 on 168 degrees of freedom
## Multiple R-squared:  0.7504, Adjusted R-squared:  0.743 
## F-statistic:   101 on 5 and 168 DF,  p-value: < 2.2e-16

list(RSS = RSS)

## $RSS
## [1] 2.100389e-25

Table 2: Multiple Linear Regression Summary.

The residual vs fitted plot indicated a assumption violation, such that there is no pattern in the residuals plot. Suggesting a non-linear relationship between predictors and response variables. I have tried using linear transformations, such as exponential, quadratic and reciprocal, but the assumption was still violated. Another violation is residuals are normally distributed, the normal q-q confirms this by data observations dispersing away the line of best fit. For this reason the null hypothesis holds because of the assumption violations, there is too much chance in committing a type 2 error, thus the outcomes of such an error (in practice) could be very significant.

# MLR Assumption tests
par(mfrow= c(1,2))
library(ggfortify)

## Warning: package 'ggfortify' was built under R version 4.0.5

autoplot(climate_lm, which = c(1, 2))

Figures 2 & 3: Diagnostic plots for MLR, note: failed assumptions and outliers ### Hierarchical Clustering

Wards method of Hierarchical cluster indicated an outlier fairly efficiently and easy to interpret from the dendrogram. It has outlined observation 76 as an outlier in the clusters, corresponding with Equatorial Guinea. It has also indicated two main clusters when the height is approximetly 15.

library(cluster)
#Hierarchical Clustering dendrogram
plot(wb_hc, main="Hierarchical Dendrogram", xlab = "", sub= "", hang = 1)

Figure 4: Dendrogram of wards method of Hiararchical Clustering. Note: outlier observation 76.

Using wards method of hierarchical clustering, and cutting the dendrogram at k = 3, gives an average silhoutte distance of 0.2875157. Not very high, thought there is not much increase to that of k = {4, 5, 6}. Therefore, for interpretability i believe k = 3 is the right number of clusters as there is not much change in average distances too the observations in each cluster, to their nearest neighbor cluster. By introducing a higher value for k, the model decreases in flexibility, much overlap in clusters as seen in the cluster plot using PCA.

wb_ward <- cutree(wb_hc, k = 3) #3 clusters
wb_sil <- silhouette(wb_ward, dist=wb_d)
mean(wb_sil[,3]) # Average distance between observations

## [1] 0.2875157

Kmeans

Using kmeans, when the number of clusters = 3 yields a within cluster variation of 50.0%. The higher values of k, increases the variation in the within clusters, but increases the complexity and decreasing the flexibility to assign observations to clusters. This method has successfully outlined observation 76 in cluster 1 as an outlier, and gives an easy visual interpretation of the cluster plot using the principle component dimensions. There is an increase in variation explained by the variables when the fviz_cluster(.) function reduces the dimensionality of the data, transforming variables into two main components that explain the data’s variation. the total amount of variation in the data, \(Dim1+Dim2 = 66.1%\). The principle component analysis increased the variation explained by variables in the clusters, without increasing the number of clusters, keeping the interpretability of the cluster analysis the same. That is, without an increase in complexity.

wb_km

## K-means clustering with 3 clusters of sizes 87, 86, 1
## 
## Cluster means:
##   wb.EN.ATM.GHGT.ZG wb.EN.ATM.CO2E.GF.ZS wb.EN.ATM.CO2E.LF.ZS
## 1        0.08251016           -0.6032311            0.8259509
## 2       -0.20813150            0.5874215           -0.8156806
## 3       10.72092460            1.9628585           -1.7092018
##   wb.EN.ATM.CO2E.SF.ZS wb.EN.ATM.METH.ZG wb.EN.ATM.NOXE.ZG
## 1           -0.5058678       -0.02900663         0.3435366
## 2            0.5197327       -0.11078149        -0.3619892
## 3           -0.6865182       12.05078491         1.2433839
## 
## Clustering vector:
##   2   3   4   6   7   8  10  11  12  13  14  15  16  17  18  19  20  21  22  23 
##   2   2   1   2   2   2   1   2   2   2   1   2   1   1   2   2   2   1   2   2 
##  24  26  27  28  29  30  31  32  33  34  36  37  38  39  40  41  42  43  44  45 
##   1   2   1   1   2   1   2   1   2   1   2   2   2   1   1   1   2   1   1   1 
##  46  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  68  69 
##   1   1   2   2   1   1   2   1   2   1   2   1   2   2   1   2   1   2   1   2 
##  70  71  73  74  75  76  77  78  80  82  84  85  86  87  88  90  91  92  93  94 
##   2   1   1   1   1   3   2   1   1   1   1   2   1   2   2   2   2   2   1   1 
##  95  96  97  98  99 100 101 102 103 104 106 108 110 111 112 114 115 116 117 118 
##   2   2   1   2   2   2   1   2   1   1   2   2   1   1   1   1   2   2   1   2 
## 121 123 124 126 128 129 130 131 133 135 136 138 139 140 142 143 144 145 146 147 
##   2   2   1   2   2   1   1   2   2   1   1   1   2   1   1   1   1   2   2   1 
## 149 150 151 152 153 154 156 157 159 160 161 164 165 166 167 169 170 172 173 174 
##   2   2   2   1   2   2   1   2   2   2   1   2   2   2   1   1   1   1   1   1 
## 176 179 180 181 182 183 184 186 187 189 190 191 192 193 194 195 197 198 199 200 
##   1   1   1   2   2   1   1   1   1   1   1   2   2   2   1   1   2   2   1   1 
## 201 202 203 204 205 206 207 210 211 212 214 215 216 217 
##   1   2   1   2   2   1   2   2   1   1   1   2   2   2 
## 
## Within cluster sum of squares by cluster:
## [1] 178.4684 340.2000   0.0000
##  (between_SS / total_SS =  50.0 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

wb_clust

Table 3: Kmeans Cluster summary, noting observation 76 belonging to cluster 1.
Figure 5: Cluster plot of kmeans clusters, dimensionality reduction by PCA.

Conclusion

In summary, multiple linear regression is not an applicable method of prediction for the prediction of Total GHG emission change percentage. The effect the of non-linear relationships, suspected collinearity between predictor variables have violated assumptions. An approach to satisfy these violations is using non-parametric tests that do not rely on normality assumptions and homogeneous variance or by finding a data transformation that improves linearity.

Hierarchical and kmeans clustering have successfully outlined an extreme outlier in the dataset for the contribution of greenhouse gas emission, that observation 76 corresponds to Equatorial Guinea from the original dataset. Using wards method of hiararchical clustering found the outlier a fair bit easier then kmeans. Though kmeans algorithm interpretation of the clusters was much better. The visualization of kmeans clusters using a principle component analysis proved to be useful by lowering dimensionality into principle components that explain variation the same as if we increased cluster complexity from k = 3, to k = 5. This method of visualization kept flexibility and interpretation high, while increasing the variation of the data attributed from the greenhouse gas variables selected.

Reference List

A, Arneth. H, Barbosa. (2009). IPCC Special Report on Climate Change, Desertification, Land Degradation, Sustainable Land Management, Food Security, and Greenhouse gas fluxes in Terrestrial Ecosystems. Retrieved From: p. 5-7, https://www.ipcc.ch/site/assets/uploads/2019/08/4.-SPM_Approved_Microsite_FINAL.pdf

M,Chantry. H,Christensen. P,Dueben and T.Palmer (4/09/2020).Opportunities and challenges for machine learning in weather and climate modelling: hard, medium and soft AI. Retrieved from: https://royalsocietypublishing.org/doi/pdf/10.1098/rsta.2020.0083

A,Kassambara: fviz_cluster(). Retrieved from: https://www.datanovia.com/en/blog/k-means-clustering-visualization-in-r-step-by-step-guide/

Appendix

#testing number of centroids to their respective total WSS

k2 <- kmeans(wb_sub2, centers = 2, nstart = 25)
k3 <- kmeans(wb_sub2, centers = 3, nstart = 25)
k4 <- kmeans(wb_sub2, centers = 4, nstart = 25)
k5 <- kmeans(wb_sub2, centers = 5, nstart = 25)
k3$tot.withinss

## [1] 518.6684

k3$tot.withinss

## [1] 518.6684

k4$tot.withinss

## [1] 392.7355

k5$tot.withinss

## [1] 319.9771

clust_tab <- table(wb_km$cluster, wb_ward)
clust_tab

##    wb_ward
##      1  2  3
##   1 23 64  0
##   2 85  1  0
##   3  0  0  1

# New data frame for clusters and countries
wb_sub2 <- data.frame(wb$country, wb$EN.ATM.GHGT.ZG,
                     wb$EN.ATM.CO2E.GF.ZS,
                     wb$EN.ATM.CO2E.LF.ZS,
                     wb$EN.ATM.CO2E.SF.ZS, wb$EN.ATM.METH.ZG,
                     wb$EN.ATM.NOXE.ZG)
wb_country <- na.omit(wb_sub2)
wb_country$cluster <- wb_km$cluster

# Silhoutte width criteria function

SWC <- function(clusterLabels, distanceMatrix)
{
  require(cluster)
  sil <- silhouette(x = clusterLabels, dist = distanceMatrix)
  return(mean(sil[,3]))
}