A Gentle Data Analytics about Climate Change

Ha Le - c#3409610

University of New Castle

STAT6020: PREDICTIVE ANALYTICS

05/11/2022

Abstract

Global Climate Change is one of the most concerning topic of all countries in the Earth. In this report, we had a briefly review some of the indicators that affect the climate change. We used a data from the World Bank Climate Change Data website. We first explored the descriptive statistics of variables in the data source then some basic analysis. We also conducted a Outlier Dectection method to address any outlier countries in the group of indicators, a Princial Component Analyis method to reduce features and visualize the data. We finally, built a model by Support Vector Machines to predict the raise of total emission on correlated indicators. We found that there were more countries focus on producing energy from renewable resources and we noticed that some top countries polluted the large amount of emissions to the total of the world. The findings in this report raises the awareness of global climate change and we should have more responsible for this matter.

1. Introduction

Climate change is a crucial issue of our concern and affects everyone on the planet. It is also an interesting subject that relates to a wide range of data analytics in various topics. Climate Change refers to long-term changes in temperature and weather by natural or human activities, mainly due to burning coal, oil and gas. The burning creates greenhouse emissions as a blanket covering the Earth, trapping heat from the sun and raising the temperature. The change of climate affects humans heavily in every sector, from health, food, and safety to education, business, economy etc.

To understand more about climate change, how people are affected by it, what are the key indicators of increasing global warming or what are the solutions. These are our challenges as well as motivations. In this study, we used the World Bank Climate Change data to find some facts and highlights from the indicators, such as determining which countries had the highest greenhouse emissions, or which countries produced renewable electricity. We also built a linear regression model to predict the proportion of population access to electricity or built a classifier by support vector machines to classify the increase or decrease in total greenhouse change from 1990. Finally, we performed Outliers detection and Principal Component Analysis in reducing the dimensional features technique.

2. Data

The data was downloaded from (“World Development Indicators,”). The data-set contains the values of 217 countries in 77 indicators relevant to climate change. This data-set is cross-sectional rather than longitudinal data. Due to the nature of the data, there were plenty of missing values and some of the data is more recent than others, the data value of a country may be between 2001 and 2020.

2.1 Indicators

77 indicators are in 13 categories, the descriptive statistics are as below:

Category and Count of Indicators
category AG BX EG EN ER IC IQ IS NV SE SH SI SP
count 13 1 13 30 5 1 1 1 1 2 3 1 5
Descriptive Stats of numeric variables
min q1 median mean q3 max missing pct_missing variable_desc
AG.LND.AGRI.K2 3 1855 26528 227271 164225 5285287 6 2.76 Agricultural land (sq. km)
AG.LND.AGRI.ZS 0.54 19.67 38.48 37.44 53.39 80.77 6 2.76 Agricultural land (% of land area)
AG.LND.ARBL.ZS 0.0 3.7 10.6 14.1 20.7 59.8 9 4.15 Arable land (% of land area)
AG.LND.EL5M.RU.K2 0 38 800 6441 3519 158139 38 17.51 Rural land area where elevation is below 5 meters (sq. km)
AG.LND.EL5M.RU.ZS 0.00 0.43 1.28 4.18 2.85 55.12 38 17.51 Rural land area where elevation is below 5 meters (% of total land area)
AG.LND.EL5M.UR.K2 0 7 57 752 391 23929 38 17.51 Urban land area where elevation is below 5 meters (sq. km)
AG.LND.EL5M.UR.ZS 0.00 0.02 0.11 1.09 0.62 22.65 38 17.51 Urban land area where elevation is below 5 meters (% of total land area)
AG.LND.EL5M.ZS 0.00 0.53 1.56 5.27 3.58 55.88 38 17.51 Land area where elevation is below 5 meters (% of total land area)
AG.LND.FRST.K2 0 1081 20420 189251 96590 8153116 3 1.38 Forest area (sq. km)
AG.LND.FRST.ZS 0 11 31 32 50 97 3 1.38 Forest area (% of land area)
AG.LND.IRIG.AG.ZS 0.01 1.00 3.98 9.08 11.20 73.81 93 42.86 Agricultural irrigated land (% of total agricultural land)
AG.LND.PRCP.MM 18 563 1031 1171 1710 3240 35 16.13 Average precipitation in depth (mm per year)
AG.YLD.CREL.KG 123 1568 2928 3596 4806 27838 36 16.59 Cereal yield (kg per hectare)
BX.KLT.DINV.WD.GD.ZS -1275.2 1.2 2.6 -2.1 4.4 103.9 19 8.76 Foreign direct investment, net inflows (% of GDP)
EG.ELC.ACCS.ZS 6.7 84.8 100.0 86.5 100.0 100.0 1 0.46 Access to electricity (% of population)
EG.ELC.COAL.ZS 0 0 1 17 30 97 76 35.02 Electricity production from coal sources (% of total)
EG.ELC.HYRO.ZS 0.0 1.1 10.5 26.6 49.6 100.0 76 35.02 Electricity production from hydroelectric sources (% of total)
EG.ELC.NGAS.ZS 0 0 15 27 45 100 76 35.02 Electricity production from natural gas sources (% of total)
EG.ELC.NUCL.ZS 0 0 0 5 0 78 76 35.02 Electricity production from nuclear sources (% of total)
EG.ELC.PETR.ZS 0.00 0.31 1.66 15.45 17.82 100.00 76 35.02 Electricity production from oil sources (% of total)
EG.ELC.RNEW.ZS 0.0 1.6 16.2 30.5 52.8 100.0 NA NA Renewable electricity output (% of total electricity output)
EG.ELC.RNWX.KH 0 2000000 398000000 11631248227 3902000000 317421000000 76 35.02 Electricity production from renewable sources, excluding hydroelectric (kWh)
EG.ELC.RNWX.ZS 0.00 0.03 1.85 7.06 8.78 65.44 76 35.02 Electricity production from renewable sources, excluding hydroelectric (% of total)
EG.FEC.RNEW.ZS 0.0 5.9 20.9 28.8 44.0 96.4 4 1.84 Renewable energy consumption (% of total final energy consumption)
EG.USE.COMM.GD.PP.KD 4.5 66.9 90.5 113.5 133.5 498.8 54 24.88 Energy use (kg of oil equivalent) per $1,000 GDP (constant 2017 PPP)
EG.USE.ELEC.KH.PC 39 913 2604 4246 5539 53832 75 34.56 Electric power consumption (kWh per capita)
EG.USE.PCAP.KG.OE 10 570 1233 2244 2713 17923 45 20.74 Energy use (kg of oil equivalent per capita)
EN.ATM.CO2E.EG.ZS 0.17 1.68 2.31 2.87 2.85 103.16 49 22.58 CO2 intensity (kg per kg of oil equivalent energy use)
EN.ATM.CO2E.GF.KT 0 0 319 33453 15183 1498556 10 4.61 CO2 emissions from gaseous fuel consumption (kt)
EN.ATM.CO2E.GF.ZS 0.0 0.0 5.2 17.4 29.1 207.4 26 11.98 CO2 emissions from gaseous fuel consumption (% of total)
EN.ATM.CO2E.KD.GD 0.037 0.236 0.361 0.455 0.569 1.603 29 13.36 CO2 emissions (kg per 2010 US$ of GDP)
EN.ATM.CO2E.KT 10 2275 11590 175806 65985 10313460 26 11.98 CO2 emissions (kt)
EN.ATM.CO2E.LF.KT 0 1082 5625 50234 23115 2127054 10 4.61 CO2 emissions from liquid fuel consumption (kt)
EN.ATM.CO2E.LF.ZS 0 39 61 61 85 118 26 11.98 CO2 emissions from liquid fuel consumption (% of total)
EN.ATM.CO2E.PC 0.03 0.76 2.54 4.19 5.92 32.42 26 11.98 CO2 emissions (metric tons per capita)
EN.ATM.CO2E.PP.GD 0.024 0.115 0.173 0.210 0.253 0.857 31 14.29 CO2 emissions (kg per PPP $ of GDP)
EN.ATM.CO2E.PP.GD.KD 0.02 0.12 0.18 0.22 0.26 0.88 35 16.13 CO2 emissions (kg per 2017 PPP $ of GDP)
EN.ATM.CO2E.SF.KT 0 0 216 67060 7028 6951653 10 4.61 CO2 emissions from solid fuel consumption (kt)
EN.ATM.CO2E.SF.ZS 0.0 0.0 3.2 15.4 24.6 121.6 26 11.98 CO2 emissions from solid fuel consumption (% of total)
EN.ATM.GHGO.KT.CE -364711 -1388 150 -3671 1917 98711 33 15.21 Other greenhouse gas emissions, HFC, PFC and SF6 (thousand metric tons of CO2 equivalent)
EN.ATM.GHGO.ZG -620 -79 30 12820 352 875606 29 13.36 Other greenhouse gas emissions (% change from 1990)
EN.ATM.GHGT.KT.CE 30 9280 39060 240177 107065 12355240 26 11.98 Total greenhouse gas emissions (kt of CO2 equivalent)
EN.ATM.GHGT.ZG -78.0 -7.5 44.2 86.8 113.8 2519.0 35 16.13 Total greenhouse gas emissions (% change from 1990)
EN.ATM.HFCG.KT.CE 0 0 105 5663 1203 300896 80 36.87 HFC gas emissions (thousand metric tons of CO2 equivalent)
EN.ATM.METH.KT.CE 0 2390 9200 42798 33710 1238630 26 11.98 Methane emissions (kt of CO2 equivalent)
EN.ATM.METH.ZG -100 3 26 49 62 2415 13 5.99 Methane emissions (% change from 1990)
EN.ATM.NOXE.KT.CE 0 765 3430 15625 12180 538790 26 11.98 Nitrous oxide emissions (thousand metric tons of CO2 equivalent)
EN.ATM.NOXE.ZG -100 -24 17 38 51 2510 12 5.53 Nitrous oxide emissions (% change from 1990)
EN.ATM.PFCG.KT.CE 0 0 0 511 134 20578 80 36.87 PFC gas emissions (thousand metric tons of CO2 equivalent)
EN.ATM.SF6G.KT.CE 0 0 0 1186 366 57054 80 36.87 SF6 gas emissions (thousand metric tons of CO2 equivalent)
EN.CLC.DRSK.XQ 1.0 2.8 3.2 3.3 3.8 4.8 134 61.75 Disaster risk reduction progress score (1-5 scale; 5=best)
EN.CLC.GHGR.MT.CE -990.1 -24.1 -3.6 -17.5 -0.3 1329.0 155 71.43 GHG net emissions/removals by LUCF (Mt of CO2 equivalent)
EN.CLC.MDAT.ZS 0.00 0.02 0.25 1.17 1.27 9.23 49 22.58 Droughts, floods, extreme temperatures (% of population, average 1990-2009)
EN.POP.EL5M.RU.ZS 0.00 0.40 0.98 3.58 3.07 48.24 38 17.51 Rural population living in areas where elevation is below 5 meters (% of total population)
EN.POP.EL5M.UR.ZS 0.00 0.61 1.85 3.84 3.71 51.59 38 17.51 Urban population living in areas where elevation is below 5 meters (% of total population)
EN.POP.EL5M.ZS 0.0 1.2 3.5 7.4 7.6 58.5 38 17.51 Population living in areas where elevation is below 5 meters (% of total population)
EN.URB.MCTY.TL.ZS 4.2 14.4 21.6 26.3 32.2 100.0 96 44.24 Population in urban agglomerations of more than 1 million (% of total population)
ER.H2O.FWTL.K3 0.0 0.4 1.8 21.4 10.4 647.5 36 16.59 Annual freshwater withdrawals, total (billion cubic meters)
ER.H2O.FWTL.ZS 0 2 9 122 28 6420 42 19.35 Annual freshwater withdrawals, total (% of internal resources)
ER.LND.PTLD.ZS 0.0 6.9 15.2 16.8 23.2 54.4 5 2.30 Terrestrial protected areas (% of total land area)
ER.MRN.PTMR.ZS 0.00 0.11 1.04 9.16 8.34 213.43 47 21.66 Marine protected areas (% of territorial waters)
ER.PTD.TOTL.ZS 0.0 1.8 8.5 12.9 18.3 99.5 6 2.76 Terrestrial and marine protected areas (% of total territorial area)
IC.BUS.EASE.XQ 1 49 96 96 143 190 28 12.90 Ease of doing business index (1=most business-friendly regulations)
IQ.CPA.PUBS.XQ 1.4 2.7 3.1 3.0 3.3 4.1 130 59.91 CPIA public sector management and institutions cluster average (1=low to 6=high)
IS.ROD.PAVE.ZS 1.8 12.3 20.6 30.8 40.6 98.0 166 76.50 Roads, paved (% of total roads)
NV.AGR.TOTL.ZS 0.02 2.23 6.83 10.49 16.10 61.29 15 6.91 Agriculture, forestry, and fishing, value added (% of GDP)
SE.ENR.PRSC.FM.ZS 0.54 0.98 1.00 0.98 1.02 1.14 23 10.60 School enrollment, primary and secondary (gross), gender parity index (GPI)
SE.PRM.CMPT.ZS 27 82 95 90 101 129 28 12.90 Primary completion rate, total (% of relevant age group)
SH.DYN.MORT 1.7 6.6 16.6 27.6 42.4 117.2 24 11.06 NA
SH.MED.CMHW.P3 0.00 0.08 0.24 0.43 0.51 3.65 157 72.35 Community health workers (per 1,000 people)
SH.STA.MALN.ZS 0.2 2.7 6.8 10.0 15.1 39.9 67 30.88 Prevalence of underweight, weight for age (% of children under 5)
SI.POV.DDAY 0.0 0.3 1.7 13.8 19.0 78.8 55 25.35 NA
SP.POP.GROW -1.72 0.32 1.06 1.14 1.92 4.12 NA NA NA
SP.POP.TOTL 10834 786559 6624554 35617163 25687041 1402112000 NA NA NA
SP.URB.GROW -1.59 0.69 1.60 1.78 2.85 5.67 2 0.92 NA
SP.URB.TOTL 5498 458630 3899416 20154875 11327426 861289359 2 0.92 NA
SP.URB.TOTL.IN.ZS 13 43 62 61 81 100 2 0.92 Urban population (% of total population)
Variables that have more than 40% missing values
min q1 median mean q3 max missing pct_missing variable_desc
AG.LND.IRIG.AG.ZS 0.01 1.00 3.98 9.08 11.20 73.81 93 42.86 Agricultural irrigated land (% of total agricultural land)
EN.CLC.DRSK.XQ 1.0 2.8 3.2 3.3 3.8 4.8 134 61.75 Disaster risk reduction progress score (1-5 scale; 5=best)
EN.CLC.GHGR.MT.CE -990.1 -24.1 -3.6 -17.5 -0.3 1329.0 155 71.43 GHG net emissions/removals by LUCF (Mt of CO2 equivalent)
EN.URB.MCTY.TL.ZS 4.2 14.4 21.6 26.3 32.2 100.0 96 44.24 Population in urban agglomerations of more than 1 million (% of total population)
IQ.CPA.PUBS.XQ 1.4 2.7 3.1 3.0 3.3 4.1 130 59.91 CPIA public sector management and institutions cluster average (1=low to 6=high)
IS.ROD.PAVE.ZS 1.8 12.3 20.6 30.8 40.6 98.0 166 76.50 Roads, paved (% of total roads)
SH.MED.CMHW.P3 0.00 0.08 0.24 0.43 0.51 3.65 157 72.35 Community health workers (per 1,000 people)

2.2 Pre-processing

We noticed that there are some pair of variables that have very high correlation coefficient. Reading the description, they are clearly derived from others variables. The high correlations help us in choosing proper variables to include in our report and also remove them to reduce the complexity.

High correlation coefficient variables (>90%)
Var1 Var1_desc Var2 Var2_desc value
EN.ATM.CO2E.PP.GD CO2 emissions (kg per PPP $ of GDP) EN.ATM.CO2E.PP.GD.KD CO2 emissions (kg per 2017 PPP $ of GDP) 0.9999
EN.ATM.CO2E.KT CO2 emissions (kt) EN.ATM.GHGT.KT.CE Total greenhouse gas emissions (kt of CO2 equivalent) 0.9974
EN.ATM.CO2E.KT CO2 emissions (kt) EN.ATM.SF6G.KT.CE SF6 gas emissions (thousand metric tons of CO2 equivalent) 0.9729
EN.ATM.GHGT.KT.CE Total greenhouse gas emissions (kt of CO2 equivalent) EN.ATM.SF6G.KT.CE SF6 gas emissions (thousand metric tons of CO2 equivalent) 0.9663
EN.ATM.NOXE.KT.CE Nitrous oxide emissions (thousand metric tons of CO2 equivalent) SP.URB.TOTL Urban population 0.9649
EN.ATM.CO2E.KT CO2 emissions (kt) EN.ATM.CO2E.SF.KT CO2 emissions from solid fuel consumption (kt) 0.9594
SP.POP.TOTL Population, total SP.URB.TOTL Urban population 0.9546
EN.ATM.CO2E.SF.KT CO2 emissions from solid fuel consumption (kt) EN.ATM.GHGT.KT.CE Total greenhouse gas emissions (kt of CO2 equivalent) 0.9522
AG.LND.EL5M.RU.ZS Rural land area where elevation is below 5 meters (% of total land area) AG.LND.EL5M.ZS Land area where elevation is below 5 meters (% of total land area) 0.9510
EN.ATM.CO2E.LF.KT CO2 emissions from liquid fuel consumption (kt) EN.ATM.HFCG.KT.CE HFC gas emissions (thousand metric tons of CO2 equivalent) 0.9509
EN.ATM.GHGT.KT.CE Total greenhouse gas emissions (kt of CO2 equivalent) EN.ATM.NOXE.KT.CE Nitrous oxide emissions (thousand metric tons of CO2 equivalent) 0.9430
EN.ATM.GHGT.KT.CE Total greenhouse gas emissions (kt of CO2 equivalent) SP.URB.TOTL Urban population 0.9400
EG.ELC.HYRO.ZS Electricity production from hydroelectric sources (% of total) EG.ELC.RNEW.ZS Renewable electricity output (% of total electricity output) 0.9377
SP.POP.GROW Population growth (annual %) SP.URB.GROW Urban population growth (annual %) 0.9346
ER.H2O.FWTL.K3 Annual freshwater withdrawals, total (billion cubic meters) SP.POP.TOTL Population, total 0.9324
EN.ATM.CO2E.KT CO2 emissions (kt) SP.URB.TOTL Urban population 0.9266
EN.ATM.CO2E.KT CO2 emissions (kt) EN.ATM.NOXE.KT.CE Nitrous oxide emissions (thousand metric tons of CO2 equivalent) 0.9264
EN.ATM.CO2E.SF.KT CO2 emissions from solid fuel consumption (kt) SP.URB.TOTL Urban population 0.9156
ER.H2O.FWTL.K3 Annual freshwater withdrawals, total (billion cubic meters) SP.URB.TOTL Urban population 0.9099
EN.ATM.METH.KT.CE Methane emissions (kt of CO2 equivalent) EN.ATM.NOXE.KT.CE Nitrous oxide emissions (thousand metric tons of CO2 equivalent) 0.9091
EN.ATM.HFCG.KT.CE HFC gas emissions (thousand metric tons of CO2 equivalent) EN.ATM.SF6G.KT.CE SF6 gas emissions (thousand metric tons of CO2 equivalent) 0.9056
AG.LND.EL5M.UR.K2 Urban land area where elevation is below 5 meters (sq. km) EN.ATM.CO2E.KT CO2 emissions (kt) 0.9052
EG.ELC.RNWX.KH Electricity production from renewable sources, excluding hydroelectric (kWh) EN.ATM.CO2E.LF.KT CO2 emissions from liquid fuel consumption (kt) 0.9032
AG.LND.EL5M.UR.K2 Urban land area where elevation is below 5 meters (sq. km) EN.ATM.SF6G.KT.CE SF6 gas emissions (thousand metric tons of CO2 equivalent) 0.9021
EN.ATM.GHGT.KT.CE Total greenhouse gas emissions (kt of CO2 equivalent) EN.ATM.METH.KT.CE Methane emissions (kt of CO2 equivalent) 0.9008

We also considered in some pre-processing steps:

  • check missing values. There were plenty of missing values, however, we did not remove their entire row from data source, one indicator may miss in some countries but still bring valuable to analysis, we noticed about missing values and cared about it at each analyzing step.
  • check high correlation: some derived variables, we did not include them in our analysis. And even in the case of non-derived, two highly correlated variables can be removed by PCA method, for the demonstration, we performed PCA to reduce the dimensional features in the next part.
  • check duplicated: no duplicated found.
  • check unreal value outliers: from the descriptive statistics, we did not find any unreal values (such as the temperature more than 100 degree, or negative values in population etc.)
  • imputation: the imputation for missing or wrong values need to cross check with many reliable sources and in the scope of this report, we skipped this step.
  • check ethical, there is no ethical, human identify involved in the dataset.

3. Method, Results and Discussion

3.1 Outlier Detection

We used unsupervised learning method, weighted distance kNN Outlier algorithm for detecting outliers in the data-set. The algorithm is simple, it takes the mean distance of an observation to k nearest neighbors as a measure of outlyingness. If the measure is large, then the distance of this observation to its neighbors is far and it may lying out of its neighbor’s cluster.

We run KNN Outlier on the subset of category-level indicators. Please note that, this method list the largest distance observations to their k neighbors, it does not conclude that those observations are outliers.

Top 8 Outliers in AG
original_id country outlier_score
145 Netherlands 15.247
37 China 9.288
166 Russian Federation 8.746
204 United States 7.555
6 United Arab Emirates 7.327
33 Canada 6.902
27 Brazil 5.603
180 Suriname 5.467
Top 8 Outliers in EG
original_id country outlier_score
94 Iceland 9.821
204 United States 7.079
37 China 6.343
196 Trinidad and Tobago 5.867
164 Qatar 5.024
40 Congo, Dem. Rep.  4.832
65 France 4.448
54 Denmark 4.422
Top 8 Outliers in EN
original_id country outlier_score
37 China 22.011
204 United States 16.230
166 Russian Federation 11.003
145 Netherlands 10.763
210 Vietnam 10.298
49 Cyprus 9.982
90 India 9.672
135 Mozambique 7.086
Top 8 Outliers in ER
original_id country outlier_score
182 Slovenia 9.891
58 Egypt, Arab Rep.  8.714
90 India 6.353
37 China 5.732
20 Bahrain 5.124
204 United States 4.710
193 Turkmenistan 2.605
6 United Arab Emirates 2.459
Top 8 Outliers in SP
original_id country outlier_score
37 China 13.733
90 India 10.434
204 United States 3.503
88 Indonesia 2.247
27 Brazil 2.235
130 Malta 2.054
123 Moldova 1.997
143 Nigeria 1.886

From the kNN Outlier result tables above, in the AG (Agricultural) category, we can see that Netherlands was ‘far’ from others (distance score 15.2 compared to 9.2 and 8.7 of the second and third highest score), further investigation, we found that the values of land area and proportion where elevation is below 5 meters of Netherlands are significant higher than others. Or in the EN (Environment) category, China and United States (22 and 16 compare to 11 the 3rs highest score) were much different from the rest, it could be the very high of CO2 emissions from these countries. Or in SP (Population) category, 13.7 and 10.4 were the score of top two countries, followed by 3.5 of the 3rd country, huge different of top 1 and 2 to the rest, and no surprisingly, they are China and India, two world’s largest populations.

The Outlier Detection helps us identify not only the potential error observations (if any) but also a special observation (if any). Both are important in data analysis, to fix the erroneous or to detect the rare case (in fraud, crime, medical…)

3.2 Principal Component Analysis (PCA)

In Part 2, we noticed there were some high correlation among five variables in the SP (Population) category. The main role of PCA is removing the high linear correlated variables and thus, reduces the number of features, retains the data information, and able to visualize the data in 1, 2 or 3 dimensional space. In this part, we applied PCA on five indicators of SP category.

##                        PC1      PC2       PC3      PC4      PC5
## SP.POP.GROW        0.62435 -0.04174  0.390287  0.67393  0.04400
## SP.POP.TOTL        0.02283  0.70644 -0.007256  0.07275 -0.70362
## SP.URB.GROW        0.65156  0.02199  0.217084 -0.72575 -0.03406
## SP.URB.TOTL       -0.02054  0.70581  0.053320 -0.01423  0.70595
## SP.URB.TOTL.IN.ZS -0.42979 -0.02350  0.893115 -0.11666 -0.05881
## [1] "The proportion of the Variance Explained PVE by each component:"
## [1] 0.441708 0.391837 0.147963 0.010086 0.008406

We can see that from 5 variables, we project to 5 components (PC1-PC5) and with only the first two components alone still explained about 84% of the variance. And a benefit of using PCA is reduce features so we can visualize data on 2D or 3D hyper-planes. From the figure, a 2D graph of the approximated data by the first two principle components, shows some outliers, and these out outliers were detected by KNN (part 3.1) as well (red circles). However, with the country id 130 (Malta), detected as outlier by unsupervised learning, but it fall inside the cluster, this may explain the outlier detection precision was not high enough; or the reduced data into 2D space might be lost some information, it will reflect correctly in a full 5-dimensional space.

We also built some linear regression in predict the Cereal yield based on Agricultural indicators or the proportion of population access to the electricity based on electricity indicators, however the models were not significant so we did not include in the report, we included and hide the code chunks below.

3.3 Some exploratory analysis

Top countries by agricultural land or forest area.

Top 10 Countries by agricultural land
country Agricultural_Land Percent_Land
China 5285287 56.08
United States 4058104 44.36
Australia 3588950 46.66
Brazil 2368788 28.34
Kazakhstan 2160365 80.02
Russian Federation 2154940 13.16
India 1796740 60.43
Saudi Arabia 1736290 80.77
Argentina 1487680 54.36
Mongolia 1134330 72.84
Top 10 Countries by Forest Area
country Forest_Area Percent_Forest
Russian Federation 8153116 49.78
Brazil 4966196 59.42
Canada 3469281 38.70
United States 3097950 33.87
China 2199782 23.34
Australia 1340051 17.42
Congo, Dem. Rep.  1261552 55.65
Indonesia 921332 49.07
Peru 723304 56.51
India 721600 24.27

Production electricity from renewable source is one of the best way to reduce the greenhouse emissions. Here is the list of top countries that produces highest electricity from renewable sources, excluding hydroelectric, unit was in gigawatt per hour (1,000,000 kWh)

Top 10 Countries by Electricty Production Renewable sources in GWh
country Renewable_Production Percent_Renewable
United States 317421 7.387
China 283851 4.857
Germany 168389 26.271
Japan 80292 7.756
United Kingdom 77262 22.970
India 74143 5.361
Brazil 70487 12.118
Spain 68948 24.820
Italy 63368 22.506
Canada 42037 6.267

Carbon dioxide CO2 Emissions

Top 18 Countries by Emissions
region Emissions
China 10313460
USA 4981300
India 2434520
Russia 1607550
Japan 1106150
Germany 709540
South Korea 630870
Iran 629290
Indonesia 583110
Canada 574400
Saudi Arabia 514600
Mexico 472140
South Africa 433250
Brazil 427710
Turkey 412970
Australia 386620
UK 358800
Italy 324850

The above table and map showing top 18 countries that “contribute” 80.11% of the carbon dioxide CO2 emissions to our planet. Some top countries may not be shown label on the map because of overlapping (South Korea was in the top, but doesn’t show on plot because overlap with China and Japan). Sadly, Australia was in the list too.

3.4 Support Vector Machines.

For this part, we consider on the indicator EN.ATM.GHGT.ZG total greenhouse gas emissions (% change from 1990). We coded it to binary class: 1 if the emission increased (positive) and 0 if the emission decreased (negative). Then we used cross-validation method to 10-fold running the support vector machines for prediction/classification the class variable (total greenhouse emission) in varies settings of C and gamma parameter, then we pick the best model after the cross validation. We pick 9 highest correlated indicators with EN.ATM.GHGT.ZG for predictors. From 134 observations (that had no missing values on all 9 predictors), we took random 10 observations for testing, and 124 for training the model. Finally, we applied the best model on the test set, since the test set had 10 observations, we only measured the Test Accuracy, we did not conduct a confusion matrix or other model assessment metrics. 9 out of 10 observations in the test set were predicted correctly.

## [1] "Selected Indicators "
##  [1] "EN.ATM.GHGT.ZG" "EN.ATM.METH.ZG" "SE.PRM.CMPT.ZS" "SP.URB.GROW"   
##  [5] "SP.POP.GROW"    "SI.POV.DDAY"    "SH.DYN.MORT"    "IC.BUS.EASE.XQ"
##  [9] "EN.CLC.MDAT.ZS" "EG.ELC.ACCS.ZS"
## [1] "Best model:"
## 
## Call:
## best.tune(method = svm, train.x = class ~ ., data = train_data, ranges = list(cost = c(0.001, 
##     0.01, 0.1, 1, 10, 100, 1000), gamma = c(0.001, 0.01, 0.1, 1, 
##     2, 10)), kernel = "radial", probability = TRUE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  100 
## 
## Number of Support Vectors:  65
## [1] "Proabilities of prediction for Test set:"
##     decrease increase
## 154  0.20229   0.7977
## 85   0.55504   0.4450
## 128  0.73408   0.2659
## 23   0.66323   0.3368
## 194  0.02430   0.9757
## 3    0.34140   0.6586
## 151  0.09499   0.9050
## 80   0.19109   0.8089
## 7    0.47356   0.5264
## 160  0.41263   0.5874
## [1] "Test accuracy: "
## [1] 0.9

4. Conclusions.

From this study, we found that the Outlier Detection help us in identify some “special” observations if any. The PCA provided us a simpler way in reducing the complexity but still retain most of data information and able to have visualization of data. The SVM poses an ability to predict an increasing or decreasing of total emissions from predictors. Some exploratory results of the top countries (top in both terms of good and bad side) also raises to our concern. In the good side, there were more countries producing electricity from the renewable sources and the proportion were raising in total (that means reducing the coal sources which yield lots of greenhouse emissions). In the bad side, the total emissions of top 18 countries ‘shares’ more than 80% of all the world. These countries should have responsibility and proper acting first to keep our planet greener (or at least not yellower as shown in the part 3 map).

There are still plenty of facts that we did not include in this report due to lack of data sources and limitation in time. The data used in this report is just a small part of the huge data related to climate change, to have a deeper insights, we should take more time larger data sources. And better a longer time series data so that we can investigate how is the climate change over time.

5. References

World Development Indicators. Retrieved from https://databank.worldbank.org/source/world-development-indicators