Ha Le - c#3409610
University of New Castle
STAT6020: PREDICTIVE ANALYTICS
05/11/2022
Global Climate Change is one of the most concerning topic of all countries in the Earth. In this report, we had a briefly review some of the indicators that affect the climate change. We used a data from the World Bank Climate Change Data website. We first explored the descriptive statistics of variables in the data source then some basic analysis. We also conducted a Outlier Dectection method to address any outlier countries in the group of indicators, a Princial Component Analyis method to reduce features and visualize the data. We finally, built a model by Support Vector Machines to predict the raise of total emission on correlated indicators. We found that there were more countries focus on producing energy from renewable resources and we noticed that some top countries polluted the large amount of emissions to the total of the world. The findings in this report raises the awareness of global climate change and we should have more responsible for this matter.
Climate change is a crucial issue of our concern and affects everyone on the planet. It is also an interesting subject that relates to a wide range of data analytics in various topics. Climate Change refers to long-term changes in temperature and weather by natural or human activities, mainly due to burning coal, oil and gas. The burning creates greenhouse emissions as a blanket covering the Earth, trapping heat from the sun and raising the temperature. The change of climate affects humans heavily in every sector, from health, food, and safety to education, business, economy etc.
To understand more about climate change, how people are affected by it, what are the key indicators of increasing global warming or what are the solutions. These are our challenges as well as motivations. In this study, we used the World Bank Climate Change data to find some facts and highlights from the indicators, such as determining which countries had the highest greenhouse emissions, or which countries produced renewable electricity. We also built a linear regression model to predict the proportion of population access to electricity or built a classifier by support vector machines to classify the increase or decrease in total greenhouse change from 1990. Finally, we performed Outliers detection and Principal Component Analysis in reducing the dimensional features technique.
The data was downloaded from (“World Development Indicators,”). The data-set contains the values of 217 countries in 77 indicators relevant to climate change. This data-set is cross-sectional rather than longitudinal data. Due to the nature of the data, there were plenty of missing values and some of the data is more recent than others, the data value of a country may be between 2001 and 2020.
77 indicators are in 13 categories, the descriptive statistics are as below:
| category | AG | BX | EG | EN | ER | IC | IQ | IS | NV | SE | SH | SI | SP |
| count | 13 | 1 | 13 | 30 | 5 | 1 | 1 | 1 | 1 | 2 | 3 | 1 | 5 |
| min | q1 | median | mean | q3 | max | missing | pct_missing | variable_desc | |
|---|---|---|---|---|---|---|---|---|---|
| AG.LND.AGRI.K2 | 3 | 1855 | 26528 | 227271 | 164225 | 5285287 | 6 | 2.76 | Agricultural land (sq. km) |
| AG.LND.AGRI.ZS | 0.54 | 19.67 | 38.48 | 37.44 | 53.39 | 80.77 | 6 | 2.76 | Agricultural land (% of land area) |
| AG.LND.ARBL.ZS | 0.0 | 3.7 | 10.6 | 14.1 | 20.7 | 59.8 | 9 | 4.15 | Arable land (% of land area) |
| AG.LND.EL5M.RU.K2 | 0 | 38 | 800 | 6441 | 3519 | 158139 | 38 | 17.51 | Rural land area where elevation is below 5 meters (sq. km) |
| AG.LND.EL5M.RU.ZS | 0.00 | 0.43 | 1.28 | 4.18 | 2.85 | 55.12 | 38 | 17.51 | Rural land area where elevation is below 5 meters (% of total land area) |
| AG.LND.EL5M.UR.K2 | 0 | 7 | 57 | 752 | 391 | 23929 | 38 | 17.51 | Urban land area where elevation is below 5 meters (sq. km) |
| AG.LND.EL5M.UR.ZS | 0.00 | 0.02 | 0.11 | 1.09 | 0.62 | 22.65 | 38 | 17.51 | Urban land area where elevation is below 5 meters (% of total land area) |
| AG.LND.EL5M.ZS | 0.00 | 0.53 | 1.56 | 5.27 | 3.58 | 55.88 | 38 | 17.51 | Land area where elevation is below 5 meters (% of total land area) |
| AG.LND.FRST.K2 | 0 | 1081 | 20420 | 189251 | 96590 | 8153116 | 3 | 1.38 | Forest area (sq. km) |
| AG.LND.FRST.ZS | 0 | 11 | 31 | 32 | 50 | 97 | 3 | 1.38 | Forest area (% of land area) |
| AG.LND.IRIG.AG.ZS | 0.01 | 1.00 | 3.98 | 9.08 | 11.20 | 73.81 | 93 | 42.86 | Agricultural irrigated land (% of total agricultural land) |
| AG.LND.PRCP.MM | 18 | 563 | 1031 | 1171 | 1710 | 3240 | 35 | 16.13 | Average precipitation in depth (mm per year) |
| AG.YLD.CREL.KG | 123 | 1568 | 2928 | 3596 | 4806 | 27838 | 36 | 16.59 | Cereal yield (kg per hectare) |
| BX.KLT.DINV.WD.GD.ZS | -1275.2 | 1.2 | 2.6 | -2.1 | 4.4 | 103.9 | 19 | 8.76 | Foreign direct investment, net inflows (% of GDP) |
| EG.ELC.ACCS.ZS | 6.7 | 84.8 | 100.0 | 86.5 | 100.0 | 100.0 | 1 | 0.46 | Access to electricity (% of population) |
| EG.ELC.COAL.ZS | 0 | 0 | 1 | 17 | 30 | 97 | 76 | 35.02 | Electricity production from coal sources (% of total) |
| EG.ELC.HYRO.ZS | 0.0 | 1.1 | 10.5 | 26.6 | 49.6 | 100.0 | 76 | 35.02 | Electricity production from hydroelectric sources (% of total) |
| EG.ELC.NGAS.ZS | 0 | 0 | 15 | 27 | 45 | 100 | 76 | 35.02 | Electricity production from natural gas sources (% of total) |
| EG.ELC.NUCL.ZS | 0 | 0 | 0 | 5 | 0 | 78 | 76 | 35.02 | Electricity production from nuclear sources (% of total) |
| EG.ELC.PETR.ZS | 0.00 | 0.31 | 1.66 | 15.45 | 17.82 | 100.00 | 76 | 35.02 | Electricity production from oil sources (% of total) |
| EG.ELC.RNEW.ZS | 0.0 | 1.6 | 16.2 | 30.5 | 52.8 | 100.0 | NA | NA | Renewable electricity output (% of total electricity output) |
| EG.ELC.RNWX.KH | 0 | 2000000 | 398000000 | 11631248227 | 3902000000 | 317421000000 | 76 | 35.02 | Electricity production from renewable sources, excluding hydroelectric (kWh) |
| EG.ELC.RNWX.ZS | 0.00 | 0.03 | 1.85 | 7.06 | 8.78 | 65.44 | 76 | 35.02 | Electricity production from renewable sources, excluding hydroelectric (% of total) |
| EG.FEC.RNEW.ZS | 0.0 | 5.9 | 20.9 | 28.8 | 44.0 | 96.4 | 4 | 1.84 | Renewable energy consumption (% of total final energy consumption) |
| EG.USE.COMM.GD.PP.KD | 4.5 | 66.9 | 90.5 | 113.5 | 133.5 | 498.8 | 54 | 24.88 | Energy use (kg of oil equivalent) per $1,000 GDP (constant 2017 PPP) |
| EG.USE.ELEC.KH.PC | 39 | 913 | 2604 | 4246 | 5539 | 53832 | 75 | 34.56 | Electric power consumption (kWh per capita) |
| EG.USE.PCAP.KG.OE | 10 | 570 | 1233 | 2244 | 2713 | 17923 | 45 | 20.74 | Energy use (kg of oil equivalent per capita) |
| EN.ATM.CO2E.EG.ZS | 0.17 | 1.68 | 2.31 | 2.87 | 2.85 | 103.16 | 49 | 22.58 | CO2 intensity (kg per kg of oil equivalent energy use) |
| EN.ATM.CO2E.GF.KT | 0 | 0 | 319 | 33453 | 15183 | 1498556 | 10 | 4.61 | CO2 emissions from gaseous fuel consumption (kt) |
| EN.ATM.CO2E.GF.ZS | 0.0 | 0.0 | 5.2 | 17.4 | 29.1 | 207.4 | 26 | 11.98 | CO2 emissions from gaseous fuel consumption (% of total) |
| EN.ATM.CO2E.KD.GD | 0.037 | 0.236 | 0.361 | 0.455 | 0.569 | 1.603 | 29 | 13.36 | CO2 emissions (kg per 2010 US$ of GDP) |
| EN.ATM.CO2E.KT | 10 | 2275 | 11590 | 175806 | 65985 | 10313460 | 26 | 11.98 | CO2 emissions (kt) |
| EN.ATM.CO2E.LF.KT | 0 | 1082 | 5625 | 50234 | 23115 | 2127054 | 10 | 4.61 | CO2 emissions from liquid fuel consumption (kt) |
| EN.ATM.CO2E.LF.ZS | 0 | 39 | 61 | 61 | 85 | 118 | 26 | 11.98 | CO2 emissions from liquid fuel consumption (% of total) |
| EN.ATM.CO2E.PC | 0.03 | 0.76 | 2.54 | 4.19 | 5.92 | 32.42 | 26 | 11.98 | CO2 emissions (metric tons per capita) |
| EN.ATM.CO2E.PP.GD | 0.024 | 0.115 | 0.173 | 0.210 | 0.253 | 0.857 | 31 | 14.29 | CO2 emissions (kg per PPP $ of GDP) |
| EN.ATM.CO2E.PP.GD.KD | 0.02 | 0.12 | 0.18 | 0.22 | 0.26 | 0.88 | 35 | 16.13 | CO2 emissions (kg per 2017 PPP $ of GDP) |
| EN.ATM.CO2E.SF.KT | 0 | 0 | 216 | 67060 | 7028 | 6951653 | 10 | 4.61 | CO2 emissions from solid fuel consumption (kt) |
| EN.ATM.CO2E.SF.ZS | 0.0 | 0.0 | 3.2 | 15.4 | 24.6 | 121.6 | 26 | 11.98 | CO2 emissions from solid fuel consumption (% of total) |
| EN.ATM.GHGO.KT.CE | -364711 | -1388 | 150 | -3671 | 1917 | 98711 | 33 | 15.21 | Other greenhouse gas emissions, HFC, PFC and SF6 (thousand metric tons of CO2 equivalent) |
| EN.ATM.GHGO.ZG | -620 | -79 | 30 | 12820 | 352 | 875606 | 29 | 13.36 | Other greenhouse gas emissions (% change from 1990) |
| EN.ATM.GHGT.KT.CE | 30 | 9280 | 39060 | 240177 | 107065 | 12355240 | 26 | 11.98 | Total greenhouse gas emissions (kt of CO2 equivalent) |
| EN.ATM.GHGT.ZG | -78.0 | -7.5 | 44.2 | 86.8 | 113.8 | 2519.0 | 35 | 16.13 | Total greenhouse gas emissions (% change from 1990) |
| EN.ATM.HFCG.KT.CE | 0 | 0 | 105 | 5663 | 1203 | 300896 | 80 | 36.87 | HFC gas emissions (thousand metric tons of CO2 equivalent) |
| EN.ATM.METH.KT.CE | 0 | 2390 | 9200 | 42798 | 33710 | 1238630 | 26 | 11.98 | Methane emissions (kt of CO2 equivalent) |
| EN.ATM.METH.ZG | -100 | 3 | 26 | 49 | 62 | 2415 | 13 | 5.99 | Methane emissions (% change from 1990) |
| EN.ATM.NOXE.KT.CE | 0 | 765 | 3430 | 15625 | 12180 | 538790 | 26 | 11.98 | Nitrous oxide emissions (thousand metric tons of CO2 equivalent) |
| EN.ATM.NOXE.ZG | -100 | -24 | 17 | 38 | 51 | 2510 | 12 | 5.53 | Nitrous oxide emissions (% change from 1990) |
| EN.ATM.PFCG.KT.CE | 0 | 0 | 0 | 511 | 134 | 20578 | 80 | 36.87 | PFC gas emissions (thousand metric tons of CO2 equivalent) |
| EN.ATM.SF6G.KT.CE | 0 | 0 | 0 | 1186 | 366 | 57054 | 80 | 36.87 | SF6 gas emissions (thousand metric tons of CO2 equivalent) |
| EN.CLC.DRSK.XQ | 1.0 | 2.8 | 3.2 | 3.3 | 3.8 | 4.8 | 134 | 61.75 | Disaster risk reduction progress score (1-5 scale; 5=best) |
| EN.CLC.GHGR.MT.CE | -990.1 | -24.1 | -3.6 | -17.5 | -0.3 | 1329.0 | 155 | 71.43 | GHG net emissions/removals by LUCF (Mt of CO2 equivalent) |
| EN.CLC.MDAT.ZS | 0.00 | 0.02 | 0.25 | 1.17 | 1.27 | 9.23 | 49 | 22.58 | Droughts, floods, extreme temperatures (% of population, average 1990-2009) |
| EN.POP.EL5M.RU.ZS | 0.00 | 0.40 | 0.98 | 3.58 | 3.07 | 48.24 | 38 | 17.51 | Rural population living in areas where elevation is below 5 meters (% of total population) |
| EN.POP.EL5M.UR.ZS | 0.00 | 0.61 | 1.85 | 3.84 | 3.71 | 51.59 | 38 | 17.51 | Urban population living in areas where elevation is below 5 meters (% of total population) |
| EN.POP.EL5M.ZS | 0.0 | 1.2 | 3.5 | 7.4 | 7.6 | 58.5 | 38 | 17.51 | Population living in areas where elevation is below 5 meters (% of total population) |
| EN.URB.MCTY.TL.ZS | 4.2 | 14.4 | 21.6 | 26.3 | 32.2 | 100.0 | 96 | 44.24 | Population in urban agglomerations of more than 1 million (% of total population) |
| ER.H2O.FWTL.K3 | 0.0 | 0.4 | 1.8 | 21.4 | 10.4 | 647.5 | 36 | 16.59 | Annual freshwater withdrawals, total (billion cubic meters) |
| ER.H2O.FWTL.ZS | 0 | 2 | 9 | 122 | 28 | 6420 | 42 | 19.35 | Annual freshwater withdrawals, total (% of internal resources) |
| ER.LND.PTLD.ZS | 0.0 | 6.9 | 15.2 | 16.8 | 23.2 | 54.4 | 5 | 2.30 | Terrestrial protected areas (% of total land area) |
| ER.MRN.PTMR.ZS | 0.00 | 0.11 | 1.04 | 9.16 | 8.34 | 213.43 | 47 | 21.66 | Marine protected areas (% of territorial waters) |
| ER.PTD.TOTL.ZS | 0.0 | 1.8 | 8.5 | 12.9 | 18.3 | 99.5 | 6 | 2.76 | Terrestrial and marine protected areas (% of total territorial area) |
| IC.BUS.EASE.XQ | 1 | 49 | 96 | 96 | 143 | 190 | 28 | 12.90 | Ease of doing business index (1=most business-friendly regulations) |
| IQ.CPA.PUBS.XQ | 1.4 | 2.7 | 3.1 | 3.0 | 3.3 | 4.1 | 130 | 59.91 | CPIA public sector management and institutions cluster average (1=low to 6=high) |
| IS.ROD.PAVE.ZS | 1.8 | 12.3 | 20.6 | 30.8 | 40.6 | 98.0 | 166 | 76.50 | Roads, paved (% of total roads) |
| NV.AGR.TOTL.ZS | 0.02 | 2.23 | 6.83 | 10.49 | 16.10 | 61.29 | 15 | 6.91 | Agriculture, forestry, and fishing, value added (% of GDP) |
| SE.ENR.PRSC.FM.ZS | 0.54 | 0.98 | 1.00 | 0.98 | 1.02 | 1.14 | 23 | 10.60 | School enrollment, primary and secondary (gross), gender parity index (GPI) |
| SE.PRM.CMPT.ZS | 27 | 82 | 95 | 90 | 101 | 129 | 28 | 12.90 | Primary completion rate, total (% of relevant age group) |
| SH.DYN.MORT | 1.7 | 6.6 | 16.6 | 27.6 | 42.4 | 117.2 | 24 | 11.06 | NA |
| SH.MED.CMHW.P3 | 0.00 | 0.08 | 0.24 | 0.43 | 0.51 | 3.65 | 157 | 72.35 | Community health workers (per 1,000 people) |
| SH.STA.MALN.ZS | 0.2 | 2.7 | 6.8 | 10.0 | 15.1 | 39.9 | 67 | 30.88 | Prevalence of underweight, weight for age (% of children under 5) |
| SI.POV.DDAY | 0.0 | 0.3 | 1.7 | 13.8 | 19.0 | 78.8 | 55 | 25.35 | NA |
| SP.POP.GROW | -1.72 | 0.32 | 1.06 | 1.14 | 1.92 | 4.12 | NA | NA | NA |
| SP.POP.TOTL | 10834 | 786559 | 6624554 | 35617163 | 25687041 | 1402112000 | NA | NA | NA |
| SP.URB.GROW | -1.59 | 0.69 | 1.60 | 1.78 | 2.85 | 5.67 | 2 | 0.92 | NA |
| SP.URB.TOTL | 5498 | 458630 | 3899416 | 20154875 | 11327426 | 861289359 | 2 | 0.92 | NA |
| SP.URB.TOTL.IN.ZS | 13 | 43 | 62 | 61 | 81 | 100 | 2 | 0.92 | Urban population (% of total population) |
| min | q1 | median | mean | q3 | max | missing | pct_missing | variable_desc | |
|---|---|---|---|---|---|---|---|---|---|
| AG.LND.IRIG.AG.ZS | 0.01 | 1.00 | 3.98 | 9.08 | 11.20 | 73.81 | 93 | 42.86 | Agricultural irrigated land (% of total agricultural land) |
| EN.CLC.DRSK.XQ | 1.0 | 2.8 | 3.2 | 3.3 | 3.8 | 4.8 | 134 | 61.75 | Disaster risk reduction progress score (1-5 scale; 5=best) |
| EN.CLC.GHGR.MT.CE | -990.1 | -24.1 | -3.6 | -17.5 | -0.3 | 1329.0 | 155 | 71.43 | GHG net emissions/removals by LUCF (Mt of CO2 equivalent) |
| EN.URB.MCTY.TL.ZS | 4.2 | 14.4 | 21.6 | 26.3 | 32.2 | 100.0 | 96 | 44.24 | Population in urban agglomerations of more than 1 million (% of total population) |
| IQ.CPA.PUBS.XQ | 1.4 | 2.7 | 3.1 | 3.0 | 3.3 | 4.1 | 130 | 59.91 | CPIA public sector management and institutions cluster average (1=low to 6=high) |
| IS.ROD.PAVE.ZS | 1.8 | 12.3 | 20.6 | 30.8 | 40.6 | 98.0 | 166 | 76.50 | Roads, paved (% of total roads) |
| SH.MED.CMHW.P3 | 0.00 | 0.08 | 0.24 | 0.43 | 0.51 | 3.65 | 157 | 72.35 | Community health workers (per 1,000 people) |
We noticed that there are some pair of variables that have very high correlation coefficient. Reading the description, they are clearly derived from others variables. The high correlations help us in choosing proper variables to include in our report and also remove them to reduce the complexity.
| Var1 | Var1_desc | Var2 | Var2_desc | value |
|---|---|---|---|---|
| EN.ATM.CO2E.PP.GD | CO2 emissions (kg per PPP $ of GDP) | EN.ATM.CO2E.PP.GD.KD | CO2 emissions (kg per 2017 PPP $ of GDP) | 0.9999 |
| EN.ATM.CO2E.KT | CO2 emissions (kt) | EN.ATM.GHGT.KT.CE | Total greenhouse gas emissions (kt of CO2 equivalent) | 0.9974 |
| EN.ATM.CO2E.KT | CO2 emissions (kt) | EN.ATM.SF6G.KT.CE | SF6 gas emissions (thousand metric tons of CO2 equivalent) | 0.9729 |
| EN.ATM.GHGT.KT.CE | Total greenhouse gas emissions (kt of CO2 equivalent) | EN.ATM.SF6G.KT.CE | SF6 gas emissions (thousand metric tons of CO2 equivalent) | 0.9663 |
| EN.ATM.NOXE.KT.CE | Nitrous oxide emissions (thousand metric tons of CO2 equivalent) | SP.URB.TOTL | Urban population | 0.9649 |
| EN.ATM.CO2E.KT | CO2 emissions (kt) | EN.ATM.CO2E.SF.KT | CO2 emissions from solid fuel consumption (kt) | 0.9594 |
| SP.POP.TOTL | Population, total | SP.URB.TOTL | Urban population | 0.9546 |
| EN.ATM.CO2E.SF.KT | CO2 emissions from solid fuel consumption (kt) | EN.ATM.GHGT.KT.CE | Total greenhouse gas emissions (kt of CO2 equivalent) | 0.9522 |
| AG.LND.EL5M.RU.ZS | Rural land area where elevation is below 5 meters (% of total land area) | AG.LND.EL5M.ZS | Land area where elevation is below 5 meters (% of total land area) | 0.9510 |
| EN.ATM.CO2E.LF.KT | CO2 emissions from liquid fuel consumption (kt) | EN.ATM.HFCG.KT.CE | HFC gas emissions (thousand metric tons of CO2 equivalent) | 0.9509 |
| EN.ATM.GHGT.KT.CE | Total greenhouse gas emissions (kt of CO2 equivalent) | EN.ATM.NOXE.KT.CE | Nitrous oxide emissions (thousand metric tons of CO2 equivalent) | 0.9430 |
| EN.ATM.GHGT.KT.CE | Total greenhouse gas emissions (kt of CO2 equivalent) | SP.URB.TOTL | Urban population | 0.9400 |
| EG.ELC.HYRO.ZS | Electricity production from hydroelectric sources (% of total) | EG.ELC.RNEW.ZS | Renewable electricity output (% of total electricity output) | 0.9377 |
| SP.POP.GROW | Population growth (annual %) | SP.URB.GROW | Urban population growth (annual %) | 0.9346 |
| ER.H2O.FWTL.K3 | Annual freshwater withdrawals, total (billion cubic meters) | SP.POP.TOTL | Population, total | 0.9324 |
| EN.ATM.CO2E.KT | CO2 emissions (kt) | SP.URB.TOTL | Urban population | 0.9266 |
| EN.ATM.CO2E.KT | CO2 emissions (kt) | EN.ATM.NOXE.KT.CE | Nitrous oxide emissions (thousand metric tons of CO2 equivalent) | 0.9264 |
| EN.ATM.CO2E.SF.KT | CO2 emissions from solid fuel consumption (kt) | SP.URB.TOTL | Urban population | 0.9156 |
| ER.H2O.FWTL.K3 | Annual freshwater withdrawals, total (billion cubic meters) | SP.URB.TOTL | Urban population | 0.9099 |
| EN.ATM.METH.KT.CE | Methane emissions (kt of CO2 equivalent) | EN.ATM.NOXE.KT.CE | Nitrous oxide emissions (thousand metric tons of CO2 equivalent) | 0.9091 |
| EN.ATM.HFCG.KT.CE | HFC gas emissions (thousand metric tons of CO2 equivalent) | EN.ATM.SF6G.KT.CE | SF6 gas emissions (thousand metric tons of CO2 equivalent) | 0.9056 |
| AG.LND.EL5M.UR.K2 | Urban land area where elevation is below 5 meters (sq. km) | EN.ATM.CO2E.KT | CO2 emissions (kt) | 0.9052 |
| EG.ELC.RNWX.KH | Electricity production from renewable sources, excluding hydroelectric (kWh) | EN.ATM.CO2E.LF.KT | CO2 emissions from liquid fuel consumption (kt) | 0.9032 |
| AG.LND.EL5M.UR.K2 | Urban land area where elevation is below 5 meters (sq. km) | EN.ATM.SF6G.KT.CE | SF6 gas emissions (thousand metric tons of CO2 equivalent) | 0.9021 |
| EN.ATM.GHGT.KT.CE | Total greenhouse gas emissions (kt of CO2 equivalent) | EN.ATM.METH.KT.CE | Methane emissions (kt of CO2 equivalent) | 0.9008 |
We also considered in some pre-processing steps:
We used unsupervised learning method, weighted distance kNN Outlier algorithm for detecting outliers in the data-set. The algorithm is simple, it takes the mean distance of an observation to k nearest neighbors as a measure of outlyingness. If the measure is large, then the distance of this observation to its neighbors is far and it may lying out of its neighbor’s cluster.
We run KNN Outlier on the subset of category-level indicators. Please note that, this method list the largest distance observations to their k neighbors, it does not conclude that those observations are outliers.
| original_id | country | outlier_score |
|---|---|---|
| 145 | Netherlands | 15.247 |
| 37 | China | 9.288 |
| 166 | Russian Federation | 8.746 |
| 204 | United States | 7.555 |
| 6 | United Arab Emirates | 7.327 |
| 33 | Canada | 6.902 |
| 27 | Brazil | 5.603 |
| 180 | Suriname | 5.467 |
| original_id | country | outlier_score |
|---|---|---|
| 94 | Iceland | 9.821 |
| 204 | United States | 7.079 |
| 37 | China | 6.343 |
| 196 | Trinidad and Tobago | 5.867 |
| 164 | Qatar | 5.024 |
| 40 | Congo, Dem. Rep. | 4.832 |
| 65 | France | 4.448 |
| 54 | Denmark | 4.422 |
| original_id | country | outlier_score |
|---|---|---|
| 37 | China | 22.011 |
| 204 | United States | 16.230 |
| 166 | Russian Federation | 11.003 |
| 145 | Netherlands | 10.763 |
| 210 | Vietnam | 10.298 |
| 49 | Cyprus | 9.982 |
| 90 | India | 9.672 |
| 135 | Mozambique | 7.086 |
| original_id | country | outlier_score |
|---|---|---|
| 182 | Slovenia | 9.891 |
| 58 | Egypt, Arab Rep. | 8.714 |
| 90 | India | 6.353 |
| 37 | China | 5.732 |
| 20 | Bahrain | 5.124 |
| 204 | United States | 4.710 |
| 193 | Turkmenistan | 2.605 |
| 6 | United Arab Emirates | 2.459 |
| original_id | country | outlier_score |
|---|---|---|
| 37 | China | 13.733 |
| 90 | India | 10.434 |
| 204 | United States | 3.503 |
| 88 | Indonesia | 2.247 |
| 27 | Brazil | 2.235 |
| 130 | Malta | 2.054 |
| 123 | Moldova | 1.997 |
| 143 | Nigeria | 1.886 |
From the kNN Outlier result tables above, in the AG (Agricultural) category, we can see that Netherlands was ‘far’ from others (distance score 15.2 compared to 9.2 and 8.7 of the second and third highest score), further investigation, we found that the values of land area and proportion where elevation is below 5 meters of Netherlands are significant higher than others. Or in the EN (Environment) category, China and United States (22 and 16 compare to 11 the 3rs highest score) were much different from the rest, it could be the very high of CO2 emissions from these countries. Or in SP (Population) category, 13.7 and 10.4 were the score of top two countries, followed by 3.5 of the 3rd country, huge different of top 1 and 2 to the rest, and no surprisingly, they are China and India, two world’s largest populations.
The Outlier Detection helps us identify not only the potential error observations (if any) but also a special observation (if any). Both are important in data analysis, to fix the erroneous or to detect the rare case (in fraud, crime, medical…)
In Part 2, we noticed there were some high correlation among five variables in the SP (Population) category. The main role of PCA is removing the high linear correlated variables and thus, reduces the number of features, retains the data information, and able to visualize the data in 1, 2 or 3 dimensional space. In this part, we applied PCA on five indicators of SP category.
## PC1 PC2 PC3 PC4 PC5
## SP.POP.GROW 0.62435 -0.04174 0.390287 0.67393 0.04400
## SP.POP.TOTL 0.02283 0.70644 -0.007256 0.07275 -0.70362
## SP.URB.GROW 0.65156 0.02199 0.217084 -0.72575 -0.03406
## SP.URB.TOTL -0.02054 0.70581 0.053320 -0.01423 0.70595
## SP.URB.TOTL.IN.ZS -0.42979 -0.02350 0.893115 -0.11666 -0.05881
## [1] "The proportion of the Variance Explained PVE by each component:"
## [1] 0.441708 0.391837 0.147963 0.010086 0.008406
We can see that from 5 variables, we project to 5 components (PC1-PC5) and with only the first two components alone still explained about 84% of the variance. And a benefit of using PCA is reduce features so we can visualize data on 2D or 3D hyper-planes. From the figure, a 2D graph of the approximated data by the first two principle components, shows some outliers, and these out outliers were detected by KNN (part 3.1) as well (red circles). However, with the country id 130 (Malta), detected as outlier by unsupervised learning, but it fall inside the cluster, this may explain the outlier detection precision was not high enough; or the reduced data into 2D space might be lost some information, it will reflect correctly in a full 5-dimensional space.
We also built some linear regression in predict the Cereal yield based on Agricultural indicators or the proportion of population access to the electricity based on electricity indicators, however the models were not significant so we did not include in the report, we included and hide the code chunks below.
Top countries by agricultural land or forest area.
| country | Agricultural_Land | Percent_Land |
|---|---|---|
| China | 5285287 | 56.08 |
| United States | 4058104 | 44.36 |
| Australia | 3588950 | 46.66 |
| Brazil | 2368788 | 28.34 |
| Kazakhstan | 2160365 | 80.02 |
| Russian Federation | 2154940 | 13.16 |
| India | 1796740 | 60.43 |
| Saudi Arabia | 1736290 | 80.77 |
| Argentina | 1487680 | 54.36 |
| Mongolia | 1134330 | 72.84 |
| country | Forest_Area | Percent_Forest |
|---|---|---|
| Russian Federation | 8153116 | 49.78 |
| Brazil | 4966196 | 59.42 |
| Canada | 3469281 | 38.70 |
| United States | 3097950 | 33.87 |
| China | 2199782 | 23.34 |
| Australia | 1340051 | 17.42 |
| Congo, Dem. Rep. | 1261552 | 55.65 |
| Indonesia | 921332 | 49.07 |
| Peru | 723304 | 56.51 |
| India | 721600 | 24.27 |
Production electricity from renewable source is one of the best way to reduce the greenhouse emissions. Here is the list of top countries that produces highest electricity from renewable sources, excluding hydroelectric, unit was in gigawatt per hour (1,000,000 kWh)
| country | Renewable_Production | Percent_Renewable |
|---|---|---|
| United States | 317421 | 7.387 |
| China | 283851 | 4.857 |
| Germany | 168389 | 26.271 |
| Japan | 80292 | 7.756 |
| United Kingdom | 77262 | 22.970 |
| India | 74143 | 5.361 |
| Brazil | 70487 | 12.118 |
| Spain | 68948 | 24.820 |
| Italy | 63368 | 22.506 |
| Canada | 42037 | 6.267 |
Carbon dioxide CO2 Emissions
| region | Emissions |
|---|---|
| China | 10313460 |
| USA | 4981300 |
| India | 2434520 |
| Russia | 1607550 |
| Japan | 1106150 |
| Germany | 709540 |
| South Korea | 630870 |
| Iran | 629290 |
| Indonesia | 583110 |
| Canada | 574400 |
| Saudi Arabia | 514600 |
| Mexico | 472140 |
| South Africa | 433250 |
| Brazil | 427710 |
| Turkey | 412970 |
| Australia | 386620 |
| UK | 358800 |
| Italy | 324850 |
The above table and map showing top 18 countries that “contribute” 80.11% of the carbon dioxide CO2 emissions to our planet. Some top countries may not be shown label on the map because of overlapping (South Korea was in the top, but doesn’t show on plot because overlap with China and Japan). Sadly, Australia was in the list too.
For this part, we consider on the indicator EN.ATM.GHGT.ZG total greenhouse gas emissions (% change from 1990). We coded it to binary class: 1 if the emission increased (positive) and 0 if the emission decreased (negative). Then we used cross-validation method to 10-fold running the support vector machines for prediction/classification the class variable (total greenhouse emission) in varies settings of C and gamma parameter, then we pick the best model after the cross validation. We pick 9 highest correlated indicators with EN.ATM.GHGT.ZG for predictors. From 134 observations (that had no missing values on all 9 predictors), we took random 10 observations for testing, and 124 for training the model. Finally, we applied the best model on the test set, since the test set had 10 observations, we only measured the Test Accuracy, we did not conduct a confusion matrix or other model assessment metrics. 9 out of 10 observations in the test set were predicted correctly.
## [1] "Selected Indicators "
## [1] "EN.ATM.GHGT.ZG" "EN.ATM.METH.ZG" "SE.PRM.CMPT.ZS" "SP.URB.GROW"
## [5] "SP.POP.GROW" "SI.POV.DDAY" "SH.DYN.MORT" "IC.BUS.EASE.XQ"
## [9] "EN.CLC.MDAT.ZS" "EG.ELC.ACCS.ZS"
## [1] "Best model:"
##
## Call:
## best.tune(method = svm, train.x = class ~ ., data = train_data, ranges = list(cost = c(0.001,
## 0.01, 0.1, 1, 10, 100, 1000), gamma = c(0.001, 0.01, 0.1, 1,
## 2, 10)), kernel = "radial", probability = TRUE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 100
##
## Number of Support Vectors: 65
## [1] "Proabilities of prediction for Test set:"
## decrease increase
## 154 0.20229 0.7977
## 85 0.55504 0.4450
## 128 0.73408 0.2659
## 23 0.66323 0.3368
## 194 0.02430 0.9757
## 3 0.34140 0.6586
## 151 0.09499 0.9050
## 80 0.19109 0.8089
## 7 0.47356 0.5264
## 160 0.41263 0.5874
## [1] "Test accuracy: "
## [1] 0.9
From this study, we found that the Outlier Detection help us in identify some “special” observations if any. The PCA provided us a simpler way in reducing the complexity but still retain most of data information and able to have visualization of data. The SVM poses an ability to predict an increasing or decreasing of total emissions from predictors. Some exploratory results of the top countries (top in both terms of good and bad side) also raises to our concern. In the good side, there were more countries producing electricity from the renewable sources and the proportion were raising in total (that means reducing the coal sources which yield lots of greenhouse emissions). In the bad side, the total emissions of top 18 countries ‘shares’ more than 80% of all the world. These countries should have responsibility and proper acting first to keep our planet greener (or at least not yellower as shown in the part 3 map).
There are still plenty of facts that we did not include in this report due to lack of data sources and limitation in time. The data used in this report is just a small part of the huge data related to climate change, to have a deeper insights, we should take more time larger data sources. And better a longer time series data so that we can investigate how is the climate change over time.
World Development Indicators. Retrieved from https://databank.worldbank.org/source/world-development-indicators