The Happy Planet Index (HPI) is an index of human well-being and environmental impact that was created by Nic Marks, telling us how well nations are doing at achieving long, happy, sustainable lives. The index combines four elements to how how efficiently residents of different countries are using environmental resources to lead long, happy lives. I downloaded the 2016 dataset from Happy Planet Index website.
My goal is to find correlations between several variables, then use clustering technic to separate these 140 countries into different clusters, according to wellbeing, wealth (GDP), life expectancy and carbon emissions.
Loading the necessary package for this analysis.
library(tidyverse)
library(plotly)
library(stringr)
library(cluster)
library(FactoMineR)
library(factoextra)
library(reshape2)
library(ggthemes)
library(NbClust)
library(readxl)
I imported the data with read_xlsx() from readxl package and I subset the columns that will be used in our data visualization and analysis.
hpi <- read_xlsx("hpi-data-2016.xlsx", sheet = 5, col_names = TRUE)
hpi <- hpi[-c(1:4, 145:161), ]
I renamed the column names (or variables), to ease the analysis process.
names(hpi) <- c("Rank", "Country", "Region", "Avg.Life.Expectancy", "Avg.Wellbeing", "Happy.Life.Years", "Footprint.gha", "Inequality", "Inequality.LE", "Inequality.W", "HPI", "GDP", "Population", "GINI.Index")
head(hpi)
## # A tibble: 6 x 14
## Rank Country Region Avg.Life.Expect… Avg.Wellbeing Happy.Life.Years
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 110 Afghan… Middl… 59.667999999999… 3.8 12.396023808740…
## 2 13 Albania Post-… 77.346999999999… 5.5 34.414736010872…
## 3 30 Algeria Middl… 74.313000000000… 5.6 30.469461311230…
## 4 19 Argent… Ameri… 75.927000000000… 6.5 40.166673874579…
## 5 73 Armenia Post-… 74.445999999999… 4.3 24.018760060702…
## 6 105 Austra… Asia … 82.052000000000… 7.2 53.069497709526…
## # ... with 8 more variables: Footprint.gha <chr>, Inequality <chr>,
## # Inequality.LE <chr>, Inequality.W <chr>, HPI <chr>, GDP <chr>,
## # Population <chr>, GINI.Index <chr>
I correct the class of the variables, as we can see they appear to have chr as class.
hpi$Rank <- as.integer(hpi$Rank)
hpi$Region <- as.factor(hpi$Region)
hpi$Avg.Life.Expectancy <- as.numeric(hpi$Avg.Life.Expectancy)
hpi$Avg.Wellbeing <- as.numeric(hpi$Avg.Wellbeing)
hpi$Happy.Life.Years <- as.numeric(hpi$Happy.Life.Years)
hpi$Footprint.gha <- as.numeric(hpi$Footprint.gha)
hpi$Inequality <- as.numeric(hpi$Inequality)
hpi$Inequality.LE <- as.numeric(hpi$Inequality.LE)
hpi$Inequality.W <- as.numeric(hpi$Inequality.W)
hpi$HPI <- as.numeric(hpi$HPI)
hpi$GDP <- as.numeric(hpi$GDP)
hpi$Population <- as.numeric(hpi$Population)
glimpse(hpi)
## Observations: 140
## Variables: 14
## $ Rank <int> 110, 13, 30, 19, 73, 105, 43, 8, 102, 87, ...
## $ Country <chr> "Afghanistan", "Albania", "Algeria", "Arge...
## $ Region <fct> Middle East and North Africa, Post-communi...
## $ Avg.Life.Expectancy <dbl> 59.668, 77.347, 74.313, 75.927, 74.446, 82...
## $ Avg.Wellbeing <dbl> 3.800000, 5.500000, 5.600000, 6.500000, 4....
## $ Happy.Life.Years <dbl> 12.396024, 34.414736, 30.469461, 40.166674...
## $ Footprint.gha <dbl> 0.79000, 2.21000, 2.12000, 3.14000, 2.2300...
## $ Inequality <dbl> 0.42655744, 0.16513372, 0.24486175, 0.1642...
## $ Inequality.LE <dbl> 38.34882, 69.67116, 60.47454, 68.34958, 66...
## $ Inequality.W <dbl> 3.390494, 5.097650, 5.196449, 6.034707, 3....
## $ HPI <dbl> 20.22535, 36.76687, 33.30054, 35.19024, 25...
## $ GDP <dbl> 690.8426, 4247.4854, 5583.6162, 14357.4116...
## $ Population <dbl> 29726803, 2900489, 37439427, 42095224, 297...
## $ GINI.Index <chr> "Data unavailable", "28.96", "Data unavail...
Let’s see the statistics for this dataset, with summary().
summary(hpi[, 3:12])
## Region Avg.Life.Expectancy Avg.Wellbeing
## Americas :25 Min. :48.91 Min. :2.867
## Asia Pacific :21 1st Qu.:65.04 1st Qu.:4.575
## Europe :20 Median :73.50 Median :5.250
## Middle East and North Africa:14 Mean :70.93 Mean :5.408
## Post-communist :26 3rd Qu.:77.02 3rd Qu.:6.225
## Sub Saharan Africa :34 Max. :83.57 Max. :7.800
## Happy.Life.Years Footprint.gha Inequality Inequality.LE
## Min. : 8.97 Min. : 0.610 Min. :0.04322 Min. :27.32
## 1st Qu.:18.69 1st Qu.: 1.425 1st Qu.:0.13353 1st Qu.:48.21
## Median :29.40 Median : 2.680 Median :0.21174 Median :63.41
## Mean :30.25 Mean : 3.258 Mean :0.23291 Mean :60.34
## 3rd Qu.:39.71 3rd Qu.: 4.482 3rd Qu.:0.32932 3rd Qu.:72.57
## Max. :59.32 Max. :15.820 Max. :0.50734 Max. :81.26
## Inequality.W HPI GDP
## Min. :2.421 Min. :12.78 Min. : 244.2
## 1st Qu.:4.047 1st Qu.:21.21 1st Qu.: 1628.1
## Median :4.816 Median :26.29 Median : 5691.1
## Mean :4.973 Mean :26.41 Mean : 13911.1
## 3rd Qu.:5.704 3rd Qu.:31.54 3rd Qu.: 15159.1
## Max. :7.625 Max. :44.71 Max. :105447.1
Let’s compare and maybe we can find a connection between GDP and Average Life Expectancy.
ggplot(hpi, aes(x=GDP, y=Avg.Life.Expectancy)) +
geom_point(aes(size=Population, color=Region)) +
coord_trans(x = 'log10') +
geom_smooth(method = 'loess') +
ggtitle('Life Expectancy and GDP per Capita in USD') +
theme_classic() +
theme(legend.justification = "left", legend.title = element_text(face = "bold")) +
ylim(40, 90) +
labs(x = "GDP in USD (with log transformation)",
y = "Average Life Expectancy (year)")
After log transformation, the relationship between GDP per capita and life expectancy is more clear and looks relatively strong. These two variables are concordant. The Pearson correlation between this two variable is reasonably high, at approximate 0.62.
cor.test(hpi$GDP, hpi$Avg.Life.Expectancy)
##
## Pearson's product-moment correlation
##
## data: hpi$GDP and hpi$Avg.Life.Expectancy
## t = 9.3042, df = 138, p-value = 2.766e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5072215 0.7133067
## sample estimates:
## cor
## 0.6208781
Let’s compare and maybe we can find a connection between Average Life Expectancy and Happy Planet Index Score.
ggplot(hpi, aes(x=Avg.Life.Expectancy, y=HPI)) +
geom_point(aes(size=Population, color=Region)) +
geom_smooth(method = 'loess') +
ggtitle('Average Life Expectancy and Happy Planet Index Score') +
theme_classic() +
theme(legend.justification = "left", legend.title = element_text(face = "bold")) +
ylim(0, 50) +
labs(x = "Average Life Expectancy (year)",
y = "Happy Planet Index Score")
Many countries in Europe and Americas end up with middle-to-low HPI index probably because of their big carbon footprints, despite long life expectancy.
cor.test(hpi$Avg.Life.Expectancy, hpi$HPI)
##
## Pearson's product-moment correlation
##
## data: hpi$Avg.Life.Expectancy and hpi$HPI
## t = 7.5519, df = 138, p-value = 5.314e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4118020 0.6484859
## sample estimates:
## cor
## 0.5407609
Let’s compare and maybe we can find a connection between GDP and Happy Planet Index Score.
ggplot(hpi, aes(x=GDP, y=HPI)) +
geom_point(aes(size=Population, color=Region)) +
coord_trans(x = 'log10') +
geom_smooth(method = 'loess') +
ggtitle('GDP in USD (with log transformation) and Happy Planet Index Score') +
theme_classic() +
theme(legend.justification = "left", legend.title = element_text(face = "bold")) +
ylim(0, 50) +
labs(x = "GDP",
y = "Happy Planet Index Score")
Apparently, money (GDP) can’t buy happiness. The correlation between GDP and Happy Planet Index score is indeed very low, at about 0.11.
cor.test(hpi$GDP, hpi$HPI)
##
## Pearson's product-moment correlation
##
## data: hpi$GDP and hpi$HPI
## t = 1.3507, df = 138, p-value = 0.179
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.05267424 0.27492060
## sample estimates:
## cor
## 0.1142272
An important step of meaningful clustering consists of transforming the variables such that they have mean zero and standard deviation one.
hpi_scale <- hpi[, 4:13]
hpi_scale <- scale(hpi_scale)
summary(hpi_scale)
## Avg.Life.Expectancy Avg.Wellbeing Happy.Life.Years
## Min. :-2.5153 Min. :-2.2128 Min. :-1.60493
## 1st Qu.:-0.6729 1st Qu.:-0.7252 1st Qu.:-0.87191
## Median : 0.2939 Median :-0.1374 Median :-0.06378
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.6968 3rd Qu.: 0.7116 3rd Qu.: 0.71388
## Max. : 1.4449 Max. : 2.0831 Max. : 2.19247
## Footprint.gha Inequality Inequality.LE Inequality.W
## Min. :-1.1493 Min. :-1.5692 Min. :-2.2192 Min. :-2.1491
## 1st Qu.:-0.7955 1st Qu.:-0.8222 1st Qu.:-0.8152 1st Qu.:-0.7795
## Median :-0.2507 Median :-0.1751 Median : 0.2060 Median :-0.1317
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5317 3rd Qu.: 0.7976 3rd Qu.: 0.8221 3rd Qu.: 0.6162
## Max. : 5.4532 Max. : 2.2702 Max. : 1.4059 Max. : 2.2339
## HPI GDP Population
## Min. :-1.86308 Min. :-0.6921 Min. :-0.2990
## 1st Qu.:-0.71120 1st Qu.:-0.6220 1st Qu.:-0.2740
## Median :-0.01653 Median :-0.4163 Median :-0.2339
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.70106 3rd Qu.: 0.0632 3rd Qu.:-0.0913
## Max. : 2.50110 Max. : 4.6356 Max. : 8.1562
From this heatmap, we can see the correlation of the variables with each other.
qplot(x=Var1, y=Var2, data=melt(cor(hpi_scale, use="p")), fill=value, geom="tile") +
scale_fill_gradient2(limits=c(-1, 1)) +
theme_classic() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title="Heatmap of Correlation Matrix",
x=NULL, y=NULL)
PCA is a procedure for identifying a smaller number of uncorrelated variables, called principal components, from a large set of data. The goal of principal components analysis is to explain the maximum amount of variance with the minimum number of principal components.
hpi.pca <- PCA(hpi_scale, graph=FALSE)
print(hpi.pca)
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 140 individuals, described by 10 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. for the variables"
## 4 "$var$cor" "correlations variables - dimensions"
## 5 "$var$cos2" "cos2 for the variables"
## 6 "$var$contrib" "contributions of the variables"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$call" "summary statistics"
## 12 "$call$centre" "mean of the variables"
## 13 "$call$ecart.type" "standard error of the variables"
## 14 "$call$row.w" "weights for the individuals"
## 15 "$call$col.w" "weights for the variables"
eigenvalues <- hpi.pca$eig
head(eigenvalues)
## eigenvalue percentage of variance cumulative percentage of variance
## comp 1 6.66741533 66.6741533 66.67415
## comp 2 1.31161290 13.1161290 79.79028
## comp 3 0.97036077 9.7036077 89.49389
## comp 4 0.70128270 7.0128270 96.50672
## comp 5 0.24150648 2.4150648 98.92178
## comp 6 0.05229306 0.5229306 99.44471
The proportion of variation retained by the principal components was extracted above. eigenvalues is the amount of variation retained by each principal component (PC). The first PC corresponds to the maximum amount of variation in the data set. In this case, the first two principal components are worthy of consideration because A commonly used criterion for the number of factors to rotate is the eigenvalues-greater-than-one rule proposed by Kaiser (1960).
fviz_screeplot(hpi.pca, addlabels = TRUE, ylim = c(0, 70)) +
theme_classic()
The scree plot shows us which components explain most of the variability in the data. In this case, almost 80% of the variances contained in the data are retained by the first two principal components.
head(hpi.pca$var$contrib)
## Dim.1 Dim.2 Dim.3 Dim.4
## Avg.Life.Expectancy 12.275001 2.29815687 0.002516184 18.4965447
## Avg.Wellbeing 12.318469 0.07472989 0.198445432 22.1593907
## Happy.Life.Years 14.793710 0.01288175 0.027105103 0.7180341
## Footprint.gha 9.021277 24.71161977 2.982449522 0.4891428
## Inequality 13.363651 0.30494623 0.010038818 9.7957329
## Inequality.LE 12.677892 0.95800977 0.001525891 19.5589843
## Dim.5
## Avg.Life.Expectancy 0.31797242
## Avg.Wellbeing 6.37614629
## Happy.Life.Years 0.03254368
## Footprint.gha 7.62967135
## Inequality 2.97699333
## Inequality.LE 0.07545215
Variables that are correlated with PC1 and PC2 are the most important in explaining the variability in the data set.
The contribution of variables was extracted above: The larger the value of the contribution, the more the variable contributes to the component.
fviz_pca_var(hpi.pca, col.var="contrib",
gradient.cols = c("red", "yellow", "blue"),
repel = TRUE
)
This highlights the most important variables in explaining the variations retained by the principal components.
When using clustering algorithms, k must be specified by the analyst. I use the following method to help finding the best k.
NbClust package provides 30 indices for determining the number of clusters and proposes to user the best clustering scheme from the different results obtained by varying all combinations of number of clusters, distance measures, and clustering methods.
number <- NbClust(hpi_scale,
distance="euclidean",
min.nc=2,
max.nc=15, # By default, max.nc=15
method='ward.D',
index='all',
alphaBeale = 0.1)
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 4 proposed 2 as the best number of clusters
## * 7 proposed 3 as the best number of clusters
## * 1 proposed 5 as the best number of clusters
## * 5 proposed 6 as the best number of clusters
## * 3 proposed 10 as the best number of clusters
## * 3 proposed 15 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
I will apply K = 3 in the following steps.
set.seed(2018)
pam <- pam(hpi_scale, diss=FALSE, k = 3, keep.data=TRUE)
Number of countries assigned in each cluster.
fviz_silhouette(pam)
## cluster size ave.sil.width
## 1 1 43 0.46
## 2 2 66 0.32
## 3 3 31 0.37
This prints out one typical country represents each cluster.
hpi$Country[pam$id.med]
## [1] "Liberia" "Romania" "Ireland"
I use fviz_cluster() which provides ggplot2-based elegant visualization of partitioning methods including kmeans, pam, HCPC (from FactoMineR), etc. Observations are represented by points in the plot, using principal components if ncol(data) > 2. An ellipse is drawn around each cluster.
fviz_cluster(pam, stand = FALSE, geom = "point",
ellipse.type = "norm", ggtheme = theme_classic())
Hmmm.. It’s easy to understand, but which country belong to which cluster? Well, I think we’ll get the idea better if we use world map to visualize our cluster.
I join the map using map_data() and our Happy Planet Index dataset with left_join() from dplyr package, with a little tweaking in “USA”.
hpi['Cluster'] <- as.factor(pam$clustering)
map <- map_data("world")
##
## Attaching package: 'maps'
## The following object is masked from 'package:cluster':
##
## votes.repub
## The following object is masked from 'package:purrr':
##
## map
map$region[map$region == "USA"] <- "United States of America"
map1 <- left_join(map, hpi[, c("Region", "Country", "Avg.Life.Expectancy", "Avg.Wellbeing", "Footprint.gha", "Inequality", "HPI", "GDP", "Population", "Cluster")], by = c("region" = "Country"))
glimpse(map1)
## Observations: 99,338
## Variables: 15
## $ long <dbl> -69.89912, -69.89571, -69.94219, -70.00415...
## $ lat <dbl> 12.45200, 12.42300, 12.43853, 12.50049, 12...
## $ group <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, ...
## $ order <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14,...
## $ region <chr> "Aruba", "Aruba", "Aruba", "Aruba", "Aruba...
## $ subregion <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ Region <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, Mi...
## $ Avg.Life.Expectancy <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 59...
## $ Avg.Wellbeing <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 3....
## $ Footprint.gha <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0....
## $ Inequality <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0....
## $ HPI <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 20...
## $ GDP <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 69...
## $ Population <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 29...
## $ Cluster <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1,...
HPI_map <- ggplot(map1) +
geom_polygon(aes(x = long,
y = lat,
group = group,
fill = Cluster,
colour = Cluster)) +
coord_equal() +
labs(title = "Clustering Happy Planet Index",
subtitle = "Based on data from: http://happyplanetindex.org/",
x = NULL,
y = NULL) +
theme_classic() +
theme(plot.title = element_text(face = "bold"),
plot.subtitle = element_text(face = "italic"),
legend.position = "bottom",
legend.justification = "center",
legend.title = element_text(face = "bold"),
)
HPI_map
C1 <- hpi[hpi$Cluster == 1, ]
C1 <- C1[, c("Country", "Region", "Avg.Life.Expectancy", "Avg.Wellbeing", "Footprint.gha", "Inequality", "HPI", "GDP", "Population")]
C1$Country
## [1] "Afghanistan" "Benin" "Bolivia"
## [4] "Botswana" "Burkina Faso" "Burundi"
## [7] "Cambodia" "Cameroon" "Chad"
## [10] "Comoros" "Cote d'Ivoire" "Djibouti"
## [13] "Ethiopia" "Gabon" "Ghana"
## [16] "Guinea" "Haiti" "India"
## [19] "Kenya" "Lesotho" "Liberia"
## [22] "Malawi" "Mauritania" "Mozambique"
## [25] "Myanmar" "Namibia" "Niger"
## [28] "Nigeria" "Pakistan" "Republic of Congo"
## [31] "Rwanda" "Senegal" "Sierra Leone"
## [34] "South Africa" "Swaziland" "Syria"
## [37] "Tanzania" "Togo" "Turkmenistan"
## [40] "Uganda" "Yemen" "Zambia"
## [43] "Zimbabwe"
summary(C1[, c(2:9)])
## Region Avg.Life.Expectancy Avg.Wellbeing
## Americas : 2 Min. :48.91 Min. :2.867
## Asia Pacific : 4 1st Qu.:56.71 1st Qu.:3.900
## Europe : 0 Median :60.31 Median :4.400
## Middle East and North Africa: 3 Mean :59.88 Mean :4.340
## Post-communist : 1 3rd Qu.:63.44 3rd Qu.:4.850
## Sub Saharan Africa :33 Max. :70.39 Max. :6.000
## Footprint.gha Inequality HPI GDP
## Min. :0.610 Min. :0.2630 Min. :12.78 Min. : 244.2
## 1st Qu.:1.030 1st Qu.:0.3529 1st Qu.:16.52 1st Qu.: 670.2
## Median :1.240 Median :0.3873 Median :19.63 Median : 1158.8
## Mean :1.559 Mean :0.3867 Mean :20.11 Mean : 1917.2
## 3rd Qu.:1.610 3rd Qu.:0.4268 3rd Qu.:22.93 3rd Qu.: 1664.2
## Max. :5.470 Max. :0.5073 Max. :31.50 Max. :10642.4
## Population
## Min. :7.337e+05
## 1st Qu.:5.608e+06
## Median :1.457e+07
## Mean :5.414e+07
## 3rd Qu.:2.564e+07
## Max. :1.264e+09
Cluster 1 are mostly from Sub Saharan Africa, the countries experiencing conflicts such as Afghanistan, Syria, and Myanmar.
With low income (average GDP is USD 1917), low wellbeing score (average 4.34) and low life expectancy (60 years old), the average HPI of the countries in this cluster is 20.11, the lowest from the three.
C2 <- hpi[hpi$Cluster == 2, ]
C2 <- C2[, c("Country", "Region", "Avg.Life.Expectancy", "Avg.Wellbeing", "Footprint.gha", "Inequality", "HPI", "GDP", "Population")]
C2$Country
## [1] "Albania" "Algeria"
## [3] "Argentina" "Armenia"
## [5] "Bangladesh" "Belarus"
## [7] "Belize" "Bhutan"
## [9] "Bosnia and Herzegovina" "Brazil"
## [11] "Bulgaria" "China"
## [13] "Colombia" "Croatia"
## [15] "Dominican Republic" "Ecuador"
## [17] "Egypt" "El Salvador"
## [19] "Estonia" "Georgia"
## [21] "Greece" "Guatemala"
## [23] "Honduras" "Hungary"
## [25] "Indonesia" "Iran"
## [27] "Iraq" "Jamaica"
## [29] "Kazakhstan" "Kyrgyzstan"
## [31] "Latvia" "Lebanon"
## [33] "Lithuania" "Macedonia"
## [35] "Malaysia" "Malta"
## [37] "Mauritius" "Mongolia"
## [39] "Montenegro" "Morocco"
## [41] "Nepal" "Nicaragua"
## [43] "Palestine" "Panama"
## [45] "Paraguay" "Peru"
## [47] "Philippines" "Poland"
## [49] "Portugal" "Romania"
## [51] "Russia" "Serbia"
## [53] "Slovakia" "Sri Lanka"
## [55] "Suriname" "Tajikistan"
## [57] "Thailand" "Trinidad and Tobago"
## [59] "Tunisia" "Turkey"
## [61] "Ukraine" "Uruguay"
## [63] "Uzbekistan" "Vanuatu"
## [65] "Venezuela" "Vietnam"
summary(C2[, c(2:9)])
## Region Avg.Life.Expectancy Avg.Wellbeing
## Americas :18 Min. :67.95 Min. :4.200
## Asia Pacific :12 1st Qu.:70.85 1st Qu.:4.800
## Europe : 3 Median :74.03 Median :5.400
## Middle East and North Africa: 9 Mean :73.48 Mean :5.404
## Post-communist :23 3rd Qu.:75.36 3rd Qu.:5.900
## Sub Saharan Africa : 1 Max. :80.50 Max. :7.100
## Footprint.gha Inequality HPI GDP
## Min. :0.720 Min. :0.1018 Min. :14.27 Min. : 685.5
## 1st Qu.:1.890 1st Qu.:0.1652 1st Qu.:25.17 1st Qu.: 3599.3
## Median :2.750 Median :0.1904 Median :29.01 Median : 5942.5
## Mean :3.022 Mean :0.1983 Mean :29.29 Mean : 7698.1
## 3rd Qu.:3.825 3rd Qu.:0.2292 3rd Qu.:34.44 3rd Qu.:11798.9
## Max. :7.920 Max. :0.3111 Max. :40.70 Max. :22242.7
## Population
## Min. :2.475e+05
## 1st Qu.:3.484e+06
## Median :9.692e+06
## Mean :4.958e+07
## 3rd Qu.:3.293e+07
## Max. :1.351e+09
Cluster 2 are dominated with post-communist and developing countries in Asia Pacific and Americas.
The average HPI of the countries in this cluster is 29.29, increases significantly compared to cluster 1’s average HPI.
C3 <- hpi[hpi$Cluster == 3, ]
C3 <- C3[, c("Country", "Region", "Avg.Life.Expectancy", "Avg.Wellbeing", "Footprint.gha", "Inequality", "HPI", "GDP", "Population")]
C3$Country
## [1] "Australia" "Austria"
## [3] "Belgium" "Canada"
## [5] "Chile" "Costa Rica"
## [7] "Cyprus" "Czech Republic"
## [9] "Denmark" "Finland"
## [11] "France" "Germany"
## [13] "Hong Kong" "Iceland"
## [15] "Ireland" "Israel"
## [17] "Italy" "Japan"
## [19] "Luxembourg" "Mexico"
## [21] "Netherlands" "New Zealand"
## [23] "Norway" "Oman"
## [25] "Slovenia" "South Korea"
## [27] "Spain" "Sweden"
## [29] "Switzerland" "United Kingdom"
## [31] "United States of America"
summary(C3[, c(2:9)])
## Region Avg.Life.Expectancy Avg.Wellbeing
## Americas : 5 Min. :76.30 Min. :5.500
## Asia Pacific : 5 1st Qu.:80.17 1st Qu.:6.450
## Europe :17 Median :81.11 Median :7.000
## Middle East and North Africa: 2 Mean :80.80 Mean :6.897
## Post-communist : 2 3rd Qu.:81.89 3rd Qu.:7.400
## Sub Saharan Africa : 0 Max. :83.57 Max. :7.800
## Footprint.gha Inequality HPI GDP
## Min. : 2.840 Min. :0.04322 Min. :13.15 Min. : 9703
## 1st Qu.: 5.000 1st Qu.:0.07118 1st Qu.:24.71 1st Qu.: 28758
## Median : 5.600 Median :0.08537 Median :30.02 Median : 44011
## Mean : 6.114 Mean :0.09316 Mean :29.03 Mean : 43775
## 3rd Qu.: 6.838 3rd Qu.:0.10641 3rd Qu.:31.79 3rd Qu.: 50466
## Max. :15.820 Max. :0.18770 Max. :44.71 Max. :105447
## Population
## Min. : 320716
## 1st Qu.: 4836360
## Median : 9519374
## Mean : 36172836
## 3rd Qu.: 48388748
## Max. :314112078
Cluster 3, the happiest of them all. The average HPI of the countries in this cluster is 29.03, almost the same with cluster 2, but there’s a big difference in average life expectancy (73.48 vs 80.80), average wellbeing (5.404 vs 6.897), inequality (19.83% vs 9.31%), and carbon footprint (3.022 gha vs 6.114 gha). The countries in cluster 3 produce more carbon footprint than countries in cluster 2 (more than twice per capita) and that what makes the HPI score is practically the same with countries in cluster 2.
I will use plotly() to make my visualization more interactive, so we can observe another data such as Life Expectation, Wellbeing, etc from a country.
ggplotly(
ggplot(map1, aes(text = paste("Country : ", map1$region, "\n",
"Life Exp : ", floor(map1$Avg.Life.Expectancy), "\n",
"Wellbeing : ", round(map1$Avg.Wellbeing, 2), "\n",
"Footprint : ", round(map1$Footprint.gha, 2), " gha", "\n",
"GDP : USD ", floor(map1$GDP), "\n",
"Population : ", format(map1$Population, big.mark = ","), "\n",
"HPI : ", round(map1$HPI, 2), "\n",
sep = ""))
) +
geom_polygon(aes(x = long,
y = lat,
group = group,
fill = Cluster)) +
coord_equal() +
ggtitle("Clustering Happy Planet Index") +
labs(x = "Longitude",
y = "Latitude"),
tooltip ="text"
)