How Happy is the Planet?

Clustering Analysis with Happy Planet Index

Pavan Singh

2022-06-11

Preliminary Discussion and Background

In this project we shall have a look at the Happy Planet Index. More details about the index can be seen here. The Happy Planet Index measures what matters: sustainable well-being for all. It tells us how well nations are doing at achieving long, happy, sustainable lives. It was developed in 2006 to challenge the idea that countries should focus on continuous economic growth. The Happy Planet Index helps to answer the question: “Is it possible to live good lives without costing the Earth?”.

The Happy Planet Index was brought to fruition by statistician Nic Marks, who introduced the Happy Planet Index, which essentially tracks national well-being against resource use (“because a happy life doesn’t have to cost the earth”). A detailed look and breakdown of the index can be seen from the talk by Nic Marks on YouTube and TEDx.

This project involves applying unsupervised learning methods. This is often performed as part of an exploratory data analysis; that is, it provides scope for exploring and understanding patterns in data without knowing labels or classifications. The methods applied in this analysis include; clustering algorithms like K-means, PAM and Self Organizing Maps. We also apply multivariate techniques such as PCA.

Aim and Project Goal

For this project, we are particularly interested in applying clustering & multivariate techniques with the Happiness Index. In part the primary goal of the project is to explore the data and try get a better understanding of the relationships inherent in the data between countries using clustering techniques We shall also seek to find correlations between several variables and then use these clustering techniques (K-means, PAM) to separate these 140 countries into different clusters, according to happiness, wealth, life expectancy and carbon emissions.

Data Description and Cleaning

The data used is from 2016 and was downloaded from happy website. The data contained information like happy score, life expectancy, GDP per capita and other variables . An extract of the data is shown below.

library(readxl)  # For reading excel files
library(DT)

#Read in Data
happy <- read_excel('hpi_data.xlsx', sheet =2)
happy <- happy[-(1:7),]
names(happy) <- happy[1,]
happy <- happy[-1,]
names(happy) <- c('rank', 'country','iso', 'code', 'continent',  'population', 'life_expectancy', 'wellbeing', 'footprint', 'happy', 'biocapacity', 'gdp')
datatable((happy))

Now we proceed by looking at the structure of our variables and changing them to correct data types. They are initially stored as character types, so we shall change them to numeric and factors where applicable. The original structure of data is shown below.

#Change Data Types
str(happy)

## tibble [152 × 12] (S3: tbl_df/tbl/data.frame)
##  $ rank           : chr [1:152] "1" "2" "3" "4" ...
##  $ country        : chr [1:152] "Costa Rica" "Vanuatu" "Colombia" "Switzerland" ...
##  $ iso            : chr [1:152] "CRI" "VUT" "COL" "CHE" ...
##  $ code           : chr [1:152] "2019CRI" "2019VUT" "2019COL" "2019CHE" ...
##  $ continent      : chr [1:152] "1" "8" "1" "3" ...
##  $ population     : chr [1:152] "5047.5609999999997" "299.88200000000001" "50339.442999999999" "8591.3610000000008" ...
##  $ life_expectancy: chr [1:152] "80.3" "70.5" "77.3" "83.8" ...
##  $ wellbeing      : chr [1:152] "6.9976186752319336" "6.9556203169964634" "6.3502979278564453" "7.694221019744873" ...
##  $ footprint      : chr [1:152] "2.648522265553356" "1.6160936601940632" "1.904750865111537" "4.1425161151167673" ...
##  $ happy          : chr [1:152] "62.057551770146681" "60.363875808250292" "60.165167498468215" "60.104650343177852" ...
##  $ biocapacity    : chr [1:152] "1.56" "1.56" "1.56" "1.56" ...
##  $ gdp            : chr [1:152] "20296.82150273478" "3153.0151677584245" "14624.971296534655" "68390.712985453865" ...

Now we change appropriate variables to appropriate data types.

happy$rank <- as.numeric(happy$rank)
happy$country <- as.character(happy$country)
happy$continent <- as.factor(happy$continent)
happy$life_expectancy <- as.numeric(happy$life_expectancy)
happy$wellbeing <- as.numeric(happy$wellbeing)
happy$happy <- as.numeric(happy$happy)
happy$footprint <- as.numeric(happy$footprint)
happy$gdp <- as.numeric(happy$gdp)
happy$population <- as.numeric(happy$population)
happy$biocapacity <- as.numeric(happy$biocapacity)
levels(happy$continent) <- c("Latin America", "North America & Oceania", "Western Europe", "Middle East", "Africa", "South Asia", "Eastern Europe & Central Asia", "East Asia")

We want to see if we have any missing observations for some variables for this data.

sapply(happy, function(x) sum(is.na(x))) #number of missing values

##            rank         country             iso            code       continent 
##               0               0               0               0               0 
##      population life_expectancy       wellbeing       footprint           happy 
##               0               0               0               0               0 
##     biocapacity             gdp 
##               0               4

We see that we only have 4 countries with missing gdp data. For now we proceed without imputing or removing the data points with missing gdp values.

Setup

Begin by loading and (or) installing packages required for the study.

# For clustering
library(cluster)
library(FactoMineR)
library(factoextra)
library(ggpubr)
library(NbClust)
library(reshape2)
library(kohonen)
library(reshape2)
library(plotly)        # For interactive plots
library(dplyr)         # For data manipulation
library(ggplot2)       # For plotting
library(stringr)       # For work with strings
library(ggthemes)      # Fpr extra themes

We now begin our analysis with conducting some brief exploration of our data.

Initial EDA

First up, we have a summary of the numerical variables in our data set. These will also be used for multivariate analysis in the next stage.

summary(happy[,6:12]) #summary of some numerical variables

##    population        life_expectancy   wellbeing       footprint      
##  Min.   :    299.9   Min.   :53.30   Min.   :2.375   Min.   : 0.5157  
##  1st Qu.:   4970.4   1st Qu.:67.25   1st Qu.:4.819   1st Qu.: 1.5019  
##  Median :  11521.8   Median :74.65   Median :5.526   Median : 2.5710  
##  Mean   :  49738.8   Mean   :73.06   Mean   :5.532   Mean   : 3.3252  
##  3rd Qu.:  37530.2   3rd Qu.:78.62   3rd Qu.:6.282   3rd Qu.: 4.4461  
##  Max.   :1433783.7   Max.   :84.90   Max.   :7.780   Max.   :15.0376  
##                                                                       
##      happy        biocapacity        gdp          
##  Min.   :24.33   Min.   :1.56   Min.   :   751.7  
##  1st Qu.:39.32   1st Qu.:1.56   1st Qu.:  5140.5  
##  Median :44.70   Median :1.56   Median : 13688.9  
##  Mean   :44.55   Mean   :1.56   Mean   : 22084.3  
##  3rd Qu.:50.87   3rd Qu.:1.56   3rd Qu.: 33645.9  
##  Max.   :62.06   Max.   :1.56   Max.   :114304.0  
##                                 NA's   :4

We see that the happiness score (happy) ranges from just over 24 to just over 62. The mean being 44.55, which is close to countries like Yemen (ranked 77th, with 44.63 score) and Georgia (ranked 78th, score of 44.27). Population has a very distinctly large range, with the smallest country (Vanuatu) with population of just under 300 (three hundred thousand) and largest country (China) a population of 1433784 (just over 1.4 billion).

So as noted, we only have missing data for 4 countries - for gdp. Let’s see which countries these are.

datatable(happy[which(is.na((happy$gdp))),])

A mix of countries, from different regions around the globe. No pattern to hint at a reason for data being missing. We shall proceed with the countries included in our data.

Let’s have a look at the univariate distributions.

g1 <- ggplot(happy) +
 aes(x = life_expectancy) +
 geom_histogram(bins = 30L, fill = "yellow") +
 theme_minimal()

g2 <- ggplot(happy) +
 aes(x = gdp) +
 geom_histogram(bins = 30L, fill = "orange") +
 theme_minimal()

g3 <- ggplot(happy) +
 aes(x = wellbeing) +
 geom_histogram(bins = 30L, fill = "blue") +
 theme_minimal()

g4 <- ggplot(happy) +
 aes(x = footprint) +
 geom_histogram(bins = 30L, fill = "dark grey") +
 theme_minimal()


g5 <- ggplot(happy) +
 aes(x = happy) +
 geom_histogram(bins = 30L, fill = "green") +
 theme_minimal()

g6 <- ggplot(happy) +
 aes(x = population) +
 geom_histogram(bins = 30L, fill = "red") +
 theme_minimal()


ggarrange(g1, g2, g3, g4, g5, g6, nrow = 3, ncol =2)

The population variable is heavily right skewed (tail on right side). The gdp variable also is right skewed, as well as footprint, but not as strong as population. The happy score appears to be closest to normally distributed compared to the other variables in data set. We see that life_expectancy is slightly exhibiting a left skewed distribution. We do not transform any variables. However given the distributions for population, footprint and gdp, we could have done so (taking the logs).

ggplot(happy) +
 aes(x = continent) +
 geom_bar(fill = "#112446") +
 theme_minimal() + theme(axis.text.x = element_text(angle = 45, vjust = 0.7, size=7,face="bold")) +ggtitle("Number of Countries in Continents")

We see that the majority of countries in this index are from Africa. Few are from South Asia and North America (and Oceania). Lets look at the specific populations in the countries; more precisely, the top most populated countries in the index.

# top 22 countries by population
datatable(happy %>% dplyr::select("country", "continent", "population") %>%
  filter(population > 65125) %>%
  arrange(desc(population)))

We can visualize these results. This is shown below.

#visualize
happy %>% 
  filter(population > 65125) %>%
  arrange(desc(population)) %>%
  ggplot(aes(x = population, y = country)) +geom_col(fill = "darkgreen") +
  xlab(NULL) + ggtitle("Populations") + theme_minimal()

Let’s see the top 10 Happiest countries (according to Happy Planet Index). An extract, the top 10, is shown below.

happy_sorted <- happy[with(happy, order(-happy)), ]
happy_sorted.20 <- head(happy_sorted, 20)
happy_sorted.20 <- happy_sorted.20[, c(1, 2, 5, 12, 10)]
happy_sorted.20 %>% head(10) %>% datatable()

Costa Rica takes the top spot - followed by Vanuatu and Colombia. Switzerland is the highest ranked European country. Of equivalent interest, would be the bottom countries (the least happiest countries).

happy_bottom.20 <- tail(happy_sorted, 20)
happy_bottom.20 <- happy_bottom.20[, c(1, 2, 5, 12, 10)]
datatable(tail(happy_bottom.20,10))

We find that the country with the lowest happiness index is Qatar. This is followed by Mongolia. There are a lot of African countries in the bottom 10. Next we shall explore some correlations between variables in our data set.

mydatz <- happy[,c(6,7,8,9,10,12)]
mydatz <- mydatz[complete.cases(mydatz),]
cormat <- round(cor(mydatz),2)


reorder_cormat <- function(cormat){
# Use correlation between variables as distance
dd <- as.dist((1-cormat)/2)
hc <- hclust(dd)
cormat <-cormat[hc$order, hc$order]
}

# Get lower triangle of the correlation matrix
  get_lower_tri<-function(cormat){
    cormat[upper.tri(cormat)] <- NA
    return(cormat)
  }
  # Get upper triangle of the correlation matrix
  get_upper_tri <- function(cormat){
    cormat[lower.tri(cormat)]<- NA
    return(cormat)
  }

# Reorder the correlation matrix
cormat <- reorder_cormat(cormat)
upper_tri <- get_upper_tri(cormat)
# Melt the correlation matrix
melted_cormat <- melt(upper_tri, na.rm = TRUE)
# Create a ggheatmap
ggheatmap <- ggplot(melted_cormat, aes(Var2, Var1, fill = value))+
 geom_tile(color = "white")+
 scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
   midpoint = 0, limit = c(-1,1), space = "Lab", 
    name="Pearson\nCorrelation") +
  theme_minimal()+ # minimal theme
 theme(axis.text.x = element_text(angle = 45, vjust = 1, 
    size = 12, hjust = 1))+
 coord_fixed()

ggheatmap + 
geom_text(aes(Var2, Var1, label = value), color = "black", size = 4) +
theme(
  axis.title.x = element_blank(),
  axis.title.y = element_blank(),
  panel.grid.major = element_blank(),
  panel.border = element_blank(),
  panel.background = element_blank(),
  axis.ticks = element_blank(),
  legend.justification = c(1, 0),
  legend.position = c(0.6, 0.7),
  legend.direction = "horizontal")+
  guides(fill = guide_colorbar(barwidth = 7, barheight = 1,
                title.position = "top", title.hjust = 0.5))

We see a lot of high correlations between several variables in the data set. Maybe surprisingly so, ecological footprint and gdp have a positive strong correlation, indicating that countries with higher GDP’s have larger footprints (bad). Also, maybe less surprisingly, life_expectancy has a high positive correlation with GDP. Well being also has a positive strong correlation with life_expectancy. There are also some other weaker correlations ,although still worth investigating, within the data.

We now investigate some relationships between GDP, life expectancy and happy index scores. Three separate relationships are illustrated below. The blue line is a LOESS (Locally Weighted Scatterplot Smoothing) line. We also have a confidence interval around the smooth line.

Many countries in Europe and Americas end up with middle-to-low HPI index because of their big carbon footprints, despite long life expectancy. GDP can’t buy happiness. The correlation between GDP and Happy Planet Index score is indeed very low, at about 0.11. After log transformation, the relationship between GDP per capita and life expectancy is relatively strong. These two variables are concordant. The Pearson correlation between this two variable is reasonably high, at approximate 0.62.

Principal Component Analysis

TLDR; PCA starts with the data-set being normalized and then the eigen-analysis of the covariance matrix is done. Then to reduce the dimension, the data-set is projected onto the first few principal components. It is am method to reduce dimensionality of data, whilst retaining as much information as possible in the data.

Principal components analysis (PCA) refers to the process by which principal components are computed, and the subsequent use of these components in understanding the data. PCA is an unsupervised approach, since it involves only a set of features \(X_1, X_2, . . . , X_p\) and no associated response \(Y\). Apart from producing derived variables for use in supervised learning problems, PCA also serves as a tool for data visualization (of the observations or of the variables).

PCA seeks a small number of dimensions that are as interesting as possible, where the concept of interesting is measured by the amount that the observations vary along each dimension. Each of the dimensions found by PCA is a linear combination of the \(p\) features.

As mentioned, we center and scale the data; this refers to respective mean and standard deviation of the variables that are used for normalization prior to implementing PCA. Below before we scale the data, we just extract the numeric (quantitative) data or variables from the original data.

mydata <-  happy
mydata <- mydata[complete.cases(mydata),]
mydata_quant <- mydata[,c(6,7,8,9,10,12)]

Now we implement prcomp function to undergo PCA, and we specify that the data is to be centered and scaled.

fit_prcomp <- prcomp(x=mydata_quant, center = T, scale. = T)
round(fit_prcomp$rotation[,1:6],2)

##                   PC1   PC2   PC3   PC4   PC5   PC6
## population       0.07 -0.08 -0.99  0.12  0.01  0.01
## life_expectancy -0.50  0.16 -0.14 -0.72 -0.29 -0.32
## wellbeing       -0.51  0.21  0.03  0.67 -0.15 -0.48
## footprint       -0.45 -0.47  0.02  0.13 -0.45  0.60
## happy           -0.18  0.80 -0.07  0.03  0.09  0.56
## gdp             -0.50 -0.24 -0.01 -0.07  0.82  0.07

The result of PCA is shown above. We see the loadings for the 6 principal components.

We can have a look at the eigenvalues (standard deviations) as well below.

fit_prcomp$sdev

## [1] 1.7910494 1.1763537 0.9997059 0.4932020 0.3860578 0.1289696

But in general from our procedure we can summarize PCA using summary() function.

summary(fit_prcomp)

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6
## Standard deviation     1.7910 1.1764 0.9997 0.49320 0.38606 0.12897
## Proportion of Variance 0.5346 0.2306 0.1666 0.04054 0.02484 0.00277
## Cumulative Proportion  0.5346 0.7653 0.9318 0.97239 0.99723 1.00000

The proportion of variation retained by the principal components was extracted above. Note, the eigenvalues is the amount of variation retained by each PC. The first PC corresponds to the maximum amount of variation in the data set. In this case, the first two principal components are worthy of consideration because a commonly used criterion for the number of factors to rotate is the eigenvalues-greater-than-one rule proposed by Kaiser. We see that the first 3 components account for over 90% of the variation in the data. However, to establish how many components we shall use and need, we can use a scree plot.

var_explained_df <- data.frame(PC= (1:6),
                               var_explained=(fit_prcomp$sdev)^2/sum((fit_prcomp$sdev)^2))

# SCREE PLOT
eigen_vs_pc <- data.frame(PC= (1:6),        eig=(fit_prcomp$sdev))

#CUMULATIVE VARIANCE EXPLAINED PLOT
std_dev <- fit_prcomp$sdev
pr_var <- std_dev^2 #save eigenvalues in pr_var
prop_varex <- pr_var/sum(pr_var)
cumsum_var <- data.frame(Variance = cumsum(prop_varex), PC = 1:6)

 #EIGENVALUE VS PC (SCREE)
scree1 <- ggplot(eigen_vs_pc) +
 aes(x = PC, y = eig) +
 geom_line(size = 1L, colour = "dark blue") +geom_point(size = 1.5, colour = "darkblue") +
 labs(x = "Principal Components", 
 y = "Eigenvalue") +
 theme_linedraw() +
 xlim(1, 6) +
 ylim(0, 2) + geom_hline(yintercept = 1, col = "green")

#CUM SUM OF VARIANCE PER PC
scree2 <- ggplot(cumsum_var) +
 aes(x = PC, y = Variance) +
 geom_line(size = 1L, colour = "darkblue") +geom_point(size = 1.5, colour = "darkblue") +
 labs(x = "Principal Components", 
 y = "Cumulative Explained Variance (%)") +
 theme_linedraw() + geom_vline(xintercept = 3, col = "green")


ggarrange(scree1, scree2,
          ncol = 2, nrow = 1)

The scree plot shows us which components explain most of the variability in the data. From Scree Plot: elbow on the scree-plot occurs at a value of about 3 or 4 - we choose 3 as eigenvalue is 1. So 3 PC’s explains enough variance. Increasing the number of PCs above two does not significantly improve the amount of variation explained. We can see there are 3 features (PC) having eigenvalues \(≥\) 1 and elbow point also at the same point.

let’s look at the loadings on the PC’s; specifically the variable contributions to each of the first two dimensions (PC’s).

g1 <- fviz_contrib(fit_prcomp, choice = "var", axes = 1) #Contributions of variables to PC1

g2 <- fviz_contrib(fit_prcomp, choice = "var", axes = 2) # Contributions of variables to PC2

ggarrange(g1, g2, nrow = 1)

We see that wellbeing, gdp and life_expectancy have large loadings (contributions) on the first PC. In the second PC, we have that happy and footprint load the highest, but majority of the contribution is from happy - this suggests that this PC (2) is essentially the variable happy. Essentially, here we are just visualizing the loadings we have seen before.

fviz_pca_var(fit_prcomp, col.var = "contrib",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), title="")

The contribution plot shown above combines and essentially summarises the individual contribution plots. Variables that are correlated with PC1 and PC2 are the most important in explaining the variability in the data set. As mentioned, the contribution of variables was extracted above: The larger the value of the contribution, the more the variable contributes to the component. This highlights the most important variables in explaining the variations retained by the principal components.

We can plot the first two principal components using bi-plot.

# visualise two PC (biplot)
(x1 <- fviz_pca_biplot(fit_prcomp, col.ind = mydata$continent, title="") +  theme_gray() + xlab("Dimension 1 (20.2%)")+ ylab("Dimension 2 (14.8%)"))

The biplot shows where the points are the projected observations and the vectors are the projected variables. The variance of the data along the principal component directions is as- sociated with the magnitude of the eigenvalues - that is, the length of the vector is a measure of the variance of the data when projected onto that axis. Moreover, the further away these vec- tors are from a PC origin, the more influence they have on that PC.

Also, the angles between the vectors tell us how characteristics correlate with one another. When two vectors are close, forming a small angle, the two variables they represent are posi- tively correlated.

We show the same plot, but add ellipses around the data.

# Change the color by groups, add ellipses
fviz_pca_biplot(fit_prcomp, label="var", habillage=mydata$continent,
               addEllipses=TRUE, ellipse.level=0.95)

The ellipses are shown to try group the continents together. We see a large deal of overlap. Continent may not have a significant role in clustering the countries.

In summary, we conducted PCA on the Happy Planet Index. PCA is a procedure for identifying a smaller number of uncorrelated variables, called “principal components”, from a large set of data. The goal of principal components analysis is to explain the maximum amount of variance with the minimum number of principal components. We have settled on suing 3 principal components is enough o explain the variation in the data.

Clustering

We apply two non-hierarchical clustering methods: PAM and K-means. First K-means.

K-means

This method is also classified as Non-Hierarchical Clustering - a key difference is that (for NH Clustering) the desired number of clusters need to be specified. Moreover, this type of clustering (NH Clustering) does not incorporate a tree-like cluster structure. Rather this deals with splitting up and merging clusters.

To begin the algorithm, each observation is randomly assigned to each of the specified clusters. Once assigned, the centroids of each cluster are calculated. The objective of the K-means algorithm is to minimize the Squared Euclidean Distance (ESS) for each observation in every cluster. This is to ensure that all observations are close to the centroid of the cluster they lie in. The algorithm iterates by assigning each observation to the closest cluster centroid. This is done to reduce the ESS, and will continue to iterate until no observation moves from the cluster in which they are located in.

In order to assess the most suitable number of clusters, an Elbow Plot will be examined.

scaled <- scale(mydata_quant)
df <- scaled
fviz_nbclust(df[,1:6], kmeans, method = "wss")

We shall use 4 clusters - as indicated by the elbow. Let’s run the K-means algorithm using the kmeans() function. We view th resulting output from running the algorithm below.

#make this example reproducible
set.seed(1)

#perform k-means clustering with k = 4 clusters
km <- kmeans(df[,1:6], centers = 4, nstart = 25)

#view results
km

## K-means clustering with 4 clusters of sizes 44, 59, 43, 2
## 
## Cluster means:
##    population life_expectancy   wellbeing  footprint      happy        gdp
## 1 -0.09931592     -1.27340693 -1.05200706 -0.7480507 -0.7566963 -0.7929422
## 2 -0.11068406      0.26321796  0.08488625 -0.3416266  0.6836538 -0.3237515
## 3 -0.12311703      0.94061438  1.01421012  1.2497614 -0.1330298  1.2782246
## 4  8.09714616      0.02681336 -1.16550671 -0.3347698 -0.6603280 -0.4864311
## 
## Clustering vector:
##   [1] 2 2 2 3 2 2 2 2 2 2 3 2 2 3 2 2 2 3 2 2 2 2 2 2 2 2 2 2 3 3 3 2 3 2 3 2 2
##  [38] 3 2 3 3 2 2 2 3 2 2 3 2 2 3 3 2 2 1 3 3 3 2 2 2 3 2 2 1 2 3 2 3 2 1 1 3 2
##  [75] 1 2 1 1 2 2 2 2 3 1 3 2 1 2 3 1 4 2 1 3 1 1 1 2 2 1 1 3 2 1 1 1 3 1 1 1 1
## [112] 1 2 1 3 1 1 1 3 1 1 3 1 1 4 1 1 3 1 3 1 3 3 1 3 3 3 1 3 3 1 1 1 1 1 1 3 3
## 
## Within cluster sum of squares by cluster:
## [1]  58.806519  79.124874 141.020682   2.686531
##  (between_SS / total_SS =  68.1 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

The output shows the cluster means, the clusters that the countries belong to, cluster sizes and available data in the output. We have a look at these countries and points in the clusters formed below.

#plot results of final k-means model
fviz_cluster(km, data = df[,1:6])

The means for each cluster are shown below.

#find means of each cluster
means_clusts <- aggregate(mydata, by=list(cluster=km$cluster), mean)
datatable(means_clusts[,c(1,7:13)])

We see that the clusters do exhibit some vast differences, some more than others.

We now attached th clusters for each respective country to the original data set (i.e., where each country is part of which cluster). This is shown below.

#add cluster assignment to original data
final_data <- cbind(mydata, cluster = km$cluster)
datatable(final_data[,c(1,2,5:13)])

PAM Clustering

Disadvantage of K-means is that it’s sensitive to outliers and different results can occur if you change the ordering of your data. PAM clustering effectively dealt with the noise and outliers present in data; because it uses medoid for the partitioning of objects into clusters rather than centroid as in k-means. It (PAM) performs clustering on overall data rather than only on selected samples from the data set. Hence, it does not work efficiently for large data sets.

The k-medoids algorithm minimizes the sum of dissimilarities between points labeled to be in a cluster and a point designated as the center of that cluster. PAM (Partitioning Around Medoids) starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering.

We begin by getting the appropriate data set. It is in fact the same starting data set from K-means algorithm.

scaled <- scale(mydata[,6:12]) 
scaled <- as.data.frame(scaled)
scaled$country <- mydata$country
scaled$continent <- mydata$continent
scaled <- scaled[,-6]

Like K-means, K-medoids (and hence PAM) requires the user to specify \(k\), the number of clusters. We now use both the elbow plot and average silhouette method to determine number of clusters to use.

#ELBOW METHOD
elb <- fviz_nbclust(scaled[,1:6], pam, method = "wss") +  theme_gray()


#AV SILHOUETTE METHOD
sil <- fviz_nbclust(scaled[,1:6], pam, method = "silhouette") +  theme_gray() + xlab("Number of Clusters (k)") + ylab("Average Silhouette Width") + ggtitle("") 

ggarrange(elb, sil, nrow = 1)

We decide on using 6 clusters. Let’s apply the PAM algorithm

#PAM CLUSTERING (K MEDROIDS)
# SET SEED
set.seed(123)
df = scaled[,1:6]
pam.out <- pam(df, 6)
fviz_cluster(pam.out, data = df)

The cluster formations actually appear to be very similar to that of K-means. Lets investigate the clusters more, but looking at the quality of clustering - the average silhouette width.

fviz_silhouette(pam.out) + labs(title="PAM using All Variables") +ylab("Silhouette Width") + xlab("Clusters")

##   cluster size ave.sil.width
## 1       1   28          0.41
## 2       2   32          0.35
## 3       3   30          0.30
## 4       4   40          0.36
## 5       5    2          0.72
## 6       6   16          0.15

The resulting clustering from PAM algorithm produces an average silhouette width of 0.34 when using all variables. We can inspect if this increases or decreases when using the PC’s from the earlier sections of this project. The silhouette width per cluster is illustrated above. The average silhouette width of 0.34 for 6 clusters, suggests a structure, albeit weak, is present.

Lets extract the clusters that each country was put into to and view an extract of that data frame. This is shown below.

clust_class <- as.data.frame(pam.out$clustering)
clust_class$country <- mydata$country
clust_class$continent <- mydata$continent
names(clust_class) <- c("Cluster", "Country", "Continent")
clust_class %>% head(10)

##    Cluster     Country      Continent
## 1        1  Costa Rica  Latin America
## 2        1     Vanuatu      East Asia
## 3        1    Colombia  Latin America
## 4        2 Switzerland Western Europe
## 5        1     Ecuador  Latin America
## 6        1      Panama  Latin America
## 7        1     Jamaica  Latin America
## 8        1   Guatemala  Latin America
## 9        1    Honduras  Latin America
## 10       1     Uruguay  Latin America

Lets look to see if using the PC’s from earlier on, improves our clustering. We repeat the similar process followed earlier on in this section, but using the three principal components.

pca_components <- fit_prcomp$x[,1:3]

#ELBOW METHOD
elb <- fviz_nbclust(pca_components, pam, method = "wss") +  theme_gray()


#AV SILHOUETTE METHOD
sil <- fviz_nbclust(pca_components, pam, method = "silhouette") +  theme_gray() + xlab("Number of Clusters (k)") + ylab("Average Silhouette Width") + ggtitle("") 

# SELECTING K
ggarrange(elb, sil, nrow = 1)

We shall use 6 clusters. Lets run the algorithm for the PC’s.

#PAM CLUSTERING (K MEDROIDS)
# SET SEED
set.seed(123)
df = pca_components
pam.out.pc <- pam(df, 6)
fviz_cluster(pam.out.pc, data = df)

Cluster structure is very different visually compared to earlier on K-means and PAM with all variables.

fviz_silhouette(pam.out.pc)  + labs(title="PAM With Components") +ylab("Silhouette Width") + xlab("Clusters")

##   cluster size ave.sil.width
## 1       1   31          0.41
## 2       2   27          0.47
## 3       3   31          0.32
## 4       4   40          0.45
## 5       5    2          0.74
## 6       6   17          0.16

We get a average silhouette width of 0.39. This shows an immediate substantial improvement in clustering quality. A comparison of the silhouette widths per cluster for all variables and when using PC’s is given below.

c1 <- fviz_cluster(pam.out.pc, data = df) + labs(title="PAM With Components")
c2 <- fviz_cluster(pam.out, data = df)  + labs(title="PAM With All Variables")

ggarrange(c1,c2,ncol=2) #av sil widths

And finally a comparison of the visual cluster formations when suing the two different data is given below.

c3 <- fviz_silhouette(pam.out) + labs(title="PAM using All Variables") +ylab("Silhouette Width") + xlab("Clusters")

##   cluster size ave.sil.width
## 1       1   28          0.41
## 2       2   32          0.35
## 3       3   30          0.30
## 4       4   40          0.36
## 5       5    2          0.72
## 6       6   16          0.15

c4 <- fviz_silhouette(pam.out.pc) + labs(title="PAM with Components") +ylab("Silhouette Width") + xlab("Clusters")

##   cluster size ave.sil.width
## 1       1   31          0.41
## 2       2   27          0.47
## 3       3   31          0.32
## 4       4   40          0.45
## 5       5    2          0.74
## 6       6   17          0.16

ggarrange(c3,c4,ncol=2) #cluster plots

We see in general an improvement in clustering for every cluster when using components over variables. Moreover, we see less observations in wrong clusters when using components.

Finally, we end off this clustering section with displaying our countries color coded by the clusters on the world map. Note than many countries weren’t included in the index, and have been grayed out.

# Using Components 
mydata['cluster'] <- as.factor(pam.out.pc$clustering)

# Map
map <- map_data("world")
map <- left_join(map, mydata, by = c('region' = 'country'))

# Plot Countries with Clusters by Color
ggplot() + geom_polygon(data = map, aes(x = long, y = lat, group = group, fill=cluster, color=cluster)) +
  labs(title = "Clustering Happy Planet Index", x=NULL, y=NULL) + theme_minimal() +
  theme(plot.title = element_text(face = "bold"),
        plot.subtitle = element_text(face = "italic"),
        legend.position = "bottom",
        legend.justification = "center", 
        legend.title = element_text(face = "bold"),
        )

We make the map interactive to show the countries specific statistics. We use plotly function for this.

#Make Map Interactive 
ggplotly(
ggplot(map, aes(text = paste("Country : ", map$region, "\n",
                              "Life Exp : ", floor(map$life_expectancy), "\n",
                              "Wellbeing : ", round(map$wellbeing, 2), "\n",
                              "Footprint : ", round(map$footprint, 2), " gha", "\n",
                              "GDP : USD ", floor(map$gdp), "\n",
                              "Population : ", format(map$population, big.mark = ","), "\n",
                              "HPI : ", round(map$happy, 2), "\n",
                              sep = ""))
       ) + 
  geom_polygon(aes(x = long, 
                   y = lat, 
                   group = group, 
                   fill = cluster)) +
  coord_equal() +
  ggtitle("Clustering Happy Planet Index") +
  labs(x = "Longitude",
       y = "Latitude"),
tooltip ="text"
)

Self Organizing Maps Application

A useful tool for visualizing highly dimensional data are SOMs (Self organizing maps). The algorithm effectively handles large complex data, grouping songs into nodes.

data_matrix <- as.matrix(scaled[,1:6])
som_grid <- somgrid(xdim = 4, ydim=4, topo="hexagonal")
set.seed(7)
som_model <- som(data_matrix,grid=som_grid, 
                 rlen=5000, 
                 alpha=c(0.05,0.01),
                 keep.data = TRUE
                 )

As the SOM training iterations progress, the distance from each node’s weights to the samples represented by that node is reduced. Ideally, this distance should reach a minimum plateau. If the curve is continually decreasing, more iterations are required

change.df <- as.data.frame(som_model$changes)
change.df <- cbind(change.df, x = c(1:5000))

#TRAINING CHANGES
ggplot(change.df) +
 aes(x = x, y = V1) +
 geom_line(size = 0.5, colour = "#112446") +
 theme_minimal()  +xlab("Iteration") +ylab("Mean Distance to Closest Unit") + labs(title = "")

We see that our curve has reached a mini- mum plateau (at 3200) much earlier than the indicated number of iterations (5000) used. No more iterations are required after 3500.

The Kohonen packages allows us to visualize the count of how many samples are mapped to each node on the map. This metric can be used as a measure of map quality - ideally the sample distribution is relatively uniform. Large values in some map areas suggests that a larger map would be beneficial. Empty nodes indicate that your map size is too big for the number of samples.

plot(som_model, type="counts", main="")

Here, the lighter color areas represent nodes with more observations mapped onto them. Darker colors (red) represent nodes with higher numbers of represented observations. We see that first four nodes all have relatively few observations (countries) mapped onto them, whilst the fifth node has the most.

The neighbor distances represent natural cluster formations. It is useful to examine the distances of each node where high distances (lighter areas) are the nodes that are very different than its surrounding nodes.

plot(som_model, type="dist.neighbours", main="")

We see that node 1 has the largest distance to its surrounding neighbors. Similarly, node 5 has a relatively large distance to its surrounding neighbors.

The node weight vectors, or “codes”, are made up of normalized values of the original variables used to generate the SOM. Each node’s weight vector is representative (or similar) of the samples mapped to that node. By visualizing the weight vectors across the map, we can see patterns in the distribution of samples and variables. Each fan is representing a variable that was used for SOM.

plot(som_model, type="codes", main="")

## Warning in par(opar): argument 1 does not name a graphical parameter

A node with a big fan pie shows that observations (countries) with high values of that variable are mapped onto that node. For example, the first node has countries with very high population and a relatively high life expectancy. Similarly, node 4 has countries with very low characteristics.

We can also examine the distribution of each and every individual variable among the grid nodes. A SOM heat map allows the visuali- sation of the distribution of a single variable across the map.

codes=som_model$codes[[1]]
par(mfrow=c(2,3))
for(i in 1:6){
plot(som_model, type = "property", property = codes[,i], main=colnames(scaled[,1:6])[i])
}

Looking at the plot for variables, there seems to be a good spread of countries for each variable. Only population has only one node which includes countries with very high levels for the respective variable. Here, all the other nodes are relatively much lower and hence a darker colour.

The results from SOMs have proved to be a great method of visualizing highly dimensional data and showing how some classes (countries) are grouped geometrically. This insight provides for greater understanding of the data and variables for further analyses.

References