World Happiness Data for Clustering & Dimension Reduction Analysis

Introduction

The World Happiness Report 2022 dataset is composed of information about 147 countries and their respective happiness index scores. This score is determined based on 12 variables that include GDP per capita, healthy life expectancy, social support, freedom to make life choices, generosity, and corruption perception.

The variables provided in the dataset are as follows:

RANK: This variable represents the country’s rank in terms of its happiness score compared to other countries in the world.

Country: This variable represents the name of the country being analyzed.

Happiness score: This variable represents the measure of happiness obtained from the Cantril ladder question in the Gallup World Poll.

Whisker-high: This variable represents the upper limit of the happiness score based on the confidence interval of 95%.

Whisker-low: This variable represents the lower limit of the happiness score based on the confidence interval of 95%.

Dystopia (1.83) + residual: This variable represents an imaginary country that has the world’s least happy people. It serves as a benchmark against which all other countries can be compared.

GDP per capita: This variable represents the country’s economic production divided by its total population.

Social support: This variable represents the extent to which social support contributes to the calculation of the happiness score.

Life expectancy: This variable represents the average number of years a newborn infant can expect to live.

Freedom of Life Choices: This variable represents the extent to which freedom contributes to the calculation of the happiness score.

Generosity: This variable represents the extent to which generosity contributes to the calculation of the happiness score.

Corruption: This variable represents the extent to which perceptions of corruption contribute to the calculation of the happiness score.

The dataset was obtained from Kaggle.

library(factoextra)

## Warning: package 'factoextra' was built under R version 4.1.3

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 4.1.3

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(clValid)

## Warning: package 'clValid' was built under R version 4.1.3

## Loading required package: cluster

## Warning: package 'cluster' was built under R version 4.1.3

library(flexclust)

## Warning: package 'flexclust' was built under R version 4.1.3

## Loading required package: grid

## Loading required package: lattice

## Loading required package: modeltools

## Loading required package: stats4

## 
## Attaching package: 'modeltools'

## The following object is masked from 'package:clValid':
## 
##     clusters

library(clustertend)

## Package `clustertend` is deprecated.  Use package `hopkins` instead.

library(cluster)
library(ClusterR)

## Warning: package 'ClusterR' was built under R version 4.1.3

library(readxl)

## Warning: package 'readxl' was built under R version 4.1.3

library(fpc)

## Warning: package 'fpc' was built under R version 4.1.3

library(gridExtra)
library(corrplot)

## corrplot 0.92 loaded

data <- read.csv("2022.csv", header = TRUE, sep = ",", dec = ",")

There are missing values (NAs) present in the last row of the dataset for all variables except for the “Rank” and “Country” variables. As a result, the entire last row was removed as it was considered to be dummy data.

Specifically, there is one missing value for each of the following variables: “Happiness score”, “Whisker-high”, “Whisker-low”, “Dystopia (1.83) + residual”, “GDP per capita”, “Social support”, “Life expectancy”, “Freedom of Life Choices”, “Generosity”, and “Corruption”.

sapply(data, function(x) sum(is.na(x)))

##                                       RANK 
##                                          0 
##                                    Country 
##                                          0 
##                            Happiness.score 
##                                          1 
##                               Whisker.high 
##                                          1 
##                                Whisker.low 
##                                          1 
##                 Dystopia..1.83....residual 
##                                          1 
##               Explained.by..GDP.per.capita 
##                                          1 
##               Explained.by..Social.support 
##                                          1 
##      Explained.by..Healthy.life.expectancy 
##                                          1 
## Explained.by..Freedom.to.make.life.choices 
##                                          1 
##                   Explained.by..Generosity 
##                                          1 
##    Explained.by..Perceptions.of.corruption 
##                                          1

data <- data[-147,]

sapply(data, function(x) sum(is.na(x)))

##                                       RANK 
##                                          0 
##                                    Country 
##                                          0 
##                            Happiness.score 
##                                          0 
##                               Whisker.high 
##                                          0 
##                                Whisker.low 
##                                          0 
##                 Dystopia..1.83....residual 
##                                          0 
##               Explained.by..GDP.per.capita 
##                                          0 
##               Explained.by..Social.support 
##                                          0 
##      Explained.by..Healthy.life.expectancy 
##                                          0 
## Explained.by..Freedom.to.make.life.choices 
##                                          0 
##                   Explained.by..Generosity 
##                                          0 
##    Explained.by..Perceptions.of.corruption 
##                                          0

str(data)

## 'data.frame':    146 obs. of  12 variables:
##  $ RANK                                      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Country                                   : chr  "Finland" "Denmark" "Iceland" "Switzerland" ...
##  $ Happiness.score                           : num  7.82 7.64 7.56 7.51 7.42 ...
##  $ Whisker.high                              : num  7.89 7.71 7.65 7.59 7.47 ...
##  $ Whisker.low                               : num  7.76 7.56 7.46 7.44 7.36 ...
##  $ Dystopia..1.83....residual                : num  2.52 2.23 2.32 2.15 2.14 ...
##  $ Explained.by..GDP.per.capita              : num  1.89 1.95 1.94 2.03 1.95 ...
##  $ Explained.by..Social.support              : num  1.26 1.24 1.32 1.23 1.21 ...
##  $ Explained.by..Healthy.life.expectancy     : num  0.775 0.777 0.803 0.822 0.787 0.79 0.803 0.786 0.818 0.752 ...
##  $ Explained.by..Freedom.to.make.life.choices: num  0.736 0.719 0.718 0.677 0.651 0.7 0.724 0.728 0.568 0.68 ...
##  $ Explained.by..Generosity                  : num  0.109 0.188 0.27 0.147 0.271 0.12 0.218 0.217 0.155 0.245 ...
##  $ Explained.by..Perceptions.of.corruption   : num  0.534 0.532 0.191 0.461 0.419 0.388 0.512 0.474 0.143 0.483 ...

The dataset also includes information on an imaginary country that represents the world’s least happy people and is used as a benchmark against which all other countries can be compared. Other variables used in calculating the happiness index include the country’s GDP per capita, social support, life expectancy, freedom to make life choices, generosity, and corruption perception. All columns, except for the country column, are numeric. Therefore, the country column has been retained, and the row name has been converted to country names.

rownames(data) <- data$Country
data <- data[,-2]
str(data)

## 'data.frame':    146 obs. of  11 variables:
##  $ RANK                                      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Happiness.score                           : num  7.82 7.64 7.56 7.51 7.42 ...
##  $ Whisker.high                              : num  7.89 7.71 7.65 7.59 7.47 ...
##  $ Whisker.low                               : num  7.76 7.56 7.46 7.44 7.36 ...
##  $ Dystopia..1.83....residual                : num  2.52 2.23 2.32 2.15 2.14 ...
##  $ Explained.by..GDP.per.capita              : num  1.89 1.95 1.94 2.03 1.95 ...
##  $ Explained.by..Social.support              : num  1.26 1.24 1.32 1.23 1.21 ...
##  $ Explained.by..Healthy.life.expectancy     : num  0.775 0.777 0.803 0.822 0.787 0.79 0.803 0.786 0.818 0.752 ...
##  $ Explained.by..Freedom.to.make.life.choices: num  0.736 0.719 0.718 0.677 0.651 0.7 0.724 0.728 0.568 0.68 ...
##  $ Explained.by..Generosity                  : num  0.109 0.188 0.27 0.147 0.271 0.12 0.218 0.217 0.155 0.245 ...
##  $ Explained.by..Perceptions.of.corruption   : num  0.534 0.532 0.191 0.461 0.419 0.388 0.512 0.474 0.143 0.483 ...

Clustering: K Means Clustering

K-means clustering is a widely used unsupervised machine learning algorithm that aims to partition a dataset into a predefined number of clusters based on similarity between data points. In this report, we will explore K-means clustering using two different methods to select the optimal number of clusters for the given data.

To start off, we used the fviz_nbclust function from the factoextra package to perform K-means clustering on our dataset. We used two different methods to determine the optimal number of clusters: the within-cluster sum of squares (WSS) and the silhouette method.

Firstly, we used the “wss” method to determine the optimal number of clusters. The WSS method calculates the sum of squared distances between each point and its assigned centroid within each cluster. The optimal number of clusters can be determined by identifying the elbow point in a plot of WSS versus the number of clusters. We plotted the WSS values for different numbers of clusters using fviz_nbclust function and identified the number of clusters that corresponds to the elbow point as the optimal number of clusters.

Next, we used the “silhouette” method to determine the optimal number of clusters. The silhouette method calculates a silhouette coefficient for each data point, which measures how similar it is to its own cluster compared to other clusters. The silhouette coefficient ranges from -1 to 1, with values closer to 1 indicating a good clustering. The optimal number of clusters can be determined by identifying the number of clusters that maximizes the average silhouette coefficient across all data points. We plotted the average silhouette coefficient values for different numbers of clusters using fviz_nbclust function and identified the number of clusters that corresponds to the peak point as the optimal number of clusters.

Number of clusters

fviz_nbclust(data, FUNcluster = kmeans, method = "wss", k.max = 10, verbose = F, nboot = 10)  + 
theme(axis.text.x = element_text(size = 15))  + 
geom_line(aes(group = 1), color = "red", linetype = "dashed",size = 0.5) + 
geom_point(group = 1, size = 2, color = "blue")

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## i Please use `linewidth` instead.

fviz_nbclust(data, FUNcluster = kmeans, method = "silhouette", k.max = 10, verbose = F, nboot = 10)  + 
theme(axis.text.x = element_text(size = 15))  + 
geom_line(aes(group = 1), color = "red", linetype = "dashed",size = 0.5) + 
geom_point(group = 1, size = 2, color = "blue")

After analyzing the within-cluster sum of squares (wss) and silhouette width for the K-means clustering algorithm using the fviz_nbclust function, it appears that the optimum number of clusters is likely between two and four. However, upon visualizing the plots generated by the function, it seems that the best case is a two-cluster solution. Further analysis can be conducted to confirm this determination.

kmeans_clustering<-eclust(data, "kmeans", k= 2)

kmeans_clustering

## K-means clustering with 2 clusters of sizes 73, 73
## 
## Cluster means:
##   RANK Happiness.score Whisker.high Whisker.low Dystopia..1.83....residual
## 1   37         6.43526     6.534726    6.335740                   2.014575
## 2  110         4.67189     4.812452    4.531397                   1.649041
##   Explained.by..GDP.per.capita Explained.by..Social.support
## 1                     1.696671                     1.100000
## 2                     1.124219                     0.711726
##   Explained.by..Healthy.life.expectancy
## 1                             0.6997397
## 2                             0.4726027
##   Explained.by..Freedom.to.make.life.choices Explained.by..Generosity
## 1                                  0.5947671                0.1431918
## 2                                  0.4396849                0.1515616
##   Explained.by..Perceptions.of.corruption
## 1                               0.1864384
## 2                               0.1231233
## 
## Clustering vector:
##                   Finland                   Denmark                   Iceland 
##                         1                         1                         1 
##               Switzerland               Netherlands               Luxembourg* 
##                         1                         1                         1 
##                    Sweden                    Norway                    Israel 
##                         1                         1                         1 
##               New Zealand                   Austria                 Australia 
##                         1                         1                         1 
##                   Ireland                   Germany                    Canada 
##                         1                         1                         1 
##             United States            United Kingdom                   Czechia 
##                         1                         1                         1 
##                   Belgium                    France                   Bahrain 
##                         1                         1                         1 
##                  Slovenia                Costa Rica      United Arab Emirates 
##                         1                         1                         1 
##              Saudi Arabia  Taiwan Province of China                 Singapore 
##                         1                         1                         1 
##                   Romania                     Spain                   Uruguay 
##                         1                         1                         1 
##                     Italy                    Kosovo                     Malta 
##                         1                         1                         1 
##                 Lithuania                  Slovakia                   Estonia 
##                         1                         1                         1 
##                    Panama                    Brazil                Guatemala* 
##                         1                         1                         1 
##                Kazakhstan                    Cyprus                    Latvia 
##                         1                         1                         1 
##                    Serbia                     Chile                 Nicaragua 
##                         1                         1                         1 
##                    Mexico                   Croatia                    Poland 
##                         1                         1                         1 
##               El Salvador                   Kuwait*                   Hungary 
##                         1                         1                         1 
##                 Mauritius                Uzbekistan                     Japan 
##                         1                         1                         1 
##                  Honduras                  Portugal                 Argentina 
##                         1                         1                         1 
##                    Greece               South Korea               Philippines 
##                         1                         1                         1 
##                  Thailand                   Moldova                   Jamaica 
##                         1                         1                         1 
##                Kyrgyzstan                  Belarus*                  Colombia 
##                         1                         1                         1 
##    Bosnia and Herzegovina                  Mongolia        Dominican Republic 
##                         1                         1                         1 
##                  Malaysia                   Bolivia                     China 
##                         1                         1                         1 
##                  Paraguay                      Peru                Montenegro 
##                         1                         2                         2 
##                   Ecuador                   Vietnam             Turkmenistan* 
##                         2                         2                         2 
##             North Cyprus*                    Russia Hong Kong S.A.R. of China 
##                         2                         2                         2 
##                   Armenia                Tajikistan                     Nepal 
##                         2                         2                         2 
##                  Bulgaria                    Libya*                 Indonesia 
##                         2                         2                         2 
##               Ivory Coast           North Macedonia                   Albania 
##                         2                         2                         2 
##              South Africa               Azerbaijan*                   Gambia* 
##                         2                         2                         2 
##                Bangladesh                      Laos                   Algeria 
##                         2                         2                         2 
##                  Liberia*                   Ukraine                     Congo 
##                         2                         2                         2 
##                   Morocco                Mozambique                  Cameroon 
##                         2                         2                         2 
##                   Senegal                    Niger*                   Georgia 
##                         2                         2                         2 
##                     Gabon                      Iraq                 Venezuela 
##                         2                         2                         2 
##                    Guinea                      Iran                     Ghana 
##                         2                         2                         2 
##                    Turkey              Burkina Faso                  Cambodia 
##                         2                         2                         2 
##                     Benin                  Comoros*                    Uganda 
##                         2                         2                         2 
##                   Nigeria                     Kenya                   Tunisia 
##                         2                         2                         2 
##                  Pakistan  Palestinian Territories*                      Mali 
##                         2                         2                         2 
##                   Namibia     Eswatini, Kingdom of*                   Myanmar 
##                         2                         2                         2 
##                 Sri Lanka               Madagascar*                     Egypt 
##                         2                         2                         2 
##                     Chad*                  Ethiopia                    Yemen* 
##                         2                         2                         2 
##               Mauritania*                    Jordan                      Togo 
##                         2                         2                         2 
##                     India                    Zambia                    Malawi 
##                         2                         2                         2 
##                  Tanzania              Sierra Leone                  Lesotho* 
##                         2                         2                         2 
##                 Botswana*                   Rwanda*                  Zimbabwe 
##                         2                         2                         2 
##                   Lebanon               Afghanistan 
##                         2                         2 
## 
## Within cluster sum of squares by cluster:
## [1] 32498.08 32564.07
##  (between_SS / total_SS =  75.0 %)
## 
## Available components:
## 
##  [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
##  [6] "betweenss"    "size"         "iter"         "ifault"       "clust_plot"  
## [11] "silinfo"      "nbclust"      "data"

The K-means clustering algorithm resulted in 2 clusters of equal size, each containing 73 countries. The countries in the first cluster have a higher happiness score compared to the second cluster. The first cluster has a mean happiness score of 6.44, while the second cluster has a mean happiness score of 4.67. The mean happiness score of the first cluster is higher than the overall mean happiness score of the dataset, which is 5.38. The mean values of the other variables are also different between the two clusters. For instance, the first cluster has higher mean values for GDP per capita, social support, healthy life expectancy, freedom to make life choices, and generosity compared to the second cluster. On the other hand, the second cluster has a higher mean value for perceptions of corruption compared to the first cluster.

The clustering vector provides information on which countries are assigned to which cluster. The first cluster includes countries like Finland, Denmark, Switzerland, and Norway, while the second cluster includes countries like Pakistan, India, Egypt, and Yemen.

fviz_silhouette(kmeans_clustering)

##   cluster size ave.sil.width
## 1       1   73          0.62
## 2       2   73          0.62

it seems that the clustering algorithm produced two clusters that are relatively well-separated and internally coherent, as indicated by the similar average silhouette widths for both clusters.

Dimension Reduction

The objective of using PCA is to identify the significant variables in the data that contribute the most to its variation and uncover its underlying structure. By identifying these important variables, PCA helps to simplify the complexity of the data and make it more manageable for further analysis.

correlation_df <- cor(data, method = 'pearson')

corrplot(correlation_df, tl.col = "red", tl.srt = 50, bg = "White",
         title = "\n\n Correlation Plot",pch.cex = 3,tl.pos = 'ld',tl.cex=0.7,
         type = "lower")

pca_df <- prcomp(data, center=TRUE, scale=TRUE)
fviz_eig(pca_df,barfill = "steelblue",
  barcolor = "steelblue",
  linecolor = "black",
  ncp = 10,
  addlabels = TRUE,)

summary(pca_df)

## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6     PC7
## Standard deviation     2.5844 1.1878 1.0677 0.8630 0.72075 0.54873 0.41899
## Proportion of Variance 0.6072 0.1283 0.1036 0.0677 0.04723 0.02737 0.01596
## Cumulative Proportion  0.6072 0.7355 0.8391 0.9068 0.95404 0.98141 0.99737
##                            PC8     PC9      PC10      PC11
## Standard deviation     0.16493 0.04155 0.0007337 0.0002421
## Proportion of Variance 0.00247 0.00016 0.0000000 0.0000000
## Cumulative Proportion  0.99984 1.00000 1.0000000 1.0000000

The first principal component (PC1) has the highest standard deviation of 2.5844, which indicates that it explains the most variability in the data. The proportion of variance explained by PC1 is 0.6072, which means that it accounts for 60.72% of the total variation in the data. The cumulative proportion of variance explained by PC1 and PC2 is 0.7355, indicating that these two components together account for 73.55% of the total variation in the data.

PC2 has the second-highest standard deviation of 1.1878, indicating that it explains the second-most variability in the data. The proportion of variance explained by PC2 is 0.1283, which means that it accounts for 12.83% of the total variation in the data. The cumulative proportion of variance explained by PC1, PC2, and PC3 is 0.8391, indicating that these three components together account for 83.91% of the total variation in the data.

PC3 has the third-highest standard deviation of 1.0677, indicating that it explains the third-most variability in the data. The proportion of variance explained by PC3 is 0.1036, which means that it accounts for 10.36% of the total variation in the data. The cumulative proportion of variance explained by PC1, PC2, PC3, and PC4 is 0.9068, indicating that these four components together account for 90.68% of the total variation in the data.

The remaining principal components (PC4 through PC11) have increasingly smaller standard deviations and explain progressively less variance. Collectively, these components explain a very small proportion of the variance in the data.

varpca <- get_pca_var(pca_df)
options(ggrepel.max.overlaps = Inf)
fviz_pca_var(pca_df, col.var="steelblue", alpha.var="contrib", repel = TRUE)

summary(varpca$contrib)

##      Dim.1              Dim.2              Dim.3             Dim.4         
##  Min.   : 0.01658   Min.   : 0.09375   Min.   : 0.2536   Min.   : 0.00058  
##  1st Qu.: 5.03532   1st Qu.: 1.60673   1st Qu.: 0.3269   1st Qu.: 0.04450  
##  Median :10.04548   Median : 2.19017   Median : 0.5723   Median : 0.96463  
##  Mean   : 9.09091   Mean   : 9.09091   Mean   : 9.0909   Mean   : 9.09091  
##  3rd Qu.:14.37384   3rd Qu.:11.60474   3rd Qu.:11.5449   3rd Qu.:10.44393  
##  Max.   :14.57906   Max.   :50.98167   Max.   :49.1253   Max.   :52.96955  
##      Dim.5              Dim.6              Dim.7             Dim.8        
##  Min.   : 0.05354   Min.   : 0.00157   Min.   : 0.1888   Min.   : 0.0140  
##  1st Qu.: 0.18918   1st Qu.: 0.06434   1st Qu.: 0.2576   1st Qu.: 0.1805  
##  Median : 0.34262   Median : 1.02782   Median : 0.6220   Median : 1.8838  
##  Mean   : 9.09091   Mean   : 9.09091   Mean   : 9.0909   Mean   : 9.0909  
##  3rd Qu.: 6.46846   3rd Qu.: 4.20840   3rd Qu.: 6.0097   3rd Qu.: 3.8548  
##  Max.   :74.40557   Max.   :54.31513   Max.   :48.5725   Max.   :82.1938  
##      Dim.9              Dim.10              Dim.11        
##  Min.   : 0.00021   Min.   : 0.000015   Min.   : 0.00000  
##  1st Qu.: 0.00320   1st Qu.: 1.850733   1st Qu.: 0.00021  
##  Median : 0.01903   Median : 7.751442   Median : 0.00087  
##  Mean   : 9.09091   Mean   : 9.090909   Mean   : 9.09091  
##  3rd Qu.: 0.09246   3rd Qu.:13.302172   3rd Qu.: 7.86571  
##  Max.   :49.87431   Max.   :28.249326   Max.   :67.26027

The summary statistics include minimum, maximum, mean, median, and quartiles of the variable contributions for each PC. For example, the first quartile of the contribution of variables to the first PC is 5.03532, and the third quartile is 14.37384. This means that 25% of the variables contribute less than 5.03532 to the first PC, and 75% of the variables contribute less than 14.37384.

The table also shows that the mean of variable contributions is 9.09091 for all the PCs, which is due to the fact that each PC explains the same amount of variation in the data, as the PCA was performed with standardized variables. The maximum contribution of variables to each PC varies widely, with some variables having a high contribution to a particular PC, and others having a low contribution.

fviz_contrib(pca_df, choice = "var", axes = 1:3, fill = "steelblue", color = "steelblue", sort.val ="desc")

## Conclusion

The dataset contains information about 147 countries and their respective happiness index scores. We used two different methods to determine the optimal number of clusters for this dataset: the within-cluster sum of squares (WSS) and the silhouette method.

Using the WSS method, we identified the optimal number of clusters to be 2. Using the silhouette method, we identified the optimal number of clusters to be 2. We then used the K-means algorithm to cluster the data into these optimal number of clusters.

We also analyzed the relationship between the variables in the dataset using a correlation matrix and visualized the results using a heatmap. The heatmap showed that there were strong positive correlations between the happiness score and GDP per capita, social support, and life expectancy.

In conclusion, K-means clustering is a useful technique for unsupervised learning and can help identify patterns in large datasets. Our analysis showed that there were distinct clusters within the World Happiness Report 2022 dataset, and that these clusters were based on the country’s level of happiness, GDP per capita, social support, life expectancy, freedom of life choices, generosity, and corruption perception.

World Happiness Data for Clustering & Dimension Reduction Analysis

Rashad Naghizade

02/05/2023

Introduction

Clustering: K Means Clustering

Number of clusters

Dimension Reduction