The World Happiness Report 2022 dataset is composed of information about 147 countries and their respective happiness index scores. This score is determined based on 12 variables that include GDP per capita, healthy life expectancy, social support, freedom to make life choices, generosity, and corruption perception.
The variables provided in the dataset are as follows:
RANK: This variable represents the country’s rank in terms of its happiness score compared to other countries in the world.
Country: This variable represents the name of the country being analyzed.
Happiness score: This variable represents the measure of happiness obtained from the Cantril ladder question in the Gallup World Poll.
Whisker-high: This variable represents the upper limit of the happiness score based on the confidence interval of 95%.
Whisker-low: This variable represents the lower limit of the happiness score based on the confidence interval of 95%.
Dystopia (1.83) + residual: This variable represents an imaginary country that has the world’s least happy people. It serves as a benchmark against which all other countries can be compared.
GDP per capita: This variable represents the country’s economic production divided by its total population.
Social support: This variable represents the extent to which social support contributes to the calculation of the happiness score.
Life expectancy: This variable represents the average number of years a newborn infant can expect to live.
Freedom of Life Choices: This variable represents the extent to which freedom contributes to the calculation of the happiness score.
Generosity: This variable represents the extent to which generosity contributes to the calculation of the happiness score.
Corruption: This variable represents the extent to which perceptions of corruption contribute to the calculation of the happiness score.
The dataset was obtained from Kaggle.
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.1.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.1.3
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(clValid)
## Warning: package 'clValid' was built under R version 4.1.3
## Loading required package: cluster
## Warning: package 'cluster' was built under R version 4.1.3
library(flexclust)
## Warning: package 'flexclust' was built under R version 4.1.3
## Loading required package: grid
## Loading required package: lattice
## Loading required package: modeltools
## Loading required package: stats4
##
## Attaching package: 'modeltools'
## The following object is masked from 'package:clValid':
##
## clusters
library(clustertend)
## Package `clustertend` is deprecated. Use package `hopkins` instead.
library(cluster)
library(ClusterR)
## Warning: package 'ClusterR' was built under R version 4.1.3
library(readxl)
## Warning: package 'readxl' was built under R version 4.1.3
library(fpc)
## Warning: package 'fpc' was built under R version 4.1.3
library(gridExtra)
library(corrplot)
## corrplot 0.92 loaded
data <- read.csv("2022.csv", header = TRUE, sep = ",", dec = ",")
There are missing values (NAs) present in the last row of the dataset for all variables except for the “Rank” and “Country” variables. As a result, the entire last row was removed as it was considered to be dummy data.
Specifically, there is one missing value for each of the following variables: “Happiness score”, “Whisker-high”, “Whisker-low”, “Dystopia (1.83) + residual”, “GDP per capita”, “Social support”, “Life expectancy”, “Freedom of Life Choices”, “Generosity”, and “Corruption”.
sapply(data, function(x) sum(is.na(x)))
## RANK
## 0
## Country
## 0
## Happiness.score
## 1
## Whisker.high
## 1
## Whisker.low
## 1
## Dystopia..1.83....residual
## 1
## Explained.by..GDP.per.capita
## 1
## Explained.by..Social.support
## 1
## Explained.by..Healthy.life.expectancy
## 1
## Explained.by..Freedom.to.make.life.choices
## 1
## Explained.by..Generosity
## 1
## Explained.by..Perceptions.of.corruption
## 1
data <- data[-147,]
sapply(data, function(x) sum(is.na(x)))
## RANK
## 0
## Country
## 0
## Happiness.score
## 0
## Whisker.high
## 0
## Whisker.low
## 0
## Dystopia..1.83....residual
## 0
## Explained.by..GDP.per.capita
## 0
## Explained.by..Social.support
## 0
## Explained.by..Healthy.life.expectancy
## 0
## Explained.by..Freedom.to.make.life.choices
## 0
## Explained.by..Generosity
## 0
## Explained.by..Perceptions.of.corruption
## 0
str(data)
## 'data.frame': 146 obs. of 12 variables:
## $ RANK : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Country : chr "Finland" "Denmark" "Iceland" "Switzerland" ...
## $ Happiness.score : num 7.82 7.64 7.56 7.51 7.42 ...
## $ Whisker.high : num 7.89 7.71 7.65 7.59 7.47 ...
## $ Whisker.low : num 7.76 7.56 7.46 7.44 7.36 ...
## $ Dystopia..1.83....residual : num 2.52 2.23 2.32 2.15 2.14 ...
## $ Explained.by..GDP.per.capita : num 1.89 1.95 1.94 2.03 1.95 ...
## $ Explained.by..Social.support : num 1.26 1.24 1.32 1.23 1.21 ...
## $ Explained.by..Healthy.life.expectancy : num 0.775 0.777 0.803 0.822 0.787 0.79 0.803 0.786 0.818 0.752 ...
## $ Explained.by..Freedom.to.make.life.choices: num 0.736 0.719 0.718 0.677 0.651 0.7 0.724 0.728 0.568 0.68 ...
## $ Explained.by..Generosity : num 0.109 0.188 0.27 0.147 0.271 0.12 0.218 0.217 0.155 0.245 ...
## $ Explained.by..Perceptions.of.corruption : num 0.534 0.532 0.191 0.461 0.419 0.388 0.512 0.474 0.143 0.483 ...
The dataset also includes information on an imaginary country that represents the world’s least happy people and is used as a benchmark against which all other countries can be compared. Other variables used in calculating the happiness index include the country’s GDP per capita, social support, life expectancy, freedom to make life choices, generosity, and corruption perception. All columns, except for the country column, are numeric. Therefore, the country column has been retained, and the row name has been converted to country names.
rownames(data) <- data$Country
data <- data[,-2]
str(data)
## 'data.frame': 146 obs. of 11 variables:
## $ RANK : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Happiness.score : num 7.82 7.64 7.56 7.51 7.42 ...
## $ Whisker.high : num 7.89 7.71 7.65 7.59 7.47 ...
## $ Whisker.low : num 7.76 7.56 7.46 7.44 7.36 ...
## $ Dystopia..1.83....residual : num 2.52 2.23 2.32 2.15 2.14 ...
## $ Explained.by..GDP.per.capita : num 1.89 1.95 1.94 2.03 1.95 ...
## $ Explained.by..Social.support : num 1.26 1.24 1.32 1.23 1.21 ...
## $ Explained.by..Healthy.life.expectancy : num 0.775 0.777 0.803 0.822 0.787 0.79 0.803 0.786 0.818 0.752 ...
## $ Explained.by..Freedom.to.make.life.choices: num 0.736 0.719 0.718 0.677 0.651 0.7 0.724 0.728 0.568 0.68 ...
## $ Explained.by..Generosity : num 0.109 0.188 0.27 0.147 0.271 0.12 0.218 0.217 0.155 0.245 ...
## $ Explained.by..Perceptions.of.corruption : num 0.534 0.532 0.191 0.461 0.419 0.388 0.512 0.474 0.143 0.483 ...
K-means clustering is a widely used unsupervised machine learning algorithm that aims to partition a dataset into a predefined number of clusters based on similarity between data points. In this report, we will explore K-means clustering using two different methods to select the optimal number of clusters for the given data.
To start off, we used the fviz_nbclust function from the factoextra package to perform K-means clustering on our dataset. We used two different methods to determine the optimal number of clusters: the within-cluster sum of squares (WSS) and the silhouette method.
Firstly, we used the “wss” method to determine the optimal number of clusters. The WSS method calculates the sum of squared distances between each point and its assigned centroid within each cluster. The optimal number of clusters can be determined by identifying the elbow point in a plot of WSS versus the number of clusters. We plotted the WSS values for different numbers of clusters using fviz_nbclust function and identified the number of clusters that corresponds to the elbow point as the optimal number of clusters.
Next, we used the “silhouette” method to determine the optimal number of clusters. The silhouette method calculates a silhouette coefficient for each data point, which measures how similar it is to its own cluster compared to other clusters. The silhouette coefficient ranges from -1 to 1, with values closer to 1 indicating a good clustering. The optimal number of clusters can be determined by identifying the number of clusters that maximizes the average silhouette coefficient across all data points. We plotted the average silhouette coefficient values for different numbers of clusters using fviz_nbclust function and identified the number of clusters that corresponds to the peak point as the optimal number of clusters.
fviz_nbclust(data, FUNcluster = kmeans, method = "wss", k.max = 10, verbose = F, nboot = 10) +
theme(axis.text.x = element_text(size = 15)) +
geom_line(aes(group = 1), color = "red", linetype = "dashed",size = 0.5) +
geom_point(group = 1, size = 2, color = "blue")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## i Please use `linewidth` instead.
fviz_nbclust(data, FUNcluster = kmeans, method = "silhouette", k.max = 10, verbose = F, nboot = 10) +
theme(axis.text.x = element_text(size = 15)) +
geom_line(aes(group = 1), color = "red", linetype = "dashed",size = 0.5) +
geom_point(group = 1, size = 2, color = "blue")
After analyzing the within-cluster sum of squares (wss) and silhouette width for the K-means clustering algorithm using the fviz_nbclust function, it appears that the optimum number of clusters is likely between two and four. However, upon visualizing the plots generated by the function, it seems that the best case is a two-cluster solution. Further analysis can be conducted to confirm this determination.
kmeans_clustering<-eclust(data, "kmeans", k= 2)
kmeans_clustering
## K-means clustering with 2 clusters of sizes 73, 73
##
## Cluster means:
## RANK Happiness.score Whisker.high Whisker.low Dystopia..1.83....residual
## 1 37 6.43526 6.534726 6.335740 2.014575
## 2 110 4.67189 4.812452 4.531397 1.649041
## Explained.by..GDP.per.capita Explained.by..Social.support
## 1 1.696671 1.100000
## 2 1.124219 0.711726
## Explained.by..Healthy.life.expectancy
## 1 0.6997397
## 2 0.4726027
## Explained.by..Freedom.to.make.life.choices Explained.by..Generosity
## 1 0.5947671 0.1431918
## 2 0.4396849 0.1515616
## Explained.by..Perceptions.of.corruption
## 1 0.1864384
## 2 0.1231233
##
## Clustering vector:
## Finland Denmark Iceland
## 1 1 1
## Switzerland Netherlands Luxembourg*
## 1 1 1
## Sweden Norway Israel
## 1 1 1
## New Zealand Austria Australia
## 1 1 1
## Ireland Germany Canada
## 1 1 1
## United States United Kingdom Czechia
## 1 1 1
## Belgium France Bahrain
## 1 1 1
## Slovenia Costa Rica United Arab Emirates
## 1 1 1
## Saudi Arabia Taiwan Province of China Singapore
## 1 1 1
## Romania Spain Uruguay
## 1 1 1
## Italy Kosovo Malta
## 1 1 1
## Lithuania Slovakia Estonia
## 1 1 1
## Panama Brazil Guatemala*
## 1 1 1
## Kazakhstan Cyprus Latvia
## 1 1 1
## Serbia Chile Nicaragua
## 1 1 1
## Mexico Croatia Poland
## 1 1 1
## El Salvador Kuwait* Hungary
## 1 1 1
## Mauritius Uzbekistan Japan
## 1 1 1
## Honduras Portugal Argentina
## 1 1 1
## Greece South Korea Philippines
## 1 1 1
## Thailand Moldova Jamaica
## 1 1 1
## Kyrgyzstan Belarus* Colombia
## 1 1 1
## Bosnia and Herzegovina Mongolia Dominican Republic
## 1 1 1
## Malaysia Bolivia China
## 1 1 1
## Paraguay Peru Montenegro
## 1 2 2
## Ecuador Vietnam Turkmenistan*
## 2 2 2
## North Cyprus* Russia Hong Kong S.A.R. of China
## 2 2 2
## Armenia Tajikistan Nepal
## 2 2 2
## Bulgaria Libya* Indonesia
## 2 2 2
## Ivory Coast North Macedonia Albania
## 2 2 2
## South Africa Azerbaijan* Gambia*
## 2 2 2
## Bangladesh Laos Algeria
## 2 2 2
## Liberia* Ukraine Congo
## 2 2 2
## Morocco Mozambique Cameroon
## 2 2 2
## Senegal Niger* Georgia
## 2 2 2
## Gabon Iraq Venezuela
## 2 2 2
## Guinea Iran Ghana
## 2 2 2
## Turkey Burkina Faso Cambodia
## 2 2 2
## Benin Comoros* Uganda
## 2 2 2
## Nigeria Kenya Tunisia
## 2 2 2
## Pakistan Palestinian Territories* Mali
## 2 2 2
## Namibia Eswatini, Kingdom of* Myanmar
## 2 2 2
## Sri Lanka Madagascar* Egypt
## 2 2 2
## Chad* Ethiopia Yemen*
## 2 2 2
## Mauritania* Jordan Togo
## 2 2 2
## India Zambia Malawi
## 2 2 2
## Tanzania Sierra Leone Lesotho*
## 2 2 2
## Botswana* Rwanda* Zimbabwe
## 2 2 2
## Lebanon Afghanistan
## 2 2
##
## Within cluster sum of squares by cluster:
## [1] 32498.08 32564.07
## (between_SS / total_SS = 75.0 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault" "clust_plot"
## [11] "silinfo" "nbclust" "data"
The K-means clustering algorithm resulted in 2 clusters of equal size, each containing 73 countries. The countries in the first cluster have a higher happiness score compared to the second cluster. The first cluster has a mean happiness score of 6.44, while the second cluster has a mean happiness score of 4.67. The mean happiness score of the first cluster is higher than the overall mean happiness score of the dataset, which is 5.38. The mean values of the other variables are also different between the two clusters. For instance, the first cluster has higher mean values for GDP per capita, social support, healthy life expectancy, freedom to make life choices, and generosity compared to the second cluster. On the other hand, the second cluster has a higher mean value for perceptions of corruption compared to the first cluster.
The clustering vector provides information on which countries are assigned to which cluster. The first cluster includes countries like Finland, Denmark, Switzerland, and Norway, while the second cluster includes countries like Pakistan, India, Egypt, and Yemen.
fviz_silhouette(kmeans_clustering)
## cluster size ave.sil.width
## 1 1 73 0.62
## 2 2 73 0.62
it seems that the clustering algorithm produced two clusters that are relatively well-separated and internally coherent, as indicated by the similar average silhouette widths for both clusters.
The objective of using PCA is to identify the significant variables in the data that contribute the most to its variation and uncover its underlying structure. By identifying these important variables, PCA helps to simplify the complexity of the data and make it more manageable for further analysis.
correlation_df <- cor(data, method = 'pearson')
corrplot(correlation_df, tl.col = "red", tl.srt = 50, bg = "White",
title = "\n\n Correlation Plot",pch.cex = 3,tl.pos = 'ld',tl.cex=0.7,
type = "lower")
pca_df <- prcomp(data, center=TRUE, scale=TRUE)
fviz_eig(pca_df,barfill = "steelblue",
barcolor = "steelblue",
linecolor = "black",
ncp = 10,
addlabels = TRUE,)
summary(pca_df)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.5844 1.1878 1.0677 0.8630 0.72075 0.54873 0.41899
## Proportion of Variance 0.6072 0.1283 0.1036 0.0677 0.04723 0.02737 0.01596
## Cumulative Proportion 0.6072 0.7355 0.8391 0.9068 0.95404 0.98141 0.99737
## PC8 PC9 PC10 PC11
## Standard deviation 0.16493 0.04155 0.0007337 0.0002421
## Proportion of Variance 0.00247 0.00016 0.0000000 0.0000000
## Cumulative Proportion 0.99984 1.00000 1.0000000 1.0000000
The first principal component (PC1) has the highest standard deviation of 2.5844, which indicates that it explains the most variability in the data. The proportion of variance explained by PC1 is 0.6072, which means that it accounts for 60.72% of the total variation in the data. The cumulative proportion of variance explained by PC1 and PC2 is 0.7355, indicating that these two components together account for 73.55% of the total variation in the data.
PC2 has the second-highest standard deviation of 1.1878, indicating that it explains the second-most variability in the data. The proportion of variance explained by PC2 is 0.1283, which means that it accounts for 12.83% of the total variation in the data. The cumulative proportion of variance explained by PC1, PC2, and PC3 is 0.8391, indicating that these three components together account for 83.91% of the total variation in the data.
PC3 has the third-highest standard deviation of 1.0677, indicating that it explains the third-most variability in the data. The proportion of variance explained by PC3 is 0.1036, which means that it accounts for 10.36% of the total variation in the data. The cumulative proportion of variance explained by PC1, PC2, PC3, and PC4 is 0.9068, indicating that these four components together account for 90.68% of the total variation in the data.
The remaining principal components (PC4 through PC11) have increasingly smaller standard deviations and explain progressively less variance. Collectively, these components explain a very small proportion of the variance in the data.
varpca <- get_pca_var(pca_df)
options(ggrepel.max.overlaps = Inf)
fviz_pca_var(pca_df, col.var="steelblue", alpha.var="contrib", repel = TRUE)
summary(varpca$contrib)
## Dim.1 Dim.2 Dim.3 Dim.4
## Min. : 0.01658 Min. : 0.09375 Min. : 0.2536 Min. : 0.00058
## 1st Qu.: 5.03532 1st Qu.: 1.60673 1st Qu.: 0.3269 1st Qu.: 0.04450
## Median :10.04548 Median : 2.19017 Median : 0.5723 Median : 0.96463
## Mean : 9.09091 Mean : 9.09091 Mean : 9.0909 Mean : 9.09091
## 3rd Qu.:14.37384 3rd Qu.:11.60474 3rd Qu.:11.5449 3rd Qu.:10.44393
## Max. :14.57906 Max. :50.98167 Max. :49.1253 Max. :52.96955
## Dim.5 Dim.6 Dim.7 Dim.8
## Min. : 0.05354 Min. : 0.00157 Min. : 0.1888 Min. : 0.0140
## 1st Qu.: 0.18918 1st Qu.: 0.06434 1st Qu.: 0.2576 1st Qu.: 0.1805
## Median : 0.34262 Median : 1.02782 Median : 0.6220 Median : 1.8838
## Mean : 9.09091 Mean : 9.09091 Mean : 9.0909 Mean : 9.0909
## 3rd Qu.: 6.46846 3rd Qu.: 4.20840 3rd Qu.: 6.0097 3rd Qu.: 3.8548
## Max. :74.40557 Max. :54.31513 Max. :48.5725 Max. :82.1938
## Dim.9 Dim.10 Dim.11
## Min. : 0.00021 Min. : 0.000015 Min. : 0.00000
## 1st Qu.: 0.00320 1st Qu.: 1.850733 1st Qu.: 0.00021
## Median : 0.01903 Median : 7.751442 Median : 0.00087
## Mean : 9.09091 Mean : 9.090909 Mean : 9.09091
## 3rd Qu.: 0.09246 3rd Qu.:13.302172 3rd Qu.: 7.86571
## Max. :49.87431 Max. :28.249326 Max. :67.26027
The summary statistics include minimum, maximum, mean, median, and quartiles of the variable contributions for each PC. For example, the first quartile of the contribution of variables to the first PC is 5.03532, and the third quartile is 14.37384. This means that 25% of the variables contribute less than 5.03532 to the first PC, and 75% of the variables contribute less than 14.37384.
The table also shows that the mean of variable contributions is 9.09091 for all the PCs, which is due to the fact that each PC explains the same amount of variation in the data, as the PCA was performed with standardized variables. The maximum contribution of variables to each PC varies widely, with some variables having a high contribution to a particular PC, and others having a low contribution.
fviz_contrib(pca_df, choice = "var", axes = 1:3, fill = "steelblue", color = "steelblue", sort.val ="desc")
## Conclusion
The dataset contains information about 147 countries and their respective happiness index scores. We used two different methods to determine the optimal number of clusters for this dataset: the within-cluster sum of squares (WSS) and the silhouette method.
Using the WSS method, we identified the optimal number of clusters to be 2. Using the silhouette method, we identified the optimal number of clusters to be 2. We then used the K-means algorithm to cluster the data into these optimal number of clusters.
We also analyzed the relationship between the variables in the dataset using a correlation matrix and visualized the results using a heatmap. The heatmap showed that there were strong positive correlations between the happiness score and GDP per capita, social support, and life expectancy.
In conclusion, K-means clustering is a useful technique for unsupervised learning and can help identify patterns in large datasets. Our analysis showed that there were distinct clusters within the World Happiness Report 2022 dataset, and that these clusters were based on the country’s level of happiness, GDP per capita, social support, life expectancy, freedom of life choices, generosity, and corruption perception.