At the beginning, I activated the required libraries and hid the
code with “include=FALSE”.
First, I imported a data set about World happiness report from Kaggle.com, based on 2015 data so I could analyze how happy people were 10 years ago!
Data found on: https://www.kaggle.com/datasets/unsdsn/world-happiness?select=2015.csv
World_data <- read.table("C:/Users/Adam/OneDrive - Univerza v Ljubljani/FAKS/IMB/MULTIVARIATE ANALYSIS/Homework assignments/HW2/HW2-File/2015.csv", header=TRUE, sep=",", dec=".")
World_data <- as.data.frame(World_data)
I used read.table instead of read.csv to see if that fixes my Knitting problems, I also renamed the columns so there’s no spaces in the wording: Note, it didn’t fix my knitting problem.
colnames(World_data) [1] <- "Country"
colnames(World_data) [2] <- "Region"
colnames(World_data) [3] <- "ID_rank"
colnames(World_data) [4] <- "Happiness_score"
colnames(World_data) [5] <- "Standard_error"
colnames(World_data) [6] <- "GDP_PerCapita"
colnames(World_data) [7] <- "Family"
colnames(World_data) [8] <- "Health_life_expectancy"
colnames(World_data) [9] <- "Freedom"
colnames(World_data) [10] <- "Trust_GovCorruption"
colnames(World_data) [11] <- "Generosity"
colnames(World_data) [12] <- "Dystopia_residual"
Let’s take a quick look at the data:
head(World_data)
## Country Region ID_rank Happiness_score Standard_error GDP_PerCapita Family Health_life_expectancy Freedom Trust_GovCorruption
## 1 Switzerland Western Europe 1 7.587 0.03411 1.39651 1.34951 0.94143 0.66557 0.41978
## 2 Iceland Western Europe 2 7.561 0.04884 1.30232 1.40223 0.94784 0.62877 0.14145
## 3 Denmark Western Europe 3 7.527 0.03328 1.32548 1.36058 0.87464 0.64938 0.48357
## 4 Norway Western Europe 4 7.522 0.03880 1.45900 1.33095 0.88521 0.66973 0.36503
## 5 Canada North America 5 7.427 0.03553 1.32629 1.32261 0.90563 0.63297 0.32957
## 6 Finland Western Europe 6 7.406 0.03140 1.29025 1.31826 0.88911 0.64169 0.41372
## Generosity Dystopia_residual
## 1 0.29678 2.51738
## 2 0.43630 2.70201
## 3 0.34139 2.49204
## 4 0.34699 2.46531
## 5 0.45811 2.45176
## 6 0.23351 2.61955
In the dataset, we can observe the following 12 variables for each of the 158 countries:
Country: Name of the country.
Region: Region the country belongs to.
Happiness Rank: Rank of the country based on the Happiness Score.
Happiness Score: A metric measured in 2015 by asking the sampled people the question: “How would you rate your happiness”.
Standard Error: The standard error of the happiness score.
Economy (GDP per Capita): The extent to which GDP contributes to the calculation of the Happiness Score.
Family: The extent to which Family contributes to the calculation of the Happiness Score.
Health (Life expectancy): The extent to which Life expectancy contributed to the calculation of the Happiness Score.
Freedom: The extent to which Freedom contributed to the calculation of the Happiness Score.
Trust (Government corruption): The extent to which Perception of Corruption contributes to Happiness Score.
Generosity: The extent to which Generosity contributed to the calculation of the Happiness Score.
Dystopia Residual: The extent to which Dystopia Residual contributed to the calculation of the Happiness Score.
summary1 <- summary(World_data)
print(summary1)
## Country Region ID_rank Happiness_score Standard_error GDP_PerCapita Family Health_life_expectancy
## Length:158 Length:158 Min. : 1.00 Min. :2.839 Min. :0.01848 Min. :0.0000 Min. :0.0000 Min. :0.0000
## Class :character Class :character 1st Qu.: 40.25 1st Qu.:4.526 1st Qu.:0.03727 1st Qu.:0.5458 1st Qu.:0.8568 1st Qu.:0.4392
## Mode :character Mode :character Median : 79.50 Median :5.232 Median :0.04394 Median :0.9102 Median :1.0295 Median :0.6967
## Mean : 79.49 Mean :5.376 Mean :0.04788 Mean :0.8461 Mean :0.9910 Mean :0.6303
## 3rd Qu.:118.75 3rd Qu.:6.244 3rd Qu.:0.05230 3rd Qu.:1.1584 3rd Qu.:1.2144 3rd Qu.:0.8110
## Max. :158.00 Max. :7.587 Max. :0.13693 Max. :1.6904 Max. :1.4022 Max. :1.0252
## Freedom Trust_GovCorruption Generosity Dystopia_residual
## Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.3286
## 1st Qu.:0.3283 1st Qu.:0.06168 1st Qu.:0.1506 1st Qu.:1.7594
## Median :0.4355 Median :0.10722 Median :0.2161 Median :2.0954
## Mean :0.4286 Mean :0.14342 Mean :0.2373 Mean :2.0990
## 3rd Qu.:0.5491 3rd Qu.:0.18025 3rd Qu.:0.3099 3rd Qu.:2.4624
## Max. :0.6697 Max. :0.55191 Max. :0.7959 Max. :3.6021
We can see the Happiness Scores ranging from 2.839 (min) to 7.587 (max). The GDP per Capita (Mean = 0.8461), Family, and Health (Life Expectancy) indicators show variation. Freedom has a mean of 0.428 and Generosity at 0.237. We’ll later see how this data impacts our analysis.
As some of the variables are unimportant for our clustering analysis, we will be removing them:
World_data_clean <- World_data[ , c(-2, -3, -5, -12)]
head(World_data_clean)
## Country Happiness_score GDP_PerCapita Family Health_life_expectancy Freedom Trust_GovCorruption Generosity
## 1 Switzerland 7.587 1.39651 1.34951 0.94143 0.66557 0.41978 0.29678
## 2 Iceland 7.561 1.30232 1.40223 0.94784 0.62877 0.14145 0.43630
## 3 Denmark 7.527 1.32548 1.36058 0.87464 0.64938 0.48357 0.34139
## 4 Norway 7.522 1.45900 1.33095 0.88521 0.66973 0.36503 0.34699
## 5 Canada 7.427 1.32629 1.32261 0.90563 0.63297 0.32957 0.45811
## 6 Finland 7.406 1.29025 1.31826 0.88911 0.64169 0.41372 0.23351
First, we have to check if the data is appropriate for clustering.
I initially used GDP per Capita in the analysis as well, but removed it later as it caused problems in the clustering.
We have to standardize the data (scale function), so the variables all provide the same effect/impact.
World_data_clean_STD <- as.data.frame(scale(World_data_clean[c("Family", "Health_life_expectancy", "Freedom", "Trust_GovCorruption", "Generosity")]))
Now we can look for potential outliers:
World_data_clean$Dissimilarity <- sqrt(World_data_clean_STD$Family^2 + World_data_clean_STD$Health_life_expectancy^2 + World_data_clean_STD$Freedom^2 + World_data_clean_STD$Trust_GovCorruption^2 + World_data_clean_STD$Generosity^2)
head(World_data_clean[order(-World_data_clean$Dissimilarity), c("Country", "Dissimilarity")])
## Country Dissimilarity
## 129 Myanmar 4.585831
## 148 Central African Republic 4.341715
## 154 Rwanda 3.750601
## 9 New Zealand 3.724099
## 3 Denmark 3.697382
## 28 Qatar 3.602008
We see the biggest jump from Central African Republic onward (4.3 to 3.7), so based on that we will drop Myanmar and CAR from our data and standardize our data again:
World_data_clean2 <- World_data_clean %>%
filter(!Country %in% c("Myanmar", "Central African Republic"))
n* = 156
World_data_clean_STD2 <- as.data.frame(scale(World_data_clean2[c("Family", "Health_life_expectancy", "Freedom", "Trust_GovCorruption", "Generosity")]))
rownames(World_data_clean_STD2) <- World_data_clean2$Country
Dissimilarity matrix
Distance <- get_dist(World_data_clean_STD2,
method = "euclidian")
fviz_dist1 <- fviz_dist(Distance,
gradient = list(low = "salmon",
mid = "grey95",
high = "white")) +
theme(axis.text.x = element_text(size = 3),
axis.text.y = element_text(size = 3)) #GPT recommended this to reduce big text
print(fviz_dist1)
Here we have the dissimilarity matrix, which uses Euclidian distances to aid us in visualizing clusters. There is a lot of observations (Countries, n* = 156), so we can’t see the country names clearly. From the matrix I can see at least 3 clusters forming, but it requires further analysis:
Hopkins statistics
Using Hopkin’s statistics we can check if data is clusterable, with 0,5 being the treshold.
Hopkins_stat <- get_clust_tendency(World_data_clean_STD2,
n =nrow(World_data_clean_STD2) -1,
graph = FALSE)
print(Hopkins_stat)
## $hopkins_stat
## [1] 0.6830286
##
## $plot
## NULL
The statistics are at 0.68 > 0.5, meaning we can proceed with our
analysis.
Elbow method
Elbow_method <- fviz_nbclust(World_data_clean_STD2, kmeans, method = "wss") +
labs(subtitle = "Elbow method")
print(Elbow_method)
In the elbow method, there is not a highly evident break in the
slope, but it appears to happen at 3 and 7 number of clusters.
Silhouette analysis
Silhouette_analysis <- fviz_nbclust(World_data_clean_STD2, kmeans, method = "silhouette") +
labs(subtitle = "Silhouette analysis")
print(Silhouette_analysis)
In the Silhouette analysis we see the highest value at 2. Two other
potential number of clusters are seen at 3 and 7.
Cluster dendrogram
WARD <- World_data_clean_STD2 %>%
get_dist(method = "euclidian") %>%
hclust(method = "ward.D2")
print(WARD)
##
## Call:
## hclust(d = ., method = "ward.D2")
##
## Cluster method : ward.D2
## Distance : euclidean
## Number of objects: 156
Dendrogram <- fviz_dend(WARD,
cex = 0.3) #cex is a ChatGPT recommendation to make text under Dendrogram less clustered
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
print(Dendrogram)
The biggest visible jump in the vertical line in the Dendrogram is
seen at 3 clusters.
Indice method
Indice_method <- NbClust(World_data_clean_STD2,
distance = "euclidean",
min.nc = 2, max.nc = 10,
method = "kmeans",
index = "all")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 6 proposed 2 as the best number of clusters
## * 8 proposed 3 as the best number of clusters
## * 2 proposed 5 as the best number of clusters
## * 2 proposed 7 as the best number of clusters
## * 2 proposed 8 as the best number of clusters
## * 3 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
print(Indice_method)
## $All.index
## KL CH Hartigan CCC Scott Marriot TrCovW TraceW Friedman Rubin Cindex DB Silhouette Duda Pseudot2 Beale
## 2 1.3166 73.1674 44.9343 -1.5735 195.3316 35315734948 11930.198 525.3834 3.1311 1.4751 0.4039 1.4501 0.2746 0.9513 5.1717 0.1581
## 3 2.0042 69.2725 27.5202 -0.2824 388.4529 23041609882 7117.389 406.7125 5.4273 1.9055 0.3448 1.3972 0.2695 0.8513 11.1827 0.5389
## 4 1.3647 63.2457 20.7245 0.4435 513.8824 18331678934 5527.242 344.7094 7.3639 2.2483 0.3454 1.3639 0.2507 1.3808 -20.1320 -0.8440
## 5 1.5892 58.6942 7.5026 1.1318 615.2159 14959454904 4127.544 303.3490 9.0857 2.5548 0.3854 1.3663 0.2563 1.9208 -34.5147 -1.4662
## 6 3.1160 50.4525 8.3136 -0.1628 639.4804 18438577563 3618.243 288.9902 9.5448 2.6818 0.3329 1.3624 0.2363 1.4354 -13.9523 -0.9196
## 7 0.0751 45.4546 26.1072 -0.8857 700.8194 16937798874 3464.357 273.8143 11.1882 2.8304 0.3129 1.3345 0.2142 1.5957 -15.6800 -1.1295
## 8 14.0319 49.1849 8.4114 2.1103 812.2501 10829902828 2158.047 232.9906 13.2899 3.3263 0.3940 1.3100 0.2380 1.2300 -6.1715 -0.5644
## 9 0.4673 46.2198 9.0653 1.8704 848.3482 10875136180 1939.257 220.4610 13.9415 3.5154 0.3822 1.3482 0.2247 1.2898 -7.8649 -0.6740
## 10 15.0520 44.3215 5.8537 1.9318 891.5193 10180380410 1769.317 207.6552 14.8895 3.7321 0.3761 1.2708 0.2209 1.0139 -0.4671 -0.0413
## Ratkowsky Ball Ptbiserial Frey McClain Dunn Hubert SDindex Dindex SDbw
## 2 0.3905 262.6917 0.4451 0.1263 0.7159 0.1253 0.0023 1.8367 1.7447 1.2727
## 3 0.3947 135.5708 0.5458 0.8865 1.1204 0.1081 0.0034 1.7352 1.5212 0.9307
## 4 0.3707 86.1773 0.5084 0.2018 1.6782 0.1272 0.0036 1.6230 1.3906 0.5690
## 5 0.3477 60.6698 0.5213 0.9459 1.9563 0.1523 0.0041 1.5588 1.3112 0.4438
## 6 0.3224 48.1650 0.4977 -1.6330 2.2866 0.0510 0.0039 1.7100 1.2835 0.5630
## 7 0.3028 39.1163 0.4533 0.0967 2.7039 0.0473 0.0040 1.8817 1.2448 0.3736
## 8 0.2953 29.1238 0.4725 0.4949 3.1287 0.1271 0.0044 1.8822 1.1630 0.3509
## 9 0.2817 24.4957 0.4566 0.5625 3.4987 0.1109 0.0045 1.7358 1.1291 0.3323
## 10 0.2705 20.7655 0.4459 0.6159 3.7482 0.1109 0.0047 1.7848 1.0938 0.3176
##
## $All.CriticalValues
## CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
## 2 0.6588 52.3182 0.9775
## 3 0.6513 34.2721 0.7467
## 4 0.5995 48.7619 1.0000
## 5 0.5965 48.7016 1.0000
## 6 0.5502 37.6080 1.0000
## 7 0.5399 35.7855 1.0000
## 8 0.5287 29.4215 1.0000
## 9 0.5022 34.6984 1.0000
## 10 0.5094 32.7506 1.0000
##
## $Best.nc
## KL CH Hartigan CCC Scott Marriot TrCovW TraceW Friedman Rubin Cindex DB Silhouette Duda PseudoT2 Beale
## Number_clusters 10.000 2.0000 7.0000 8.0000 3.0000 3 3.000 3.0000 3.0000 8.0000 7.0000 10.0000 2.0000 2.0000 2.0000 2.0000
## Value_Index 15.052 73.1674 17.7936 2.1103 193.1213 7564194118 4812.809 56.6678 2.2963 -0.3069 0.3129 1.2708 0.2746 0.9513 5.1717 0.1581
## Ratkowsky Ball PtBiserial Frey McClain Dunn Hubert SDindex Dindex SDbw
## Number_clusters 3.0000 3.0000 3.0000 1 2.0000 5.0000 0 5.0000 0 10.0000
## Value_Index 0.3947 127.1209 0.5458 NA 0.7159 0.1523 0 1.5588 0 0.3176
##
## $Best.partition
## Switzerland Iceland Denmark Norway Canada Finland
## 3 3 3 3 3 3
## Netherlands Sweden New Zealand Australia Israel Costa Rica
## 3 3 3 3 1 1
## Austria Mexico United States Brazil Luxembourg Ireland
## 3 1 3 1 3 3
## Belgium United Arab Emirates United Kingdom Oman Venezuela Singapore
## 3 3 3 3 1 3
## Panama Germany Chile Qatar France Argentina
## 1 3 1 3 1 1
## Czech Republic Uruguay Colombia Thailand Saudi Arabia Spain
## 1 3 1 3 1 1
## Malta Taiwan Kuwait Suriname Trinidad and Tobago El Salvador
## 3 1 1 1 1 1
## Guatemala Uzbekistan Slovakia Japan South Korea Ecuador
## 1 3 1 1 1 1
## Bahrain Italy Bolivia Moldova Paraguay Kazakhstan
## 3 1 1 1 1 1
## Slovenia Lithuania Nicaragua Peru Belarus Poland
## 1 1 3 1 1 1
## Malaysia Croatia Libya Russia Jamaica North Cyprus
## 1 1 1 1 1 1
## Cyprus Algeria Kosovo Turkmenistan Mauritius Hong Kong
## 1 1 2 3 1 3
## Estonia Indonesia Vietnam Turkey Kyrgyzstan Nigeria
## 1 1 1 1 1 2
## Bhutan Azerbaijan Pakistan Jordan Montenegro China
## 3 1 2 1 1 1
## Zambia Romania Serbia Portugal Latvia Philippines
## 2 1 1 1 1 1
## Somaliland region Morocco Macedonia Mozambique Albania Bosnia and Herzegovina
## 3 2 1 2 1 1
## Lesotho Dominican Republic Laos Mongolia Swaziland Greece
## 2 1 3 1 2 1
## Lebanon Hungary Honduras Tajikistan Tunisia Palestinian Territories
## 1 1 1 2 2 1
## Bangladesh Iran Ukraine Iraq South Africa Ghana
## 2 2 1 2 2 2
## Zimbabwe Liberia India Sudan Haiti Congo (Kinshasa)
## 2 2 2 2 2 2
## Nepal Ethiopia Sierra Leone Mauritania Kenya Djibouti
## 2 2 2 2 2 2
## Armenia Botswana Georgia Malawi Sri Lanka Cameroon
## 1 2 2 2 1 2
## Bulgaria Egypt Yemen Angola Mali Congo (Brazzaville)
## 1 2 2 2 2 2
## Comoros Uganda Senegal Gabon Niger Cambodia
## 2 2 2 2 2 2
## Tanzania Madagascar Chad Guinea Ivory Coast Burkina Faso
## 2 2 2 2 2 2
## Afghanistan Rwanda Benin Syria Burundi Togo
## 2 3 2 2 2 2
The indice method recommends using 3 clusters, which is what we’ve already established with the previous tests.
For further analysis, we will be using 3 clusters.
K-Means clustering
With K-Means clustering, we will start with 3 random points that preform 25 full trials to find the best options.
Clustering <- kmeans(World_data_clean_STD2,
centers = 3,
nstart = 25)
print(Clustering)
## K-means clustering with 3 clusters of sizes 31, 73, 52
##
## Cluster means:
## Family Health_life_expectancy Freedom Trust_GovCorruption Generosity
## 1 0.8028006 0.7727040 1.19108894 1.4387314 1.10542963
## 2 0.3320646 0.4634078 -0.05718025 -0.4072969 -0.40343257
## 3 -0.9447603 -1.1112038 -0.62979997 -0.2859231 -0.09264887
##
## Clustering vector:
## Switzerland Iceland Denmark Norway Canada Finland
## 1 1 1 1 1 1
## Netherlands Sweden New Zealand Australia Israel Costa Rica
## 1 1 1 1 2 2
## Austria Mexico United States Brazil Luxembourg Ireland
## 1 2 1 2 1 1
## Belgium United Arab Emirates United Kingdom Oman Venezuela Singapore
## 1 1 1 1 2 1
## Panama Germany Chile Qatar France Argentina
## 2 1 2 1 2 2
## Czech Republic Uruguay Colombia Thailand Saudi Arabia Spain
## 2 1 2 1 2 2
## Malta Taiwan Kuwait Suriname Trinidad and Tobago El Salvador
## 1 2 2 2 2 2
## Guatemala Uzbekistan Slovakia Japan South Korea Ecuador
## 2 1 2 2 2 2
## Bahrain Italy Bolivia Moldova Paraguay Kazakhstan
## 2 2 2 2 2 2
## Slovenia Lithuania Nicaragua Peru Belarus Poland
## 2 2 1 2 2 2
## Malaysia Croatia Libya Russia Jamaica North Cyprus
## 2 2 2 2 2 2
## Cyprus Algeria Kosovo Turkmenistan Mauritius Hong Kong
## 2 2 3 2 2 1
## Estonia Indonesia Vietnam Turkey Kyrgyzstan Nigeria
## 2 2 2 2 2 3
## Bhutan Azerbaijan Pakistan Jordan Montenegro China
## 1 2 3 2 2 2
## Zambia Romania Serbia Portugal Latvia Philippines
## 3 2 2 2 2 2
## Somaliland region Morocco Macedonia Mozambique Albania Bosnia and Herzegovina
## 1 3 2 3 2 2
## Lesotho Dominican Republic Laos Mongolia Swaziland Greece
## 3 2 1 2 3 2
## Lebanon Hungary Honduras Tajikistan Tunisia Palestinian Territories
## 2 2 2 2 3 2
## Bangladesh Iran Ukraine Iraq South Africa Ghana
## 3 3 2 3 3 3
## Zimbabwe Liberia India Sudan Haiti Congo (Kinshasa)
## 3 3 3 3 3 3
## Nepal Ethiopia Sierra Leone Mauritania Kenya Djibouti
## 3 3 3 3 3 3
## Armenia Botswana Georgia Malawi Sri Lanka Cameroon
## 2 3 3 3 2 3
## Bulgaria Egypt Yemen Angola Mali Congo (Brazzaville)
## 2 3 3 3 3 3
## Comoros Uganda Senegal Gabon Niger Cambodia
## 3 3 3 3 3 3
## Tanzania Madagascar Chad Guinea Ivory Coast Burkina Faso
## 3 3 3 3 3 3
## Afghanistan Rwanda Benin Syria Burundi Togo
## 3 1 3 3 3 3
##
## Within cluster sum of squares by cluster:
## [1] 83.86715 161.73718 160.97704
## (between_SS / total_SS = 47.5 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss" "size" "iter" "ifault"
We created 3 clusters with the following sizes:
1st: 73 +
2nd: 52 +
3rd: 31 = 156
Please note the order of 1st, 2nd and 3rd might change due to rerun of the code.
The biggest variability is seen in cluster 1 at 161.73 (73), which tells us it is the most heterogeneous.
The best we could achieve with the data and 3 groups is a ratio of 47.5%, which means that 47.5% of total variability is explained by classification of countries into 3 groups.
Not the best ratio, but it’s the best we were able to get and we’ll work with it.
Cluster plot
Now time to visualize our clusters:
Cluster_plot <- fviz_cluster(Clustering,
palette = "Set1",
repel = FALSE,
labelsize = 10,
ggtheme = theme_bw(),
data = World_data_clean_STD2)
print(Cluster_plot)
In the cluster plot, we can see our 3 clusters in action. We see that we will need to remove some countries as they are far away from other units in their respective clusters.
I will be removing Somaliland region, Rwanda, Laos (Cluster 1), Greece (Cluster 2) and Cambodia (Cluster 3).
World_data_clean3 <- World_data_clean2 %>%
filter(!Country %in% c("Somaliland region", "Rwanda", "Laos", "Cambodia", "Greece"))
World_data_clean_STD3 <- as.data.frame(scale(World_data_clean3[c("Family", "Health_life_expectancy", "Freedom", "Trust_GovCorruption", "Generosity")]))
rownames(World_data_clean_STD3) <- World_data_clean3$Country
Clustering1 <- kmeans(World_data_clean_STD3,
centers = 3,
nstart = 25)
print(Clustering1)
## K-means clustering with 3 clusters of sizes 54, 27, 70
##
## Cluster means:
## Family Health_life_expectancy Freedom Trust_GovCorruption Generosity
## 1 -0.9476684 -1.0167833 -0.76248707 -0.2751518 -0.1594749
## 2 0.9492764 0.9299521 1.27992474 1.5011733 1.1708433
## 3 0.3649090 0.4256799 0.09451905 -0.3667640 -0.3285875
##
## Clustering vector:
## Switzerland Iceland Denmark Norway Canada Finland
## 2 2 2 2 2 2
## Netherlands Sweden New Zealand Australia Israel Costa Rica
## 2 2 2 2 3 3
## Austria Mexico United States Brazil Luxembourg Ireland
## 2 3 2 3 2 2
## Belgium United Arab Emirates United Kingdom Oman Venezuela Singapore
## 2 2 2 2 3 2
## Panama Germany Chile Qatar France Argentina
## 3 2 3 2 3 3
## Czech Republic Uruguay Colombia Thailand Saudi Arabia Spain
## 3 2 3 2 3 3
## Malta Taiwan Kuwait Suriname Trinidad and Tobago El Salvador
## 2 3 3 3 3 3
## Guatemala Uzbekistan Slovakia Japan South Korea Ecuador
## 3 2 3 3 3 3
## Bahrain Italy Bolivia Moldova Paraguay Kazakhstan
## 3 3 3 3 3 3
## Slovenia Lithuania Nicaragua Peru Belarus Poland
## 3 3 3 3 3 3
## Malaysia Croatia Libya Russia Jamaica North Cyprus
## 3 3 3 3 3 3
## Cyprus Algeria Kosovo Turkmenistan Mauritius Hong Kong
## 3 3 1 3 3 2
## Estonia Indonesia Vietnam Turkey Kyrgyzstan Nigeria
## 3 3 3 3 3 1
## Bhutan Azerbaijan Pakistan Jordan Montenegro China
## 2 3 1 3 1 3
## Zambia Romania Serbia Portugal Latvia Philippines
## 1 3 3 3 3 3
## Morocco Macedonia Mozambique Albania Bosnia and Herzegovina Lesotho
## 1 3 1 3 1 1
## Dominican Republic Mongolia Swaziland Lebanon Hungary Honduras
## 3 3 1 3 3 3
## Tajikistan Tunisia Palestinian Territories Bangladesh Iran Ukraine
## 3 1 1 1 1 3
## Iraq South Africa Ghana Zimbabwe Liberia India
## 1 1 1 1 1 1
## Sudan Haiti Congo (Kinshasa) Nepal Ethiopia Sierra Leone
## 1 1 1 1 1 1
## Mauritania Kenya Djibouti Armenia Botswana Georgia
## 1 1 1 1 1 1
## Malawi Sri Lanka Cameroon Bulgaria Egypt Yemen
## 1 3 1 3 1 1
## Angola Mali Congo (Brazzaville) Comoros Uganda Senegal
## 1 1 1 1 3 1
## Gabon Niger Tanzania Madagascar Chad Guinea
## 1 1 1 1 1 1
## Ivory Coast Burkina Faso Afghanistan Benin Syria Burundi
## 1 1 1 1 1 1
## Togo
## 1
##
## Within cluster sum of squares by cluster:
## [1] 169.97585 62.26983 147.19870
## (between_SS / total_SS = 49.4 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss" "size" "iter" "ifault"
With the 5 countries removed, the K-Means clustering tells us the ratio (between_SS / total_SS) is now 49.4%, which is better than the previous 47.5%. Our sample now includes 151 countries.
Cluster_plot1 <- fviz_cluster(Clustering1,
palette = "Set1",
repel = FALSE,
labelsize = 10,
ggtheme = theme_bw(),
data = World_data_clean_STD3)
print(Cluster_plot1)
We see the clusters look better after our manipulation. There is a slight overlap seen between clusters 1 and 3, but this poses no problem. Let’s continue:
Cluster profiles
averages <- Clustering1$centers
print(averages)
## Family Health_life_expectancy Freedom Trust_GovCorruption Generosity
## 1 -0.9476684 -1.0167833 -0.76248707 -0.2751518 -0.1594749
## 2 0.9492764 0.9299521 1.27992474 1.5011733 1.1708433
## 3 0.3649090 0.4256799 0.09451905 -0.3667640 -0.3285875
figure <- as.data.frame(averages)
figure$id <- 1:nrow(figure)
library(tidyr)
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
##
## extract
## The following object is masked from 'package:pastecs':
##
## extract
## The following object is masked from 'package:reshape2':
##
## smiths
figure <- pivot_longer(figure, cols = c("Family", "Health_life_expectancy", "Freedom", "Trust_GovCorruption", "Generosity"))
figure$Group <- factor(figure$id,
levels = c(1, 2, 3),
labels = c("1", "2", "3"))
figure$ImeF <- factor(figure$name,
levels = c("Family", "Health_life_expectancy", "Freedom", "Trust_GovCorruption", "Generosity"),
labels = c("Family", "Health_life_expectancy", "Freedom", "Trust_GovCorruption", "Generosity"))
ggplot1 <- ggplot(figure, aes(x = ImeF, y = value)) +
geom_hline(yintercept = 0) +
theme_bw() +
geom_point(aes(shape = Group, col = Group), size = 3) +
geom_line(aes(group = id), linewidth = 1) +
ylab("Averages") +
xlab("Cluster variables") +
scale_color_brewer(palette = "Set1") +
ylim(-2.2, 2.2) +
theme(axis.text.x = element_text(angle = 45, vjust = 0.50, size = 10))
print(ggplot1)
World_data_clean3$Group <- Clustering1$cluster
In the cluster profiles we can see how the 3 clusters are rated on average in the Family, Health & life expectancy, Freedom, Trust in government and Generosity variables.
We see cluster 2 ranking above average in every variable. Cluster 3 has an above average Family and Health & life expectancy contribution to the calculation of the Happiness Score, a slightly above average Freedom contribution, but a less than average contribution in Trust in Government and Generosity. Cluster 1 however, deviates negatively from the mean in every variable.
These values make sense as we can see the countries in Cluster 1 are generally less developed countries, and Cluster 2 generally has highly developed countries, with some exceptions.
In the last r chunk I also assigned the countries in the dataset groups by adding a “Group” column (1, 2 or 3).
Checking the appropriateness of the cluster variables used.
fit <- aov(cbind(Family, Health_life_expectancy, Freedom, Trust_GovCorruption, Generosity) ~ as.factor(Group),
data = World_data_clean3)
fit_summary <- summary(fit)
print(fit_summary)
## Response Family :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 5.6431 2.82154 89.59 < 2.2e-16 ***
## Residuals 148 4.6611 0.03149
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Health_life_expectancy :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 5.5678 2.78390 116.92 < 2.2e-16 ***
## Residuals 148 3.5238 0.02381
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Freedom :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 1.6909 0.84547 76.512 < 2.2e-16 ***
## Residuals 148 1.6354 0.01105
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Trust_GovCorruption :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 0.98651 0.49326 72.727 < 2.2e-16 ***
## Residuals 148 1.00378 0.00678
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Generosity :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 0.61662 0.308308 32.674 1.765e-12 ***
## Residuals 148 1.39650 0.009436
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Using ANOVA we’ve checked the appropriateness for cluster variables and we tested the following:
Family:
Health & life expectancy:
Freedom:
Trust in Government:
Generosity:
This tells us that all of the groups (Group 1, Group 2 and Group 3) are statistically different in the mean values of the cluster variables (5). There is a significant difference between the formed groups, all of the p-values are very close to 0, which further proves the selection of the variables is appropriate.
Checking criterion validity
Now we have to check the variables that were not used for the clustering, which in our case is GDP per Capita. The only other variable left is Happiness Score, but that is directly correlated to and measured by the chosen 5 variables.
aggregate1 <- aggregate(World_data_clean3$GDP_PerCapita,
by = list(World_data_clean3$Group),
FUN = mean)
print(aggregate1)
## Group.1 x
## 1 1 0.4910052
## 2 2 1.2927037
## 3 3 0.9901024
As expected, we see group 2 has the highest GDP per Capita at 1.29, and Group 1 has the lowest GDP per Capita.
With Levene test we’ll be looking at Homogenity of variance:
leveneTest(World_data_clean3$GDP_PerCapita, as.factor(World_data_clean3$Group))
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 4.9078 0.008635 **
## 148
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We tested the following:
As H0 was rejected, we know that at least one variance of the Groups is different than the rest.
World_data_clean3 %>%
group_by(Group) %>%
shapiro_test(GDP_PerCapita)
## # A tibble: 3 × 4
## Group variable statistic p
## <int> <chr> <dbl> <dbl>
## 1 1 GDP_PerCapita 0.962 0.0884
## 2 2 GDP_PerCapita 0.840 0.000723
## 3 3 GDP_PerCapita 0.967 0.0608
We tested the following for each group:
Group 1
Group 2
Group 3
Because the Homogenity of variances is violated as well as normality of GDP per Capita is violated in Group 2, we have to use the Kruskal-Wallis rank sum test (non-parametric test).
kruskal <- kruskal.test(GDP_PerCapita ~ Group,
data = World_data_clean3)
print(kruskal)
##
## Kruskal-Wallis rank sum test
##
## data: GDP_PerCapita by Group
## Kruskal-Wallis chi-squared = 89.489, df = 2, p-value < 2.2e-16
With Kruskal-Wallis we tested the following:
This means that at least one is different, meaning we have validated the validation and our clusters are trustworthy.
kruskal_eff <- kruskal_effsize(GDP_PerCapita ~ Group,
data = World_data_clean3)
print(kruskal_eff)
## # A tibble: 1 × 5
## .y. n effsize method magnitude
## * <chr> <int> <dbl> <chr> <ord>
## 1 GDP_PerCapita 151 0.591 eta2[H] large
The Kruskal-Wallis test showed a large, statistically significant difference in GDP per Capita across groups, meaning group membership strongly influences GDP per Capita distribution.
Based on our analysis, we created 3 distinct groups (Clusters) using 5 standardized variables: Family, Health & life expectancy, Freedom, Trust in government and Generosity (All contributing variables to calculation of the Happiness Score) for 151 of our observed countries based on our cleaned data from the World Happiness Report from 2015.
Cluster 1 (Group 1): (54 of 151 = 35,76%) represents countries that are less developed and tend to have below-average values across all variables. The averages for “Family,” “Health & life expectancy,” “Freedom,” “Trust in Government,” and “Generosity” are all negative, highlighting challenges these countries face in factors contributing to happiness. These countries are characterized by weaker support systems, lower life expectancy, limited freedom, lack of trust in government institutions, and reduced levels of generosity. Furthermore, the GDP per Capita for this group is the lowest among the three clusters (0.491), reinforcing the idea that these countries generally have limited economic resources. Cluster 1 likely consists of countries from lower-income regions, struggling with systemic issues that directly and indirectly impact happiness and well-being.
Cluster 2 (Group 2): (27 of 151 = 17,88%) includes highly developed countries with above-average performance in every variable. These countries rank positively in “Family,” “Health & life expectancy,” “Freedom,” “Trust in Government,” and “Generosity,” indicating a robust support system, strong institutional trust, and economic affluence. The GDP per Capita for this group is the highest among the three clusters (1.29), reflecting significant economic prosperity. These nations are characterized by their advanced healthcare systems, social safety nets, democratic values, and community-driven support. Cluster 2 represents countries where citizens enjoy a higher quality of life and have the resources to lead fulfilling, secure, and generous lives.
Cluster 3 (Group 3): (70 of 151 = 46,36%) lies between the other two clusters in terms of performance. Countries in this group exhibit slightly above-average values in “Family,” “Health & life expectancy,” and “Freedom,” but they fall below average in “Trust in Government” and “Generosity.” This suggests that while these countries maintain decent social structures, healthcare, and personal freedoms, there may be issues with institutional transparency or corruption and a slightly lower emphasis on altruistic behavior. The GDP per Capita for this group (0.99) is moderate, sitting between Cluster 1 and Cluster 2. Overall, these countries are relatively well-off but may still face challenges in fostering trust and community engagement, which affects overall happiness.