library(readxl)
mydata <- read_xlsx("/cloud/project/Student depression/2015.xlsx")
mydata <- as.data.frame(mydata)
colnames(mydata) [1] <- "Country"
colnames(mydata) [2] <- "Region"
colnames(mydata) [3] <- "ID"
colnames(mydata) [4] <- "Happiness_Score"
colnames(mydata) [5] <- "GDP_per_capita"
colnames(mydata) [6] <- "Family"
colnames(mydata) [7] <- "Life_Expectancy"
colnames(mydata) [8] <- "Freedom"
colnames(mydata) [9] <- "Trust"
colnames(mydata) [10] <- "Generosity"
colnames(mydata) [11] <- "Dystopia"
head(mydata)
## Country Region ID Happiness_Score GDP_per_capita Family
## 1 Switzerland Western Europe 1 7.587 1.39651 1.34951
## 2 Iceland Western Europe 2 7.561 1.30232 1.40223
## 3 Denmark Western Europe 3 7.527 1.32548 1.36058
## 4 Norway Western Europe 4 7.522 1.45900 1.33095
## 5 Canada North America 5 7.427 1.32629 1.32261
## 6 Finland Western Europe 6 7.406 1.29025 1.31826
## Life_Expectancy Freedom Trust Generosity Dystopia
## 1 0.94143 0.66557 0.41978 0.29678 2.51738
## 2 0.94784 0.62877 0.14145 0.43630 2.70201
## 3 0.87464 0.64938 0.48357 0.34139 2.49204
## 4 0.88521 0.66973 0.36503 0.34699 2.46531
## 5 0.90563 0.63297 0.32957 0.45811 2.45176
## 6 0.88911 0.64169 0.41372 0.23351 2.61955
Source: Kaggle
Unit of Observation: Each row represents a country.
Sample Size: 158 countries (rows in the dataset).
Variables Analyzed:
Economy (GDP per Capita): GDP per capita (logarithmic scale).
Family: Social support metric.
Health (Life Expectancy): Life expectancy in years.
Freedom: Perceived freedom to make life choices.
Trust (Government Corruption): Perception of government corruption.
Generosity: Willingness to donate to others.
Dystopia Residual: Baseline metric for unhappiness.
summary(mydata[ , c(-1, -2, -3)])
## Happiness_Score GDP_per_capita Family Life_Expectancy
## Min. :2.839 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:4.526 1st Qu.:0.5458 1st Qu.:0.8568 1st Qu.:0.4392
## Median :5.232 Median :0.9102 Median :1.0295 Median :0.6967
## Mean :5.376 Mean :0.8461 Mean :0.9910 Mean :0.6303
## 3rd Qu.:6.244 3rd Qu.:1.1584 3rd Qu.:1.2144 3rd Qu.:0.8110
## Max. :7.587 Max. :1.6904 Max. :1.4022 Max. :1.0252
## Freedom Trust Generosity Dystopia
## Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.3286
## 1st Qu.:0.3283 1st Qu.:0.06168 1st Qu.:0.1506 1st Qu.:1.7594
## Median :0.4355 Median :0.10722 Median :0.2161 Median :2.0954
## Mean :0.4286 Mean :0.14342 Mean :0.2373 Mean :2.0990
## 3rd Qu.:0.5491 3rd Qu.:0.18025 3rd Qu.:0.3099 3rd Qu.:2.4624
## Max. :0.6697 Max. :0.55191 Max. :0.7959 Max. :3.6021
The average happiness score across all countries is 5.421, the median is slightly lower than the mean, suggesting that the distribution of happiness scores might be slightly skewed to the right.
The lowest happiness score is 2.839, and the highest is 7.587, showing a wide range of happiness across countries.
The average GDP per capita contribution to happiness is 0.8601.
The median generosity score is 0.216, indicating that half of the countries have a generosity score below this value and the other half above it
mydata_clu_std <- as.data.frame(scale(mydata[c("Family", "Life_Expectancy", "Freedom", "Trust", "Generosity", "Dystopia")]))
mydata$Dissimilarity <- sqrt(mydata_clu_std$Family^2 + mydata_clu_std$`Life_Expectancy`^2 + mydata_clu_std$Freedom^2 + mydata_clu_std$Trust^2 + mydata_clu_std$Generosity^2 + mydata_clu_std$Dystopia^2)
head(mydata[order(-mydata$Dissimilarity), c("ID", "Dissimilarity")])
## ID Dissimilarity
## 129 129 4.747948
## 156 156 4.558639
## 154 154 4.552705
## 148 148 4.485362
## 72 72 3.956991
## 3 3 3.764949
I have identified ID129, ID156, ID154 and ID148 as potential outliers, as there is a big jump in disimilarity numbers between units. I have decided to remove these units.
print(mydata[c(129, 156, 154, 148), ])
## Country Region ID
## 129 Myanmar Southeastern Asia 129
## 156 Syria Middle East and Northern Africa 156
## 154 Rwanda Sub-Saharan Africa 154
## 148 Central African Republic Sub-Saharan Africa 148
## Happiness_Score GDP_per_capita Family Life_Expectancy Freedom Trust
## 129 4.307 0.27108 0.70905 0.48246 0.44017 0.19034
## 156 3.006 0.66320 0.47489 0.72193 0.15684 0.18906
## 154 3.465 0.22208 0.77370 0.42864 0.59201 0.55191
## 148 3.678 0.07850 0.00000 0.06699 0.48879 0.08289
## Generosity Dystopia Dissimilarity
## 129 0.79588 1.41805 4.747948
## 156 0.47179 0.32858 4.558639
## 154 0.22628 0.67042 4.552705
## 148 0.23835 2.72230 4.485362
mydata <- mydata %>%
filter(!ID %in% c(129, 156, 154, 148))
mydata$ID <- seq(1, nrow(mydata))
mydata_clu_std <- as.data.frame(scale(mydata[c("Family", "Life_Expectancy", "Freedom", "Trust", "Generosity", "Dystopia")]))
rownames(mydata_clu_std) <- mydata$Country
Distances <- get_dist(mydata_clu_std,
method = "euclidian")
fviz_dist(Distances, #Showing matrix of distances
gradient = list(low = "darkred",
mid = "grey95",
high = "white"))
library(factoextra)
get_clust_tendency(mydata_clu_std, #Hopkins statistics
n = nrow(mydata_clu_std) - 1,
graph = FALSE)
## $hopkins_stat
## [1] 0.6641082
##
## $plot
## NULL
Hopkins statistics are 0.66, my data is clusterable as it is above 0.5. With the help of Hierarhical clustering (dendrogram) and K-Means clustering (Elbow method and Silhouette analysis) I will now determine how many clusters to use.
WARD <- mydata_clu_std %>%
get_dist(method = "euclidean") %>%
hclust(method = "ward.D2")
WARD
##
## Call:
## hclust(d = ., method = "ward.D2")
##
## Cluster method : ward.D2
## Distance : euclidean
## Number of objects: 154
fviz_dend(WARD)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Based on the dendrogram, I would choose 2 or 3 clusters, as there is the biggest jump in vertical line.
library(dplyr)
library(factoextra)
WARD <- mydata_clu_std %>%
get_dist(method = "euclidean") %>%
hclust(method = "ward.D2")
WARD
##
## Call:
## hclust(d = ., method = "ward.D2")
##
## Cluster method : ward.D2
## Distance : euclidean
## Number of objects: 154
library(factoextra)
library(NbClust)
fviz_nbclust(mydata_clu_std, kmeans, method = "wss") +
labs(subtitle = "Elbow method")
With the elbow method the slope changes most evidently at 3, therefore I’d choose 3 clusters.
fviz_nbclust(mydata_clu_std, kmeans, method = "silhouette")+
labs(subtitle = "Silhouette analysis")
The higest value of the Silhouette analysis is at 2.
library(NbClust)
NbClust(mydata_clu_std,
distance = "euclidean",
min.nc = 2, max.nc = 10,
method = "kmeans",
index = "all")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 6 proposed 2 as the best number of clusters
## * 10 proposed 3 as the best number of clusters
## * 4 proposed 9 as the best number of clusters
## * 3 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
## $All.index
## KL CH Hartigan CCC Scott Marriot TrCovW TraceW
## 2 1.6663 58.0809 32.7492 -1.5670 191.4581 4.709559e+12 12898.695 664.2014
## 3 2.4828 51.3320 19.4393 -0.6937 384.4317 3.026628e+12 8378.690 546.4629
## 4 3.3715 44.8079 12.8153 -1.2094 502.4637 2.500203e+12 7152.162 484.1367
## 5 0.1162 39.4164 20.4654 -1.8145 590.9604 2.199005e+12 6536.666 446.0300
## 6 1.9507 39.6891 13.6274 0.0982 653.7504 2.106275e+12 4666.327 392.1654
## 7 2.4432 38.1315 9.0788 0.8676 755.7490 1.478301e+12 4072.358 359.1005
## 8 0.5142 35.7548 10.5778 0.8516 789.7330 1.548491e+12 3231.614 338.2123
## 9 7.8805 34.6355 5.7791 1.3620 842.1439 1.394473e+12 2612.292 315.3640
## 10 0.5265 32.4310 5.7777 0.9888 870.6843 1.430337e+12 2483.474 303.2766
## Friedman Rubin Cindex DB Silhouette Duda Pseudot2 Beale Ratkowsky
## 2 2.9644 1.3821 0.4536 1.5497 0.2499 0.8298 20.7196 0.7812 0.3395
## 3 5.4668 1.6799 0.4217 1.6095 0.2166 1.1427 -9.6160 -0.4726 0.3477
## 4 7.3087 1.8962 0.4005 1.5968 0.2070 1.0469 -2.6000 -0.1692 0.3238
## 5 8.5477 2.0582 0.3859 1.6463 0.1948 1.8851 -31.9279 -1.7613 0.3169
## 6 9.0831 2.3408 0.3733 1.4245 0.2005 1.4051 -11.8205 -1.0710 0.3079
## 7 11.1795 2.5564 0.3908 1.5612 0.1980 2.0770 -22.8153 -1.9043 0.2935
## 8 11.6042 2.7143 0.3828 1.4400 0.1993 1.4646 -11.4194 -1.1735 0.2803
## 9 12.4236 2.9109 0.3649 1.3892 0.2035 2.1703 -18.8735 -1.9655 0.2699
## 10 13.0009 3.0269 0.3607 1.4309 0.1868 1.1266 -3.1465 -0.4035 0.2585
## Ball Ptbiserial Frey McClain Dunn Hubert SDindex Dindex SDbw
## 2 332.1007 0.4553 0.5515 0.6330 0.1318 0.0025 1.5878 1.9959 1.8120
## 3 182.1543 0.4870 0.5910 1.3499 0.1224 0.0029 1.5813 1.8033 1.0185
## 4 121.0342 0.4748 0.5322 1.8308 0.1224 0.0031 1.5262 1.6878 0.5972
## 5 89.2060 0.4545 0.5445 2.4749 0.0978 0.0034 1.5611 1.6129 0.5406
## 6 65.3609 0.4355 0.1703 3.0153 0.1229 0.0034 1.4491 1.5135 0.4381
## 7 51.3001 0.4380 0.5377 3.6951 0.1155 0.0039 1.5478 1.4557 0.4376
## 8 42.2765 0.4215 0.1287 4.1858 0.1246 0.0039 1.4182 1.4093 0.3952
## 9 35.0404 0.4219 0.6363 4.5788 0.1388 0.0042 1.4448 1.3593 0.3741
## 10 30.3277 0.4087 0.3902 4.9934 0.1005 0.0043 1.4130 1.3324 0.3638
##
## $All.CriticalValues
## CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
## 2 0.7212 39.0422 0.5849
## 3 0.6757 36.9584 1.0000
## 4 0.6579 30.1615 1.0000
## 5 0.6256 40.7028 1.0000
## 6 0.5802 29.6656 1.0000
## 7 0.5356 38.1467 1.0000
## 8 0.5632 27.9172 1.0000
## 9 0.5097 33.6684 1.0000
## 10 0.4643 32.2998 1.0000
##
## $Best.nc
## KL CH Hartigan CCC Scott Marriot TrCovW
## Number_clusters 9.0000 2.0000 3.00 9.000 3.0000 3.000000e+00 3.000
## Value_Index 7.8805 58.0809 13.31 1.362 192.9736 1.156505e+12 4520.005
## TraceW Friedman Rubin Cindex DB Silhouette Duda
## Number_clusters 3.0000 3.0000 3.0000 10.0000 9.0000 2.0000 2.0000
## Value_Index 55.4122 2.5024 -0.0815 0.3607 1.3892 0.2499 0.8298
## PseudoT2 Beale Ratkowsky Ball PtBiserial Frey McClain
## Number_clusters 2.0000 2.0000 3.0000 3.0000 3.000 1 2.000
## Value_Index 20.7196 0.7812 0.3477 149.9464 0.487 NA 0.633
## Dunn Hubert SDindex Dindex SDbw
## Number_clusters 9.0000 0 10.000 0 10.0000
## Value_Index 0.1388 0 1.413 0 0.3638
##
## $Best.partition
## Switzerland Iceland Denmark
## 3 3 3
## Norway Canada Finland
## 3 3 3
## Netherlands Sweden New Zealand
## 3 3 3
## Australia Israel Costa Rica
## 3 1 1
## Austria Mexico United States
## 3 1 3
## Brazil Luxembourg Ireland
## 1 3 3
## Belgium United Arab Emirates United Kingdom
## 3 3 3
## Oman Venezuela Singapore
## 3 1 3
## Panama Germany Chile
## 1 3 1
## Qatar France Argentina
## 3 1 1
## Czech Republic Uruguay Colombia
## 1 3 1
## Thailand Saudi Arabia Spain
## 3 1 1
## Malta Taiwan Kuwait
## 3 1 3
## Suriname Trinidad and Tobago El Salvador
## 1 1 1
## Guatemala Uzbekistan Slovakia
## 1 3 1
## Japan South Korea Ecuador
## 1 1 1
## Bahrain Italy Bolivia
## 3 1 1
## Moldova Paraguay Kazakhstan
## 1 3 1
## Slovenia Lithuania Nicaragua
## 1 1 3
## Peru Belarus Poland
## 1 1 1
## Malaysia Croatia Libya
## 3 1 1
## Russia Jamaica North Cyprus
## 1 1 1
## Cyprus Algeria Kosovo
## 1 1 2
## Turkmenistan Mauritius Hong Kong
## 3 3 3
## Estonia Indonesia Vietnam
## 1 3 1
## Turkey Kyrgyzstan Nigeria
## 1 1 2
## Bhutan Azerbaijan Pakistan
## 3 1 2
## Jordan Montenegro China
## 1 1 1
## Zambia Romania Serbia
## 2 1 1
## Portugal Latvia Philippines
## 1 1 3
## Somaliland region Morocco Macedonia
## 3 2 1
## Mozambique Albania Bosnia and Herzegovina
## 2 1 1
## Lesotho Dominican Republic Laos
## 2 3 3
## Mongolia Swaziland Greece
## 1 2 1
## Lebanon Hungary Honduras
## 1 1 1
## Tajikistan Tunisia Palestinian Territories
## 2 2 1
## Bangladesh Iran Ukraine
## 2 2 1
## Iraq South Africa Ghana
## 2 2 2
## Zimbabwe Liberia India
## 2 2 2
## Sudan Haiti Congo (Kinshasa)
## 2 2 2
## Nepal Ethiopia Sierra Leone
## 2 2 2
## Mauritania Kenya Djibouti
## 2 2 2
## Armenia Botswana Georgia
## 1 2 2
## Malawi Sri Lanka Cameroon
## 2 3 2
## Bulgaria Egypt Yemen
## 1 2 2
## Angola Mali Congo (Brazzaville)
## 2 2 2
## Comoros Uganda Senegal
## 2 2 2
## Gabon Niger Cambodia
## 2 2 3
## Tanzania Madagascar Chad
## 2 2 2
## Guinea Ivory Coast Burkina Faso
## 2 2 2
## Afghanistan Benin Burundi
## 2 2 2
## Togo
## 2
The final method is telling us that the best number of clusters to use is 3, therefore I will proceed with 3 clusters. I have tried to test with 2 clusters (based on Elbow method and Silhouette analysis), between_SS / total_SS ratio was much lower than in the case if 3 clusters are formed.
Clustering <- kmeans(mydata_clu_std,
centers = 3, #Number of groups
nstart = 25) #Number of attempts at different starting leader positions
library(factoextra)
fviz_cluster(Clustering,
palette = "Set1",
repel = FALSE,
ggtheme = theme_bw(),
data = mydata_clu_std)
I will remove Cambodia and Greece as they are outliers.
mydata <- mydata %>%
filter(!ID %in% c(144,102))
mydata$ID <- seq(1, nrow(mydata))
mydata_clu_std <- as.data.frame(scale(mydata[c("Family", "Life_Expectancy", "Freedom", "Trust", "Generosity", "Dystopia")]))
rownames(mydata_clu_std) <- mydata$Country
Clustering <- kmeans(mydata_clu_std,
centers = 3, #Number of groups
nstart = 25) #Number of attempts at different starting leader positions
library(factoextra)
fviz_cluster(Clustering,
palette = "Set1",
repel = FALSE,
ggtheme = theme_bw(),
data = mydata_clu_std)
mydata <- mydata %>%
filter(!ID %in% c(129))
mydata$ID <- seq(1, nrow(mydata))
mydata_clu_std <- as.data.frame(scale(mydata[c("Family", "Life_Expectancy", "Freedom", "Trust", "Generosity", "Dystopia")]))
rownames(mydata_clu_std) <- mydata$Country
Clustering <- kmeans(mydata_clu_std,
centers = 3, #Number of groups
nstart = 25) #Number of attempts at different starting leader positions
library(factoextra)
fviz_cluster(Clustering,
palette = "Set1",
repel = FALSE,
ggtheme = theme_bw(),
data = mydata_clu_std)
With the help of Principal Component Analysis around 59.2% (18.2% + 41%) of information is showed when combining the 5 variables into 2 dimensions.
Clustering
## K-means clustering with 3 clusters of sizes 29, 53, 69
##
## Cluster means:
## Family Life_Expectancy Freedom Trust Generosity Dystopia
## 1 0.8446439 0.8307036 1.22703520 1.4587738 1.2082289 0.08699786
## 2 -0.9212850 -1.0423774 -0.76549278 -0.3046756 -0.2130384 -0.13800652
## 3 0.3526584 0.4515304 0.07227676 -0.3790817 -0.3441681 0.06944069
##
## Clustering vector:
## Switzerland Iceland Denmark
## 1 1 1
## Norway Canada Finland
## 1 1 1
## Netherlands Sweden New Zealand
## 1 1 1
## Australia Israel Costa Rica
## 1 3 3
## Austria Mexico United States
## 1 3 1
## Brazil Luxembourg Ireland
## 3 1 1
## Belgium United Arab Emirates United Kingdom
## 1 1 1
## Oman Venezuela Singapore
## 1 3 1
## Panama Germany Chile
## 3 1 3
## Qatar France Argentina
## 1 3 3
## Czech Republic Uruguay Colombia
## 3 1 3
## Thailand Saudi Arabia Spain
## 1 3 3
## Malta Taiwan Kuwait
## 1 3 3
## Suriname Trinidad and Tobago El Salvador
## 3 3 3
## Guatemala Uzbekistan Slovakia
## 3 1 3
## Japan South Korea Ecuador
## 3 3 3
## Bahrain Italy Bolivia
## 3 3 3
## Moldova Paraguay Kazakhstan
## 3 3 3
## Slovenia Lithuania Nicaragua
## 3 3 3
## Peru Belarus Poland
## 3 3 3
## Malaysia Croatia Libya
## 3 3 3
## Russia Jamaica North Cyprus
## 3 3 3
## Cyprus Algeria Kosovo
## 3 3 2
## Turkmenistan Mauritius Hong Kong
## 3 3 1
## Estonia Indonesia Vietnam
## 3 3 3
## Turkey Kyrgyzstan Nigeria
## 3 3 2
## Bhutan Azerbaijan Pakistan
## 1 3 2
## Jordan Montenegro China
## 3 2 3
## Zambia Romania Serbia
## 2 3 3
## Portugal Latvia Philippines
## 3 3 3
## Somaliland region Morocco Macedonia
## 1 2 3
## Mozambique Albania Bosnia and Herzegovina
## 2 3 2
## Lesotho Dominican Republic Laos
## 2 3 1
## Mongolia Swaziland Lebanon
## 3 2 3
## Hungary Honduras Tajikistan
## 3 3 3
## Tunisia Palestinian Territories Bangladesh
## 2 2 2
## Iran Ukraine Iraq
## 2 3 2
## South Africa Ghana Zimbabwe
## 2 2 2
## Liberia India Sudan
## 2 2 2
## Haiti Congo (Kinshasa) Nepal
## 2 2 2
## Ethiopia Sierra Leone Mauritania
## 2 2 2
## Kenya Djibouti Armenia
## 2 2 2
## Botswana Georgia Sri Lanka
## 2 2 3
## Cameroon Bulgaria Egypt
## 2 3 2
## Yemen Angola Mali
## 2 2 2
## Congo (Brazzaville) Comoros Uganda
## 2 2 2
## Senegal Gabon Niger
## 2 2 2
## Tanzania Madagascar Chad
## 2 2 2
## Guinea Ivory Coast Burkina Faso
## 2 2 2
## Afghanistan Benin Burundi
## 2 2 2
## Togo
## 2
##
## Within cluster sum of squares by cluster:
## [1] 93.17516 205.76961 229.03003
## (between_SS / total_SS = 41.3 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
Averages <- Clustering$centers
Averages #Average values of cluster variables to describe groups
## Family Life_Expectancy Freedom Trust Generosity Dystopia
## 1 0.8446439 0.8307036 1.22703520 1.4587738 1.2082289 0.08699786
## 2 -0.9212850 -1.0423774 -0.76549278 -0.3046756 -0.2130384 -0.13800652
## 3 0.3526584 0.4515304 0.07227676 -0.3790817 -0.3441681 0.06944069
Figure <- as.data.frame(Averages)
Figure$ID <- 1:nrow(Figure)
library(tidyr)
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
##
## extract
Figure <- pivot_longer(Figure, cols = c("Family", "Life_Expectancy", "Freedom", "Trust", "Generosity", "Dystopia"))
Figure$Group <- factor(Figure$ID,
levels = c(1, 2, 3),
labels = c("1", "2", "3"))
Figure$NameF <- factor(Figure$name,
levels = c("Family", "Life_Expectancy", "Freedom", "Trust", "Generosity", "Dystopia"),
labels = c("Family", "Life_Expectancy", "Freedom", "Trust", "Generosity", "Dystopia"))
library(ggplot2)
ggplot(Figure, aes(x = NameF, y = value)) +
geom_hline(yintercept = 0) +
theme_bw() +
geom_point(aes(shape = Group, col = Group), size = 3) +
geom_line(aes(group = ID), linewidth = 1) +
ylab("Averages") +
xlab("Cluster variables")+
ylim(-2.2, 2.2) +
theme(axis.text.x = element_text(angle = 45, vjust = 0.50, size = 10))
Group 1 (High Performers in Happiness)
Countries in this cluster show above-average values for most variables, such as Family, Life Expectancy, Freedom, Trust, and Generosity.
These are countries with higher happiness scores, robust economic performance (GDP per capita), and strong social support systems. Examples: Switzerland, Iceland, Denmark, Norway, and Canada.
Group 2 (Low Performers in Happiness)
This group has below-average scores for most variables, with the lowest scores for Family and Life Expectancy.
These countries struggle with socioeconomic challenges, weaker institutional support, and lower trust levels.
Examples: Countries like Zimbabwe, Afghanistan, and Haiti.
Group 3 (Moderate Performers in Happiness)
Countries in this cluster have mixed performance. Some variables are close to the average, while others (e.g., Trust and Generosity) are below average.
These countries are transitioning economies with moderate happiness scores and average institutional support.
Examples: Countries like Brazil, Chile, and Slovenia.
mydata$Group <- Clustering$cluster
#Checking if clustering variables successfully differentiate between groups
fit <- aov(cbind(Family, Life_Expectancy, Freedom, Trust, Generosity, Dystopia) ~ as.factor(Group),
data = mydata)
summary(fit)
## Response Family :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 4.8214 2.41069 72.545 < 2.2e-16 ***
## Residuals 148 4.9181 0.03323
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Life_Expectancy :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 5.4769 2.73846 116.29 < 2.2e-16 ***
## Residuals 148 3.4853 0.02355
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Freedom :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 1.6430 0.82148 74.158 < 2.2e-16 ***
## Residuals 148 1.6395 0.01108
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Trust :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 1.0512 0.52560 77.119 < 2.2e-16 ***
## Residuals 148 1.0087 0.00682
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Generosity :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 0.72508 0.36254 40.331 1.045e-14 ***
## Residuals 148 1.33040 0.00899
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Dystopia :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 0.418 0.20893 0.7785 0.461
## Residuals 148 39.719 0.26837
Response for Family:
H0: μ(Family, G1) = μ(Family, G2) = μ(Family, G3)
H1: At least one μ(Family, j) is different.
We can reject H0 at p < 0.001. We can reject H0 for all cluster variables at p < 0.001, but for Dystopia.
We cannot reject H0, since p=0.461. The arithmetic mean of the variable Dystopia in one group is not statistically different from the arithmetic means in the other groups for the same variable. We would need additional clustering variables to enhance the differentiation of clusters.
aggregate(mydata$GDP_per_capita,
by = list(mydata$Group),
FUN = mean)
## Group.1 x
## 1 1 1.2304183
## 2 2 0.4914351
## 3 3 1.0013935
On average the 1 group has the biggest GDP per capita.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
leveneTest(mydata$GDP_per_capita, as.factor(mydata$Group))
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 2.238 0.1103
## 148
H0: σ2 (GDP_per_capita, G1) = σ2 (GDP_per_capita, G2) = σ2 (GDP_per_capita, G3)
H1: At least one σ2 (GDP_per_capita, j) is different.
We cannot reject H0. We can assume homogeneity of variance.
library(dplyr)
library(rstatix)
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
mydata %>%
group_by(as.factor(mydata$Group)) %>%
shapiro_test(GDP_per_capita)
## # A tibble: 3 × 4
## `as.factor(mydata$Group)` variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 1 GDP_per_capita 0.789 0.0000527
## 2 2 GDP_per_capita 0.958 0.0575
## 3 3 GDP_per_capita 0.981 0.388
H0: GDP per capita is normally distributed in G1.
H1: GDP per capita is not normally distributed in G1.
We reject H0 at p < 0.001.
H0: GDP per capita is normally distributed in G2.
H1: GDP per capita is not normally distributed in G2.
We cannot reject H0
H0: GDP per capita is normally distributed in G3.
H1: GDP per capita is not normally distributed in G3.
We cannot reject H0.
Our result is not validated.
Kruskal-Wallis:
H0: The location distribution of GDP per capita is the same in all groups
H1: The location distribution of GDP per capita is different in at least on of the groups
kruskal.test(GDP_per_capita ~ as.factor(Group),
data = mydata)
##
## Kruskal-Wallis rank sum test
##
## data: GDP_per_capita by as.factor(Group)
## Kruskal-Wallis chi-squared = 81.992, df = 2, p-value < 2.2e-16
We reject H0 at p<0.001. We can’t say that groups significantly differ in GDP per capita.
chi_square <- chisq.test(mydata$Region, as.factor(mydata$Group))
## Warning in chisq.test(mydata$Region, as.factor(mydata$Group)): Chi-squared
## approximation may be incorrect
chi_square
##
## Pearson's Chi-squared test
##
## data: mydata$Region and as.factor(mydata$Group)
## X-squared = 154.09, df = 18, p-value < 2.2e-16
H0:There is no association between Region and classification of countries into 3 groups.
H1: There is association between Region and classification of countries into 3 groups.
We reject H0 at p<0.001.
addmargins(chi_square$observed)
## as.factor(mydata$Group)
## mydata$Region 1 2 3 Sum
## Australia and New Zealand 2 0 0 2
## Central and Eastern Europe 1 5 23 29
## Eastern Asia 1 0 5 6
## Latin America and Caribbean 1 1 20 22
## Middle East and Northern Africa 3 7 9 19
## North America 2 0 0 2
## Southeastern Asia 3 0 4 7
## Southern Asia 1 5 1 7
## Sub-Saharan Africa 1 35 1 37
## Western Europe 14 0 6 20
## Sum 29 53 69 151
addmargins(round(chi_square$expected, 2))
## as.factor(mydata$Group)
## mydata$Region 1 2 3 Sum
## Australia and New Zealand 0.38 0.70 0.91 1.99
## Central and Eastern Europe 5.57 10.18 13.25 29.00
## Eastern Asia 1.15 2.11 2.74 6.00
## Latin America and Caribbean 4.23 7.72 10.05 22.00
## Middle East and Northern Africa 3.65 6.67 8.68 19.00
## North America 0.38 0.70 0.91 1.99
## Southeastern Asia 1.34 2.46 3.20 7.00
## Southern Asia 1.34 2.46 3.20 7.00
## Sub-Saharan Africa 7.11 12.99 16.91 37.01
## Western Europe 3.84 7.02 9.14 20.00
## Sum 28.99 53.01 68.99 150.99
Most of expected frequencies aren’t larger than 5. The test is invalid.
round(chi_square$res, 2)
## as.factor(mydata$Group)
## mydata$Region 1 2 3
## Australia and New Zealand 2.61 -0.84 -0.96
## Central and Eastern Europe -1.94 -1.62 2.68
## Eastern Asia -0.14 -1.45 1.36
## Latin America and Caribbean -1.57 -2.42 3.14
## Middle East and Northern Africa -0.34 0.13 0.11
## North America 2.61 -0.84 -0.96
## Southeastern Asia 1.43 -1.57 0.45
## Southern Asia -0.30 1.62 -1.23
## Sub-Saharan Africa -2.29 6.11 -3.87
## Western Europe 5.18 -2.65 -1.04
library(effectsize)
##
## Attaching package: 'effectsize'
## The following objects are masked from 'package:rstatix':
##
## cohens_d, eta_squared
effectsize::cramers_v(mydata$Region, mydata$Group)
## Cramer's V (adj.) | 95% CI
## --------------------------------
## 0.68 | [0.52, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.68)
## [1] "very large"
## (Rules: funder2019)
We can’t use residuals, since not all assumptions are met.
I clustered 151 countries based on six standardized variables: Family, Life Expectancy, Freedom, Trust, Generosity, and Dystopia.
We clustered 151 countries based on six standardized variables: Family, Life Expectancy, Freedom, Trust, Generosity, and Dystopia. Our analysis revealed distinct patterns and relationships across the three identified clusters, supported by various statistical tests.
Cluster 1 (High Performers in Happiness): This group consists of 29 countries (19.2%) that exhibit above-average values for most variables. These countries tend to have high GDP per capita and strong social and institutional support systems. Our statistical tests confirmed that this cluster had the highest average GDP per capita, and the distribution of variables like Family, Life Expectancy, and Freedom differed significantly from the other clusters. However, we could not find a significant difference in the Dystopia variable between clusters, indicating it may not be a strong differentiator.
Cluster 2 (Low performers in Happiness): This cluster, with 53 countries (35.1%), includes those facing substantial challenges across variables. These countries report below-average values for most variables, particularly Family and Life Expectancy. While Kruskal-Wallis tests showed that GDP per capita significantly differs between clusters, the Shapiro-Wilk test revealed that GDP per capita is not normally distributed in this cluster, which slightly limits our ability to draw broader conclusions. Despite this, the overall trend highlights the need for interventions to address systemic issues in this cluster.
Cluster 3 (Moderate Performers in Happiness): Representing 69 countries (45.7%), this cluster contains countries with average performance across variables. While their values are closer to the global mean, Trust and Generosity are slightly lower, indicating areas for improvement. Statistical analysis confirmed that this cluster occupies a middle ground in terms of GDP per capita and other variables, suggesting that these countries may benefit from targeted efforts to improve governance and social trust.
Overall, the clustering analysis highlights significant disparities in happiness-related variables among countries, suggesting targeted approaches to address the unique challenges faced by each cluster. Solutions tailored to specific clusters—such as increasing trust and infrastructure in Cluster 2 or sustaining progress in Cluster 1—could help reduce global inequalities and promote well-being. While most findings were robust and statistically validated, certain limitations (e.g., chi-squared test validity and normality violations) should be considered when interpreting the results.