Author: Miha Fabjan
mydata <- read.table("./world-happiness-report-2021.csv", header=TRUE, sep=",", dec=".")
mydata <- cbind(ID = 1:nrow(mydata), mydata)
mydata<-mydata[,c(1,2,4,8,9,10,11,12,13)]
head(mydata,10)
## ID Country.name Ladder.score Logged.GDP.per.capita Social.support
## 1 1 Finland 7.842 10.775 0.954
## 2 2 Denmark 7.620 10.933 0.954
## 3 3 Switzerland 7.571 11.117 0.942
## 4 4 Iceland 7.554 10.878 0.983
## 5 5 Netherlands 7.464 10.932 0.942
## 6 6 Norway 7.392 11.053 0.954
## 7 7 Sweden 7.363 10.867 0.934
## 8 8 Luxembourg 7.324 11.647 0.908
## 9 9 New Zealand 7.277 10.643 0.948
## 10 10 Austria 7.268 10.906 0.934
## Healthy.life.expectancy Freedom.to.make.life.choices Generosity
## 1 72.0 0.949 -0.098
## 2 72.7 0.946 0.030
## 3 74.4 0.919 0.025
## 4 73.0 0.955 0.160
## 5 72.4 0.913 0.175
## 6 73.3 0.960 0.093
## 7 72.7 0.945 0.086
## 8 72.6 0.907 -0.034
## 9 73.4 0.929 0.134
## 10 73.3 0.908 0.042
## Perceptions.of.corruption
## 1 0.186
## 2 0.179
## 3 0.292
## 4 0.673
## 5 0.338
## 6 0.270
## 7 0.237
## 8 0.386
## 9 0.242
## 10 0.481
Unit of observation is an individual country, with different variables, which describe the situations in the country, which will be explained down below. Sample size is 149.
Explanation of variables:
Ladder score
Represents the overall happiness score of a country, based on survey
responses to the Cantril ladder question, where respondents evaluate
their current life relative to the best and worst possible life. A
higher score indicates a higher level of well-being and
happiness.
Logged GDP per capita
Measures the economic production of a country on a per-person basis,
adjusted logarithmically to reflect diminishing returns of income on
happiness. Higher values suggest a higher standard of living and
economic prosperity.
Social support index
Reflects the extent to which people feel they have someone to count on
in times of trouble, based on survey responses. A higher value indicates
stronger social support networks and connections within the
country.
Healthy life expectancy
Represents the number of years a person can expect to live in good
health. A higher value suggests better healthcare systems and living
conditions.
Freedom to make life choices index
Measures peopleโs perceptions of the freedom they have to make life decisions. Higher values indicate greater personal autonomy and life satisfaction.
Generosity index
Captures the extent to which people donate time and money to charities and help others. A higher value suggests a more altruistic society.
Perceptions of corruption index
Reflects the publicโs perception of the level of corruption within their government and businesses. Lower values indicate lower perceived corruption and higher trust in institutions.
Source of data: https://www.kaggle.com/datasets/ajaypalsinghlo/world-happiness-report-2021
library(pastecs)
##
## Attaching package: 'pastecs'
## The following objects are masked from 'package:dplyr':
##
## first, last
round(stat.desc(mydata[,c(-1,-2)]), 2)
## Ladder.score Logged.GDP.per.capita Social.support
## nbr.val 149.00 149.00 149.00
## nbr.null 0.00 0.00 0.00
## nbr.na 0.00 0.00 0.00
## min 2.52 6.64 0.46
## max 7.84 11.65 0.98
## range 5.32 5.01 0.52
## sum 824.39 1405.40 121.40
## median 5.53 9.57 0.83
## mean 5.53 9.43 0.81
## SE.mean 0.09 0.09 0.01
## CI.mean.0.95 0.17 0.19 0.02
## var 1.15 1.34 0.01
## std.dev 1.07 1.16 0.11
## coef.var 0.19 0.12 0.14
## Healthy.life.expectancy Freedom.to.make.life.choices Generosity
## nbr.val 149.00 149.00 149.00
## nbr.null 0.00 0.00 0.00
## nbr.na 0.00 0.00 0.00
## min 48.48 0.38 -0.29
## max 76.95 0.97 0.54
## range 28.48 0.59 0.83
## sum 9683.93 117.95 -2.26
## median 66.60 0.80 -0.04
## mean 64.99 0.79 -0.02
## SE.mean 0.55 0.01 0.01
## CI.mean.0.95 1.09 0.02 0.02
## var 45.73 0.01 0.02
## std.dev 6.76 0.11 0.15
## coef.var 0.10 0.14 -9.95
## Perceptions.of.corruption
## nbr.val 149.00
## nbr.null 0.00
## nbr.na 0.00
## min 0.08
## max 0.94
## range 0.86
## sum 108.39
## median 0.78
## mean 0.73
## SE.mean 0.01
## CI.mean.0.95 0.03
## var 0.03
## std.dev 0.18
## coef.var 0.25
Minimum (Min):
The minimum healthy life expectancy is 48.48 years, indicating that the country with the lowest healthy life expectancy has an expected lifespan of 48.48 years in good health.
Maximum (Max):
The maximum healthy life expectancy is 76.95 years, meaning that the country with the highest healthy life expectancy has an expected lifespan of 76.95 years in good health.
Median: The median healthy life expectancy is 66.60 years, which means that half of the countries have a healthy life expectancy of 66.60 years or less, while the other half have a healthy life expectancy greater than 66.60 years.
#Saving standardized cluster variables into new data frame
mydata_clu_std <- as.data.frame(scale(mydata[c("Logged.GDP.per.capita", "Social.support", "Healthy.life.expectancy","Freedom.to.make.life.choices","Generosity","Perceptions.of.corruption")]))
mydata$Dissimilarity <- sqrt(mydata_clu_std$Logged.GDP.per.capita^2 + mydata_clu_std$Social.support^2 + mydata_clu_std$Healthy.life.expectancy^2 + mydata_clu_std$Freedom.to.make.life.choices^2 + mydata_clu_std$Generosity^2 + mydata_clu_std$Perceptions.of.corruption^2) #Finding outliers
head(mydata[order(-mydata$Dissimilarity), c("ID", "Dissimilarity")], 10) #Finding units with highest value of dissimilarity
## ID Dissimilarity
## 149 149 5.443879
## 143 143 4.684666
## 32 32 4.629847
## 140 140 4.398610
## 147 147 4.325374
## 128 128 3.988596
## 2 2 3.969410
## 1 1 3.904402
## 82 82 3.868888
## 98 98 3.842972
Since we can spot that there is a big gap between units with highest value of dissimiliarty, we remove following units:
mydata <- mydata %>%
filter(!ID %in% c(149,32,143,140,147)) #Removing ID 32, 140,143,147, 149 from original data frame
mydata$ID <- seq(1, nrow(mydata))
mydata_clu_std <- as.data.frame(scale(mydata[c("Logged.GDP.per.capita", "Social.support", "Healthy.life.expectancy","Freedom.to.make.life.choices","Generosity","Perceptions.of.corruption")]))
We show the matrix of distances, from which we can recognize, that we have multiple groups, which can be indentified.
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.4.2
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
#Finding Eudlidean distances, based on 6 Cluster variables, then saving them into object Distances
Distances <- get_dist(mydata_clu_std,
method = "euclidian")
fviz_dist(Distances, #Showing matrix of distances
gradient = list(low = "darkred",
mid = "grey95",
high = "white"))
We also calculate Hopkins statistics, which is also above 0.5 - our clusters are good.
library(factoextra)
get_clust_tendency(mydata_clu_std, #Hopkins statistics
n = nrow(mydata_clu_std) - 1,
graph = FALSE)
## $hopkins_stat
## [1] 0.7133091
##
## $plot
## NULL
For further clustering, we selected K-Means clustering. First we predetermine the number of groups with a help of elbow and silhouette analysis.
library(factoextra)
library(NbClust)
fviz_nbclust(mydata_clu_std, kmeans, method = "wss") +
labs(subtitle = "Elbow method")
fviz_nbclust(mydata_clu_std, kmeans, method = "silhouette")+
labs(subtitle = "Silhouette analysis")
library(NbClust)
NbClust(mydata_clu_std,
distance = "euclidean",
min.nc = 2, max.nc = 10,
method = "kmeans",
index = "all")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 7 proposed 2 as the best number of clusters
## * 12 proposed 3 as the best number of clusters
## * 2 proposed 5 as the best number of clusters
## * 1 proposed 9 as the best number of clusters
## * 1 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
## $All.index
## KL CH Hartigan CCC Scott Marriot TrCovW TraceW
## 2 1.4646 77.7111 50.5306 -1.5166 204.4336 464874908577 13186.697 554.5281
## 3 3.6099 77.3987 22.5467 0.2894 381.9961 304790789090 5103.647 408.9895
## 4 0.8139 66.8879 21.4936 0.5364 466.3435 301643079108 3679.977 352.6058
## 5 4.2803 62.7894 11.1952 1.6211 571.5838 226942369375 2795.892 305.6765
## 6 0.5570 56.1097 11.8020 1.1950 622.0625 230164041938 2324.914 282.8920
## 7 1.1800 52.3418 10.1196 1.2826 701.3449 180642752049 1806.601 260.6047
## 8 0.8524 49.2618 10.0932 1.3262 739.2676 181314396821 1805.312 242.6791
## 9 2.6609 47.2150 6.5736 1.5525 811.7554 138713494017 1553.825 225.9129
## 10 2.1967 44.4114 3.5427 1.3127 856.1134 125849902307 1447.072 215.4232
## Friedman Rubin Cindex DB Silhouette Duda Pseudot2 Beale Ratkowsky
## 2 7.9547 1.5473 0.3917 1.2987 0.3254 0.7530 33.1273 1.2482 0.3808
## 3 11.1927 2.0979 0.3413 1.1711 0.3224 0.5472 33.9222 3.1424 0.4092
## 4 12.4770 2.4333 0.4018 1.2502 0.3008 1.6853 -19.1116 -1.4933 0.3820
## 5 14.8348 2.8069 0.3630 1.3237 0.2605 1.7393 -17.4271 -1.5535 0.3575
## 6 15.6725 3.0330 0.3910 1.3523 0.2317 1.1785 -4.9988 -0.5504 0.3335
## 7 18.0169 3.2923 0.3864 1.3492 0.2262 1.3629 -5.8575 -0.9561 0.3147
## 8 18.4804 3.5355 0.3559 1.3821 0.2296 1.5254 -8.9548 -1.2046 0.2989
## 9 20.4884 3.7979 0.3402 1.3573 0.2195 1.1817 -3.3827 -0.5324 0.2857
## 10 22.0848 3.9829 0.3827 1.3607 0.2125 1.2336 -3.7868 -0.6880 0.2734
## Ball Ptbiserial Frey McClain Dunn Hubert SDindex Dindex SDbw
## 2 277.2640 0.5323 0.2208 0.5596 0.1597 0.0023 1.9152 1.8382 0.9108
## 3 136.3298 0.6261 0.7058 0.8501 0.1249 0.0031 1.6285 1.5676 0.5427
## 4 88.1515 0.6143 1.0228 1.1622 0.1375 0.0032 1.6473 1.4732 0.4739
## 5 61.1353 0.5449 2.2211 1.8704 0.1017 0.0034 1.9415 1.3722 0.4254
## 6 47.1487 0.4751 0.6225 2.6418 0.0863 0.0034 1.9845 1.3182 0.3810
## 7 37.2292 0.4507 -0.1628 3.1382 0.1373 0.0036 2.0513 1.2668 0.3478
## 8 30.3349 0.4803 1.1307 2.9675 0.0874 0.0037 1.9746 1.2264 0.3394
## 9 25.1014 0.4348 1.0884 3.7974 0.1319 0.0038 2.1685 1.1774 0.3140
## 10 21.5423 0.4162 0.7337 4.2153 0.1509 0.0038 2.1883 1.1447 0.2961
##
## $All.CriticalValues
## CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
## 2 0.7157 40.1151 0.2799
## 3 0.7006 17.5226 0.0050
## 4 0.5356 40.7476 1.0000
## 5 0.5190 38.0049 1.0000
## 6 0.4997 33.0377 1.0000
## 7 0.4643 25.3784 1.0000
## 8 0.3979 39.3438 1.0000
## 9 0.3758 36.5355 1.0000
## 10 0.4997 20.0228 1.0000
##
## $Best.nc
## KL CH Hartigan CCC Scott Marriot TrCovW
## Number_clusters 5.0000 2.0000 3.0000 5.0000 3.0000 3 3.00
## Value_Index 4.2803 77.7111 27.9839 1.6211 177.5625 156936409505 8083.05
## TraceW Friedman Rubin Cindex DB Silhouette Duda
## Number_clusters 3.0000 3.0000 3.0000 9.0000 3.0000 2.0000 2.000
## Value_Index 89.1549 3.2381 -0.2151 0.3402 1.1711 0.3254 0.753
## PseudoT2 Beale Ratkowsky Ball PtBiserial Frey McClain
## Number_clusters 2.0000 2.0000 3.0000 3.0000 3.0000 1 2.0000
## Value_Index 33.1273 1.2482 0.4092 140.9342 0.6261 NA 0.5596
## Dunn Hubert SDindex Dindex SDbw
## Number_clusters 2.0000 0 3.0000 0 10.0000
## Value_Index 0.1597 0 1.6285 0 0.2961
##
## $Best.partition
## [1] 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 1 3 1 3 1 3 3 3 1 3 1 1 1 1 1 3 1 1 1 1 1 1
## [38] 1 3 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [75] 1 3 1 1 1 1 2 2 1 2 1 2 1 1 1 2 2 1 1 2 2 1 2 2 2 2 2 1 1 2 2 1 1 1 1 2 1
## [112] 2 2 2 2 2 2 2 2 2 1 1 1 1 2 1 2 1 2 2 1 2 2 2 2 2 2 2 2 2 2 2 1 2
From here we can see that we will have 3 groups. Now we proceed with K-Means method.
Clustering <- kmeans(mydata_clu_std,
centers = 3, #Number of groups
nstart = 25) #Number of attempts at different starting leader positions
Clustering
## K-means clustering with 3 clusters of sizes 24, 79, 41
##
## Cluster means:
## Logged.GDP.per.capita Social.support Healthy.life.expectancy
## 1 1.1733233 0.9899720 1.0233188
## 2 0.2787572 0.3257858 0.3282251
## 3 -1.2239408 -1.2072294 -1.2314497
## Freedom.to.make.life.choices Generosity Perceptions.of.corruption
## 1 1.02130098 0.5727610 -1.7417075
## 2 -0.01723578 -0.4458824 0.3996773
## 3 -0.56462432 0.5238645 0.2494263
##
## Clustering vector:
## [1] 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 2 1 2 1 1 1 2 1 2 2 2 2 2 1 2 2 2 2 2 2
## [38] 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [75] 2 1 2 2 2 2 3 3 2 3 2 3 2 2 2 3 3 2 2 3 3 2 3 3 3 3 3 2 2 3 3 2 2 2 2 3 2
## [112] 3 3 3 3 3 3 3 3 3 2 2 2 2 3 2 3 2 3 3 2 3 3 3 3 3 3 3 3 3 3 3 2 3
##
## Within cluster sum of squares by cluster:
## [1] 51.36146 212.36967 145.25836
## (between_SS / total_SS = 52.3 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
library(factoextra)
fviz_cluster(Clustering,
palette = "Set1",
repel = FALSE,
ggtheme = theme_bw(),
data = mydata_clu_std)
Since we see that ID 41,67 81,94,97,125 are clearly outliers, we remove them and repeat the process.
library(dplyr)
mydata <- mydata %>%
filter(!ID %in% c(41,67,81,94,97,125)) #Removing IDs from original data frame
mydata$ID <- seq(1, nrow(mydata))
mydata_clu_std <- as.data.frame(scale(mydata[c("Logged.GDP.per.capita", "Social.support", "Healthy.life.expectancy","Freedom.to.make.life.choices","Generosity","Perceptions.of.corruption")]))
Clustering <- kmeans(mydata_clu_std,
centers = 3, #Number of groups
nstart = 25) #Number of attempts at different starting leader positions
Clustering
## K-means clustering with 3 clusters of sizes 36, 80, 22
##
## Cluster means:
## Logged.GDP.per.capita Social.support Healthy.life.expectancy
## 1 -1.2908312 -1.2856269 -1.298512
## 2 0.2349987 0.3112376 0.290719
## 3 1.2577285 0.9719799 1.067678
## Freedom.to.make.life.choices Generosity Perceptions.of.corruption
## 1 -0.70874597 0.4260038 0.2358709
## 2 0.04147238 -0.3977309 0.3835836
## 3 1.00895748 0.7491973 -1.7808198
##
## Clustering vector:
## [1] 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 2 3 2 3 2 3 3 3 2 3 2 2 2 2 2 2 2 2 2 2 2 2
## [38] 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3
## [75] 2 2 2 2 1 2 1 2 1 2 2 2 1 1 2 2 1 2 1 1 1 1 2 2 1 1 2 2 2 2 1 2 1 2 1 1 1
## [112] 1 1 1 1 2 2 2 2 2 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 2 1
##
## Within cluster sum of squares by cluster:
## [1] 108.64553 233.81083 44.06584
## (between_SS / total_SS = 53.0 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
library(factoextra)
fviz_cluster(Clustering,
palette = "Set1",
repel = FALSE,
ggtheme = theme_bw(),
data = mydata_clu_std)
rownames(mydata_clu_std) <- mydata$Country.name
library(factoextra)
fviz_cluster(Clustering,
palette = "Set1",
repel = FALSE,
ggtheme = theme_bw(),
data = mydata_clu_std)
Averages <- Clustering$centers
Averages #Average values of cluster variables to describe groups
## Logged.GDP.per.capita Social.support Healthy.life.expectancy
## 1 -1.2908312 -1.2856269 -1.298512
## 2 0.2349987 0.3112376 0.290719
## 3 1.2577285 0.9719799 1.067678
## Freedom.to.make.life.choices Generosity Perceptions.of.corruption
## 1 -0.70874597 0.4260038 0.2358709
## 2 0.04147238 -0.3977309 0.3835836
## 3 1.00895748 0.7491973 -1.7808198
Figure <- as.data.frame(Averages)
Figure$ID <- 1:nrow(Figure)
library(tidyr)
Figure <- pivot_longer(Figure, cols = c("Logged.GDP.per.capita", "Social.support", "Healthy.life.expectancy","Freedom.to.make.life.choices","Generosity","Perceptions.of.corruption"))
Figure$Group <- factor(Figure$ID,
levels = c(1, 2, 3),
labels = c("1", "2", "3"))
library(ggplot2)
ggplot(Figure, aes(x = name, y = value)) +
geom_hline(yintercept = 0) +
theme_bw() +
geom_point(aes(shape = Group, col = Group), size = 3) +
geom_line(aes(group = ID), linewidth = 1) +
ylab("Averages") +
xlab("Cluster variables")+
ylim(-2.2, 2.2) +
theme(axis.text.x = element_text(angle = 45, vjust = 0.50, size = 10))
From here we can see which clustering variables are above and below average for each formed group.
For group 1, variables like Perception of corruption and Generosity are above average,while Healthy life expectancy, Logged GDP per capita Freedom to make life choices and Social support are below average.
For group 2, all variables are above average except Generosity.
For group 3, all variables are above average except Perceptions of corruption.
mydata$Group <- Clustering$cluster
Then we test if variables are different for each group. Here is just an example of 1 variable:
\(H0: ๐_{Generosity,G1} = ๐_{Generosity,G2} = ๐_{Generosity,G3}\)
\(H1\): At least one group is different.
#Checking if clustering variables successfully differentiate between groups
fit <- aov(cbind(Logged.GDP.per.capita, Social.support, Healthy.life.expectancy,Freedom.to.make.life.choices,Generosity,Perceptions.of.corruption) ~ as.factor(Group),
data = mydata)
summary(fit)
## Response Logged.GDP.per.capita :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 123.385 61.692 177.17 < 2.2e-16 ***
## Residuals 135 47.008 0.348
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Social.support :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 0.97873 0.48937 121.36 < 2.2e-16 ***
## Residuals 135 0.54435 0.00403
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Healthy.life.expectancy :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 4033.7 2016.86 140.5 < 2.2e-16 ***
## Residuals 135 1937.9 14.35
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Freedom.to.make.life.choices :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 0.45553 0.227767 28.445 4.913e-11 ***
## Residuals 135 1.08097 0.008007
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Generosity :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 0.49356 0.246781 20.185 2.141e-08 ***
## Residuals 135 1.65052 0.012226
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Perceptions.of.corruption :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 2.3594 1.17970 105.49 < 2.2e-16 ***
## Residuals 135 1.5097 0.01118
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the output we can conduct that we have to reject \(H0\) at p<0,001, which applies for all selected variables.
Then we also check the ladder score, for which group it is the highest (official happiness leaderboard score)
aggregate(mydata$Ladder.score,
by = list(mydata$Group),
FUN = mean)
## Group.1 x
## 1 1 4.522278
## 2 2 5.683650
## 3 3 7.062818
3rd group has the highest average score.
We perform Leveneโs test for homogeneity of variance
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
leveneTest(mydata$Ladder.score, as.factor(mydata$Group))
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 1.8227 0.1655
## 135
Based on the levene test, we have:
\(H0\): \(ฯ_1^2 = ฯ_2^2 = ฯ_3^2\)
\(H1\): At least one is different.
From output we can see that we can not reject \(H0\).
We also check the normality:
\(H0\): Normality of distribution is met.
\(H1\): Normality of distribution is not met.
library(dplyr)
library(rstatix)
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
mydata %>%
group_by(as.factor(mydata$Group)) %>%
shapiro_test(Ladder.score)
## # A tibble: 3 ร 4
## `as.factor(mydata$Group)` variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 1 Ladder.score 0.941 0.0534
## 2 2 Ladder.score 0.978 0.174
## 3 3 Ladder.score 0.904 0.0363
From the output we can see we have to reject \(H0\) at p < 0.04.
Therefore, we have to perform kruskal-wallis test:
\(H0\): Location distribution of Ladder.Score is the same.
\(H1\): Location distribution of Ladder.Score is not the same.
kruskal.test(Ladder.score ~ as.factor(Group),
data = mydata)
##
## Kruskal-Wallis rank sum test
##
## data: Ladder.score by as.factor(Group)
## Kruskal-Wallis chi-squared = 81.148, df = 2, p-value < 2.2e-16
We have to reject \(H0\) at p < 0,001.
We clustered 138 countries into 3 segments based on six standardized variables.
Group 1 represents the second largest group (26%, 36/138). For those countries have perception of corruption and generosity are above average, while healthy life expectancy, logged GDP per capita, freedom to make life choices and social support are below average. Countries in this group are Iran,Nepal and Kenya
Group 2 represents the largest group of countries (58%, 80/138).Those countries have below average Generosity, while Healthy life expectancy, Logged GDP per capita, Social support, freedom to make life choices and perception of corruption are above average. Countries in this group are Malaysia, Sri Lanka, Kosovo.
Group 3 represents the smallest group from all 3 (16% 22/138).Those countries have variables like freedom to make life choices, generosity, healthy life expectancy, logged GDP per capita and social support above average, while perception of corruption is below average. Countries in this group are Finland, Denmark, Switzerland.