Homework assignment 2 at course MVA

1 Data import

library(readxl)
mydata <- read_xlsx("/cloud/project/Student depression/2015.xlsx")
mydata <- as.data.frame(mydata)

2 Data description

colnames(mydata) [1] <- "Country"
colnames(mydata) [2] <- "Region"
colnames(mydata) [3] <- "ID"
colnames(mydata) [4] <- "Happiness_Score"
colnames(mydata) [5] <- "GDP_per_capita"
colnames(mydata) [6] <- "Family"
colnames(mydata) [7] <- "Life_Expectancy"
colnames(mydata) [8] <- "Freedom"
colnames(mydata) [9] <- "Trust"
colnames(mydata) [10] <- "Generosity"
colnames(mydata) [11] <- "Dystopia"


head(mydata)

##       Country         Region ID Happiness_Score GDP_per_capita  Family
## 1 Switzerland Western Europe  1           7.587        1.39651 1.34951
## 2     Iceland Western Europe  2           7.561        1.30232 1.40223
## 3     Denmark Western Europe  3           7.527        1.32548 1.36058
## 4      Norway Western Europe  4           7.522        1.45900 1.33095
## 5      Canada  North America  5           7.427        1.32629 1.32261
## 6     Finland Western Europe  6           7.406        1.29025 1.31826
##   Life_Expectancy Freedom   Trust Generosity Dystopia
## 1         0.94143 0.66557 0.41978    0.29678  2.51738
## 2         0.94784 0.62877 0.14145    0.43630  2.70201
## 3         0.87464 0.64938 0.48357    0.34139  2.49204
## 4         0.88521 0.66973 0.36503    0.34699  2.46531
## 5         0.90563 0.63297 0.32957    0.45811  2.45176
## 6         0.88911 0.64169 0.41372    0.23351  2.61955

Source: 2015.csv. Kaggle;

https://www.kaggle.com/datasets/unsdsn/world-happiness

Unit of Observation: Each row represents a country.

Sample Size: 158 countries (rows in the dataset).

Variables Analyzed:

Economy (GDP per Capita): The extent to which GDP contributes to the calculation of the Happiness Score.
Family: The extent to which Social support contributes to the calculation of the Happiness Score.
Health (Life Expectancy): The extent to which Life expectancy contributed to the calculation of the Happiness Score.
Freedom: The extent to which Perceived Freedom to make life choices contributed to the calculation of the Happiness Score.
Trust (Government Corruption): The extent to which the Perception of government corruption contributes to Happiness Score.
Generosity: The extent to which Willingness to donate to others contributes to Happiness Score.
Dystopia Residual: The extent to which the Baseline metric for unhappiness (Dystopia Residual) contributes to Happiness Score.

summary(mydata[ , c(-1, -2, -3)])

##  Happiness_Score GDP_per_capita       Family       Life_Expectancy 
##  Min.   :2.839   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:4.526   1st Qu.:0.5458   1st Qu.:0.8568   1st Qu.:0.4392  
##  Median :5.232   Median :0.9102   Median :1.0295   Median :0.6967  
##  Mean   :5.376   Mean   :0.8461   Mean   :0.9910   Mean   :0.6303  
##  3rd Qu.:6.244   3rd Qu.:1.1584   3rd Qu.:1.2144   3rd Qu.:0.8110  
##  Max.   :7.587   Max.   :1.6904   Max.   :1.4022   Max.   :1.0252  
##     Freedom           Trust           Generosity        Dystopia     
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.3286  
##  1st Qu.:0.3283   1st Qu.:0.06168   1st Qu.:0.1506   1st Qu.:1.7594  
##  Median :0.4355   Median :0.10722   Median :0.2161   Median :2.0954  
##  Mean   :0.4286   Mean   :0.14342   Mean   :0.2373   Mean   :2.0990  
##  3rd Qu.:0.5491   3rd Qu.:0.18025   3rd Qu.:0.3099   3rd Qu.:2.4624  
##  Max.   :0.6697   Max.   :0.55191   Max.   :0.7959   Max.   :3.6021

The average happiness score across all countries is 5.376, the median is slightly lower than the mean, suggesting that the distribution of happiness scores might be slightly skewed to the right.

The lowest happiness score is 2.839, and the highest is 7.587, showing a wide range of happiness across countries.

The average GDP per capita contribution to happiness is 0.8461.

The median generosity score is 0.216, indicating that half of the countries have a generosity score below this value and the other half above it

3 How can countries be segmented based on key happiness-related variables (Family, Life Expectancy, Freedom, Trust, and Generosity) to identify patterns in well-being and socioeconomic characteristics?

mydata_clu_std <- as.data.frame(scale(mydata[c("Family", "Life_Expectancy", "Freedom", "Trust", "Generosity")]))

mydata$Dissimilarity <- sqrt(mydata_clu_std$Family^2 + mydata_clu_std$`Life_Expectancy`^2 + mydata_clu_std$Freedom^2 + mydata_clu_std$Trust^2 + mydata_clu_std$Generosity^2)

head(mydata[order(-mydata$Dissimilarity), c("ID", "Dissimilarity")])

##      ID Dissimilarity
## 129 129      4.585831
## 148 148      4.341715
## 154 154      3.750601
## 9     9      3.724099
## 3     3      3.697382
## 28   28      3.602008

I have identified ID129 and ID148 as potential outliers, as there is a big jump in disimilarity numbers between units. I have decided to remove these units.

print(mydata[c(129, 148), ])

##                      Country             Region  ID Happiness_Score
## 129                  Myanmar  Southeastern Asia 129           4.307
## 148 Central African Republic Sub-Saharan Africa 148           3.678
##     GDP_per_capita  Family Life_Expectancy Freedom   Trust Generosity Dystopia
## 129        0.27108 0.70905         0.48246 0.44017 0.19034    0.79588  1.41805
## 148        0.07850 0.00000         0.06699 0.48879 0.08289    0.23835  2.72230
##     Dissimilarity
## 129      4.585831
## 148      4.341715

mydata <- mydata %>%
  filter(!ID %in% c(129, 148)) 

mydata$ID <- seq(1, nrow(mydata))

mydata_clu_std <- as.data.frame(scale(mydata[c("Family", "Life_Expectancy", "Freedom", "Trust", "Generosity")])) 
rownames(mydata_clu_std) <- mydata$Country

Distances <- get_dist(mydata_clu_std, 
                      method = "euclidian")

fviz_dist(Distances, #Showing matrix of distances
          gradient = list(low = "darkred",
                          mid = "grey95",
                          high = "white"))

library(factoextra) 
get_clust_tendency(mydata_clu_std, #Hopkins statistics
                   n = nrow(mydata_clu_std) - 1,
                   graph = FALSE)

## $hopkins_stat
## [1] 0.6830286
## 
## $plot
## NULL

Hopkins statistics is 0.68, my data is clusterable as it is above 0.5. With the help of Hierarhical clustering (dendrogram) and K-Means clustering (Elbow method and Silhouette analysis) I will now determine how many clusters to use.

WARD <- mydata_clu_std %>%
  get_dist(method = "euclidean") %>%  
  hclust(method = "ward.D2")          

WARD

## 
## Call:
## hclust(d = ., method = "ward.D2")
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 156

fviz_dend(WARD)

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Based on the dendrogram, I would choose 3 clusters, as there is the biggest jump in vertical line.

library(dplyr)
library(factoextra)
WARD <- mydata_clu_std %>%
  get_dist(method = "euclidean") %>%  
  hclust(method = "ward.D2")          

WARD

## 
## Call:
## hclust(d = ., method = "ward.D2")
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 156

library(factoextra)
library(NbClust)

fviz_nbclust(mydata_clu_std, kmeans, method = "wss") +
  labs(subtitle = "Elbow method")

With the elbow method the slope changes most evidently at 3, therefore I’d choose 3 clusters.

fviz_nbclust(mydata_clu_std, kmeans, method = "silhouette")+
  labs(subtitle = "Silhouette analysis")

The higest value of the Silhouette analysis is at 2.

library(NbClust)
NbClust(mydata_clu_std, 
        distance = "euclidean", 
        min.nc = 2, max.nc = 10,
        method = "kmeans", 
        index = "all")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 6 proposed 2 as the best number of clusters 
## * 8 proposed 3 as the best number of clusters 
## * 2 proposed 5 as the best number of clusters 
## * 2 proposed 7 as the best number of clusters 
## * 2 proposed 8 as the best number of clusters 
## * 3 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************

## $All.index
##         KL      CH Hartigan     CCC    Scott     Marriot    TrCovW   TraceW
## 2   1.3166 73.1674  44.9343 -1.5735 195.3316 35315734948 11930.198 525.3834
## 3   2.0042 69.2725  27.5202 -0.2824 388.4529 23041609882  7117.389 406.7125
## 4   1.3647 63.2457  20.7245  0.4435 513.8824 18331678934  5527.242 344.7094
## 5   1.5892 58.6942   7.5026  1.1318 615.2159 14959454904  4127.544 303.3490
## 6   3.1160 50.4525   8.3136 -0.1628 639.4804 18438577563  3618.243 288.9902
## 7   0.0751 45.4546  26.1072 -0.8857 700.8194 16937798874  3464.357 273.8143
## 8  14.0319 49.1849   8.4114  2.1103 812.2501 10829902828  2158.047 232.9906
## 9   0.4673 46.2198   9.0653  1.8704 848.3482 10875136180  1939.257 220.4610
## 10 15.0520 44.3215   5.8537  1.9318 891.5193 10180380410  1769.317 207.6552
##    Friedman  Rubin Cindex     DB Silhouette   Duda Pseudot2   Beale Ratkowsky
## 2    3.1311 1.4751 0.4039 1.4501     0.2746 0.9513   5.1717  0.1581    0.3905
## 3    5.4273 1.9055 0.3448 1.3972     0.2695 0.8513  11.1827  0.5389    0.3947
## 4    7.3639 2.2483 0.3454 1.3639     0.2507 1.3808 -20.1320 -0.8440    0.3707
## 5    9.0857 2.5548 0.3854 1.3663     0.2563 1.9208 -34.5147 -1.4662    0.3477
## 6    9.5448 2.6818 0.3329 1.3624     0.2363 1.4354 -13.9523 -0.9196    0.3224
## 7   11.1882 2.8304 0.3129 1.3345     0.2142 1.5957 -15.6800 -1.1295    0.3028
## 8   13.2899 3.3263 0.3940 1.3100     0.2380 1.2300  -6.1715 -0.5644    0.2953
## 9   13.9415 3.5154 0.3822 1.3482     0.2247 1.2898  -7.8649 -0.6740    0.2817
## 10  14.8895 3.7321 0.3761 1.2708     0.2209 1.0139  -0.4671 -0.0413    0.2705
##        Ball Ptbiserial    Frey McClain   Dunn Hubert SDindex Dindex   SDbw
## 2  262.6917     0.4451  0.1263  0.7159 0.1253 0.0023  1.8367 1.7447 1.2727
## 3  135.5708     0.5458  0.8865  1.1204 0.1081 0.0034  1.7352 1.5212 0.9307
## 4   86.1773     0.5084  0.2018  1.6782 0.1272 0.0036  1.6230 1.3906 0.5690
## 5   60.6698     0.5213  0.9459  1.9563 0.1523 0.0041  1.5588 1.3112 0.4438
## 6   48.1650     0.4977 -1.6330  2.2866 0.0510 0.0039  1.7100 1.2835 0.5630
## 7   39.1163     0.4533  0.0967  2.7039 0.0473 0.0040  1.8817 1.2448 0.3736
## 8   29.1238     0.4725  0.4949  3.1287 0.1271 0.0044  1.8822 1.1630 0.3509
## 9   24.4957     0.4566  0.5625  3.4987 0.1109 0.0045  1.7358 1.1291 0.3323
## 10  20.7655     0.4459  0.6159  3.7482 0.1109 0.0047  1.7848 1.0938 0.3176
## 
## $All.CriticalValues
##    CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
## 2          0.6588            52.3182       0.9775
## 3          0.6513            34.2721       0.7467
## 4          0.5995            48.7619       1.0000
## 5          0.5965            48.7016       1.0000
## 6          0.5502            37.6080       1.0000
## 7          0.5399            35.7855       1.0000
## 8          0.5287            29.4215       1.0000
## 9          0.5022            34.6984       1.0000
## 10         0.5094            32.7506       1.0000
## 
## $Best.nc
##                     KL      CH Hartigan    CCC    Scott    Marriot   TrCovW
## Number_clusters 10.000  2.0000   7.0000 8.0000   3.0000          3    3.000
## Value_Index     15.052 73.1674  17.7936 2.1103 193.1213 7564194118 4812.809
##                  TraceW Friedman   Rubin Cindex      DB Silhouette   Duda
## Number_clusters  3.0000   3.0000  8.0000 7.0000 10.0000     2.0000 2.0000
## Value_Index     56.6678   2.2963 -0.3069 0.3129  1.2708     0.2746 0.9513
##                 PseudoT2  Beale Ratkowsky     Ball PtBiserial Frey McClain
## Number_clusters   2.0000 2.0000    3.0000   3.0000     3.0000    1  2.0000
## Value_Index       5.1717 0.1581    0.3947 127.1209     0.5458   NA  0.7159
##                   Dunn Hubert SDindex Dindex    SDbw
## Number_clusters 5.0000      0  5.0000      0 10.0000
## Value_Index     0.1523      0  1.5588      0  0.3176
## 
## $Best.partition
##             Switzerland                 Iceland                 Denmark 
##                       3                       3                       3 
##                  Norway                  Canada                 Finland 
##                       3                       3                       3 
##             Netherlands                  Sweden             New Zealand 
##                       3                       3                       3 
##               Australia                  Israel              Costa Rica 
##                       3                       1                       1 
##                 Austria                  Mexico           United States 
##                       3                       1                       3 
##                  Brazil              Luxembourg                 Ireland 
##                       1                       3                       3 
##                 Belgium    United Arab Emirates          United Kingdom 
##                       3                       3                       3 
##                    Oman               Venezuela               Singapore 
##                       3                       1                       3 
##                  Panama                 Germany                   Chile 
##                       1                       3                       1 
##                   Qatar                  France               Argentina 
##                       3                       1                       1 
##          Czech Republic                 Uruguay                Colombia 
##                       1                       3                       1 
##                Thailand            Saudi Arabia                   Spain 
##                       3                       1                       1 
##                   Malta                  Taiwan                  Kuwait 
##                       3                       1                       1 
##                Suriname     Trinidad and Tobago             El Salvador 
##                       1                       1                       1 
##               Guatemala              Uzbekistan                Slovakia 
##                       1                       3                       1 
##                   Japan             South Korea                 Ecuador 
##                       1                       1                       1 
##                 Bahrain                   Italy                 Bolivia 
##                       3                       1                       1 
##                 Moldova                Paraguay              Kazakhstan 
##                       1                       1                       1 
##                Slovenia               Lithuania               Nicaragua 
##                       1                       1                       3 
##                    Peru                 Belarus                  Poland 
##                       1                       1                       1 
##                Malaysia                 Croatia                   Libya 
##                       1                       1                       1 
##                  Russia                 Jamaica            North Cyprus 
##                       1                       1                       1 
##                  Cyprus                 Algeria                  Kosovo 
##                       1                       1                       2 
##            Turkmenistan               Mauritius               Hong Kong 
##                       3                       1                       3 
##                 Estonia               Indonesia                 Vietnam 
##                       1                       1                       1 
##                  Turkey              Kyrgyzstan                 Nigeria 
##                       1                       1                       2 
##                  Bhutan              Azerbaijan                Pakistan 
##                       3                       1                       2 
##                  Jordan              Montenegro                   China 
##                       1                       1                       1 
##                  Zambia                 Romania                  Serbia 
##                       2                       1                       1 
##                Portugal                  Latvia             Philippines 
##                       1                       1                       1 
##       Somaliland region                 Morocco               Macedonia 
##                       3                       2                       1 
##              Mozambique                 Albania  Bosnia and Herzegovina 
##                       2                       1                       1 
##                 Lesotho      Dominican Republic                    Laos 
##                       2                       1                       3 
##                Mongolia               Swaziland                  Greece 
##                       1                       2                       1 
##                 Lebanon                 Hungary                Honduras 
##                       1                       1                       1 
##              Tajikistan                 Tunisia Palestinian Territories 
##                       2                       2                       1 
##              Bangladesh                    Iran                 Ukraine 
##                       2                       2                       1 
##                    Iraq            South Africa                   Ghana 
##                       2                       2                       2 
##                Zimbabwe                 Liberia                   India 
##                       2                       2                       2 
##                   Sudan                   Haiti        Congo (Kinshasa) 
##                       2                       2                       2 
##                   Nepal                Ethiopia            Sierra Leone 
##                       2                       2                       2 
##              Mauritania                   Kenya                Djibouti 
##                       2                       2                       2 
##                 Armenia                Botswana                 Georgia 
##                       1                       2                       2 
##                  Malawi               Sri Lanka                Cameroon 
##                       2                       1                       2 
##                Bulgaria                   Egypt                   Yemen 
##                       1                       2                       2 
##                  Angola                    Mali     Congo (Brazzaville) 
##                       2                       2                       2 
##                 Comoros                  Uganda                 Senegal 
##                       2                       2                       2 
##                   Gabon                   Niger                Cambodia 
##                       2                       2                       2 
##                Tanzania              Madagascar                    Chad 
##                       2                       2                       2 
##                  Guinea             Ivory Coast            Burkina Faso 
##                       2                       2                       2 
##             Afghanistan                  Rwanda                   Benin 
##                       2                       3                       2 
##                   Syria                 Burundi                    Togo 
##                       2                       2                       2

We utilized three methods to determine the optimal number of clusters:

Dendrogram Analysis (Hierarchical Clustering): Suggested 3 clusters.

Elbow Method: Largest slope change at 3 clusters.

Silhouette Analysis: Optimal at 2 clusters.

Here again 3 clusters are the optimal solution so we will use 3 clusters.

Clustering <- kmeans(mydata_clu_std, 
                     centers = 3, #Number of groups
                     nstart = 25) #Number of attempts at different starting leader positions
library(factoextra)
fviz_cluster(Clustering, 
             palette = "Set1", 
             repel = FALSE,
             ggtheme = theme_bw(),
             data = mydata_clu_std)

With the help of Principal Component Analysis around 67.8% (21% + 46.8%) of information is showed when combining the 5 variables into 2 dimensions.

Clustering

## K-means clustering with 3 clusters of sizes 31, 73, 52
## 
## Cluster means:
##       Family Life_Expectancy     Freedom      Trust  Generosity
## 1  0.8028006       0.7727040  1.19108894  1.4387314  1.10542963
## 2  0.3320646       0.4634078 -0.05718025 -0.4072969 -0.40343257
## 3 -0.9447603      -1.1112038 -0.62979997 -0.2859231 -0.09264887
## 
## Clustering vector:
##             Switzerland                 Iceland                 Denmark 
##                       1                       1                       1 
##                  Norway                  Canada                 Finland 
##                       1                       1                       1 
##             Netherlands                  Sweden             New Zealand 
##                       1                       1                       1 
##               Australia                  Israel              Costa Rica 
##                       1                       2                       2 
##                 Austria                  Mexico           United States 
##                       1                       2                       1 
##                  Brazil              Luxembourg                 Ireland 
##                       2                       1                       1 
##                 Belgium    United Arab Emirates          United Kingdom 
##                       1                       1                       1 
##                    Oman               Venezuela               Singapore 
##                       1                       2                       1 
##                  Panama                 Germany                   Chile 
##                       2                       1                       2 
##                   Qatar                  France               Argentina 
##                       1                       2                       2 
##          Czech Republic                 Uruguay                Colombia 
##                       2                       1                       2 
##                Thailand            Saudi Arabia                   Spain 
##                       1                       2                       2 
##                   Malta                  Taiwan                  Kuwait 
##                       1                       2                       2 
##                Suriname     Trinidad and Tobago             El Salvador 
##                       2                       2                       2 
##               Guatemala              Uzbekistan                Slovakia 
##                       2                       1                       2 
##                   Japan             South Korea                 Ecuador 
##                       2                       2                       2 
##                 Bahrain                   Italy                 Bolivia 
##                       2                       2                       2 
##                 Moldova                Paraguay              Kazakhstan 
##                       2                       2                       2 
##                Slovenia               Lithuania               Nicaragua 
##                       2                       2                       1 
##                    Peru                 Belarus                  Poland 
##                       2                       2                       2 
##                Malaysia                 Croatia                   Libya 
##                       2                       2                       2 
##                  Russia                 Jamaica            North Cyprus 
##                       2                       2                       2 
##                  Cyprus                 Algeria                  Kosovo 
##                       2                       2                       3 
##            Turkmenistan               Mauritius               Hong Kong 
##                       2                       2                       1 
##                 Estonia               Indonesia                 Vietnam 
##                       2                       2                       2 
##                  Turkey              Kyrgyzstan                 Nigeria 
##                       2                       2                       3 
##                  Bhutan              Azerbaijan                Pakistan 
##                       1                       2                       3 
##                  Jordan              Montenegro                   China 
##                       2                       2                       2 
##                  Zambia                 Romania                  Serbia 
##                       3                       2                       2 
##                Portugal                  Latvia             Philippines 
##                       2                       2                       2 
##       Somaliland region                 Morocco               Macedonia 
##                       1                       3                       2 
##              Mozambique                 Albania  Bosnia and Herzegovina 
##                       3                       2                       2 
##                 Lesotho      Dominican Republic                    Laos 
##                       3                       2                       1 
##                Mongolia               Swaziland                  Greece 
##                       2                       3                       2 
##                 Lebanon                 Hungary                Honduras 
##                       2                       2                       2 
##              Tajikistan                 Tunisia Palestinian Territories 
##                       2                       3                       2 
##              Bangladesh                    Iran                 Ukraine 
##                       3                       3                       2 
##                    Iraq            South Africa                   Ghana 
##                       3                       3                       3 
##                Zimbabwe                 Liberia                   India 
##                       3                       3                       3 
##                   Sudan                   Haiti        Congo (Kinshasa) 
##                       3                       3                       3 
##                   Nepal                Ethiopia            Sierra Leone 
##                       3                       3                       3 
##              Mauritania                   Kenya                Djibouti 
##                       3                       3                       3 
##                 Armenia                Botswana                 Georgia 
##                       2                       3                       3 
##                  Malawi               Sri Lanka                Cameroon 
##                       3                       2                       3 
##                Bulgaria                   Egypt                   Yemen 
##                       2                       3                       3 
##                  Angola                    Mali     Congo (Brazzaville) 
##                       3                       3                       3 
##                 Comoros                  Uganda                 Senegal 
##                       3                       3                       3 
##                   Gabon                   Niger                Cambodia 
##                       3                       3                       3 
##                Tanzania              Madagascar                    Chad 
##                       3                       3                       3 
##                  Guinea             Ivory Coast            Burkina Faso 
##                       3                       3                       3 
##             Afghanistan                  Rwanda                   Benin 
##                       3                       1                       3 
##                   Syria                 Burundi                    Togo 
##                       3                       3                       3 
## 
## Within cluster sum of squares by cluster:
## [1]  83.86715 161.73718 160.97704
##  (between_SS / total_SS =  47.5 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

Averages <- Clustering$centers
Averages #Average values of cluster variables to describe groups

##       Family Life_Expectancy     Freedom      Trust  Generosity
## 1  0.8028006       0.7727040  1.19108894  1.4387314  1.10542963
## 2  0.3320646       0.4634078 -0.05718025 -0.4072969 -0.40343257
## 3 -0.9447603      -1.1112038 -0.62979997 -0.2859231 -0.09264887

Figure <- as.data.frame(Averages)
Figure$ID <- 1:nrow(Figure)

library(tidyr)

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:magrittr':
## 
##     extract

Figure <- pivot_longer(Figure, cols = c("Family", "Life_Expectancy", "Freedom", "Trust", "Generosity"))

Figure$Group <- factor(Figure$ID, 
                       levels = c(1, 2, 3), 
                       labels = c("1", "2", "3"))

Figure$NameF <- factor(Figure$name, 
                       levels = c("Family", "Life_Expectancy", "Freedom", "Trust", "Generosity", "Dystopia"), 
                       labels = c("Family", "Life_Expectancy", "Freedom", "Trust", "Generosity", "Dystopia"))

library(ggplot2)
ggplot(Figure, aes(x = NameF, y = value)) +
  geom_hline(yintercept = 0) +
  theme_bw() +
  geom_point(aes(shape = Group, col = Group), size = 3) +
  geom_line(aes(group = ID), linewidth = 1) +
  ylab("Averages") +
  xlab("Cluster variables")+
  ylim(-2.2, 2.2) +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.50, size = 10))

Group 1 (High Performers in Happiness):

Countries in this cluster consistently exhibit above-average values for all analyzed variables: Family, Life Expectancy, Freedom, Trust, and Generosity. These nations enjoy strong institutional and social support systems, robust economies, and high happiness scores. Statistical tests confirm that Group 1 has the highest GDP per capita contribution among all clusters.

Examples: Switzerland, Iceland, Norway, Canada, and Hong Kong.

Group 2 (Moderate Performers in Happiness):

This cluster consists of countries with mixed performance across the variables. While indicators like Family and Life Expectancy are close to global averages, variables such as Trust and Generosity tend to fall below average. These nations are often characterized as transitioning economies, with moderate happiness levels and stable but average institutional support systems.

Examples: Slovenia, Brazil, Chile, Romania, and Indonesia.

Group 3 (Low Performers in Happiness):

Countries in this group consistently score below average across all variables, particularly in Family, Life Expectancy, and Trust. These nations face significant socioeconomic challenges, weak governance, and lower levels of social trust and generosity. This cluster represents countries struggling with systemic issues that hinder their happiness scores.

Examples: Zimbabwe, Afghanistan, Haiti, Rwanda, and the Central African Republic.

mydata$Group <- Clustering$cluster

#Checking if clustering variables successfully differentiate between groups

fit <- aov(cbind(Family, Life_Expectancy, Freedom, Trust, Generosity) ~ as.factor(Group), 
           data = mydata)

summary(fit)

##  Response Family :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 5.0789 2.53943  70.693 < 2.2e-16 ***
## Residuals        153 5.4961 0.03592                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Life_Expectancy :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 5.8669 2.93344  132.97 < 2.2e-16 ***
## Residuals        153 3.3752 0.02206                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Freedom :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 1.4899 0.74496  55.022 < 2.2e-16 ***
## Residuals        153 2.0715 0.01354                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Trust :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 1.1722 0.58610  82.724 < 2.2e-16 ***
## Residuals        153 1.0840 0.00709                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Generosity :
##                   Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 0.71448 0.35724  36.654 9.873e-14 ***
## Residuals        153 1.49120 0.00975                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Response for Family:

H0: μ(Family, G1) = μ(Family, G2) = μ(Family, G3)

H1: At least one μ(Family, j) is different.

We can reject H0 at p < 0.001. We can reject H0 for all cluster variables at p < 0.001.

We found significant differences across clusters for all clustering variables.

aggregate(mydata$GDP_per_capita, 
          by = list(mydata$Group), 
          FUN = mean)

##   Group.1         x
## 1       1 1.1773374
## 2       2 0.9976774
## 3       3 0.4617729

On average the 1st group has the biggest GDP per capita contribution to Happiness Score.

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

leveneTest(mydata$GDP_per_capita, as.factor(mydata$Group))

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   2  1.7196 0.1826
##       153

H0: σ2 (GDP_per_capita, G1) = σ2 (GDP_per_capita, G2) = σ2 (GDP_per_capita, G3)

H1: At least one σ2 (GDP_per_capita, j) is different.

We cannot reject H0. We can assume homogeinity of variances,

library(dplyr)
library(rstatix)

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

mydata %>%
  group_by(as.factor(mydata$Group)) %>%
  shapiro_test(GDP_per_capita)

## # A tibble: 3 × 4
##   `as.factor(mydata$Group)` variable       statistic         p
##   <fct>                     <chr>              <dbl>     <dbl>
## 1 1                         GDP_per_capita     0.799 0.0000504
## 2 2                         GDP_per_capita     0.984 0.491    
## 3 3                         GDP_per_capita     0.959 0.0703

H0: GDP per capita is normally distributed in G1.

H1: GDP per capita is not normally distributed in G1.

We cannot reject H0.

H0: GDP per capita is normally distributed in G2.

H1: GDP per capita is not normally distributed in G2.

We cannot reject H0.

H0: GDP per capita is normally distributed in G3.

H1: GDP per capita is not normally distributed in G3.

We reject H0 at p < 0.001.

Our result is not validated.

kruskal.test(GDP_per_capita ~ as.factor(Group), 
             data = mydata)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  GDP_per_capita by as.factor(Group)
## Kruskal-Wallis chi-squared = 79.541, df = 2, p-value < 2.2e-16

H0: The location distribution of GDP per capita is the same in all groups

H1: The location distribution of GDP per capita is different in at least on of the groups

We reject H0 at p<0.001. We can’t say that groups significantly differ in the variable GDP per capita.

chi_square <- chisq.test(mydata$Region, as.factor(mydata$Group))

## Warning in chisq.test(mydata$Region, as.factor(mydata$Group)): Chi-squared
## approximation may be incorrect

chi_square

## 
##  Pearson's Chi-squared test
## 
## data:  mydata$Region and as.factor(mydata$Group)
## X-squared = 154.96, df = 18, p-value < 2.2e-16

H0:There is no association between Region and classification of countries into 3 groups.

H1: There is association between Region and classification of countries into 3 groups.

We reject H0 at p<0.001.

addmargins(chi_square$observed)

##                                  as.factor(mydata$Group)
## mydata$Region                       1   2   3 Sum
##   Australia and New Zealand         2   0   0   2
##   Central and Eastern Europe        1  26   2  29
##   Eastern Asia                      1   5   0   6
##   Latin America and Caribbean       2  19   1  22
##   Middle East and Northern Africa   3  10   7  20
##   North America                     2   0   0   2
##   Southeastern Asia                 3   4   1   8
##   Southern Asia                     1   1   5   7
##   Sub-Saharan Africa                2   1  36  39
##   Western Europe                   14   7   0  21
##   Sum                              31  73  52 156

addmargins(round(chi_square$expected, 2))

##                                  as.factor(mydata$Group)
## mydata$Region                         1     2     3    Sum
##   Australia and New Zealand        0.40  0.94  0.67   2.01
##   Central and Eastern Europe       5.76 13.57  9.67  29.00
##   Eastern Asia                     1.19  2.81  2.00   6.00
##   Latin America and Caribbean      4.37 10.29  7.33  21.99
##   Middle East and Northern Africa  3.97  9.36  6.67  20.00
##   North America                    0.40  0.94  0.67   2.01
##   Southeastern Asia                1.59  3.74  2.67   8.00
##   Southern Asia                    1.39  3.28  2.33   7.00
##   Sub-Saharan Africa               7.75 18.25 13.00  39.00
##   Western Europe                   4.17  9.83  7.00  21.00
##   Sum                             30.99 73.01 52.01 156.01

Most of expected frequencies aren’t larger than 5. The test is invalid.

round(chi_square$res, 2)

##                                  as.factor(mydata$Group)
## mydata$Region                         1     2     3
##   Australia and New Zealand        2.54 -0.97 -0.82
##   Central and Eastern Europe      -1.98  3.37 -2.47
##   Eastern Asia                    -0.18  1.31 -1.41
##   Latin America and Caribbean     -1.13  2.71 -2.34
##   Middle East and Northern Africa -0.49  0.21  0.13
##   North America                    2.54 -0.97 -0.82
##   Southeastern Asia                1.12  0.13 -1.02
##   Southern Asia                   -0.33 -1.26  1.75
##   Sub-Saharan Africa              -2.07 -4.04  6.38
##   Western Europe                   4.81 -0.90 -2.65

library(effectsize)

## 
## Attaching package: 'effectsize'

## The following objects are masked from 'package:rstatix':
## 
##     cohens_d, eta_squared

effectsize::cramers_v(mydata$Region, mydata$Group)

## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.67              | [0.52, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].

interpret_cramers_v(0.65)

## [1] "very large"
## (Rules: funder2019)

We can’t use residuals, since not all assumptions are met.

4 Conclusion

We clustered 153 countries based on five standardized variables: Family, Life Expectancy, Freedom, Trust, and Generosity. The analysis revealed three distinct clusters that highlight significant disparities in happiness-related factors.

Cluster 1: High Performers in Happiness

This cluster consists of 31 countries (20.3%), which have above-average values across all variables. These countries enjoy strong social support systems, higher life expectancy, greater freedom, high levels of trust, and generosity, all contributing to their higher happiness scores.

Statistical tests confirmed that Cluster 1 had significantly higher average values for Family, Life Expectancy, Freedom, Trust, and Generosity (p < 0.001). However, the normality test for GDP per capita revealed that the distribution within this cluster is not normal, limiting further interpretations.

Cluster 2: Moderate Performers in Happiness

This cluster represents 73 countries (47.7%), which show average to below-average performance in happiness-related variables. While Family and Life Expectancy approach average values, Trust and Generosity are consistently below average.

Kruskal-Wallis tests confirmed significant differences in GDP per capita between clusters (p < 0.001). However, tests for homogeneity of variance revealed marginally non-significant differences (p = 0.051). This cluster represents countries where improvements in institutional trust and generosity could have significant impacts on happiness levels.

Cluster 3: Low Performers in Happiness

This group includes 49 countries (32%) that face significant socioeconomic challenges and exhibit below-average values across all variables, particularly Family and Life Expectancy.

Chi-squared analysis revealed a significant association between region and cluster assignment (p < 0.001), with Sub-Saharan Africa dominating this group. However, chi-squared residuals could not be interpreted due to unmet assumptions, limiting further insights.

Validated Findings:

Differences in Family, Life Expectancy, Freedom, Trust, and Generosity between clusters were statistically significant (p < 0.001). GDP per capita contributions significantly differed between clusters, as confirmed by the Kruskal-Wallis test (p < 0.001).

Limitations:

Normality assumptions for GDP per capita were violated in Cluster 1. Chi-squared residuals could not be interpreted due to small expected frequencies, rendering the test invalid.

Homework assignment 2 at course MVA

Simon Sever

2024-12-12

1 Data import

2 Data description

4 Conclusion