Homework assignment 2 at course MVA

1 Data import

library(readxl)
mydata <- read_xlsx("/cloud/project/Student depression/2015.xlsx")
mydata <- as.data.frame(mydata)

2 Data description

colnames(mydata) [1] <- "Country"
colnames(mydata) [2] <- "Region"
colnames(mydata) [3] <- "ID"
colnames(mydata) [4] <- "Happiness_Score"
colnames(mydata) [5] <- "GDP_per_capita"
colnames(mydata) [6] <- "Family"
colnames(mydata) [7] <- "Life_Expectancy"
colnames(mydata) [8] <- "Freedom"
colnames(mydata) [9] <- "Trust"
colnames(mydata) [10] <- "Generosity"
colnames(mydata) [11] <- "Dystopia"


head(mydata)

##       Country         Region ID Happiness_Score GDP_per_capita  Family
## 1 Switzerland Western Europe  1           7.587        1.39651 1.34951
## 2     Iceland Western Europe  2           7.561        1.30232 1.40223
## 3     Denmark Western Europe  3           7.527        1.32548 1.36058
## 4      Norway Western Europe  4           7.522        1.45900 1.33095
## 5      Canada  North America  5           7.427        1.32629 1.32261
## 6     Finland Western Europe  6           7.406        1.29025 1.31826
##   Life_Expectancy Freedom   Trust Generosity Dystopia
## 1         0.94143 0.66557 0.41978    0.29678  2.51738
## 2         0.94784 0.62877 0.14145    0.43630  2.70201
## 3         0.87464 0.64938 0.48357    0.34139  2.49204
## 4         0.88521 0.66973 0.36503    0.34699  2.46531
## 5         0.90563 0.63297 0.32957    0.45811  2.45176
## 6         0.88911 0.64169 0.41372    0.23351  2.61955

Source: Kaggle

Unit of Observation: Each row represents a country.

Sample Size: 158 countries (rows in the dataset).

Variables Analyzed:

Economy (GDP per Capita): GDP per capita (logarithmic scale).
Family: Social support metric.
Health (Life Expectancy): Life expectancy in years.
Freedom: Perceived freedom to make life choices.
Trust (Government Corruption): Perception of government corruption.
Generosity: Willingness to donate to others.
Dystopia Residual: Baseline metric for unhappiness.

summary(mydata[ , c(-1, -2, -3)])

##  Happiness_Score GDP_per_capita       Family       Life_Expectancy 
##  Min.   :2.839   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:4.526   1st Qu.:0.5458   1st Qu.:0.8568   1st Qu.:0.4392  
##  Median :5.232   Median :0.9102   Median :1.0295   Median :0.6967  
##  Mean   :5.376   Mean   :0.8461   Mean   :0.9910   Mean   :0.6303  
##  3rd Qu.:6.244   3rd Qu.:1.1584   3rd Qu.:1.2144   3rd Qu.:0.8110  
##  Max.   :7.587   Max.   :1.6904   Max.   :1.4022   Max.   :1.0252  
##     Freedom           Trust           Generosity        Dystopia     
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.3286  
##  1st Qu.:0.3283   1st Qu.:0.06168   1st Qu.:0.1506   1st Qu.:1.7594  
##  Median :0.4355   Median :0.10722   Median :0.2161   Median :2.0954  
##  Mean   :0.4286   Mean   :0.14342   Mean   :0.2373   Mean   :2.0990  
##  3rd Qu.:0.5491   3rd Qu.:0.18025   3rd Qu.:0.3099   3rd Qu.:2.4624  
##  Max.   :0.6697   Max.   :0.55191   Max.   :0.7959   Max.   :3.6021

The average happiness score across all countries is 5.421, the median is slightly lower than the mean, suggesting that the distribution of happiness scores might be slightly skewed to the right.

The lowest happiness score is 2.839, and the highest is 7.587, showing a wide range of happiness across countries.

The average GDP per capita contribution to happiness is 0.8601.

The median generosity score is 0.216, indicating that half of the countries have a generosity score below this value and the other half above it

3 Based on variables, such as Family, Life_Expectancy, Freedom, Trust, Generosity and Dystopia, how can we identify distinct segments?

mydata_clu_std <- as.data.frame(scale(mydata[c("Family", "Life_Expectancy", "Freedom", "Trust", "Generosity", "Dystopia")]))

mydata$Dissimilarity <- sqrt(mydata_clu_std$Family^2 + mydata_clu_std$`Life_Expectancy`^2 + mydata_clu_std$Freedom^2 + mydata_clu_std$Trust^2 + mydata_clu_std$Generosity^2 + mydata_clu_std$Dystopia^2)

head(mydata[order(-mydata$Dissimilarity), c("ID", "Dissimilarity")])

##      ID Dissimilarity
## 129 129      4.747948
## 156 156      4.558639
## 154 154      4.552705
## 148 148      4.485362
## 72   72      3.956991
## 3     3      3.764949

I have identified ID129, ID156, ID154 and ID148 as potential outliers, as there is a big jump in disimilarity numbers between units. I have decided to remove these units.

print(mydata[c(129, 156, 154, 148), ])

##                      Country                          Region  ID
## 129                  Myanmar               Southeastern Asia 129
## 156                    Syria Middle East and Northern Africa 156
## 154                   Rwanda              Sub-Saharan Africa 154
## 148 Central African Republic              Sub-Saharan Africa 148
##     Happiness_Score GDP_per_capita  Family Life_Expectancy Freedom   Trust
## 129           4.307        0.27108 0.70905         0.48246 0.44017 0.19034
## 156           3.006        0.66320 0.47489         0.72193 0.15684 0.18906
## 154           3.465        0.22208 0.77370         0.42864 0.59201 0.55191
## 148           3.678        0.07850 0.00000         0.06699 0.48879 0.08289
##     Generosity Dystopia Dissimilarity
## 129    0.79588  1.41805      4.747948
## 156    0.47179  0.32858      4.558639
## 154    0.22628  0.67042      4.552705
## 148    0.23835  2.72230      4.485362

mydata <- mydata %>%
  filter(!ID %in% c(129, 156, 154, 148)) 

mydata$ID <- seq(1, nrow(mydata))

mydata_clu_std <- as.data.frame(scale(mydata[c("Family", "Life_Expectancy", "Freedom", "Trust", "Generosity", "Dystopia")])) 
rownames(mydata_clu_std) <- mydata$Country

Distances <- get_dist(mydata_clu_std, 
                      method = "euclidian")

fviz_dist(Distances, #Showing matrix of distances
          gradient = list(low = "darkred",
                          mid = "grey95",
                          high = "white"))

library(factoextra) 
get_clust_tendency(mydata_clu_std, #Hopkins statistics
                   n = nrow(mydata_clu_std) - 1,
                   graph = FALSE)

## $hopkins_stat
## [1] 0.6641082
## 
## $plot
## NULL

Hopkins statistics are 0.66, my data is clusterable as it is above 0.5. With the help of Hierarhical clustering (dendrogram) and K-Means clustering (Elbow method and Silhouette analysis) I will now determine how many clusters to use.

WARD <- mydata_clu_std %>%
  get_dist(method = "euclidean") %>%  
  hclust(method = "ward.D2")          

WARD

## 
## Call:
## hclust(d = ., method = "ward.D2")
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 154

fviz_dend(WARD)

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Based on the dendrogram, I would choose 2 or 3 clusters, as there is the biggest jump in vertical line.

library(dplyr)
library(factoextra)
WARD <- mydata_clu_std %>%
  get_dist(method = "euclidean") %>%  
  hclust(method = "ward.D2")          

WARD

## 
## Call:
## hclust(d = ., method = "ward.D2")
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 154

library(factoextra)
library(NbClust)

fviz_nbclust(mydata_clu_std, kmeans, method = "wss") +
  labs(subtitle = "Elbow method")

With the elbow method the slope changes most evidently at 3, therefore I’d choose 3 clusters.

fviz_nbclust(mydata_clu_std, kmeans, method = "silhouette")+
  labs(subtitle = "Silhouette analysis")

The higest value of the Silhouette analysis is at 2.

library(NbClust)
NbClust(mydata_clu_std, 
        distance = "euclidean", 
        min.nc = 2, max.nc = 10,
        method = "kmeans", 
        index = "all")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 6 proposed 2 as the best number of clusters 
## * 10 proposed 3 as the best number of clusters 
## * 4 proposed 9 as the best number of clusters 
## * 3 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************

## $All.index
##        KL      CH Hartigan     CCC    Scott      Marriot    TrCovW   TraceW
## 2  1.6663 58.0809  32.7492 -1.5670 191.4581 4.709559e+12 12898.695 664.2014
## 3  2.4828 51.3320  19.4393 -0.6937 384.4317 3.026628e+12  8378.690 546.4629
## 4  3.3715 44.8079  12.8153 -1.2094 502.4637 2.500203e+12  7152.162 484.1367
## 5  0.1162 39.4164  20.4654 -1.8145 590.9604 2.199005e+12  6536.666 446.0300
## 6  1.9507 39.6891  13.6274  0.0982 653.7504 2.106275e+12  4666.327 392.1654
## 7  2.4432 38.1315   9.0788  0.8676 755.7490 1.478301e+12  4072.358 359.1005
## 8  0.5142 35.7548  10.5778  0.8516 789.7330 1.548491e+12  3231.614 338.2123
## 9  7.8805 34.6355   5.7791  1.3620 842.1439 1.394473e+12  2612.292 315.3640
## 10 0.5265 32.4310   5.7777  0.9888 870.6843 1.430337e+12  2483.474 303.2766
##    Friedman  Rubin Cindex     DB Silhouette   Duda Pseudot2   Beale Ratkowsky
## 2    2.9644 1.3821 0.4536 1.5497     0.2499 0.8298  20.7196  0.7812    0.3395
## 3    5.4668 1.6799 0.4217 1.6095     0.2166 1.1427  -9.6160 -0.4726    0.3477
## 4    7.3087 1.8962 0.4005 1.5968     0.2070 1.0469  -2.6000 -0.1692    0.3238
## 5    8.5477 2.0582 0.3859 1.6463     0.1948 1.8851 -31.9279 -1.7613    0.3169
## 6    9.0831 2.3408 0.3733 1.4245     0.2005 1.4051 -11.8205 -1.0710    0.3079
## 7   11.1795 2.5564 0.3908 1.5612     0.1980 2.0770 -22.8153 -1.9043    0.2935
## 8   11.6042 2.7143 0.3828 1.4400     0.1993 1.4646 -11.4194 -1.1735    0.2803
## 9   12.4236 2.9109 0.3649 1.3892     0.2035 2.1703 -18.8735 -1.9655    0.2699
## 10  13.0009 3.0269 0.3607 1.4309     0.1868 1.1266  -3.1465 -0.4035    0.2585
##        Ball Ptbiserial   Frey McClain   Dunn Hubert SDindex Dindex   SDbw
## 2  332.1007     0.4553 0.5515  0.6330 0.1318 0.0025  1.5878 1.9959 1.8120
## 3  182.1543     0.4870 0.5910  1.3499 0.1224 0.0029  1.5813 1.8033 1.0185
## 4  121.0342     0.4748 0.5322  1.8308 0.1224 0.0031  1.5262 1.6878 0.5972
## 5   89.2060     0.4545 0.5445  2.4749 0.0978 0.0034  1.5611 1.6129 0.5406
## 6   65.3609     0.4355 0.1703  3.0153 0.1229 0.0034  1.4491 1.5135 0.4381
## 7   51.3001     0.4380 0.5377  3.6951 0.1155 0.0039  1.5478 1.4557 0.4376
## 8   42.2765     0.4215 0.1287  4.1858 0.1246 0.0039  1.4182 1.4093 0.3952
## 9   35.0404     0.4219 0.6363  4.5788 0.1388 0.0042  1.4448 1.3593 0.3741
## 10  30.3277     0.4087 0.3902  4.9934 0.1005 0.0043  1.4130 1.3324 0.3638
## 
## $All.CriticalValues
##    CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
## 2          0.7212            39.0422       0.5849
## 3          0.6757            36.9584       1.0000
## 4          0.6579            30.1615       1.0000
## 5          0.6256            40.7028       1.0000
## 6          0.5802            29.6656       1.0000
## 7          0.5356            38.1467       1.0000
## 8          0.5632            27.9172       1.0000
## 9          0.5097            33.6684       1.0000
## 10         0.4643            32.2998       1.0000
## 
## $Best.nc
##                     KL      CH Hartigan   CCC    Scott      Marriot   TrCovW
## Number_clusters 9.0000  2.0000     3.00 9.000   3.0000 3.000000e+00    3.000
## Value_Index     7.8805 58.0809    13.31 1.362 192.9736 1.156505e+12 4520.005
##                  TraceW Friedman   Rubin  Cindex     DB Silhouette   Duda
## Number_clusters  3.0000   3.0000  3.0000 10.0000 9.0000     2.0000 2.0000
## Value_Index     55.4122   2.5024 -0.0815  0.3607 1.3892     0.2499 0.8298
##                 PseudoT2  Beale Ratkowsky     Ball PtBiserial Frey McClain
## Number_clusters   2.0000 2.0000    3.0000   3.0000      3.000    1   2.000
## Value_Index      20.7196 0.7812    0.3477 149.9464      0.487   NA   0.633
##                   Dunn Hubert SDindex Dindex    SDbw
## Number_clusters 9.0000      0  10.000      0 10.0000
## Value_Index     0.1388      0   1.413      0  0.3638
## 
## $Best.partition
##             Switzerland                 Iceland                 Denmark 
##                       3                       3                       3 
##                  Norway                  Canada                 Finland 
##                       3                       3                       3 
##             Netherlands                  Sweden             New Zealand 
##                       3                       3                       3 
##               Australia                  Israel              Costa Rica 
##                       3                       1                       1 
##                 Austria                  Mexico           United States 
##                       3                       1                       3 
##                  Brazil              Luxembourg                 Ireland 
##                       1                       3                       3 
##                 Belgium    United Arab Emirates          United Kingdom 
##                       3                       3                       3 
##                    Oman               Venezuela               Singapore 
##                       3                       1                       3 
##                  Panama                 Germany                   Chile 
##                       1                       3                       1 
##                   Qatar                  France               Argentina 
##                       3                       1                       1 
##          Czech Republic                 Uruguay                Colombia 
##                       1                       3                       1 
##                Thailand            Saudi Arabia                   Spain 
##                       3                       1                       1 
##                   Malta                  Taiwan                  Kuwait 
##                       3                       1                       3 
##                Suriname     Trinidad and Tobago             El Salvador 
##                       1                       1                       1 
##               Guatemala              Uzbekistan                Slovakia 
##                       1                       3                       1 
##                   Japan             South Korea                 Ecuador 
##                       1                       1                       1 
##                 Bahrain                   Italy                 Bolivia 
##                       3                       1                       1 
##                 Moldova                Paraguay              Kazakhstan 
##                       1                       3                       1 
##                Slovenia               Lithuania               Nicaragua 
##                       1                       1                       3 
##                    Peru                 Belarus                  Poland 
##                       1                       1                       1 
##                Malaysia                 Croatia                   Libya 
##                       3                       1                       1 
##                  Russia                 Jamaica            North Cyprus 
##                       1                       1                       1 
##                  Cyprus                 Algeria                  Kosovo 
##                       1                       1                       2 
##            Turkmenistan               Mauritius               Hong Kong 
##                       3                       3                       3 
##                 Estonia               Indonesia                 Vietnam 
##                       1                       3                       1 
##                  Turkey              Kyrgyzstan                 Nigeria 
##                       1                       1                       2 
##                  Bhutan              Azerbaijan                Pakistan 
##                       3                       1                       2 
##                  Jordan              Montenegro                   China 
##                       1                       1                       1 
##                  Zambia                 Romania                  Serbia 
##                       2                       1                       1 
##                Portugal                  Latvia             Philippines 
##                       1                       1                       3 
##       Somaliland region                 Morocco               Macedonia 
##                       3                       2                       1 
##              Mozambique                 Albania  Bosnia and Herzegovina 
##                       2                       1                       1 
##                 Lesotho      Dominican Republic                    Laos 
##                       2                       3                       3 
##                Mongolia               Swaziland                  Greece 
##                       1                       2                       1 
##                 Lebanon                 Hungary                Honduras 
##                       1                       1                       1 
##              Tajikistan                 Tunisia Palestinian Territories 
##                       2                       2                       1 
##              Bangladesh                    Iran                 Ukraine 
##                       2                       2                       1 
##                    Iraq            South Africa                   Ghana 
##                       2                       2                       2 
##                Zimbabwe                 Liberia                   India 
##                       2                       2                       2 
##                   Sudan                   Haiti        Congo (Kinshasa) 
##                       2                       2                       2 
##                   Nepal                Ethiopia            Sierra Leone 
##                       2                       2                       2 
##              Mauritania                   Kenya                Djibouti 
##                       2                       2                       2 
##                 Armenia                Botswana                 Georgia 
##                       1                       2                       2 
##                  Malawi               Sri Lanka                Cameroon 
##                       2                       3                       2 
##                Bulgaria                   Egypt                   Yemen 
##                       1                       2                       2 
##                  Angola                    Mali     Congo (Brazzaville) 
##                       2                       2                       2 
##                 Comoros                  Uganda                 Senegal 
##                       2                       2                       2 
##                   Gabon                   Niger                Cambodia 
##                       2                       2                       3 
##                Tanzania              Madagascar                    Chad 
##                       2                       2                       2 
##                  Guinea             Ivory Coast            Burkina Faso 
##                       2                       2                       2 
##             Afghanistan                   Benin                 Burundi 
##                       2                       2                       2 
##                    Togo 
##                       2

The final method is telling us that the best number of clusters to use is 3, therefore I will proceed with 3 clusters. I have tried to test with 2 clusters (based on Elbow method and Silhouette analysis), between_SS / total_SS ratio was much lower than in the case if 3 clusters are formed.

Clustering <- kmeans(mydata_clu_std, 
                     centers = 3, #Number of groups
                     nstart = 25) #Number of attempts at different starting leader positions
library(factoextra)
fviz_cluster(Clustering, 
             palette = "Set1", 
             repel = FALSE,
             ggtheme = theme_bw(),
             data = mydata_clu_std)

I will remove Cambodia and Greece as they are outliers.

mydata <- mydata %>%
  filter(!ID %in% c(144,102))

mydata$ID <- seq(1, nrow(mydata))

mydata_clu_std <- as.data.frame(scale(mydata[c("Family", "Life_Expectancy", "Freedom", "Trust", "Generosity", "Dystopia")]))
rownames(mydata_clu_std) <- mydata$Country

Clustering <- kmeans(mydata_clu_std, 
                     centers = 3, #Number of groups
                     nstart = 25) #Number of attempts at different starting leader positions
library(factoextra)
fviz_cluster(Clustering, 
             palette = "Set1", 
             repel = FALSE,
             ggtheme = theme_bw(),
             data = mydata_clu_std)

mydata <- mydata %>%
  filter(!ID %in% c(129))

mydata$ID <- seq(1, nrow(mydata))

mydata_clu_std <- as.data.frame(scale(mydata[c("Family", "Life_Expectancy", "Freedom", "Trust", "Generosity", "Dystopia")]))
rownames(mydata_clu_std) <- mydata$Country

Clustering <- kmeans(mydata_clu_std, 
                     centers = 3, #Number of groups
                     nstart = 25) #Number of attempts at different starting leader positions
library(factoextra)
fviz_cluster(Clustering, 
             palette = "Set1", 
             repel = FALSE,
             ggtheme = theme_bw(),
             data = mydata_clu_std)

With the help of Principal Component Analysis around 59.2% (18.2% + 41%) of information is showed when combining the 5 variables into 2 dimensions.

Clustering

## K-means clustering with 3 clusters of sizes 29, 53, 69
## 
## Cluster means:
##       Family Life_Expectancy     Freedom      Trust Generosity    Dystopia
## 1  0.8446439       0.8307036  1.22703520  1.4587738  1.2082289  0.08699786
## 2 -0.9212850      -1.0423774 -0.76549278 -0.3046756 -0.2130384 -0.13800652
## 3  0.3526584       0.4515304  0.07227676 -0.3790817 -0.3441681  0.06944069
## 
## Clustering vector:
##             Switzerland                 Iceland                 Denmark 
##                       1                       1                       1 
##                  Norway                  Canada                 Finland 
##                       1                       1                       1 
##             Netherlands                  Sweden             New Zealand 
##                       1                       1                       1 
##               Australia                  Israel              Costa Rica 
##                       1                       3                       3 
##                 Austria                  Mexico           United States 
##                       1                       3                       1 
##                  Brazil              Luxembourg                 Ireland 
##                       3                       1                       1 
##                 Belgium    United Arab Emirates          United Kingdom 
##                       1                       1                       1 
##                    Oman               Venezuela               Singapore 
##                       1                       3                       1 
##                  Panama                 Germany                   Chile 
##                       3                       1                       3 
##                   Qatar                  France               Argentina 
##                       1                       3                       3 
##          Czech Republic                 Uruguay                Colombia 
##                       3                       1                       3 
##                Thailand            Saudi Arabia                   Spain 
##                       1                       3                       3 
##                   Malta                  Taiwan                  Kuwait 
##                       1                       3                       3 
##                Suriname     Trinidad and Tobago             El Salvador 
##                       3                       3                       3 
##               Guatemala              Uzbekistan                Slovakia 
##                       3                       1                       3 
##                   Japan             South Korea                 Ecuador 
##                       3                       3                       3 
##                 Bahrain                   Italy                 Bolivia 
##                       3                       3                       3 
##                 Moldova                Paraguay              Kazakhstan 
##                       3                       3                       3 
##                Slovenia               Lithuania               Nicaragua 
##                       3                       3                       3 
##                    Peru                 Belarus                  Poland 
##                       3                       3                       3 
##                Malaysia                 Croatia                   Libya 
##                       3                       3                       3 
##                  Russia                 Jamaica            North Cyprus 
##                       3                       3                       3 
##                  Cyprus                 Algeria                  Kosovo 
##                       3                       3                       2 
##            Turkmenistan               Mauritius               Hong Kong 
##                       3                       3                       1 
##                 Estonia               Indonesia                 Vietnam 
##                       3                       3                       3 
##                  Turkey              Kyrgyzstan                 Nigeria 
##                       3                       3                       2 
##                  Bhutan              Azerbaijan                Pakistan 
##                       1                       3                       2 
##                  Jordan              Montenegro                   China 
##                       3                       2                       3 
##                  Zambia                 Romania                  Serbia 
##                       2                       3                       3 
##                Portugal                  Latvia             Philippines 
##                       3                       3                       3 
##       Somaliland region                 Morocco               Macedonia 
##                       1                       2                       3 
##              Mozambique                 Albania  Bosnia and Herzegovina 
##                       2                       3                       2 
##                 Lesotho      Dominican Republic                    Laos 
##                       2                       3                       1 
##                Mongolia               Swaziland                 Lebanon 
##                       3                       2                       3 
##                 Hungary                Honduras              Tajikistan 
##                       3                       3                       3 
##                 Tunisia Palestinian Territories              Bangladesh 
##                       2                       2                       2 
##                    Iran                 Ukraine                    Iraq 
##                       2                       3                       2 
##            South Africa                   Ghana                Zimbabwe 
##                       2                       2                       2 
##                 Liberia                   India                   Sudan 
##                       2                       2                       2 
##                   Haiti        Congo (Kinshasa)                   Nepal 
##                       2                       2                       2 
##                Ethiopia            Sierra Leone              Mauritania 
##                       2                       2                       2 
##                   Kenya                Djibouti                 Armenia 
##                       2                       2                       2 
##                Botswana                 Georgia               Sri Lanka 
##                       2                       2                       3 
##                Cameroon                Bulgaria                   Egypt 
##                       2                       3                       2 
##                   Yemen                  Angola                    Mali 
##                       2                       2                       2 
##     Congo (Brazzaville)                 Comoros                  Uganda 
##                       2                       2                       2 
##                 Senegal                   Gabon                   Niger 
##                       2                       2                       2 
##                Tanzania              Madagascar                    Chad 
##                       2                       2                       2 
##                  Guinea             Ivory Coast            Burkina Faso 
##                       2                       2                       2 
##             Afghanistan                   Benin                 Burundi 
##                       2                       2                       2 
##                    Togo 
##                       2 
## 
## Within cluster sum of squares by cluster:
## [1]  93.17516 205.76961 229.03003
##  (between_SS / total_SS =  41.3 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

Averages <- Clustering$centers
Averages #Average values of cluster variables to describe groups

##       Family Life_Expectancy     Freedom      Trust Generosity    Dystopia
## 1  0.8446439       0.8307036  1.22703520  1.4587738  1.2082289  0.08699786
## 2 -0.9212850      -1.0423774 -0.76549278 -0.3046756 -0.2130384 -0.13800652
## 3  0.3526584       0.4515304  0.07227676 -0.3790817 -0.3441681  0.06944069

Figure <- as.data.frame(Averages)
Figure$ID <- 1:nrow(Figure)

library(tidyr)

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:magrittr':
## 
##     extract

Figure <- pivot_longer(Figure, cols = c("Family", "Life_Expectancy", "Freedom", "Trust", "Generosity", "Dystopia"))

Figure$Group <- factor(Figure$ID, 
                       levels = c(1, 2, 3), 
                       labels = c("1", "2", "3"))

Figure$NameF <- factor(Figure$name, 
                       levels = c("Family", "Life_Expectancy", "Freedom", "Trust", "Generosity", "Dystopia"), 
                       labels = c("Family", "Life_Expectancy", "Freedom", "Trust", "Generosity", "Dystopia"))

library(ggplot2)
ggplot(Figure, aes(x = NameF, y = value)) +
  geom_hline(yintercept = 0) +
  theme_bw() +
  geom_point(aes(shape = Group, col = Group), size = 3) +
  geom_line(aes(group = ID), linewidth = 1) +
  ylab("Averages") +
  xlab("Cluster variables")+
  ylim(-2.2, 2.2) +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.50, size = 10))

Group 1 (High Performers in Happiness)

Countries in this cluster show above-average values for most variables, such as Family, Life Expectancy, Freedom, Trust, and Generosity.

These are countries with higher happiness scores, robust economic performance (GDP per capita), and strong social support systems. Examples: Switzerland, Iceland, Denmark, Norway, and Canada.

Group 2 (Low Performers in Happiness)

This group has below-average scores for most variables, with the lowest scores for Family and Life Expectancy.

These countries struggle with socioeconomic challenges, weaker institutional support, and lower trust levels.

Examples: Countries like Zimbabwe, Afghanistan, and Haiti.

Group 3 (Moderate Performers in Happiness)

Countries in this cluster have mixed performance. Some variables are close to the average, while others (e.g., Trust and Generosity) are below average.

These countries are transitioning economies with moderate happiness scores and average institutional support.

Examples: Countries like Brazil, Chile, and Slovenia.

mydata$Group <- Clustering$cluster

#Checking if clustering variables successfully differentiate between groups

fit <- aov(cbind(Family, Life_Expectancy, Freedom, Trust, Generosity, Dystopia) ~ as.factor(Group), 
           data = mydata)

summary(fit)

##  Response Family :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 4.8214 2.41069  72.545 < 2.2e-16 ***
## Residuals        148 4.9181 0.03323                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Life_Expectancy :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 5.4769 2.73846  116.29 < 2.2e-16 ***
## Residuals        148 3.4853 0.02355                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Freedom :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 1.6430 0.82148  74.158 < 2.2e-16 ***
## Residuals        148 1.6395 0.01108                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Trust :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 1.0512 0.52560  77.119 < 2.2e-16 ***
## Residuals        148 1.0087 0.00682                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Generosity :
##                   Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 0.72508 0.36254  40.331 1.045e-14 ***
## Residuals        148 1.33040 0.00899                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Dystopia :
##                   Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group)   2  0.418 0.20893  0.7785  0.461
## Residuals        148 39.719 0.26837

Response for Family:

H0: μ(Family, G1) = μ(Family, G2) = μ(Family, G3)

H1: At least one μ(Family, j) is different.

We can reject H0 at p < 0.001. We can reject H0 for all cluster variables at p < 0.001, but for Dystopia.

We cannot reject H0, since p=0.461. The arithmetic mean of the variable Dystopia in one group is not statistically different from the arithmetic means in the other groups for the same variable. We would need additional clustering variables to enhance the differentiation of clusters.

aggregate(mydata$GDP_per_capita, 
          by = list(mydata$Group), 
          FUN = mean)

##   Group.1         x
## 1       1 1.2304183
## 2       2 0.4914351
## 3       3 1.0013935

On average the 1 group has the biggest GDP per capita.

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

leveneTest(mydata$GDP_per_capita, as.factor(mydata$Group))

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   2   2.238 0.1103
##       148

H0: σ2 (GDP_per_capita, G1) = σ2 (GDP_per_capita, G2) = σ2 (GDP_per_capita, G3)

H1: At least one σ2 (GDP_per_capita, j) is different.

We cannot reject H0. We can assume homogeneity of variance.

library(dplyr)
library(rstatix)

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

mydata %>%
  group_by(as.factor(mydata$Group)) %>%
  shapiro_test(GDP_per_capita)

## # A tibble: 3 × 4
##   `as.factor(mydata$Group)` variable       statistic         p
##   <fct>                     <chr>              <dbl>     <dbl>
## 1 1                         GDP_per_capita     0.789 0.0000527
## 2 2                         GDP_per_capita     0.958 0.0575   
## 3 3                         GDP_per_capita     0.981 0.388

H0: GDP per capita is normally distributed in G1.

H1: GDP per capita is not normally distributed in G1.

We reject H0 at p < 0.001.

H0: GDP per capita is normally distributed in G2.

H1: GDP per capita is not normally distributed in G2.

We cannot reject H0

H0: GDP per capita is normally distributed in G3.

H1: GDP per capita is not normally distributed in G3.

We cannot reject H0.

Our result is not validated.

Kruskal-Wallis:

H0: The location distribution of GDP per capita is the same in all groups

H1: The location distribution of GDP per capita is different in at least on of the groups

kruskal.test(GDP_per_capita ~ as.factor(Group), 
             data = mydata)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  GDP_per_capita by as.factor(Group)
## Kruskal-Wallis chi-squared = 81.992, df = 2, p-value < 2.2e-16

We reject H0 at p<0.001. We can’t say that groups significantly differ in GDP per capita.

chi_square <- chisq.test(mydata$Region, as.factor(mydata$Group))

## Warning in chisq.test(mydata$Region, as.factor(mydata$Group)): Chi-squared
## approximation may be incorrect

chi_square

## 
##  Pearson's Chi-squared test
## 
## data:  mydata$Region and as.factor(mydata$Group)
## X-squared = 154.09, df = 18, p-value < 2.2e-16

H0:There is no association between Region and classification of countries into 3 groups.

H1: There is association between Region and classification of countries into 3 groups.

We reject H0 at p<0.001.

addmargins(chi_square$observed)

##                                  as.factor(mydata$Group)
## mydata$Region                       1   2   3 Sum
##   Australia and New Zealand         2   0   0   2
##   Central and Eastern Europe        1   5  23  29
##   Eastern Asia                      1   0   5   6
##   Latin America and Caribbean       1   1  20  22
##   Middle East and Northern Africa   3   7   9  19
##   North America                     2   0   0   2
##   Southeastern Asia                 3   0   4   7
##   Southern Asia                     1   5   1   7
##   Sub-Saharan Africa                1  35   1  37
##   Western Europe                   14   0   6  20
##   Sum                              29  53  69 151

addmargins(round(chi_square$expected, 2))

##                                  as.factor(mydata$Group)
## mydata$Region                         1     2     3    Sum
##   Australia and New Zealand        0.38  0.70  0.91   1.99
##   Central and Eastern Europe       5.57 10.18 13.25  29.00
##   Eastern Asia                     1.15  2.11  2.74   6.00
##   Latin America and Caribbean      4.23  7.72 10.05  22.00
##   Middle East and Northern Africa  3.65  6.67  8.68  19.00
##   North America                    0.38  0.70  0.91   1.99
##   Southeastern Asia                1.34  2.46  3.20   7.00
##   Southern Asia                    1.34  2.46  3.20   7.00
##   Sub-Saharan Africa               7.11 12.99 16.91  37.01
##   Western Europe                   3.84  7.02  9.14  20.00
##   Sum                             28.99 53.01 68.99 150.99

Most of expected frequencies aren’t larger than 5. The test is invalid.

round(chi_square$res, 2)

##                                  as.factor(mydata$Group)
## mydata$Region                         1     2     3
##   Australia and New Zealand        2.61 -0.84 -0.96
##   Central and Eastern Europe      -1.94 -1.62  2.68
##   Eastern Asia                    -0.14 -1.45  1.36
##   Latin America and Caribbean     -1.57 -2.42  3.14
##   Middle East and Northern Africa -0.34  0.13  0.11
##   North America                    2.61 -0.84 -0.96
##   Southeastern Asia                1.43 -1.57  0.45
##   Southern Asia                   -0.30  1.62 -1.23
##   Sub-Saharan Africa              -2.29  6.11 -3.87
##   Western Europe                   5.18 -2.65 -1.04

library(effectsize)

## 
## Attaching package: 'effectsize'

## The following objects are masked from 'package:rstatix':
## 
##     cohens_d, eta_squared

effectsize::cramers_v(mydata$Region, mydata$Group)

## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.68              | [0.52, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].

interpret_cramers_v(0.68)

## [1] "very large"
## (Rules: funder2019)

We can’t use residuals, since not all assumptions are met.

4 Conclusion

I clustered 151 countries based on six standardized variables: Family, Life Expectancy, Freedom, Trust, Generosity, and Dystopia.

We clustered 151 countries based on six standardized variables: Family, Life Expectancy, Freedom, Trust, Generosity, and Dystopia. Our analysis revealed distinct patterns and relationships across the three identified clusters, supported by various statistical tests.

Cluster 1 (High Performers in Happiness): This group consists of 29 countries (19.2%) that exhibit above-average values for most variables. These countries tend to have high GDP per capita and strong social and institutional support systems. Our statistical tests confirmed that this cluster had the highest average GDP per capita, and the distribution of variables like Family, Life Expectancy, and Freedom differed significantly from the other clusters. However, we could not find a significant difference in the Dystopia variable between clusters, indicating it may not be a strong differentiator.

Cluster 2 (Low performers in Happiness): This cluster, with 53 countries (35.1%), includes those facing substantial challenges across variables. These countries report below-average values for most variables, particularly Family and Life Expectancy. While Kruskal-Wallis tests showed that GDP per capita significantly differs between clusters, the Shapiro-Wilk test revealed that GDP per capita is not normally distributed in this cluster, which slightly limits our ability to draw broader conclusions. Despite this, the overall trend highlights the need for interventions to address systemic issues in this cluster.

Cluster 3 (Moderate Performers in Happiness): Representing 69 countries (45.7%), this cluster contains countries with average performance across variables. While their values are closer to the global mean, Trust and Generosity are slightly lower, indicating areas for improvement. Statistical analysis confirmed that this cluster occupies a middle ground in terms of GDP per capita and other variables, suggesting that these countries may benefit from targeted efforts to improve governance and social trust.

Overall, the clustering analysis highlights significant disparities in happiness-related variables among countries, suggesting targeted approaches to address the unique challenges faced by each cluster. Solutions tailored to specific clusters—such as increasing trust and infrastructure in Cluster 2 or sustaining progress in Cluster 1—could help reduce global inequalities and promote well-being. While most findings were robust and statistically validated, certain limitations (e.g., chi-squared test validity and normality violations) should be considered when interpreting the results.

Homework assignment 2 at course MVA

Simon Sever

2024-12-12

1 Data import

2 Data description

3 Based on variables, such as Family, Life_Expectancy, Freedom, Trust, Generosity and Dystopia, how can we identify distinct segments?

4 Conclusion