HW2 - Clustering

1. Data import, manipulation & description

At the beginning, I activated the required libraries and hid the code with “include=FALSE”.

First, I imported a data set about World happiness report from Kaggle.com, based on 2015 data so I could analyze how happy people were 10 years ago!

Data found on: https://www.kaggle.com/datasets/unsdsn/world-happiness?select=2015.csv

World_data <- read.table("C:/Users/Adam/OneDrive - Univerza v Ljubljani/FAKS/IMB/MULTIVARIATE ANALYSIS/Homework assignments/HW2/HW2-File/2015.csv", header=TRUE, sep=",", dec=".")
World_data <- as.data.frame(World_data)

I used read.table instead of read.csv to see if that fixes my Knitting problems, I also renamed the columns so there’s no spaces in the wording: Note, it didn’t fix my knitting problem.

colnames(World_data) [1] <- "Country"
colnames(World_data) [2] <- "Region"
colnames(World_data) [3] <- "ID_rank"
colnames(World_data) [4] <- "Happiness_score"
colnames(World_data) [5] <- "Standard_error"
colnames(World_data) [6] <- "GDP_PerCapita"
colnames(World_data) [7] <- "Family"
colnames(World_data) [8] <- "Health_life_expectancy"
colnames(World_data) [9] <- "Freedom"
colnames(World_data) [10] <- "Trust_GovCorruption"
colnames(World_data) [11] <- "Generosity"
colnames(World_data) [12] <- "Dystopia_residual"

Let’s take a quick look at the data:

head(World_data)

##       Country         Region ID_rank Happiness_score Standard_error GDP_PerCapita  Family Health_life_expectancy Freedom Trust_GovCorruption
## 1 Switzerland Western Europe       1           7.587        0.03411       1.39651 1.34951                0.94143 0.66557             0.41978
## 2     Iceland Western Europe       2           7.561        0.04884       1.30232 1.40223                0.94784 0.62877             0.14145
## 3     Denmark Western Europe       3           7.527        0.03328       1.32548 1.36058                0.87464 0.64938             0.48357
## 4      Norway Western Europe       4           7.522        0.03880       1.45900 1.33095                0.88521 0.66973             0.36503
## 5      Canada  North America       5           7.427        0.03553       1.32629 1.32261                0.90563 0.63297             0.32957
## 6     Finland Western Europe       6           7.406        0.03140       1.29025 1.31826                0.88911 0.64169             0.41372
##   Generosity Dystopia_residual
## 1    0.29678           2.51738
## 2    0.43630           2.70201
## 3    0.34139           2.49204
## 4    0.34699           2.46531
## 5    0.45811           2.45176
## 6    0.23351           2.61955

In the dataset, we can observe the following 12 variables for each of the 158 countries:

Country: Name of the country.
Region: Region the country belongs to.
Happiness Rank: Rank of the country based on the Happiness Score.
Happiness Score: A metric measured in 2015 by asking the sampled people the question: “How would you rate your happiness”.
Standard Error: The standard error of the happiness score.
Economy (GDP per Capita): The extent to which GDP contributes to the calculation of the Happiness Score.
Family: The extent to which Family contributes to the calculation of the Happiness Score.
Health (Life expectancy): The extent to which Life expectancy contributed to the calculation of the Happiness Score.
Freedom: The extent to which Freedom contributed to the calculation of the Happiness Score.
Trust (Government corruption): The extent to which Perception of Corruption contributes to Happiness Score.
Generosity: The extent to which Generosity contributed to the calculation of the Happiness Score.
Dystopia Residual: The extent to which Dystopia Residual contributed to the calculation of the Happiness Score.

summary1 <- summary(World_data)
print(summary1)

##    Country             Region             ID_rank       Happiness_score Standard_error    GDP_PerCapita        Family       Health_life_expectancy
##  Length:158         Length:158         Min.   :  1.00   Min.   :2.839   Min.   :0.01848   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000        
##  Class :character   Class :character   1st Qu.: 40.25   1st Qu.:4.526   1st Qu.:0.03727   1st Qu.:0.5458   1st Qu.:0.8568   1st Qu.:0.4392        
##  Mode  :character   Mode  :character   Median : 79.50   Median :5.232   Median :0.04394   Median :0.9102   Median :1.0295   Median :0.6967        
##                                        Mean   : 79.49   Mean   :5.376   Mean   :0.04788   Mean   :0.8461   Mean   :0.9910   Mean   :0.6303        
##                                        3rd Qu.:118.75   3rd Qu.:6.244   3rd Qu.:0.05230   3rd Qu.:1.1584   3rd Qu.:1.2144   3rd Qu.:0.8110        
##                                        Max.   :158.00   Max.   :7.587   Max.   :0.13693   Max.   :1.6904   Max.   :1.4022   Max.   :1.0252        
##     Freedom       Trust_GovCorruption   Generosity     Dystopia_residual
##  Min.   :0.0000   Min.   :0.00000     Min.   :0.0000   Min.   :0.3286   
##  1st Qu.:0.3283   1st Qu.:0.06168     1st Qu.:0.1506   1st Qu.:1.7594   
##  Median :0.4355   Median :0.10722     Median :0.2161   Median :2.0954   
##  Mean   :0.4286   Mean   :0.14342     Mean   :0.2373   Mean   :2.0990   
##  3rd Qu.:0.5491   3rd Qu.:0.18025     3rd Qu.:0.3099   3rd Qu.:2.4624   
##  Max.   :0.6697   Max.   :0.55191     Max.   :0.7959   Max.   :3.6021

We can see the Happiness Scores ranging from 2.839 (min) to 7.587 (max). The GDP per Capita (Mean = 0.8461), Family, and Health (Life Expectancy) indicators show variation. Freedom has a mean of 0.428 and Generosity at 0.237. We’ll later see how this data impacts our analysis.

As some of the variables are unimportant for our clustering analysis, we will be removing them:

World_data_clean <- World_data[ , c(-2, -3, -5, -12)]

head(World_data_clean)

##       Country Happiness_score GDP_PerCapita  Family Health_life_expectancy Freedom Trust_GovCorruption Generosity
## 1 Switzerland           7.587       1.39651 1.34951                0.94143 0.66557             0.41978    0.29678
## 2     Iceland           7.561       1.30232 1.40223                0.94784 0.62877             0.14145    0.43630
## 3     Denmark           7.527       1.32548 1.36058                0.87464 0.64938             0.48357    0.34139
## 4      Norway           7.522       1.45900 1.33095                0.88521 0.66973             0.36503    0.34699
## 5      Canada           7.427       1.32629 1.32261                0.90563 0.63297             0.32957    0.45811
## 6     Finland           7.406       1.29025 1.31826                0.88911 0.64169             0.41372    0.23351

2. Research question: How can countries be segmented based on their performance in key happiness factors, including Family support, Health and life expectancy, Freedom, Trust in government, and Generosity?

First, we have to check if the data is appropriate for clustering.

I initially used GDP per Capita in the analysis as well, but removed it later as it caused problems in the clustering.

We have to standardize the data (scale function), so the variables all provide the same effect/impact.

World_data_clean_STD <- as.data.frame(scale(World_data_clean[c("Family", "Health_life_expectancy", "Freedom", "Trust_GovCorruption", "Generosity")]))

Now we can look for potential outliers:

World_data_clean$Dissimilarity <- sqrt(World_data_clean_STD$Family^2 + World_data_clean_STD$Health_life_expectancy^2 + World_data_clean_STD$Freedom^2 + World_data_clean_STD$Trust_GovCorruption^2 + World_data_clean_STD$Generosity^2)

head(World_data_clean[order(-World_data_clean$Dissimilarity), c("Country", "Dissimilarity")])

##                      Country Dissimilarity
## 129                  Myanmar      4.585831
## 148 Central African Republic      4.341715
## 154                   Rwanda      3.750601
## 9                New Zealand      3.724099
## 3                    Denmark      3.697382
## 28                     Qatar      3.602008

We see the biggest jump from Central African Republic onward (4.3 to 3.7), so based on that we will drop Myanmar and CAR from our data and standardize our data again:

World_data_clean2 <- World_data_clean %>%
  filter(!Country %in% c("Myanmar", "Central African Republic"))

n* = 156

World_data_clean_STD2 <- as.data.frame(scale(World_data_clean2[c("Family", "Health_life_expectancy", "Freedom", "Trust_GovCorruption", "Generosity")]))

rownames(World_data_clean_STD2) <- World_data_clean2$Country

Dissimilarity matrix

Distance <- get_dist(World_data_clean_STD2,
                     method = "euclidian")

fviz_dist1 <- fviz_dist(Distance,
          gradient = list(low = "salmon",
                          mid = "grey95",
                          high = "white")) + 
  theme(axis.text.x = element_text(size = 3),
        axis.text.y = element_text(size = 3)) #GPT recommended this to reduce big text
  
print(fviz_dist1)

Here we have the dissimilarity matrix, which uses Euclidian distances to aid us in visualizing clusters. There is a lot of observations (Countries, n* = 156), so we can’t see the country names clearly. From the matrix I can see at least 3 clusters forming, but it requires further analysis:

Hopkins statistics

Using Hopkin’s statistics we can check if data is clusterable, with 0,5 being the treshold.

Hopkins_stat <- get_clust_tendency(World_data_clean_STD2,
                   n =nrow(World_data_clean_STD2) -1,
                   graph = FALSE)
print(Hopkins_stat)

## $hopkins_stat
## [1] 0.6830286
## 
## $plot
## NULL

The statistics are at 0.68 > 0.5, meaning we can proceed with our analysis.

Elbow method

Elbow_method <- fviz_nbclust(World_data_clean_STD2, kmeans, method = "wss") +
  labs(subtitle = "Elbow method")
print(Elbow_method)

In the elbow method, there is not a highly evident break in the slope, but it appears to happen at 3 and 7 number of clusters.

Silhouette analysis

Silhouette_analysis <- fviz_nbclust(World_data_clean_STD2, kmeans, method = "silhouette") +
  labs(subtitle = "Silhouette analysis")
print(Silhouette_analysis)

In the Silhouette analysis we see the highest value at 2. Two other potential number of clusters are seen at 3 and 7.

Cluster dendrogram

WARD <- World_data_clean_STD2 %>%
  get_dist(method = "euclidian") %>%
  hclust(method = "ward.D2")
print(WARD)

## 
## Call:
## hclust(d = ., method = "ward.D2")
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 156

Dendrogram <- fviz_dend(WARD,
                        cex = 0.3) #cex is a ChatGPT recommendation to make text under Dendrogram less clustered

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

print(Dendrogram)

The biggest visible jump in the vertical line in the Dendrogram is seen at 3 clusters.

Indice method

Indice_method <- NbClust(World_data_clean_STD2,
                         distance = "euclidean",
                         min.nc = 2, max.nc = 10,
                         method = "kmeans",
                         index = "all")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 6 proposed 2 as the best number of clusters 
## * 8 proposed 3 as the best number of clusters 
## * 2 proposed 5 as the best number of clusters 
## * 2 proposed 7 as the best number of clusters 
## * 2 proposed 8 as the best number of clusters 
## * 3 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************

print(Indice_method)

## $All.index
##         KL      CH Hartigan     CCC    Scott     Marriot    TrCovW   TraceW Friedman  Rubin Cindex     DB Silhouette   Duda Pseudot2   Beale
## 2   1.3166 73.1674  44.9343 -1.5735 195.3316 35315734948 11930.198 525.3834   3.1311 1.4751 0.4039 1.4501     0.2746 0.9513   5.1717  0.1581
## 3   2.0042 69.2725  27.5202 -0.2824 388.4529 23041609882  7117.389 406.7125   5.4273 1.9055 0.3448 1.3972     0.2695 0.8513  11.1827  0.5389
## 4   1.3647 63.2457  20.7245  0.4435 513.8824 18331678934  5527.242 344.7094   7.3639 2.2483 0.3454 1.3639     0.2507 1.3808 -20.1320 -0.8440
## 5   1.5892 58.6942   7.5026  1.1318 615.2159 14959454904  4127.544 303.3490   9.0857 2.5548 0.3854 1.3663     0.2563 1.9208 -34.5147 -1.4662
## 6   3.1160 50.4525   8.3136 -0.1628 639.4804 18438577563  3618.243 288.9902   9.5448 2.6818 0.3329 1.3624     0.2363 1.4354 -13.9523 -0.9196
## 7   0.0751 45.4546  26.1072 -0.8857 700.8194 16937798874  3464.357 273.8143  11.1882 2.8304 0.3129 1.3345     0.2142 1.5957 -15.6800 -1.1295
## 8  14.0319 49.1849   8.4114  2.1103 812.2501 10829902828  2158.047 232.9906  13.2899 3.3263 0.3940 1.3100     0.2380 1.2300  -6.1715 -0.5644
## 9   0.4673 46.2198   9.0653  1.8704 848.3482 10875136180  1939.257 220.4610  13.9415 3.5154 0.3822 1.3482     0.2247 1.2898  -7.8649 -0.6740
## 10 15.0520 44.3215   5.8537  1.9318 891.5193 10180380410  1769.317 207.6552  14.8895 3.7321 0.3761 1.2708     0.2209 1.0139  -0.4671 -0.0413
##    Ratkowsky     Ball Ptbiserial    Frey McClain   Dunn Hubert SDindex Dindex   SDbw
## 2     0.3905 262.6917     0.4451  0.1263  0.7159 0.1253 0.0023  1.8367 1.7447 1.2727
## 3     0.3947 135.5708     0.5458  0.8865  1.1204 0.1081 0.0034  1.7352 1.5212 0.9307
## 4     0.3707  86.1773     0.5084  0.2018  1.6782 0.1272 0.0036  1.6230 1.3906 0.5690
## 5     0.3477  60.6698     0.5213  0.9459  1.9563 0.1523 0.0041  1.5588 1.3112 0.4438
## 6     0.3224  48.1650     0.4977 -1.6330  2.2866 0.0510 0.0039  1.7100 1.2835 0.5630
## 7     0.3028  39.1163     0.4533  0.0967  2.7039 0.0473 0.0040  1.8817 1.2448 0.3736
## 8     0.2953  29.1238     0.4725  0.4949  3.1287 0.1271 0.0044  1.8822 1.1630 0.3509
## 9     0.2817  24.4957     0.4566  0.5625  3.4987 0.1109 0.0045  1.7358 1.1291 0.3323
## 10    0.2705  20.7655     0.4459  0.6159  3.7482 0.1109 0.0047  1.7848 1.0938 0.3176
## 
## $All.CriticalValues
##    CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
## 2          0.6588            52.3182       0.9775
## 3          0.6513            34.2721       0.7467
## 4          0.5995            48.7619       1.0000
## 5          0.5965            48.7016       1.0000
## 6          0.5502            37.6080       1.0000
## 7          0.5399            35.7855       1.0000
## 8          0.5287            29.4215       1.0000
## 9          0.5022            34.6984       1.0000
## 10         0.5094            32.7506       1.0000
## 
## $Best.nc
##                     KL      CH Hartigan    CCC    Scott    Marriot   TrCovW  TraceW Friedman   Rubin Cindex      DB Silhouette   Duda PseudoT2  Beale
## Number_clusters 10.000  2.0000   7.0000 8.0000   3.0000          3    3.000  3.0000   3.0000  8.0000 7.0000 10.0000     2.0000 2.0000   2.0000 2.0000
## Value_Index     15.052 73.1674  17.7936 2.1103 193.1213 7564194118 4812.809 56.6678   2.2963 -0.3069 0.3129  1.2708     0.2746 0.9513   5.1717 0.1581
##                 Ratkowsky     Ball PtBiserial Frey McClain   Dunn Hubert SDindex Dindex    SDbw
## Number_clusters    3.0000   3.0000     3.0000    1  2.0000 5.0000      0  5.0000      0 10.0000
## Value_Index        0.3947 127.1209     0.5458   NA  0.7159 0.1523      0  1.5588      0  0.3176
## 
## $Best.partition
##             Switzerland                 Iceland                 Denmark                  Norway                  Canada                 Finland 
##                       3                       3                       3                       3                       3                       3 
##             Netherlands                  Sweden             New Zealand               Australia                  Israel              Costa Rica 
##                       3                       3                       3                       3                       1                       1 
##                 Austria                  Mexico           United States                  Brazil              Luxembourg                 Ireland 
##                       3                       1                       3                       1                       3                       3 
##                 Belgium    United Arab Emirates          United Kingdom                    Oman               Venezuela               Singapore 
##                       3                       3                       3                       3                       1                       3 
##                  Panama                 Germany                   Chile                   Qatar                  France               Argentina 
##                       1                       3                       1                       3                       1                       1 
##          Czech Republic                 Uruguay                Colombia                Thailand            Saudi Arabia                   Spain 
##                       1                       3                       1                       3                       1                       1 
##                   Malta                  Taiwan                  Kuwait                Suriname     Trinidad and Tobago             El Salvador 
##                       3                       1                       1                       1                       1                       1 
##               Guatemala              Uzbekistan                Slovakia                   Japan             South Korea                 Ecuador 
##                       1                       3                       1                       1                       1                       1 
##                 Bahrain                   Italy                 Bolivia                 Moldova                Paraguay              Kazakhstan 
##                       3                       1                       1                       1                       1                       1 
##                Slovenia               Lithuania               Nicaragua                    Peru                 Belarus                  Poland 
##                       1                       1                       3                       1                       1                       1 
##                Malaysia                 Croatia                   Libya                  Russia                 Jamaica            North Cyprus 
##                       1                       1                       1                       1                       1                       1 
##                  Cyprus                 Algeria                  Kosovo            Turkmenistan               Mauritius               Hong Kong 
##                       1                       1                       2                       3                       1                       3 
##                 Estonia               Indonesia                 Vietnam                  Turkey              Kyrgyzstan                 Nigeria 
##                       1                       1                       1                       1                       1                       2 
##                  Bhutan              Azerbaijan                Pakistan                  Jordan              Montenegro                   China 
##                       3                       1                       2                       1                       1                       1 
##                  Zambia                 Romania                  Serbia                Portugal                  Latvia             Philippines 
##                       2                       1                       1                       1                       1                       1 
##       Somaliland region                 Morocco               Macedonia              Mozambique                 Albania  Bosnia and Herzegovina 
##                       3                       2                       1                       2                       1                       1 
##                 Lesotho      Dominican Republic                    Laos                Mongolia               Swaziland                  Greece 
##                       2                       1                       3                       1                       2                       1 
##                 Lebanon                 Hungary                Honduras              Tajikistan                 Tunisia Palestinian Territories 
##                       1                       1                       1                       2                       2                       1 
##              Bangladesh                    Iran                 Ukraine                    Iraq            South Africa                   Ghana 
##                       2                       2                       1                       2                       2                       2 
##                Zimbabwe                 Liberia                   India                   Sudan                   Haiti        Congo (Kinshasa) 
##                       2                       2                       2                       2                       2                       2 
##                   Nepal                Ethiopia            Sierra Leone              Mauritania                   Kenya                Djibouti 
##                       2                       2                       2                       2                       2                       2 
##                 Armenia                Botswana                 Georgia                  Malawi               Sri Lanka                Cameroon 
##                       1                       2                       2                       2                       1                       2 
##                Bulgaria                   Egypt                   Yemen                  Angola                    Mali     Congo (Brazzaville) 
##                       1                       2                       2                       2                       2                       2 
##                 Comoros                  Uganda                 Senegal                   Gabon                   Niger                Cambodia 
##                       2                       2                       2                       2                       2                       2 
##                Tanzania              Madagascar                    Chad                  Guinea             Ivory Coast            Burkina Faso 
##                       2                       2                       2                       2                       2                       2 
##             Afghanistan                  Rwanda                   Benin                   Syria                 Burundi                    Togo 
##                       2                       3                       2                       2                       2                       2

The indice method recommends using 3 clusters, which is what we’ve already established with the previous tests.

For further analysis, we will be using 3 clusters.

K-Means clustering

With K-Means clustering, we will start with 3 random points that preform 25 full trials to find the best options.

Clustering <- kmeans(World_data_clean_STD2,
                     centers = 3,
                     nstart = 25)
print(Clustering)

## K-means clustering with 3 clusters of sizes 31, 73, 52
## 
## Cluster means:
##       Family Health_life_expectancy     Freedom Trust_GovCorruption  Generosity
## 1  0.8028006              0.7727040  1.19108894           1.4387314  1.10542963
## 2  0.3320646              0.4634078 -0.05718025          -0.4072969 -0.40343257
## 3 -0.9447603             -1.1112038 -0.62979997          -0.2859231 -0.09264887
## 
## Clustering vector:
##             Switzerland                 Iceland                 Denmark                  Norway                  Canada                 Finland 
##                       1                       1                       1                       1                       1                       1 
##             Netherlands                  Sweden             New Zealand               Australia                  Israel              Costa Rica 
##                       1                       1                       1                       1                       2                       2 
##                 Austria                  Mexico           United States                  Brazil              Luxembourg                 Ireland 
##                       1                       2                       1                       2                       1                       1 
##                 Belgium    United Arab Emirates          United Kingdom                    Oman               Venezuela               Singapore 
##                       1                       1                       1                       1                       2                       1 
##                  Panama                 Germany                   Chile                   Qatar                  France               Argentina 
##                       2                       1                       2                       1                       2                       2 
##          Czech Republic                 Uruguay                Colombia                Thailand            Saudi Arabia                   Spain 
##                       2                       1                       2                       1                       2                       2 
##                   Malta                  Taiwan                  Kuwait                Suriname     Trinidad and Tobago             El Salvador 
##                       1                       2                       2                       2                       2                       2 
##               Guatemala              Uzbekistan                Slovakia                   Japan             South Korea                 Ecuador 
##                       2                       1                       2                       2                       2                       2 
##                 Bahrain                   Italy                 Bolivia                 Moldova                Paraguay              Kazakhstan 
##                       2                       2                       2                       2                       2                       2 
##                Slovenia               Lithuania               Nicaragua                    Peru                 Belarus                  Poland 
##                       2                       2                       1                       2                       2                       2 
##                Malaysia                 Croatia                   Libya                  Russia                 Jamaica            North Cyprus 
##                       2                       2                       2                       2                       2                       2 
##                  Cyprus                 Algeria                  Kosovo            Turkmenistan               Mauritius               Hong Kong 
##                       2                       2                       3                       2                       2                       1 
##                 Estonia               Indonesia                 Vietnam                  Turkey              Kyrgyzstan                 Nigeria 
##                       2                       2                       2                       2                       2                       3 
##                  Bhutan              Azerbaijan                Pakistan                  Jordan              Montenegro                   China 
##                       1                       2                       3                       2                       2                       2 
##                  Zambia                 Romania                  Serbia                Portugal                  Latvia             Philippines 
##                       3                       2                       2                       2                       2                       2 
##       Somaliland region                 Morocco               Macedonia              Mozambique                 Albania  Bosnia and Herzegovina 
##                       1                       3                       2                       3                       2                       2 
##                 Lesotho      Dominican Republic                    Laos                Mongolia               Swaziland                  Greece 
##                       3                       2                       1                       2                       3                       2 
##                 Lebanon                 Hungary                Honduras              Tajikistan                 Tunisia Palestinian Territories 
##                       2                       2                       2                       2                       3                       2 
##              Bangladesh                    Iran                 Ukraine                    Iraq            South Africa                   Ghana 
##                       3                       3                       2                       3                       3                       3 
##                Zimbabwe                 Liberia                   India                   Sudan                   Haiti        Congo (Kinshasa) 
##                       3                       3                       3                       3                       3                       3 
##                   Nepal                Ethiopia            Sierra Leone              Mauritania                   Kenya                Djibouti 
##                       3                       3                       3                       3                       3                       3 
##                 Armenia                Botswana                 Georgia                  Malawi               Sri Lanka                Cameroon 
##                       2                       3                       3                       3                       2                       3 
##                Bulgaria                   Egypt                   Yemen                  Angola                    Mali     Congo (Brazzaville) 
##                       2                       3                       3                       3                       3                       3 
##                 Comoros                  Uganda                 Senegal                   Gabon                   Niger                Cambodia 
##                       3                       3                       3                       3                       3                       3 
##                Tanzania              Madagascar                    Chad                  Guinea             Ivory Coast            Burkina Faso 
##                       3                       3                       3                       3                       3                       3 
##             Afghanistan                  Rwanda                   Benin                   Syria                 Burundi                    Togo 
##                       3                       1                       3                       3                       3                       3 
## 
## Within cluster sum of squares by cluster:
## [1]  83.86715 161.73718 160.97704
##  (between_SS / total_SS =  47.5 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss" "betweenss"    "size"         "iter"         "ifault"

We created 3 clusters with the following sizes:

1st: 73 +
2nd: 52 +
3rd: 31 = 156

Please note the order of 1st, 2nd and 3rd might change due to rerun of the code.

The biggest variability is seen in cluster 1 at 161.73 (73), which tells us it is the most heterogeneous.

The best we could achieve with the data and 3 groups is a ratio of 47.5%, which means that 47.5% of total variability is explained by classification of countries into 3 groups.

Not the best ratio, but it’s the best we were able to get and we’ll work with it.

Cluster plot

Now time to visualize our clusters:

Cluster_plot <- fviz_cluster(Clustering,
                             palette = "Set1",
                             repel = FALSE,
                             labelsize = 10,
                             ggtheme = theme_bw(),
                             data = World_data_clean_STD2)
print(Cluster_plot)

In the cluster plot, we can see our 3 clusters in action. We see that we will need to remove some countries as they are far away from other units in their respective clusters.

I will be removing Somaliland region, Rwanda, Laos (Cluster 1), Greece (Cluster 2) and Cambodia (Cluster 3).

World_data_clean3 <- World_data_clean2 %>%
  filter(!Country %in% c("Somaliland region", "Rwanda", "Laos", "Cambodia", "Greece"))

World_data_clean_STD3 <- as.data.frame(scale(World_data_clean3[c("Family", "Health_life_expectancy", "Freedom", "Trust_GovCorruption", "Generosity")]))

rownames(World_data_clean_STD3) <- World_data_clean3$Country

Clustering1 <- kmeans(World_data_clean_STD3,
                     centers = 3,
                     nstart = 25)
print(Clustering1)

## K-means clustering with 3 clusters of sizes 54, 27, 70
## 
## Cluster means:
##       Family Health_life_expectancy     Freedom Trust_GovCorruption Generosity
## 1 -0.9476684             -1.0167833 -0.76248707          -0.2751518 -0.1594749
## 2  0.9492764              0.9299521  1.27992474           1.5011733  1.1708433
## 3  0.3649090              0.4256799  0.09451905          -0.3667640 -0.3285875
## 
## Clustering vector:
##             Switzerland                 Iceland                 Denmark                  Norway                  Canada                 Finland 
##                       2                       2                       2                       2                       2                       2 
##             Netherlands                  Sweden             New Zealand               Australia                  Israel              Costa Rica 
##                       2                       2                       2                       2                       3                       3 
##                 Austria                  Mexico           United States                  Brazil              Luxembourg                 Ireland 
##                       2                       3                       2                       3                       2                       2 
##                 Belgium    United Arab Emirates          United Kingdom                    Oman               Venezuela               Singapore 
##                       2                       2                       2                       2                       3                       2 
##                  Panama                 Germany                   Chile                   Qatar                  France               Argentina 
##                       3                       2                       3                       2                       3                       3 
##          Czech Republic                 Uruguay                Colombia                Thailand            Saudi Arabia                   Spain 
##                       3                       2                       3                       2                       3                       3 
##                   Malta                  Taiwan                  Kuwait                Suriname     Trinidad and Tobago             El Salvador 
##                       2                       3                       3                       3                       3                       3 
##               Guatemala              Uzbekistan                Slovakia                   Japan             South Korea                 Ecuador 
##                       3                       2                       3                       3                       3                       3 
##                 Bahrain                   Italy                 Bolivia                 Moldova                Paraguay              Kazakhstan 
##                       3                       3                       3                       3                       3                       3 
##                Slovenia               Lithuania               Nicaragua                    Peru                 Belarus                  Poland 
##                       3                       3                       3                       3                       3                       3 
##                Malaysia                 Croatia                   Libya                  Russia                 Jamaica            North Cyprus 
##                       3                       3                       3                       3                       3                       3 
##                  Cyprus                 Algeria                  Kosovo            Turkmenistan               Mauritius               Hong Kong 
##                       3                       3                       1                       3                       3                       2 
##                 Estonia               Indonesia                 Vietnam                  Turkey              Kyrgyzstan                 Nigeria 
##                       3                       3                       3                       3                       3                       1 
##                  Bhutan              Azerbaijan                Pakistan                  Jordan              Montenegro                   China 
##                       2                       3                       1                       3                       1                       3 
##                  Zambia                 Romania                  Serbia                Portugal                  Latvia             Philippines 
##                       1                       3                       3                       3                       3                       3 
##                 Morocco               Macedonia              Mozambique                 Albania  Bosnia and Herzegovina                 Lesotho 
##                       1                       3                       1                       3                       1                       1 
##      Dominican Republic                Mongolia               Swaziland                 Lebanon                 Hungary                Honduras 
##                       3                       3                       1                       3                       3                       3 
##              Tajikistan                 Tunisia Palestinian Territories              Bangladesh                    Iran                 Ukraine 
##                       3                       1                       1                       1                       1                       3 
##                    Iraq            South Africa                   Ghana                Zimbabwe                 Liberia                   India 
##                       1                       1                       1                       1                       1                       1 
##                   Sudan                   Haiti        Congo (Kinshasa)                   Nepal                Ethiopia            Sierra Leone 
##                       1                       1                       1                       1                       1                       1 
##              Mauritania                   Kenya                Djibouti                 Armenia                Botswana                 Georgia 
##                       1                       1                       1                       1                       1                       1 
##                  Malawi               Sri Lanka                Cameroon                Bulgaria                   Egypt                   Yemen 
##                       1                       3                       1                       3                       1                       1 
##                  Angola                    Mali     Congo (Brazzaville)                 Comoros                  Uganda                 Senegal 
##                       1                       1                       1                       1                       3                       1 
##                   Gabon                   Niger                Tanzania              Madagascar                    Chad                  Guinea 
##                       1                       1                       1                       1                       1                       1 
##             Ivory Coast            Burkina Faso             Afghanistan                   Benin                   Syria                 Burundi 
##                       1                       1                       1                       1                       1                       1 
##                    Togo 
##                       1 
## 
## Within cluster sum of squares by cluster:
## [1] 169.97585  62.26983 147.19870
##  (between_SS / total_SS =  49.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss" "betweenss"    "size"         "iter"         "ifault"

With the 5 countries removed, the K-Means clustering tells us the ratio (between_SS / total_SS) is now 49.4%, which is better than the previous 47.5%. Our sample now includes 151 countries.

Cluster_plot1 <- fviz_cluster(Clustering1,
                             palette = "Set1",
                             repel = FALSE,
                             labelsize = 10,
                             ggtheme = theme_bw(),
                             data = World_data_clean_STD3)
print(Cluster_plot1)

We see the clusters look better after our manipulation. There is a slight overlap seen between clusters 1 and 3, but this poses no problem. Let’s continue:

Cluster profiles

averages <- Clustering1$centers
print(averages)

##       Family Health_life_expectancy     Freedom Trust_GovCorruption Generosity
## 1 -0.9476684             -1.0167833 -0.76248707          -0.2751518 -0.1594749
## 2  0.9492764              0.9299521  1.27992474           1.5011733  1.1708433
## 3  0.3649090              0.4256799  0.09451905          -0.3667640 -0.3285875

figure <- as.data.frame(averages)
figure$id <- 1:nrow(figure)

library(tidyr)

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:magrittr':
## 
##     extract

## The following object is masked from 'package:pastecs':
## 
##     extract

## The following object is masked from 'package:reshape2':
## 
##     smiths

figure <- pivot_longer(figure, cols = c("Family", "Health_life_expectancy", "Freedom", "Trust_GovCorruption", "Generosity"))

figure$Group <- factor(figure$id,
                       levels = c(1, 2, 3),
                       labels = c("1", "2", "3"))

figure$ImeF <- factor(figure$name,
                       levels = c("Family", "Health_life_expectancy", "Freedom", "Trust_GovCorruption", "Generosity"),
                      labels = c("Family", "Health_life_expectancy", "Freedom", "Trust_GovCorruption", "Generosity"))

ggplot1 <- ggplot(figure, aes(x = ImeF, y = value)) +
  geom_hline(yintercept = 0) +
  theme_bw() +
  geom_point(aes(shape = Group, col = Group), size = 3) +
  geom_line(aes(group = id), linewidth = 1) +
  ylab("Averages") +
  xlab("Cluster variables") +
  scale_color_brewer(palette = "Set1") +
  ylim(-2.2, 2.2) +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.50, size = 10))

print(ggplot1)

World_data_clean3$Group <- Clustering1$cluster

In the cluster profiles we can see how the 3 clusters are rated on average in the Family, Health & life expectancy, Freedom, Trust in government and Generosity variables.

We see cluster 2 ranking above average in every variable. Cluster 3 has an above average Family and Health & life expectancy contribution to the calculation of the Happiness Score, a slightly above average Freedom contribution, but a less than average contribution in Trust in Government and Generosity. Cluster 1 however, deviates negatively from the mean in every variable.

These values make sense as we can see the countries in Cluster 1 are generally less developed countries, and Cluster 2 generally has highly developed countries, with some exceptions.

In the last r chunk I also assigned the countries in the dataset groups by adding a “Group” column (1, 2 or 3).

Checking the appropriateness of the cluster variables used.

fit <- aov(cbind(Family, Health_life_expectancy, Freedom, Trust_GovCorruption, Generosity) ~ as.factor(Group),
           data = World_data_clean3)
fit_summary <- summary(fit)
print(fit_summary)

##  Response Family :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 5.6431 2.82154   89.59 < 2.2e-16 ***
## Residuals        148 4.6611 0.03149                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Health_life_expectancy :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 5.5678 2.78390  116.92 < 2.2e-16 ***
## Residuals        148 3.5238 0.02381                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Freedom :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 1.6909 0.84547  76.512 < 2.2e-16 ***
## Residuals        148 1.6354 0.01105                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Trust_GovCorruption :
##                   Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 0.98651 0.49326  72.727 < 2.2e-16 ***
## Residuals        148 1.00378 0.00678                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Generosity :
##                   Df  Sum Sq  Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 0.61662 0.308308  32.674 1.765e-12 ***
## Residuals        148 1.39650 0.009436                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Using ANOVA we’ve checked the appropriateness for cluster variables and we tested the following:

Family:

H0: μ(Family, Group 1) = μ(Family, Group 2) = μ(Family, Group 3)
H1: At least 1 μ(Family, Group j) is different.
We reject H0 at p < 0.001.

Health & life expectancy:

H0: μ(Health & life expectancy, Group 1) = μ(Health & life expectancy, Group 2) = μ(Health & life expectancy, Group 3)
H1: At least 1 μ(Health & life expectancy, Group j) is different.
We reject H0 at p < 0.001.

Freedom:

H0: μ(Freedom, Group 1) = μ(Freedom, Group 2) = μ(Freedom, Group 3)
H1: At least 1 μ(Freedom, Group j) is different.
We reject H0 at p < 0.001.

Trust in Government:

H0: μ(Trust in Government, Group 1) = μ(Trust in Government, Group 2) = μ(Trust in Government, Group 3)
H1: At least 1 μ(Trust in Government, Group j) is different.
We reject H0 at p < 0.001.

Generosity:

H0: μ(Generosity, Group 1) = μ(Generosity, Group 2) = μ(Generosity, Group 3)
H1: At least 1 μ(Generosity, Group j) is different.
We reject H0 at p < 0.001.

This tells us that all of the groups (Group 1, Group 2 and Group 3) are statistically different in the mean values of the cluster variables (5). There is a significant difference between the formed groups, all of the p-values are very close to 0, which further proves the selection of the variables is appropriate.

Checking criterion validity

Now we have to check the variables that were not used for the clustering, which in our case is GDP per Capita. The only other variable left is Happiness Score, but that is directly correlated to and measured by the chosen 5 variables.

aggregate1 <- aggregate(World_data_clean3$GDP_PerCapita,
          by = list(World_data_clean3$Group),
          FUN = mean)
print(aggregate1)

##   Group.1         x
## 1       1 0.4910052
## 2       2 1.2927037
## 3       3 0.9901024

As expected, we see group 2 has the highest GDP per Capita at 1.29, and Group 1 has the lowest GDP per Capita.

With Levene test we’ll be looking at Homogenity of variance:

leveneTest(World_data_clean3$GDP_PerCapita, as.factor(World_data_clean3$Group))

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value   Pr(>F)   
## group   2  4.9078 0.008635 **
##       148                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We tested the following:

H0: G^2(GDP per Capita, Group 1) = G^2(GDP per Capita, Group 2) = G^2(GDP per Capita, Group 3)
H1: At least one G^2(GDP per Capita, Group j) is different.
We reject H0 at p = 0.009.

As H0 was rejected, we know that at least one variance of the Groups is different than the rest.

World_data_clean3 %>%
  group_by(Group) %>%
  shapiro_test(GDP_PerCapita)

## # A tibble: 3 × 4
##   Group variable      statistic        p
##   <int> <chr>             <dbl>    <dbl>
## 1     1 GDP_PerCapita     0.962 0.0884  
## 2     2 GDP_PerCapita     0.840 0.000723
## 3     3 GDP_PerCapita     0.967 0.0608

We tested the following for each group:

Group 1

H0: GDP per Capita in Group 1 is normally distributed.
H1: GDP per Capita is not normally distributed.
We can not reject H0 (p > 0.05).

Group 2

H0: GDP per Capita in Group 2 is normally distributed.
H1: GDP per Capita in Group 2 is not normally distributed.
We can reject H0 (p < 0.001).

Group 3

H0: GDP per Capita in Group 3 is normally distributed.
H1: GDP per Capita in Group 3 is not normally distributed.
We can not reject H0 (p > 0.05).

Because the Homogenity of variances is violated as well as normality of GDP per Capita is violated in Group 2, we have to use the Kruskal-Wallis rank sum test (non-parametric test).

kruskal <- kruskal.test(GDP_PerCapita ~ Group,
             data = World_data_clean3)
print(kruskal)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  GDP_PerCapita by Group
## Kruskal-Wallis chi-squared = 89.489, df = 2, p-value < 2.2e-16

With Kruskal-Wallis we tested the following:

H0: Location distributions of GDP per Capita are the same in all 3 groups.
H1: Location distributions of GDP per Capita are NOT the same in all 3 groups.
We reject at p < 0.001.

This means that at least one is different, meaning we have validated the validation and our clusters are trustworthy.

kruskal_eff <- kruskal_effsize(GDP_PerCapita ~ Group,
                               data = World_data_clean3)
print(kruskal_eff)

## # A tibble: 1 × 5
##   .y.               n effsize method  magnitude
## * <chr>         <int>   <dbl> <chr>   <ord>    
## 1 GDP_PerCapita   151   0.591 eta2[H] large

The Kruskal-Wallis test showed a large, statistically significant difference in GDP per Capita across groups, meaning group membership strongly influences GDP per Capita distribution.

3. Conclusion & explanation of results

Based on our analysis, we created 3 distinct groups (Clusters) using 5 standardized variables: Family, Health & life expectancy, Freedom, Trust in government and Generosity (All contributing variables to calculation of the Happiness Score) for 151 of our observed countries based on our cleaned data from the World Happiness Report from 2015.

Cluster 1 (Group 1): (54 of 151 = 35,76%) represents countries that are less developed and tend to have below-average values across all variables. The averages for “Family,” “Health & life expectancy,” “Freedom,” “Trust in Government,” and “Generosity” are all negative, highlighting challenges these countries face in factors contributing to happiness. These countries are characterized by weaker support systems, lower life expectancy, limited freedom, lack of trust in government institutions, and reduced levels of generosity. Furthermore, the GDP per Capita for this group is the lowest among the three clusters (0.491), reinforcing the idea that these countries generally have limited economic resources. Cluster 1 likely consists of countries from lower-income regions, struggling with systemic issues that directly and indirectly impact happiness and well-being.
Cluster 2 (Group 2): (27 of 151 = 17,88%) includes highly developed countries with above-average performance in every variable. These countries rank positively in “Family,” “Health & life expectancy,” “Freedom,” “Trust in Government,” and “Generosity,” indicating a robust support system, strong institutional trust, and economic affluence. The GDP per Capita for this group is the highest among the three clusters (1.29), reflecting significant economic prosperity. These nations are characterized by their advanced healthcare systems, social safety nets, democratic values, and community-driven support. Cluster 2 represents countries where citizens enjoy a higher quality of life and have the resources to lead fulfilling, secure, and generous lives.
Cluster 3 (Group 3): (70 of 151 = 46,36%) lies between the other two clusters in terms of performance. Countries in this group exhibit slightly above-average values in “Family,” “Health & life expectancy,” and “Freedom,” but they fall below average in “Trust in Government” and “Generosity.” This suggests that while these countries maintain decent social structures, healthcare, and personal freedoms, there may be issues with institutional transparency or corruption and a slightly lower emphasis on altruistic behavior. The GDP per Capita for this group (0.99) is moderate, sitting between Cluster 1 and Cluster 2. Overall, these countries are relatively well-off but may still face challenges in fostering trust and community engagement, which affects overall happiness.

HW2 - Clustering

Adam Čeru

23.01.2025

1. Data import, manipulation & description

2. Research question: How can countries be segmented based on their performance in key happiness factors, including Family support, Health and life expectancy, Freedom, Trust in government, and Generosity?

3. Conclusion & explanation of results