knitr::opts_chunk$set(echo = TRUE)
options(width = 120)
#install.packages(ggplot2)
library(ggplot2)
#install.packages("ggfortify")
library(ggfortify)
#install.packages("ranger")
library(ranger)
#install.packages("dplyr")
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

#install.packages("Hmisc")
library(Hmisc)

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

#install.packages("factoextra")
library(factoextra)

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

#install.packages("cluster")
library(cluster)
#install.packages("magrittr")
library(magrittr)
#install.packages("NbClust")
library("NbClust")

h1.title {color: salmon;font-family: "Monaco", monospace; font-size: 48px;}
.author {color: salmon ;font-family: "Monaco", monospace; font-size: 24px;}
.date {color: salmon ;font-family: "Monaco", monospace; font-size: 18px;}

columns {display: flex;}
h1 {color: salmon;font-family: "Monaco", monospace; font-size: 36px;}

columns {display: flex;}
h3 {color: coral;font-family: "Monaco", monospace; font-size: 24px;}

Descriptive statistics

mydata <- read.table("~/Documents/Šola/IMB/2. semester/Multivariate analysis/HW2/2019.csv", header=TRUE, sep=",", dec="." )

colnames(mydata) [1] <- "ID"
colnames(mydata) [2] <- "Country"
colnames(mydata) [3] <- "Happiness_Score"
colnames(mydata) [4] <- "GDP_per_capita"
colnames(mydata) [5] <- "Social_support"
colnames(mydata) [6] <- "Health_life_expectancy"
colnames(mydata) [7] <- "Freedom"
colnames(mydata) [8] <- "Generosity"
colnames(mydata) [9] <- "Perceptions_of_corruptions"

head(mydata)

##   ID     Country Happiness_Score GDP_per_capita Social_support Health_life_expectancy Freedom Generosity
## 1  1     Finland           7.769          1.340          1.587                  0.986   0.596      0.153
## 2  2     Denmark           7.600          1.383          1.573                  0.996   0.592      0.252
## 3  3      Norway           7.554          1.488          1.582                  1.028   0.603      0.271
## 4  4     Iceland           7.494          1.380          1.624                  1.026   0.591      0.354
## 5  5 Netherlands           7.488          1.396          1.522                  0.999   0.557      0.322
## 6  6 Switzerland           7.480          1.452          1.526                  1.052   0.572      0.263
##   Perceptions_of_corruptions
## 1                      0.393
## 2                      0.410
## 3                      0.341
## 4                      0.118
## 5                      0.298
## 6                      0.343

Unit of observation: individual country

Sample size: 156 countries

Definition of variables:

ID: A unique identifier for each country.
Country: Name of the country.
Happiness_score: A metric measured in 2015 by asking the sampled people the question: “How would you rate your happiness?”.
GDP_per_capita: The extent to which GDP contributes to the calculation of the Happiness Score.
Social_support: The extent to which Social support contributes to the calculation of the Happiness Score.
Health_life_expectancy: The extent to which Life expectancy contributed to the calculation of the Happiness Score.
Freedom: The extent to which Freedom contributed to the calculation of the Happiness Score.
Generosity: The extent to which Willingness to donate to others contributes to Happiness Score.
Perceptions_of_corruptions: The extent to which Perception of Corruption contributes to Happiness Score.

Source: Vikarna. (2022, April 4). World happiness report. Kaggle. https://www.kaggle.com/code/vikarna/world-happiness-report/input?select=2019.csv.

anyNA(mydata)

## [1] FALSE

There is no missing data in my data frame.

summary(mydata[ , c(-1,-2)])

##  Happiness_Score GDP_per_capita   Social_support  Health_life_expectancy    Freedom         Generosity    
##  Min.   :2.853   Min.   :0.0000   Min.   :0.000   Min.   :0.0000         Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:4.545   1st Qu.:0.6028   1st Qu.:1.056   1st Qu.:0.5477         1st Qu.:0.3080   1st Qu.:0.1087  
##  Median :5.380   Median :0.9600   Median :1.272   Median :0.7890         Median :0.4170   Median :0.1775  
##  Mean   :5.407   Mean   :0.9051   Mean   :1.209   Mean   :0.7252         Mean   :0.3926   Mean   :0.1848  
##  3rd Qu.:6.184   3rd Qu.:1.2325   3rd Qu.:1.452   3rd Qu.:0.8818         3rd Qu.:0.5072   3rd Qu.:0.2482  
##  Max.   :7.769   Max.   :1.6840   Max.   :1.624   Max.   :1.1410         Max.   :0.6310   Max.   :0.5660  
##  Perceptions_of_corruptions
##  Min.   :0.0000            
##  1st Qu.:0.0470            
##  Median :0.0855            
##  Mean   :0.1106            
##  3rd Qu.:0.1412            
##  Max.   :0.4530

The average number of happiness score in the data frame is 5.407.

The median number of the happines score is 5.380, meaning that half of countries in the data frame had happiness score lower than 2.445, while the other half had more.

The minimum happiness score in the data frame is 2.853, meaning that the country which had the lowest score it had it at 2.853.

RQ

Based on variables, such as social support, health life expectancy, freedom, generosity and perceptions of corruptions, how can we identify distinct segments?

Is data appropriate?

It has to be numerical, which it is
Dissimilarity metric should be checked with Hopkins statistics

#Saving standardized cluster variables into new data frame

mydata_clu_std <- as.data.frame(scale(mydata[c("Social_support", "Health_life_expectancy", "Freedom", "Generosity", "Perceptions_of_corruptions")]))

We want all of the variables to have the same effect, not that the one with the highest variance to contribute the most, for this reason we standardize all cluster variables.

mydata$Dissimilarity <- sqrt(mydata_clu_std$Social_support^2 + mydata_clu_std$Health_life_expectancy^2 + mydata_clu_std$Freedom^2 + 
                             mydata_clu_std$Generosity^2 + mydata_clu_std$Perceptions_of_corruptions^2) #Finding outliers

head(mydata[order(-mydata$Dissimilarity), c("ID", "Dissimilarity")]) #Finding units with highest value of dissimilarity

##      ID Dissimilarity
## 155 155      5.017007
## 34   34      4.348179
## 149 149      4.309707
## 131 131      4.217867
## 147 147      4.113304
## 154 154      4.001882

I see ID155 as a potential outlier, as there is a big jump in disimilarity numbers between units. For this reason I will remove this unit.

mydata <- mydata %>%
  filter(!ID %in% c(155)) #Removing ID155 from original data frame

mydata$ID <- seq(1, nrow(mydata)) #Numbering the data again

mydata_clu_std <- as.data.frame(scale(mydata[c("Social_support", "Health_life_expectancy", "Freedom", "Generosity", "Perceptions_of_corruptions")])) #Standardizing the data again

rownames(mydata_clu_std) <- mydata$Country #Showing the units with country names

After removing one country, the sample size is now 155.

#Finding Eudlidean distances, based on 5 Cluster variables, then saving them into object Distances

Distances <- get_dist(mydata_clu_std, 
                      method = "euclidian")

fviz_dist(Distances, #Showing matrix of distances
          gradient = list(low = "darkred",
                          mid = "grey95",
                          high = "white"))

We can see on the matrix of distances that some groups of clusters are forming, I see either 5 or 3 groups.

get_clust_tendency(mydata_clu_std, #Hopkins statistics
                   n = nrow(mydata_clu_std) - 1,
                   graph = FALSE)

## $hopkins_stat
## [1] 0.7320906
## 
## $plot
## NULL

My data is clusterable as it is above 0.5. If it would be more close to 1, it would be even more appropriate. However the threshold is 0.5, therefore my data is appropriate. Now the next question is how many clusters to use. I will check this with Hierarhical clustering (dendrogram) and K-Means clustering (Elbow method, Silhouette analysis and with the help of indices).

How many clusters to use?

WARD <- mydata_clu_std %>%
  get_dist(method = "euclidean") %>%  
  hclust(method = "ward.D2")          

WARD

## 
## Call:
## hclust(d = ., method = "ward.D2")
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 155

fviz_dend(WARD)

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

Based on the dendrogram, I would choose 3 clusters, as there is the biggest jump in vertical line.

fviz_nbclust(mydata_clu_std, kmeans, method = "wss") +
  labs(subtitle = "Elbow method")

With the elbow method the slope changes most evidently at 3 and 5, therefore both are possible options.

fviz_nbclust(mydata_clu_std, kmeans, method = "silhouette")+
  labs(subtitle = "Silhouette analysis")

The higest value of the Silhouette analysis is at 3.

NbClust(mydata_clu_std, 
        distance = "euclidean", 
        min.nc = 2, max.nc = 10,
        method = "kmeans", 
        index = "all")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 4 proposed 2 as the best number of clusters 
## * 14 proposed 3 as the best number of clusters 
## * 1 proposed 4 as the best number of clusters 
## * 3 proposed 9 as the best number of clusters 
## * 1 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************

## $All.index
##        KL      CH Hartigan     CCC    Scott     Marriot    TrCovW   TraceW Friedman  Rubin Cindex     DB Silhouette
## 2  0.4144 63.6298  65.4352 -2.5065 182.2619 27534595130 12463.334 543.8310   3.4388 1.4159 0.3792 1.5106     0.2719
## 3  4.4201 77.6284  27.2111  0.3594 400.9012 15116660268  6157.016 380.9190   6.4907 2.0214 0.3291 1.2059     0.3280
## 4  0.8275 69.6263  24.9101  1.0464 516.6175 12738195366  4951.317 323.0810   8.2318 2.3833 0.3134 1.3163     0.2761
## 5  1.6082 66.6177  18.1976  2.3924 617.9681 10350316732  3015.686 277.3304   9.9180 2.7765 0.3341 1.2358     0.2814
## 6  3.4600 62.9765  11.4451  2.8928 707.8913  8343734049  2420.214 247.3255  11.9637 3.1133 0.3658 1.2588     0.2827
## 7  0.5114 58.0270  12.0065  2.5947 747.2997  8807155450  1962.561 229.6829  12.4286 3.3524 0.3590 1.3261     0.2681
## 8  0.6875 55.1127  12.8804  2.7061 797.6182  8314422951  1801.445 212.4481  13.4753 3.6244 0.3365 1.3804     0.2589
## 9  4.9293 53.6913   7.4541  3.1697 864.4944  6835288033  1594.328 195.3328  15.2076 3.9420 0.3551 1.3328     0.2608
## 10 0.6076 50.6413   7.5632  2.8851 902.6896  6595572550  1474.097 185.8444  16.2635 4.1433 0.3531 1.3143     0.2347
##      Duda Pseudot2   Beale Ratkowsky     Ball Ptbiserial    Frey McClain   Dunn Hubert SDindex Dindex   SDbw
## 2  0.7836  30.3822  0.8552    0.3418 271.9155     0.4014 -0.1897  0.6617 0.1010 0.0022  2.0458 1.7333 1.2820
## 3  0.6762  25.8563  1.4803    0.4068 126.9730     0.6008  1.2330  0.8815 0.1133 0.0035  1.7378 1.4531 1.1676
## 4  1.6155 -19.8116 -1.1552    0.3793  80.7702     0.5351  0.8433  1.4868 0.1079 0.0037  1.8536 1.3138 0.8145
## 5  2.2216 -31.8922 -1.6616    0.3574  55.4661     0.4975  0.1048  2.0003 0.1006 0.0037  1.7776 1.2166 0.4353
## 6  1.2917  -7.2265 -0.6816    0.3359  41.2209     0.5105  0.6072  2.1345 0.1174 0.0043  1.8505 1.1633 0.3640
## 7  1.4571 -11.2931 -0.9441    0.3165  32.8118     0.4884  0.4057  2.5073 0.1314 0.0045  2.1046 1.1253 0.3517
## 8  1.3334  -6.2515 -0.7366    0.3008  26.5560     0.4697  0.2215  2.9409 0.1314 0.0047  2.1801 1.0824 0.3373
## 9  1.2804  -8.5396 -0.6625    0.2879  21.7036     0.4645  1.4268  3.1784 0.1459 0.0047  2.1742 1.0466 0.3318
## 10 0.9823   0.4673  0.0534    0.2754  18.5844     0.4548  0.4866  3.3398 0.1316 0.0049  2.0619 1.0230 0.3258
## 
## $All.CriticalValues
##    CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
## 2          0.6816            51.3852       0.5112
## 3          0.6693            26.6776       0.1951
## 4          0.5502            42.5134       1.0000
## 5          0.5344            50.5235       1.0000
## 6          0.5287            28.5299       1.0000
## 7          0.5162            33.7471       1.0000
## 8          0.4360            32.3364       1.0000
## 9          0.5399            33.2294       1.0000
## 10         0.4684            29.5064       0.9982
## 
## $Best.nc
##                     KL      CH Hartigan    CCC    Scott     Marriot   TrCovW   TraceW Friedman   Rubin Cindex     DB
## Number_clusters 9.0000  3.0000   3.0000 9.0000   3.0000           3    3.000   3.0000   3.0000  3.0000 4.0000 3.0000
## Value_Index     4.9293 77.6284  38.2242 3.1697 218.6393 10039469960 6306.318 105.0739   3.0518 -0.2437 0.3134 1.2059
##                 Silhouette   Duda PseudoT2  Beale Ratkowsky     Ball PtBiserial Frey McClain   Dunn Hubert SDindex
## Number_clusters      3.000 2.0000   2.0000 2.0000    3.0000   3.0000     3.0000    1  2.0000 9.0000      0  3.0000
## Value_Index          0.328 0.7836  30.3822 0.8552    0.4068 144.9425     0.6008   NA  0.6617 0.1459      0  1.7378
##                 Dindex    SDbw
## Number_clusters      0 10.0000
## Value_Index          0  0.3258
## 
## $Best.partition
##                 Finland                 Denmark                  Norway                 Iceland             Netherlands 
##                       3                       3                       3                       3                       3 
##             Switzerland                  Sweden             New Zealand                  Canada                 Austria 
##                       3                       3                       3                       3                       3 
##               Australia              Costa Rica                  Israel              Luxembourg          United Kingdom 
##                       3                       1                       1                       3                       3 
##                 Ireland                 Germany                 Belgium           United States          Czech Republic 
##                       3                       3                       3                       3                       1 
##    United Arab Emirates                   Malta                  Mexico                  France                  Taiwan 
##                       3                       3                       1                       1                       1 
##                   Chile               Guatemala            Saudi Arabia                   Qatar                   Spain 
##                       1                       1                       1                       3                       1 
##                  Panama                  Brazil                 Uruguay               Singapore             El Salvador 
##                       1                       1                       1                       3                       1 
##                   Italy                 Bahrain                Slovakia       Trinidad & Tobago                  Poland 
##                       1                       3                       1                       1                       1 
##              Uzbekistan               Lithuania                Colombia                Slovenia               Nicaragua 
##                       3                       1                       1                       1                       1 
##                  Kosovo               Argentina                 Romania                  Cyprus                 Ecuador 
##                       1                       1                       1                       1                       1 
##                  Kuwait                Thailand                  Latvia             South Korea                 Estonia 
##                       1                       3                       1                       1                       1 
##                 Jamaica               Mauritius                   Japan                Honduras              Kazakhstan 
##                       1                       1                       1                       1                       1 
##                 Bolivia                 Hungary                Paraguay         Northern Cyprus                    Peru 
##                       1                       1                       1                       1                       1 
##                Portugal                Pakistan                  Russia             Philippines                  Serbia 
##                       1                       2                       1                       1                       1 
##                 Moldova                   Libya              Montenegro              Tajikistan                 Croatia 
##                       1                       1                       1                       1                       1 
##               Hong Kong      Dominican Republic  Bosnia and Herzegovina                  Turkey                Malaysia 
##                       3                       1                       1                       1                       1 
##                 Belarus                  Greece                Mongolia         North Macedonia                 Nigeria 
##                       1                       1                       1                       1                       2 
##              Kyrgyzstan            Turkmenistan                 Algeria                 Morocco              Azerbaijan 
##                       1                       1                       1                       1                       1 
##                 Lebanon               Indonesia                   China                 Vietnam                  Bhutan 
##                       1                       3                       1                       1                       3 
##                Cameroon                Bulgaria                   Ghana             Ivory Coast                   Nepal 
##                       2                       1                       2                       2                       1 
##                  Jordan                   Benin     Congo (Brazzaville)                   Gabon                    Laos 
##                       1                       2                       2                       1                       2 
##            South Africa                 Albania               Venezuela                Cambodia Palestinian Territories 
##                       1                       1                       1                       1                       1 
##                 Senegal                 Somalia                 Namibia                   Niger            Burkina Faso 
##                       2                       2                       1                       2                       2 
##                 Armenia                    Iran                  Guinea                 Georgia                  Gambia 
##                       1                       2                       2                       2                       2 
##                   Kenya              Mauritania              Mozambique                 Tunisia              Bangladesh 
##                       2                       2                       2                       1                       1 
##                    Iraq        Congo (Kinshasa)                    Mali            Sierra Leone               Sri Lanka 
##                       2                       2                       2                       2                       1 
##                 Myanmar                    Chad                 Ukraine                Ethiopia               Swaziland 
##                       3                       2                       1                       2                       2 
##                  Uganda                   Egypt                  Zambia                    Togo                   India 
##                       2                       1                       2                       2                       2 
##                 Liberia                 Comoros              Madagascar                 Lesotho                 Burundi 
##                       2                       2                       2                       2                       2 
##                Zimbabwe                   Haiti                Botswana                   Syria                  Malawi 
##                       2                       2                       1                       2                       2 
##                   Yemen                  Rwanda                Tanzania             Afghanistan             South Sudan 
##                       2                       3                       2                       2                       2

Also the final method with indices is telling us that the best number of clusters to use is 3, therefore I will proceed with 3 clusters.

K-Means Clustering

Clustering <- kmeans(mydata_clu_std, 
                     centers = 3, #Number of groups
                     nstart = 25) #Number of attempts at different starting leader positions

Clustering

## K-means clustering with 3 clusters of sizes 43, 83, 29
## 
## Cluster means:
##   Social_support Health_life_expectancy     Freedom Generosity Perceptions_of_corruptions
## 1     -1.2158031             -1.2846751 -0.62058401  0.1954757                 -0.1697575
## 2      0.3626721              0.3733243 -0.04100557 -0.5095995                 -0.4417800
## 3      0.7647501              0.8363831  1.03753706  1.1686656                  1.5161142
## 
## Clustering vector:
##                 Finland                 Denmark                  Norway                 Iceland             Netherlands 
##                       3                       3                       3                       3                       3 
##             Switzerland                  Sweden             New Zealand                  Canada                 Austria 
##                       3                       3                       3                       3                       3 
##               Australia              Costa Rica                  Israel              Luxembourg          United Kingdom 
##                       3                       2                       2                       3                       3 
##                 Ireland                 Germany                 Belgium           United States          Czech Republic 
##                       3                       3                       3                       3                       2 
##    United Arab Emirates                   Malta                  Mexico                  France                  Taiwan 
##                       3                       3                       2                       2                       2 
##                   Chile               Guatemala            Saudi Arabia                   Qatar                   Spain 
##                       2                       2                       2                       3                       2 
##                  Panama                  Brazil                 Uruguay               Singapore             El Salvador 
##                       2                       2                       2                       3                       2 
##                   Italy                 Bahrain                Slovakia       Trinidad & Tobago                  Poland 
##                       2                       3                       2                       2                       2 
##              Uzbekistan               Lithuania                Colombia                Slovenia               Nicaragua 
##                       3                       2                       2                       2                       2 
##                  Kosovo               Argentina                 Romania                  Cyprus                 Ecuador 
##                       2                       2                       2                       2                       2 
##                  Kuwait                Thailand                  Latvia             South Korea                 Estonia 
##                       2                       3                       2                       2                       2 
##                 Jamaica               Mauritius                   Japan                Honduras              Kazakhstan 
##                       2                       2                       2                       2                       2 
##                 Bolivia                 Hungary                Paraguay         Northern Cyprus                    Peru 
##                       2                       2                       2                       2                       2 
##                Portugal                Pakistan                  Russia             Philippines                  Serbia 
##                       2                       1                       2                       2                       2 
##                 Moldova                   Libya              Montenegro              Tajikistan                 Croatia 
##                       2                       2                       2                       2                       2 
##               Hong Kong      Dominican Republic  Bosnia and Herzegovina                  Turkey                Malaysia 
##                       3                       2                       2                       2                       2 
##                 Belarus                  Greece                Mongolia         North Macedonia                 Nigeria 
##                       2                       2                       2                       2                       1 
##              Kyrgyzstan            Turkmenistan                 Algeria                 Morocco              Azerbaijan 
##                       2                       2                       2                       2                       2 
##                 Lebanon               Indonesia                   China                 Vietnam                  Bhutan 
##                       2                       3                       2                       2                       3 
##                Cameroon                Bulgaria                   Ghana             Ivory Coast                   Nepal 
##                       1                       2                       1                       1                       2 
##                  Jordan                   Benin     Congo (Brazzaville)                   Gabon                    Laos 
##                       2                       1                       1                       2                       1 
##            South Africa                 Albania               Venezuela                Cambodia Palestinian Territories 
##                       2                       2                       2                       2                       2 
##                 Senegal                 Somalia                 Namibia                   Niger            Burkina Faso 
##                       1                       1                       2                       1                       1 
##                 Armenia                    Iran                  Guinea                 Georgia                  Gambia 
##                       2                       1                       1                       1                       1 
##                   Kenya              Mauritania              Mozambique                 Tunisia              Bangladesh 
##                       1                       1                       1                       2                       2 
##                    Iraq        Congo (Kinshasa)                    Mali            Sierra Leone               Sri Lanka 
##                       1                       1                       1                       1                       2 
##                 Myanmar                    Chad                 Ukraine                Ethiopia               Swaziland 
##                       3                       1                       2                       1                       1 
##                  Uganda                   Egypt                  Zambia                    Togo                   India 
##                       1                       2                       1                       1                       1 
##                 Liberia                 Comoros              Madagascar                 Lesotho                 Burundi 
##                       1                       1                       1                       1                       1 
##                Zimbabwe                   Haiti                Botswana                   Syria                  Malawi 
##                       1                       1                       2                       1                       1 
##                   Yemen                  Rwanda                Tanzania             Afghanistan             South Sudan 
##                       1                       3                       1                       1                       1 
## 
## Within cluster sum of squares by cluster:
## [1] 117.04777 172.38772  91.48354
##  (between_SS / total_SS =  50.5 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss" "betweenss"    "size"        
## [8] "iter"         "ifault"

Here I performed K-Means clustering. Biggest ratio of final leaders is 50.5%, which explains the total variability of clustering the 5 variables into 3 groups, and the other 49.5% is not explained.

library(factoextra)
fviz_cluster(Clustering, 
             palette = "Set1", 
             repel = FALSE,
             ggtheme = theme_bw(),
             data = mydata_clu_std)

With the help of Principal Component Analysis around 71.9% (26.2 + 45.7) of information is showed when combining the 5 variables into 2 dimensions. I decided to remove countries “Myanmar”, “Rwanda”, “Indonesia” and “Bhutan”, since they are quite far away from the other units in this cluster.

mydata <- mydata %>%
  filter(!Country %in% c("Myanmar", "Rwanda", "Indonesia", "Bhutan"))

mydata$ID <- seq(1, nrow(mydata))

mydata_clu_std <- as.data.frame(scale(mydata[c("Social_support", "Health_life_expectancy", "Freedom", "Generosity", "Perceptions_of_corruptions")]))

rownames(mydata_clu_std) <- mydata$Country

After removing the four countries our sample size is now 151.

Clustering <- kmeans(mydata_clu_std, 
                     centers = 3, #Number of groups
                     nstart = 25) #Number of attempts at different starting leader positions

Clustering

## K-means clustering with 3 clusters of sizes 43, 86, 22
## 
## Cluster means:
##   Social_support Health_life_expectancy     Freedom Generosity Perceptions_of_corruptions
## 1     -1.2236409             -1.2860540 -0.59729541  0.2860715                 -0.1502989
## 2      0.3626510              0.3622542  0.01110329 -0.4278306                 -0.4222057
## 3      0.9740258              1.0975664  1.12403726  1.1132890                  1.9442067
## 
## Clustering vector:
##                 Finland                 Denmark                  Norway                 Iceland             Netherlands 
##                       3                       3                       3                       3                       3 
##             Switzerland                  Sweden             New Zealand                  Canada                 Austria 
##                       3                       3                       3                       3                       3 
##               Australia              Costa Rica                  Israel              Luxembourg          United Kingdom 
##                       3                       2                       2                       3                       3 
##                 Ireland                 Germany                 Belgium           United States          Czech Republic 
##                       3                       3                       3                       2                       2 
##    United Arab Emirates                   Malta                  Mexico                  France                  Taiwan 
##                       3                       3                       2                       2                       2 
##                   Chile               Guatemala            Saudi Arabia                   Qatar                   Spain 
##                       2                       2                       2                       3                       2 
##                  Panama                  Brazil                 Uruguay               Singapore             El Salvador 
##                       2                       2                       2                       3                       2 
##                   Italy                 Bahrain                Slovakia       Trinidad & Tobago                  Poland 
##                       2                       2                       2                       2                       2 
##              Uzbekistan               Lithuania                Colombia                Slovenia               Nicaragua 
##                       3                       2                       2                       2                       2 
##                  Kosovo               Argentina                 Romania                  Cyprus                 Ecuador 
##                       2                       2                       2                       2                       2 
##                  Kuwait                Thailand                  Latvia             South Korea                 Estonia 
##                       2                       2                       2                       2                       2 
##                 Jamaica               Mauritius                   Japan                Honduras              Kazakhstan 
##                       2                       2                       2                       2                       2 
##                 Bolivia                 Hungary                Paraguay         Northern Cyprus                    Peru 
##                       2                       2                       2                       2                       2 
##                Portugal                Pakistan                  Russia             Philippines                  Serbia 
##                       2                       1                       2                       2                       2 
##                 Moldova                   Libya              Montenegro              Tajikistan                 Croatia 
##                       2                       2                       2                       2                       2 
##               Hong Kong      Dominican Republic  Bosnia and Herzegovina                  Turkey                Malaysia 
##                       3                       2                       2                       2                       2 
##                 Belarus                  Greece                Mongolia         North Macedonia                 Nigeria 
##                       2                       2                       2                       2                       1 
##              Kyrgyzstan            Turkmenistan                 Algeria                 Morocco              Azerbaijan 
##                       2                       2                       2                       2                       2 
##                 Lebanon                   China                 Vietnam                Cameroon                Bulgaria 
##                       2                       2                       2                       1                       2 
##                   Ghana             Ivory Coast                   Nepal                  Jordan                   Benin 
##                       1                       1                       2                       2                       1 
##     Congo (Brazzaville)                   Gabon                    Laos            South Africa                 Albania 
##                       1                       2                       1                       2                       2 
##               Venezuela                Cambodia Palestinian Territories                 Senegal                 Somalia 
##                       2                       2                       2                       1                       1 
##                 Namibia                   Niger            Burkina Faso                 Armenia                    Iran 
##                       2                       1                       1                       2                       1 
##                  Guinea                 Georgia                  Gambia                   Kenya              Mauritania 
##                       1                       1                       1                       1                       1 
##              Mozambique                 Tunisia              Bangladesh                    Iraq        Congo (Kinshasa) 
##                       1                       2                       2                       1                       1 
##                    Mali            Sierra Leone               Sri Lanka                    Chad                 Ukraine 
##                       1                       1                       2                       1                       2 
##                Ethiopia               Swaziland                  Uganda                   Egypt                  Zambia 
##                       1                       1                       1                       2                       1 
##                    Togo                   India                 Liberia                 Comoros              Madagascar 
##                       1                       1                       1                       1                       1 
##                 Lesotho                 Burundi                Zimbabwe                   Haiti                Botswana 
##                       1                       1                       1                       1                       2 
##                   Syria                  Malawi                   Yemen                Tanzania             Afghanistan 
##                       1                       1                       1                       1                       1 
##             South Sudan 
##                       1 
## 
## Within cluster sum of squares by cluster:
## [1] 122.39566 196.70505  36.29094
##  (between_SS / total_SS =  52.6 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss" "betweenss"    "size"        
## [8] "iter"         "ifault"

Here I performed K-Means clustering again, without the removed countries. As we can see the ratio between SS between and SS total is now bigger at 52.6% compared to 50.5% we had before we removed some countries. This means that the biggest ratio of final leaders is now 52.6%, which explains the total variability of clustering the 5 variables into 3 groups, and the other 47.4% is not explained. The final sizes of our clusters are 43, 86 and 22. We can say that the variability is highest in cluster 2 (196.71), as the SS within is the largest.

Sum of SS between is 122.40 + 196.71 + 36.30 = 355.41. Therefore the SS between is 372.21. The brings us to the conclusion that SS total = 355.41 + 372.21 = 727.62.

library(factoextra)
fviz_cluster(Clustering, 
             palette = "Set1", 
             repel = FALSE,
             ggtheme = theme_bw(),
             data = mydata_clu_std)

Now the clusters are more homogeneous.

Group profiles

Averages <- Clustering$centers
Averages #Average values of cluster variables to describe groups

##   Social_support Health_life_expectancy     Freedom Generosity Perceptions_of_corruptions
## 1     -1.2236409             -1.2860540 -0.59729541  0.2860715                 -0.1502989
## 2      0.3626510              0.3622542  0.01110329 -0.4278306                 -0.4222057
## 3      0.9740258              1.0975664  1.12403726  1.1132890                  1.9442067

Figure <- as.data.frame(Averages)
Figure$ID <- 1:nrow(Figure)

library(tidyr)

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:magrittr':
## 
##     extract

Figure <- pivot_longer(Figure, cols = c("Social_support", "Health_life_expectancy", "Freedom", "Generosity", "Perceptions_of_corruptions"))

Figure$Group <- factor(Figure$ID, 
                       levels = c(1, 2, 3), 
                       labels = c("1", "2", "3"))

Figure$NameF <- factor(Figure$name, 
                       levels = c("Social_support", "Health_life_expectancy", "Freedom", "Generosity", "Perceptions_of_corruptions"), 
                       labels = c("Social_support", "Health_life_expectancy", "Freedom", "Generosity", "Perceptions_of_corruptions"))

library(ggplot2)
ggplot(Figure, aes(x = NameF, y = value)) +
  geom_hline(yintercept = 0) +
  theme_bw() +
  geom_point(aes(shape = Group, col = Group), size = 3) +
  geom_line(aes(group = ID), linewidth = 1) +
  ylab("Averages") +
  xlab("Cluster variables")+
  ylim(-2.2, 2.2) +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.50, size = 10))

I will use this profiles of groups at the end for my conclusion.

mydata$Group <- Clustering$cluster #Assignings units to groups

Appropriateness of cluster variables used

#Checking if clustering variables successfully differentiate between groups

fit <- aov(cbind(Social_support, Health_life_expectancy, Freedom, Generosity, Perceptions_of_corruptions) ~ as.factor(Group), 
           data = mydata)

summary(fit)

##  Response Social_support :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 7.8121  3.9061  133.73 < 2.2e-16 ***
## Residuals        148 4.3228  0.0292                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Health_life_expectancy :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 6.2686 3.13430  196.12 < 2.2e-16 ***
## Residuals        148 2.3653 0.01598                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Freedom :
##                   Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 0.89058 0.44529  29.881 1.257e-11 ***
## Residuals        148 2.20547 0.01490                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Generosity :
##                   Df  Sum Sq  Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 0.34712 0.173559  33.275 1.165e-12 ***
## Residuals        148 0.77196 0.005216                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Perceptions_of_corruptions :
##                   Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 0.84549 0.42275  145.63 < 2.2e-16 ***
## Residuals        148 0.42963 0.00290                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Response for Social support:

H0: μ(Social_support, G1) = μ(Social_support, G2) = μ(Social_support, G3)
H1: At least one μ(Social_support, j) is different.

We can reject H0 at p < 0.001. We can reject H0 for all cluster variables at p < 0.001. Therefore we can assume that the groups are statistically different in the mean values of the cluster variables.

Criterion Validity

Next step is to check the criterion validity of the classification with variables that were not used in the clustering process. For this I chose GDP per capita.

We have to check two assumptions: 1. normal distribution of the GDP per capita in each group. 2. homogeneity of variances

aggregate(mydata$GDP_per_capita, 
          by = list(mydata$Group), 
          FUN = mean)

##   Group.1         x
## 1       1 0.4444651
## 2       2 1.0302093
## 3       3 1.3933636

On average the 3 group has the biggest GDP per capita (descriptive approach).

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

leveneTest(mydata$GDP_per_capita, as.factor(mydata$Group))

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value  Pr(>F)  
## group   2  3.4586 0.03404 *
##       148                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

H0: σ2 (GDP_per_capita, G1) = σ2 (GDP_per_capita, G2) = σ2 (GDP_per_capita, G3)
H1: At least one σ2 (GDP_per_capita, j) is different.

We can reject H0 at p = 0.035, meaning that at least one variance is different from the rest. With this we checked the homogeneity of variances, therefore we can proceed with Welch F-test (if normality is not violated).

library(rstatix)

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

mydata %>%
  group_by(as.factor(mydata$Group)) %>%
  shapiro_test(GDP_per_capita)

## # A tibble: 3 × 4
##   `as.factor(mydata$Group)` variable       statistic        p
##   <fct>                     <chr>              <dbl>    <dbl>
## 1 1                         GDP_per_capita     0.963 0.181   
## 2 2                         GDP_per_capita     0.979 0.184   
## 3 3                         GDP_per_capita     0.754 0.000105

H0: GDP per capita is normally distributed in G1.
H1: GDP per capita is not normally distributed in G1.

We cannot reject H0.

H0: GDP per capita is normally distributed in G2.
H1: GDP per capita is not normally distributed in G2.

We cannot reject H0.

H0: GDP per capita is normally distributed in G3.
H1: GDP per capita is not normally distributed in G3.

We reject H0 at p = 0.001.

Because we rejected one, we have to use the Kruskal-Wallis Rank Sum Test.

kruskal.test(GDP_per_capita ~ as.factor(Group), 
             data = mydata)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  GDP_per_capita by as.factor(Group)
## Kruskal-Wallis chi-squared = 97.25, df = 2, p-value < 2.2e-16

H0: Location distribution of GDP per capita are the same for all groups.
H1: Location distribution of GDP per capita are not the same for all groups.

We reject H0 at p < 0.001, therefore at least one location distribution is different from the other, meaning I trust that my clusters are meaningful. All in all, my result is validated.

Conclusion

Based on 5 standardized variables (social support, health life expectancy, freedom, generosity and perceptions of corruptions), we divided 151 countries into 3 groups using hierarchical and K-means clustering.

Group 1 (43/151 = 28%) consisting mainly of developing countries with low standards of living (for example Iraq, Nigeria, Haiti). These countries are the worst performing and below average in social support, health life expectancy and freedom. They are better than group 2 in generosity, in which they perform a bit above average, on the other hand they are better than group 2 in perceptions of corruption, however they also perform a bit below average. On average they have the lowest GDP per capita.

Group 2 (86/151 = 57%) is the largest group, consisisting mainly of emerging markets or tranisition economies (for example Costa Rica, Taiwan, Vietnam). These countries are a bit above average in social support and health life expectancy. They are right on average with freedom, however a bit below average in generosity and perceptions of corruptions, which also makes them the worst in these two cluster variables among all countries. On average they have the second biggest GDP per capita.

Group 3 (22/151 = 15%) is the smallest group, which consists mainly of developed countries with high standards of living (for example Finland, Switzerland, Singapore). These countries are the best in all cluster variables; social support, health life expectancy, freedom, generosity and they have the least perceptions of corruptions. On average they have the biggest GDP per capita.

Homework 2

Manja Modic

2025-01-22