knitr::opts_chunk$set(echo = TRUE)
options(width = 120)
#install.packages(ggplot2)
library(ggplot2)
#install.packages("ggfortify")
library(ggfortify)
#install.packages("ranger")
library(ranger)
#install.packages("dplyr")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#install.packages("Hmisc")
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
#install.packages("factoextra")
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
#install.packages("cluster")
library(cluster)
#install.packages("magrittr")
library(magrittr)
#install.packages("NbClust")
library("NbClust")
h1.title {color: salmon;font-family: "Monaco", monospace; font-size: 48px;}
.author {color: salmon ;font-family: "Monaco", monospace; font-size: 24px;}
.date {color: salmon ;font-family: "Monaco", monospace; font-size: 18px;}
columns {display: flex;}
h1 {color: salmon;font-family: "Monaco", monospace; font-size: 36px;}
columns {display: flex;}
h3 {color: coral;font-family: "Monaco", monospace; font-size: 24px;}
mydata <- read.table("~/Documents/Šola/IMB/2. semester/Multivariate analysis/HW2/2019.csv", header=TRUE, sep=",", dec="." )
colnames(mydata) [1] <- "ID"
colnames(mydata) [2] <- "Country"
colnames(mydata) [3] <- "Happiness_Score"
colnames(mydata) [4] <- "GDP_per_capita"
colnames(mydata) [5] <- "Social_support"
colnames(mydata) [6] <- "Health_life_expectancy"
colnames(mydata) [7] <- "Freedom"
colnames(mydata) [8] <- "Generosity"
colnames(mydata) [9] <- "Perceptions_of_corruptions"
head(mydata)
## ID Country Happiness_Score GDP_per_capita Social_support Health_life_expectancy Freedom Generosity
## 1 1 Finland 7.769 1.340 1.587 0.986 0.596 0.153
## 2 2 Denmark 7.600 1.383 1.573 0.996 0.592 0.252
## 3 3 Norway 7.554 1.488 1.582 1.028 0.603 0.271
## 4 4 Iceland 7.494 1.380 1.624 1.026 0.591 0.354
## 5 5 Netherlands 7.488 1.396 1.522 0.999 0.557 0.322
## 6 6 Switzerland 7.480 1.452 1.526 1.052 0.572 0.263
## Perceptions_of_corruptions
## 1 0.393
## 2 0.410
## 3 0.341
## 4 0.118
## 5 0.298
## 6 0.343
Unit of observation: individual country
Sample size: 156 countries
Definition of variables:
ID: A unique identifier for each country.
Country: Name of the country.
Happiness_score: A metric measured in 2015 by asking the sampled people the question: “How would you rate your happiness?”.
GDP_per_capita: The extent to which GDP contributes to the calculation of the Happiness Score.
Social_support: The extent to which Social support contributes to the calculation of the Happiness Score.
Health_life_expectancy: The extent to which Life expectancy contributed to the calculation of the Happiness Score.
Freedom: The extent to which Freedom contributed to the calculation of the Happiness Score.
Generosity: The extent to which Willingness to donate to others contributes to Happiness Score.
Perceptions_of_corruptions: The extent to which Perception of Corruption contributes to Happiness Score.
Source: Vikarna. (2022, April 4). World happiness report. Kaggle. https://www.kaggle.com/code/vikarna/world-happiness-report/input?select=2019.csv.
anyNA(mydata)
## [1] FALSE
There is no missing data in my data frame.
summary(mydata[ , c(-1,-2)])
## Happiness_Score GDP_per_capita Social_support Health_life_expectancy Freedom Generosity
## Min. :2.853 Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:4.545 1st Qu.:0.6028 1st Qu.:1.056 1st Qu.:0.5477 1st Qu.:0.3080 1st Qu.:0.1087
## Median :5.380 Median :0.9600 Median :1.272 Median :0.7890 Median :0.4170 Median :0.1775
## Mean :5.407 Mean :0.9051 Mean :1.209 Mean :0.7252 Mean :0.3926 Mean :0.1848
## 3rd Qu.:6.184 3rd Qu.:1.2325 3rd Qu.:1.452 3rd Qu.:0.8818 3rd Qu.:0.5072 3rd Qu.:0.2482
## Max. :7.769 Max. :1.6840 Max. :1.624 Max. :1.1410 Max. :0.6310 Max. :0.5660
## Perceptions_of_corruptions
## Min. :0.0000
## 1st Qu.:0.0470
## Median :0.0855
## Mean :0.1106
## 3rd Qu.:0.1412
## Max. :0.4530
The average number of happiness score in the data frame is 5.407.
The median number of the happines score is 5.380, meaning that half of countries in the data frame had happiness score lower than 2.445, while the other half had more.
The minimum happiness score in the data frame is 2.853, meaning that the country which had the lowest score it had it at 2.853.
Based on variables, such as social support, health life expectancy, freedom, generosity and perceptions of corruptions, how can we identify distinct segments?
#Saving standardized cluster variables into new data frame
mydata_clu_std <- as.data.frame(scale(mydata[c("Social_support", "Health_life_expectancy", "Freedom", "Generosity", "Perceptions_of_corruptions")]))
We want all of the variables to have the same effect, not that the one with the highest variance to contribute the most, for this reason we standardize all cluster variables.
mydata$Dissimilarity <- sqrt(mydata_clu_std$Social_support^2 + mydata_clu_std$Health_life_expectancy^2 + mydata_clu_std$Freedom^2 +
mydata_clu_std$Generosity^2 + mydata_clu_std$Perceptions_of_corruptions^2) #Finding outliers
head(mydata[order(-mydata$Dissimilarity), c("ID", "Dissimilarity")]) #Finding units with highest value of dissimilarity
## ID Dissimilarity
## 155 155 5.017007
## 34 34 4.348179
## 149 149 4.309707
## 131 131 4.217867
## 147 147 4.113304
## 154 154 4.001882
I see ID155 as a potential outlier, as there is a big jump in disimilarity numbers between units. For this reason I will remove this unit.
mydata <- mydata %>%
filter(!ID %in% c(155)) #Removing ID155 from original data frame
mydata$ID <- seq(1, nrow(mydata)) #Numbering the data again
mydata_clu_std <- as.data.frame(scale(mydata[c("Social_support", "Health_life_expectancy", "Freedom", "Generosity", "Perceptions_of_corruptions")])) #Standardizing the data again
rownames(mydata_clu_std) <- mydata$Country #Showing the units with country names
After removing one country, the sample size is now 155.
#Finding Eudlidean distances, based on 5 Cluster variables, then saving them into object Distances
Distances <- get_dist(mydata_clu_std,
method = "euclidian")
fviz_dist(Distances, #Showing matrix of distances
gradient = list(low = "darkred",
mid = "grey95",
high = "white"))
We can see on the matrix of distances that some groups of clusters are forming, I see either 5 or 3 groups.
get_clust_tendency(mydata_clu_std, #Hopkins statistics
n = nrow(mydata_clu_std) - 1,
graph = FALSE)
## $hopkins_stat
## [1] 0.7320906
##
## $plot
## NULL
My data is clusterable as it is above 0.5. If it would be more close to 1, it would be even more appropriate. However the threshold is 0.5, therefore my data is appropriate. Now the next question is how many clusters to use. I will check this with Hierarhical clustering (dendrogram) and K-Means clustering (Elbow method, Silhouette analysis and with the help of indices).
WARD <- mydata_clu_std %>%
get_dist(method = "euclidean") %>%
hclust(method = "ward.D2")
WARD
##
## Call:
## hclust(d = ., method = "ward.D2")
##
## Cluster method : ward.D2
## Distance : euclidean
## Number of objects: 155
fviz_dend(WARD)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
Based on the dendrogram, I would choose 3 clusters, as there is the biggest jump in vertical line.
fviz_nbclust(mydata_clu_std, kmeans, method = "wss") +
labs(subtitle = "Elbow method")
With the elbow method the slope changes most evidently at 3 and 5, therefore both are possible options.
fviz_nbclust(mydata_clu_std, kmeans, method = "silhouette")+
labs(subtitle = "Silhouette analysis")
The higest value of the Silhouette analysis is at 3.
NbClust(mydata_clu_std,
distance = "euclidean",
min.nc = 2, max.nc = 10,
method = "kmeans",
index = "all")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 4 proposed 2 as the best number of clusters
## * 14 proposed 3 as the best number of clusters
## * 1 proposed 4 as the best number of clusters
## * 3 proposed 9 as the best number of clusters
## * 1 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
## $All.index
## KL CH Hartigan CCC Scott Marriot TrCovW TraceW Friedman Rubin Cindex DB Silhouette
## 2 0.4144 63.6298 65.4352 -2.5065 182.2619 27534595130 12463.334 543.8310 3.4388 1.4159 0.3792 1.5106 0.2719
## 3 4.4201 77.6284 27.2111 0.3594 400.9012 15116660268 6157.016 380.9190 6.4907 2.0214 0.3291 1.2059 0.3280
## 4 0.8275 69.6263 24.9101 1.0464 516.6175 12738195366 4951.317 323.0810 8.2318 2.3833 0.3134 1.3163 0.2761
## 5 1.6082 66.6177 18.1976 2.3924 617.9681 10350316732 3015.686 277.3304 9.9180 2.7765 0.3341 1.2358 0.2814
## 6 3.4600 62.9765 11.4451 2.8928 707.8913 8343734049 2420.214 247.3255 11.9637 3.1133 0.3658 1.2588 0.2827
## 7 0.5114 58.0270 12.0065 2.5947 747.2997 8807155450 1962.561 229.6829 12.4286 3.3524 0.3590 1.3261 0.2681
## 8 0.6875 55.1127 12.8804 2.7061 797.6182 8314422951 1801.445 212.4481 13.4753 3.6244 0.3365 1.3804 0.2589
## 9 4.9293 53.6913 7.4541 3.1697 864.4944 6835288033 1594.328 195.3328 15.2076 3.9420 0.3551 1.3328 0.2608
## 10 0.6076 50.6413 7.5632 2.8851 902.6896 6595572550 1474.097 185.8444 16.2635 4.1433 0.3531 1.3143 0.2347
## Duda Pseudot2 Beale Ratkowsky Ball Ptbiserial Frey McClain Dunn Hubert SDindex Dindex SDbw
## 2 0.7836 30.3822 0.8552 0.3418 271.9155 0.4014 -0.1897 0.6617 0.1010 0.0022 2.0458 1.7333 1.2820
## 3 0.6762 25.8563 1.4803 0.4068 126.9730 0.6008 1.2330 0.8815 0.1133 0.0035 1.7378 1.4531 1.1676
## 4 1.6155 -19.8116 -1.1552 0.3793 80.7702 0.5351 0.8433 1.4868 0.1079 0.0037 1.8536 1.3138 0.8145
## 5 2.2216 -31.8922 -1.6616 0.3574 55.4661 0.4975 0.1048 2.0003 0.1006 0.0037 1.7776 1.2166 0.4353
## 6 1.2917 -7.2265 -0.6816 0.3359 41.2209 0.5105 0.6072 2.1345 0.1174 0.0043 1.8505 1.1633 0.3640
## 7 1.4571 -11.2931 -0.9441 0.3165 32.8118 0.4884 0.4057 2.5073 0.1314 0.0045 2.1046 1.1253 0.3517
## 8 1.3334 -6.2515 -0.7366 0.3008 26.5560 0.4697 0.2215 2.9409 0.1314 0.0047 2.1801 1.0824 0.3373
## 9 1.2804 -8.5396 -0.6625 0.2879 21.7036 0.4645 1.4268 3.1784 0.1459 0.0047 2.1742 1.0466 0.3318
## 10 0.9823 0.4673 0.0534 0.2754 18.5844 0.4548 0.4866 3.3398 0.1316 0.0049 2.0619 1.0230 0.3258
##
## $All.CriticalValues
## CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
## 2 0.6816 51.3852 0.5112
## 3 0.6693 26.6776 0.1951
## 4 0.5502 42.5134 1.0000
## 5 0.5344 50.5235 1.0000
## 6 0.5287 28.5299 1.0000
## 7 0.5162 33.7471 1.0000
## 8 0.4360 32.3364 1.0000
## 9 0.5399 33.2294 1.0000
## 10 0.4684 29.5064 0.9982
##
## $Best.nc
## KL CH Hartigan CCC Scott Marriot TrCovW TraceW Friedman Rubin Cindex DB
## Number_clusters 9.0000 3.0000 3.0000 9.0000 3.0000 3 3.000 3.0000 3.0000 3.0000 4.0000 3.0000
## Value_Index 4.9293 77.6284 38.2242 3.1697 218.6393 10039469960 6306.318 105.0739 3.0518 -0.2437 0.3134 1.2059
## Silhouette Duda PseudoT2 Beale Ratkowsky Ball PtBiserial Frey McClain Dunn Hubert SDindex
## Number_clusters 3.000 2.0000 2.0000 2.0000 3.0000 3.0000 3.0000 1 2.0000 9.0000 0 3.0000
## Value_Index 0.328 0.7836 30.3822 0.8552 0.4068 144.9425 0.6008 NA 0.6617 0.1459 0 1.7378
## Dindex SDbw
## Number_clusters 0 10.0000
## Value_Index 0 0.3258
##
## $Best.partition
## Finland Denmark Norway Iceland Netherlands
## 3 3 3 3 3
## Switzerland Sweden New Zealand Canada Austria
## 3 3 3 3 3
## Australia Costa Rica Israel Luxembourg United Kingdom
## 3 1 1 3 3
## Ireland Germany Belgium United States Czech Republic
## 3 3 3 3 1
## United Arab Emirates Malta Mexico France Taiwan
## 3 3 1 1 1
## Chile Guatemala Saudi Arabia Qatar Spain
## 1 1 1 3 1
## Panama Brazil Uruguay Singapore El Salvador
## 1 1 1 3 1
## Italy Bahrain Slovakia Trinidad & Tobago Poland
## 1 3 1 1 1
## Uzbekistan Lithuania Colombia Slovenia Nicaragua
## 3 1 1 1 1
## Kosovo Argentina Romania Cyprus Ecuador
## 1 1 1 1 1
## Kuwait Thailand Latvia South Korea Estonia
## 1 3 1 1 1
## Jamaica Mauritius Japan Honduras Kazakhstan
## 1 1 1 1 1
## Bolivia Hungary Paraguay Northern Cyprus Peru
## 1 1 1 1 1
## Portugal Pakistan Russia Philippines Serbia
## 1 2 1 1 1
## Moldova Libya Montenegro Tajikistan Croatia
## 1 1 1 1 1
## Hong Kong Dominican Republic Bosnia and Herzegovina Turkey Malaysia
## 3 1 1 1 1
## Belarus Greece Mongolia North Macedonia Nigeria
## 1 1 1 1 2
## Kyrgyzstan Turkmenistan Algeria Morocco Azerbaijan
## 1 1 1 1 1
## Lebanon Indonesia China Vietnam Bhutan
## 1 3 1 1 3
## Cameroon Bulgaria Ghana Ivory Coast Nepal
## 2 1 2 2 1
## Jordan Benin Congo (Brazzaville) Gabon Laos
## 1 2 2 1 2
## South Africa Albania Venezuela Cambodia Palestinian Territories
## 1 1 1 1 1
## Senegal Somalia Namibia Niger Burkina Faso
## 2 2 1 2 2
## Armenia Iran Guinea Georgia Gambia
## 1 2 2 2 2
## Kenya Mauritania Mozambique Tunisia Bangladesh
## 2 2 2 1 1
## Iraq Congo (Kinshasa) Mali Sierra Leone Sri Lanka
## 2 2 2 2 1
## Myanmar Chad Ukraine Ethiopia Swaziland
## 3 2 1 2 2
## Uganda Egypt Zambia Togo India
## 2 1 2 2 2
## Liberia Comoros Madagascar Lesotho Burundi
## 2 2 2 2 2
## Zimbabwe Haiti Botswana Syria Malawi
## 2 2 1 2 2
## Yemen Rwanda Tanzania Afghanistan South Sudan
## 2 3 2 2 2
Also the final method with indices is telling us that the best number of clusters to use is 3, therefore I will proceed with 3 clusters.
Clustering <- kmeans(mydata_clu_std,
centers = 3, #Number of groups
nstart = 25) #Number of attempts at different starting leader positions
Clustering
## K-means clustering with 3 clusters of sizes 43, 83, 29
##
## Cluster means:
## Social_support Health_life_expectancy Freedom Generosity Perceptions_of_corruptions
## 1 -1.2158031 -1.2846751 -0.62058401 0.1954757 -0.1697575
## 2 0.3626721 0.3733243 -0.04100557 -0.5095995 -0.4417800
## 3 0.7647501 0.8363831 1.03753706 1.1686656 1.5161142
##
## Clustering vector:
## Finland Denmark Norway Iceland Netherlands
## 3 3 3 3 3
## Switzerland Sweden New Zealand Canada Austria
## 3 3 3 3 3
## Australia Costa Rica Israel Luxembourg United Kingdom
## 3 2 2 3 3
## Ireland Germany Belgium United States Czech Republic
## 3 3 3 3 2
## United Arab Emirates Malta Mexico France Taiwan
## 3 3 2 2 2
## Chile Guatemala Saudi Arabia Qatar Spain
## 2 2 2 3 2
## Panama Brazil Uruguay Singapore El Salvador
## 2 2 2 3 2
## Italy Bahrain Slovakia Trinidad & Tobago Poland
## 2 3 2 2 2
## Uzbekistan Lithuania Colombia Slovenia Nicaragua
## 3 2 2 2 2
## Kosovo Argentina Romania Cyprus Ecuador
## 2 2 2 2 2
## Kuwait Thailand Latvia South Korea Estonia
## 2 3 2 2 2
## Jamaica Mauritius Japan Honduras Kazakhstan
## 2 2 2 2 2
## Bolivia Hungary Paraguay Northern Cyprus Peru
## 2 2 2 2 2
## Portugal Pakistan Russia Philippines Serbia
## 2 1 2 2 2
## Moldova Libya Montenegro Tajikistan Croatia
## 2 2 2 2 2
## Hong Kong Dominican Republic Bosnia and Herzegovina Turkey Malaysia
## 3 2 2 2 2
## Belarus Greece Mongolia North Macedonia Nigeria
## 2 2 2 2 1
## Kyrgyzstan Turkmenistan Algeria Morocco Azerbaijan
## 2 2 2 2 2
## Lebanon Indonesia China Vietnam Bhutan
## 2 3 2 2 3
## Cameroon Bulgaria Ghana Ivory Coast Nepal
## 1 2 1 1 2
## Jordan Benin Congo (Brazzaville) Gabon Laos
## 2 1 1 2 1
## South Africa Albania Venezuela Cambodia Palestinian Territories
## 2 2 2 2 2
## Senegal Somalia Namibia Niger Burkina Faso
## 1 1 2 1 1
## Armenia Iran Guinea Georgia Gambia
## 2 1 1 1 1
## Kenya Mauritania Mozambique Tunisia Bangladesh
## 1 1 1 2 2
## Iraq Congo (Kinshasa) Mali Sierra Leone Sri Lanka
## 1 1 1 1 2
## Myanmar Chad Ukraine Ethiopia Swaziland
## 3 1 2 1 1
## Uganda Egypt Zambia Togo India
## 1 2 1 1 1
## Liberia Comoros Madagascar Lesotho Burundi
## 1 1 1 1 1
## Zimbabwe Haiti Botswana Syria Malawi
## 1 1 2 1 1
## Yemen Rwanda Tanzania Afghanistan South Sudan
## 1 3 1 1 1
##
## Within cluster sum of squares by cluster:
## [1] 117.04777 172.38772 91.48354
## (between_SS / total_SS = 50.5 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss" "size"
## [8] "iter" "ifault"
Here I performed K-Means clustering. Biggest ratio of final leaders is 50.5%, which explains the total variability of clustering the 5 variables into 3 groups, and the other 49.5% is not explained.
library(factoextra)
fviz_cluster(Clustering,
palette = "Set1",
repel = FALSE,
ggtheme = theme_bw(),
data = mydata_clu_std)
With the help of Principal Component Analysis around 71.9% (26.2 + 45.7) of information is showed when combining the 5 variables into 2 dimensions. I decided to remove countries “Myanmar”, “Rwanda”, “Indonesia” and “Bhutan”, since they are quite far away from the other units in this cluster.
mydata <- mydata %>%
filter(!Country %in% c("Myanmar", "Rwanda", "Indonesia", "Bhutan"))
mydata$ID <- seq(1, nrow(mydata))
mydata_clu_std <- as.data.frame(scale(mydata[c("Social_support", "Health_life_expectancy", "Freedom", "Generosity", "Perceptions_of_corruptions")]))
rownames(mydata_clu_std) <- mydata$Country
After removing the four countries our sample size is now 151.
Clustering <- kmeans(mydata_clu_std,
centers = 3, #Number of groups
nstart = 25) #Number of attempts at different starting leader positions
Clustering
## K-means clustering with 3 clusters of sizes 43, 86, 22
##
## Cluster means:
## Social_support Health_life_expectancy Freedom Generosity Perceptions_of_corruptions
## 1 -1.2236409 -1.2860540 -0.59729541 0.2860715 -0.1502989
## 2 0.3626510 0.3622542 0.01110329 -0.4278306 -0.4222057
## 3 0.9740258 1.0975664 1.12403726 1.1132890 1.9442067
##
## Clustering vector:
## Finland Denmark Norway Iceland Netherlands
## 3 3 3 3 3
## Switzerland Sweden New Zealand Canada Austria
## 3 3 3 3 3
## Australia Costa Rica Israel Luxembourg United Kingdom
## 3 2 2 3 3
## Ireland Germany Belgium United States Czech Republic
## 3 3 3 2 2
## United Arab Emirates Malta Mexico France Taiwan
## 3 3 2 2 2
## Chile Guatemala Saudi Arabia Qatar Spain
## 2 2 2 3 2
## Panama Brazil Uruguay Singapore El Salvador
## 2 2 2 3 2
## Italy Bahrain Slovakia Trinidad & Tobago Poland
## 2 2 2 2 2
## Uzbekistan Lithuania Colombia Slovenia Nicaragua
## 3 2 2 2 2
## Kosovo Argentina Romania Cyprus Ecuador
## 2 2 2 2 2
## Kuwait Thailand Latvia South Korea Estonia
## 2 2 2 2 2
## Jamaica Mauritius Japan Honduras Kazakhstan
## 2 2 2 2 2
## Bolivia Hungary Paraguay Northern Cyprus Peru
## 2 2 2 2 2
## Portugal Pakistan Russia Philippines Serbia
## 2 1 2 2 2
## Moldova Libya Montenegro Tajikistan Croatia
## 2 2 2 2 2
## Hong Kong Dominican Republic Bosnia and Herzegovina Turkey Malaysia
## 3 2 2 2 2
## Belarus Greece Mongolia North Macedonia Nigeria
## 2 2 2 2 1
## Kyrgyzstan Turkmenistan Algeria Morocco Azerbaijan
## 2 2 2 2 2
## Lebanon China Vietnam Cameroon Bulgaria
## 2 2 2 1 2
## Ghana Ivory Coast Nepal Jordan Benin
## 1 1 2 2 1
## Congo (Brazzaville) Gabon Laos South Africa Albania
## 1 2 1 2 2
## Venezuela Cambodia Palestinian Territories Senegal Somalia
## 2 2 2 1 1
## Namibia Niger Burkina Faso Armenia Iran
## 2 1 1 2 1
## Guinea Georgia Gambia Kenya Mauritania
## 1 1 1 1 1
## Mozambique Tunisia Bangladesh Iraq Congo (Kinshasa)
## 1 2 2 1 1
## Mali Sierra Leone Sri Lanka Chad Ukraine
## 1 1 2 1 2
## Ethiopia Swaziland Uganda Egypt Zambia
## 1 1 1 2 1
## Togo India Liberia Comoros Madagascar
## 1 1 1 1 1
## Lesotho Burundi Zimbabwe Haiti Botswana
## 1 1 1 1 2
## Syria Malawi Yemen Tanzania Afghanistan
## 1 1 1 1 1
## South Sudan
## 1
##
## Within cluster sum of squares by cluster:
## [1] 122.39566 196.70505 36.29094
## (between_SS / total_SS = 52.6 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss" "size"
## [8] "iter" "ifault"
Here I performed K-Means clustering again, without the removed countries. As we can see the ratio between SS between and SS total is now bigger at 52.6% compared to 50.5% we had before we removed some countries. This means that the biggest ratio of final leaders is now 52.6%, which explains the total variability of clustering the 5 variables into 3 groups, and the other 47.4% is not explained. The final sizes of our clusters are 43, 86 and 22. We can say that the variability is highest in cluster 2 (196.71), as the SS within is the largest.
Sum of SS between is 122.40 + 196.71 + 36.30 = 355.41. Therefore the SS between is 372.21. The brings us to the conclusion that SS total = 355.41 + 372.21 = 727.62.
library(factoextra)
fviz_cluster(Clustering,
palette = "Set1",
repel = FALSE,
ggtheme = theme_bw(),
data = mydata_clu_std)
Now the clusters are more homogeneous.
Averages <- Clustering$centers
Averages #Average values of cluster variables to describe groups
## Social_support Health_life_expectancy Freedom Generosity Perceptions_of_corruptions
## 1 -1.2236409 -1.2860540 -0.59729541 0.2860715 -0.1502989
## 2 0.3626510 0.3622542 0.01110329 -0.4278306 -0.4222057
## 3 0.9740258 1.0975664 1.12403726 1.1132890 1.9442067
Figure <- as.data.frame(Averages)
Figure$ID <- 1:nrow(Figure)
library(tidyr)
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
##
## extract
Figure <- pivot_longer(Figure, cols = c("Social_support", "Health_life_expectancy", "Freedom", "Generosity", "Perceptions_of_corruptions"))
Figure$Group <- factor(Figure$ID,
levels = c(1, 2, 3),
labels = c("1", "2", "3"))
Figure$NameF <- factor(Figure$name,
levels = c("Social_support", "Health_life_expectancy", "Freedom", "Generosity", "Perceptions_of_corruptions"),
labels = c("Social_support", "Health_life_expectancy", "Freedom", "Generosity", "Perceptions_of_corruptions"))
library(ggplot2)
ggplot(Figure, aes(x = NameF, y = value)) +
geom_hline(yintercept = 0) +
theme_bw() +
geom_point(aes(shape = Group, col = Group), size = 3) +
geom_line(aes(group = ID), linewidth = 1) +
ylab("Averages") +
xlab("Cluster variables")+
ylim(-2.2, 2.2) +
theme(axis.text.x = element_text(angle = 45, vjust = 0.50, size = 10))
I will use this profiles of groups at the end for my conclusion.
mydata$Group <- Clustering$cluster #Assignings units to groups
#Checking if clustering variables successfully differentiate between groups
fit <- aov(cbind(Social_support, Health_life_expectancy, Freedom, Generosity, Perceptions_of_corruptions) ~ as.factor(Group),
data = mydata)
summary(fit)
## Response Social_support :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 7.8121 3.9061 133.73 < 2.2e-16 ***
## Residuals 148 4.3228 0.0292
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Health_life_expectancy :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 6.2686 3.13430 196.12 < 2.2e-16 ***
## Residuals 148 2.3653 0.01598
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Freedom :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 0.89058 0.44529 29.881 1.257e-11 ***
## Residuals 148 2.20547 0.01490
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Generosity :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 0.34712 0.173559 33.275 1.165e-12 ***
## Residuals 148 0.77196 0.005216
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Perceptions_of_corruptions :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 0.84549 0.42275 145.63 < 2.2e-16 ***
## Residuals 148 0.42963 0.00290
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Response for Social support:
H0: μ(Social_support, G1) = μ(Social_support, G2) = μ(Social_support, G3)
H1: At least one μ(Social_support, j) is different.
We can reject H0 at p < 0.001. We can reject H0 for all cluster variables at p < 0.001. Therefore we can assume that the groups are statistically different in the mean values of the cluster variables.
Next step is to check the criterion validity of the classification with variables that were not used in the clustering process. For this I chose GDP per capita.
We have to check two assumptions: 1. normal distribution of the GDP per capita in each group. 2. homogeneity of variances
aggregate(mydata$GDP_per_capita,
by = list(mydata$Group),
FUN = mean)
## Group.1 x
## 1 1 0.4444651
## 2 2 1.0302093
## 3 3 1.3933636
On average the 3 group has the biggest GDP per capita (descriptive approach).
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
leveneTest(mydata$GDP_per_capita, as.factor(mydata$Group))
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 3.4586 0.03404 *
## 148
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
H0: σ2 (GDP_per_capita, G1) = σ2 (GDP_per_capita, G2) = σ2 (GDP_per_capita, G3)
H1: At least one σ2 (GDP_per_capita, j) is different.
We can reject H0 at p = 0.035, meaning that at least one variance is different from the rest. With this we checked the homogeneity of variances, therefore we can proceed with Welch F-test (if normality is not violated).
library(rstatix)
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
mydata %>%
group_by(as.factor(mydata$Group)) %>%
shapiro_test(GDP_per_capita)
## # A tibble: 3 × 4
## `as.factor(mydata$Group)` variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 1 GDP_per_capita 0.963 0.181
## 2 2 GDP_per_capita 0.979 0.184
## 3 3 GDP_per_capita 0.754 0.000105
H0: GDP per capita is normally distributed in G1.
H1: GDP per capita is not normally distributed in G1.
We cannot reject H0.
H0: GDP per capita is normally distributed in G2.
H1: GDP per capita is not normally distributed in G2.
We cannot reject H0.
H0: GDP per capita is normally distributed in G3.
H1: GDP per capita is not normally distributed in G3.
We reject H0 at p = 0.001.
Because we rejected one, we have to use the Kruskal-Wallis Rank Sum Test.
kruskal.test(GDP_per_capita ~ as.factor(Group),
data = mydata)
##
## Kruskal-Wallis rank sum test
##
## data: GDP_per_capita by as.factor(Group)
## Kruskal-Wallis chi-squared = 97.25, df = 2, p-value < 2.2e-16
H0: Location distribution of GDP per capita are the same for all groups.
H1: Location distribution of GDP per capita are not the same for all groups.
We reject H0 at p < 0.001, therefore at least one location distribution is different from the other, meaning I trust that my clusters are meaningful. All in all, my result is validated.
Based on 5 standardized variables (social support, health life expectancy, freedom, generosity and perceptions of corruptions), we divided 151 countries into 3 groups using hierarchical and K-means clustering.
Group 1 (43/151 = 28%) consisting mainly of developing countries with low standards of living (for example Iraq, Nigeria, Haiti). These countries are the worst performing and below average in social support, health life expectancy and freedom. They are better than group 2 in generosity, in which they perform a bit above average, on the other hand they are better than group 2 in perceptions of corruption, however they also perform a bit below average. On average they have the lowest GDP per capita.
Group 2 (86/151 = 57%) is the largest group, consisisting mainly of emerging markets or tranisition economies (for example Costa Rica, Taiwan, Vietnam). These countries are a bit above average in social support and health life expectancy. They are right on average with freedom, however a bit below average in generosity and perceptions of corruptions, which also makes them the worst in these two cluster variables among all countries. On average they have the second biggest GDP per capita.
Group 3 (22/151 = 15%) is the smallest group, which consists mainly of developed countries with high standards of living (for example Finland, Switzerland, Singapore). These countries are the best in all cluster variables; social support, health life expectancy, freedom, generosity and they have the least perceptions of corruptions. On average they have the biggest GDP per capita.