country <- read.table("C:/Users/Tinkara/Desktop/IMB 2024 R/Bootcamp/Quality of life in a country comparison.csv", header=TRUE, sep=",", dec=".")
### Removing not needed variables

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
country <- country %>% select (3:10)

head(country)
##       Country Stability.15.. Rights.20.. Health.15.. Safety.10.. Climate.15..
## 1   Australia             86          92          87         100           90
## 2  San Marino             81          82          91          87           62
## 3   Hong Kong             77          85         100          97           49
## 4 Switzerland             91          99          93          99           34
## 5       Macao             90          71         100          93           48
## 6       Malta             86          72          91          92           75
##   Costs.15.. Popularity.10..
## 1         23              41
## 2         61              47
## 3         61              41
## 4         44              46
## 5         61              44
## 6         41              44

2 Data description

Unit of observation: one country

Initial sample size: 137 countries

Definition of variables and units of measurement:

Source of data: https://www.kaggle.com/datasets/shivamsingh0194/quality-of-life-in-a-country-comparison (explanation fo data taken from original cource here: https://www.worlddata.info/quality-of-life.php)

library(pastecs)
## 
## Attaching package: 'pastecs'
## The following objects are masked from 'package:dplyr':
## 
##     first, last
round(stat.desc(country[ , c (2, 4, 5, 6, 7, 8) ]), 2)
##              Stability.15.. Health.15.. Safety.10.. Climate.15.. Costs.15..
## nbr.val              137.00      137.00      137.00       137.00     137.00
## nbr.null               0.00        1.00        0.00         0.00       0.00
## nbr.na                 0.00        0.00        0.00         0.00       0.00
## min                    8.00        0.00        4.00         2.00      23.00
## max                   93.00      100.00      100.00        95.00      76.00
## range                 85.00      100.00       96.00        93.00      53.00
## sum                 7888.00     7963.00    10519.00      7293.00    7171.00
## median                57.00       63.00       85.00        53.00      54.00
## mean                  57.58       58.12       76.78        53.23      52.34
## SE.mean                1.65        2.23        1.86         1.71       0.99
## CI.mean.0.95           3.27        4.41        3.68         3.37       1.96
## var                  373.60      680.86      474.86       398.43     134.86
## std.dev               19.33       26.09       21.79        19.96      11.61
## coef.var               0.34        0.45        0.28         0.37       0.22
##              Popularity.10..
## nbr.val               137.00
## nbr.null                0.00
## nbr.na                  0.00
## min                    11.00
## max                    73.00
## range                  62.00
## sum                  5203.00
## median                 37.00
## mean                   37.98
## SE.mean                 1.03
## CI.mean.0.95            2.04
## var                   145.21
## std.dev                12.05
## coef.var                0.32

The average indicator of stability of countries in our sample was 57.58.

The minimum indicator of health in countries in our sample was 0 and the maximum was 100.

50% of countries in our sample have the safety indicator up to and including 85, and 50% have a higher indicator.

country_clu_std <- as.data.frame(scale(country[c(4:8)]))

Research question

Can we cluster countries based on their quality of life?

To do this, I will cluster them by indicators for stability, rights, health, safety and climate.

country_clu_std <- as.data.frame(country[c(2:6)])
### Loooking for outliers

country$Dissimilarity <- sqrt(country$Stability.15..^2 + 
                                        country$Rights.20..^2 + 
                                        country$Health.15..^2 + 
                                        country$Safety.10..^2 + 
                                       country$Climate.15..^2)

head(country[order(-country$Dissimilarity), c("Country", "Dissimilarity")])
##        Country Dissimilarity
## 1    Australia      203.7867
## 4  Switzerland      194.1340
## 8      Bermuda      191.0340
## 7   Luxembourg      190.5256
## 11       Japan      190.2682
## 16      Norway      189.4175

Dissimilarity tells us how different each country is from the rest of the countries (higher value means more different). From this case, it is evident that the countries are quite different from one another, which makes sense, since countries from all around the world are included. Australia seems to be the most different form the others, so I will remove it to avoind issues with clustering.

### Removing Australia 

country <- country %>%
filter(!(row_number() %in% c(1)))

country_clu_std <- as.data.frame(country[c(2:6)])

# Checking outliers again
country$Dissimilarity <- sqrt(country$Stability.15..^2 + 
                                        country$Rights.20..^2 + 
                                        country$Health.15..^2 + 
                                        country$Safety.10..^2 + 
                                       country$Climate.15..^2)

head(country[order(-country$Dissimilarity), c("Country", "Dissimilarity")])
##        Country Dissimilarity
## 3  Switzerland      194.1340
## 7      Bermuda      191.0340
## 6   Luxembourg      190.5256
## 10       Japan      190.2682
## 15      Norway      189.4175
## 18 New Zealand      187.3286

Is data suitable for clustering?

library(factoextra) 
## Warning: package 'factoextra' was built under R version 4.4.2
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
Distances <- get_dist(country_clu_std, 
                      method = "euclidian")

fviz_dist(Distances, #Showing matrix of distances
          gradient = list(low = "darkred",
                          mid = "grey95",
                          high = "white"))

### Hopkins statistics

library(factoextra) 

get_clust_tendency(country_clu_std, 
                   n = nrow(country_clu_std) - 1,
                   graph = FALSE) 
## $hopkins_stat
## [1] 0.7267446
## 
## $plot
## NULL

From the graph of Euclidian distances and Hopkins statistics we can check if data can be clustered. Hopking statistic is close to 1, which confirms that data can be used for clusters. The graph also shows the possibility of clustering, but maybe the number of clusters is not so clear, so I will continue the analysis to figure out how many clusters I should make.

How many clusters?

I will check how many clusters I should make based on the Elbow method and Silhouette analysis.

library(factoextra)
library(NbClust)

fviz_nbclust(country_clu_std, kmeans, method = "wss") +
  labs(subtitle = "Elbow method")

Wen using the elbow methond to determine the optimal number of clusters, we are looking for the most evident breaks. For me, this seems to be at 2 or 3, but I will check additionally with silhouette analysis.

fviz_nbclust(country_clu_std, kmeans, method = "silhouette") +
labs(subtitle = "Silhouette analysis")

Silhouette methods measures how well data fits into their assigned cluster. In this case, the analysis indicates that 2 clusters will provide the best fit for my data.

I will check this also with K-method:

library(NbClust)
nc <- NbClust(country_clu_std, 
              distance = "euclidean",
              min.nc = 2, max.nc = 10,
              method = "kmeans",
              index = "all")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 10 proposed 2 as the best number of clusters 
## * 7 proposed 3 as the best number of clusters 
## * 2 proposed 6 as the best number of clusters 
## * 2 proposed 8 as the best number of clusters 
## * 2 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************

Again, it is confirmed that it’s best to use 2 clusters.

Clustering

Clustering <- kmeans(country_clu_std,
centers = 2,
nstart = 25)

Clustering
## K-means clustering with 2 clusters of sizes 84, 52
## 
## Cluster means:
##   Stability.15.. Rights.20.. Health.15.. Safety.10.. Climate.15..
## 1       45.71429    33.66667    43.40476    67.19048     60.09524
## 2       76.19231    77.38462    81.34615    91.82692     41.44231
## 
## Clustering vector:
##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [38] 2 2 2 2 2 2 1 2 2 1 2 1 2 1 2 2 2 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 128774.24  45972.42
##  (between_SS / total_SS =  49.0 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
library(factoextra)
fviz_cluster(Clustering,
             palette = "Set1",
             repel = FALSE,
             ggtheme = theme_bw(),
             data = country_clu_std)

Based on the graph, we can see that there is still a lot of variability within groups, so in order to achieve a better clustering outcome, I will try making 3 clusters. This decision can be supported by the elbow method analysis, as well as it is a close second result in the K-means method.

Clustering <- kmeans(country_clu_std,
centers = 3,
nstart = 25)

Clustering
## K-means clustering with 3 clusters of sizes 48, 55, 33
## 
## Cluster means:
##   Stability.15.. Rights.20.. Health.15.. Safety.10.. Climate.15..
## 1       41.14583    27.89583    30.60417    55.31250     64.29167
## 2       56.92727    48.85455    64.74545    84.65455     54.94545
## 3       81.69697    85.63636    86.24242    94.18182     33.18182
## 
## Clustering vector:
##   [1] 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 2 3 3 3 3 3 3 2 2 2 3 3 2 2 2 2 3 2 2 3 3
##  [38] 3 3 3 2 3 3 2 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [75] 1 1 1 2 1 2 2 2 1 2 2 1 2 1 2 2 1 1 1 2 2 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 64794.17 41676.25 19152.48
##  (between_SS / total_SS =  63.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
library(factoextra)
fviz_cluster(Clustering,
             palette = "Set1",
             repel = FALSE,
             ggtheme = theme_bw(),
             data = country_clu_std)

Explanation of results:

  • When clustering to 3 clusters, clusters contain 33, 48 and 55 countries respectively.