country <- read.table("C:/Users/Tinkara/Desktop/IMB 2024 R/Bootcamp/Quality of life in a country comparison.csv", header=TRUE, sep=",", dec=".")
### Removing not needed variables
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
country <- country %>% select (3:10)
head(country)
## Country Stability.15.. Rights.20.. Health.15.. Safety.10.. Climate.15..
## 1 Australia 86 92 87 100 90
## 2 San Marino 81 82 91 87 62
## 3 Hong Kong 77 85 100 97 49
## 4 Switzerland 91 99 93 99 34
## 5 Macao 90 71 100 93 48
## 6 Malta 86 72 91 92 75
## Costs.15.. Popularity.10..
## 1 23 41
## 2 61 47
## 3 61 41
## 4 44 46
## 5 61 44
## 6 41 44
Unit of observation: one country
Initial sample size: 137 countries
Definition of variables and units of measurement:
Source of data: https://www.kaggle.com/datasets/shivamsingh0194/quality-of-life-in-a-country-comparison (explanation fo data taken from original cource here: https://www.worlddata.info/quality-of-life.php)
library(pastecs)
##
## Attaching package: 'pastecs'
## The following objects are masked from 'package:dplyr':
##
## first, last
round(stat.desc(country[ , c (2, 4, 5, 6, 7, 8) ]), 2)
## Stability.15.. Health.15.. Safety.10.. Climate.15.. Costs.15..
## nbr.val 137.00 137.00 137.00 137.00 137.00
## nbr.null 0.00 1.00 0.00 0.00 0.00
## nbr.na 0.00 0.00 0.00 0.00 0.00
## min 8.00 0.00 4.00 2.00 23.00
## max 93.00 100.00 100.00 95.00 76.00
## range 85.00 100.00 96.00 93.00 53.00
## sum 7888.00 7963.00 10519.00 7293.00 7171.00
## median 57.00 63.00 85.00 53.00 54.00
## mean 57.58 58.12 76.78 53.23 52.34
## SE.mean 1.65 2.23 1.86 1.71 0.99
## CI.mean.0.95 3.27 4.41 3.68 3.37 1.96
## var 373.60 680.86 474.86 398.43 134.86
## std.dev 19.33 26.09 21.79 19.96 11.61
## coef.var 0.34 0.45 0.28 0.37 0.22
## Popularity.10..
## nbr.val 137.00
## nbr.null 0.00
## nbr.na 0.00
## min 11.00
## max 73.00
## range 62.00
## sum 5203.00
## median 37.00
## mean 37.98
## SE.mean 1.03
## CI.mean.0.95 2.04
## var 145.21
## std.dev 12.05
## coef.var 0.32
The average indicator of stability of countries in our sample was 57.58.
The minimum indicator of health in countries in our sample was 0 and the maximum was 100.
50% of countries in our sample have the safety indicator up to and including 85, and 50% have a higher indicator.
country_clu_std <- as.data.frame(scale(country[c(4:8)]))
Can we cluster countries based on their quality of life?
To do this, I will cluster them by indicators for stability, rights, health, safety and climate.
country_clu_std <- as.data.frame(country[c(2:6)])
### Loooking for outliers
country$Dissimilarity <- sqrt(country$Stability.15..^2 +
country$Rights.20..^2 +
country$Health.15..^2 +
country$Safety.10..^2 +
country$Climate.15..^2)
head(country[order(-country$Dissimilarity), c("Country", "Dissimilarity")])
## Country Dissimilarity
## 1 Australia 203.7867
## 4 Switzerland 194.1340
## 8 Bermuda 191.0340
## 7 Luxembourg 190.5256
## 11 Japan 190.2682
## 16 Norway 189.4175
Dissimilarity tells us how different each country is from the rest of the countries (higher value means more different). From this case, it is evident that the countries are quite different from one another, which makes sense, since countries from all around the world are included. Australia seems to be the most different form the others, so I will remove it to avoind issues with clustering.
### Removing Australia
country <- country %>%
filter(!(row_number() %in% c(1)))
country_clu_std <- as.data.frame(country[c(2:6)])
# Checking outliers again
country$Dissimilarity <- sqrt(country$Stability.15..^2 +
country$Rights.20..^2 +
country$Health.15..^2 +
country$Safety.10..^2 +
country$Climate.15..^2)
head(country[order(-country$Dissimilarity), c("Country", "Dissimilarity")])
## Country Dissimilarity
## 3 Switzerland 194.1340
## 7 Bermuda 191.0340
## 6 Luxembourg 190.5256
## 10 Japan 190.2682
## 15 Norway 189.4175
## 18 New Zealand 187.3286
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.4.2
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
Distances <- get_dist(country_clu_std,
method = "euclidian")
fviz_dist(Distances, #Showing matrix of distances
gradient = list(low = "darkred",
mid = "grey95",
high = "white"))
### Hopkins statistics
library(factoextra)
get_clust_tendency(country_clu_std,
n = nrow(country_clu_std) - 1,
graph = FALSE)
## $hopkins_stat
## [1] 0.7267446
##
## $plot
## NULL
From the graph of Euclidian distances and Hopkins statistics we can check if data can be clustered. Hopking statistic is close to 1, which confirms that data can be used for clusters. The graph also shows the possibility of clustering, but maybe the number of clusters is not so clear, so I will continue the analysis to figure out how many clusters I should make.
I will check how many clusters I should make based on the Elbow method and Silhouette analysis.
library(factoextra)
library(NbClust)
fviz_nbclust(country_clu_std, kmeans, method = "wss") +
labs(subtitle = "Elbow method")
Wen using the elbow methond to determine the optimal number of clusters,
we are looking for the most evident breaks. For me, this seems to be at
2 or 3, but I will check additionally with silhouette analysis.
fviz_nbclust(country_clu_std, kmeans, method = "silhouette") +
labs(subtitle = "Silhouette analysis")
Silhouette methods measures how well data fits into their assigned
cluster. In this case, the analysis indicates that 2 clusters will
provide the best fit for my data.
I will check this also with K-method:
library(NbClust)
nc <- NbClust(country_clu_std,
distance = "euclidean",
min.nc = 2, max.nc = 10,
method = "kmeans",
index = "all")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 10 proposed 2 as the best number of clusters
## * 7 proposed 3 as the best number of clusters
## * 2 proposed 6 as the best number of clusters
## * 2 proposed 8 as the best number of clusters
## * 2 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 2
##
##
## *******************************************************************
Again, it is confirmed that it’s best to use 2 clusters.
Clustering <- kmeans(country_clu_std,
centers = 2,
nstart = 25)
Clustering
## K-means clustering with 2 clusters of sizes 84, 52
##
## Cluster means:
## Stability.15.. Rights.20.. Health.15.. Safety.10.. Climate.15..
## 1 45.71429 33.66667 43.40476 67.19048 60.09524
## 2 76.19231 77.38462 81.34615 91.82692 41.44231
##
## Clustering vector:
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [38] 2 2 2 2 2 2 1 2 2 1 2 1 2 1 2 2 2 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##
## Within cluster sum of squares by cluster:
## [1] 128774.24 45972.42
## (between_SS / total_SS = 49.0 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
library(factoextra)
fviz_cluster(Clustering,
palette = "Set1",
repel = FALSE,
ggtheme = theme_bw(),
data = country_clu_std)
Based on the graph, we can see that there is still a lot of variability
within groups, so in order to achieve a better clustering outcome, I
will try making 3 clusters. This decision can be supported by the elbow
method analysis, as well as it is a close second result in the K-means
method.
Clustering <- kmeans(country_clu_std,
centers = 3,
nstart = 25)
Clustering
## K-means clustering with 3 clusters of sizes 48, 55, 33
##
## Cluster means:
## Stability.15.. Rights.20.. Health.15.. Safety.10.. Climate.15..
## 1 41.14583 27.89583 30.60417 55.31250 64.29167
## 2 56.92727 48.85455 64.74545 84.65455 54.94545
## 3 81.69697 85.63636 86.24242 94.18182 33.18182
##
## Clustering vector:
## [1] 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 2 3 3 3 3 3 3 2 2 2 3 3 2 2 2 2 3 2 2 3 3
## [38] 3 3 3 2 3 3 2 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [75] 1 1 1 2 1 2 2 2 1 2 2 1 2 1 2 2 1 1 1 2 2 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##
## Within cluster sum of squares by cluster:
## [1] 64794.17 41676.25 19152.48
## (between_SS / total_SS = 63.4 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
library(factoextra)
fviz_cluster(Clustering,
palette = "Set1",
repel = FALSE,
ggtheme = theme_bw(),
data = country_clu_std)
Explanation of results: