Introduction

The World Happiness Report is an annual survey of the state of global happiness, which ranks all of world’s countries’ happiness based on few factors. The goal of this analysis is to find a proper way to divide countries into clusters based on only 6 variables: GDP per capita, Social support, Life expectancy, Freedom to make life choices, Generosity and Perception of corruption. Although there is no “curse of dimensionality” here, additional goal is to reduce the number of dimensions in dataset to better visualize the data.

Dataset

Dataset consists of 149 observations - one for each country, measured in the year 2020.

df <- read.csv('world-happiness-report-2021.csv')
print(colnames(df))

##  [1] "Country.name"                              
##  [2] "Regional.indicator"                        
##  [3] "Ladder.score"                              
##  [4] "Standard.error.of.ladder.score"            
##  [5] "upperwhisker"                              
##  [6] "lowerwhisker"                              
##  [7] "Logged.GDP.per.capita"                     
##  [8] "Social.support"                            
##  [9] "Healthy.life.expectancy"                   
## [10] "Freedom.to.make.life.choices"              
## [11] "Generosity"                                
## [12] "Perceptions.of.corruption"                 
## [13] "Ladder.score.in.Dystopia"                  
## [14] "Explained.by..Log.GDP.per.capita"          
## [15] "Explained.by..Social.support"              
## [16] "Explained.by..Healthy.life.expectancy"     
## [17] "Explained.by..Freedom.to.make.life.choices"
## [18] "Explained.by..Generosity"                  
## [19] "Explained.by..Perceptions.of.corruption"   
## [20] "Dystopia...residual"

We only need columns with values themselves, so we are going to work with variables: Country, Region, Logged GDP per capita, Social support, Life expectancy, Freedom to make life choices, Generosity and Perception of corruption. For aesthetic reasons, all dots will be changed to underscores. We save labels “Country” and “Region” into separate arrays and keep only numerical values inside the DataFrame.

df <- df[, c(1:2, 7:12)]
old_names <- colnames(df)
new_names <- gsub('.', '_', old_names, fixed = TRUE)
df <- setnames(df,
              old = old_names,
              new = new_names)
Regions <- df[2]
Countries <- df[1]
df <- df[-(1:2)]

Dimensionality reduction

t-SNE

t-SNE stands for t-distributed Stochastic Neighbor Embedding - a non-linear statistical method of dimensionality reduction. The biggest difference between t-SNE and PCA is that t-SNE is (contrary to PCA) a non-linear algorithm, tries to preserve the local structure of data and can handle outliers, but t-SNE is intendent mainly for visualisations. Creators of that technique clearly state¹, that the performance of this algorithm is robust to changes in the perplexity parameter. Results of t-SNE are generally hard to reproduce and usually it is a good idea to make a few visualisations to compare. Below is the best plot I could make with given data:

#t-SNE
tsne <- Rtsne(df, dims = 2, perplexity = 35, max_iter = 700)
tsne_df <- data.table(tsne$Y)
tsne_df$Region <- Regions
tsne_df %>%
  ggplot(aes(x = V1,
             y = V2,
             color = Region)) +
  geom_point() + 
  theme(legend.position = 'left')

We see three scattered clusters: the first one consists of mostly Sub-Saharan countries, the bottom-right - of Western European countries, and the middle one - Middle East, Central and Eastern Europe, the Commonwealth of Independent States and others. That is our starting point and we can assume for now that countries can be best clustered into 3 groups.

Principal Component Analysis

pca <- prcomp(df, center = TRUE, scale. = TRUE)

Principal Component Analysis is a linear dimensionality reduction technique, which can be used to reduce number of variables for modelling purposes. It results in some components which are constructed from input variables. The optimal number of components is usually determined by their Explained Variance or Eigenvalues.

Explained Variance

fviz_eig(pca)

Eigenvalues

fviz_eig(pca, choice='eigenvalue')

Summary

summary(pca)$importance

##                             PC1      PC2       PC3     PC4       PC5       PC6
## Standard deviation     1.764773 1.134382 0.8384161 0.71998 0.5002993 0.3565696
## Proportion of Variance 0.519070 0.214470 0.1171600 0.08640 0.0417200 0.0211900
## Cumulative Proportion  0.519070 0.733540 0.8507000 0.93709 0.9788100 1.0000000

Judging by the % of explained variance, the first 3 Principal Components would be enough - they explain roughly 85% of the variance. Regarding the Eigenvalues, Kaiser rule would suggest only 2 PC - as only for those two Eigenvalue is higher than 1. That approach would be too general as there would be only 73% of variance explained. 3 components seem to be a reasonable number as we reduce the dimensionality by 50% by losing only 15% of the information.

Output Principal Components

The first component is highly correlated with GDP, Social support, Life Expectancy and Freedom to make life choices, the determinantes of a developed country. There is also a reasonable correlation (in the opposite direction) with Perception of corruption.
The second component is highly correlated with Generosity and there also exists a relationship with Perception of corruption - in the opposite direction as well.
The third component is almost solely correlated with Perception of corruption, but in the opposite direction than the first two components. That will probably be a determinant of an underdeveloped country.

Clustering

Pre-cluster analysis

Silhouette and Total Within Sum of Squares are the most popular techniques to find the optimal number of clusters for K-means algorithm.

Silhouette

fviz_nbclust(pca_df[, -4], kmeans, method = 's') + 
  labs(subtitle = 'Silhouette method')

Total WSS

fviz_nbclust(pca_df[, -4], kmeans, method = "wss") +
  labs(subtitle = 'Within Sum of Squares')

Silhouette method suggests 3 clusters, WSS gives similar results. Having in mind that we have 149 observations, one for each country in the world, that partition seems reasonable. It will probably result in clusters covering more or less: highly developed developing and underdeveloped countries. Choosing smaller k means that 149 countries would have to be partitioned between two clusters (developed/undeveloped, rich/poor) and some information would be lost. In this analysis a k = 3 number of clusters is recommended. Not only in k-means algorithm, but as a general rule.

K-means

K-means is one of the most popular clustering technique. It assignes every observation into one of k clusters, depending on which cluster center (created artificially) is the nearest to that given observation.

2-d plot

kmeans_ <- eclust(pca_df[, -4] , k=3,
                  hc_metric = 'euclidean', graph = FALSE)

colors = brewer.pal(n = 3, 'Set2')
fviz_cluster(kmeans_, geom = c("point")) +
  scale_color_manual(values = colors) +
  scale_fill_manual(values = colors) +
  ggtitle('K-means with 3 clusters')

3-d plot

plot_ly(x = pca_df$PC1,
        y = pca_df$PC2,
        z = pca_df$PC3,
        type = 'scatter3d',
        mode = 'markers',
        color = as.factor(kmeans_$cluster))

Silhouette diagnosis

fviz_silhouette(kmeans_, print.summary = F)

Countries classification

kmeans_df <- data.table(kmeans_$cluster)
kmeans_df$Region <- Regions
kmeans_df$Counter <- 1
kmeans_df %>%
  group_by(V1, Region) %>%
  summarise(Count = sum(Counter), .groups = 'keep') %>%
  arrange(V1, desc(Count)) %>%
  print(n = 22)

## # A tibble: 22 x 3
## # Groups:   V1, Region [22]
##       V1 Region                             Count
##    <int> <chr>                              <dbl>
##  1     1 Western Europe                        14
##  2     1 North America and ANZ                  3
##  3     1 Central and Eastern Europe             1
##  4     1 Commonwealth of Independent States     1
##  5     1 East Asia                              1
##  6     1 Middle East and North Africa           1
##  7     1 Southeast Asia                         1
##  8     2 Latin America and Caribbean           19
##  9     2 Central and Eastern Europe            16
## 10     2 Middle East and North Africa          12
## 11     2 Commonwealth of Independent States    11
## 12     2 Western Europe                         7
## 13     2 East Asia                              5
## 14     2 Southeast Asia                         5
## 15     2 Sub-Saharan Africa                     5
## 16     2 South Asia                             2
## 17     2 North America and ANZ                  1
## 18     3 Sub-Saharan Africa                    31
## 19     3 South Asia                             5
## 20     3 Middle East and North Africa           4
## 21     3 Southeast Asia                         3
## 22     3 Latin America and Caribbean            1

Cluster size

kmeans_df <- data.table(kmeans_$cluster)
kmeans_df$Region <- Regions
kmeans_df$Counter <- 1
kmeans_df %>%
  group_by(V1) %>%
  summarise(Count = sum(Counter), .groups = 'keep') %>%
  arrange(V1, desc(Count))

## # A tibble: 3 x 2
## # Groups:   V1 [3]
##      V1 Count
##   <int> <dbl>
## 1     1    22
## 2     2    83
## 3     3    44

Variable distribution

df$cluster <- factor(kmeans_$cluster, levels = 1:3, labels = letters[1:3])
melted <- reshape2::melt(df, id.vars = 'cluster')
ggplot(melted, aes(x = cluster,
                   y = value,
                   color = cluster)) +
  geom_boxplot() +
  facet_wrap(~variable, scales = 'free_y') +
  theme(legend.position="bottom")

At the first glance, 2-d plot shows overlapping clusters and it generally does not look very well. Situation changes when we switch to 3-d plot (as we used 3 PC) and Clusters silhouette plot proves that observations are generally clastered well - the average silhouete width is equal to 0.41, which is relatively high score (on a scale from 0 to 1, the bigger the better) and there are only a few observations with negative silhouette values - meaning, that they should belong to another claster. In our situation, some countries classified to the 2nd claster should belong to the 3rd one and the other way around, because there are no such cases in the 1st cluster.

Looking at Countries classification and Variable distribution, initial claims are confirmed - clusters represent countries’ development level: 1. 1st cluster consists of mostly Western European countries and those belonging to the Anglosphere - with high GDP per capita, Social support, Life expectancy, Freedom of making life choices and bad Perception of corruption.

2nd cluster is mostly developing countries - Latin America, Central and Eastern Europe, Middle East, Commonwealth of Independent States with moderate values.
3rd cluster consists of poor countries, it’s almost entirely made out of Sub-Saharan African countries, low values in every variables, except for Perception of corruption and Generosity (a few outliers).

Even though these results make sense but for scientific purposes two other algorithms will be performed.

Hierarchical clustering

Hierarchical clustering is based on the dissimilarities matrix. Algorithm starts by treating every observation as a separate cluster. Then, iteratively, most similar clusters are combined into one, resulting in a tree (dendrogram), which can be cut at any height, resulting in a given number of clusters.

Dendrogram

h_df <- pca$x[, 1:3]
rownames(h_df) <- Countries$Country_name
distMatrix <- dist(h_df)
groups <- hclust(distMatrix, method = 'ward.D')
fviz_dend(groups, cex = 0.5, k = 3, rect = TRUE,
          k_colors = colors)

Dendrogram with k-means cluster labels

fviz_dend(groups, cex = 0.5, k = 3, rect = TRUE,
          k_colors = colors,
          label_cols = kmeans_df$V1[groups$order])

Countries classification

trees <- cutree(groups, k = 3)
dendo_df <- cbind(trees, Regions)

dendo_df$Counter <- 1
dendo_df %>%
  group_by(trees, Regional_indicator) %>%
  summarise(group = sum(Counter)) %>%
  arrange(trees, desc(group)) %>%
  `colnames<-`(c('Cluster', 'Region', 'Count')) %>%
  print(n = 18)

## # A tibble: 18 x 3
## # Groups:   Cluster [3]
##    Cluster Region                             Count
##      <int> <chr>                              <dbl>
##  1       1 Western Europe                        11
##  2       1 North America and ANZ                  3
##  3       1 Southeast Asia                         1
##  4       2 Latin America and Caribbean           18
##  5       2 Central and Eastern Europe            17
##  6       2 Commonwealth of Independent States    12
##  7       2 Western Europe                        10
##  8       2 Middle East and North Africa           7
##  9       2 Southeast Asia                         7
## 10       2 East Asia                              6
## 11       2 South Asia                             3
## 12       2 Sub-Saharan Africa                     2
## 13       2 North America and ANZ                  1
## 14       3 Sub-Saharan Africa                    34
## 15       3 Middle East and North Africa          10
## 16       3 South Asia                             4
## 17       3 Latin America and Caribbean            2
## 18       3 Southeast Asia                         1

Cluster size

dendo_df %>%
  group_by(trees) %>%
  summarise(group = sum(Counter)) %>%
  arrange(trees, desc(group)) %>%
  `colnames<-`(c('Cluster', 'Count'))

## # A tibble: 3 x 2
##   Cluster Count
##     <int> <dbl>
## 1       1    15
## 2       2    83
## 3       3    51

Variable distribution

df$hclust <- factor(dendo_df$trees, levels = 1:3, labels = letters[1:3])
melted <- reshape2::melt(df[-7], id.vars = 'hclust')
ggplot(melted, aes(x = hclust,
                   y = value,
                   color = hclust)) +
  geom_boxplot() +
  facet_wrap(~variable, scales = 'free_y') +
  theme(legend.position="bottom")

Results are similar to K-means, but the 1st cluster is smaller and some of them belong to the 3rd cluster now (compare two dendrograms). Because of less observations in the 1st cluster, the boxplots are respectively narrower - but every variable distributions stay relatively similar to the K-Means scenario.

Fuzzy clustering

The biggest advantage is this clustering method is fact, that each data point can belong to more than one cluster. Those are observationed classified as “unclear”, with lower than 50% probability of belonging to a specified cluster.

Countries classification

fuzzy_df <- data.frame(fuzzy_knn$clus)
fuzzy_df$Regions <- Regions$Regional_indicator
fuzzy_df$Counter <- 1
fuzzy_df %>%
  group_by(Cluster, Regions) %>%
  summarise(group = sum(Counter)) %>%
  arrange(Cluster, desc(group)) %>%
  print(n = 22)

## # A tibble: 22 x 3
## # Groups:   Cluster [3]
##    Cluster Regions                            group
##      <dbl> <chr>                              <dbl>
##  1       1 Western Europe                        15
##  2       1 North America and ANZ                  4
##  3       1 Middle East and North Africa           2
##  4       1 Central and Eastern Europe             1
##  5       1 Commonwealth of Independent States     1
##  6       1 East Asia                              1
##  7       1 Latin America and Caribbean            1
##  8       1 Southeast Asia                         1
##  9       2 Sub-Saharan Africa                    31
## 10       2 South Asia                             5
## 11       2 Middle East and North Africa           4
## 12       2 Southeast Asia                         3
## 13       2 Latin America and Caribbean            1
## 14       3 Latin America and Caribbean           18
## 15       3 Central and Eastern Europe            16
## 16       3 Commonwealth of Independent States    11
## 17       3 Middle East and North Africa          11
## 18       3 Western Europe                         6
## 19       3 East Asia                              5
## 20       3 Southeast Asia                         5
## 21       3 Sub-Saharan Africa                     5
## 22       3 South Asia                             2

Cluster size

fuzzy_df %>%
  group_by(Cluster) %>%
  summarise(group = sum(Counter)) %>%
  arrange(Cluster)

## # A tibble: 3 x 2
##   Cluster group
##     <dbl> <dbl>
## 1       1    26
## 2       2    44
## 3       3    79

Variable distribution

df$fclust <- factor(fuzzy_df$Cluster, levels = c(1, 3, 2), labels = letters[1:3])
melted <- reshape2::melt(df[c(-7, -8)], id.vars = 'fclust')
ggplot(melted, aes(x = fclust,
                   y = value,
                   color = fclust)) +
  geom_boxplot() +
  facet_wrap(~variable, scales = 'free_y') +
  theme(legend.position="bottom")

Unclear assignments

fuzzy_df$Country <- Countries$Country_name
fuzzy_df$fclust <- factor(fuzzy_df$Cluster, levels = c(1, 3, 2), labels = letters[1:3])
fuzzy_df[fuzzy_df$Membership.degree < 0.5, c('Country',
                                             'Membership.degree', 'fclust')] %>%
  arrange(desc(Membership.degree))

##                         Country Membership.degree fclust
## Obj 124                 Namibia         0.4984663      b
## Obj 125 Palestinian Territories         0.4956267      b
## Obj 109                 Algeria         0.4951335      b
## Obj 100                    Laos         0.4870567      c
## Obj 78               Tajikistan         0.4854723      b
## Obj 19            United States         0.4833566      a
## Obj 101              Bangladesh         0.4800726      c
## Obj 74             North Cyprus         0.4740861      a
## Obj 97             Turkmenistan         0.4512418      b
## Obj 54                 Thailand         0.4468175      b
## Obj 33                   Kosovo         0.4285203      b
## Obj 114                Cambodia         0.4256271      b
## Obj 126                 Myanmar         0.4231886      c
## Obj 147                  Rwanda         0.4213193      c
## Obj 82                Indonesia         0.3734274      c

Summary

summary.fclust(fuzzy_knn)

## 
##  Fuzzy clustering object of class 'fclust' 
##  
##  Number of objects: 
##  149
##  
##  Number of clusters: 
##  3
##  
##  Cluster sizes: 
## Clus 1 Clus 2 Clus 3 
##     26     44     79 
## 
##  
##  Clustering index values: 
## SIL.F k=2 SIL.F k=3 SIL.F k=4 SIL.F k=5 SIL.F k=6 
## 0.6486382 0.6657629 0.5805586 0.6176952 0.5940933 
## 
##  
##  Closest hard clustering partition: 
##   Obj 1   Obj 2   Obj 3   Obj 4   Obj 5   Obj 6   Obj 7   Obj 8   Obj 9  Obj 10 
##       1       1       1       1       1       1       1       1       1       1 
##  Obj 11  Obj 12  Obj 13  Obj 14  Obj 15  Obj 16  Obj 17  Obj 18  Obj 19  Obj 20 
##       1       3       1       1       1       3       1       3       1       3 
##  Obj 21  Obj 22  Obj 23  Obj 24  Obj 25  Obj 26  Obj 27  Obj 28  Obj 29  Obj 30 
##       1       1       1       3       1       3       3       3       3       3 
##  Obj 31  Obj 32  Obj 33  Obj 34  Obj 35  Obj 36  Obj 37  Obj 38  Obj 39  Obj 40 
##       1       1       3       3       3       3       3       3       3       1 
##  Obj 41  Obj 42  Obj 43  Obj 44  Obj 45  Obj 46  Obj 47  Obj 48  Obj 49  Obj 50 
##       3       1       3       3       3       3       3       3       3       3 
##  Obj 51  Obj 52  Obj 53  Obj 54  Obj 55  Obj 56  Obj 57  Obj 58  Obj 59  Obj 60 
##       3       3       3       3       3       3       3       3       3       3 
##  Obj 61  Obj 62  Obj 63  Obj 64  Obj 65  Obj 66  Obj 67  Obj 68  Obj 69  Obj 70 
##       3       3       3       3       3       3       3       3       3       3 
##  Obj 71  Obj 72  Obj 73  Obj 74  Obj 75  Obj 76  Obj 77  Obj 78  Obj 79  Obj 80 
##       3       3       3       1       3       3       1       3       3       3 
##  Obj 81  Obj 82  Obj 83  Obj 84  Obj 85  Obj 86  Obj 87  Obj 88  Obj 89  Obj 90 
##       3       2       2       3       2       3       2       3       3       3 
##  Obj 91  Obj 92  Obj 93  Obj 94  Obj 95  Obj 96  Obj 97  Obj 98  Obj 99 Obj 100 
##       2       2       3       3       2       2       3       2       2       2 
## Obj 101 Obj 102 Obj 103 Obj 104 Obj 105 Obj 106 Obj 107 Obj 108 Obj 109 Obj 110 
##       2       2       3       3       2       2       3       3       3       3 
## Obj 111 Obj 112 Obj 113 Obj 114 Obj 115 Obj 116 Obj 117 Obj 118 Obj 119 Obj 120 
##       2       3       2       3       2       2       2       2       2       2 
## Obj 121 Obj 122 Obj 123 Obj 124 Obj 125 Obj 126 Obj 127 Obj 128 Obj 129 Obj 130 
##       2       3       3       3       3       2       3       2       3       2 
## Obj 131 Obj 132 Obj 133 Obj 134 Obj 135 Obj 136 Obj 137 Obj 138 Obj 139 Obj 140 
##       2       3       2       2       2       2       2       2       2       2 
## Obj 141 Obj 142 Obj 143 Obj 144 Obj 145 Obj 146 Obj 147 Obj 148 Obj 149 
##       2       2       2       2       2       3       2       2       2 
## 
##  Cluster memberships:
##   Clus 1
##   [1] "Obj 1"  "Obj 2"  "Obj 3"  "Obj 4"  "Obj 5"  "Obj 6"  "Obj 7"  "Obj 8" 
##  [9] "Obj 9"  "Obj 10" "Obj 11" "Obj 13" "Obj 14" "Obj 15" "Obj 17" "Obj 19"
## [17] "Obj 21" "Obj 22" "Obj 23" "Obj 25" "Obj 31" "Obj 32" "Obj 40" "Obj 42"
## [25] "Obj 74" "Obj 77"
##  Clus 2
##   [1] "Obj 82"  "Obj 83"  "Obj 85"  "Obj 87"  "Obj 91"  "Obj 92"  "Obj 95" 
##  [8] "Obj 96"  "Obj 98"  "Obj 99"  "Obj 100" "Obj 101" "Obj 102" "Obj 105"
## [15] "Obj 106" "Obj 111" "Obj 113" "Obj 115" "Obj 116" "Obj 117" "Obj 118"
## [22] "Obj 119" "Obj 120" "Obj 121" "Obj 126" "Obj 128" "Obj 130" "Obj 131"
## [29] "Obj 133" "Obj 134" "Obj 135" "Obj 136" "Obj 137" "Obj 138" "Obj 139"
## [36] "Obj 140" "Obj 141" "Obj 142" "Obj 143" "Obj 144" "Obj 145" "Obj 147"
## [43] "Obj 148" "Obj 149"
##  Clus 3 (First 50 objects) 
##  [1] "Obj 12" "Obj 16" "Obj 18" "Obj 20" "Obj 24" "Obj 26" "Obj 27" "Obj 28"
##  [9] "Obj 29" "Obj 30" "Obj 33" "Obj 34" "Obj 35" "Obj 36" "Obj 37" "Obj 38"
## [17] "Obj 39" "Obj 41" "Obj 43" "Obj 44" "Obj 45" "Obj 46" "Obj 47" "Obj 48"
## [25] "Obj 49" "Obj 50" "Obj 51" "Obj 52" "Obj 53" "Obj 54" "Obj 55" "Obj 56"
## [33] "Obj 57" "Obj 58" "Obj 59" "Obj 60" "Obj 61" "Obj 62" "Obj 63" "Obj 64"
## [41] "Obj 65" "Obj 66" "Obj 67" "Obj 68" "Obj 69" "Obj 70" "Obj 71" "Obj 72"
## [49] "Obj 73" "Obj 75"
## 
##  Number of objects with unclear assignment (maximal membership degree <0.5): 
##  15
##  
##  Objects with unclear assignment: 
##  [1] "Obj 19"  "Obj 33"  "Obj 54"  "Obj 74"  "Obj 78"  "Obj 82"  "Obj 97" 
##  [8] "Obj 100" "Obj 101" "Obj 109" "Obj 114" "Obj 124" "Obj 125" "Obj 126"
## [15] "Obj 147"
## 
##  Cluster sizes (without unclear assignments): 
##  Clus 1  Clus 2  Clus 3 No clus 
##      24      39      71      15 
## 
##  Membership degree matrix (rounded): 
##         Clus 1 Clus 2 Clus 3
## Obj 1     0.78   0.07   0.15
## Obj 2     0.82   0.06   0.12
## Obj 3     0.88   0.04   0.08
## Obj 4     0.69   0.08   0.24
## Obj 5     0.90   0.03   0.07
## Obj 6     0.86   0.04   0.09
## Obj 7     0.86   0.05   0.10
## Obj 8     0.89   0.03   0.08
## Obj 9     0.86   0.05   0.10
## Obj 10    0.97   0.01   0.02
## Obj 11    0.93   0.02   0.05
## Obj 12    0.31   0.07   0.62
## Obj 13    0.97   0.01   0.03
## Obj 14    0.97   0.01   0.02
## Obj 15    0.95   0.01   0.04
## Obj 16    0.17   0.05   0.77
## Obj 17    0.88   0.03   0.08
## Obj 18    0.19   0.08   0.73
## Obj 19    0.48   0.07   0.44
## Obj 20    0.32   0.08   0.60
## Obj 21    0.53   0.07   0.39
## Obj 22    0.54   0.08   0.38
## Obj 23    0.73   0.06   0.21
## Obj 24    0.16   0.05   0.79
## Obj 25    0.85   0.03   0.12
## Obj 26    0.25   0.06   0.69
## Obj 27    0.22   0.06   0.71
## Obj 28    0.09   0.06   0.86
## Obj 29    0.35   0.08   0.57
## Obj 30    0.12   0.12   0.76
## Obj 31    0.58   0.06   0.37
## Obj 32    0.72   0.10   0.18
## Obj 33    0.23   0.34   0.43
## Obj 34    0.10   0.07   0.83
## Obj 35    0.01   0.01   0.98
## Obj 36    0.02   0.01   0.97
## Obj 37    0.02   0.02   0.96
## Obj 38    0.12   0.07   0.81
## Obj 39    0.03   0.02   0.95
## Obj 40    0.77   0.04   0.19
## Obj 41    0.11   0.05   0.84
## Obj 42    0.53   0.20   0.27
## Obj 43    0.02   0.01   0.97
## Obj 44    0.16   0.05   0.79
## Obj 45    0.12   0.04   0.84
## Obj 46    0.07   0.06   0.87
## Obj 47    0.13   0.04   0.83
## Obj 48    0.03   0.03   0.94
## Obj 49    0.15   0.15   0.70
## Obj 50    0.08   0.04   0.88
## Obj 51    0.07   0.05   0.87
## Obj 52    0.00   0.00   0.99
## Obj 53    0.11   0.08   0.82
## Obj 54    0.32   0.23   0.45
## Obj 55    0.21   0.16   0.63
## Obj 56    0.34   0.11   0.55
## Obj 57    0.06   0.04   0.90
## Obj 58    0.17   0.08   0.75
## Obj 59    0.17   0.26   0.58
## Obj 60    0.10   0.08   0.82
## Obj 61    0.10   0.11   0.79
## Obj 62    0.10   0.06   0.84
## Obj 63    0.03   0.03   0.95
## Obj 64    0.11   0.17   0.72
## Obj 65    0.04   0.06   0.90
## Obj 66    0.00   0.00   0.99
## Obj 67    0.22   0.26   0.52
## Obj 68    0.17   0.20   0.63
## Obj 69    0.06   0.11   0.83
## Obj 70    0.12   0.21   0.68
## Obj 71    0.14   0.11   0.75
## Obj 72    0.01   0.01   0.99
## Obj 73    0.06   0.03   0.91
## Obj 74    0.47   0.06   0.46
## Obj 75    0.14   0.12   0.74
## Obj 76    0.04   0.04   0.93
## Obj 77    0.84   0.04   0.12
## Obj 78    0.24   0.27   0.49
## Obj 79    0.06   0.04   0.90
## Obj 80    0.11   0.14   0.75
## Obj 81    0.26   0.15   0.59
## Obj 82    0.30   0.37   0.33
## Obj 83    0.03   0.90   0.07
## Obj 84    0.06   0.03   0.91
## Obj 85    0.01   0.98   0.02
## Obj 86    0.18   0.12   0.70
## Obj 87    0.13   0.61   0.27
## Obj 88    0.08   0.06   0.86
## Obj 89    0.18   0.07   0.75
## Obj 90    0.31   0.16   0.53
## Obj 91    0.01   0.97   0.02
## Obj 92    0.03   0.90   0.08
## Obj 93    0.07   0.17   0.77
## Obj 94    0.08   0.16   0.76
## Obj 95    0.06   0.78   0.16
## Obj 96    0.03   0.91   0.06
## Obj 97    0.31   0.24   0.45
## Obj 98    0.13   0.68   0.19
## Obj 99    0.07   0.82   0.11
## Obj 100   0.21   0.49   0.30
## Obj 101   0.15   0.48   0.37
## Obj 102   0.02   0.95   0.03
## Obj 103   0.08   0.29   0.63
## Obj 104   0.11   0.17   0.72
## Obj 105   0.01   0.98   0.02
## Obj 106   0.11   0.52   0.36
## Obj 107   0.10   0.21   0.69
## Obj 108   0.17   0.31   0.52
## Obj 109   0.11   0.40   0.50
## Obj 110   0.07   0.13   0.80
## Obj 111   0.07   0.58   0.34
## Obj 112   0.10   0.28   0.62
## Obj 113   0.01   0.96   0.03
## Obj 114   0.15   0.42   0.43
## Obj 115   0.06   0.83   0.11
## Obj 116   0.02   0.93   0.05
## Obj 117   0.02   0.95   0.04
## Obj 118   0.09   0.65   0.25
## Obj 119   0.03   0.89   0.08
## Obj 120   0.01   0.96   0.03
## Obj 121   0.11   0.69   0.20
## Obj 122   0.11   0.36   0.53
## Obj 123   0.10   0.25   0.66
## Obj 124   0.08   0.42   0.50
## Obj 125   0.09   0.41   0.50
## Obj 126   0.29   0.42   0.28
## Obj 127   0.12   0.19   0.70
## Obj 128   0.07   0.81   0.12
## Obj 129   0.14   0.15   0.70
## Obj 130   0.08   0.67   0.25
## Obj 131   0.03   0.92   0.06
## Obj 132   0.10   0.30   0.61
## Obj 133   0.03   0.91   0.06
## Obj 134   0.07   0.68   0.24
## Obj 135   0.04   0.88   0.09
## Obj 136   0.05   0.86   0.09
## Obj 137   0.01   0.95   0.04
## Obj 138   0.04   0.89   0.07
## Obj 139   0.08   0.76   0.16
## Obj 140   0.11   0.72   0.17
## Obj 141   0.07   0.70   0.23
## Obj 142   0.15   0.66   0.19
## Obj 143   0.12   0.72   0.17
## Obj 144   0.04   0.88   0.08
## Obj 145   0.05   0.81   0.14
## Obj 146   0.09   0.17   0.75
## Obj 147   0.32   0.42   0.26
## Obj 148   0.02   0.91   0.07
## Obj 149   0.13   0.63   0.24
## 
##  Cluster summary: 
##        Cl.size Min.memb.deg. Max.memb.deg. Av.memb.deg. N.uncl.assignm.
## Clus 1      26          0.47          0.97         0.78               2
## Clus 2      44          0.37          0.98         0.77               5
## Clus 3      79          0.43          0.99         0.74               8
## 
##  Euclidean distance matrix for the prototypes (rounded): 
##        Clus 1 Clus 2
## Clus 2   4.47       
## Clus 3   2.57   2.87
## 
##  Available components: 
##  [1] "U"         "H"         "F"         "clus"      "medoid"    "value"    
##  [7] "criterion" "iter"      "k"         "m"         "ent"       "b"        
## [13] "vp"        "delta"     "stand"     "Xca"       "X"         "D"        
## [19] "call"     
## 
##

Cluster sizes are similar to K-means, but here we have a little larger 1st cluster containing the most developed countires. Variable distribution stays practicallyt the same, but what’s interesting in Fuzzy clustering - here Unclear cluster assignments can be observed. For given FKM algorithms, there are 15 of observations with Membership degree < 0.5, so below 50% chances that observation belongs to a given cluster. 14 of those observations have Membership degree higher than 42% (including United States) and only one has approx. 37% (Indonesia).

Summary

Every algorithm resulted in similar results. so the conducted analysis proved that all of world’s countries can be divided into 3 groups, based on their development: 1. Highly developed countries - mostly West Europe and North America. 2. Developing countries - mostly North African/Central and Eastern Europe. 3. Underdeveloped countries - almost exclusively Sub-Saharan Africa.

https://jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf ↩︎

Clustering of all countries in the world

Artur Nowak

26/02/2022

Introduction

Dataset

Dimensionality reduction

t-SNE

Principal Component Analysis

Explained Variance

Eigenvalues

Summary

Output Principal Components

Clustering

Pre-cluster analysis

Silhouette

Total WSS

K-means

2-d plot

3-d plot

Silhouette diagnosis

Countries classification

Cluster size

Variable distribution

Hierarchical clustering

Dendrogram

Dendrogram with k-means cluster labels

Countries classification

Cluster size

Variable distribution

Fuzzy clustering

Countries classification

Cluster size

Variable distribution

Unclear assignments

Summary

Summary