Financial AID - Clustering

Introduction

Unsupervised learning is a set of algorithms that, based on the data received, try to find certain characteristics, patterns and anomalies in the data, without being able to assist with hints or answers. These algorithms can be used e.g. for face recognition, image recognition and for purely business purposes e.g. shopping basket analysis.

In the following work I will be dealing with the concept of clustering, which consists in dividing data into certain groups with similar characteristics. Using this method, I will try to find which countries charities should give financial aid to, based on quality of life and economic statistics.

Importing essential libraries

# Libraries for data analysis
library("tidyverse")
library("psych")
library("data.table")
library("tidytext")

# Libraries for visualisation
library("corrplot")
library("ggplot2")
library("DataExplorer")
library("RColorBrewer")
library("kableExtra")
library("tidytext")

#libraries for clustering
library("clustertend")
library("factoextra") # drawing charts for clustering
library("gridExtra")
library("mclust") # Model Based Algorithm

Dataset

The dataset was downloaded from this link: https://www.kaggle.com/rohan0301/unsupervised-learning-on-country-data and it contains information about socio-economic and health factors for 167 countries.

unchanged_countries<- read.csv('country-data.csv')
countries <- read.csv('country-data.csv')
countries_description <- read.csv('data-dictionary.csv', sep = ",", header = FALSE)
kable(head(countries)) %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

country	child_mort	exports	health	imports	income	inflation	life_expec	total_fer	gdpp
Afghanistan	90.2	10.0	7.58	44.9	1610	9.44	56.2	5.82	553
Albania	16.6	28.0	6.55	48.6	9930	4.49	76.3	1.65	4090
Algeria	27.3	38.4	4.17	31.4	12900	16.10	76.5	2.89	4460
Angola	119.0	62.3	2.85	42.9	5900	22.40	60.1	6.16	3530
Antigua and Barbuda	10.3	45.5	6.03	58.9	19100	1.44	76.8	2.13	12200
Argentina	14.5	18.9	8.10	16.0	18700	20.90	75.8	2.37	10300

kable(countries_description[-1, ])

	V1	V2
2	country	Name of the country
3	child_mort	Death of children under 5 years of age per 1000 live births
4	exports	Exports of goods and services per capita. Given as %age of the GDP per capita
5	health	Total health spending per capita. Given as %age of GDP per capita
6	imports	Imports of goods and services per capita. Given as %age of the GDP per capita
7	Income	Net income per person
8	Inflation	The measurement of the annual growth rate of the Total GDP
9	life_expec	The average number of years a new born child would live if the current mortality patterns are to remain the same
10	total_fer	The number of children that would be born to each woman if the current age-fertility rates remain the same.
11	gdpp	The GDP per capita. Calculated as the Total GDP divided by the total population.

Exploratory Data Analysis

At the beginning of my analysis, I decided to check basic descriptive statistics and find rows with missing data.

cat("Dimension of the dataset: ","(",dim(countries),")")

## Dimension of the dataset:  ( 167 10 )

kable(psych::describe(countries)) %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
country*	1	167	84.000000	48.352870	84.00	84.000000	62.269200	1.0000	167.00	166.0000	0.0000000	-1.2215774	3.7416574
child_mort	2	167	38.270060	40.328932	19.30	31.588889	21.942480	2.6000	208.00	205.4000	1.4248158	1.6228439	3.1207464
exports	3	167	41.108976	27.412010	35.00	37.799259	21.201180	0.1090	200.00	199.8910	2.4020624	9.6490756	2.1212050
health	4	167	6.815689	2.746837	6.32	6.663333	2.639028	1.8100	17.90	16.0900	0.6931186	0.5944336	0.2125567
imports	5	167	46.890215	24.209589	43.30	44.337037	21.052920	0.0659	174.00	173.9341	1.8711858	6.4058921	1.8733942
income	6	167	17144.688623	19278.067698	9960.00	13807.629630	11638.410000	609.0000	125000.00	124391.0000	2.1915532	6.6674347	1491.7816663
inflation	7	167	7.781832	10.570704	5.39	6.265133	5.722836	-4.2100	104.00	108.2100	5.0618313	39.9484352	0.8179856
life_expec	8	167	70.555689	8.893172	73.10	71.327407	8.895600	32.1000	82.80	50.7000	-0.9536222	1.0329499	0.6881743
total_fer	9	167	2.947964	1.513848	2.41	2.768741	1.215732	1.1500	7.49	6.3400	0.9497881	-0.2501780	0.1171450
gdpp	10	167	12964.155689	18328.704809	4660.00	9146.429630	5814.757200	231.0000	105000.00	104769.0000	2.1783653	5.2286139	1418.3177603

check_missing <- plot_missing(
  countries, 
  ggtheme = theme_bw(),
  title = "Missing Values") +
  theme_update(plot.title = element_text(hjust = 0.5))

#### Visualisations

Now, let’s “penetrate” the data and look for some interesting relationships and additional information.

countries %>%
  gather(Info, value, 2:10) %>%
  ggplot(aes(x=value, fill=Info)) +
  geom_boxplot(color = "black", show.legend=FALSE) +
  facet_wrap(~Info, scales="free") +
  labs(x="Values", y="",
       title="countries Data - Boxplots") +
  theme_update(plot.title = element_text(hjust = 0.5),
               axis.ticks.y=element_blank(),
               axis.text.y=element_blank())

Basing on the above analysis, we see that there might be potential outliers, which may significantly affect subsequent results. I will bear this in mind and try to do something about it later in the project.

countries %>%
  gather(Info, value, 2:10) %>%
  ggplot(aes(x=value, fill=Info)) +
  geom_histogram(colour="white", show.legend=FALSE) +
  facet_wrap(~Info, scales="free_x") +
  labs(x="Values", y="Frequency",
       title="countries Data - Histograms") +
  theme_update(plot.title = element_text(hjust = 0.5))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As wee see, our data is mostly right skewed - in this case for the right skewed variables I will focus only on their maximal values, while for the left skewed for minimal values. This trick will help me understand the data better and get more knowledge about the countries which are taken into analysis.

dataframes_sorted <- lapply(countries[, -1], function(x) countries %>% 
                              arrange(desc(x)))

reorder_columns_dfs <- lapply(dataframes_sorted, function (x){x[1:5,c(2,3,4,5,6,7,8,9,10,1)]})

# In below dataset I will save top 5 countries for each variable
top_features <- data.frame(variable = "x", country = "y", values = 1)

for (i in 1:length(reorder_columns_dfs)) {
  temp <- reorder_columns_dfs[[i]] %>% 
    select(country) %>% 
    mutate("variable" = colnames(reorder_columns_dfs[[i]])[i], .before = country) %>% 
    mutate("values" = reorder_columns_dfs[[i]][, i])
  
  top_features <- rbind(top_features, temp)
}

# Delete temporary row
top_features <- top_features[-1, ]

# Order the countries and plot the variables
indexes_top_5_countries <- top_features %>% 
                            group_by(variable) %>%
                            ungroup() %>%
                            mutate(variable = as.factor(variable),
                                   country = reorder_within(country, values, variable)) %>%
                            ggplot(aes(x = country, y = values, fill = variable)) +
                              geom_col(show.legend=FALSE) +
                              facet_wrap(~variable, scales="free") +
                                scale_x_reordered() +
                                labs(title = "Top 5 countries per each variable", y = "", x = "Countries"
                                  )

indexes_top_5_countries + theme(axis.text.x = element_text(color = "000033", 
                           size = 7, angle = 20))

Based on the barplots above, I can say that my data set seems to reflect reality. Therefore, I already have the first indication that my outliers are not spurious values. In addition, I can conclude that the countries that will need financial assistance first will predominantly be from Africa. In later parts of the paper, I will try to specify more precisely which countries these will be.

Let’s examine correlation, using spearman method, because the variables do not follow a normal distribution

par(mfrow=c(1,1))

countries_matrix <- data.matrix(countries, rownames.force = NA)
corr.data <- cor(countries_matrix, method = "spearman")
corrplot(corr.data, type = 'lower', order = 'hclust', tl.col = 'black',
    cl.ratio = 0.2, tl.srt = 45, col = COL2('PuOr', 10))

There is some correlation between variables, however, this should not have a major impact on my analysis.

Now, let’s try to clean the data by removing outliers with IQR method.

temp_country <- countries[, -1]

out <- boxplot.stats(countries$child_mort)$out
out_ind <- which(countries$child_mort %in% c(out))

result <- do.call("cbind",lapply(temp_country, function(x) length(boxplot.stats(x)$out)))
result

##      child_mort exports health imports income inflation life_expec total_fer
## [1,]          4       5      2       4      8         5          3         1
##      gdpp
## [1,]   25

Especially interesting seems to be variables income and gdpp, where a lot of outliers have been recognized. Let’s assess whether it would be worthwhile to remove these records.

out <- boxplot.stats(countries$income)$out
out_ind <- which(countries$income %in% c(out))
countries[out_ind, ]$country

## [1] "Brunei"               "Kuwait"               "Luxembourg"          
## [4] "Norway"               "Qatar"                "Singapore"           
## [7] "Switzerland"          "United Arab Emirates"

out <- boxplot.stats(countries$gdpp)$out
out_ind <- which(countries$gdpp %in% c(out))
countries[out_ind, ]$country

##  [1] "Australia"            "Austria"              "Belgium"             
##  [4] "Brunei"               "Canada"               "Denmark"             
##  [7] "Finland"              "France"               "Germany"             
## [10] "Iceland"              "Ireland"              "Italy"               
## [13] "Japan"                "Kuwait"               "Luxembourg"          
## [16] "Netherlands"          "New Zealand"          "Norway"              
## [19] "Qatar"                "Singapore"            "Sweden"              
## [22] "Switzerland"          "United Arab Emirates" "United Kingdom"      
## [25] "United States"

As regards the income variable, let us recall that it is calculated per capita. We can see that this group includes countries that we can definitely consider rich. In this case there is no point in removing them.

For the variable gdpp the situation is very similar, which coincides with the logic. Therefore, it was again decided to not remove any country from the analysis.

Standarisation of the data

countries_unscaled <- data.frame(countries) # making copy, because it will be helpful in comparison of the clusters

# Now scale
countries <- countries %>% 
              mutate_if(is.numeric, scale)

Prediagnostics

In order to implement the kmeans and pams algorithm, it is useful to first check that our dataset properly has “clustering” characteristics. To do this, we use the Hopkins metric. The closer the value for this statistic is to 1, the more “clusterable” our data is. We can also illustrate this graphically (dissimilarity plot), where getting clear rectangles confirms possibility of finding clusters. In my case, hopkins statistic is equal to 0.9287641, so I can confidently use clustering methods

get_clust_tendency(countries[,-1], 2, graph=TRUE, gradient=list(low="red", mid="white", high="blue"), seed = 123)

## $hopkins_stat
## [1] 0.9287641
## 
## $plot

Comparison of optimal number of clusters

In this step, we are going to find the optimal number of clusters for each method. My dataset is rather small, so I wouldn’t use CLARA method, which is mainly helpful in the big data structures.

Sillhoutte

opt_kmeans_sill <- fviz_nbclust(countries[,-1], FUNcluster = kmeans, method = "silhouette") + 
  theme_classic() +
  labs(subtitle = "Silhouette method with K-means")

opt_pam_sill <- fviz_nbclust(countries[,-1], FUNcluster = cluster::pam, method = "silhouette") + 
  theme_classic() +
  labs(subtitle = "Silhouette method with PAM")

opt_hct_sill <- fviz_nbclust(countries[,-1], FUNcluster = hcut, method = "silhouette") +
  theme_classic() +
  labs(subtitle = "Silhouette method with Hierarchical Clustering")

grid.arrange(opt_kmeans_sill, opt_pam_sill, opt_hct_sill, ncol=2)

After taking into account the sillhoutte method for the different algorithms, varied results were obtained, but for the PAM and Hierarchical Clustering method the optimal number of clusters is 2, which is not a very satisfactory result. It was decided to perform a similar analysis, using the GAP metric in order to find a better results.

GAP method

opt_kmeans_gap <- fviz_nbclust(countries[,-1], FUNcluster = kmeans, method = "gap") + theme_classic() +
  labs(subtitle = "GAP method with K-means")
opt_pam_gap <- fviz_nbclust(countries[,-1], FUNcluster = cluster::pam, method = "gap") + theme_classic() +
  labs(subtitle = "GAP method with PAM")
opt_hct_gap <- fviz_nbclust(countries[,-1], FUNcluster = hcut, method = "gap") + theme_classic() +
  labs(subtitle = "GAP method with Hierarchical Clustering")
grid.arrange(opt_kmeans_gap, opt_pam_gap, opt_hct_gap, ncol=2)

In the case of GAP method, we see that results are similar to each other.

Based on the above fact, It seems to me that the division into 3 clusters from a business point of view might be the best. In a way, this division can be a reflection of rich, middle-income and poor countries. Nevertheless, in further analysis we will also examine how the algorithms behave when considering only 2 clusters.

Kmeans & PAM

The best known/easiest algorithms for clustering variables are Kmeans and PAM. Although they are very similar, two important differences between them should be mentioned. In kmeans, our task is to find centroids, i.e., group centers, for which the sum of distances for all points between a given cluster point will be as small as possible. In the PAM algorithm, the idea is similar; however, we replace centroids with medoids, i.e., points that are derived from our passed dataset. Therefore, in the PAM method, the set of possible clusters is smaller but more stable. There are many ways to count these distances such as manhattan or canberra. Nevertheless, in my work I will use a basic distance called sillhoutte.

# In the below dataframe I will save the results of each algorithm
countries_division <- data.frame(countries$country)

temp_chart_clust_plot <- list()
temp_chart_clust_sill <- list()
temp <- 1
  
cluster_types <- c("kmeans", "pam") # change your values here
cluster_numbers <- c(2,3)

for(number in cluster_numbers) {
  for (type in cluster_types) {

    # Plot clusters
    clustering_info<- eclust(countries[,-1], FUNcluster = type, k=number, hc_metric = 'euclidean', graph = FALSE)
    clustering_kmeans_chart_plot <- fviz_cluster(clustering_info, geom = c("point")) + 
      ggtitle(paste(paste(toupper(substr(type, 1, 1)), 
                          substr(type, 2, nchar(type)), sep = ""),"with", number, "clusters", collapse = ""))
      
    # Plot silhouette
    clustering_kmeans_chart_sill <- fviz_silhouette(clustering_info) + 
      ggtitle(paste(paste(toupper(substr(type, 1, 1)), 
                          substr(type, 2, nchar(type)), sep = ""),"with", number, "clusters", 
                          "\n and width equal: ",  round(mean(clustering_info$silinfo$avg.width), 2), collapse = ""))
      
    # Save plots into list
    temp_chart_clust_plot[[temp]] <- clustering_kmeans_chart_plot
    temp_chart_clust_sill[[temp]] <- clustering_kmeans_chart_sill
      
    # Save number of cluster for each country based on each type of algorithm and number of clusters
    countries_division$new <- clustering_info$cluster
    colnames(countries_division)[temp+1] <- paste(type, number, sep = "_")
    temp <-  temp + 1
      
  }
}

##   cluster size ave.sil.width
## 1       1   99          0.29
## 2       2   68          0.29
##   cluster size ave.sil.width
## 1       1   69          0.28
## 2       2   98          0.29
##   cluster size ave.sil.width
## 1       1   36          0.15
## 2       2   84          0.36
## 3       3   47          0.24
##   cluster size ave.sil.width
## 1       1   51          0.25
## 2       2   85          0.31
## 3       3   31          0.26

grid.arrange(grobs = temp_chart_clust_plot, ncol=2 , top = "Clustering plots")

grid.arrange(grobs = temp_chart_clust_sill, ncol=2 , top = "Silhoutte plots")

As you can see from the graphs above, I was able to efficiently split my variables in such a way that I got well separated clusters. Let’s see what conclusions we can come to after considering other ways.

Hierarchical Clustering

In hierarchical clustering, our goal is to determine the similarities between individual data, which contributes to obtaining graphs in the form of trees. At first, we treat each observation as a separate cluster, and then we start to combine similar observations into larger sets. Of course, also in this case there are many methods of calculating distances between data. For the following 4 metrics I decided to present their results graphically, in order to illustrate how significantly they can influence the final results. The best of them (and also the most time-consuming) is undoubtedly Ward’s method, which determines clusters with the minimum variance. The results obtained from this method, I decided to save to the final set, so that I can get a comparison between other algorithms.

Single method

hc_single <- eclust(countries[,-1], k=3, FUNcluster="hclust", hc_metric="euclidean", hc_method = "single")
hc_single$labels<-countries$country
plot(hc_single, cex=0.6, hang=-1, main = "Dendrogram of HAC")
rect.hclust(hc_single, k=3, border='red')

Complete method

hc_complete <- eclust(countries[,-1], k=3, FUNcluster="hclust", hc_metric="euclidean", hc_method = "complete")
hc_complete$labels<-countries$country
plot(hc_complete, cex=0.6, hang=-1, main = "Dendrogram of HAC")
rect.hclust(hc_complete, k=3, border='red')

Average method

hc_average <- eclust(countries[,-1], k=3, FUNcluster="hclust", hc_metric="euclidean", hc_method = "average")
hc_average$labels<-countries$country
plot(hc_average, cex=0.6, hang=-1, main = "Dendrogram of HAC")
rect.hclust(hc_average, k=3, border='red')

Ward.D2 method <- the best method

hc_ward <- eclust(countries[,-1], k=3, FUNcluster="hclust", hc_metric="euclidean", hc_method = "ward.D2")
hc_ward$labels<-countries$country
plot(hc_ward, cex=0.6, hang=-1, main = "Dendrogram of HAC")
rect.hclust(hc_ward, k=3, border='red')

ward_hclust<-cutree(hc_ward, k=3)
table(ward_hclust)

## ward_hclust
##   1   2   3 
##  27 106  34

countries_division$ward_hclust <- ward_hclust

Model Based Clustering

Model-based clustering assumes that the data is generated by an underlying probability distribution and tries to recover the distribution from the dat. The most popular approach is the Gaussian Mixture Model, which uses the mean vector, covariance matrix and the probability of each variable belonging to each cluster in its calculations. Our main task is to maximize the BIC statistic, which in its notation somehow penalizes models that have too many clusters.

mc_countries <- Mclust(countries)

summary(mc_countries)

## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust EVE (ellipsoidal, equal volume and orientation) model with 5 components: 
## 
##  log-likelihood   n  df       BIC      ICL
##       -1725.771 167 145 -4193.651 -4209.05
## 
## Clustering table:
##  1  2  3  4  5 
## 47 38 11 30 41

# BIC values used for choosing the number of clusters #THE HIGHEST BIC the better
(BIC_mc_countries <- fviz_mclust(mc_countries, "BIC", palette = "jco"))

# Classification: plot showing the clustering
fviz_mclust(mc_countries, "classification", geom = "point", 
            pointsize = 1.5, palette = "jco")

At first we see that the optimal number of clusters is 5, while the clustering graph itself does not look very nice. Let’s see what happens if we artificially influence the algorithm by imposing a number of clusters equal to 2 and 3

mc_countries_2 <- Mclust(countries, G = 2)
mc_countries_3 <- Mclust(countries, G = 3)

summary(mc_countries_3)

## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust EVE (ellipsoidal, equal volume and orientation) model with 3 components: 
## 
##  log-likelihood   n  df     BIC       ICL
##       -1904.855 167 105 -4347.1 -4357.081
## 
## Clustering table:
##  1  2  3 
## 53 72 42

# Save the results into my final dataframe
countries_division$mbc_2 <- mc_countries_2$classification 
countries_division$mbc_3 <- mc_countries_3$classification 

fviz_cluster(mc_countries_2)

fviz_cluster(mc_countries_3)

As we can see, the situation has not improved much, nevertheless, for the sake of comparison and exercise, I decided to include the results in the final analysis.

Selecting optimal solution

Because we are concentrating on financial aid, let’s focus only on some variables which might be the most important for us:

exports
imports
income
inflation

countries_division$exports <- unchanged_countries$exports
countries_division$imports <- unchanged_countries$imports
countries_division$income <- unchanged_countries$income
countries_division$inflation <- unchanged_countries$inflation

# Creating empty dataframe where I will store calculated results
clusters_summarised <- data.frame(Algorithm = character(),
                                  Clusters = integer(),
                                  exports = double(),
                                  imports = double(),
                                  income = double(),
                                  inflation = double())

# Add results into dataframe
groups <- c(quo(kmeans_2), quo(pam_2), quo(kmeans_3), quo(pam_3), quo(ward_hclust), quo(mbc_2), quo(mbc_3))

for (i in seq_along(groups)) {
  temp <- countries_division %>% 
            group_by(!!groups[[i]]) %>% # Unquote with !!
            summarise_each(funs(median), c(exports, imports, income, inflation)) %>% 
            mutate("Algorithm" = colnames(countries_division[i+1])) %>% 
            rename(Clusters = !!groups[[i]]) %>% 
            relocate(Algorithm, .before = Clusters) %>%
    print()
    
  clusters_summarised <- rbind(clusters_summarised, temp)
}

## # A tibble: 2 x 6
##   Algorithm Clusters exports imports income inflation
##   <chr>        <int>   <dbl>   <dbl>  <dbl>     <dbl>
## 1 kmeans_2         1    40.1    46.2  19400      3.53
## 2 kmeans_2         2    25.0    41.5   2590      8.21
## # A tibble: 2 x 6
##   Algorithm Clusters exports imports income inflation
##   <chr>        <int>   <dbl>   <dbl>  <dbl>     <dbl>
## 1 pam_2            1    25      39.2   2660      8.92
## 2 pam_2            2    40.0    47.2  19250      3.38
## # A tibble: 3 x 6
##   Algorithm Clusters exports imports income inflation
##   <chr>        <int>   <dbl>   <dbl>  <dbl>     <dbl>
## 1 kmeans_3         1    50.4    39.0  40550      1.19
## 2 kmeans_3         2    37.3    48.7  10180      5.80
## 3 kmeans_3         3    23.8    40.3   1870      8.92
## # A tibble: 3 x 6
##   Algorithm Clusters exports imports income inflation
##   <chr>        <int>   <dbl>   <dbl>  <int>     <dbl>
## 1 pam_3            1    22.8    39.2   1990      8.98
## 2 pam_3            2    39.8    50.9  11200      5.14
## 3 pam_3            3    42.3    37.1  40700      1.16
## # A tibble: 3 x 6
##   Algorithm   Clusters exports imports income inflation
##   <chr>          <int>   <dbl>   <dbl>  <dbl>     <dbl>
## 1 ward_hclust        1    22.2    39.2   1430      5.45
## 2 ward_hclust        2    37.3    49.2   9890      6.04
## 3 ward_hclust        3    50.0    37.2  41250      1.67
## # A tibble: 2 x 6
##   Algorithm Clusters exports imports income inflation
##   <chr>        <dbl>   <dbl>   <dbl>  <dbl>     <dbl>
## 1 mbc_2            1    41.2    37.8  34850      1.94
## 2 mbc_2            2    31.7    45.9   5900      6.35
## # A tibble: 3 x 6
##   Algorithm Clusters exports imports income inflation
##   <chr>        <dbl>   <dbl>   <dbl>  <dbl>     <dbl>
## 1 mbc_3            1    22.8    43     1990      7.64
## 2 mbc_3            2    38.9    48.9  11700      5.29
## 3 mbc_3            3    49.5    39.0  40100      1.67

head(clusters_summarised)

## # A tibble: 6 x 6
##   Algorithm Clusters exports imports income inflation
##   <chr>        <dbl>   <dbl>   <dbl>  <dbl>     <dbl>
## 1 kmeans_2         1    40.1    46.2  19400      3.53
## 2 kmeans_2         2    25.0    41.5   2590      8.21
## 3 pam_2            1    25      39.2   2660      8.92
## 4 pam_2            2    40.0    47.2  19250      3.38
## 5 kmeans_3         1    50.4    39.0  40550      1.19
## 6 kmeans_3         2    37.3    48.7  10180      5.80

So we see, that for each algorithm we have calculated median for 4 variables. Now let’s check, which algorithms have allowed us to obtain the lowest import, export, income values and the highest inflation rate.

# Positions of our desired values
as.matrix(apply(clusters_summarised[, 3:5],2,which.min))

##         [,1]
## exports   11
## imports   10
## income    11

as.matrix(apply(clusters_summarised[, 6],2,which.max))

##           [,1]
## inflation    8

clusters_summarised[c(8, 10, 11), ]

## # A tibble: 3 x 6
##   Algorithm   Clusters exports imports income inflation
##   <chr>          <dbl>   <dbl>   <dbl>  <dbl>     <dbl>
## 1 pam_3              1    22.8    39.2   1990      8.98
## 2 pam_3              3    42.3    37.1  40700      1.16
## 3 ward_hclust        1    22.2    39.2   1430      5.45

Ward_hclust (1st cluster group) has the lowest exports and income, while pam_3 (3rd cluster group) has the lowest imports and the highest inflation (1st cluster).

Let’s see which countries belongs to these algorithms.

countries_division %>% 
  filter((pam_3 == 1 & ward_hclust == 1) | (pam_3 == 3 & ward_hclust == 1)) %>% 
  select(countries.country)

##           countries.country
## 1               Afghanistan
## 2                     Benin
## 3              Burkina Faso
## 4                   Burundi
## 5                  Cameroon
## 6  Central African Republic
## 7                      Chad
## 8                   Comoros
## 9          Congo, Dem. Rep.
## 10            Cote d'Ivoire
## 11                   Gambia
## 12                   Guinea
## 13            Guinea-Bissau
## 14                    Haiti
## 15                    Kenya
## 16               Madagascar
## 17                   Malawi
## 18                     Mali
## 19               Mozambique
## 20                    Niger
## 21                   Rwanda
## 22                  Senegal
## 23             Sierra Leone
## 24                 Tanzania
## 25                     Togo
## 26                   Uganda
## 27                   Zambia

So we have 27 countries out of 167 for which we may think to target them first in order to financial aid. Most of them, obviously, come from Africa, which seems to be logical.

Summary

The main aim of the study was to answer the question of which countries charities should help financially in the first place. From an initial analysis of the data, I checked whether the values of the indices corresponded to reality. Through helpful visualisations I came to the conclusion that the countries in need would mostly come from Africa. In order to identify which countries these might be, I used the algorithms: kmeans, pam, hierarchical clustering and model based clustering method. For the first 3 models I obtained satisfactory result and well distinguished clusters, however its cannot be said the same for the last method. Nevertheless, this did not affect the final results and in the end, 2 types of models (pam with 1st and 3rd cluster group and hierachical clustering with 1st cluster group) were taken into account. Of the 27 potentially needy countries identified, most of them are from Africa, which is in line with my initial assumptions. Therefore, I believe that I have succeeded in fulfilling the purpose of the paper and it has a meaningful translation to the global reality.