Unsupervised learning is a set of algorithms that, based on the data received, try to find certain characteristics, patterns and anomalies in the data, without being able to assist with hints or answers. These algorithms can be used e.g. for face recognition, image recognition and for purely business purposes e.g. shopping basket analysis.
In the following work I will be dealing with the concept of clustering, which consists in dividing data into certain groups with similar characteristics. Using this method, I will try to find which countries charities should give financial aid to, based on quality of life and economic statistics.
# Libraries for data analysis
library("tidyverse")
library("psych")
library("data.table")
library("tidytext")
# Libraries for visualisation
library("corrplot")
library("ggplot2")
library("DataExplorer")
library("RColorBrewer")
library("kableExtra")
library("tidytext")
#libraries for clustering
library("clustertend")
library("factoextra") # drawing charts for clustering
library("gridExtra")
library("mclust") # Model Based Algorithm
The dataset was downloaded from this link: https://www.kaggle.com/rohan0301/unsupervised-learning-on-country-data and it contains information about socio-economic and health factors for 167 countries.
unchanged_countries<- read.csv('country-data.csv')
countries <- read.csv('country-data.csv')
countries_description <- read.csv('data-dictionary.csv', sep = ",", header = FALSE)
kable(head(countries)) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| country | child_mort | exports | health | imports | income | inflation | life_expec | total_fer | gdpp |
|---|---|---|---|---|---|---|---|---|---|
| Afghanistan | 90.2 | 10.0 | 7.58 | 44.9 | 1610 | 9.44 | 56.2 | 5.82 | 553 |
| Albania | 16.6 | 28.0 | 6.55 | 48.6 | 9930 | 4.49 | 76.3 | 1.65 | 4090 |
| Algeria | 27.3 | 38.4 | 4.17 | 31.4 | 12900 | 16.10 | 76.5 | 2.89 | 4460 |
| Angola | 119.0 | 62.3 | 2.85 | 42.9 | 5900 | 22.40 | 60.1 | 6.16 | 3530 |
| Antigua and Barbuda | 10.3 | 45.5 | 6.03 | 58.9 | 19100 | 1.44 | 76.8 | 2.13 | 12200 |
| Argentina | 14.5 | 18.9 | 8.10 | 16.0 | 18700 | 20.90 | 75.8 | 2.37 | 10300 |
kable(countries_description[-1, ])
| V1 | V2 | |
|---|---|---|
| 2 | country | Name of the country |
| 3 | child_mort | Death of children under 5 years of age per 1000 live births |
| 4 | exports | Exports of goods and services per capita. Given as %age of the GDP per capita |
| 5 | health | Total health spending per capita. Given as %age of GDP per capita |
| 6 | imports | Imports of goods and services per capita. Given as %age of the GDP per capita |
| 7 | Income | Net income per person |
| 8 | Inflation | The measurement of the annual growth rate of the Total GDP |
| 9 | life_expec | The average number of years a new born child would live if the current mortality patterns are to remain the same |
| 10 | total_fer | The number of children that would be born to each woman if the current age-fertility rates remain the same. |
| 11 | gdpp | The GDP per capita. Calculated as the Total GDP divided by the total population. |
At the beginning of my analysis, I decided to check basic descriptive statistics and find rows with missing data.
cat("Dimension of the dataset: ","(",dim(countries),")")
## Dimension of the dataset: ( 167 10 )
kable(psych::describe(countries)) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| country* | 1 | 167 | 84.000000 | 48.352870 | 84.00 | 84.000000 | 62.269200 | 1.0000 | 167.00 | 166.0000 | 0.0000000 | -1.2215774 | 3.7416574 |
| child_mort | 2 | 167 | 38.270060 | 40.328932 | 19.30 | 31.588889 | 21.942480 | 2.6000 | 208.00 | 205.4000 | 1.4248158 | 1.6228439 | 3.1207464 |
| exports | 3 | 167 | 41.108976 | 27.412010 | 35.00 | 37.799259 | 21.201180 | 0.1090 | 200.00 | 199.8910 | 2.4020624 | 9.6490756 | 2.1212050 |
| health | 4 | 167 | 6.815689 | 2.746837 | 6.32 | 6.663333 | 2.639028 | 1.8100 | 17.90 | 16.0900 | 0.6931186 | 0.5944336 | 0.2125567 |
| imports | 5 | 167 | 46.890215 | 24.209589 | 43.30 | 44.337037 | 21.052920 | 0.0659 | 174.00 | 173.9341 | 1.8711858 | 6.4058921 | 1.8733942 |
| income | 6 | 167 | 17144.688623 | 19278.067698 | 9960.00 | 13807.629630 | 11638.410000 | 609.0000 | 125000.00 | 124391.0000 | 2.1915532 | 6.6674347 | 1491.7816663 |
| inflation | 7 | 167 | 7.781832 | 10.570704 | 5.39 | 6.265133 | 5.722836 | -4.2100 | 104.00 | 108.2100 | 5.0618313 | 39.9484352 | 0.8179856 |
| life_expec | 8 | 167 | 70.555689 | 8.893172 | 73.10 | 71.327407 | 8.895600 | 32.1000 | 82.80 | 50.7000 | -0.9536222 | 1.0329499 | 0.6881743 |
| total_fer | 9 | 167 | 2.947964 | 1.513848 | 2.41 | 2.768741 | 1.215732 | 1.1500 | 7.49 | 6.3400 | 0.9497881 | -0.2501780 | 0.1171450 |
| gdpp | 10 | 167 | 12964.155689 | 18328.704809 | 4660.00 | 9146.429630 | 5814.757200 | 231.0000 | 105000.00 | 104769.0000 | 2.1783653 | 5.2286139 | 1418.3177603 |
check_missing <- plot_missing(
countries,
ggtheme = theme_bw(),
title = "Missing Values") +
theme_update(plot.title = element_text(hjust = 0.5))
#### Visualisations
Now, let’s “penetrate” the data and look for some interesting relationships and additional information.
countries %>%
gather(Info, value, 2:10) %>%
ggplot(aes(x=value, fill=Info)) +
geom_boxplot(color = "black", show.legend=FALSE) +
facet_wrap(~Info, scales="free") +
labs(x="Values", y="",
title="countries Data - Boxplots") +
theme_update(plot.title = element_text(hjust = 0.5),
axis.ticks.y=element_blank(),
axis.text.y=element_blank())
Basing on the above analysis, we see that there might be potential outliers, which may significantly affect subsequent results. I will bear this in mind and try to do something about it later in the project.
countries %>%
gather(Info, value, 2:10) %>%
ggplot(aes(x=value, fill=Info)) +
geom_histogram(colour="white", show.legend=FALSE) +
facet_wrap(~Info, scales="free_x") +
labs(x="Values", y="Frequency",
title="countries Data - Histograms") +
theme_update(plot.title = element_text(hjust = 0.5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
As wee see, our data is mostly right skewed - in this case for the right skewed variables I will focus only on their maximal values, while for the left skewed for minimal values. This trick will help me understand the data better and get more knowledge about the countries which are taken into analysis.
dataframes_sorted <- lapply(countries[, -1], function(x) countries %>%
arrange(desc(x)))
reorder_columns_dfs <- lapply(dataframes_sorted, function (x){x[1:5,c(2,3,4,5,6,7,8,9,10,1)]})
# In below dataset I will save top 5 countries for each variable
top_features <- data.frame(variable = "x", country = "y", values = 1)
for (i in 1:length(reorder_columns_dfs)) {
temp <- reorder_columns_dfs[[i]] %>%
select(country) %>%
mutate("variable" = colnames(reorder_columns_dfs[[i]])[i], .before = country) %>%
mutate("values" = reorder_columns_dfs[[i]][, i])
top_features <- rbind(top_features, temp)
}
# Delete temporary row
top_features <- top_features[-1, ]
# Order the countries and plot the variables
indexes_top_5_countries <- top_features %>%
group_by(variable) %>%
ungroup() %>%
mutate(variable = as.factor(variable),
country = reorder_within(country, values, variable)) %>%
ggplot(aes(x = country, y = values, fill = variable)) +
geom_col(show.legend=FALSE) +
facet_wrap(~variable, scales="free") +
scale_x_reordered() +
labs(title = "Top 5 countries per each variable", y = "", x = "Countries"
)
indexes_top_5_countries + theme(axis.text.x = element_text(color = "000033",
size = 7, angle = 20))
Based on the barplots above, I can say that my data set seems to reflect reality. Therefore, I already have the first indication that my outliers are not spurious values. In addition, I can conclude that the countries that will need financial assistance first will predominantly be from Africa. In later parts of the paper, I will try to specify more precisely which countries these will be.
Let’s examine correlation, using spearman method, because the variables do not follow a normal distribution
par(mfrow=c(1,1))
countries_matrix <- data.matrix(countries, rownames.force = NA)
corr.data <- cor(countries_matrix, method = "spearman")
corrplot(corr.data, type = 'lower', order = 'hclust', tl.col = 'black',
cl.ratio = 0.2, tl.srt = 45, col = COL2('PuOr', 10))
There is some correlation between variables, however, this should not have a major impact on my analysis.
Now, let’s try to clean the data by removing outliers with IQR method.
temp_country <- countries[, -1]
out <- boxplot.stats(countries$child_mort)$out
out_ind <- which(countries$child_mort %in% c(out))
result <- do.call("cbind",lapply(temp_country, function(x) length(boxplot.stats(x)$out)))
result
## child_mort exports health imports income inflation life_expec total_fer
## [1,] 4 5 2 4 8 5 3 1
## gdpp
## [1,] 25
Especially interesting seems to be variables income and gdpp, where a lot of outliers have been recognized. Let’s assess whether it would be worthwhile to remove these records.
out <- boxplot.stats(countries$income)$out
out_ind <- which(countries$income %in% c(out))
countries[out_ind, ]$country
## [1] "Brunei" "Kuwait" "Luxembourg"
## [4] "Norway" "Qatar" "Singapore"
## [7] "Switzerland" "United Arab Emirates"
out <- boxplot.stats(countries$gdpp)$out
out_ind <- which(countries$gdpp %in% c(out))
countries[out_ind, ]$country
## [1] "Australia" "Austria" "Belgium"
## [4] "Brunei" "Canada" "Denmark"
## [7] "Finland" "France" "Germany"
## [10] "Iceland" "Ireland" "Italy"
## [13] "Japan" "Kuwait" "Luxembourg"
## [16] "Netherlands" "New Zealand" "Norway"
## [19] "Qatar" "Singapore" "Sweden"
## [22] "Switzerland" "United Arab Emirates" "United Kingdom"
## [25] "United States"
As regards the income variable, let us recall that it is calculated per capita. We can see that this group includes countries that we can definitely consider rich. In this case there is no point in removing them.
For the variable gdpp the situation is very similar, which coincides with the logic. Therefore, it was again decided to not remove any country from the analysis.
countries_unscaled <- data.frame(countries) # making copy, because it will be helpful in comparison of the clusters
# Now scale
countries <- countries %>%
mutate_if(is.numeric, scale)
In order to implement the kmeans and pams algorithm, it is useful to first check that our dataset properly has “clustering” characteristics. To do this, we use the Hopkins metric. The closer the value for this statistic is to 1, the more “clusterable” our data is. We can also illustrate this graphically (dissimilarity plot), where getting clear rectangles confirms possibility of finding clusters. In my case, hopkins statistic is equal to 0.9287641, so I can confidently use clustering methods
get_clust_tendency(countries[,-1], 2, graph=TRUE, gradient=list(low="red", mid="white", high="blue"), seed = 123)
## $hopkins_stat
## [1] 0.9287641
##
## $plot
In this step, we are going to find the optimal number of clusters for each method. My dataset is rather small, so I wouldn’t use CLARA method, which is mainly helpful in the big data structures.
Sillhoutte
opt_kmeans_sill <- fviz_nbclust(countries[,-1], FUNcluster = kmeans, method = "silhouette") +
theme_classic() +
labs(subtitle = "Silhouette method with K-means")
opt_pam_sill <- fviz_nbclust(countries[,-1], FUNcluster = cluster::pam, method = "silhouette") +
theme_classic() +
labs(subtitle = "Silhouette method with PAM")
opt_hct_sill <- fviz_nbclust(countries[,-1], FUNcluster = hcut, method = "silhouette") +
theme_classic() +
labs(subtitle = "Silhouette method with Hierarchical Clustering")
grid.arrange(opt_kmeans_sill, opt_pam_sill, opt_hct_sill, ncol=2)
After taking into account the sillhoutte method for the different algorithms, varied results were obtained, but for the PAM and Hierarchical Clustering method the optimal number of clusters is 2, which is not a very satisfactory result. It was decided to perform a similar analysis, using the GAP metric in order to find a better results.
GAP method
opt_kmeans_gap <- fviz_nbclust(countries[,-1], FUNcluster = kmeans, method = "gap") + theme_classic() +
labs(subtitle = "GAP method with K-means")
opt_pam_gap <- fviz_nbclust(countries[,-1], FUNcluster = cluster::pam, method = "gap") + theme_classic() +
labs(subtitle = "GAP method with PAM")
opt_hct_gap <- fviz_nbclust(countries[,-1], FUNcluster = hcut, method = "gap") + theme_classic() +
labs(subtitle = "GAP method with Hierarchical Clustering")
grid.arrange(opt_kmeans_gap, opt_pam_gap, opt_hct_gap, ncol=2)
In the case of GAP method, we see that results are similar to each other.
Based on the above fact, It seems to me that the division into 3 clusters from a business point of view might be the best. In a way, this division can be a reflection of rich, middle-income and poor countries. Nevertheless, in further analysis we will also examine how the algorithms behave when considering only 2 clusters.
The best known/easiest algorithms for clustering variables are Kmeans and PAM. Although they are very similar, two important differences between them should be mentioned. In kmeans, our task is to find centroids, i.e., group centers, for which the sum of distances for all points between a given cluster point will be as small as possible. In the PAM algorithm, the idea is similar; however, we replace centroids with medoids, i.e., points that are derived from our passed dataset. Therefore, in the PAM method, the set of possible clusters is smaller but more stable. There are many ways to count these distances such as manhattan or canberra. Nevertheless, in my work I will use a basic distance called sillhoutte.
# In the below dataframe I will save the results of each algorithm
countries_division <- data.frame(countries$country)
temp_chart_clust_plot <- list()
temp_chart_clust_sill <- list()
temp <- 1
cluster_types <- c("kmeans", "pam") # change your values here
cluster_numbers <- c(2,3)
for(number in cluster_numbers) {
for (type in cluster_types) {
# Plot clusters
clustering_info<- eclust(countries[,-1], FUNcluster = type, k=number, hc_metric = 'euclidean', graph = FALSE)
clustering_kmeans_chart_plot <- fviz_cluster(clustering_info, geom = c("point")) +
ggtitle(paste(paste(toupper(substr(type, 1, 1)),
substr(type, 2, nchar(type)), sep = ""),"with", number, "clusters", collapse = ""))
# Plot silhouette
clustering_kmeans_chart_sill <- fviz_silhouette(clustering_info) +
ggtitle(paste(paste(toupper(substr(type, 1, 1)),
substr(type, 2, nchar(type)), sep = ""),"with", number, "clusters",
"\n and width equal: ", round(mean(clustering_info$silinfo$avg.width), 2), collapse = ""))
# Save plots into list
temp_chart_clust_plot[[temp]] <- clustering_kmeans_chart_plot
temp_chart_clust_sill[[temp]] <- clustering_kmeans_chart_sill
# Save number of cluster for each country based on each type of algorithm and number of clusters
countries_division$new <- clustering_info$cluster
colnames(countries_division)[temp+1] <- paste(type, number, sep = "_")
temp <- temp + 1
}
}
## cluster size ave.sil.width
## 1 1 99 0.29
## 2 2 68 0.29
## cluster size ave.sil.width
## 1 1 69 0.28
## 2 2 98 0.29
## cluster size ave.sil.width
## 1 1 36 0.15
## 2 2 84 0.36
## 3 3 47 0.24
## cluster size ave.sil.width
## 1 1 51 0.25
## 2 2 85 0.31
## 3 3 31 0.26
grid.arrange(grobs = temp_chart_clust_plot, ncol=2 , top = "Clustering plots")
grid.arrange(grobs = temp_chart_clust_sill, ncol=2 , top = "Silhoutte plots")
As you can see from the graphs above, I was able to efficiently split my variables in such a way that I got well separated clusters. Let’s see what conclusions we can come to after considering other ways.
In hierarchical clustering, our goal is to determine the similarities between individual data, which contributes to obtaining graphs in the form of trees. At first, we treat each observation as a separate cluster, and then we start to combine similar observations into larger sets. Of course, also in this case there are many methods of calculating distances between data. For the following 4 metrics I decided to present their results graphically, in order to illustrate how significantly they can influence the final results. The best of them (and also the most time-consuming) is undoubtedly Ward’s method, which determines clusters with the minimum variance. The results obtained from this method, I decided to save to the final set, so that I can get a comparison between other algorithms.
Single method
hc_single <- eclust(countries[,-1], k=3, FUNcluster="hclust", hc_metric="euclidean", hc_method = "single")
hc_single$labels<-countries$country
plot(hc_single, cex=0.6, hang=-1, main = "Dendrogram of HAC")
rect.hclust(hc_single, k=3, border='red')
Complete method
hc_complete <- eclust(countries[,-1], k=3, FUNcluster="hclust", hc_metric="euclidean", hc_method = "complete")
hc_complete$labels<-countries$country
plot(hc_complete, cex=0.6, hang=-1, main = "Dendrogram of HAC")
rect.hclust(hc_complete, k=3, border='red')
Average method
hc_average <- eclust(countries[,-1], k=3, FUNcluster="hclust", hc_metric="euclidean", hc_method = "average")
hc_average$labels<-countries$country
plot(hc_average, cex=0.6, hang=-1, main = "Dendrogram of HAC")
rect.hclust(hc_average, k=3, border='red')
Ward.D2 method <- the best method
hc_ward <- eclust(countries[,-1], k=3, FUNcluster="hclust", hc_metric="euclidean", hc_method = "ward.D2")
hc_ward$labels<-countries$country
plot(hc_ward, cex=0.6, hang=-1, main = "Dendrogram of HAC")
rect.hclust(hc_ward, k=3, border='red')
ward_hclust<-cutree(hc_ward, k=3)
table(ward_hclust)
## ward_hclust
## 1 2 3
## 27 106 34
countries_division$ward_hclust <- ward_hclust
Model-based clustering assumes that the data is generated by an underlying probability distribution and tries to recover the distribution from the dat. The most popular approach is the Gaussian Mixture Model, which uses the mean vector, covariance matrix and the probability of each variable belonging to each cluster in its calculations. Our main task is to maximize the BIC statistic, which in its notation somehow penalizes models that have too many clusters.
mc_countries <- Mclust(countries)
summary(mc_countries)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust EVE (ellipsoidal, equal volume and orientation) model with 5 components:
##
## log-likelihood n df BIC ICL
## -1725.771 167 145 -4193.651 -4209.05
##
## Clustering table:
## 1 2 3 4 5
## 47 38 11 30 41
# BIC values used for choosing the number of clusters #THE HIGHEST BIC the better
(BIC_mc_countries <- fviz_mclust(mc_countries, "BIC", palette = "jco"))
# Classification: plot showing the clustering
fviz_mclust(mc_countries, "classification", geom = "point",
pointsize = 1.5, palette = "jco")
At first we see that the optimal number of clusters is 5, while the clustering graph itself does not look very nice. Let’s see what happens if we artificially influence the algorithm by imposing a number of clusters equal to 2 and 3
mc_countries_2 <- Mclust(countries, G = 2)
mc_countries_3 <- Mclust(countries, G = 3)
summary(mc_countries_3)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust EVE (ellipsoidal, equal volume and orientation) model with 3 components:
##
## log-likelihood n df BIC ICL
## -1904.855 167 105 -4347.1 -4357.081
##
## Clustering table:
## 1 2 3
## 53 72 42
# Save the results into my final dataframe
countries_division$mbc_2 <- mc_countries_2$classification
countries_division$mbc_3 <- mc_countries_3$classification
fviz_cluster(mc_countries_2)
fviz_cluster(mc_countries_3)
As we can see, the situation has not improved much, nevertheless, for the sake of comparison and exercise, I decided to include the results in the final analysis.
Because we are concentrating on financial aid, let’s focus only on some variables which might be the most important for us:
countries_division$exports <- unchanged_countries$exports
countries_division$imports <- unchanged_countries$imports
countries_division$income <- unchanged_countries$income
countries_division$inflation <- unchanged_countries$inflation
# Creating empty dataframe where I will store calculated results
clusters_summarised <- data.frame(Algorithm = character(),
Clusters = integer(),
exports = double(),
imports = double(),
income = double(),
inflation = double())
# Add results into dataframe
groups <- c(quo(kmeans_2), quo(pam_2), quo(kmeans_3), quo(pam_3), quo(ward_hclust), quo(mbc_2), quo(mbc_3))
for (i in seq_along(groups)) {
temp <- countries_division %>%
group_by(!!groups[[i]]) %>% # Unquote with !!
summarise_each(funs(median), c(exports, imports, income, inflation)) %>%
mutate("Algorithm" = colnames(countries_division[i+1])) %>%
rename(Clusters = !!groups[[i]]) %>%
relocate(Algorithm, .before = Clusters) %>%
print()
clusters_summarised <- rbind(clusters_summarised, temp)
}
## # A tibble: 2 x 6
## Algorithm Clusters exports imports income inflation
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 kmeans_2 1 40.1 46.2 19400 3.53
## 2 kmeans_2 2 25.0 41.5 2590 8.21
## # A tibble: 2 x 6
## Algorithm Clusters exports imports income inflation
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 pam_2 1 25 39.2 2660 8.92
## 2 pam_2 2 40.0 47.2 19250 3.38
## # A tibble: 3 x 6
## Algorithm Clusters exports imports income inflation
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 kmeans_3 1 50.4 39.0 40550 1.19
## 2 kmeans_3 2 37.3 48.7 10180 5.80
## 3 kmeans_3 3 23.8 40.3 1870 8.92
## # A tibble: 3 x 6
## Algorithm Clusters exports imports income inflation
## <chr> <int> <dbl> <dbl> <int> <dbl>
## 1 pam_3 1 22.8 39.2 1990 8.98
## 2 pam_3 2 39.8 50.9 11200 5.14
## 3 pam_3 3 42.3 37.1 40700 1.16
## # A tibble: 3 x 6
## Algorithm Clusters exports imports income inflation
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 ward_hclust 1 22.2 39.2 1430 5.45
## 2 ward_hclust 2 37.3 49.2 9890 6.04
## 3 ward_hclust 3 50.0 37.2 41250 1.67
## # A tibble: 2 x 6
## Algorithm Clusters exports imports income inflation
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 mbc_2 1 41.2 37.8 34850 1.94
## 2 mbc_2 2 31.7 45.9 5900 6.35
## # A tibble: 3 x 6
## Algorithm Clusters exports imports income inflation
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 mbc_3 1 22.8 43 1990 7.64
## 2 mbc_3 2 38.9 48.9 11700 5.29
## 3 mbc_3 3 49.5 39.0 40100 1.67
head(clusters_summarised)
## # A tibble: 6 x 6
## Algorithm Clusters exports imports income inflation
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 kmeans_2 1 40.1 46.2 19400 3.53
## 2 kmeans_2 2 25.0 41.5 2590 8.21
## 3 pam_2 1 25 39.2 2660 8.92
## 4 pam_2 2 40.0 47.2 19250 3.38
## 5 kmeans_3 1 50.4 39.0 40550 1.19
## 6 kmeans_3 2 37.3 48.7 10180 5.80
So we see, that for each algorithm we have calculated median for 4 variables. Now let’s check, which algorithms have allowed us to obtain the lowest import, export, income values and the highest inflation rate.
# Positions of our desired values
as.matrix(apply(clusters_summarised[, 3:5],2,which.min))
## [,1]
## exports 11
## imports 10
## income 11
as.matrix(apply(clusters_summarised[, 6],2,which.max))
## [,1]
## inflation 8
clusters_summarised[c(8, 10, 11), ]
## # A tibble: 3 x 6
## Algorithm Clusters exports imports income inflation
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 pam_3 1 22.8 39.2 1990 8.98
## 2 pam_3 3 42.3 37.1 40700 1.16
## 3 ward_hclust 1 22.2 39.2 1430 5.45
Ward_hclust (1st cluster group) has the lowest exports and income, while pam_3 (3rd cluster group) has the lowest imports and the highest inflation (1st cluster).
Let’s see which countries belongs to these algorithms.
countries_division %>%
filter((pam_3 == 1 & ward_hclust == 1) | (pam_3 == 3 & ward_hclust == 1)) %>%
select(countries.country)
## countries.country
## 1 Afghanistan
## 2 Benin
## 3 Burkina Faso
## 4 Burundi
## 5 Cameroon
## 6 Central African Republic
## 7 Chad
## 8 Comoros
## 9 Congo, Dem. Rep.
## 10 Cote d'Ivoire
## 11 Gambia
## 12 Guinea
## 13 Guinea-Bissau
## 14 Haiti
## 15 Kenya
## 16 Madagascar
## 17 Malawi
## 18 Mali
## 19 Mozambique
## 20 Niger
## 21 Rwanda
## 22 Senegal
## 23 Sierra Leone
## 24 Tanzania
## 25 Togo
## 26 Uganda
## 27 Zambia
So we have 27 countries out of 167 for which we may think to target them first in order to financial aid. Most of them, obviously, come from Africa, which seems to be logical.
The main aim of the study was to answer the question of which countries charities should help financially in the first place. From an initial analysis of the data, I checked whether the values of the indices corresponded to reality. Through helpful visualisations I came to the conclusion that the countries in need would mostly come from Africa. In order to identify which countries these might be, I used the algorithms: kmeans, pam, hierarchical clustering and model based clustering method. For the first 3 models I obtained satisfactory result and well distinguished clusters, however its cannot be said the same for the last method. Nevertheless, this did not affect the final results and in the end, 2 types of models (pam with 1st and 3rd cluster group and hierachical clustering with 1st cluster group) were taken into account. Of the 27 potentially needy countries identified, most of them are from Africa, which is in line with my initial assumptions. Therefore, I believe that I have succeeded in fulfilling the purpose of the paper and it has a meaningful translation to the global reality.