Load the necessary libraries

Intro

Idea for this project was taken from kaggle.com website: https://www.kaggle.com/datasets/rohan0301/unsupervised-learning-on-country-data

Problem Statement:

HELP International have been able to raise around $ 10 million. Now the CEO of the NGO needs to decide how to use this money strategically and effectively. So, CEO has to make decision to choose the countries that are in the direst need of aid. Hence, your Job as a Data scientist is to categorise the countries using some socio-economic and health factors that determine the overall development of the country. Then you need to suggest the countries which the CEO needs to focus on the most.

To determine the answer for that problem there are going to be presented methods used in Unsupervised Learning. There is K-Means and PAM algorithms presented.

Let’s have a look at the data and information what kaggle provides:

country : Name of the country
child_mort : Death of children under 5 years of age per 1000 live births
exports : Exports of goods and services per capita. Given as %age of the GDP per capita
health : Total health spending per capita. Given as %age of GDP per capita
imports : Imports of goods and services per capita. Given as %age of the GDP per capita
income : Net income per person
inflation : The measurement of the annual growth rate of the Total GDP
life_expec : The average number of years a new born child would live if the current mortality patterns are to remain the same
total_fer : The number of children that would be born to each woman if the current age-fertility rates remain the same.
gdpp : The GDP per capita. Calculated as the Total GDP divided by the total population.

Data Analysis

country_data <- read.csv('C:/Users/1032745/OneDrive - Blue Yonder/Desktop/PRIVATE/STUDIA/UL/PROJECTS/Country-data.csv', header = TRUE)

head(country_data)

##               country child_mort exports health imports income inflation
## 1         Afghanistan       90.2    10.0   7.58    44.9   1610      9.44
## 2             Albania       16.6    28.0   6.55    48.6   9930      4.49
## 3             Algeria       27.3    38.4   4.17    31.4  12900     16.10
## 4              Angola      119.0    62.3   2.85    42.9   5900     22.40
## 5 Antigua and Barbuda       10.3    45.5   6.03    58.9  19100      1.44
## 6           Argentina       14.5    18.9   8.10    16.0  18700     20.90
##   life_expec total_fer  gdpp
## 1       56.2      5.82   553
## 2       76.3      1.65  4090
## 3       76.5      2.89  4460
## 4       60.1      6.16  3530
## 5       76.8      2.13 12200
## 6       75.8      2.37 10300

dim(country_data)

## [1] 167  10

summary(country_data)

##    country            child_mort        exports            health      
##  Length:167         Min.   :  2.60   Min.   :  0.109   Min.   : 1.810  
##  Class :character   1st Qu.:  8.25   1st Qu.: 23.800   1st Qu.: 4.920  
##  Mode  :character   Median : 19.30   Median : 35.000   Median : 6.320  
##                     Mean   : 38.27   Mean   : 41.109   Mean   : 6.816  
##                     3rd Qu.: 62.10   3rd Qu.: 51.350   3rd Qu.: 8.600  
##                     Max.   :208.00   Max.   :200.000   Max.   :17.900  
##     imports             income         inflation         life_expec   
##  Min.   :  0.0659   Min.   :   609   Min.   : -4.210   Min.   :32.10  
##  1st Qu.: 30.2000   1st Qu.:  3355   1st Qu.:  1.810   1st Qu.:65.30  
##  Median : 43.3000   Median :  9960   Median :  5.390   Median :73.10  
##  Mean   : 46.8902   Mean   : 17145   Mean   :  7.782   Mean   :70.56  
##  3rd Qu.: 58.7500   3rd Qu.: 22800   3rd Qu.: 10.750   3rd Qu.:76.80  
##  Max.   :174.0000   Max.   :125000   Max.   :104.000   Max.   :82.80  
##    total_fer          gdpp       
##  Min.   :1.150   Min.   :   231  
##  1st Qu.:1.795   1st Qu.:  1330  
##  Median :2.410   Median :  4660  
##  Mean   :2.948   Mean   : 12964  
##  3rd Qu.:3.880   3rd Qu.: 14050  
##  Max.   :7.490   Max.   :105000

any(is.na(country_data))

## [1] FALSE

Exploratory Data Analysis

Select the columns to use for clustering

cluster_data <- country_data[,c("child_mort", "exports", "health", "imports", "income", "inflation", "life_expec", "total_fer","gdpp")]

Let’s have a look at the distribution of each feature

par(mfcol = c(3,3))
titles<-colnames(cluster_data)
i<-1

par(mfcol = c(3,3))
titles<-colnames(cluster_data)
i<-1
titles

## [1] "child_mort" "exports"    "health"     "imports"    "income"    
## [6] "inflation"  "life_expec" "total_fer"  "gdpp"

for (i in 1:9){
  hist(cluster_data[,i],
       main=titles[i],
       col="darkmagenta",
       freq=FALSE
  )
  print(i)
}

## [1] 1

## [1] 2

## [1] 3

## [1] 4

## [1] 5

## [1] 6

## [1] 7

## [1] 8

## [1] 9

par(mfcol = c(1,1))

child_mort, exports, income, total_fer, inflation and gdpp assume to have log-normal distribution imports and health seem to have normal distribution life_expec seems to also have normal distribution with negative coefficient of asymmetry

Let us now see which country is the best candidate for the finansial support when considering each variable separately.

x=c(head(country_data[order(country_data$child_mort, decreasing = TRUE),1],5))
y=c(head(country_data[order(country_data$child_mort, decreasing = TRUE),2],5))
df=data.frame(x,y)
ggplot(df, aes(x, y, fill = y)) + geom_col() + xlab("Country") + ylab("Child Mortality Rate") + ggtitle("Child Mortality by Country")

Despite the Haiti all the countries are African countries. Haiti is located the the South America and it has the highest mortality rate out of all studied countries.

x=c(head(country_data[order(country_data$exports, decreasing = FALSE),1],5))
y=c(head(country_data[order(country_data$exports, decreasing = FALSE),2],5))
df=data.frame(x,y)
ggplot(df, aes(x, y, fill = y)) + geom_col() + xlab("Country") + ylab("Export Rate") + ggtitle("Exports by Country")

Exports is one of the crucial elements to build economic force of the country.

x=c(head(country_data[order(country_data$imports, decreasing = FALSE),1],5))
y=c(head(country_data[order(country_data$imports, decreasing = FALSE),2],5))
df=data.frame(x,y)
ggplot(df, aes(x, y, fill = y)) + geom_col() + xlab("Country") + ylab("Import Rate") + ggtitle("Imports by Country")

We see that countries which are exporting a lot are also importing a lot. So high import rate does not mean that the country has bad position on the international area.

x=c(head(country_data[order(country_data$health, decreasing = FALSE),1],5))
y=c(head(country_data[order(country_data$health, decreasing = FALSE),2],5))
df=data.frame(x,y)
ggplot(df, aes(x, y, fill = y)) + geom_col() + xlab("Country") + ylab("Capital on health Rate") + ggtitle("Capital on health by Country")

Countries which can allocate lots of capital on health are on good position

x=c(head(country_data[order(country_data$income, decreasing = FALSE),1],5))
y=c(head(country_data[order(country_data$income, decreasing = FALSE),2],5))
df=data.frame(x,y)
ggplot(df, aes(x, y, fill = y)) + geom_col() + xlab("Country") + ylab("Income Rate") + ggtitle("Income by Country")

Countries with lower income would be preferable.

x=c(head(country_data[order(country_data$inflation, decreasing = TRUE),1],5))
y=c(head(country_data[order(country_data$inflation, decreasing = TRUE),2],5))
df=data.frame(x,y)
ggplot(df, aes(x, y, fill = y)) + geom_col() + xlab("Country") + ylab("Inflation Rate") + ggtitle("Inflation by Country")

Countries with higher inflation would be preferable.

x=c(head(country_data[order(country_data$life_expec, decreasing = FALSE),1],5))
y=c(head(country_data[order(country_data$life_expec, decreasing = FALSE),2],5))
df=data.frame(x,y)
ggplot(df, aes(x, y, fill = y)) + geom_col() + xlab("Country") + ylab("Years") + ggtitle("Expected life duration by Country")

Countries with lower expected life duration would be preferable.

x=c(head(country_data[order(country_data$total_fer, decreasing = TRUE),1],5))
y=c(head(country_data[order(country_data$total_fer, decreasing = TRUE),2],5))
df=data.frame(x,y)
ggplot(df, aes(x, y, fill = y)) + geom_col() + xlab("Country") + ylab("Total fertility rate") + ggtitle("Total Fertility by Country")

x=c(head(country_data[order(country_data$total_fer, decreasing = FALSE),1],5))
y=c(head(country_data[order(country_data$total_fer, decreasing = FALSE),2],5))
df=data.frame(x,y)
ggplot(df, aes(x, y, fill = y)) + geom_col() + xlab("Country") + ylab("Total fertility rate") + ggtitle("Total Fertility by Country")

With fertility rate situation is unclear as according to WorldBank data high fertility rate is high in not developed countries. From the other hand developed countries would like to have as high fertility rate as possible to speed up the economy.

x=c(head(country_data[order(country_data$gdpp, decreasing = FALSE),1],5))
y=c(head(country_data[order(country_data$gdpp, decreasing = FALSE),2],5))
df=data.frame(x,y)
ggplot(df, aes(x, y, fill = y)) + geom_col() + xlab("Country") + ylab("GDPP Rate") + ggtitle("GDPP by Country")

Countries with lower GDP per capita would be preferable.

Summary of data Analysis

In overall summary African countries seems to have the worst performance when analyzing those features. To analyze the data deeper there is going to be unsupervised learning methods for clustering used. Let us start with choosing the adequate number of clusters. K-Means and PAM methods are going to be used as they were presented in our classes. CLARA and CLARANS is not adequate as it is PAM for big datasets and this dataset is small.

Before doing further analysis I will do standarization and normalization because data is stored in different measures.

cluster_data.s<-scale(cluster_data) # standardised data (obs-mean)/sd

fviz_nbclust(cluster_data.s, FUNcluster = kmeans, method = "silhouette")

fviz_nbclust(cluster_data.s, FUNcluster = kmeans, method = "gap_stat")

fviz_nbclust(cluster_data.s, FUNcluster = kmeans, method = "wss")

Using different methods we receive different hints how many clusters should be chosen. I would try to pick 3 clusters as that would be the best way to interpret the results. One cluster for countries that require help. One cluster for countries that may require help. One cluster for countries that must receive help.

fviz_nbclust(cluster_data.s, FUNcluster=cluster::pam, method="silhouette")+ theme_classic()

fviz_nbclust(cluster_data.s, FUNcluster=cluster::pam, method="gap_stat")+ theme_classic()

fviz_nbclust(cluster_data.s, FUNcluster=cluster::pam, method="wss")+ theme_classic()

With PAM methods I have the same situation so here as well I would pick 3 clusters.

Let us start clustering with K-Means

km1<-eclust(cluster_data.s, "kmeans", hc_metric="euclidean",k=3)

fviz_cluster(km1, main="kmeans / Euclidean")

km1$centers

##   child_mort     exports     health     imports     income   inflation
## 1 -0.8249676  0.64314557  0.7252301  0.19006732  1.4797922 -0.48346661
## 2 -0.4052346 -0.03155768 -0.2237978  0.02408916 -0.2510155 -0.01711594
## 3  1.3561391 -0.43622118 -0.1555163 -0.18863644 -0.6848344  0.40090504
##   life_expec  total_fer       gdpp
## 1  1.0763414 -0.7895024  1.6111498
## 2  0.2539698 -0.4230704 -0.3534185
## 3 -1.2783352  1.3608511 -0.6024306

sil<-silhouette(km1$cluster, dist(cluster_data.s))
fviz_silhouette(sil)

##   cluster size ave.sil.width
## 1       1   36          0.15
## 2       2   84          0.36
## 3       3   47          0.24

Now having our data clustered we need to determin which cluster informs us about countries that require help. Let us go back to what we arranged in EDA part. Let us take two examples, we know that high child mortality is bad and low GDPP is bad.

How those two variables look when clustered?

country_data["cluster"] <- km1$cluster

ggplot(country_data, aes(x=cluster, y=child_mort)) +
  geom_boxplot() +
  facet_grid(~cluster, scales="free_y")

## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?

ggplot(country_data, aes(x=cluster, y=gdpp)) +
  geom_boxplot() +
  facet_grid(~cluster, scales="free_y")

## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?

After checking those two conditions we are clear that:

Cluster 1: Countries that do not require any help.
Cluster 2: Countries that are between those two and it should be deeply considered if they require help.
Cluster 3: Countries must receive financial help.

df_cluster2 <- country_data[country_data$cluster == 3, ]
list_of_countries_that_require_help_k_means <- df_cluster2$country
list_of_countries_that_require_help_k_means

##  [1] "Afghanistan"              "Angola"                  
##  [3] "Benin"                    "Botswana"                
##  [5] "Burkina Faso"             "Burundi"                 
##  [7] "Cameroon"                 "Central African Republic"
##  [9] "Chad"                     "Comoros"                 
## [11] "Congo, Dem. Rep."         "Congo, Rep."             
## [13] "Cote d'Ivoire"            "Equatorial Guinea"       
## [15] "Eritrea"                  "Gabon"                   
## [17] "Gambia"                   "Ghana"                   
## [19] "Guinea"                   "Guinea-Bissau"           
## [21] "Haiti"                    "Iraq"                    
## [23] "Kenya"                    "Kiribati"                
## [25] "Lao"                      "Lesotho"                 
## [27] "Liberia"                  "Madagascar"              
## [29] "Malawi"                   "Mali"                    
## [31] "Mauritania"               "Mozambique"              
## [33] "Namibia"                  "Niger"                   
## [35] "Nigeria"                  "Pakistan"                
## [37] "Rwanda"                   "Senegal"                 
## [39] "Sierra Leone"             "South Africa"            
## [41] "Sudan"                    "Tanzania"                
## [43] "Timor-Leste"              "Togo"                    
## [45] "Uganda"                   "Yemen"                   
## [47] "Zambia"

#####Let us now move to PAM algorithm

pam1<-eclust(cluster_data.s, "pam", k=3)

fviz_silhouette(pam1)

##   cluster size ave.sil.width
## 1       1   51          0.25
## 2       2   85          0.31
## 3       3   31          0.26

fviz_cluster(pam1)

sil.p<-silhouette(pam1$cluster, dist(cluster_data.s))
fviz_silhouette(sil.p)

##   cluster size ave.sil.width
## 1       1   51          0.25
## 2       2   85          0.31
## 3       3   31          0.26

A check that was done for k-means with clusters assigment is required here as well.

country_data["cluster_pam"] <- pam1$cluster

ggplot(country_data, aes(x=cluster_pam, y=child_mort)) +
  geom_boxplot() +
  facet_grid(~cluster_pam, scales="free_y")

## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?

ggplot(country_data, aes(x=cluster_pam, y=gdpp)) +
  geom_boxplot() +
  facet_grid(~cluster_pam, scales="free_y")

## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?

We have the different cluster assignment - Cluster 1: Countries must receive financial help. - Cluster 2: Countries that are between those two and it should be deeply considered if they require help. - Cluster 3: Countries that do not require any help.

df_cluster_3 <- country_data[country_data$cluster_pam == 1, ]
list_of_countries_that_require_help_pam <- df_cluster_3$country
list_of_countries_that_require_help_pam

##  [1] "Afghanistan"              "Angola"                  
##  [3] "Benin"                    "Botswana"                
##  [5] "Burkina Faso"             "Burundi"                 
##  [7] "Cameroon"                 "Central African Republic"
##  [9] "Chad"                     "Comoros"                 
## [11] "Congo, Dem. Rep."         "Congo, Rep."             
## [13] "Cote d'Ivoire"            "Equatorial Guinea"       
## [15] "Eritrea"                  "Gabon"                   
## [17] "Gambia"                   "Ghana"                   
## [19] "Guinea"                   "Guinea-Bissau"           
## [21] "Haiti"                    "India"                   
## [23] "Iraq"                     "Kenya"                   
## [25] "Kiribati"                 "Lao"                     
## [27] "Lesotho"                  "Liberia"                 
## [29] "Madagascar"               "Malawi"                  
## [31] "Mali"                     "Mauritania"              
## [33] "Mozambique"               "Myanmar"                 
## [35] "Namibia"                  "Nepal"                   
## [37] "Niger"                    "Nigeria"                 
## [39] "Pakistan"                 "Rwanda"                  
## [41] "Senegal"                  "Sierra Leone"            
## [43] "South Africa"             "Sudan"                   
## [45] "Tajikistan"               "Tanzania"                
## [47] "Timor-Leste"              "Togo"                    
## [49] "Uganda"                   "Yemen"                   
## [51] "Zambia"

x <- intersect(list_of_countries_that_require_help_pam,list_of_countries_that_require_help_k_means)
x

##  [1] "Afghanistan"              "Angola"                  
##  [3] "Benin"                    "Botswana"                
##  [5] "Burkina Faso"             "Burundi"                 
##  [7] "Cameroon"                 "Central African Republic"
##  [9] "Chad"                     "Comoros"                 
## [11] "Congo, Dem. Rep."         "Congo, Rep."             
## [13] "Cote d'Ivoire"            "Equatorial Guinea"       
## [15] "Eritrea"                  "Gabon"                   
## [17] "Gambia"                   "Ghana"                   
## [19] "Guinea"                   "Guinea-Bissau"           
## [21] "Haiti"                    "Iraq"                    
## [23] "Kenya"                    "Kiribati"                
## [25] "Lao"                      "Lesotho"                 
## [27] "Liberia"                  "Madagascar"              
## [29] "Malawi"                   "Mali"                    
## [31] "Mauritania"               "Mozambique"              
## [33] "Namibia"                  "Niger"                   
## [35] "Nigeria"                  "Pakistan"                
## [37] "Rwanda"                   "Senegal"                 
## [39] "Sierra Leone"             "South Africa"            
## [41] "Sudan"                    "Tanzania"                
## [43] "Timor-Leste"              "Togo"                    
## [45] "Uganda"                   "Yemen"                   
## [47] "Zambia"

CONCLUSIONS

After conducting a thorough analysis of the data using K-means and PAM algorithms, I have come to a conclusion on the most effective and strategic use of the available funds. My recommendation is to distribute the funds evenly to the 47 countries that were selected through the K-Means and PAM analysis, which have been stored in the variable “x.”

In order to arrive at this conclusion, several key steps were taken. Firstly, the data was preprocessed and cleaned to ensure accuracy and reliability. Next, the K-means and PAM algorithms were applied to cluster the countries into groups with similar characteristics. These algorithms use various mathematical and statistical methods to identify patterns and relationships within the data, which is critical in identifying the most appropriate distribution of funds.

The distribution of funds to the 47 countries identified through the analysis can help to address key development challenges and support sustainable economic growth in those regions. This can be achieved by investing in key areas such as education, healthcare, infrastructure, and employment opportunities, which can ultimately lead to improved quality of life for citizens and increased economic prosperity.

Moreover, distributing the funds evenly to the 47 countries can ensure that the impact of the funds is maximized, as all countries receive a fair share of the resources available. This approach can also help to avoid any potential biases or inequalities in the distribution of funds, ensuring that all countries are given an equal opportunity to benefit from the investment.

In conclusion, distributing the funds evenly to the 47 countries identified through the K-Means and PAM analysis is the most effective and strategic approach to using the available funds. This approach can help to address key development challenges, support sustainable economic growth, and ensure equitable distribution of resources.

Clustering

Piotr Radziszewski