Clustering the Countries with PCA and K-means for HELP International Organization

Introduction

An international humanitarian Charity called HELP International is dedicated to eradicating poverty and giving the people of underdeveloped nations access to basic comforts and assistance during natural disasters. Over $10 million has been raised through HELP International. Now, this money needs to be used wisely and efficiently. So, data-driven decisions must be made to determine which nations are in the greatest need of aid. As a result, it becomes vital to group the nations according to the socioeconomic and health aspects that affect each nation’s overall development. Hence, funds would be provided for help after catastrophes and natural calamities based on these clusters of countries and their situations. It is a blatant example of unsupervised learning in which we must group the countries according to their various features.

RMarkdown view

Data Explanation

Dataset feature details:

country : Name of the country
child_mort : Death of children under 5 years of age per 1000 live births
exports : Exports of goods and services per capita. Given as %age of the GDP per capita
health : Total health spending per capita. Given as %age of GDP per capita
imports : Imports of goods and services per capita. Given as %age of the GDP per capita
Income : Net income per person
Inflation : The measurement of the annual growth rate of the Total GDP
life_expec : The average number of years a new born child would live if the current mortality patterns are to remain the same
total_fer : The number of children that would be born to each woman if the current age-fertility rates remain the same.
gdpp : The GDP per capita. Calculated as the Total GDP divided by the total population.

The data for this research is taken from https://www.kaggle.com/datasets/rohan0301/unsupervised-learning-on-country-data/code?datasetId=721951.

Exploratory Data Analysis

Necessary Libraries:

library(tidyverse)
library(dplyr)
library(magrittr)
library(xfun)
library(corrplot)
library(scales)
library(FactoMineR)
library(factoextra)
library(cowplot)
library(stats)
library(GGally)
library(cluster)

country_data <- read.csv("Country-data.csv")
head(country_data)

##               country child_mort exports health imports income inflation
## 1         Afghanistan       90.2    10.0   7.58    44.9   1610      9.44
## 2             Albania       16.6    28.0   6.55    48.6   9930      4.49
## 3             Algeria       27.3    38.4   4.17    31.4  12900     16.10
## 4              Angola      119.0    62.3   2.85    42.9   5900     22.40
## 5 Antigua and Barbuda       10.3    45.5   6.03    58.9  19100      1.44
## 6           Argentina       14.5    18.9   8.10    16.0  18700     20.90
##   life_expec total_fer  gdpp
## 1       56.2      5.82   553
## 2       76.3      1.65  4090
## 3       76.5      2.89  4460
## 4       60.1      6.16  3530
## 5       76.8      2.13 12200
## 6       75.8      2.37 10300

#checking duplicate values
sum(duplicated(country_data$country)) == 0

## [1] TRUE

There are no dublicate values.

#checking missing values
colSums(is.na(country_data))

##    country child_mort    exports     health    imports     income  inflation 
##          0          0          0          0          0          0          0 
## life_expec  total_fer       gdpp 
##          0          0          0

There are no missing values

dim(country_data)

## [1] 167  10

str(country_data)

## 'data.frame':    167 obs. of  10 variables:
##  $ country   : chr  "Afghanistan" "Albania" "Algeria" "Angola" ...
##  $ child_mort: num  90.2 16.6 27.3 119 10.3 14.5 18.1 4.8 4.3 39.2 ...
##  $ exports   : num  10 28 38.4 62.3 45.5 18.9 20.8 19.8 51.3 54.3 ...
##  $ health    : num  7.58 6.55 4.17 2.85 6.03 8.1 4.4 8.73 11 5.88 ...
##  $ imports   : num  44.9 48.6 31.4 42.9 58.9 16 45.3 20.9 47.8 20.7 ...
##  $ income    : int  1610 9930 12900 5900 19100 18700 6700 41400 43200 16000 ...
##  $ inflation : num  9.44 4.49 16.1 22.4 1.44 20.9 7.77 1.16 0.873 13.8 ...
##  $ life_expec: num  56.2 76.3 76.5 60.1 76.8 75.8 73.3 82 80.5 69.1 ...
##  $ total_fer : num  5.82 1.65 2.89 6.16 2.13 2.37 1.69 1.93 1.44 1.92 ...
##  $ gdpp      : int  553 4090 4460 3530 12200 10300 3220 51900 46900 5840 ...

summary(country_data)

##    country            child_mort        exports            health      
##  Length:167         Min.   :  2.60   Min.   :  0.109   Min.   : 1.810  
##  Class :character   1st Qu.:  8.25   1st Qu.: 23.800   1st Qu.: 4.920  
##  Mode  :character   Median : 19.30   Median : 35.000   Median : 6.320  
##                     Mean   : 38.27   Mean   : 41.109   Mean   : 6.816  
##                     3rd Qu.: 62.10   3rd Qu.: 51.350   3rd Qu.: 8.600  
##                     Max.   :208.00   Max.   :200.000   Max.   :17.900  
##     imports             income         inflation         life_expec   
##  Min.   :  0.0659   Min.   :   609   Min.   : -4.210   Min.   :32.10  
##  1st Qu.: 30.2000   1st Qu.:  3355   1st Qu.:  1.810   1st Qu.:65.30  
##  Median : 43.3000   Median :  9960   Median :  5.390   Median :73.10  
##  Mean   : 46.8902   Mean   : 17145   Mean   :  7.782   Mean   :70.56  
##  3rd Qu.: 58.7500   3rd Qu.: 22800   3rd Qu.: 10.750   3rd Qu.:76.80  
##  Max.   :174.0000   Max.   :125000   Max.   :104.000   Max.   :82.80  
##    total_fer          gdpp       
##  Min.   :1.150   Min.   :   231  
##  1st Qu.:1.795   1st Qu.:  1330  
##  Median :2.410   Median :  4660  
##  Mean   :2.948   Mean   : 12964  
##  3rd Qu.:3.880   3rd Qu.: 14050  
##  Max.   :7.490   Max.   :105000

Examining the Big Relationships:

country_data %>%
  gather(Features, value, 2:10) %>%
  ggplot(aes(x=value)) +
  geom_histogram(fill="blue", colour="black", bins = 30) +
  facet_wrap(~Features, scales="free_x") +
  labs(x="Values", y="Frequency")

int_cols <- country_data %>% 
  select_if(is.numeric) %>% 
  names()

country_long <- country_data %>% 
  pivot_longer(
    cols = all_of(int_cols),
    names_to = "Features",
    values_to = "value"
  )

colors <- c("red3", "orange3", "yellow3", "green3", "blue3", "violet", "pink3","grey","lightblue")

ggplot(country_long, aes(x = value)) +
  geom_boxplot(fill = colors, colour = "black") +
  facet_wrap(~Features, ncol = 3, scales = "free_x") +
  labs(x = "Values", y = "") +
  theme_minimal()

Findings

The boxplot and histogram both distinctly demonstrate whether the numerical features are continuous or discrete values. Features without categorical values do not exist.
Most of the data exhibits a right or positively skewed data distribution except for life expectancy, which shows a left or - negatively skewed data distribution.
Two GDPP and total fertility peaks imply the possibility of at least two clusters in the data.
There are obvious outliers in the child mort, exports, imports, income, and gdpp aspects of the box plot. These facts, however, are the property of each nation. Since there are outliers, we cannot assume that the values are regularly distributed. Outliers will not be eliminated for the analysis’s purposes because they may be very instructive in highlighting nations that want aid and are in dire straits.
The outliers in this feature demonstrate that there are countries with higher than average/critical levels of child mortality, as child death is a strong indicator of poverty and necessity.

Correlation between features:

pairs(country_data[sapply(country_data, is.numeric)],col="blue")

my_colors <- colorRampPalette(c("white", "blue3"))(50)

corrplot(cor(country_data[2:10]), type = "lower", method = "ellipse", col = my_colors, tl.col = "black")

Positive Correlations - The positive correlation between exports and imports suggests that countries with higher levels of trade tend to have more open economies.

The high relationship between GDP per capita and income suggests that personal income and economic growth are closely related and that actions to boost economic growth may also positively impact individual incomes.
The relationship between child mortality and total fertility rate has a significance that points to the importance of supporting family planning and healthcare programs that enhance child health outcomes and reduce population expansion.
Given that those with greater earnings typically have better access to healthcare and lead healthier lifestyles, the positive relationship between income and life expectancy supports the idea that economic inequality might impact health outcomes.

Negative Correlations - The negative correlation between child mortality and life expectancy highlights the importance of improving child health outcomes in order to increase life expectancy and promote better overall public health.

According to the inverse relationship between life expectancy and total fertility rate, people tend to have fewer children as they live longer, which could impact population growth and aging demographics.

Data Scaling

Scaling the data is important in this case because the features in the dataset have different units and ranges, making it difficult to compare and analyze them accurately. For example, some features are expressed as percentages, some as dollar values, and some as whole numbers, making it hard to compare their magnitudes directly. Additionally, the range of values for each feature varies widely, with some features ranging from 0 to 200 and others from 0 to 100,000. This means that a change of 50 in one feature might have a significant impact, while in another, it might not be noticeable. If we do not scale the data, this level of variance can negatively impact the model’s performance, especially if it is based on measuring distances between data points. Some features may have higher magnitudes than others, giving them more weight in the model, even if they are not necessarily more important in determining the outcome. By scaling the data, we can remove potential bias that the model may have towards features with higher magnitudes and ensure that each feature is given equal consideration in the analysis. This can lead to more accurate results and a better understanding of the relationships between the different features in the dataset.

#Eliminating the country-specific column because we should only use numerical data.
dataset_country <- country_data[, -1] 
head(dataset_country)

##   child_mort exports health imports income inflation life_expec total_fer  gdpp
## 1       90.2    10.0   7.58    44.9   1610      9.44       56.2      5.82   553
## 2       16.6    28.0   6.55    48.6   9930      4.49       76.3      1.65  4090
## 3       27.3    38.4   4.17    31.4  12900     16.10       76.5      2.89  4460
## 4      119.0    62.3   2.85    42.9   5900     22.40       60.1      6.16  3530
## 5       10.3    45.5   6.03    58.9  19100      1.44       76.8      2.13 12200
## 6       14.5    18.9   8.10    16.0  18700     20.90       75.8      2.37 10300

Data Normalization

# Applying min-max scaling to all columns
dataset_normalised <- data.frame(lapply(dataset_country, function(x) rescale(x, to = c(0,1))))

# Assign original column names to the rescaled data frame
colnames(dataset_normalised) <- colnames(dataset_country)

head(dataset_normalised)

##   child_mort    exports     health    imports      income  inflation life_expec
## 1 0.42648491 0.04948197 0.35860783 0.25776487 0.008047206 0.12614361  0.4753452
## 2 0.06815969 0.13953104 0.29459291 0.27903729 0.074933074 0.08039922  0.8717949
## 3 0.12025316 0.19155940 0.14667495 0.18014926 0.098809399 0.18769060  0.8757396
## 4 0.56669912 0.31112456 0.06463642 0.24626626 0.042535232 0.24591073  0.5522682
## 5 0.03748783 0.22707876 0.26227470 0.33825512 0.148652234 0.05221329  0.8816568
## 6 0.05793574 0.09400623 0.39092604 0.09160998 0.145436567 0.23204879  0.8619329
##    total_fer        gdpp
## 1 0.73659306 0.003073428
## 2 0.07886435 0.036833414
## 3 0.27444795 0.040364993
## 4 0.79022082 0.031488322
## 5 0.15457413 0.114241808
## 6 0.19242902 0.096106673

Data Standardization

# Applying standardization to all columns
dataset_standardised  <- dataset_country %>% 
  scale()

# Assigning original column names to the rescaled data frame
colnames(dataset_standardised ) <- colnames(dataset_country)

head(dataset_standardised)

##      child_mort     exports      health     imports      income  inflation
## [1,]  1.2876597 -1.13486665  0.27825140 -0.08220771 -0.80582187  0.1568645
## [2,] -0.5373329 -0.47822017 -0.09672528  0.07062429 -0.37424335 -0.3114109
## [3,] -0.2720146 -0.09882442 -0.96317624 -0.63983800 -0.22018227  0.7869076
## [4,]  2.0017872  0.77305618 -1.44372888 -0.16481961 -0.58328920  1.3828944
## [5,] -0.6935483  0.16018613 -0.28603389  0.49607554  0.10142673 -0.5999442
## [6,] -0.5894047 -0.81019144  0.46756001 -1.27594958  0.08067776  1.2409928
##      life_expec   total_fer        gdpp
## [1,] -1.6142372  1.89717646 -0.67714308
## [2,]  0.6459238 -0.85739418 -0.48416709
## [3,]  0.6684130 -0.03828924 -0.46398018
## [4,] -1.1756985  2.12176975 -0.51472026
## [5,]  0.7021467 -0.54032130 -0.04169175
## [6,]  0.5897009 -0.38178486 -0.14535428

#We need columns argument for building new dataframe
columns <- colnames(dataset_country) 

#We should make a new dataframe with the rescaled values and the column labels
df_normal <- as.data.frame(dataset_normalised)
colnames(df_normal) <- columns 

df_standard <- as.data.frame(dataset_standardised)
colnames(df_standard) <- columns 

head(df_normal)

##   child_mort    exports     health    imports      income  inflation life_expec
## 1 0.42648491 0.04948197 0.35860783 0.25776487 0.008047206 0.12614361  0.4753452
## 2 0.06815969 0.13953104 0.29459291 0.27903729 0.074933074 0.08039922  0.8717949
## 3 0.12025316 0.19155940 0.14667495 0.18014926 0.098809399 0.18769060  0.8757396
## 4 0.56669912 0.31112456 0.06463642 0.24626626 0.042535232 0.24591073  0.5522682
## 5 0.03748783 0.22707876 0.26227470 0.33825512 0.148652234 0.05221329  0.8816568
## 6 0.05793574 0.09400623 0.39092604 0.09160998 0.145436567 0.23204879  0.8619329
##    total_fer        gdpp
## 1 0.73659306 0.003073428
## 2 0.07886435 0.036833414
## 3 0.27444795 0.040364993
## 4 0.79022082 0.031488322
## 5 0.15457413 0.114241808
## 6 0.19242902 0.096106673

head(df_standard)

##   child_mort     exports      health     imports      income  inflation
## 1  1.2876597 -1.13486665  0.27825140 -0.08220771 -0.80582187  0.1568645
## 2 -0.5373329 -0.47822017 -0.09672528  0.07062429 -0.37424335 -0.3114109
## 3 -0.2720146 -0.09882442 -0.96317624 -0.63983800 -0.22018227  0.7869076
## 4  2.0017872  0.77305618 -1.44372888 -0.16481961 -0.58328920  1.3828944
## 5 -0.6935483  0.16018613 -0.28603389  0.49607554  0.10142673 -0.5999442
## 6 -0.5894047 -0.81019144  0.46756001 -1.27594958  0.08067776  1.2409928
##   life_expec   total_fer        gdpp
## 1 -1.6142372  1.89717646 -0.67714308
## 2  0.6459238 -0.85739418 -0.48416709
## 3  0.6684130 -0.03828924 -0.46398018
## 4 -1.1756985  2.12176975 -0.51472026
## 5  0.7021467 -0.54032130 -0.04169175
## 6  0.5897009 -0.38178486 -0.14535428

Comparing Scaling Methods

plot(df_normal$gdpp, df_normal$child_mort, col = 'blue3', 
     xlab = 'GDP per Person', ylab = 'Child Mortality')

plot(df_standard$gdpp, df_standard$child_mort, col = 'blue3', 
     xlab = 'GDP per Person', ylab = 'Child Mortality')

Principal Component Analysis

PCA with normalized data

normal_pca <- prcomp(df_normal)
normal_pca

## Standard deviations (1, .., p=9):
## [1] 0.37657157 0.18576634 0.17808711 0.15854130 0.09869388 0.08814718 0.05540053
## [8] 0.04784283 0.04230136
## 
## Rotation (n x k) = (9 x 9):
##                    PC1          PC2          PC3         PC4         PC5
## child_mort -0.47646333 -0.281733093  0.100011701 -0.02908132  0.42095743
## exports     0.17033279 -0.466643928 -0.390523997  0.18848786 -0.10579788
## health      0.13388598 -0.044509771  0.781911221  0.51298346 -0.08330309
## imports     0.08161987 -0.394257286 -0.302774809  0.60665580 -0.10081385
## income      0.30698142 -0.404404812  0.068366711 -0.37282250  0.15996971
## inflation  -0.08800904  0.001601976 -0.057029275 -0.21262116 -0.21955510
## life_expec  0.42053483  0.137519745  0.007857375 -0.14802117 -0.55643730
## total_fer  -0.57108745 -0.369245977  0.174836469 -0.17306242 -0.62684939
## gdpp        0.33824337 -0.476077611  0.309636965 -0.31044817  0.13469013
##                    PC6         PC7         PC8          PC9
## child_mort -0.08309594  0.68892064 -0.05031919 -0.146653807
## exports    -0.15131228  0.06411594 -0.64019218  0.342268962
## health     -0.20814680 -0.02034690 -0.22728423 -0.056741715
## imports     0.01221525  0.01074206  0.54440667 -0.267186976
## income     -0.02247672 -0.24033746 -0.19006577 -0.691784599
## inflation  -0.92453091 -0.02335202  0.20070661  0.001198683
## life_expec  0.10955292  0.65603590  0.00204731 -0.173940114
## total_fer   0.23223954 -0.17206635 -0.01699321 -0.015072472
## gdpp        0.07433502  0.04950529  0.40358904  0.526959586

summary(normal_pca)

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6    PC7
## Standard deviation     0.3766 0.1858 0.1781 0.15854 0.09869 0.08815 0.0554
## Proportion of Variance 0.5500 0.1338 0.1230 0.09749 0.03778 0.03014 0.0119
## Cumulative Proportion  0.5500 0.6839 0.8069 0.90436 0.94214 0.97228 0.9842
##                            PC8     PC9
## Standard deviation     0.04784 0.04230
## Proportion of Variance 0.00888 0.00694
## Cumulative Proportion  0.99306 1.00000

country_normal_pca <- data.frame(normal_pca$x[,1:4])
head(country_normal_pca)

##            PC1         PC2        PC3         PC4
## 1 -0.599077641 -0.09549001  0.1575538  0.02433250
## 2  0.158474297  0.21209240 -0.0641890  0.06124679
## 3  0.003685518  0.13586725 -0.1341820 -0.13357441
## 4 -0.650234991 -0.27597526 -0.1426725 -0.15601831
## 5  0.200711243  0.06466223 -0.1007155  0.03790176
## 6  0.112534001  0.20604633  0.1189418 -0.10644779

fviz_screeplot(normal_pca, addlabels = TRUE)+
  labs(title = "Scree plot - Min-Max scaled data", x = "Principal component", y = "Percentage of variance explained")

fviz_pca_biplot(normal_pca, col.ind = "cos2", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"))

PCA with standardised data

standard_pca <- prcomp(df_standard)
standard_pca

## Standard deviations (1, .., p=9):
## [1] 2.0336314 1.2435217 1.0818425 0.9973889 0.8127847 0.4728437 0.3368067
## [8] 0.2971790 0.2586020
## 
## Rotation (n x k) = (9 x 9):
##                   PC1          PC2         PC3          PC4         PC5
## child_mort -0.4195194 -0.192883937  0.02954353 -0.370653262  0.16896968
## exports     0.2838970 -0.613163494 -0.14476069 -0.003091019 -0.05761584
## health      0.1508378  0.243086779  0.59663237 -0.461897497 -0.51800037
## imports     0.1614824 -0.671820644  0.29992674  0.071907461 -0.25537642
## income      0.3984411 -0.022535530 -0.30154750 -0.392159039  0.24714960
## inflation  -0.1931729  0.008404473 -0.64251951 -0.150441762 -0.71486910
## life_expec  0.4258394  0.222706743 -0.11391854  0.203797235 -0.10821980
## total_fer  -0.4037290 -0.155233106 -0.01954925 -0.378303645  0.13526221
## gdpp        0.3926448  0.046022396 -0.12297749 -0.531994575  0.18016662
##                     PC6         PC7         PC8         PC9
## child_mort -0.200628153  0.07948854  0.68274306  0.32754180
## exports     0.059332832  0.70730269  0.01419742 -0.12308207
## health     -0.007276456  0.24983051 -0.07249683  0.11308797
## imports     0.030031537 -0.59218953  0.02894642  0.09903717
## income     -0.160346990 -0.09556237 -0.35262369  0.61298247
## inflation  -0.066285372 -0.10463252  0.01153775 -0.02523614
## life_expec  0.601126516 -0.01848639  0.50466425  0.29403981
## total_fer   0.750688748 -0.02882643 -0.29335267 -0.02633585
## gdpp       -0.016778761 -0.24299776  0.24969636 -0.62564572

summary(standard_pca)

## Importance of components:
##                           PC1    PC2    PC3    PC4    PC5     PC6    PC7
## Standard deviation     2.0336 1.2435 1.0818 0.9974 0.8128 0.47284 0.3368
## Proportion of Variance 0.4595 0.1718 0.1300 0.1105 0.0734 0.02484 0.0126
## Cumulative Proportion  0.4595 0.6313 0.7614 0.8719 0.9453 0.97015 0.9828
##                            PC8     PC9
## Standard deviation     0.29718 0.25860
## Proportion of Variance 0.00981 0.00743
## Cumulative Proportion  0.99257 1.00000

country_standard_pca <- data.frame(standard_pca$x[,1:4])
head(country_standard_pca)

##           PC1         PC2        PC3         PC4
## 1 -2.90428986 -0.09533386  0.7159652 -1.00224038
## 2  0.42862224  0.58639208  0.3324855  1.15757715
## 3 -0.28436983  0.45380957 -1.2178421  0.86551146
## 4 -2.92362976 -1.69047094 -1.5204709 -0.83710739
## 5  1.03047668 -0.13624894  0.2250441  0.84452276
## 6  0.02234007  1.77385167 -0.8673884  0.03685602

fviz_screeplot(standard_pca, addlabels = TRUE) +
  labs(title = "Scree plot - Standardized data", x = "Principal component", y = "Percentage of variance explained")

fviz_pca_biplot(standard_pca, col.ind = "cos2", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"))

#data with PC1,  PC2, PC3, PC4
pca_data_s <- as.data.frame(standard_pca$x[, 1:4])

We can see that there are 4 principle components that can account for around 90% of the distribution of the original data after doing PCA on both the standardised and normalized versions of the original dataset.

K-Means Clustering

Setting Model

Executing model using different dataframe

# Creating KMeans model for normalized dataset
km <- kmeans(x = df_normal, centers = 3, nstart = 10, algorithm = "Lloyd",
             iter.max = 300, trace = T)
km

## K-means clustering with 3 clusters of sizes 35, 46, 86
## 
## Cluster means:
##   child_mort   exports    health   imports     income inflation life_expec
## 1 0.01118375 0.2917712 0.4417296 0.2957596 0.36332096 0.0623325  0.9496196
## 2 0.44150332 0.1459704 0.2814468 0.2485535 0.02337632 0.1506983  0.5383329
## 3 0.09652182 0.2014770 0.2738087 0.2694484 0.09777085 0.1092230  0.7984725
##   total_fer       gdpp
## 1 0.0932402 0.40934000
## 2 0.6214854 0.01398231
## 3 0.1803243 0.06193426
## 
## Clustering vector:
##   [1] 2 3 3 2 3 3 3 1 1 3 3 3 3 3 3 1 3 2 3 3 3 3 3 1 3 2 2 3 2 1 3 2 2 3 3 3 2
##  [38] 2 2 3 2 3 1 1 1 3 3 3 3 2 2 3 3 1 1 2 2 3 1 2 1 3 3 2 2 3 2 3 1 3 3 3 2 1
##  [75] 1 1 3 1 3 3 2 2 1 3 2 3 3 2 2 3 3 1 3 2 2 3 3 2 1 2 3 3 3 3 3 3 2 3 2 3 1
## [112] 1 2 2 1 3 2 3 3 3 3 3 1 1 3 3 2 3 3 2 3 3 2 1 1 1 2 3 1 1 3 3 2 3 1 1 3 2
## [149] 3 2 2 3 3 3 3 2 3 1 1 1 3 3 3 3 3 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 6.708499 6.161400 6.475220
##  (between_SS / total_SS =  54.8 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

# Computing clusters and assign labels for normalised dataset
y_predicted_normal <- km$cluster

y_predicted_normal

##   [1] 2 3 3 2 3 3 3 1 1 3 3 3 3 3 3 1 3 2 3 3 3 3 3 1 3 2 2 3 2 1 3 2 2 3 3 3 2
##  [38] 2 2 3 2 3 1 1 1 3 3 3 3 2 2 3 3 1 1 2 2 3 1 2 1 3 3 2 2 3 2 3 1 3 3 3 2 1
##  [75] 1 1 3 1 3 3 2 2 1 3 2 3 3 2 2 3 3 1 3 2 2 3 3 2 1 2 3 3 3 3 3 3 2 3 2 3 1
## [112] 1 2 2 1 3 2 3 3 3 3 3 1 1 3 3 2 3 3 2 3 3 2 1 1 1 2 3 1 1 3 3 2 3 1 1 3 2
## [149] 3 2 2 3 3 3 3 2 3 1 1 1 3 3 3 3 3 2 2

# Creating KMeans model for standardised dataset
km_standard <- kmeans(x = df_standard, centers = 3, nstart = 10, algorithm = "Lloyd",
                      iter.max = 300, trace = T)
km_standard

## K-means clustering with 3 clusters of sizes 36, 84, 47
## 
## Cluster means:
##   child_mort     exports     health     imports     income   inflation
## 1 -0.8249676  0.64314557  0.7252301  0.19006732  1.4797922 -0.48346661
## 2 -0.4052346 -0.03155768 -0.2237978  0.02408916 -0.2510155 -0.01711594
## 3  1.3561391 -0.43622118 -0.1555163 -0.18863644 -0.6848344  0.40090504
##   life_expec  total_fer       gdpp
## 1  1.0763414 -0.7895024  1.6111498
## 2  0.2539698 -0.4230704 -0.3534185
## 3 -1.2783352  1.3608511 -0.6024306
## 
## Clustering vector:
##   [1] 3 2 2 3 2 2 2 1 1 2 2 1 2 2 2 1 2 3 2 2 2 3 2 1 2 3 3 2 3 1 2 3 3 2 2 2 3
##  [38] 3 3 2 3 2 1 1 1 2 2 2 2 3 3 2 2 1 1 3 3 2 1 3 1 2 2 3 3 2 3 2 1 2 2 2 3 1
##  [75] 1 1 2 1 2 2 3 3 1 2 3 2 2 3 3 2 2 1 2 3 3 2 2 3 1 3 2 2 2 2 2 2 3 2 3 2 1
## [112] 1 3 3 1 2 3 2 2 2 2 2 1 1 2 2 3 2 2 3 2 2 3 1 1 1 2 3 1 1 2 2 3 2 1 1 2 3
## [149] 2 3 3 2 2 2 2 3 2 1 1 1 2 2 2 2 2 3 3
## 
## Within cluster sum of squares by cluster:
## [1] 297.2279 259.5575 269.6604
##  (between_SS / total_SS =  44.7 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

# Computing clusters and assign labels for standardised dataset
y_predicted_standard <- km_standard$cluster

# the orginal data with standard scaling and 4 PCA to calculate the clusters and assign the labels.
dataset_pca <- kmeans(x = pca_data_s, centers = 3, nstart = 10, algorithm = "Lloyd",
                 iter.max = 300, trace = T)

# Computing clusters and assign labels for PCA data
y_predicted_pca<- dataset_pca$cluster

#adding cluster column to the dataframe
df_normal$cluster <- y_predicted_normal
head(df_normal)

##   child_mort    exports     health    imports      income  inflation life_expec
## 1 0.42648491 0.04948197 0.35860783 0.25776487 0.008047206 0.12614361  0.4753452
## 2 0.06815969 0.13953104 0.29459291 0.27903729 0.074933074 0.08039922  0.8717949
## 3 0.12025316 0.19155940 0.14667495 0.18014926 0.098809399 0.18769060  0.8757396
## 4 0.56669912 0.31112456 0.06463642 0.24626626 0.042535232 0.24591073  0.5522682
## 5 0.03748783 0.22707876 0.26227470 0.33825512 0.148652234 0.05221329  0.8816568
## 6 0.05793574 0.09400623 0.39092604 0.09160998 0.145436567 0.23204879  0.8619329
##    total_fer        gdpp cluster
## 1 0.73659306 0.003073428       2
## 2 0.07886435 0.036833414       3
## 3 0.27444795 0.040364993       3
## 4 0.79022082 0.031488322       2
## 5 0.15457413 0.114241808       3
## 6 0.19242902 0.096106673       3

#adding cluster column to the dataframe
df_standard$cluster <- y_predicted_standard
head(df_standard)

##   child_mort     exports      health     imports      income  inflation
## 1  1.2876597 -1.13486665  0.27825140 -0.08220771 -0.80582187  0.1568645
## 2 -0.5373329 -0.47822017 -0.09672528  0.07062429 -0.37424335 -0.3114109
## 3 -0.2720146 -0.09882442 -0.96317624 -0.63983800 -0.22018227  0.7869076
## 4  2.0017872  0.77305618 -1.44372888 -0.16481961 -0.58328920  1.3828944
## 5 -0.6935483  0.16018613 -0.28603389  0.49607554  0.10142673 -0.5999442
## 6 -0.5894047 -0.81019144  0.46756001 -1.27594958  0.08067776  1.2409928
##   life_expec   total_fer        gdpp cluster
## 1 -1.6142372  1.89717646 -0.67714308       3
## 2  0.6459238 -0.85739418 -0.48416709       2
## 3  0.6684130 -0.03828924 -0.46398018       2
## 4 -1.1756985  2.12176975 -0.51472026       3
## 5  0.7021467 -0.54032130 -0.04169175       2
## 6  0.5897009 -0.38178486 -0.14535428       2

# adding the cluster column to the dataframe without 'country' feature)
dataset_country$cluster <- y_predicted_pca
head(dataset_country)

##   child_mort exports health imports income inflation life_expec total_fer  gdpp
## 1       90.2    10.0   7.58    44.9   1610      9.44       56.2      5.82   553
## 2       16.6    28.0   6.55    48.6   9930      4.49       76.3      1.65  4090
## 3       27.3    38.4   4.17    31.4  12900     16.10       76.5      2.89  4460
## 4      119.0    62.3   2.85    42.9   5900     22.40       60.1      6.16  3530
## 5       10.3    45.5   6.03    58.9  19100      1.44       76.8      2.13 12200
## 6       14.5    18.9   8.10    16.0  18700     20.90       75.8      2.37 10300
##   cluster
## 1       2
## 2       1
## 3       1
## 4       2
## 5       1
## 6       1

Elbow Method: number of clusters

Number of cluster with normalized data

sse = numeric(length = 10)
for (i in 1:10){
  km = kmeans(df_normal, centers = i, nstart = 100, algorithm = "Lloyd", iter.max = 100)
  sse[i] = km$tot.withinss
}

# plot
plot(1:10, sse, type="b", xlab="Number of clusters", ylab="SSE")

Number of cluster with standardized data

sse = numeric(length = 10)
for (i in 1:10){
  km = kmeans(df_standard, centers = i, nstart = 25, algorithm = "Lloyd", iter.max = 100)
  sse[i] = km$tot.withinss
}

# plot
plot(1:10, sse, type="b", xlab="Number of clusters", ylab="SSE")

Number of cluster with normalized data + PCA

sse = numeric(length = 10)
for (i in 1:10){
  km = suppressWarnings(kmeans(df_normal, centers = i, nstart = 25, algorithm = "Lloyd", iter.max = 100))
  sse[i] = km$tot.withinss
}

# plot
plot(1:10, sse, type="b", xlab="Number of clusters", ylab="SSE")

The ideal number of clusters is still 3 with various amounts of inertia after running the K-Means model with the normalized dataset, a standardized dataset, and a PCA with four components (with standardized scaling).

Clustering

# convert cluster column to factor
df_normal$cluster <- as.factor(df_normal$cluster)

# plot
ggpairs(df_normal, columns = 1:9, aes_string(color = "cluster"))

df_standard$cluster <- as.factor(df_standard$cluster)

ggpairs(df_standard, columns = 1:9, aes_string(color = "cluster"))

dataset_country$cluster <- as.factor(dataset_country$cluster)

ggpairs(dataset_country, columns = 1:9, aes_string(color = "cluster"))

We can observe that there is typically overlap across clusters after running the model with two different scaling methods and applying PCA. While clusters 1 and 2 frequently overlap, cluster 3 is more dispersed.

Cluster Descriptions

#adding cluster column to the original dataset
country_data$cluster <- as.factor(y_predicted_normal)

# convert cluster column to factor
country_data$cluster <- as.factor(country_data$cluster)

# create pivot table with mean values for each cluster and feature
clusters_table <- aggregate(country_data[,2:9], 
                            by = list(cluster = country_data$cluster), 
                            mean)

#for cluster 1

cluster_1 <- subset(country_data, cluster == 1)

#list of countries in this cluster
unique(cluster_1$country)

##  [1] "Australia"            "Austria"              "Belgium"             
##  [4] "Brunei"               "Canada"               "Cyprus"              
##  [7] "Czech Republic"       "Denmark"              "Finland"             
## [10] "France"               "Germany"              "Greece"              
## [13] "Iceland"              "Ireland"              "Israel"              
## [16] "Italy"                "Japan"                "Kuwait"              
## [19] "Luxembourg"           "Malta"                "Netherlands"         
## [22] "New Zealand"          "Norway"               "Portugal"            
## [25] "Qatar"                "Singapore"            "Slovak Republic"     
## [28] "Slovenia"             "South Korea"          "Spain"               
## [31] "Sweden"               "Switzerland"          "United Arab Emirates"
## [34] "United Kingdom"       "United States"

cluster_2 <- subset(country_data, cluster == 2)

#list of countries in this cluster
unique(cluster_2$country)

##  [1] "Afghanistan"              "Angola"                  
##  [3] "Benin"                    "Burkina Faso"            
##  [5] "Burundi"                  "Cameroon"                
##  [7] "Central African Republic" "Chad"                    
##  [9] "Comoros"                  "Congo, Dem. Rep."        
## [11] "Congo, Rep."              "Cote d'Ivoire"           
## [13] "Equatorial Guinea"        "Eritrea"                 
## [15] "Gabon"                    "Gambia"                  
## [17] "Ghana"                    "Guinea"                  
## [19] "Guinea-Bissau"            "Haiti"                   
## [21] "Iraq"                     "Kenya"                   
## [23] "Kiribati"                 "Lao"                     
## [25] "Lesotho"                  "Liberia"                 
## [27] "Madagascar"               "Malawi"                  
## [29] "Mali"                     "Mauritania"              
## [31] "Mozambique"               "Namibia"                 
## [33] "Niger"                    "Nigeria"                 
## [35] "Pakistan"                 "Rwanda"                  
## [37] "Senegal"                  "Sierra Leone"            
## [39] "Solomon Islands"          "Sudan"                   
## [41] "Tanzania"                 "Timor-Leste"             
## [43] "Togo"                     "Uganda"                  
## [45] "Yemen"                    "Zambia"

cluster_3 <- subset(country_data, cluster == 3)

#list of countries in this cluster
unique(cluster_3$country)

##  [1] "Albania"                        "Algeria"                       
##  [3] "Antigua and Barbuda"            "Argentina"                     
##  [5] "Armenia"                        "Azerbaijan"                    
##  [7] "Bahamas"                        "Bahrain"                       
##  [9] "Bangladesh"                     "Barbados"                      
## [11] "Belarus"                        "Belize"                        
## [13] "Bhutan"                         "Bolivia"                       
## [15] "Bosnia and Herzegovina"         "Botswana"                      
## [17] "Brazil"                         "Bulgaria"                      
## [19] "Cambodia"                       "Cape Verde"                    
## [21] "Chile"                          "China"                         
## [23] "Colombia"                       "Costa Rica"                    
## [25] "Croatia"                        "Dominican Republic"            
## [27] "Ecuador"                        "Egypt"                         
## [29] "El Salvador"                    "Estonia"                       
## [31] "Fiji"                           "Georgia"                       
## [33] "Grenada"                        "Guatemala"                     
## [35] "Guyana"                         "Hungary"                       
## [37] "India"                          "Indonesia"                     
## [39] "Iran"                           "Jamaica"                       
## [41] "Jordan"                         "Kazakhstan"                    
## [43] "Kyrgyz Republic"                "Latvia"                        
## [45] "Lebanon"                        "Libya"                         
## [47] "Lithuania"                      "Macedonia, FYR"                
## [49] "Malaysia"                       "Maldives"                      
## [51] "Mauritius"                      "Micronesia, Fed. Sts."         
## [53] "Moldova"                        "Mongolia"                      
## [55] "Montenegro"                     "Morocco"                       
## [57] "Myanmar"                        "Nepal"                         
## [59] "Oman"                           "Panama"                        
## [61] "Paraguay"                       "Peru"                          
## [63] "Philippines"                    "Poland"                        
## [65] "Romania"                        "Russia"                        
## [67] "Samoa"                          "Saudi Arabia"                  
## [69] "Serbia"                         "Seychelles"                    
## [71] "South Africa"                   "Sri Lanka"                     
## [73] "St. Vincent and the Grenadines" "Suriname"                      
## [75] "Tajikistan"                     "Thailand"                      
## [77] "Tonga"                          "Tunisia"                       
## [79] "Turkey"                         "Turkmenistan"                  
## [81] "Ukraine"                        "Uruguay"                       
## [83] "Uzbekistan"                     "Vanuatu"                       
## [85] "Venezuela"                      "Vietnam"

Conclusion

After thoroughly analyzing the country database, three clusters of countries have been identified based on their socio-economic and health factors.

The first cluster is characterized by strong positive values such as economic development, high life expectancy, and low child mortality. This cluster comprises 34 countries: Australia, Canada, Germany, Japan, and the United States. The second cluster is characterized by the most negative values, including high child mortality, low economic development, low gdpp, exports and imports, and the lowest life expectancy. This cluster comprises 58 countries, including Afghanistan, Angola, and Nigeria.

Average values for all features characterize the third cluster compared to the other clusters. This cluster comprises 97 countries, including Albania, Brazil, and Colombia. Based on this analysis, it is evident that the countries in the second cluster are in the greatest need of aid. These countries have the highest child mortality rates, lowest economic development, and lowest life expectancy. Hence, HELP International should prioritize providing aid to countries in this cluster to eradicate poverty and improve their living conditions.

Resources

https://www.datanovia.com/en/lessons/k-means-clustering-in-r-algorith-and-practical-examples/

https://towardsdatascience.com/understanding-dimension-reduction-and-principal-component-analysis-in-r-e3fbd02b29ae

https://www.datacamp.com/tutorial/pca-analysis-r