An international humanitarian Charity called HELP International is dedicated to eradicating poverty and giving the people of underdeveloped nations access to basic comforts and assistance during natural disasters. Over $10 million has been raised through HELP International. Now, this money needs to be used wisely and efficiently. So, data-driven decisions must be made to determine which nations are in the greatest need of aid. As a result, it becomes vital to group the nations according to the socioeconomic and health aspects that affect each nation’s overall development. Hence, funds would be provided for help after catastrophes and natural calamities based on these clusters of countries and their situations. It is a blatant example of unsupervised learning in which we must group the countries according to their various features.
RMarkdown view
Dataset feature details:
country : Name of the country
child_mort : Death of children under 5 years of age per 1000 live births
exports : Exports of goods and services per capita. Given as %age of the GDP per capita
health : Total health spending per capita. Given as %age of GDP per capita
imports : Imports of goods and services per capita. Given as %age of the GDP per capita
Income : Net income per person
Inflation : The measurement of the annual growth rate of the Total GDP
life_expec : The average number of years a new born child would live if the current mortality patterns are to remain the same
total_fer : The number of children that would be born to each woman if the current age-fertility rates remain the same.
gdpp : The GDP per capita. Calculated as the Total GDP divided by the total population.
The data for this research is taken from https://www.kaggle.com/datasets/rohan0301/unsupervised-learning-on-country-data/code?datasetId=721951.
Necessary Libraries:
library(tidyverse)
library(dplyr)
library(magrittr)
library(xfun)
library(corrplot)
library(scales)
library(FactoMineR)
library(factoextra)
library(cowplot)
library(stats)
library(GGally)
library(cluster)
country_data <- read.csv("Country-data.csv")
head(country_data)
## country child_mort exports health imports income inflation
## 1 Afghanistan 90.2 10.0 7.58 44.9 1610 9.44
## 2 Albania 16.6 28.0 6.55 48.6 9930 4.49
## 3 Algeria 27.3 38.4 4.17 31.4 12900 16.10
## 4 Angola 119.0 62.3 2.85 42.9 5900 22.40
## 5 Antigua and Barbuda 10.3 45.5 6.03 58.9 19100 1.44
## 6 Argentina 14.5 18.9 8.10 16.0 18700 20.90
## life_expec total_fer gdpp
## 1 56.2 5.82 553
## 2 76.3 1.65 4090
## 3 76.5 2.89 4460
## 4 60.1 6.16 3530
## 5 76.8 2.13 12200
## 6 75.8 2.37 10300
#checking duplicate values
sum(duplicated(country_data$country)) == 0
## [1] TRUE
There are no dublicate values.
#checking missing values
colSums(is.na(country_data))
## country child_mort exports health imports income inflation
## 0 0 0 0 0 0 0
## life_expec total_fer gdpp
## 0 0 0
There are no missing values
dim(country_data)
## [1] 167 10
str(country_data)
## 'data.frame': 167 obs. of 10 variables:
## $ country : chr "Afghanistan" "Albania" "Algeria" "Angola" ...
## $ child_mort: num 90.2 16.6 27.3 119 10.3 14.5 18.1 4.8 4.3 39.2 ...
## $ exports : num 10 28 38.4 62.3 45.5 18.9 20.8 19.8 51.3 54.3 ...
## $ health : num 7.58 6.55 4.17 2.85 6.03 8.1 4.4 8.73 11 5.88 ...
## $ imports : num 44.9 48.6 31.4 42.9 58.9 16 45.3 20.9 47.8 20.7 ...
## $ income : int 1610 9930 12900 5900 19100 18700 6700 41400 43200 16000 ...
## $ inflation : num 9.44 4.49 16.1 22.4 1.44 20.9 7.77 1.16 0.873 13.8 ...
## $ life_expec: num 56.2 76.3 76.5 60.1 76.8 75.8 73.3 82 80.5 69.1 ...
## $ total_fer : num 5.82 1.65 2.89 6.16 2.13 2.37 1.69 1.93 1.44 1.92 ...
## $ gdpp : int 553 4090 4460 3530 12200 10300 3220 51900 46900 5840 ...
summary(country_data)
## country child_mort exports health
## Length:167 Min. : 2.60 Min. : 0.109 Min. : 1.810
## Class :character 1st Qu.: 8.25 1st Qu.: 23.800 1st Qu.: 4.920
## Mode :character Median : 19.30 Median : 35.000 Median : 6.320
## Mean : 38.27 Mean : 41.109 Mean : 6.816
## 3rd Qu.: 62.10 3rd Qu.: 51.350 3rd Qu.: 8.600
## Max. :208.00 Max. :200.000 Max. :17.900
## imports income inflation life_expec
## Min. : 0.0659 Min. : 609 Min. : -4.210 Min. :32.10
## 1st Qu.: 30.2000 1st Qu.: 3355 1st Qu.: 1.810 1st Qu.:65.30
## Median : 43.3000 Median : 9960 Median : 5.390 Median :73.10
## Mean : 46.8902 Mean : 17145 Mean : 7.782 Mean :70.56
## 3rd Qu.: 58.7500 3rd Qu.: 22800 3rd Qu.: 10.750 3rd Qu.:76.80
## Max. :174.0000 Max. :125000 Max. :104.000 Max. :82.80
## total_fer gdpp
## Min. :1.150 Min. : 231
## 1st Qu.:1.795 1st Qu.: 1330
## Median :2.410 Median : 4660
## Mean :2.948 Mean : 12964
## 3rd Qu.:3.880 3rd Qu.: 14050
## Max. :7.490 Max. :105000
Examining the Big Relationships:
country_data %>%
gather(Features, value, 2:10) %>%
ggplot(aes(x=value)) +
geom_histogram(fill="blue", colour="black", bins = 30) +
facet_wrap(~Features, scales="free_x") +
labs(x="Values", y="Frequency")
int_cols <- country_data %>%
select_if(is.numeric) %>%
names()
country_long <- country_data %>%
pivot_longer(
cols = all_of(int_cols),
names_to = "Features",
values_to = "value"
)
colors <- c("red3", "orange3", "yellow3", "green3", "blue3", "violet", "pink3","grey","lightblue")
ggplot(country_long, aes(x = value)) +
geom_boxplot(fill = colors, colour = "black") +
facet_wrap(~Features, ncol = 3, scales = "free_x") +
labs(x = "Values", y = "") +
theme_minimal()
Findings
Correlation between features:
pairs(country_data[sapply(country_data, is.numeric)],col="blue")
my_colors <- colorRampPalette(c("white", "blue3"))(50)
corrplot(cor(country_data[2:10]), type = "lower", method = "ellipse", col = my_colors, tl.col = "black")
Positive Correlations - The positive correlation between exports and imports suggests that countries with higher levels of trade tend to have more open economies.
The high relationship between GDP per capita and income suggests that personal income and economic growth are closely related and that actions to boost economic growth may also positively impact individual incomes.
The relationship between child mortality and total fertility rate has a significance that points to the importance of supporting family planning and healthcare programs that enhance child health outcomes and reduce population expansion.
Given that those with greater earnings typically have better access to healthcare and lead healthier lifestyles, the positive relationship between income and life expectancy supports the idea that economic inequality might impact health outcomes.
Negative Correlations - The negative correlation between child mortality and life expectancy highlights the importance of improving child health outcomes in order to increase life expectancy and promote better overall public health.
Scaling the data is important in this case because the features in the dataset have different units and ranges, making it difficult to compare and analyze them accurately. For example, some features are expressed as percentages, some as dollar values, and some as whole numbers, making it hard to compare their magnitudes directly. Additionally, the range of values for each feature varies widely, with some features ranging from 0 to 200 and others from 0 to 100,000. This means that a change of 50 in one feature might have a significant impact, while in another, it might not be noticeable. If we do not scale the data, this level of variance can negatively impact the model’s performance, especially if it is based on measuring distances between data points. Some features may have higher magnitudes than others, giving them more weight in the model, even if they are not necessarily more important in determining the outcome. By scaling the data, we can remove potential bias that the model may have towards features with higher magnitudes and ensure that each feature is given equal consideration in the analysis. This can lead to more accurate results and a better understanding of the relationships between the different features in the dataset.
#Eliminating the country-specific column because we should only use numerical data.
dataset_country <- country_data[, -1]
head(dataset_country)
## child_mort exports health imports income inflation life_expec total_fer gdpp
## 1 90.2 10.0 7.58 44.9 1610 9.44 56.2 5.82 553
## 2 16.6 28.0 6.55 48.6 9930 4.49 76.3 1.65 4090
## 3 27.3 38.4 4.17 31.4 12900 16.10 76.5 2.89 4460
## 4 119.0 62.3 2.85 42.9 5900 22.40 60.1 6.16 3530
## 5 10.3 45.5 6.03 58.9 19100 1.44 76.8 2.13 12200
## 6 14.5 18.9 8.10 16.0 18700 20.90 75.8 2.37 10300
Data Normalization
# Applying min-max scaling to all columns
dataset_normalised <- data.frame(lapply(dataset_country, function(x) rescale(x, to = c(0,1))))
# Assign original column names to the rescaled data frame
colnames(dataset_normalised) <- colnames(dataset_country)
head(dataset_normalised)
## child_mort exports health imports income inflation life_expec
## 1 0.42648491 0.04948197 0.35860783 0.25776487 0.008047206 0.12614361 0.4753452
## 2 0.06815969 0.13953104 0.29459291 0.27903729 0.074933074 0.08039922 0.8717949
## 3 0.12025316 0.19155940 0.14667495 0.18014926 0.098809399 0.18769060 0.8757396
## 4 0.56669912 0.31112456 0.06463642 0.24626626 0.042535232 0.24591073 0.5522682
## 5 0.03748783 0.22707876 0.26227470 0.33825512 0.148652234 0.05221329 0.8816568
## 6 0.05793574 0.09400623 0.39092604 0.09160998 0.145436567 0.23204879 0.8619329
## total_fer gdpp
## 1 0.73659306 0.003073428
## 2 0.07886435 0.036833414
## 3 0.27444795 0.040364993
## 4 0.79022082 0.031488322
## 5 0.15457413 0.114241808
## 6 0.19242902 0.096106673
Data Standardization
# Applying standardization to all columns
dataset_standardised <- dataset_country %>%
scale()
# Assigning original column names to the rescaled data frame
colnames(dataset_standardised ) <- colnames(dataset_country)
head(dataset_standardised)
## child_mort exports health imports income inflation
## [1,] 1.2876597 -1.13486665 0.27825140 -0.08220771 -0.80582187 0.1568645
## [2,] -0.5373329 -0.47822017 -0.09672528 0.07062429 -0.37424335 -0.3114109
## [3,] -0.2720146 -0.09882442 -0.96317624 -0.63983800 -0.22018227 0.7869076
## [4,] 2.0017872 0.77305618 -1.44372888 -0.16481961 -0.58328920 1.3828944
## [5,] -0.6935483 0.16018613 -0.28603389 0.49607554 0.10142673 -0.5999442
## [6,] -0.5894047 -0.81019144 0.46756001 -1.27594958 0.08067776 1.2409928
## life_expec total_fer gdpp
## [1,] -1.6142372 1.89717646 -0.67714308
## [2,] 0.6459238 -0.85739418 -0.48416709
## [3,] 0.6684130 -0.03828924 -0.46398018
## [4,] -1.1756985 2.12176975 -0.51472026
## [5,] 0.7021467 -0.54032130 -0.04169175
## [6,] 0.5897009 -0.38178486 -0.14535428
#We need columns argument for building new dataframe
columns <- colnames(dataset_country)
#We should make a new dataframe with the rescaled values and the column labels
df_normal <- as.data.frame(dataset_normalised)
colnames(df_normal) <- columns
df_standard <- as.data.frame(dataset_standardised)
colnames(df_standard) <- columns
head(df_normal)
## child_mort exports health imports income inflation life_expec
## 1 0.42648491 0.04948197 0.35860783 0.25776487 0.008047206 0.12614361 0.4753452
## 2 0.06815969 0.13953104 0.29459291 0.27903729 0.074933074 0.08039922 0.8717949
## 3 0.12025316 0.19155940 0.14667495 0.18014926 0.098809399 0.18769060 0.8757396
## 4 0.56669912 0.31112456 0.06463642 0.24626626 0.042535232 0.24591073 0.5522682
## 5 0.03748783 0.22707876 0.26227470 0.33825512 0.148652234 0.05221329 0.8816568
## 6 0.05793574 0.09400623 0.39092604 0.09160998 0.145436567 0.23204879 0.8619329
## total_fer gdpp
## 1 0.73659306 0.003073428
## 2 0.07886435 0.036833414
## 3 0.27444795 0.040364993
## 4 0.79022082 0.031488322
## 5 0.15457413 0.114241808
## 6 0.19242902 0.096106673
head(df_standard)
## child_mort exports health imports income inflation
## 1 1.2876597 -1.13486665 0.27825140 -0.08220771 -0.80582187 0.1568645
## 2 -0.5373329 -0.47822017 -0.09672528 0.07062429 -0.37424335 -0.3114109
## 3 -0.2720146 -0.09882442 -0.96317624 -0.63983800 -0.22018227 0.7869076
## 4 2.0017872 0.77305618 -1.44372888 -0.16481961 -0.58328920 1.3828944
## 5 -0.6935483 0.16018613 -0.28603389 0.49607554 0.10142673 -0.5999442
## 6 -0.5894047 -0.81019144 0.46756001 -1.27594958 0.08067776 1.2409928
## life_expec total_fer gdpp
## 1 -1.6142372 1.89717646 -0.67714308
## 2 0.6459238 -0.85739418 -0.48416709
## 3 0.6684130 -0.03828924 -0.46398018
## 4 -1.1756985 2.12176975 -0.51472026
## 5 0.7021467 -0.54032130 -0.04169175
## 6 0.5897009 -0.38178486 -0.14535428
Comparing Scaling Methods
plot(df_normal$gdpp, df_normal$child_mort, col = 'blue3',
xlab = 'GDP per Person', ylab = 'Child Mortality')
plot(df_standard$gdpp, df_standard$child_mort, col = 'blue3',
xlab = 'GDP per Person', ylab = 'Child Mortality')
PCA with normalized data
normal_pca <- prcomp(df_normal)
normal_pca
## Standard deviations (1, .., p=9):
## [1] 0.37657157 0.18576634 0.17808711 0.15854130 0.09869388 0.08814718 0.05540053
## [8] 0.04784283 0.04230136
##
## Rotation (n x k) = (9 x 9):
## PC1 PC2 PC3 PC4 PC5
## child_mort -0.47646333 -0.281733093 0.100011701 -0.02908132 0.42095743
## exports 0.17033279 -0.466643928 -0.390523997 0.18848786 -0.10579788
## health 0.13388598 -0.044509771 0.781911221 0.51298346 -0.08330309
## imports 0.08161987 -0.394257286 -0.302774809 0.60665580 -0.10081385
## income 0.30698142 -0.404404812 0.068366711 -0.37282250 0.15996971
## inflation -0.08800904 0.001601976 -0.057029275 -0.21262116 -0.21955510
## life_expec 0.42053483 0.137519745 0.007857375 -0.14802117 -0.55643730
## total_fer -0.57108745 -0.369245977 0.174836469 -0.17306242 -0.62684939
## gdpp 0.33824337 -0.476077611 0.309636965 -0.31044817 0.13469013
## PC6 PC7 PC8 PC9
## child_mort -0.08309594 0.68892064 -0.05031919 -0.146653807
## exports -0.15131228 0.06411594 -0.64019218 0.342268962
## health -0.20814680 -0.02034690 -0.22728423 -0.056741715
## imports 0.01221525 0.01074206 0.54440667 -0.267186976
## income -0.02247672 -0.24033746 -0.19006577 -0.691784599
## inflation -0.92453091 -0.02335202 0.20070661 0.001198683
## life_expec 0.10955292 0.65603590 0.00204731 -0.173940114
## total_fer 0.23223954 -0.17206635 -0.01699321 -0.015072472
## gdpp 0.07433502 0.04950529 0.40358904 0.526959586
summary(normal_pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 0.3766 0.1858 0.1781 0.15854 0.09869 0.08815 0.0554
## Proportion of Variance 0.5500 0.1338 0.1230 0.09749 0.03778 0.03014 0.0119
## Cumulative Proportion 0.5500 0.6839 0.8069 0.90436 0.94214 0.97228 0.9842
## PC8 PC9
## Standard deviation 0.04784 0.04230
## Proportion of Variance 0.00888 0.00694
## Cumulative Proportion 0.99306 1.00000
country_normal_pca <- data.frame(normal_pca$x[,1:4])
head(country_normal_pca)
## PC1 PC2 PC3 PC4
## 1 -0.599077641 -0.09549001 0.1575538 0.02433250
## 2 0.158474297 0.21209240 -0.0641890 0.06124679
## 3 0.003685518 0.13586725 -0.1341820 -0.13357441
## 4 -0.650234991 -0.27597526 -0.1426725 -0.15601831
## 5 0.200711243 0.06466223 -0.1007155 0.03790176
## 6 0.112534001 0.20604633 0.1189418 -0.10644779
fviz_screeplot(normal_pca, addlabels = TRUE)+
labs(title = "Scree plot - Min-Max scaled data", x = "Principal component", y = "Percentage of variance explained")
fviz_pca_biplot(normal_pca, col.ind = "cos2", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"))
PCA with standardised data
standard_pca <- prcomp(df_standard)
standard_pca
## Standard deviations (1, .., p=9):
## [1] 2.0336314 1.2435217 1.0818425 0.9973889 0.8127847 0.4728437 0.3368067
## [8] 0.2971790 0.2586020
##
## Rotation (n x k) = (9 x 9):
## PC1 PC2 PC3 PC4 PC5
## child_mort -0.4195194 -0.192883937 0.02954353 -0.370653262 0.16896968
## exports 0.2838970 -0.613163494 -0.14476069 -0.003091019 -0.05761584
## health 0.1508378 0.243086779 0.59663237 -0.461897497 -0.51800037
## imports 0.1614824 -0.671820644 0.29992674 0.071907461 -0.25537642
## income 0.3984411 -0.022535530 -0.30154750 -0.392159039 0.24714960
## inflation -0.1931729 0.008404473 -0.64251951 -0.150441762 -0.71486910
## life_expec 0.4258394 0.222706743 -0.11391854 0.203797235 -0.10821980
## total_fer -0.4037290 -0.155233106 -0.01954925 -0.378303645 0.13526221
## gdpp 0.3926448 0.046022396 -0.12297749 -0.531994575 0.18016662
## PC6 PC7 PC8 PC9
## child_mort -0.200628153 0.07948854 0.68274306 0.32754180
## exports 0.059332832 0.70730269 0.01419742 -0.12308207
## health -0.007276456 0.24983051 -0.07249683 0.11308797
## imports 0.030031537 -0.59218953 0.02894642 0.09903717
## income -0.160346990 -0.09556237 -0.35262369 0.61298247
## inflation -0.066285372 -0.10463252 0.01153775 -0.02523614
## life_expec 0.601126516 -0.01848639 0.50466425 0.29403981
## total_fer 0.750688748 -0.02882643 -0.29335267 -0.02633585
## gdpp -0.016778761 -0.24299776 0.24969636 -0.62564572
summary(standard_pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0336 1.2435 1.0818 0.9974 0.8128 0.47284 0.3368
## Proportion of Variance 0.4595 0.1718 0.1300 0.1105 0.0734 0.02484 0.0126
## Cumulative Proportion 0.4595 0.6313 0.7614 0.8719 0.9453 0.97015 0.9828
## PC8 PC9
## Standard deviation 0.29718 0.25860
## Proportion of Variance 0.00981 0.00743
## Cumulative Proportion 0.99257 1.00000
country_standard_pca <- data.frame(standard_pca$x[,1:4])
head(country_standard_pca)
## PC1 PC2 PC3 PC4
## 1 -2.90428986 -0.09533386 0.7159652 -1.00224038
## 2 0.42862224 0.58639208 0.3324855 1.15757715
## 3 -0.28436983 0.45380957 -1.2178421 0.86551146
## 4 -2.92362976 -1.69047094 -1.5204709 -0.83710739
## 5 1.03047668 -0.13624894 0.2250441 0.84452276
## 6 0.02234007 1.77385167 -0.8673884 0.03685602
fviz_screeplot(standard_pca, addlabels = TRUE) +
labs(title = "Scree plot - Standardized data", x = "Principal component", y = "Percentage of variance explained")
fviz_pca_biplot(standard_pca, col.ind = "cos2", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"))
#data with PC1, PC2, PC3, PC4
pca_data_s <- as.data.frame(standard_pca$x[, 1:4])
We can see that there are 4 principle components that can account for around 90% of the distribution of the original data after doing PCA on both the standardised and normalized versions of the original dataset.
Executing model using different dataframe
# Creating KMeans model for normalized dataset
km <- kmeans(x = df_normal, centers = 3, nstart = 10, algorithm = "Lloyd",
iter.max = 300, trace = T)
km
## K-means clustering with 3 clusters of sizes 35, 46, 86
##
## Cluster means:
## child_mort exports health imports income inflation life_expec
## 1 0.01118375 0.2917712 0.4417296 0.2957596 0.36332096 0.0623325 0.9496196
## 2 0.44150332 0.1459704 0.2814468 0.2485535 0.02337632 0.1506983 0.5383329
## 3 0.09652182 0.2014770 0.2738087 0.2694484 0.09777085 0.1092230 0.7984725
## total_fer gdpp
## 1 0.0932402 0.40934000
## 2 0.6214854 0.01398231
## 3 0.1803243 0.06193426
##
## Clustering vector:
## [1] 2 3 3 2 3 3 3 1 1 3 3 3 3 3 3 1 3 2 3 3 3 3 3 1 3 2 2 3 2 1 3 2 2 3 3 3 2
## [38] 2 2 3 2 3 1 1 1 3 3 3 3 2 2 3 3 1 1 2 2 3 1 2 1 3 3 2 2 3 2 3 1 3 3 3 2 1
## [75] 1 1 3 1 3 3 2 2 1 3 2 3 3 2 2 3 3 1 3 2 2 3 3 2 1 2 3 3 3 3 3 3 2 3 2 3 1
## [112] 1 2 2 1 3 2 3 3 3 3 3 1 1 3 3 2 3 3 2 3 3 2 1 1 1 2 3 1 1 3 3 2 3 1 1 3 2
## [149] 3 2 2 3 3 3 3 2 3 1 1 1 3 3 3 3 3 2 2
##
## Within cluster sum of squares by cluster:
## [1] 6.708499 6.161400 6.475220
## (between_SS / total_SS = 54.8 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
# Computing clusters and assign labels for normalised dataset
y_predicted_normal <- km$cluster
y_predicted_normal
## [1] 2 3 3 2 3 3 3 1 1 3 3 3 3 3 3 1 3 2 3 3 3 3 3 1 3 2 2 3 2 1 3 2 2 3 3 3 2
## [38] 2 2 3 2 3 1 1 1 3 3 3 3 2 2 3 3 1 1 2 2 3 1 2 1 3 3 2 2 3 2 3 1 3 3 3 2 1
## [75] 1 1 3 1 3 3 2 2 1 3 2 3 3 2 2 3 3 1 3 2 2 3 3 2 1 2 3 3 3 3 3 3 2 3 2 3 1
## [112] 1 2 2 1 3 2 3 3 3 3 3 1 1 3 3 2 3 3 2 3 3 2 1 1 1 2 3 1 1 3 3 2 3 1 1 3 2
## [149] 3 2 2 3 3 3 3 2 3 1 1 1 3 3 3 3 3 2 2
# Creating KMeans model for standardised dataset
km_standard <- kmeans(x = df_standard, centers = 3, nstart = 10, algorithm = "Lloyd",
iter.max = 300, trace = T)
km_standard
## K-means clustering with 3 clusters of sizes 36, 84, 47
##
## Cluster means:
## child_mort exports health imports income inflation
## 1 -0.8249676 0.64314557 0.7252301 0.19006732 1.4797922 -0.48346661
## 2 -0.4052346 -0.03155768 -0.2237978 0.02408916 -0.2510155 -0.01711594
## 3 1.3561391 -0.43622118 -0.1555163 -0.18863644 -0.6848344 0.40090504
## life_expec total_fer gdpp
## 1 1.0763414 -0.7895024 1.6111498
## 2 0.2539698 -0.4230704 -0.3534185
## 3 -1.2783352 1.3608511 -0.6024306
##
## Clustering vector:
## [1] 3 2 2 3 2 2 2 1 1 2 2 1 2 2 2 1 2 3 2 2 2 3 2 1 2 3 3 2 3 1 2 3 3 2 2 2 3
## [38] 3 3 2 3 2 1 1 1 2 2 2 2 3 3 2 2 1 1 3 3 2 1 3 1 2 2 3 3 2 3 2 1 2 2 2 3 1
## [75] 1 1 2 1 2 2 3 3 1 2 3 2 2 3 3 2 2 1 2 3 3 2 2 3 1 3 2 2 2 2 2 2 3 2 3 2 1
## [112] 1 3 3 1 2 3 2 2 2 2 2 1 1 2 2 3 2 2 3 2 2 3 1 1 1 2 3 1 1 2 2 3 2 1 1 2 3
## [149] 2 3 3 2 2 2 2 3 2 1 1 1 2 2 2 2 2 3 3
##
## Within cluster sum of squares by cluster:
## [1] 297.2279 259.5575 269.6604
## (between_SS / total_SS = 44.7 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
# Computing clusters and assign labels for standardised dataset
y_predicted_standard <- km_standard$cluster
# the orginal data with standard scaling and 4 PCA to calculate the clusters and assign the labels.
dataset_pca <- kmeans(x = pca_data_s, centers = 3, nstart = 10, algorithm = "Lloyd",
iter.max = 300, trace = T)
# Computing clusters and assign labels for PCA data
y_predicted_pca<- dataset_pca$cluster
#adding cluster column to the dataframe
df_normal$cluster <- y_predicted_normal
head(df_normal)
## child_mort exports health imports income inflation life_expec
## 1 0.42648491 0.04948197 0.35860783 0.25776487 0.008047206 0.12614361 0.4753452
## 2 0.06815969 0.13953104 0.29459291 0.27903729 0.074933074 0.08039922 0.8717949
## 3 0.12025316 0.19155940 0.14667495 0.18014926 0.098809399 0.18769060 0.8757396
## 4 0.56669912 0.31112456 0.06463642 0.24626626 0.042535232 0.24591073 0.5522682
## 5 0.03748783 0.22707876 0.26227470 0.33825512 0.148652234 0.05221329 0.8816568
## 6 0.05793574 0.09400623 0.39092604 0.09160998 0.145436567 0.23204879 0.8619329
## total_fer gdpp cluster
## 1 0.73659306 0.003073428 2
## 2 0.07886435 0.036833414 3
## 3 0.27444795 0.040364993 3
## 4 0.79022082 0.031488322 2
## 5 0.15457413 0.114241808 3
## 6 0.19242902 0.096106673 3
#adding cluster column to the dataframe
df_standard$cluster <- y_predicted_standard
head(df_standard)
## child_mort exports health imports income inflation
## 1 1.2876597 -1.13486665 0.27825140 -0.08220771 -0.80582187 0.1568645
## 2 -0.5373329 -0.47822017 -0.09672528 0.07062429 -0.37424335 -0.3114109
## 3 -0.2720146 -0.09882442 -0.96317624 -0.63983800 -0.22018227 0.7869076
## 4 2.0017872 0.77305618 -1.44372888 -0.16481961 -0.58328920 1.3828944
## 5 -0.6935483 0.16018613 -0.28603389 0.49607554 0.10142673 -0.5999442
## 6 -0.5894047 -0.81019144 0.46756001 -1.27594958 0.08067776 1.2409928
## life_expec total_fer gdpp cluster
## 1 -1.6142372 1.89717646 -0.67714308 3
## 2 0.6459238 -0.85739418 -0.48416709 2
## 3 0.6684130 -0.03828924 -0.46398018 2
## 4 -1.1756985 2.12176975 -0.51472026 3
## 5 0.7021467 -0.54032130 -0.04169175 2
## 6 0.5897009 -0.38178486 -0.14535428 2
# adding the cluster column to the dataframe without 'country' feature)
dataset_country$cluster <- y_predicted_pca
head(dataset_country)
## child_mort exports health imports income inflation life_expec total_fer gdpp
## 1 90.2 10.0 7.58 44.9 1610 9.44 56.2 5.82 553
## 2 16.6 28.0 6.55 48.6 9930 4.49 76.3 1.65 4090
## 3 27.3 38.4 4.17 31.4 12900 16.10 76.5 2.89 4460
## 4 119.0 62.3 2.85 42.9 5900 22.40 60.1 6.16 3530
## 5 10.3 45.5 6.03 58.9 19100 1.44 76.8 2.13 12200
## 6 14.5 18.9 8.10 16.0 18700 20.90 75.8 2.37 10300
## cluster
## 1 2
## 2 1
## 3 1
## 4 2
## 5 1
## 6 1
Number of cluster with normalized data
sse = numeric(length = 10)
for (i in 1:10){
km = kmeans(df_normal, centers = i, nstart = 100, algorithm = "Lloyd", iter.max = 100)
sse[i] = km$tot.withinss
}
# plot
plot(1:10, sse, type="b", xlab="Number of clusters", ylab="SSE")
Number of cluster with standardized data
sse = numeric(length = 10)
for (i in 1:10){
km = kmeans(df_standard, centers = i, nstart = 25, algorithm = "Lloyd", iter.max = 100)
sse[i] = km$tot.withinss
}
# plot
plot(1:10, sse, type="b", xlab="Number of clusters", ylab="SSE")
Number of cluster with normalized data + PCA
sse = numeric(length = 10)
for (i in 1:10){
km = suppressWarnings(kmeans(df_normal, centers = i, nstart = 25, algorithm = "Lloyd", iter.max = 100))
sse[i] = km$tot.withinss
}
# plot
plot(1:10, sse, type="b", xlab="Number of clusters", ylab="SSE")
The ideal number of clusters is still 3 with various amounts of inertia after running the K-Means model with the normalized dataset, a standardized dataset, and a PCA with four components (with standardized scaling).
# convert cluster column to factor
df_normal$cluster <- as.factor(df_normal$cluster)
# plot
ggpairs(df_normal, columns = 1:9, aes_string(color = "cluster"))
df_standard$cluster <- as.factor(df_standard$cluster)
ggpairs(df_standard, columns = 1:9, aes_string(color = "cluster"))
dataset_country$cluster <- as.factor(dataset_country$cluster)
ggpairs(dataset_country, columns = 1:9, aes_string(color = "cluster"))
We can observe that there is typically overlap across clusters after running the model with two different scaling methods and applying PCA. While clusters 1 and 2 frequently overlap, cluster 3 is more dispersed.
#adding cluster column to the original dataset
country_data$cluster <- as.factor(y_predicted_normal)
# convert cluster column to factor
country_data$cluster <- as.factor(country_data$cluster)
# create pivot table with mean values for each cluster and feature
clusters_table <- aggregate(country_data[,2:9],
by = list(cluster = country_data$cluster),
mean)
#for cluster 1
cluster_1 <- subset(country_data, cluster == 1)
#list of countries in this cluster
unique(cluster_1$country)
## [1] "Australia" "Austria" "Belgium"
## [4] "Brunei" "Canada" "Cyprus"
## [7] "Czech Republic" "Denmark" "Finland"
## [10] "France" "Germany" "Greece"
## [13] "Iceland" "Ireland" "Israel"
## [16] "Italy" "Japan" "Kuwait"
## [19] "Luxembourg" "Malta" "Netherlands"
## [22] "New Zealand" "Norway" "Portugal"
## [25] "Qatar" "Singapore" "Slovak Republic"
## [28] "Slovenia" "South Korea" "Spain"
## [31] "Sweden" "Switzerland" "United Arab Emirates"
## [34] "United Kingdom" "United States"
cluster_2 <- subset(country_data, cluster == 2)
#list of countries in this cluster
unique(cluster_2$country)
## [1] "Afghanistan" "Angola"
## [3] "Benin" "Burkina Faso"
## [5] "Burundi" "Cameroon"
## [7] "Central African Republic" "Chad"
## [9] "Comoros" "Congo, Dem. Rep."
## [11] "Congo, Rep." "Cote d'Ivoire"
## [13] "Equatorial Guinea" "Eritrea"
## [15] "Gabon" "Gambia"
## [17] "Ghana" "Guinea"
## [19] "Guinea-Bissau" "Haiti"
## [21] "Iraq" "Kenya"
## [23] "Kiribati" "Lao"
## [25] "Lesotho" "Liberia"
## [27] "Madagascar" "Malawi"
## [29] "Mali" "Mauritania"
## [31] "Mozambique" "Namibia"
## [33] "Niger" "Nigeria"
## [35] "Pakistan" "Rwanda"
## [37] "Senegal" "Sierra Leone"
## [39] "Solomon Islands" "Sudan"
## [41] "Tanzania" "Timor-Leste"
## [43] "Togo" "Uganda"
## [45] "Yemen" "Zambia"
cluster_3 <- subset(country_data, cluster == 3)
#list of countries in this cluster
unique(cluster_3$country)
## [1] "Albania" "Algeria"
## [3] "Antigua and Barbuda" "Argentina"
## [5] "Armenia" "Azerbaijan"
## [7] "Bahamas" "Bahrain"
## [9] "Bangladesh" "Barbados"
## [11] "Belarus" "Belize"
## [13] "Bhutan" "Bolivia"
## [15] "Bosnia and Herzegovina" "Botswana"
## [17] "Brazil" "Bulgaria"
## [19] "Cambodia" "Cape Verde"
## [21] "Chile" "China"
## [23] "Colombia" "Costa Rica"
## [25] "Croatia" "Dominican Republic"
## [27] "Ecuador" "Egypt"
## [29] "El Salvador" "Estonia"
## [31] "Fiji" "Georgia"
## [33] "Grenada" "Guatemala"
## [35] "Guyana" "Hungary"
## [37] "India" "Indonesia"
## [39] "Iran" "Jamaica"
## [41] "Jordan" "Kazakhstan"
## [43] "Kyrgyz Republic" "Latvia"
## [45] "Lebanon" "Libya"
## [47] "Lithuania" "Macedonia, FYR"
## [49] "Malaysia" "Maldives"
## [51] "Mauritius" "Micronesia, Fed. Sts."
## [53] "Moldova" "Mongolia"
## [55] "Montenegro" "Morocco"
## [57] "Myanmar" "Nepal"
## [59] "Oman" "Panama"
## [61] "Paraguay" "Peru"
## [63] "Philippines" "Poland"
## [65] "Romania" "Russia"
## [67] "Samoa" "Saudi Arabia"
## [69] "Serbia" "Seychelles"
## [71] "South Africa" "Sri Lanka"
## [73] "St. Vincent and the Grenadines" "Suriname"
## [75] "Tajikistan" "Thailand"
## [77] "Tonga" "Tunisia"
## [79] "Turkey" "Turkmenistan"
## [81] "Ukraine" "Uruguay"
## [83] "Uzbekistan" "Vanuatu"
## [85] "Venezuela" "Vietnam"
After thoroughly analyzing the country database, three clusters of countries have been identified based on their socio-economic and health factors.
The first cluster is characterized by strong positive values such as economic development, high life expectancy, and low child mortality. This cluster comprises 34 countries: Australia, Canada, Germany, Japan, and the United States. The second cluster is characterized by the most negative values, including high child mortality, low economic development, low gdpp, exports and imports, and the lowest life expectancy. This cluster comprises 58 countries, including Afghanistan, Angola, and Nigeria.
Average values for all features characterize the third cluster compared to the other clusters. This cluster comprises 97 countries, including Albania, Brazil, and Colombia. Based on this analysis, it is evident that the countries in the second cluster are in the greatest need of aid. These countries have the highest child mortality rates, lowest economic development, and lowest life expectancy. Hence, HELP International should prioritize providing aid to countries in this cluster to eradicate poverty and improve their living conditions.
Resources
https://www.datanovia.com/en/lessons/k-means-clustering-in-r-algorith-and-practical-examples/