The objective of this report is to categorise the countries using socio-economic and health factors that determine the overall development of the country. The data is obtained from HELP International. HELP International is an international humanitarian NGO that is committed to fighting poverty and providing the people of backward countries with basic amenities and relief during the time of disasters and natural calamities.
In this report, we will make an unsupervised learning analysis using the customer segmentation dataset. The analysis includes clustering using K-means algorithm and dimensionality reduction using principal component analysis (PCA). The dataset is obtained from : https://www.kaggle.com/datasets/rohan0301/unsupervised-learning-on-country-data
To make the model, we will need to use these libraries below :
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(GGally)## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(gridExtra)##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(factoextra)## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(FactoMineR)
library(plotly)##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(ggforce)
library(cowplot)country <- read.csv("archive_2/Country-data.csv")
head(country,10)glimpse(country)## Rows: 167
## Columns: 10
## $ country <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Antigua and…
## $ child_mort <dbl> 90.2, 16.6, 27.3, 119.0, 10.3, 14.5, 18.1, 4.8, 4.3, 39.2, …
## $ exports <dbl> 10.0, 28.0, 38.4, 62.3, 45.5, 18.9, 20.8, 19.8, 51.3, 54.3,…
## $ health <dbl> 7.58, 6.55, 4.17, 2.85, 6.03, 8.10, 4.40, 8.73, 11.00, 5.88…
## $ imports <dbl> 44.9, 48.6, 31.4, 42.9, 58.9, 16.0, 45.3, 20.9, 47.8, 20.7,…
## $ income <int> 1610, 9930, 12900, 5900, 19100, 18700, 6700, 41400, 43200, …
## $ inflation <dbl> 9.440, 4.490, 16.100, 22.400, 1.440, 20.900, 7.770, 1.160, …
## $ life_expec <dbl> 56.2, 76.3, 76.5, 60.1, 76.8, 75.8, 73.3, 82.0, 80.5, 69.1,…
## $ total_fer <dbl> 5.82, 1.65, 2.89, 6.16, 2.13, 2.37, 1.69, 1.93, 1.44, 1.92,…
## $ gdpp <int> 553, 4090, 4460, 3530, 12200, 10300, 3220, 51900, 46900, 58…
In this dataset, we have 167 rows and 10 columns. The detail of each columns are :
country : Name of the country
child_mort ; Death of children under 5 years of age per 1000 live births
exports ; Exports of goods and services per capita. Given as %age of the GDP per capita
health : Total health spending per capita. Given as %age of GDP per capita
imports : Imports of goods and services per capita. Given as %age of the GDP per capita
Income : Net income per person
Inflation : The measurement of the annual growth rate of the Total GDP
life_expec : The average number of years a new born child would live if the current mortality patterns are to remain the same
total_fer : The number of children that would be born to each woman if the current age-fertility rates remain the same.
gdpp : The GDP per capita. Calculated as the Total GDP divided by the total population.
We will check wether this dataset have missing value or not :
anyNA(country)## [1] FALSE
This dataset doesn’t have any missing value, therefore we proceed to the next process.
Next, since the column of country is Credit card holder
ID, then this data will not relevant to our analisys. Therefore we will
make this column as rownames index.
rownames(country) <- country[, 1]
country_clean <- country %>%
select(-country)
head(country_clean)We want to see if there is a high correlation between numeric variables. Strong correlation in some variables imply that we can reduce the dimensionality or number of features using the Principle Component Analysis (PCA).
ggcorr(country_clean, low = "navy", high = "red")There are some features that has high correlation such as the
child_mort with total_fer, income
with gdpp, etc. Based on this result, we will try to reduce
the dimension using PCA.
country_scaled <- scale(country_clean)
summary(country_scaled)## child_mort exports health imports
## Min. :-0.8845 Min. :-1.4957 Min. :-1.8223 Min. :-1.9341
## 1st Qu.:-0.7444 1st Qu.:-0.6314 1st Qu.:-0.6901 1st Qu.:-0.6894
## Median :-0.4704 Median :-0.2229 Median :-0.1805 Median :-0.1483
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5909 3rd Qu.: 0.3736 3rd Qu.: 0.6496 3rd Qu.: 0.4899
## Max. : 4.2086 Max. : 5.7964 Max. : 4.0353 Max. : 5.2504
## income inflation life_expec total_fer
## Min. :-0.8577 Min. :-1.1344 Min. :-4.3242 Min. :-1.1877
## 1st Qu.:-0.7153 1st Qu.:-0.5649 1st Qu.:-0.5910 1st Qu.:-0.7616
## Median :-0.3727 Median :-0.2263 Median : 0.2861 Median :-0.3554
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.2934 3rd Qu.: 0.2808 3rd Qu.: 0.7021 3rd Qu.: 0.6157
## Max. : 5.5947 Max. : 9.1023 Max. : 1.3768 Max. : 3.0003
## gdpp
## Min. :-0.69471
## 1st Qu.:-0.63475
## Median :-0.45307
## Mean : 0.00000
## 3rd Qu.: 0.05924
## Max. : 5.02140
Scaling the data in k-means clustering is crucial to make the algorithm fairer and more robust. It helps prevent biased results and ensures that all variables are treated equally when forming clusters. By applying scaling, you can achieve a more accurate representation of the underlying data structure and improve the performance of the k-means algorithm.
TO find the optimal clustering method we will use the silhouette method to determine the optimal number of cluster :
fviz_nbclust(country_scaled, kmeans, "silhouette", k.max = 15) + labs(subtitle = "Silhouette method")The silhouette method measures the silhouette coefficient, by calculating the mean intra-cluster distance and the mean nearest-cluster distance for each observations. We get the optimal number of clusters by choosing the number of cluster with the highest silhouette score (the peak).
The method suggest that the optimal number of cluster is 13.
The main goal of K-means clustering is to partition a given dataset into ‘k’ clusters, where each cluster represents a group of data points that are similar to each other and dissimilar to points in other clusters. The ‘k’ in K-means refers to the number of clusters, which needs to be specified before running the algorithm.
set.seed(123)
clust <- kmeans(country_scaled, centers = 5)
clust## K-means clustering with 5 clusters of sizes 22, 40, 45, 52, 8
##
## Cluster means:
## child_mort exports health imports income inflation life_expec
## 1 -0.8423248 -0.1170646 1.36809977 -0.4475776 1.1865034 -0.5902992 1.1647079
## 2 -0.2371141 -0.3634995 -0.56954175 -0.6532976 -0.2176016 0.4134368 0.1916427
## 3 1.4031544 -0.4676733 -0.14850199 -0.1595875 -0.7081369 0.3816471 -1.3076349
## 4 -0.5559776 0.3296093 0.08593233 0.5701085 -0.1597509 -0.3791089 0.3366978
## 5 -0.7769251 2.6276265 -0.63780206 1.6893011 2.8467745 -0.1264185 1.0057504
## total_fer gdpp
## 1 -0.7583089 1.7624828
## 2 -0.2526437 -0.3692435
## 3 1.3810523 -0.6138338
## 4 -0.5605195 -0.2418234
## 5 -0.7764745 2.0240571
##
## Clustering vector:
## Afghanistan Albania
## 3 4
## Algeria Angola
## 2 3
## Antigua and Barbuda Argentina
## 4 2
## Armenia Australia
## 2 1
## Austria Azerbaijan
## 1 2
## Bahamas Bahrain
## 4 4
## Bangladesh Barbados
## 2 4
## Belarus Belgium
## 4 1
## Belize Benin
## 4 3
## Bhutan Bolivia
## 4 2
## Bosnia and Herzegovina Botswana
## 4 3
## Brazil Brunei
## 2 5
## Bulgaria Burkina Faso
## 4 3
## Burundi Cambodia
## 3 4
## Cameroon Canada
## 3 1
## Cape Verde Central African Republic
## 4 3
## Chad Chile
## 3 2
## China Colombia
## 2 2
## Comoros Congo, Dem. Rep.
## 3 3
## Congo, Rep. Costa Rica
## 3 4
## Cote d'Ivoire Croatia
## 3 4
## Cyprus Czech Republic
## 4 4
## Denmark Dominican Republic
## 1 2
## Ecuador Egypt
## 2 2
## El Salvador Equatorial Guinea
## 4 3
## Eritrea Estonia
## 3 4
## Fiji Finland
## 4 1
## France Gabon
## 1 2
## Gambia Georgia
## 3 4
## Germany Ghana
## 1 3
## Greece Grenada
## 1 4
## Guatemala Guinea
## 2 3
## Guinea-Bissau Guyana
## 3 4
## Haiti Hungary
## 3 4
## Iceland India
## 1 2
## Indonesia Iran
## 2 2
## Iraq Ireland
## 2 5
## Israel Italy
## 1 1
## Jamaica Japan
## 2 1
## Jordan Kazakhstan
## 4 2
## Kenya Kiribati
## 3 3
## Kuwait Kyrgyz Republic
## 5 4
## Lao Latvia
## 3 4
## Lebanon Lesotho
## 4 3
## Liberia Libya
## 3 2
## Lithuania Luxembourg
## 4 5
## Macedonia, FYR Madagascar
## 4 3
## Malawi Malaysia
## 3 4
## Maldives Mali
## 4 3
## Malta Mauritania
## 5 3
## Mauritius Micronesia, Fed. Sts.
## 4 4
## Moldova Mongolia
## 4 2
## Montenegro Morocco
## 4 2
## Mozambique Myanmar
## 3 2
## Namibia Nepal
## 3 2
## Netherlands New Zealand
## 1 1
## Niger Nigeria
## 3 3
## Norway Oman
## 1 2
## Pakistan Panama
## 3 4
## Paraguay Peru
## 4 2
## Philippines Poland
## 2 4
## Portugal Qatar
## 1 5
## Romania Russia
## 2 2
## Rwanda Samoa
## 3 4
## Saudi Arabia Senegal
## 2 3
## Serbia Seychelles
## 4 4
## Sierra Leone Singapore
## 3 5
## Slovak Republic Slovenia
## 4 4
## Solomon Islands South Africa
## 4 3
## South Korea Spain
## 4 1
## Sri Lanka St. Vincent and the Grenadines
## 2 4
## Sudan Suriname
## 3 4
## Sweden Switzerland
## 1 1
## Tajikistan Tanzania
## 2 3
## Thailand Timor-Leste
## 4 3
## Togo Tonga
## 3 2
## Tunisia Turkey
## 4 2
## Turkmenistan Uganda
## 2 3
## Ukraine United Arab Emirates
## 4 5
## United Kingdom United States
## 1 1
## Uruguay Uzbekistan
## 2 2
## Vanuatu Venezuela
## 4 2
## Vietnam Yemen
## 4 3
## Zambia
## 3
##
## Within cluster sum of squares by cluster:
## [1] 46.96944 104.32294 260.44689 120.43834 115.75637
## (between_SS / total_SS = 56.6 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
The percentage of variance explained (between_SS / total_SS) provides insight into how much of the total variance is accounted for by the clustering. A higher percentage suggests that the clustering has effectively captured a significant portion of the dataset’s variability. a good K-means clustering solution aims to minimize the within-cluster sum of squares (making clusters compact) while maximizing the between-cluster sum of squares (separating clusters from each other). The percentage of variance explained gives an indication of how well the clustering has captured the underlying patterns in the data.
The percentage of variance explained is calculated as (Between_SS / Total_SS) * 100. In this case, it’s 56.6%, indicating that 60.5% of the total variance in the data is accounted for by the separation of the clusters. This value 56.6% of explained variance may be considered good enough.
country_clust <- country_clean %>% bind_cols(cluster = as.factor(clust$cluster)) %>%
select(cluster, 1:10)
country_clust In this section, we will analize characteristic of each cluster and
see if there is a difference or specific traits on each clusters. First
let’s examine in life_expec and incom
country_clust %>% mutate(cluster = cluster) %>% ggplot(aes(life_expec, income,
color = cluster)) + geom_point(alpha = 0.5) + geom_mark_hull() + scale_color_brewer(palette = "Set1") +
theme_minimal() + theme(legend.position = "top")
The analysis of the clusters based on “life_expec” (Life Expectancy) and
“income” (Net Income per person) reveals distinct patterns in the
socio-economic and health conditions of the countries within each
group.
Cluster 1 represents countries with high life expectancy and high income, indicating that they are relatively affluent and have better healthcare and living conditions. These countries are likely to have well-established healthcare systems and higher standards of living.
Cluster 2 consists of countries with moderate life expectancy and moderate income. These countries may face some challenges in healthcare and development but still maintain moderate living standards.
Cluster 3 includes countries with low life expectancy and low income, suggesting they may have inadequate healthcare facilities and lower socio-economic development. These countries may face significant health and economic challenges.
Cluster 4 represents countries with moderate life expectancy and moderate income levels. These countries are in an intermediate position between the high and low-income clusters in terms of life expectancy and income.
Cluster 5 consists of countries with high life expectancy and high income, indicating that they are highly developed and prosperous. These countries likely have robust healthcare systems and high living standards.
Next, we will analize using the centroid of each cluster :
country_clust %>% group_by(cluster) %>% summarise_if(is.numeric, "mean") %>% mutate_if(is.numeric,
.funs = "round", digits = 2) %>% select(1:10)According to clustering above, we can analize that :
Cluster 1 : Contain high-income countries with low child mortality, high life expectancy, and moderate levels of exports and imports.
Cluster 2 : This cluster represents countries with moderate socio-economic indicators. Although the child mortality rate and inflation are relatively higher compared to other clusters, the cluster shows moderate values for other variables. Might have some developmental difficulty but not as economically developed as those in Cluster 1 and Cluster 5.
Cluster 3 : includes low-income countries with high child mortality, low life expectancy, and low health spending.
Cluster 4 : This cluster represents countries with moderate socio-economic indicators. The countries in this cluster have a relatively higher level of exports and imports, suggesting they have good international trade connections. between that of Cluster 1 (higher socio-economic status) and Cluster 3 (lower socio-economic status).
Cluster 5 : represents high-income countries with high export and import levels, low child mortality, and high life expectancy.
These clusters provide valuable insights into the socio-economic characteristics of different groups of countries. The analysis can help policymakers, researchers, and development organizations in understanding the diverse challenges and opportunities faced by countries in different clusters and tailoring appropriate policies and interventions accordingly.
In this section, will make PCA from the datasets. First we will measure the egienvalues, The eigenvalues measure the amount of variation retained by each principal component. Eigenvalues are large for the first PCs and small for the subsequent PCs. That is, the first PCs corresponds to the directions with the maximum amount of variation in the data set.
country_pca <- PCA(country_clean, scale.unit = T, ncp = 9, graph = T)summary(country_pca)##
## Call:
## PCA(X = country_clean, scale.unit = T, ncp = 9, graph = T)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 4.136 1.546 1.170 0.995 0.661 0.224 0.113
## % of var. 45.952 17.182 13.004 11.053 7.340 2.484 1.260
## Cumulative % of var. 45.952 63.133 76.138 87.191 94.531 97.015 98.276
## Dim.8 Dim.9
## Variance 0.088 0.067
## % of var. 0.981 0.743
## Cumulative % of var. 99.257 100.000
##
## Individuals (the 10 first)
## Dist Dim.1 ctr cos2 Dim.2 ctr cos2
## Afghanistan | 3.230 | -2.913 1.229 0.814 | 0.096 0.004 0.001 |
## Albania | 1.473 | 0.430 0.027 0.085 | -0.588 0.134 0.160 |
## Algeria | 1.664 | -0.285 0.012 0.029 | -0.455 0.080 0.075 |
## Angola | 3.900 | -2.932 1.245 0.565 | 1.696 1.113 0.189 |
## Antigua and Barbuda | 1.415 | 1.034 0.155 0.533 | 0.137 0.007 0.009 |
## Argentina | 2.223 | 0.022 0.000 0.000 | -1.779 1.226 0.641 |
## Armenia | 1.719 | -0.102 0.001 0.003 | -0.568 0.125 0.109 |
## Australia | 3.405 | 2.342 0.794 0.473 | -1.988 1.531 0.341 |
## Austria | 3.341 | 2.974 1.280 0.792 | -0.735 0.209 0.048 |
## Azerbaijan | 1.581 | -0.181 0.005 0.013 | -0.403 0.063 0.065 |
## Dim.3 ctr cos2
## Afghanistan -0.718 0.264 0.049 |
## Albania -0.333 0.057 0.051 |
## Algeria 1.222 0.763 0.539 |
## Angola 1.525 1.190 0.153 |
## Antigua and Barbuda -0.226 0.026 0.025 |
## Argentina 0.870 0.387 0.153 |
## Armenia 0.242 0.030 0.020 |
## Australia 0.190 0.019 0.003 |
## Austria -0.520 0.138 0.024 |
## Azerbaijan 0.867 0.385 0.301 |
##
## Variables
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## child_mort | -0.853 17.600 0.728 | 0.240 3.720 0.058 | -0.032
## exports | 0.577 8.060 0.333 | 0.762 37.597 0.581 | 0.157
## health | 0.307 2.275 0.094 | -0.302 5.909 0.091 | -0.645
## imports | 0.328 2.608 0.108 | 0.835 45.134 0.698 | -0.324
## income | 0.810 15.876 0.657 | 0.028 0.051 0.001 | 0.326
## inflation | -0.393 3.732 0.154 | -0.010 0.007 0.000 | 0.695
## life_expec | 0.866 18.134 0.750 | -0.277 4.960 0.077 | 0.123
## total_fer | -0.821 16.300 0.674 | 0.193 2.410 0.037 | 0.021
## gdpp | 0.798 15.417 0.638 | -0.057 0.212 0.003 | 0.133
## ctr cos2
## child_mort 0.087 0.001 |
## exports 2.096 0.025 |
## health 35.597 0.417 |
## imports 8.996 0.105 |
## income 9.093 0.106 |
## inflation 41.283 0.483 |
## life_expec 1.298 0.015 |
## total_fer 0.038 0.000 |
## gdpp 1.512 0.018 |
Percentage of variances captured by each dimensions visualization :
fviz_eig(country_pca, ncp = 9,choice = "variance", addlabels = T, main = "Variance explained by each dimensions")We keep around 80% of the information from our data by using only four dimensions (thus, 12.7% dimensionality reduction). This mean that we can actually reduce the number of features on our dataset from nine to just four numeric features.
We can extract the values of PC1 to PC4 from all of the observations and put it into a new data frame. This data frame can later be analyzed using supervised learning classification technique or other purposes.
country_inpca <- data.frame(country_pca$ind$coord[, 1:4]) %>% bind_cols(cluster = as.factor(clust$cluster)) %>%
select(cluster, 1:4)
country_inpcaBy integrating PCA with K-means clustering, we can create more insightful and informative visualizations, making it easier to interpret the clustering results and gain deeper insights into the data. The reduced-dimensional visualization can also aid in identifying potential overlaps or separations between clusters..
fviz_cluster(object = clust, data = country_clean, labelsize = 0) + theme_minimal()We can more dimension to the visualization above in order to avoid overlapping cluster in a 2D plot.
plot_ly(country_inpca, x = ~Dim.1, y = ~Dim.2, z = ~Dim.3, color = ~cluster, colors = c("black",
"red", "green", "blue")) %>% add_markers() %>% layout(scene = list(xaxis = list(title = "Dim.1"),
yaxis = list(title = "Dim.2"), zaxis = list(title = "Dim.3")))The conclusion that can obtained in this report :
The application of K-means clustering on the country dataset for socio-econmic factor has successfully grouped the countries into distinct clusters based on their socio-economic indicators. The clustering has provided valuable insights into the diversity of socio-economic conditions across different groups of countries.
Each cluster represents a set of countries with specific socio-economic profiles. Cluster 1 comprises high-income and highly developed countries with low child mortality, high life expectancy, and moderate trade activities. Cluster 2 consists of countries with moderate socio-economic indicators but facing challenges in terms of inflation and health spending. Cluster 3 represents low-income countries with high child mortality, low life expectancy, and limited economic development. Cluster 4 includes countries with moderate socio-economic indicators and relatively higher trade activities. Cluster 5 represents high-income countries with high trade activities, low child mortality, and high life expectancy.
We can reduce our dimensions from nine features into four dimensions and still retain more than 80% of the variances using PCA. The dimensionality reduction can be useful if we apply the new PCA for machine learning applications.
To enhance the project, further analysis can be conducted to explore the relationships between different variables within each cluster. Additionally, considering the impact of other socio-economic indicators and external factors on the clusters could provide a more comprehensive understanding of the countries’ socio-economic status