1 Introductionn

1.1 Introduction

The objective of this report is to categorise the countries using socio-economic and health factors that determine the overall development of the country. The data is obtained from HELP International. HELP International is an international humanitarian NGO that is committed to fighting poverty and providing the people of backward countries with basic amenities and relief during the time of disasters and natural calamities.

In this report, we will make an unsupervised learning analysis using the customer segmentation dataset. The analysis includes clustering using K-means algorithm and dimensionality reduction using principal component analysis (PCA). The dataset is obtained from : https://www.kaggle.com/datasets/rohan0301/unsupervised-learning-on-country-data

1.2 Package

To make the model, we will need to use these libraries below :

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(FactoMineR)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(ggforce)
library(cowplot)

2 Data Input and Explanatory

country <- read.csv("archive_2/Country-data.csv")
head(country,10)
glimpse(country)
## Rows: 167
## Columns: 10
## $ country    <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Antigua and…
## $ child_mort <dbl> 90.2, 16.6, 27.3, 119.0, 10.3, 14.5, 18.1, 4.8, 4.3, 39.2, …
## $ exports    <dbl> 10.0, 28.0, 38.4, 62.3, 45.5, 18.9, 20.8, 19.8, 51.3, 54.3,…
## $ health     <dbl> 7.58, 6.55, 4.17, 2.85, 6.03, 8.10, 4.40, 8.73, 11.00, 5.88…
## $ imports    <dbl> 44.9, 48.6, 31.4, 42.9, 58.9, 16.0, 45.3, 20.9, 47.8, 20.7,…
## $ income     <int> 1610, 9930, 12900, 5900, 19100, 18700, 6700, 41400, 43200, …
## $ inflation  <dbl> 9.440, 4.490, 16.100, 22.400, 1.440, 20.900, 7.770, 1.160, …
## $ life_expec <dbl> 56.2, 76.3, 76.5, 60.1, 76.8, 75.8, 73.3, 82.0, 80.5, 69.1,…
## $ total_fer  <dbl> 5.82, 1.65, 2.89, 6.16, 2.13, 2.37, 1.69, 1.93, 1.44, 1.92,…
## $ gdpp       <int> 553, 4090, 4460, 3530, 12200, 10300, 3220, 51900, 46900, 58…

In this dataset, we have 167 rows and 10 columns. The detail of each columns are :

  • country : Name of the country

  • child_mort ; Death of children under 5 years of age per 1000 live births

  • exports ; Exports of goods and services per capita. Given as %age of the GDP per capita

  • health : Total health spending per capita. Given as %age of GDP per capita

  • imports : Imports of goods and services per capita. Given as %age of the GDP per capita

  • Income : Net income per person

  • Inflation : The measurement of the annual growth rate of the Total GDP

  • life_expec : The average number of years a new born child would live if the current mortality patterns are to remain the same

  • total_fer : The number of children that would be born to each woman if the current age-fertility rates remain the same.

  • gdpp : The GDP per capita. Calculated as the Total GDP divided by the total population.

3 Data Wrangling

We will check wether this dataset have missing value or not :

anyNA(country)
## [1] FALSE

This dataset doesn’t have any missing value, therefore we proceed to the next process.

Next, since the column of country is Credit card holder ID, then this data will not relevant to our analisys. Therefore we will make this column as rownames index.

rownames(country) <- country[, 1]
country_clean <- country %>%
  select(-country)
head(country_clean)

4 Explolatory Data Analysis

4.1 Clustering Opportunity

4.2 Dimensionality Reduction (PCA) Opportunity

We want to see if there is a high correlation between numeric variables. Strong correlation in some variables imply that we can reduce the dimensionality or number of features using the Principle Component Analysis (PCA).

ggcorr(country_clean, low = "navy", high = "red")

There are some features that has high correlation such as the child_mort with total_fer, income with gdpp, etc. Based on this result, we will try to reduce the dimension using PCA.

5 Preprocessing

5.1 Scalling Data

country_scaled <- scale(country_clean)
summary(country_scaled)
##    child_mort         exports            health           imports       
##  Min.   :-0.8845   Min.   :-1.4957   Min.   :-1.8223   Min.   :-1.9341  
##  1st Qu.:-0.7444   1st Qu.:-0.6314   1st Qu.:-0.6901   1st Qu.:-0.6894  
##  Median :-0.4704   Median :-0.2229   Median :-0.1805   Median :-0.1483  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5909   3rd Qu.: 0.3736   3rd Qu.: 0.6496   3rd Qu.: 0.4899  
##  Max.   : 4.2086   Max.   : 5.7964   Max.   : 4.0353   Max.   : 5.2504  
##      income          inflation         life_expec        total_fer      
##  Min.   :-0.8577   Min.   :-1.1344   Min.   :-4.3242   Min.   :-1.1877  
##  1st Qu.:-0.7153   1st Qu.:-0.5649   1st Qu.:-0.5910   1st Qu.:-0.7616  
##  Median :-0.3727   Median :-0.2263   Median : 0.2861   Median :-0.3554  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.2934   3rd Qu.: 0.2808   3rd Qu.: 0.7021   3rd Qu.: 0.6157  
##  Max.   : 5.5947   Max.   : 9.1023   Max.   : 1.3768   Max.   : 3.0003  
##       gdpp         
##  Min.   :-0.69471  
##  1st Qu.:-0.63475  
##  Median :-0.45307  
##  Mean   : 0.00000  
##  3rd Qu.: 0.05924  
##  Max.   : 5.02140

Scaling the data in k-means clustering is crucial to make the algorithm fairer and more robust. It helps prevent biased results and ensures that all variables are treated equally when forming clusters. By applying scaling, you can achieve a more accurate representation of the underlying data structure and improve the performance of the k-means algorithm.

6 Clustering

6.1 Finding Optimal Number of Cluster

TO find the optimal clustering method we will use the silhouette method to determine the optimal number of cluster :

fviz_nbclust(country_scaled, kmeans, "silhouette", k.max = 15) + labs(subtitle = "Silhouette method")

The silhouette method measures the silhouette coefficient, by calculating the mean intra-cluster distance and the mean nearest-cluster distance for each observations. We get the optimal number of clusters by choosing the number of cluster with the highest silhouette score (the peak).

The method suggest that the optimal number of cluster is 13.

6.2 K-means Clustering

The main goal of K-means clustering is to partition a given dataset into ‘k’ clusters, where each cluster represents a group of data points that are similar to each other and dissimilar to points in other clusters. The ‘k’ in K-means refers to the number of clusters, which needs to be specified before running the algorithm.

set.seed(123)
clust <- kmeans(country_scaled, centers = 5)
clust
## K-means clustering with 5 clusters of sizes 22, 40, 45, 52, 8
## 
## Cluster means:
##   child_mort    exports      health    imports     income  inflation life_expec
## 1 -0.8423248 -0.1170646  1.36809977 -0.4475776  1.1865034 -0.5902992  1.1647079
## 2 -0.2371141 -0.3634995 -0.56954175 -0.6532976 -0.2176016  0.4134368  0.1916427
## 3  1.4031544 -0.4676733 -0.14850199 -0.1595875 -0.7081369  0.3816471 -1.3076349
## 4 -0.5559776  0.3296093  0.08593233  0.5701085 -0.1597509 -0.3791089  0.3366978
## 5 -0.7769251  2.6276265 -0.63780206  1.6893011  2.8467745 -0.1264185  1.0057504
##    total_fer       gdpp
## 1 -0.7583089  1.7624828
## 2 -0.2526437 -0.3692435
## 3  1.3810523 -0.6138338
## 4 -0.5605195 -0.2418234
## 5 -0.7764745  2.0240571
## 
## Clustering vector:
##                    Afghanistan                        Albania 
##                              3                              4 
##                        Algeria                         Angola 
##                              2                              3 
##            Antigua and Barbuda                      Argentina 
##                              4                              2 
##                        Armenia                      Australia 
##                              2                              1 
##                        Austria                     Azerbaijan 
##                              1                              2 
##                        Bahamas                        Bahrain 
##                              4                              4 
##                     Bangladesh                       Barbados 
##                              2                              4 
##                        Belarus                        Belgium 
##                              4                              1 
##                         Belize                          Benin 
##                              4                              3 
##                         Bhutan                        Bolivia 
##                              4                              2 
##         Bosnia and Herzegovina                       Botswana 
##                              4                              3 
##                         Brazil                         Brunei 
##                              2                              5 
##                       Bulgaria                   Burkina Faso 
##                              4                              3 
##                        Burundi                       Cambodia 
##                              3                              4 
##                       Cameroon                         Canada 
##                              3                              1 
##                     Cape Verde       Central African Republic 
##                              4                              3 
##                           Chad                          Chile 
##                              3                              2 
##                          China                       Colombia 
##                              2                              2 
##                        Comoros               Congo, Dem. Rep. 
##                              3                              3 
##                    Congo, Rep.                     Costa Rica 
##                              3                              4 
##                  Cote d'Ivoire                        Croatia 
##                              3                              4 
##                         Cyprus                 Czech Republic 
##                              4                              4 
##                        Denmark             Dominican Republic 
##                              1                              2 
##                        Ecuador                          Egypt 
##                              2                              2 
##                    El Salvador              Equatorial Guinea 
##                              4                              3 
##                        Eritrea                        Estonia 
##                              3                              4 
##                           Fiji                        Finland 
##                              4                              1 
##                         France                          Gabon 
##                              1                              2 
##                         Gambia                        Georgia 
##                              3                              4 
##                        Germany                          Ghana 
##                              1                              3 
##                         Greece                        Grenada 
##                              1                              4 
##                      Guatemala                         Guinea 
##                              2                              3 
##                  Guinea-Bissau                         Guyana 
##                              3                              4 
##                          Haiti                        Hungary 
##                              3                              4 
##                        Iceland                          India 
##                              1                              2 
##                      Indonesia                           Iran 
##                              2                              2 
##                           Iraq                        Ireland 
##                              2                              5 
##                         Israel                          Italy 
##                              1                              1 
##                        Jamaica                          Japan 
##                              2                              1 
##                         Jordan                     Kazakhstan 
##                              4                              2 
##                          Kenya                       Kiribati 
##                              3                              3 
##                         Kuwait                Kyrgyz Republic 
##                              5                              4 
##                            Lao                         Latvia 
##                              3                              4 
##                        Lebanon                        Lesotho 
##                              4                              3 
##                        Liberia                          Libya 
##                              3                              2 
##                      Lithuania                     Luxembourg 
##                              4                              5 
##                 Macedonia, FYR                     Madagascar 
##                              4                              3 
##                         Malawi                       Malaysia 
##                              3                              4 
##                       Maldives                           Mali 
##                              4                              3 
##                          Malta                     Mauritania 
##                              5                              3 
##                      Mauritius          Micronesia, Fed. Sts. 
##                              4                              4 
##                        Moldova                       Mongolia 
##                              4                              2 
##                     Montenegro                        Morocco 
##                              4                              2 
##                     Mozambique                        Myanmar 
##                              3                              2 
##                        Namibia                          Nepal 
##                              3                              2 
##                    Netherlands                    New Zealand 
##                              1                              1 
##                          Niger                        Nigeria 
##                              3                              3 
##                         Norway                           Oman 
##                              1                              2 
##                       Pakistan                         Panama 
##                              3                              4 
##                       Paraguay                           Peru 
##                              4                              2 
##                    Philippines                         Poland 
##                              2                              4 
##                       Portugal                          Qatar 
##                              1                              5 
##                        Romania                         Russia 
##                              2                              2 
##                         Rwanda                          Samoa 
##                              3                              4 
##                   Saudi Arabia                        Senegal 
##                              2                              3 
##                         Serbia                     Seychelles 
##                              4                              4 
##                   Sierra Leone                      Singapore 
##                              3                              5 
##                Slovak Republic                       Slovenia 
##                              4                              4 
##                Solomon Islands                   South Africa 
##                              4                              3 
##                    South Korea                          Spain 
##                              4                              1 
##                      Sri Lanka St. Vincent and the Grenadines 
##                              2                              4 
##                          Sudan                       Suriname 
##                              3                              4 
##                         Sweden                    Switzerland 
##                              1                              1 
##                     Tajikistan                       Tanzania 
##                              2                              3 
##                       Thailand                    Timor-Leste 
##                              4                              3 
##                           Togo                          Tonga 
##                              3                              2 
##                        Tunisia                         Turkey 
##                              4                              2 
##                   Turkmenistan                         Uganda 
##                              2                              3 
##                        Ukraine           United Arab Emirates 
##                              4                              5 
##                 United Kingdom                  United States 
##                              1                              1 
##                        Uruguay                     Uzbekistan 
##                              2                              2 
##                        Vanuatu                      Venezuela 
##                              4                              2 
##                        Vietnam                          Yemen 
##                              4                              3 
##                         Zambia 
##                              3 
## 
## Within cluster sum of squares by cluster:
## [1]  46.96944 104.32294 260.44689 120.43834 115.75637
##  (between_SS / total_SS =  56.6 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

The percentage of variance explained (between_SS / total_SS) provides insight into how much of the total variance is accounted for by the clustering. A higher percentage suggests that the clustering has effectively captured a significant portion of the dataset’s variability. a good K-means clustering solution aims to minimize the within-cluster sum of squares (making clusters compact) while maximizing the between-cluster sum of squares (separating clusters from each other). The percentage of variance explained gives an indication of how well the clustering has captured the underlying patterns in the data.

The percentage of variance explained is calculated as (Between_SS / Total_SS) * 100. In this case, it’s 56.6%, indicating that 60.5% of the total variance in the data is accounted for by the separation of the clusters. This value 56.6% of explained variance may be considered good enough.

country_clust <- country_clean %>% bind_cols(cluster = as.factor(clust$cluster)) %>% 
    select(cluster, 1:10)

country_clust 

6.3 Clustering Analisys

In this section, we will analize characteristic of each cluster and see if there is a difference or specific traits on each clusters. First let’s examine in life_expec and incom

country_clust %>% mutate(cluster = cluster) %>% ggplot(aes(life_expec, income, 
    color = cluster)) + geom_point(alpha = 0.5) + geom_mark_hull() + scale_color_brewer(palette = "Set1") + 
    theme_minimal() + theme(legend.position = "top")

The analysis of the clusters based on “life_expec” (Life Expectancy) and “income” (Net Income per person) reveals distinct patterns in the socio-economic and health conditions of the countries within each group.

  • Cluster 1 represents countries with high life expectancy and high income, indicating that they are relatively affluent and have better healthcare and living conditions. These countries are likely to have well-established healthcare systems and higher standards of living.

  • Cluster 2 consists of countries with moderate life expectancy and moderate income. These countries may face some challenges in healthcare and development but still maintain moderate living standards.

  • Cluster 3 includes countries with low life expectancy and low income, suggesting they may have inadequate healthcare facilities and lower socio-economic development. These countries may face significant health and economic challenges.

  • Cluster 4 represents countries with moderate life expectancy and moderate income levels. These countries are in an intermediate position between the high and low-income clusters in terms of life expectancy and income.

  • Cluster 5 consists of countries with high life expectancy and high income, indicating that they are highly developed and prosperous. These countries likely have robust healthcare systems and high living standards.

Next, we will analize using the centroid of each cluster :

country_clust %>% group_by(cluster) %>% summarise_if(is.numeric, "mean") %>% mutate_if(is.numeric, 
    .funs = "round", digits = 2) %>% select(1:10)

According to clustering above, we can analize that :

  • Cluster 1 : Contain high-income countries with low child mortality, high life expectancy, and moderate levels of exports and imports.

  • Cluster 2 : This cluster represents countries with moderate socio-economic indicators. Although the child mortality rate and inflation are relatively higher compared to other clusters, the cluster shows moderate values for other variables. Might have some developmental difficulty but not as economically developed as those in Cluster 1 and Cluster 5.

  • Cluster 3 : includes low-income countries with high child mortality, low life expectancy, and low health spending.

  • Cluster 4 : This cluster represents countries with moderate socio-economic indicators. The countries in this cluster have a relatively higher level of exports and imports, suggesting they have good international trade connections. between that of Cluster 1 (higher socio-economic status) and Cluster 3 (lower socio-economic status).

  • Cluster 5 : represents high-income countries with high export and import levels, low child mortality, and high life expectancy.

These clusters provide valuable insights into the socio-economic characteristics of different groups of countries. The analysis can help policymakers, researchers, and development organizations in understanding the diverse challenges and opportunities faced by countries in different clusters and tailoring appropriate policies and interventions accordingly.

7 Principal Component Analysis (PCA)

7.1 Dimensionality Reduction

In this section, will make PCA from the datasets. First we will measure the egienvalues, The eigenvalues measure the amount of variation retained by each principal component. Eigenvalues are large for the first PCs and small for the subsequent PCs. That is, the first PCs corresponds to the directions with the maximum amount of variation in the data set.

country_pca <- PCA(country_clean, scale.unit = T, ncp = 9, graph = T)

summary(country_pca)
## 
## Call:
## PCA(X = country_clean, scale.unit = T, ncp = 9, graph = T) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               4.136   1.546   1.170   0.995   0.661   0.224   0.113
## % of var.             45.952  17.182  13.004  11.053   7.340   2.484   1.260
## Cumulative % of var.  45.952  63.133  76.138  87.191  94.531  97.015  98.276
##                        Dim.8   Dim.9
## Variance               0.088   0.067
## % of var.              0.981   0.743
## Cumulative % of var.  99.257 100.000
## 
## Individuals (the 10 first)
##                         Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2  
## Afghanistan         |  3.230 | -2.913  1.229  0.814 |  0.096  0.004  0.001 |
## Albania             |  1.473 |  0.430  0.027  0.085 | -0.588  0.134  0.160 |
## Algeria             |  1.664 | -0.285  0.012  0.029 | -0.455  0.080  0.075 |
## Angola              |  3.900 | -2.932  1.245  0.565 |  1.696  1.113  0.189 |
## Antigua and Barbuda |  1.415 |  1.034  0.155  0.533 |  0.137  0.007  0.009 |
## Argentina           |  2.223 |  0.022  0.000  0.000 | -1.779  1.226  0.641 |
## Armenia             |  1.719 | -0.102  0.001  0.003 | -0.568  0.125  0.109 |
## Australia           |  3.405 |  2.342  0.794  0.473 | -1.988  1.531  0.341 |
## Austria             |  3.341 |  2.974  1.280  0.792 | -0.735  0.209  0.048 |
## Azerbaijan          |  1.581 | -0.181  0.005  0.013 | -0.403  0.063  0.065 |
##                      Dim.3    ctr   cos2  
## Afghanistan         -0.718  0.264  0.049 |
## Albania             -0.333  0.057  0.051 |
## Algeria              1.222  0.763  0.539 |
## Angola               1.525  1.190  0.153 |
## Antigua and Barbuda -0.226  0.026  0.025 |
## Argentina            0.870  0.387  0.153 |
## Armenia              0.242  0.030  0.020 |
## Australia            0.190  0.019  0.003 |
## Austria             -0.520  0.138  0.024 |
## Azerbaijan           0.867  0.385  0.301 |
## 
## Variables
##                        Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## child_mort          | -0.853 17.600  0.728 |  0.240  3.720  0.058 | -0.032
## exports             |  0.577  8.060  0.333 |  0.762 37.597  0.581 |  0.157
## health              |  0.307  2.275  0.094 | -0.302  5.909  0.091 | -0.645
## imports             |  0.328  2.608  0.108 |  0.835 45.134  0.698 | -0.324
## income              |  0.810 15.876  0.657 |  0.028  0.051  0.001 |  0.326
## inflation           | -0.393  3.732  0.154 | -0.010  0.007  0.000 |  0.695
## life_expec          |  0.866 18.134  0.750 | -0.277  4.960  0.077 |  0.123
## total_fer           | -0.821 16.300  0.674 |  0.193  2.410  0.037 |  0.021
## gdpp                |  0.798 15.417  0.638 | -0.057  0.212  0.003 |  0.133
##                        ctr   cos2  
## child_mort           0.087  0.001 |
## exports              2.096  0.025 |
## health              35.597  0.417 |
## imports              8.996  0.105 |
## income               9.093  0.106 |
## inflation           41.283  0.483 |
## life_expec           1.298  0.015 |
## total_fer            0.038  0.000 |
## gdpp                 1.512  0.018 |

Percentage of variances captured by each dimensions visualization :

fviz_eig(country_pca, ncp = 9,choice = "variance", addlabels = T, main = "Variance explained by each dimensions")

We keep around 80% of the information from our data by using only four dimensions (thus, 12.7% dimensionality reduction). This mean that we can actually reduce the number of features on our dataset from nine to just four numeric features.

We can extract the values of PC1 to PC4 from all of the observations and put it into a new data frame. This data frame can later be analyzed using supervised learning classification technique or other purposes.

country_inpca <- data.frame(country_pca$ind$coord[, 1:4]) %>% bind_cols(cluster = as.factor(clust$cluster)) %>% 
    select(cluster, 1:4)
country_inpca

7.2 Clustering with PCA

By integrating PCA with K-means clustering, we can create more insightful and informative visualizations, making it easier to interpret the clustering results and gain deeper insights into the data. The reduced-dimensional visualization can also aid in identifying potential overlaps or separations between clusters..

fviz_cluster(object = clust, data = country_clean, labelsize = 0) + theme_minimal()

We can more dimension to the visualization above in order to avoid overlapping cluster in a 2D plot.

plot_ly(country_inpca, x = ~Dim.1, y = ~Dim.2, z = ~Dim.3, color = ~cluster, colors = c("black", 
    "red", "green", "blue")) %>% add_markers() %>% layout(scene = list(xaxis = list(title = "Dim.1"), 
    yaxis = list(title = "Dim.2"), zaxis = list(title = "Dim.3")))

8 Conclusion

The conclusion that can obtained in this report :

  • The application of K-means clustering on the country dataset for socio-econmic factor has successfully grouped the countries into distinct clusters based on their socio-economic indicators. The clustering has provided valuable insights into the diversity of socio-economic conditions across different groups of countries.

  • Each cluster represents a set of countries with specific socio-economic profiles. Cluster 1 comprises high-income and highly developed countries with low child mortality, high life expectancy, and moderate trade activities. Cluster 2 consists of countries with moderate socio-economic indicators but facing challenges in terms of inflation and health spending. Cluster 3 represents low-income countries with high child mortality, low life expectancy, and limited economic development. Cluster 4 includes countries with moderate socio-economic indicators and relatively higher trade activities. Cluster 5 represents high-income countries with high trade activities, low child mortality, and high life expectancy.

  • We can reduce our dimensions from nine features into four dimensions and still retain more than 80% of the variances using PCA. The dimensionality reduction can be useful if we apply the new PCA for machine learning applications.

  • To enhance the project, further analysis can be conducted to explore the relationships between different variables within each cluster. Additionally, considering the impact of other socio-economic indicators and external factors on the clusters could provide a more comprehensive understanding of the countries’ socio-economic status