Welcome to my Rmd. The reason why I created this Rmd is to improve my understanding on Unsupervised Machine Learning.
This dataset consists information about socio-economic and health factors that determine the overall development of the country.
Columns Insight :
1. country : Name of the country, there are 167 country listed
2. child_mort : Death of children under 5 years of age per 1000 live births
3. exports : Exports of goods and services per capita. Given as %age of the GDP per capita
4. health : Total health spending per capita. Given as %age of GDP per capita
5. imports : Imports of goods and services per capita. Given as %age of the GDP per capita
6. Income : Net income per person
7. Inflation : The measurement of the annual growth rate of the Total GDP
8. life_expec : The average number of years a new born child would live if the current mortality patterns are to remain the same
9. total_fer : The number of children that would be born to each woman if the current age-fertility rates remain the same
10. gdpp : The GDP per capita. Calculated as the Total GDP divided by the total population.
You may download the data set from kaggle: https://www.kaggle.com/rohan0301/unsupervised-learning-on-country-data
HELP International have been able to raise around $ 10 million. Now the CEO of the NGO needs to decide how to use this money strategically and effectively. So, CEO has to make decision to choose the countries that are in the direst need of aid. Hence, your Job as a Data scientist is to categorise the countries using some socio-economic and health factors that determine the overall development of the country. Then you need to suggest the countries which the CEO needs to focus on the most.
1. Import necessary library
library(gridExtra)
library(factoextra)
library(FactoMineR)
library(arsenal)
library(tidyverse)
2. Read the daataset
country <- read.csv("Country-data.csv")
head(country)
1. Check missing value
colSums(is.na(country))
## country child_mort exports health imports income inflation
## 0 0 0 0 0 0 0
## life_expec total_fer gdpp
## 0 0 0
From missing value inspection using function colSums(is.na()) there are no missing value.
2. Check data type
glimpse(country)
## Rows: 167
## Columns: 10
## $ country <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Antigua and~
## $ child_mort <dbl> 90.2, 16.6, 27.3, 119.0, 10.3, 14.5, 18.1, 4.8, 4.3, 39.2, ~
## $ exports <dbl> 10.0, 28.0, 38.4, 62.3, 45.5, 18.9, 20.8, 19.8, 51.3, 54.3,~
## $ health <dbl> 7.58, 6.55, 4.17, 2.85, 6.03, 8.10, 4.40, 8.73, 11.00, 5.88~
## $ imports <dbl> 44.9, 48.6, 31.4, 42.9, 58.9, 16.0, 45.3, 20.9, 47.8, 20.7,~
## $ income <int> 1610, 9930, 12900, 5900, 19100, 18700, 6700, 41400, 43200, ~
## $ inflation <dbl> 9.440, 4.490, 16.100, 22.400, 1.440, 20.900, 7.770, 1.160, ~
## $ life_expec <dbl> 56.2, 76.3, 76.5, 60.1, 76.8, 75.8, 73.3, 82.0, 80.5, 69.1,~
## $ total_fer <dbl> 5.82, 1.65, 2.89, 6.16, 2.13, 2.37, 1.69, 1.93, 1.44, 1.92,~
## $ gdpp <int> 553, 4090, 4460, 3530, 12200, 10300, 3220, 51900, 46900, 58~
The data type for each columns are already suitable but if the data type for column country want to be change into data type factor it is also possible. In this case, let’s change data type for column country.
country <- country %>%
mutate(country = as.factor(country))
glimpse(country$country)
## Factor w/ 167 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
3. Check the distribution/pattern data
summary(country)
## country child_mort exports health
## Afghanistan : 1 Min. : 2.60 Min. : 0.109 Min. : 1.810
## Albania : 1 1st Qu.: 8.25 1st Qu.: 23.800 1st Qu.: 4.920
## Algeria : 1 Median : 19.30 Median : 35.000 Median : 6.320
## Angola : 1 Mean : 38.27 Mean : 41.109 Mean : 6.816
## Antigua and Barbuda: 1 3rd Qu.: 62.10 3rd Qu.: 51.350 3rd Qu.: 8.600
## Argentina : 1 Max. :208.00 Max. :200.000 Max. :17.900
## (Other) :161
## imports income inflation life_expec
## Min. : 0.0659 Min. : 609 Min. : -4.210 Min. :32.10
## 1st Qu.: 30.2000 1st Qu.: 3355 1st Qu.: 1.810 1st Qu.:65.30
## Median : 43.3000 Median : 9960 Median : 5.390 Median :73.10
## Mean : 46.8902 Mean : 17145 Mean : 7.782 Mean :70.56
## 3rd Qu.: 58.7500 3rd Qu.: 22800 3rd Qu.: 10.750 3rd Qu.:76.80
## Max. :174.0000 Max. :125000 Max. :104.000 Max. :82.80
##
## total_fer gdpp
## Min. :1.150 Min. : 231
## 1st Qu.:1.795 1st Qu.: 1330
## Median :2.410 Median : 4660
## Mean :2.948 Mean : 12964
## 3rd Qu.:3.880 3rd Qu.: 14050
## Max. :7.490 Max. :105000
##
From the data distribution above, it is very sad to know that there are huge gap in the welfare of life for each country.
Furthermore, from the inspection above. The data are required to be scale since there are different range of value on each columns, so every columns will have same standardization or range of value. The standardization is important because the higher the scale, the higher the variance or covariance value which might cause bias.
4. Data Scaling
Since the first column is a string, the first column must be excluded from the dataset before assign it to the scale() function. If there are any data type numeric assigned into scale() function, an error will occur.
country_scale <- country %>%
mutate_at(c(2:10), funs(c(scale(.))))
summary(country_scale)
## country child_mort exports
## Afghanistan : 1 Min. :-0.8845 Min. :-1.4957
## Albania : 1 1st Qu.:-0.7444 1st Qu.:-0.6314
## Algeria : 1 Median :-0.4704 Median :-0.2229
## Angola : 1 Mean : 0.0000 Mean : 0.0000
## Antigua and Barbuda: 1 3rd Qu.: 0.5909 3rd Qu.: 0.3736
## Argentina : 1 Max. : 4.2086 Max. : 5.7964
## (Other) :161
## health imports income inflation
## Min. :-1.8223 Min. :-1.9341 Min. :-0.8577 Min. :-1.1344
## 1st Qu.:-0.6901 1st Qu.:-0.6894 1st Qu.:-0.7153 1st Qu.:-0.5649
## Median :-0.1805 Median :-0.1483 Median :-0.3727 Median :-0.2263
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6496 3rd Qu.: 0.4899 3rd Qu.: 0.2934 3rd Qu.: 0.2808
## Max. : 4.0353 Max. : 5.2504 Max. : 5.5947 Max. : 9.1023
##
## life_expec total_fer gdpp
## Min. :-4.3242 Min. :-1.1877 Min. :-0.69471
## 1st Qu.:-0.5910 1st Qu.:-0.7616 1st Qu.:-0.63475
## Median : 0.2861 Median :-0.3554 Median :-0.45307
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.7021 3rd Qu.: 0.6157 3rd Qu.: 0.05924
## Max. : 1.3768 Max. : 3.0003 Max. : 5.02140
##
Compare to summary result from object country and country_scale, there are no huge gab on range of values on every columns in object country_scale or can be said the data on each of the columns already has the same minimum and maximum range of values. Hopefully it will provide better result during modeling machine learning.
PCA is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set.
Reducing the number of variables of a data set naturally comes at the expense of accuracy, but the trick in dimensionality reduction is to trade a little accuracy for simplicity because smaller data sets are easier to explore and visualize and make analyzing data much easier and faster for machine learning algorithms without extraneous variables to process.
PCA can be perform in R using function prcomp().
Unfortunately, function prcomp() cannot be implement inside function mutate_at(), so numeric value has to be separated at before committing PCA.
rownames(country_scale) <- country_scale[,"country"]
country_scale <- country_scale %>%
select(-country)
head(country_scale)
country_pca <- prcomp(country_scale)
country_pca
## Standard deviations (1, .., p=9):
## [1] 2.0336314 1.2435217 1.0818425 0.9973889 0.8127847 0.4728437 0.3368067
## [8] 0.2971790 0.2586020
##
## Rotation (n x k) = (9 x 9):
## PC1 PC2 PC3 PC4 PC5
## child_mort -0.4195194 -0.192883937 0.02954353 -0.370653262 0.16896968
## exports 0.2838970 -0.613163494 -0.14476069 -0.003091019 -0.05761584
## health 0.1508378 0.243086779 0.59663237 -0.461897497 -0.51800037
## imports 0.1614824 -0.671820644 0.29992674 0.071907461 -0.25537642
## income 0.3984411 -0.022535530 -0.30154750 -0.392159039 0.24714960
## inflation -0.1931729 0.008404473 -0.64251951 -0.150441762 -0.71486910
## life_expec 0.4258394 0.222706743 -0.11391854 0.203797235 -0.10821980
## total_fer -0.4037290 -0.155233106 -0.01954925 -0.378303645 0.13526221
## gdpp 0.3926448 0.046022396 -0.12297749 -0.531994575 0.18016662
## PC6 PC7 PC8 PC9
## child_mort -0.200628153 0.07948854 0.68274306 0.32754180
## exports 0.059332832 0.70730269 0.01419742 -0.12308207
## health -0.007276456 0.24983051 -0.07249683 0.11308797
## imports 0.030031537 -0.59218953 0.02894642 0.09903717
## income -0.160346990 -0.09556237 -0.35262369 0.61298247
## inflation -0.066285372 -0.10463252 0.01153775 -0.02523614
## life_expec 0.601126516 -0.01848639 0.50466425 0.29403981
## total_fer 0.750688748 -0.02882643 -0.29335267 -0.02633585
## gdpp -0.016778761 -0.24299776 0.24969636 -0.62564572
The way to interpret PCA from the result above:
- The first principal component has positive associations with exports, health, imports, income, life_expec and gdpp, but the first principal component has negative associations with child_mort, inflation and total_fer. The first component can be viewed as a measure of how stable the country is, since generally stable country has positive/high rating in exports, imports, income, health, life_expec and gdpp but has negative/low rating in child_mort, inflation and total_fer.
- Another way to interpret PCA is by examine the value of the coefficient/columns. The larger the absolute value of the coefficient, the more important the corresponding variable is in calculating the component. Example can be taken from PC7 at column exports, the largest value for columns exports are on PC7 compared to another PC. It can be said the contribution or the amount of information about column export is on PC7.
Another way to interpret result from PCA is to implement function biplot(), that function will help to observe:
- Overall data distribution using 2 PCs The goal is to find out similar observations and outliers.
- The correlation between variables and their contribution to PC.
biplot(country_pca, cex = 0.5)
From the plot above, Malta, Singapore and Luxemburg might be indicated as outliers but in this case there are no such thing can be categorized as outliers. Another insight from the plot are columns gdpp and income has strong correlation as well as columns total_fer and child_mort.
As stated above, the main function of PCA is to do dimensionality reduction. In order to determine the minimum number of principal components that account for most of the variation in your data, can use a function summary().
Function summary() will provide these three information:
- Standard deviation: standard deviation (root variance) captured by each PC.
- Proportion of Variance: information captured by each PC.
- Cumulative Proportion: the cumulative amount of information captured from PC1 to PC9.
summary(country_pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0336 1.2435 1.0818 0.9974 0.8128 0.47284 0.3368
## Proportion of Variance 0.4595 0.1718 0.1300 0.1105 0.0734 0.02484 0.0126
## Cumulative Proportion 0.4595 0.6313 0.7614 0.8719 0.9453 0.97015 0.9828
## PC8 PC9
## Standard deviation 0.29718 0.25860
## Proportion of Variance 0.00981 0.00743
## Cumulative Proportion 0.99257 1.00000
The determination of PCs number are adjusted to the information requirements. Let’s say in this project requires at least 80% information, then the number PCs use are from PC1-4.
country_selected_pca <- data.frame(country_pca$x[,1:4])
head(country_selected_pca)
Once a PC is selected that summarizes the required information, it can be combined with the initial data and used for further analysis.
country_pca <- country %>%
select_if(purrr::negate(is.numeric)) %>%
cbind(country_selected_pca)
glimpse(country_pca)
## Rows: 167
## Columns: 5
## $ country <fct> "Afghanistan", "Albania", "Algeria", "Angola", "Antigua and Ba~
## $ PC1 <dbl> -2.90428986, 0.42862224, -0.28436983, -2.92362976, 1.03047668,~
## $ PC2 <dbl> -0.09533386, 0.58639208, 0.45380957, -1.69047094, -0.13624894,~
## $ PC3 <dbl> 0.7159652, 0.3324855, -1.2178421, -1.5204709, 0.2250441, -0.86~
## $ PC4 <dbl> -1.00224038, 1.15757715, 0.86551146, -0.83710739, 0.84452276, ~
country_final <- country_pca %>%
select(-country)
head(country_final)
1. Finding The Best K-Value For Clustering With PCA Value
To determine which country are in the direst need of aid, model K-Means might help. K-Means is a machine learning model which can be used to grouping data based on similar characteristics.
How many K-Value or how many groups desired as the final result must be determined in advance, since model K-Means required that information to grouping the data based on similar characteristics. There are several way to find out the optimum K-Value.
- Elbow Method
One of the most famous method is Elbow Method, but how to know optimum K-Value from Elbow Method? Elbow method can find out Optimum K value or groups by seeing the results of the visualization from function fviz_nbclust() which contains WSS (Within Sum of Square) values.
WSS can be interpreted more easily as the measures of variation that exists within each group, so the higher WSS result indicates a large degree of variability within the data set, while a lower result indicates that the data does not vary considerably from the mean value. From the explanation above, it can be concluded that the optimum K-Value is when the increasing the number of K does not result in a considerable decrease of the total within sum of squares.
fviz_nbclust(country_final, kmeans, method = "wss") +
labs(subtitle = "Elbow Method With PCA Value")
From the plots can be seen that 3 is the optimum number of K. Since After k=3, increasing the number of K does not result in a considerable decrease of the total within sum of squares.
Another way to find out the optimum K Value is to choose the number of cluster in the area of “bend of an elbow”, but this method can be said as biased method since the point where the K-Value will perform an area that looks like an elbow it depends on the opinion of each person and everyone’s opinion may be different.
- Silhouette Method
The second method is Silhouette Method, this method will use the same function as Elbow Method. Function fviz_nbclust() will visualize a measure of how close each point in one cluster is to points in the neighboring clusters and thus provides a way to assess parameters like number of clusters visually.
It can be said that, the highest Average Silhouette Width value or the peak value from the plot is the optimal K-Value since the average distance between each cluster is not that close.
fviz_nbclust(country_final, kmeans, method = "silhouette") +
labs(subtitle = "Silhouette Method With PCA Value")
From the plots can be seen that 4 is the optimum number of K. Since After k=4, the Average Silhouette Width are decreased.
- Gap Statistic
The last method is called Gap Statistic. Gap statistic can be visualized by function fviz_nbclust(). The basic idea of the Gap Statistics is to choose the number of K, where the biggest jump in within-cluster distance occurred or can be said that the first time K-Value reached the highest Gap Statistic value without dropping that is the optimal K-Value compute from the plot.
Based on the gap statistic method below, the optimal k is 3.
fviz_nbclust(country_final, kmeans, method = "gap_stat") +
labs(subtitle = "Gap Statistic method With PCA Value")
Based on three methods above, two out of three methods suggest that k = 3 is the optimum number of clusters. So, the clusters will be divided into 3 clusters.
Disclaimer : Determining the number of clusters is not obliged to use the 3 test methods above, cluster determination can also be determined based on business question or mutual agreement.
2. Clustering With PCA Value
set.seed(100)
km_pca <- kmeans(country_final, centers = 3)
country_final$cluster <- km_pca$cluster
unique(country_final$cluster)
## [1] 3 1 2
cluster_pca <- fviz_cluster(km_pca, data = country_final) +
labs(subtitle = "K-Means With PCA & K-Value = 3")
cluster_pca
The purpose of profiling is to to understand the characteristics of each cluster, in this case to understand which country cluster is the direst need of aid.
To find out characteristic from each cluster, the value from each columns can be averaged.
country_final %>%
group_by(cluster) %>%
summarise_all(mean)
Instead profiling using PCA value, it is better to assign the cluster into original value so it will be easier to interpret.
#Assign into new object
country_profile <- country_scale
#Assign cluster result into the new object
country_profile$cluster <- km_pca$cluster
country_profile %>%
group_by(cluster) %>%
summarise_all(mean)
Profiling:
Cluster 1:
- From an economic point of view, cluster 1 country population do not have a good economic level that can be seen from minus average values in columns income, gdpp and positive average value in column inflation. Furthermore, cluster 1 country have an unfavorable on the industrial side since columns export and import has minus average value.
- From health point of view, this is very sad because cluster 1 country have high positive average value on column child_mort and minus on columns health and life_expe.
Cluster 2:
- From an economic point of view, cluster 2 country population have a good economic level that can be seen from positive average values in columns income, gdpp and minus average value in column inflation. Other than that, cluster 2 can be said as developed country since columns export and import has high positive average value.
- From health point of view, cluster 2 country have high negative average on column child_mort but minus in column health.
Cluster 3:
- From an economic point of view, most of cluster 2 country population do not have a good economic level that can be seen from minus average values in columns income and gdpp. Furthermore, cluster 1 country have an unfavorable industrial since columns export has minus average value but fortunately the average for imports is still positive.
- From health point of view, cluster 2 country have the lowest average value on column health but fortunately average on columns child_mort and life expe rate are showing a good result.
Based on cluster profiling, country in cluster 1 is a country that in need of aid the most compared to countries in clusters 2 and 3. To determine which countries in cluster 1 should be given the donations first, that can be determined by finding out which countries has the value that is lower than the average profiling value for cluster 1.
country_profile %>%
filter(child_mort > 1.32,
exports < -0.42,
health < -0.13,
imports < -0.15,
income < -0.68,
inflation < 0.39,
life_expec < -1.27,
total_fer > 1.35,
gdpp < -0.60)
From the results above, there are 2 countries that have is lower than the average profiling value for cluster 1.
The distribution of donation funds can also be determined based on the economic and health segments that are most in need of aid. Let’s divide up any columns that are suitable as assessment parameters.
- Economic Sector
Parameter columns income, exports, imports and gdpp must be lower than average value from profiling value for cluster 1.
#Parameter filter
economic <- country_profile %>%
filter(exports < -0.42,
imports < -0.15,
income < -0.68,
gdpp < -0.60) %>%
select(income,exports,imports, gdpp)
economic
If only seen from the results above, it is quite difficult to sort which country need the most aid in economic sector. Even though those 18 countries has average value for columns income, exports, imports and gdpp below average cluster 1 profiling. However, it would be wise if the countries that received the first aid were the country which has the lowest average.
In order to help figuring out which country will received the first aid, let’s visualize it.
#Change negative value into positive, for the sake of visualization
economic <- abs(economic)
#Change country section from index into columns, for the sake of visualization
economic <- tibble::rownames_to_column(economic, "country")
head(economic)
Implement function pivot_longer() to combine income, exports, imports and gdpp, so that the visualization results are easier to interpret.
eco_piv_long <- pivot_longer(data = economic,
cols = c("income", "exports", "imports", "gdpp"))
head(eco_piv_long,8)
ggplot(data = eco_piv_long, aes(x = value, y = reorder(country, value))) +
geom_col(aes(fill = name),position = "dodge") +
scale_x_continuous(label = scales::comma,
expand = c(0,0),
breaks = seq(0, 2.5, 0.25)) +
labs(title = "Most Needed Aid Country In Economic Segmen",
subtitle = "Comparismn between Income, Exports, Imports & GDP",
x = "Value",
y = "Country",
color = "") +
theme_bw() +
theme(legend.position = "bottom",
legend.title = element_blank())
- Health Sector
Parameter columns child_mort, health and life_expec must be lower than average value from profiling value for cluster 1.
#Parameter filter
health <- country_profile %>%
filter(child_mort > 1.32,
health < -0.13,
life_expec < -1.27) %>%
select(child_mort, health, life_expec)
health
Let’s visualize comparison between child_mort, healt and life_expec to find out which country in dearest need of aid in health sector.
#Change negative value into positive, for the sake of visualization
health <- abs(health)
#Change country section from index into columns, for the sake of visualization
health <- tibble::rownames_to_column(health, "country")
health
Implement function pivot_longer() to combine child_mort, health and life_expec, so that the visualization results are easier to interpret.
health_piv_long <- pivot_longer(data = health,
cols = c("child_mort", "health", "life_expec"))
head(health_piv_long,9)
ggplot(data = health_piv_long, aes(x = value, y = reorder(country, value))) +
geom_col(aes(fill = name),position = "dodge") +
scale_x_continuous(label = scales::comma,
expand = c(0,0),
breaks = seq(0, 4, 0.25)) +
labs(title = "Most Needed Aid Country In Health Segmen",
subtitle = "Comparismn between Child Mortality, Health & Life Expected",
x = "Value",
y = "Channel",
color = "") +
theme_bw() +
theme(legend.position = "bottom",
legend.title = element_blank())
How many K-Value or how many groups desired as the final result must be determined in advance, since model K-Means required that information to grouping the data based on similar characteristics. There are several way to find out the optimum K-Value such as Elbow Method, Silhouette Method and Gap Statistic Method. It is important to implement several methods, since from the case above the K-Value result from one method to another might be different. Even though there are difference in the results, the final suggestion for K-Value = 3 since two out three methods produced the same results.
After knowing how many groups are ideal, the characteristics of each country in each cluster can be determined by averaging the values in each column, this method can be call Cluster Profiling. From the results of cluster profiling, countries in cluster 1 are the countries most in need of aid when compared to countries included in cluster 2 and cluster 3.
The results of cluster 1 profiling are countries that really need aid in the economic and health sectors.
- When viewed from the economic sector, the average results from columns exports, imports and gdpp column have negative results or it can be interpreted that economic growth in cluster 1 countries is not good or even not developing.
- When analyzed from the health sector, the results are very sad because the mortality rate under 5 years is very high and the life expectancy also does not indicate that a large number of the population does not have a long life. It all can happen because of poor health figures.
There are 2 ways that can be applied to determine how to select countries in Cluster 1 to be assisted:
1. Allocating all funds to countries that have an average below the profiling cluster parameter 1.
From the filter results above, it is known that there are 2 countries that have an average below the profiling parameter cluster 1, namely Cameroon & Central African Republic.
2. Determine the Based On Economic & Health Urgency Parameters.
If the first method is used, there are only 2 countries that can be assisted, while maybe there are still many countries that have an average below the economy above the average profiling cluster 1 but have an average health segment below the average profiling cluster 1 or vice versa. From the computation results above, there are a total of 18 countries that need assistance in the economic segment and 7 countries that need assistance in the health segment. Determination of which countries will get assistance first can be started by the country that is in the topmost plot, for the economic segment it can start from the country Eritrea then the Central African Republic then Sudan and so on and for the health segment it can be started from countries Central African Republic then Chad then Niger and so on
Conclusion: From the two methods above, the best way that can be applied is the second way because in the second way many countries will get aid and aid can be given more precisely based on what segmented in need of aid and how urgent the country needs assistance. Countries that are in the first to third positions for each segment are going to be the earliest countries who will get the very first of aid.
The list of country will received the earliest of aid in each segment: