The objective of this work is to categorize the countries using socio-economic and health factors that determine the overall development of the country.
HELP International have been able to raise around $ 10 million. Now the CEO of the NGO needs to decide how to use this money strategically and effectively. So, CEO has to make decision to choose the countries that are in the direst need of aid. Hence, your Job as a Data scientist is to categorise the countries using some socio-economic and health factors that determine the overall development of the country. Then you need to suggest the countries which the CEO needs to focus on the most.
The dataset is consist of list of 167 countries and information about socio-economic and health factors that determine the overall development of the country. The main dataset containing 167 rows and 10 columns.
Columns description:
1. country : Name of the country, there are 167 country listed
2. child_mort : (Child mortality) Death of children under 5 years of age per 1000 live births
3. exports : Exports of goods and services per capita. Given as %age of the GDP per capita
4. health : Total health spending per capita. Given as %age of GDP per capita
5. imports : Imports of goods and services per capita. Given as %age of the GDP per capita
6. Income : Net income per person
7. Inflation : The measurement of the annual growth rate of the Total GDP
8. life_expec : (Life expectancy) The average number of years a new born child would live if the current mortality patterns are to remain the same
9. total_fer : (Total fertility) The number of children that would be born to each woman if the current age-fertility rates remain the same
10. gdpp : The GDP per capita. Calculated as the Total GDP divided by the total population.
The dataset can be obtained from Kaggle
library(tidyverse)
library(gridExtra)
library(factoextra)
library(FactoMineR)data <- read.csv("Country-data.csv")
dataLet’s see the data types
glimpse(data)## Rows: 167
## Columns: 10
## $ country <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Antigua and~
## $ child_mort <dbl> 90.2, 16.6, 27.3, 119.0, 10.3, 14.5, 18.1, 4.8, 4.3, 39.2, ~
## $ exports <dbl> 10.0, 28.0, 38.4, 62.3, 45.5, 18.9, 20.8, 19.8, 51.3, 54.3,~
## $ health <dbl> 7.58, 6.55, 4.17, 2.85, 6.03, 8.10, 4.40, 8.73, 11.00, 5.88~
## $ imports <dbl> 44.9, 48.6, 31.4, 42.9, 58.9, 16.0, 45.3, 20.9, 47.8, 20.7,~
## $ income <int> 1610, 9930, 12900, 5900, 19100, 18700, 6700, 41400, 43200, ~
## $ inflation <dbl> 9.440, 4.490, 16.100, 22.400, 1.440, 20.900, 7.770, 1.160, ~
## $ life_expec <dbl> 56.2, 76.3, 76.5, 60.1, 76.8, 75.8, 73.3, 82.0, 80.5, 69.1,~
## $ total_fer <dbl> 5.82, 1.65, 2.89, 6.16, 2.13, 2.37, 1.69, 1.93, 1.44, 1.92,~
## $ gdpp <int> 553, 4090, 4460, 3530, 12200, 10300, 3220, 51900, 46900, 58~
The data types has already appropriate for each columns.
First, we check for any missing values or any values that indicates a missing value.
colSums(is.na(data))## country child_mort exports health imports income inflation
## 0 0 0 0 0 0 0
## life_expec total_fer gdpp
## 0 0 0
It seems that there are no missing values. now, let see the summary of the dataset.
summary(data)## country child_mort exports health
## Length:167 Min. : 2.60 Min. : 0.109 Min. : 1.810
## Class :character 1st Qu.: 8.25 1st Qu.: 23.800 1st Qu.: 4.920
## Mode :character Median : 19.30 Median : 35.000 Median : 6.320
## Mean : 38.27 Mean : 41.109 Mean : 6.816
## 3rd Qu.: 62.10 3rd Qu.: 51.350 3rd Qu.: 8.600
## Max. :208.00 Max. :200.000 Max. :17.900
## imports income inflation life_expec
## Min. : 0.0659 Min. : 609 Min. : -4.210 Min. :32.10
## 1st Qu.: 30.2000 1st Qu.: 3355 1st Qu.: 1.810 1st Qu.:65.30
## Median : 43.3000 Median : 9960 Median : 5.390 Median :73.10
## Mean : 46.8902 Mean : 17145 Mean : 7.782 Mean :70.56
## 3rd Qu.: 58.7500 3rd Qu.: 22800 3rd Qu.: 10.750 3rd Qu.:76.80
## Max. :174.0000 Max. :125000 Max. :104.000 Max. :82.80
## total_fer gdpp
## Min. :1.150 Min. : 231
## 1st Qu.:1.795 1st Qu.: 1330
## Median :2.410 Median : 4660
## Mean :2.948 Mean : 12964
## 3rd Qu.:3.880 3rd Qu.: 14050
## Max. :7.490 Max. :105000
Insights:
1. There are 167 unique countries
2. Most of the variable has a very wide range. it means that there are some country that has been developed well while there are others that don’t.
3. Some variable has different scaling compared to other variable such as income. for dimensionality reduction using PCA later we must scaling the data first. The standardization scaling is important because the higher the variable range, the higher the variance or covariance value which might cause bias on our model.
As we now, that our data contain variable with different range of values. we will scaling the data to make it perform well in our modelization and to prevent any bias that might be happen.
data.scaled <- data %>%
mutate_at(c(2:10), funs(c(scale(.))))
summary(data.scaled)## country child_mort exports health
## Length:167 Min. :-0.8845 Min. :-1.4957 Min. :-1.8223
## Class :character 1st Qu.:-0.7444 1st Qu.:-0.6314 1st Qu.:-0.6901
## Mode :character Median :-0.4704 Median :-0.2229 Median :-0.1805
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5909 3rd Qu.: 0.3736 3rd Qu.: 0.6496
## Max. : 4.2086 Max. : 5.7964 Max. : 4.0353
## imports income inflation life_expec
## Min. :-1.9341 Min. :-0.8577 Min. :-1.1344 Min. :-4.3242
## 1st Qu.:-0.6894 1st Qu.:-0.7153 1st Qu.:-0.5649 1st Qu.:-0.5910
## Median :-0.1483 Median :-0.3727 Median :-0.2263 Median : 0.2861
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4899 3rd Qu.: 0.2934 3rd Qu.: 0.2808 3rd Qu.: 0.7021
## Max. : 5.2504 Max. : 5.5947 Max. : 9.1023 Max. : 1.3768
## total_fer gdpp
## Min. :-1.1877 Min. :-0.69471
## 1st Qu.:-0.7616 1st Qu.:-0.63475
## Median :-0.3554 Median :-0.45307
## Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.6157 3rd Qu.: 0.05924
## Max. : 3.0003 Max. : 5.02140
Now, all variables has the same range scale of values. and as we can see that it seems that there are some outlier countries in term of some of the variables such as life_expec, inflation, and gdpp. We will further assess whether to remove or keep the outliers.
Next, we will change the row name using country variable
rownames(data.scaled) <- data.scaled[,"country"]
data.scaled <- data.scaled %>% select(-country)Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables (entities each of which takes on various numerical values) into a set of values of linearly uncorrelated variables called principal components. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components. The resulting vectors (each being a linear combination of the variables and containing n observations) are an uncorrelated orthogonal basis set. PCA is sensitive to the relative scaling of the original variables.
PCA measurement can be performed by using function prcomp from package stats
pca.data <- prcomp(data.scaled)
pca.data## Standard deviations (1, .., p=9):
## [1] 2.0336314 1.2435217 1.0818425 0.9973889 0.8127847 0.4728437 0.3368067
## [8] 0.2971790 0.2586020
##
## Rotation (n x k) = (9 x 9):
## PC1 PC2 PC3 PC4 PC5
## child_mort -0.4195194 -0.192883937 0.02954353 -0.370653262 0.16896968
## exports 0.2838970 -0.613163494 -0.14476069 -0.003091019 -0.05761584
## health 0.1508378 0.243086779 0.59663237 -0.461897497 -0.51800037
## imports 0.1614824 -0.671820644 0.29992674 0.071907461 -0.25537642
## income 0.3984411 -0.022535530 -0.30154750 -0.392159039 0.24714960
## inflation -0.1931729 0.008404473 -0.64251951 -0.150441762 -0.71486910
## life_expec 0.4258394 0.222706743 -0.11391854 0.203797235 -0.10821980
## total_fer -0.4037290 -0.155233106 -0.01954925 -0.378303645 0.13526221
## gdpp 0.3926448 0.046022396 -0.12297749 -0.531994575 0.18016662
## PC6 PC7 PC8 PC9
## child_mort -0.200628153 0.07948854 0.68274306 0.32754180
## exports 0.059332832 0.70730269 0.01419742 -0.12308207
## health -0.007276456 0.24983051 -0.07249683 0.11308797
## imports 0.030031537 -0.59218953 0.02894642 0.09903717
## income -0.160346990 -0.09556237 -0.35262369 0.61298247
## inflation -0.066285372 -0.10463252 0.01153775 -0.02523614
## life_expec 0.601126516 -0.01848639 0.50466425 0.29403981
## total_fer 0.750688748 -0.02882643 -0.29335267 -0.02633585
## gdpp -0.016778761 -0.24299776 0.24969636 -0.62564572
fviz_eig(pca.data, addlabels = T, main = "Variance explained by each dimensions")summary(pca.data)## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0336 1.2435 1.0818 0.9974 0.8128 0.47284 0.3368
## Proportion of Variance 0.4595 0.1718 0.1300 0.1105 0.0734 0.02484 0.0126
## Cumulative Proportion 0.4595 0.6313 0.7614 0.8719 0.9453 0.97015 0.9828
## PC8 PC9
## Standard deviation 0.29718 0.25860
## Proportion of Variance 0.00981 0.00743
## Cumulative Proportion 0.99257 1.00000
If we tolerate no more than 20% of information loss or to retain more than 80% of information. then we can use 5 principal component (PC). which are dim 1 to dim 5. In this case we retain about 87% of the information.
country.pca.selected <- data.frame(pca.data$x[,1:4])
head(country.pca.selected)Once the number of PC dim is selected, it can be combined with the initial data and used for further analysis.
country.pca <- data %>%
select_if(negate(is.numeric)) %>%
cbind(country.pca.selected) %>%
select(-country)
country.pca %>% head()Before we do cluster analysis, first we need to determine the optimal number of cluster. In clustering method, we seek to minimize the total within-cluster sum of squares (meaning that the distance is minimum between observation in the same cluster). To find the optimum number of cluster, we can use 3 methods: elbow method, silhouette method, and gap statistic. We will decide the number of cluster based on majority voting.
Choosing the number of clusters using elbow method is a little bit subjective. The rule of thumb is we choose the number of cluster in the area of “bend of an elbow”, where the graph is total within sum of squares start to stagnate with the increase of the number of clusters.
fviz_nbclust(country.pca, kmeans, method = "wss") + labs(subtitle = "Elbow method with PCA") Arguably, the optimum number for k is 3. but we need more method to determine the best k for our model. In my opinion, 3 cluster is good enough since there is no significant decline in total within-cluster sum of squares on higher number of clusters. This method may be not enough since the optimal number of clusters is vague.
The silhouette method measures the silhouette coefficient, by calculating the mean intra-cluster distance and the mean nearest-cluster distance for each observations. We get the optimal number of clusters by choosing the number of cluster with the highest silhouette score (the peak).
fviz_nbclust(country.pca, kmeans, method = "silhouette") + labs(subtitle = "Silhouette method with PCA") Based on the silhouette method, number of clusters with maximum score is considered as the optimum k-clusters. The graph shows that the optimum number of cluster is 2.
The gap statistic compares the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data. The estimate of the optimal clusters will be value that maximize the gap statistic.
fviz_nbclust(country.pca, kmeans, method = "gap_stat") + labs(subtitle = "Gap Statistic with PCA method") As we can see that the gap method confirm the previous elbow method that 3 is the most optimum number of cluster. So, for this model we will cluster the data into 3 cluster. for the case of country development level, it could be said the 3 level might be (under developed, developing, and developed countries).
RNGkind(sample.kind = "Rounding")
set.seed(100)
km.pca <- kmeans(country.pca, centers = 3) # add the cluster as new column
country.pca$cluster <- km.pca$clustervisualizing the cluster
fviz_cluster(km.pca, data = country.pca) +
labs(subtitle = "K-Means With PCA & K-Value = 3")outliers <- c("Singapore","Malta","Luxembourg")
data %>% filter(country %in% outliers)As we can see that Malta, Luxembourg, and Singapore is pretty much an outlier that is very far from other countries. we can exclude these outiers and can consider them as developed countries based on the criteria that is better than the average countries.
# remove outliers from data
country.pca2 <- data %>%
select_if(negate(is.numeric)) %>%
cbind(country.pca.selected) %>%
filter(!country %in% outliers) %>%
select(-country)
# Clustering
RNGkind(sample.kind = "Rounding")
set.seed(100)
km.pca2 <- kmeans(country.pca2, centers = 3)
# add the cluster as new column
country.pca2$cluster <- km.pca2$cluster
# Visualize the new cluster
fviz_cluster(km.pca2, data = country.pca2) +
labs(subtitle = "K-Means With PCA & K-Value = 3") Now, the seperation of each cluster is more distinct than before. especially between developing and developed countries. while the under developed coutries has a higher gap than the rest of cluster.
The purpose of profiling is to to understand the characteristics of each cluster, in this case to understand which country cluster is the direst need of aid.
To find out characteristic from each cluster, the value from each columns can be averaged.
#Assign into new object
country.clustered <- data %>%
filter(!country %in% outliers) %>%
select(-country)
#Assign cluster result into the new object
country.clustered$cluster <- km.pca2$cluster
criteria.mean <- country.clustered %>%
group_by(cluster) %>%
summarise_all(mean)
criteria.meanProfiling from the insights:
How many K-Value or how many groups desired as the final result must be determined in advance, since model K-Means required that information to grouping the data based on similar characteristics. There are several way to find out the optimum K-Value such as Elbow Method, Silhouette Method and Gap Statistic Method. It is important to implement several methods, since from the case above the K-Value result from one method to another might be different. Even though there are difference in the results, the final suggestion for K-Value = 3 since two out three methods produced the same results.
After knowing how many groups are ideal, the characteristics of each country in each cluster can be determined by averaging the values in each column, this method can be call Cluster Profiling. From the results of cluster profiling, countries in cluster 1 are the countries most in need of aid when compared to countries included in cluster 2 and cluster 3.
In order to help country that has the most needs we must consider to seperate between economic and health criterias. So, the aid will be more effective.
# get all mean values for cluster 1
mean.clus.1 <- criteria.mean[criteria.mean$cluster==1,]
#Parameter filter
economic.help <- data %>%
filter(exports < mean.clus.1 %>% pull(exports),
imports < mean.clus.1 %>% pull(imports),
income < mean.clus.1 %>% pull(income),
gdpp < mean.clus.1 %>% pull(gdpp)
) %>%
select(country,income,exports,imports, gdpp)
list.metrics <- economic.help %>% select(-country) %>% colnames()
# see the worst country on each economic category
for (col in list.metrics) {
help.country <- economic.help %>%
arrange(!!as.symbol(col)) %>%
select(country, !!as.symbol(col)) %>%
head(1)
print(help.country)
}## country income
## 1 Burundi 764
## country exports
## 1 Myanmar 0.109
## country imports
## 1 Myanmar 0.0659
## country gdpp
## 1 Burundi 231
From the economic sector, there are two countries that need help the most are Myanmar and Burundi. the different needs between those two is myanmar needs more in its productivity (import & export) while Burundi in its income and GDP.
# get all mean values for cluster 1
mean.clus.1 <- criteria.mean[criteria.mean$cluster==1,]
#Parameter filter
health.help <- data %>%
filter(health < mean.clus.1 %>% pull(health),
life_expec < mean.clus.1 %>% pull(life_expec)
) %>%
select(country, health, life_expec)
list.metrics <- health.help %>% select(-country) %>% colnames()
# see the worst country on each economic category
for (col in list.metrics) {
help.country <- health.help %>%
arrange(!!as.symbol(col)) %>%
select(country, !!as.symbol(col)) %>%
head(1)
print(help.country)
}## country health
## 1 Central African Republic 3.98
## country life_expec
## 1 Central African Republic 47.5
In health sector, country that needs help the most is Central African Republic