The categorization of countries based on social class, economy and health is what determines the overall development of a country. HELP international is an NGO committed to fighting poverty, providing facilities and aid in underdeveloped countries after natural disasters. The organization raised $10 million US dollars. The CEO of the NGO wants the funds to be used strategically and effectively. The CEO will decide which countries will receive the aid. A data scientist’s job is to categorize countries based on social, economic and overall health factors. Then suggest which countries should get the aid. Some of the libraries we need include:
library(dplyr)
library(tidyverse)
library(factoextra)
library(FactoMineR)
library(plotly)country <- read.csv("Country-data-f.csv")We use head to see the top data
head(country)We first separate the data according to the name
country_separate <- separate(country, col = "country.child_mort.exports.health.imports.income.inflation.life_expec.total_fer.gdpp", into = c("country", "child_mort", "exports", "health", "imports", "income", "inflation", "life_expec", "total_fer", "gdpp"), sep = ";" )
country_separateChange data types
country_separate$country <- as.factor(country_separate$country)# meng-assign nilai dari kolom country menjadi rownames
rownames(country_separate) <- country_separate$country
country_clean <- country_separate %>%
mutate_at(vars(child_mort,exports,health,imports,income,inflation,life_expec,total_fer,gdpp), as.numeric) %>%
select(-country)
glimpse(country_clean)#> Rows: 167
#> Columns: 9
#> $ child_mort <dbl> 90.2, 16.6, 27.3, 119.0, 10.3, 14.5, 18.1, 4.8, 4.3, 39.2, …
#> $ exports <dbl> 10.0, 28.0, 38.4, 62.3, 45.5, 18.9, 20.8, 19.8, 51.3, 54.3,…
#> $ health <dbl> 7.58, 6.55, 4.17, 2.85, 6.03, 8.10, 4.40, 8.73, 11.00, 5.88…
#> $ imports <dbl> 44.9, 48.6, 31.4, 42.9, 58.9, 16.0, 45.3, 20.9, 47.8, 20.7,…
#> $ income <dbl> 1610, 9930, 12900, 5900, 19100, 18700, 6700, 41400, 43200, …
#> $ inflation <dbl> 9.440, 4.490, 16.100, 22.400, 1.440, 20.900, 7.770, 1.160, …
#> $ life_expec <dbl> 56.2, 76.3, 76.5, 60.1, 76.8, 75.8, 73.3, 82.0, 80.5, 69.1,…
#> $ total_fer <dbl> 5.82, 1.65, 2.89, 6.16, 2.13, 2.37, 1.69, 1.93, 1.44, 1.92,…
#> $ gdpp <dbl> 553, 4090, 4460, 3530, 12200, 10300, 3220, 51900, 46900, 58…
From the glimps function above, we can see that the data has 167 rows and 10 columns. Here is the explanation of the variables:
country : Name of the countrychild_mort : Death of children under 5 years of age per
1000 live birthsexports : Exports of goods and services per capita.
Given as %age of the GDP per capitahealth : Total health spending per capita. Given as
%age of GDP per capitaimports : Imports of goods and services per capita.
Given as %age of the GDP per capitaincome : Net income per personinflation : The measurement of the annual growth rate
of the Total GDPlife_expec : The average number of years a new born
child would live if the current mortality patterns are to remain the
sametotal_fer : The number of children that would be born
to each woman if the current age-fertility rates remain the samegdpp : The GDP per capita. Calculated as the Total GDP
divided by the total population.Check Missing values
anyNA(country_clean)#> [1] FALSE
No missing values
By using the summary we will have the following information
summary(country_clean)#> child_mort exports health imports
#> Min. : 2.60 Min. : 0.109 Min. : 1.810 Min. : 0.0659
#> 1st Qu.: 8.25 1st Qu.: 23.800 1st Qu.: 4.920 1st Qu.: 30.2000
#> Median : 19.30 Median : 35.000 Median : 6.320 Median : 43.3000
#> Mean : 38.27 Mean : 41.109 Mean : 6.816 Mean : 46.8902
#> 3rd Qu.: 62.10 3rd Qu.: 51.350 3rd Qu.: 8.600 3rd Qu.: 58.7500
#> Max. :208.00 Max. :200.000 Max. :17.900 Max. :174.0000
#> income inflation life_expec total_fer
#> Min. : 609 Min. : -4.210 Min. :32.10 Min. :1.150
#> 1st Qu.: 3355 1st Qu.: 1.810 1st Qu.:65.30 1st Qu.:1.795
#> Median : 9960 Median : 5.390 Median :73.10 Median :2.410
#> Mean : 17145 Mean : 7.782 Mean :70.56 Mean :2.948
#> 3rd Qu.: 22800 3rd Qu.: 10.750 3rd Qu.:76.80 3rd Qu.:3.880
#> Max. :125000 Max. :104.000 Max. :82.80 Max. :7.490
#> gdpp
#> Min. : 231
#> 1st Qu.: 1330
#> Median : 4660
#> Mean : 12964
#> 3rd Qu.: 14050
#> Max. :105000
From the data above, each dimension has a different scale.
Because our data has different scales, we will scale first
country_scale <- scale(country_clean)
summary(country_scale)#> child_mort exports health imports
#> Min. :-0.8845 Min. :-1.4957 Min. :-1.8223 Min. :-1.9341
#> 1st Qu.:-0.7444 1st Qu.:-0.6314 1st Qu.:-0.6901 1st Qu.:-0.6894
#> Median :-0.4704 Median :-0.2229 Median :-0.1805 Median :-0.1483
#> Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
#> 3rd Qu.: 0.5909 3rd Qu.: 0.3736 3rd Qu.: 0.6496 3rd Qu.: 0.4899
#> Max. : 4.2086 Max. : 5.7964 Max. : 4.0353 Max. : 5.2504
#> income inflation life_expec total_fer
#> Min. :-0.8577 Min. :-1.1344 Min. :-4.3242 Min. :-1.1877
#> 1st Qu.:-0.7153 1st Qu.:-0.5649 1st Qu.:-0.5910 1st Qu.:-0.7616
#> Median :-0.3727 Median :-0.2263 Median : 0.2861 Median :-0.3554
#> Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
#> 3rd Qu.: 0.2934 3rd Qu.: 0.2808 3rd Qu.: 0.7021 3rd Qu.: 0.6157
#> Max. : 5.5947 Max. : 9.1023 Max. : 1.3768 Max. : 3.0003
#> gdpp
#> Min. :-0.69471
#> 1st Qu.:-0.63475
#> Median :-0.45307
#> Mean : 0.00000
#> 3rd Qu.: 0.05924
#> Max. : 5.02140
With the factoextra library, we will use the elbow method to determine it by noting that in terms of business, it does not determine what the optimal K is
fviz_nbclust(
x = country_scale, # data
FUNcluster = kmeans,
method = "wss" # method
)From the plot above we can see that K starts to slope at number 6, so we can conclude that the optimum K is 6.
From the country_scale data we will create a clustering with an optimum K of 6
country_cluster <- kmeans(x = country_scale,
centers = 6)country_cluster$size#> [1] 37 48 31 21 27 3
From the data above, we can see that there is 1 cluster that is only filled with 3 countries, namely cluster 6
country_cluster$centers#> child_mort exports health imports income inflation
#> 1 -0.2509045 -0.3371408 -0.606008744 -0.72003193 -0.1908814 0.42283666
#> 2 -0.5832573 0.3691122 0.038430150 0.54839641 -0.1135088 -0.37881810
#> 3 0.6551993 -0.6036538 0.223878019 0.07169025 -0.7068393 0.03910137
#> 4 1.9864373 -0.2369286 -0.540872412 -0.26533075 -0.6882963 0.72971904
#> 5 -0.8224879 0.1364063 0.926673853 -0.36798367 1.5400218 -0.46258692
#> 6 -0.8464575 4.9208731 -0.008138555 4.53442030 2.4322274 -0.50269428
#> life_expec total_fer gdpp
#> 1 0.2222766 -0.3291179 -0.3547909
#> 2 0.4130672 -0.6588108 -0.2150469
#> 3 -0.9446404 1.0175549 -0.6054523
#> 4 -1.5446278 1.6738414 -0.6070734
#> 5 1.1111163 -0.7328648 1.7654312
#> 6 1.2231457 -1.0357477 2.4334786
To see which countries are included in each cluster
country_cluster$cluster#> Afghanistan Albania
#> 3 2
#> Algeria Angola
#> 1 4
#> Antigua and Barbuda Argentina
#> 2 1
#> Armenia Australia
#> 1 5
#> Austria Azerbaijan
#> 5 1
#> Bahamas Bahrain
#> 2 2
#> Bangladesh Barbados
#> 1 2
#> Belarus Belgium
#> 2 5
#> Belize Benin
#> 2 4
#> Bhutan Bolivia
#> 2 1
#> Bosnia and Herzegovina Botswana
#> 2 3
#> Brazil Brunei
#> 1 5
#> Bulgaria Burkina Faso
#> 2 4
#> Burundi Cambodia
#> 3 2
#> Cameroon Canada
#> 4 5
#> Cape Verde Central African Republic
#> 2 4
#> Chad Chile
#> 4 1
#> China Colombia
#> 1 1
#> Comoros Congo Dem. Rep.
#> 3 4
#> Congo Rep. Costa Rica
#> 4 2
#> Cote d'Ivoire Croatia
#> 4 2
#> Cyprus Czech Republic
#> 2 2
#> Denmark Dominican Republic
#> 5 1
#> Ecuador Egypt
#> 1 1
#> El Salvador Equatorial Guinea
#> 2 4
#> Eritrea Estonia
#> 3 2
#> Fiji Finland
#> 2 5
#> France Gabon
#> 5 1
#> Gambia Georgia
#> 3 2
#> Germany Ghana
#> 5 3
#> Greece Grenada
#> 5 2
#> Guatemala Guinea
#> 1 4
#> Guinea-Bissau Guyana
#> 3 2
#> Haiti Hungary
#> 4 2
#> Iceland India
#> 5 1
#> Indonesia Iran
#> 1 1
#> Iraq Ireland
#> 3 5
#> Israel Italy
#> 5 5
#> Jamaica Japan
#> 1 5
#> Jordan Kazakhstan
#> 2 1
#> Kenya Kiribati
#> 3 3
#> Kuwait Kyrgyz Republic
#> 5 2
#> Lao Latvia
#> 3 2
#> Lebanon Lesotho
#> 2 3
#> Liberia Libya
#> 3 1
#> Lithuania Luxembourg
#> 2 6
#> Macedonia FYR Madagascar
#> 2 3
#> Malawi Malaysia
#> 4 2
#> Maldives Mali
#> 2 4
#> Malta Mauritania
#> 6 4
#> Mauritius Micronesia Fed. Sts.
#> 2 3
#> Moldova Mongolia
#> 2 1
#> Montenegro Morocco
#> 2 1
#> Mozambique Myanmar
#> 4 1
#> Namibia Nepal
#> 3 1
#> Netherlands New Zealand
#> 5 5
#> Niger Nigeria
#> 4 4
#> Norway Oman
#> 5 1
#> Pakistan Panama
#> 4 2
#> Paraguay Peru
#> 2 1
#> Philippines Poland
#> 1 2
#> Portugal Qatar
#> 5 5
#> Romania Russia
#> 1 1
#> Rwanda Samoa
#> 3 3
#> Saudi Arabia Senegal
#> 1 3
#> Serbia Seychelles
#> 2 2
#> Sierra Leone Singapore
#> 4 6
#> Slovak Republic Slovenia
#> 2 2
#> Solomon Islands South Africa
#> 3 3
#> South Korea Spain
#> 2 5
#> Sri Lanka St. Vincent and the Grenadines
#> 1 2
#> Sudan Suriname
#> 3 2
#> Sweden Switzerland
#> 5 5
#> Tajikistan Tanzania
#> 3 3
#> Thailand Timor-Leste
#> 2 3
#> Togo Tonga
#> 3 3
#> Tunisia Turkey
#> 2 1
#> Turkmenistan Uganda
#> 1 3
#> Ukraine United Arab Emirates
#> 2 5
#> United Kingdom United States
#> 5 5
#> Uruguay Uzbekistan
#> 1 1
#> Vanuatu Venezuela
#> 3 1
#> Vietnam Yemen
#> 2 3
#> Zambia
#> 4
Create a new column containing the label information of the cluster formed using k optimum
country_clean$cluster <- as.factor(country_cluster$cluster)
country_clean %>% head()Melakukan grouping berdasarkan cluster yang terbentuk, untuk mengetahui karakteristik dari masing-masing cluster
country_centroid <- country_clean %>%
group_by(cluster) %>%
summarise_all(mean)
country_centroidPerform grouping based on the clusters formed, to determine the characteristics of each cluster
country_centroid %>%
pivot_longer(-cluster) %>%
group_by(name) %>%
summarize(
group_min = which.min(value),
group_max = which.max(value))The explanation of the above plot is as follows:
Cluster 1 belongs to the minimum group in the
health and imports variablesCluster 2 is not included in all groups both minimum
and maximum in all variablesCluster 3 belongs to the minimum group in the
exports and income variables.Cluster 4 belongs to the minimum group with the
variables gdpp and life_expec, and belongs to
the maximum group with the variables child_mort,
inflation and total_fer.Cluster 5 Included in the maximum group with
health variablesCluster 6 belongs to the minimum group with
child_mort, inflation and
total_fer variables, and belongs to the maximum group with
exports, gdpp, imports,
income and life_expec variables.The goodness of clustering results can be seen from 3 values
the sum of the squared distances from each observation to the centroid of each cluster. From our case, we can find the WSS value below:
country_cluster$withinss#> [1] 93.42705 96.29759 106.26163 151.18155 121.75399 20.87409
sum of the weighted squared distances from each centroid to the global mean
country_cluster$betweenss#> [1] 904.2041
sum of squared distances from each observation to the global mean
country_cluster$totss#> [1] 1494
The ratio between BSS and TSS is as follows
country_cluster$betweenss/country_cluster$totss#> [1] 0.6052236
The ratio is quite good because it is close to 1
we visualize it on a 2 dimensional plot, where the object is
country_cluster and the data is
country_clean
# visualisasi 2 dimensi
fviz_cluster(object = country_cluster,
data = country_clean %>% select(-cluster))create a new axis that can capture as much information (variance) as possible from the initial variables. This new axis is called the Principal Component (PC).
We want to create a visualization that simplifies cluster profiling, where the individual views and variables factor map together. Visualizations can be created using the fviz_pca_biplot() function from the factoextra package
# buat model PCA
country_pca <- PCA(X = country_clean, # data untuk di PCA
scale.unit = T,
quali.sup = 10, # quali.sup -> indeks dari kolom kategori
graph = F)
summary(country_pca)#>
#> Call:
#> PCA(X = country_clean, scale.unit = T, quali.sup = 10, graph = F)
#>
#>
#> Eigenvalues
#> Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
#> Variance 4.136 1.546 1.170 0.995 0.661 0.224 0.113
#> % of var. 45.952 17.182 13.004 11.053 7.340 2.484 1.260
#> Cumulative % of var. 45.952 63.133 76.138 87.191 94.531 97.015 98.276
#> Dim.8 Dim.9
#> Variance 0.088 0.067
#> % of var. 0.981 0.743
#> Cumulative % of var. 99.257 100.000
#>
#> Individuals (the 10 first)
#> Dist Dim.1 ctr cos2 Dim.2 ctr cos2
#> Afghanistan | 3.230 | -2.913 1.229 0.814 | 0.096 0.004 0.001 |
#> Albania | 1.473 | 0.430 0.027 0.085 | -0.588 0.134 0.160 |
#> Algeria | 1.664 | -0.285 0.012 0.029 | -0.455 0.080 0.075 |
#> Angola | 3.900 | -2.932 1.245 0.565 | 1.696 1.113 0.189 |
#> Antigua and Barbuda | 1.415 | 1.034 0.155 0.533 | 0.137 0.007 0.009 |
#> Argentina | 2.223 | 0.022 0.000 0.000 | -1.779 1.226 0.641 |
#> Armenia | 1.719 | -0.102 0.001 0.003 | -0.568 0.125 0.109 |
#> Australia | 3.405 | 2.342 0.794 0.473 | -1.988 1.531 0.341 |
#> Austria | 3.341 | 2.974 1.280 0.792 | -0.735 0.209 0.048 |
#> Azerbaijan | 1.581 | -0.181 0.005 0.013 | -0.403 0.063 0.065 |
#> Dim.3 ctr cos2
#> Afghanistan -0.718 0.264 0.049 |
#> Albania -0.333 0.057 0.051 |
#> Algeria 1.222 0.763 0.539 |
#> Angola 1.525 1.190 0.153 |
#> Antigua and Barbuda -0.226 0.026 0.025 |
#> Argentina 0.870 0.387 0.153 |
#> Armenia 0.242 0.030 0.020 |
#> Australia 0.190 0.019 0.003 |
#> Austria -0.520 0.138 0.024 |
#> Azerbaijan 0.867 0.385 0.301 |
#>
#> Variables
#> Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
#> child_mort | -0.853 17.600 0.728 | 0.240 3.720 0.058 | -0.032
#> exports | 0.577 8.060 0.333 | 0.762 37.597 0.581 | 0.157
#> health | 0.307 2.275 0.094 | -0.302 5.909 0.091 | -0.645
#> imports | 0.328 2.608 0.108 | 0.835 45.134 0.698 | -0.324
#> income | 0.810 15.876 0.657 | 0.028 0.051 0.001 | 0.326
#> inflation | -0.393 3.732 0.154 | -0.010 0.007 0.000 | 0.695
#> life_expec | 0.866 18.134 0.750 | -0.277 4.960 0.077 | 0.123
#> total_fer | -0.821 16.300 0.674 | 0.193 2.410 0.037 | 0.021
#> gdpp | 0.798 15.417 0.638 | -0.057 0.212 0.003 | 0.133
#> ctr cos2
#> child_mort 0.087 0.001 |
#> exports 2.096 0.025 |
#> health 35.597 0.417 |
#> imports 8.996 0.105 |
#> income 9.093 0.106 |
#> inflation 41.283 0.483 |
#> life_expec 1.298 0.015 |
#> total_fer 0.038 0.000 |
#> gdpp 1.512 0.018 |
#>
#> Supplementary categories
#> Dist Dim.1 cos2 v.test Dim.2 cos2 v.test
#> cluster_1 | 1.253 | -0.268 0.046 -0.907 | -0.686 0.299 -3.790 |
#> cluster_2 | 1.263 | 0.832 0.434 3.347 | 0.290 0.053 1.909 |
#> cluster_3 | 1.915 | -1.746 0.832 -5.282 | 0.130 0.005 0.645 |
#> cluster_4 | 3.316 | -3.021 0.830 -7.259 | 0.804 0.059 3.158 |
#> cluster_5 | 3.036 | 2.637 0.754 7.337 | -0.954 0.099 -4.342 |
#> cluster_6 | 7.779 | 5.460 0.493 4.679 | 5.432 0.488 7.613 |
#> Dim.3 cos2 v.test
#> cluster_1 0.728 0.337 4.623 |
#> cluster_2 -0.388 0.094 -2.933 |
#> cluster_3 -0.614 0.103 -3.490 |
#> cluster_4 0.354 0.011 1.598 |
#> cluster_5 0.098 0.001 0.514 |
#> cluster_6 0.212 0.001 0.341 |
From the data above we can explore again by looking at the proportion of dimensions with the plot below:
fviz_eig(country_pca, ncp = 9, addlabels = T, main = "Variance by each dimensions")Dimensions 1 and 2 have a variance of about 63%, from these dimensions we will make a visualization
The goal is to display the distribution of data
plot.PCA(
x = country_pca,
choix = "ind",
select = "contrib 5"
)With the visual plot above, the countries of Singapore, Malta, Luxembourg, Haiti and Nigeria are outliers
plot.PCA(x = country_pca,
choix = "var")The insights we can take from the plot above are:
life_expec,
child_mort, total_fer, income and
gdpp.imports and
exportsimports-exports, income-gdpp,
gdpp-life_expecchild_mort-total_ferWe will look at the variable contributions for dimension 1
fviz_contrib(X = country_pca,
choice = "var",
axes = 1)From the plot above, we can conclude that the variables
life_expec, child_mort,
total_fer, income and gdpp have a
contribution in dimension 1.
We will look at the variable contributions for dimension 2
fviz_contrib(X = country_pca,
choice = "var",
axes = 2)From the plot above, we can conclude that the imports and exports variables have a contribution in dimension 2
# visualisasi biplot + cluster
fviz_pca_biplot(X = country_pca,
habillage = "cluster",
geom.ind = "point",
addEllipses = TRUE)By using PCA - Biplot we can draw conclusions: -
child_mort is strongly positively correlated with
total_fer - child_mort, total_fer
and inflation, strongly negatively correlated with
health, life_expec, gdpp and
income
Which countries will receive assistance from HELP International?
From the case above, we will categorize the data with low social, economic and health class filtration. Below are the countries that are nominated to receive assistance from HELP International.
# variabel child_mort
country_clean %>%
filter(cluster == "4") %>% arrange(-child_mort) %>% head()# variabel total_fer
country_clean %>%
filter(cluster == "4") %>% arrange(-total_fer)%>% head() # variabel inflation
country_clean %>%
filter(cluster == "4")%>% arrange(-inflation) %>% head()From the child_mort, total_fer and inflation variables, our recommended country is in cluster 4.
We will look for countries with strong negative correlations to
child_mort, total_fer and
inflation, namely health,
life_expec, gdpp and income
# variabel health
country_clean %>%
filter(cluster == "1") %>% arrange(health) %>% head()# variabel life_expec
country_clean %>%
filter(cluster == "4") %>% arrange(life_expec) %>% head()# variabel gdpp
country_clean %>%
filter(cluster == "4") %>% arrange(gdpp) %>% head()# variabel life_expec
country_clean %>%
filter(cluster == "4") %>% arrange(income) %>% head()From the life_expec, gdpp and
income variables, our recommendation is in
cluster 4. From the health variable, our
recommendation is in Cluster 1.
From the PCA-biplot we can conclude that the variable
child_mort is strongly positively correlated with
total_fer, which means that the number of deaths of
children under 5 years per 1000 births is strongly correlated with the
number of children born. The variables child_mort,
total_fer and inflation are strongly
negatively correlated with health, life_expec,
gdpp and income, which means that if child
mortality under 5 years old, the number of child births and
inflation increase, health, the number of live
births, gdpp and income per person will
decrease. From the above recommendations, cluster no. 4 is
a priority that receives assistance from the HELP International
organization. As for cluster 1 with the value of
health, we have an assumption that maybe the country uses
subsidies, so the total health expenditure per capita is
low, therefore cluster 1 is not recommended.