HELP International
knitr::include_graphics("HELP-int.jfif")
HELP International is an international humanitarian NGO that is committed to fighting poverty and providing the people of backward countries with basic amenities and relief during the time of disasters and natural calamities.
HELP International have been able to raise around $ 10 million. Now the CEO of the NGO needs to decide how to use this money strategically and effectively.
Dataset
With this dataset, we will be using K-Means Clustering to cluster the countries, to categorise the countries using socio-economic and health factors that determine the overall development of the country. And then, with PCA, we will decide which countries need help the most.
country <- read.csv("Country-data.csv")
rownames(country) <- country$country
country_clean <- country[,-1]
str(country_clean)
## 'data.frame': 167 obs. of 9 variables:
## $ child_mort: num 90.2 16.6 27.3 119 10.3 14.5 18.1 4.8 4.3 39.2 ...
## $ exports : num 10 28 38.4 62.3 45.5 18.9 20.8 19.8 51.3 54.3 ...
## $ health : num 7.58 6.55 4.17 2.85 6.03 8.1 4.4 8.73 11 5.88 ...
## $ imports : num 44.9 48.6 31.4 42.9 58.9 16 45.3 20.9 47.8 20.7 ...
## $ income : int 1610 9930 12900 5900 19100 18700 6700 41400 43200 16000 ...
## $ inflation : num 9.44 4.49 16.1 22.4 1.44 20.9 7.77 1.16 0.873 13.8 ...
## $ life_expec: num 56.2 76.3 76.5 60.1 76.8 75.8 73.3 82 80.5 69.1 ...
## $ total_fer : num 5.82 1.65 2.89 6.16 2.13 2.37 1.69 1.93 1.44 1.92 ...
## $ gdpp : int 553 4090 4460 3530 12200 10300 3220 51900 46900 5840 ...
About the columns :
country: Name of the country, we will use this column as rownames
child_mort: Death of children under 5 years of age per 1000 live births
exports: Exports of goods and services per capita. Given as %age of the GDP per capita
health: Total health spending per capita. Given as %age of GDP per capita
imports: Imports of goods and services per capita. Given as %age of the GDP per capita
income: Net income per person
inflation: The measurement of the annual growth rate of the Total GDP
life_expec: The average number of years a new born child would live if the current mortality patterns are to remain the same
total_fer: The number of children that would be born to each woman if the current age-fertility rates remain the same.
gdpp: The GDP per capita. Calculated as the Total GDP divided by the total population.
Exploratory Data Analysis
There are no missing value in the dataset.
colSums(is.na(country_clean))
## child_mort exports health imports income inflation life_expec
## 0 0 0 0 0 0 0
## total_fer gdpp
## 0 0
The columns are in different scales. We will scale it while doing clustering and PCA.
summary(country_clean)
## child_mort exports health imports
## Min. : 2.60 Min. : 0.109 Min. : 1.810 Min. : 0.0659
## 1st Qu.: 8.25 1st Qu.: 23.800 1st Qu.: 4.920 1st Qu.: 30.2000
## Median : 19.30 Median : 35.000 Median : 6.320 Median : 43.3000
## Mean : 38.27 Mean : 41.109 Mean : 6.816 Mean : 46.8902
## 3rd Qu.: 62.10 3rd Qu.: 51.350 3rd Qu.: 8.600 3rd Qu.: 58.7500
## Max. :208.00 Max. :200.000 Max. :17.900 Max. :174.0000
## income inflation life_expec total_fer
## Min. : 609 Min. : -4.210 Min. :32.10 Min. :1.150
## 1st Qu.: 3355 1st Qu.: 1.810 1st Qu.:65.30 1st Qu.:1.795
## Median : 9960 Median : 5.390 Median :73.10 Median :2.410
## Mean : 17145 Mean : 7.782 Mean :70.56 Mean :2.948
## 3rd Qu.: 22800 3rd Qu.: 10.750 3rd Qu.:76.80 3rd Qu.:3.880
## Max. :125000 Max. :104.000 Max. :82.80 Max. :7.490
## gdpp
## Min. : 231
## 1st Qu.: 1330
## Median : 4660
## Mean : 12964
## 3rd Qu.: 14050
## Max. :105000
country_clean_s <- scale(country_clean)
There are outliers in the dataset. But we will include them, as it might be as represent the real case in the country.
country_clean %>%
gather(attr, values, c(1:9)) %>%
ggplot(aes(x = attr, y = values)) +
geom_boxplot() +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank()) +
facet_wrap(facets = ~ attr, scales = "free_y")
There are strong positive correlation between :
exports - imports
child_mort - total_fer
income - gdpp
Strong negative correlation between :
child_mort - life_expec
life_expec - total_fer
ggcorr(country_clean, label = T)
K-Means Clustering
Optimal Number of Clusters
We will use two method, Elbow and Silhouette, to determine the optimal number of clusters. Both method recommend different number of clusters. Elbow method recommend 2 clusters (where the elbow is), but Silhouette recommend 5 clusters. With logical thinking, it will be not logic to devide the world to only 2 cluster. We will build models with 3, 4 and 5 clusters. And decide later which cluster we will use.
Elbow Method
fviz_nbclust(x = country_clean_s,
FUNcluster = kmeans,
method = "wss") +
# geom_vline(xintercept = 6, linetype = 2)+
labs(subtitle = "Elbow method")
Silhouette Method
set.seed(310)
fviz_nbclust(country_clean_s, FUN = kmeans, method = "silhouette") +
labs(subtitle = "Silhouette method")
Clustering
K = 3
options(warn = - 1)
RNGkind(sample.kind = "Rounding")
set.seed(310)
country_km3 <- kmeans(x = country_clean_s, centers = 3)
country_k3 <- country_clean
country_k3$cluster <- country_km3$cluster
country_k3 %>%
group_by(cluster) %>%
summarise_all(mean)
## # A tibble: 3 x 10
## cluster child_mort exports health imports income inflation life_expec
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 93.0 29.2 6.39 42.3 3942. 12.0 59.2
## 2 2 5 58.7 8.81 51.5 45672. 2.67 80.1
## 3 3 21.9 40.2 6.20 47.5 12306. 7.60 72.8
## # ... with 2 more variables: total_fer <dbl>, gdpp <dbl>
K = 4
options(warn = - 1)
RNGkind(sample.kind = "Rounding")
set.seed(310)
country_km4 <- kmeans(x = country_clean_s, centers = 4)
country_k4 <- country_clean
country_k4$cluster <- country_km4$cluster
country_k4 %>%
group_by(cluster) %>%
summarise_all(mean)
## # A tibble: 4 x 10
## cluster child_mort exports health imports income inflation life_expec
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 93.0 29.2 6.39 42.3 3942. 12.0 59.2
## 2 2 4.22 39.5 10.3 38.1 39292. 1.46 80.8
## 3 3 21.2 41.4 6.24 48.0 13167. 7.37 73.0
## 4 4 6.94 113. 5.06 87.8 72025 6.45 79.5
## # ... with 2 more variables: total_fer <dbl>, gdpp <dbl>
K = 5
options(warn = - 1)
RNGkind(sample.kind = "Rounding")
set.seed(310)
country_km5 <- kmeans(x = country_clean_s, centers = 5)
country_k5 <- country_clean
country_k5$cluster <- country_km5$cluster
country_k5 %>%
group_by(cluster) %>%
summarise_all(mean)
## # A tibble: 5 x 10
## cluster child_mort exports health imports income inflation life_expec
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 94.9 28.3 6.41 43.0 3493. 11.8 58.9
## 2 2 4.3 37.9 10.6 36.1 40018. 1.54 80.9
## 3 3 15.2 52.2 7.19 62.2 14792. 3.85 74.0
## 4 4 6.94 113. 5.06 87.8 72025 6.45 79.5
## 5 5 27.7 31.6 5.35 33.5 12369. 11.0 72.0
## # ... with 2 more variables: total_fer <dbl>, gdpp <dbl>
Goodness of Fit
A good number of cluster can be judge by its :
Within Sum of Squares (wss) : the sum of square of the distance of the point from its centroids. wss should be close to 0.
Between Sun of Squares (bss) : the sum of square of the distance of the centroids from center of the centroid.
Total Sum of Squres (tss) : the sum of square of the distance of the point from the center of the centroids. bss/tss should be close to 1.
All of our wss and bss/tss is far from the goodness. But for this dataset, we will ignore this numbers.
K = 3
country_km3$withinss
## [1] 269.6604 297.2279 259.5575
country_km3$betweenss/country_km3$totss
## [1] 0.4468234
K = 4
country_km4$withinss
## [1] 269.66044 55.13569 277.01987 115.75637
country_km4$betweenss/country_km3$totss
## [1] 0.5196972
K = 5
country_km5$withinss
## [1] 260.44689 46.96944 107.03581 115.75637 117.75126
country_km5$betweenss/country_km3$totss
## [1] 0.566292
Choosing Cluster
Based on the wss and bss/tss above, and from the plot generated below, there are no significat differences between 3 and 4 or 5 clusters. One thing to notice, Cluster 1 is consistent, no shape change between k=3, k=4, k=5. Hence, we will continue with 3 clusters.
K = 3
fviz_cluster(object = country_km3, data = country_clean, geom = "point")+ ggtitle("k = 3")
K = 4
fviz_cluster(object = country_km4, data = country_clean, geom = "point") + ggtitle("k = 4")
K = 5
fviz_cluster(object = country_km5, data = country_clean, geom = "point") + ggtitle("k = 5")
Cluster Characteristic
We will use PCA to find out what are the characteristic of each cluster. From the resulting plot :
Cluster 1 : high on child mortality, fertility, and low GDP. We can say, cluster 1 are Poor Countries.
Cluster 2 : high income, GDP, and good health. We can say, cluster 3 are Developed Countries.
Cluster 3 : high on export, imports, and middle income. We can say, cluster 2 are Developing Countries.
FactoMineR
country_k3_pca <- PCA(X = country_k3,
quali.sup = 10,
scale.unit = T,
graph = F)
fviz_pca_biplot(country_k3_pca,
habillage = 10,
addEllipses = T,
geom.ind = "point")
prcomp & pca3d
country_k3_pr <- prcomp(country_k3, scale = T)
# pca3d(country_k3_pr, group = country_k3$cluster,
# biplot = TRUE, biplot.vars = 5,
# show.ellipses=TRUE, show.group.labels = T)
knitr::include_graphics("pca-3d.JPG")
Contribution of Variable
From Dimension 1 of the PCA, the contributing variables are life_expec, child_mort, total_fer, income, gdpp.
fviz_contrib(X = country_k3_pca, choice = "var", axes = 1)
Which Country to HELP
It is clear that countries in Cluster 1 are the cluster need help. But what countries need the most ? We will choose top 20 countries of each contributing variables, and join. The resulting dataframe shows countries that need the most help.
# low life expectation, ascending
c1 <- country_k3 %>%
filter(cluster == 1) %>%
tibble::rownames_to_column("row_names") %>%
arrange(life_expec) %>%
head(20)
# high child mortality, descending
c2 <- country_k3 %>%
filter(cluster == 1) %>%
tibble::rownames_to_column("row_names") %>%
arrange(desc(child_mort)) %>%
head(20)
# high fertility, descending
c3 <- country_k3 %>%
filter(cluster == 1) %>%
tibble::rownames_to_column("row_names") %>%
arrange(desc(total_fer)) %>%
head(20)
# lowest income, ascending
c4 <- country_k3 %>%
filter(cluster == 1) %>%
tibble::rownames_to_column("row_names") %>%
arrange(income) %>%
head(20)
# lowest GDP, ascending
c5 <- country_k3 %>%
filter(cluster == 1) %>%
tibble::rownames_to_column("row_names") %>%
arrange(gdpp) %>%
head(20)
# merge forward
m1 <- merge(c1, c2)
m2 <- merge(m1, c3)
m3 <- merge(m2, c4)
m4 <- merge(m3, c5)
m4
## row_names child_mort exports health imports income inflation
## 1 Burkina Faso 116.0 19.20 6.74 29.6 1430 6.81
## 2 Burundi 93.6 8.92 11.60 39.2 764 12.30
## 3 Central African Republic 149.0 11.80 3.98 26.5 888 2.01
## 4 Congo, Dem. Rep. 116.0 41.10 7.91 49.6 609 20.80
## 5 Guinea 109.0 30.30 4.93 43.2 1190 16.10
## 6 Mozambique 101.0 31.50 5.21 46.2 918 7.64
## life_expec total_fer gdpp cluster
## 1 57.9 5.87 575 1
## 2 57.7 6.26 231 1
## 3 47.5 5.21 446 1
## 4 57.5 6.54 334 1
## 5 58.0 5.34 648 1
## 6 54.5 5.56 419 1
#merge backward
m10 <- merge(c4,c5)
m11 <- merge(m10,c3)
m12 <- merge(m11,c2)
m13 <- merge(m12,c1)
m13
## row_names child_mort exports health imports income inflation
## 1 Burkina Faso 116.0 19.20 6.74 29.6 1430 6.81
## 2 Burundi 93.6 8.92 11.60 39.2 764 12.30
## 3 Central African Republic 149.0 11.80 3.98 26.5 888 2.01
## 4 Congo, Dem. Rep. 116.0 41.10 7.91 49.6 609 20.80
## 5 Guinea 109.0 30.30 4.93 43.2 1190 16.10
## 6 Mozambique 101.0 31.50 5.21 46.2 918 7.64
## life_expec total_fer gdpp cluster
## 1 57.9 5.87 575 1
## 2 57.7 6.26 231 1
## 3 47.5 5.21 446 1
## 4 57.5 6.54 334 1
## 5 58.0 5.34 648 1
## 6 54.5 5.56 419 1
Both merged dataframe shows the same result. Countries need the most help are : Burkina Faso, Burundi, Central African Republic, Democratic Republic Congo, Guinea, and Mozambique.
Alternative Choices
Or we can plot the PCA, and find the outliers in Cluster 1. We got Haiti, Chad, Niger, Central African Republic, Nigeria
options(ggrepel.max.overlaps = Inf)
plot.PCA(x = country_k3_pca ,
choix = "ind",
invisible = "quali",
select = "contrib 13")
Indonesia
What cluster is Indonesia in ? Indonesia is in Cluster 3.
country_k3["Indonesia",]$cluster
## [1] 3