Country Data

Before Start

in this article we will learn how to use PCA in Unsupervised Learning model with K-Mean Clustering, we will cluster a definition of state development in this data

Read Data and Processing

dataset = read.csv("Country-data.csv")
kable(head(dataset))

country	child_mort	exports	health	imports	income	inflation	life_expec	total_fer	gdpp
Afghanistan	90.2	10.0	7.58	44.9	1610	9.44	56.2	5.82	553
Albania	16.6	28.0	6.55	48.6	9930	4.49	76.3	1.65	4090
Algeria	27.3	38.4	4.17	31.4	12900	16.10	76.5	2.89	4460
Angola	119.0	62.3	2.85	42.9	5900	22.40	60.1	6.16	3530
Antigua and Barbuda	10.3	45.5	6.03	58.9	19100	1.44	76.8	2.13	12200
Argentina	14.5	18.9	8.10	16.0	18700	20.90	75.8	2.37	10300

rownames(dataset) = dataset$country
dataset = dataset %>% 
  select(- country)
kable(head(dataset))

	child_mort	exports	health	imports	income	inflation	life_expec	total_fer	gdpp
Afghanistan	90.2	10.0	7.58	44.9	1610	9.44	56.2	5.82	553
Albania	16.6	28.0	6.55	48.6	9930	4.49	76.3	1.65	4090
Algeria	27.3	38.4	4.17	31.4	12900	16.10	76.5	2.89	4460
Angola	119.0	62.3	2.85	42.9	5900	22.40	60.1	6.16	3530
Antigua and Barbuda	10.3	45.5	6.03	58.9	19100	1.44	76.8	2.13	12200
Argentina	14.5	18.9	8.10	16.0	18700	20.90	75.8	2.37	10300

check correlation in our data

cov(dataset)

##               child_mort       exports        health      imports        income
## child_mort    1626.42271   -351.651128   -22.1999431  -124.201982   -407635.982
## exports       -351.65113    751.418298    -8.6145337   489.350622    273094.598
## health         -22.19994     -8.614534     7.5451162     6.365141      6861.669
## imports       -124.20198    489.350622     6.3651406   586.104198     57128.722
## income     -407635.98227 273094.598023  6861.6690711 57128.721588 371643894.155
## inflation      122.89363    -31.090078    -7.4150930   -63.208898    -30110.122
## life_expec    -318.00826     77.110598     5.1468078    11.710284    104916.786
## total_fer       51.80116    -13.279671    -0.8178281    -5.829066    -14645.728
## gdpp       -357046.30615 210378.470377 17417.9712174 51250.050217 316443012.157
##                inflation    life_expec     total_fer         gdpp
## child_mort    122.893627   -318.008262  5.180116e+01   -357046.31
## exports       -31.090078     77.110598 -1.327967e+01    210378.47
## health         -7.415093      5.146808 -8.178281e-01     17417.97
## imports       -63.208898     11.710284 -5.829066e+00     51250.05
## income     -30110.122438 104916.785517 -1.464573e+04 316443012.16
## inflation     111.739781    -22.533965  5.071509e+00    -42940.42
## life_expec    -22.533965     79.088507 -1.024358e+01     97814.72
## total_fer       5.071509    -10.243585  2.291734e+00    -12622.33
## gdpp       -42940.421636  97814.722603 -1.262233e+04 335941419.96

My data has too many variations between columns, in this case we need scaling the data

dataset_pca <- prcomp(dataset, scale=T)
biplot(dataset_pca,
       cex= 0.6, 
       choices = c(1,2),
       scale=F)

on this plot we know column export and import have a good correlation

summary(dataset_pca)

## Importance of components:
##                           PC1    PC2    PC3    PC4    PC5     PC6    PC7
## Standard deviation     2.0336 1.2435 1.0818 0.9974 0.8128 0.47284 0.3368
## Proportion of Variance 0.4595 0.1718 0.1300 0.1105 0.0734 0.02484 0.0126
## Cumulative Proportion  0.4595 0.6313 0.7614 0.8719 0.9453 0.97015 0.9828
##                            PC8     PC9
## Standard deviation     0.29718 0.25860
## Proportion of Variance 0.00981 0.00743
## Cumulative Proportion  0.99257 1.00000

from summary of pc, i want use the minimum possible pc but a lost information less then 15%. becouse of that i will take a 4pc from pc1 to pc4

library(factoextra)

## Loading required package: ggplot2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

fviz_contrib(X = dataset_pca,
             choice = "var", 
             axes = 4)

Modeling

Scaling Dataset

data_scale = scale(dataset)
kable(head(as.data.frame(data_scale)))

	child_mort	exports	health	imports	income	inflation	life_expec	total_fer	gdpp
Afghanistan	1.2876597	-1.1348666	0.2782514	-0.0822077	-0.8058219	0.1568645	-1.6142372	1.8971765	-0.6771431
Albania	-0.5373329	-0.4782202	-0.0967253	0.0706243	-0.3742433	-0.3114109	0.6459238	-0.8573942	-0.4841671
Algeria	-0.2720146	-0.0988244	-0.9631762	-0.6398380	-0.2201823	0.7869076	0.6684130	-0.0382892	-0.4639802
Angola	2.0017872	0.7730562	-1.4437289	-0.1648196	-0.5832892	1.3828944	-1.1756985	2.1217698	-0.5147203
Antigua and Barbuda	-0.6935483	0.1601861	-0.2860339	0.4960755	0.1014267	-0.5999442	0.7021467	-0.5403213	-0.0416917
Argentina	-0.5894047	-0.8101914	0.4675600	-1.2759496	0.0806778	1.2409928	0.5897009	-0.3817849	-0.1453543

Determine the value of k

for this part i just want tell you, you dont always need to search a optimum K for the cluster, becouse sometimes we choose K value based on a bisnis problem. for this problem we need to search a levels of state development. and we will use elbow method

fviz_nbclust(data_scale, FUNcluster = kmeans, method = "wss")

Tuning Model

RNGkind(sample.kind = "Rounding")

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

set.seed(123)
clusters = kmeans(data_scale, 3)

Import cluster in model

dataset$cluster = clusters$cluster
head(dataset)

##                     child_mort exports health imports income inflation
## Afghanistan               90.2    10.0   7.58    44.9   1610      9.44
## Albania                   16.6    28.0   6.55    48.6   9930      4.49
## Algeria                   27.3    38.4   4.17    31.4  12900     16.10
## Angola                   119.0    62.3   2.85    42.9   5900     22.40
## Antigua and Barbuda       10.3    45.5   6.03    58.9  19100      1.44
## Argentina                 14.5    18.9   8.10    16.0  18700     20.90
##                     life_expec total_fer  gdpp cluster
## Afghanistan               56.2      5.82   553       1
## Albania                   76.3      1.65  4090       2
## Algeria                   76.5      2.89  4460       2
## Angola                    60.1      6.16  3530       1
## Antigua and Barbuda       76.8      2.13 12200       2
## Argentina                 75.8      2.37 10300       2

comparison of data from clusters

dataset %>% 
  group_by(cluster) %>% 
  summarise_all(mean)

## # A tibble: 3 x 10
##   cluster child_mort exports health imports income inflation life_expec
##     <int>      <dbl>   <dbl>  <dbl>   <dbl>  <dbl>     <dbl>      <dbl>
## 1       1       93.0    29.2   6.39    42.3  3942.     12.0        59.2
## 2       2       21.9    40.2   6.20    47.5 12306.      7.60       72.8
## 3       3        5      58.7   8.81    51.5 45672.      2.67       80.1
## # ... with 2 more variables: total_fer <dbl>, gdpp <dbl>

for this part we have a lot of insight:
1. cluster 1 have the highest number in child_mort and inflation
2. cluster 2 have the lowest number in total_fer
3. cluster 3 have the highest number in gdpp, invome, life_expect,

from that we can conclude that:
cluster 1 is under-developing country
cluster 2 is developing country
cluster 3 is developed country

Visualize Cluster

data_viz = dataset
data_viz$cluster = factor(data_viz$cluster,
                         levels = c(1,2,3),
                         labels = c("under-developing", "developing", "developed"))

library(ggradar)
library(scales)
dat_radar <- data_viz %>% 
             group_by(cluster) %>% 
             summarise_all("mean") %>% 
             rename(group = cluster) %>% 
             mutate(group = as.character(group)) %>%
             mutate_at(vars(-group),
             funs(rescale))

## Warning: `funs()` is deprecated as of dplyr 0.8.0.
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

ggradar(dat_radar, 
        grid.label.size = 4,
        axis.label.size = 4, 
        group.point.size = 5,
        group.line.width = 1.5,
        legend.text.size= 10)

on that plots we can conclude in detail:
- developed country has a highest gdpp, exports, healt, imports, income and life_expec
- under-developing country has a higest inflation, total_fer and child_mor. and besides he has lowest value

fviz_cluster(object = clusters,data = dataset)

if we see cluster 3 on this plot, the centroid have a long distance with the observation mainly is Malta, Luxemburg and Singapore, we can assume that observation is outlier and we dont delete the outlier becouse we just have a little observation

library(FactoMineR)
end_pca = PCA(X = data_viz, scale.unit = T,
              quali.sup = 10, graph = F)

plot.PCA(x = end_pca, choix = "var")

in this plot we can conclude:
Variables that were highly contributed to PC1 are income, gdp, life_expec, health, inflation, child_mort, total_fer
Variables that were highly contributed to PC2 are import and export.

fviz_pca_biplot(X = end_pca,
                habillage = 10,
                geom.ind = "point",
                addEllipses = T,
                col.var = "navy")

Summary

From the unsupervised learning analysis above, we can summarize that:

1.K-means clustering can be done using this dataset, although, the clusters did not resemble Kernels types. Geometrical properties of Kernels alone are not sufficient enough to obtain a clustering that resembles Kernels types.
2.Dimensionality reduction can be performed using this dataset. To perform dimensionality reduction, we can pick PCs from a total of 9 PC according to the total information we want to retain. on this article i used 4PC to reduce 50% + of dimension form my original data well retain 87% information on my data.
3.The improved data set obtained from unsupervised learning (eg.PCA) can be utilized further for supervised learning (classification) or for better data visualization (high dimensional data) with various insights.