Country Data
Before Start
in this article we will learn how to use PCA in Unsupervised Learning model with K-Mean Clustering, we will cluster a definition of state development in this data
Read Data and Processing
| country | child_mort | exports | health | imports | income | inflation | life_expec | total_fer | gdpp |
|---|---|---|---|---|---|---|---|---|---|
| Afghanistan | 90.2 | 10.0 | 7.58 | 44.9 | 1610 | 9.44 | 56.2 | 5.82 | 553 |
| Albania | 16.6 | 28.0 | 6.55 | 48.6 | 9930 | 4.49 | 76.3 | 1.65 | 4090 |
| Algeria | 27.3 | 38.4 | 4.17 | 31.4 | 12900 | 16.10 | 76.5 | 2.89 | 4460 |
| Angola | 119.0 | 62.3 | 2.85 | 42.9 | 5900 | 22.40 | 60.1 | 6.16 | 3530 |
| Antigua and Barbuda | 10.3 | 45.5 | 6.03 | 58.9 | 19100 | 1.44 | 76.8 | 2.13 | 12200 |
| Argentina | 14.5 | 18.9 | 8.10 | 16.0 | 18700 | 20.90 | 75.8 | 2.37 | 10300 |
| child_mort | exports | health | imports | income | inflation | life_expec | total_fer | gdpp | |
|---|---|---|---|---|---|---|---|---|---|
| Afghanistan | 90.2 | 10.0 | 7.58 | 44.9 | 1610 | 9.44 | 56.2 | 5.82 | 553 |
| Albania | 16.6 | 28.0 | 6.55 | 48.6 | 9930 | 4.49 | 76.3 | 1.65 | 4090 |
| Algeria | 27.3 | 38.4 | 4.17 | 31.4 | 12900 | 16.10 | 76.5 | 2.89 | 4460 |
| Angola | 119.0 | 62.3 | 2.85 | 42.9 | 5900 | 22.40 | 60.1 | 6.16 | 3530 |
| Antigua and Barbuda | 10.3 | 45.5 | 6.03 | 58.9 | 19100 | 1.44 | 76.8 | 2.13 | 12200 |
| Argentina | 14.5 | 18.9 | 8.10 | 16.0 | 18700 | 20.90 | 75.8 | 2.37 | 10300 |
check correlation in our data
## child_mort exports health imports income
## child_mort 1626.42271 -351.651128 -22.1999431 -124.201982 -407635.982
## exports -351.65113 751.418298 -8.6145337 489.350622 273094.598
## health -22.19994 -8.614534 7.5451162 6.365141 6861.669
## imports -124.20198 489.350622 6.3651406 586.104198 57128.722
## income -407635.98227 273094.598023 6861.6690711 57128.721588 371643894.155
## inflation 122.89363 -31.090078 -7.4150930 -63.208898 -30110.122
## life_expec -318.00826 77.110598 5.1468078 11.710284 104916.786
## total_fer 51.80116 -13.279671 -0.8178281 -5.829066 -14645.728
## gdpp -357046.30615 210378.470377 17417.9712174 51250.050217 316443012.157
## inflation life_expec total_fer gdpp
## child_mort 122.893627 -318.008262 5.180116e+01 -357046.31
## exports -31.090078 77.110598 -1.327967e+01 210378.47
## health -7.415093 5.146808 -8.178281e-01 17417.97
## imports -63.208898 11.710284 -5.829066e+00 51250.05
## income -30110.122438 104916.785517 -1.464573e+04 316443012.16
## inflation 111.739781 -22.533965 5.071509e+00 -42940.42
## life_expec -22.533965 79.088507 -1.024358e+01 97814.72
## total_fer 5.071509 -10.243585 2.291734e+00 -12622.33
## gdpp -42940.421636 97814.722603 -1.262233e+04 335941419.96
My data has too many variations between columns, in this case we need scaling the data
on this plot we know column export and import have a good correlation
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0336 1.2435 1.0818 0.9974 0.8128 0.47284 0.3368
## Proportion of Variance 0.4595 0.1718 0.1300 0.1105 0.0734 0.02484 0.0126
## Cumulative Proportion 0.4595 0.6313 0.7614 0.8719 0.9453 0.97015 0.9828
## PC8 PC9
## Standard deviation 0.29718 0.25860
## Proportion of Variance 0.00981 0.00743
## Cumulative Proportion 0.99257 1.00000
from summary of pc, i want use the minimum possible pc but a lost information less then 15%. becouse of that i will take a 4pc from pc1 to pc4
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
Modeling
Scaling Dataset
| child_mort | exports | health | imports | income | inflation | life_expec | total_fer | gdpp | |
|---|---|---|---|---|---|---|---|---|---|
| Afghanistan | 1.2876597 | -1.1348666 | 0.2782514 | -0.0822077 | -0.8058219 | 0.1568645 | -1.6142372 | 1.8971765 | -0.6771431 |
| Albania | -0.5373329 | -0.4782202 | -0.0967253 | 0.0706243 | -0.3742433 | -0.3114109 | 0.6459238 | -0.8573942 | -0.4841671 |
| Algeria | -0.2720146 | -0.0988244 | -0.9631762 | -0.6398380 | -0.2201823 | 0.7869076 | 0.6684130 | -0.0382892 | -0.4639802 |
| Angola | 2.0017872 | 0.7730562 | -1.4437289 | -0.1648196 | -0.5832892 | 1.3828944 | -1.1756985 | 2.1217698 | -0.5147203 |
| Antigua and Barbuda | -0.6935483 | 0.1601861 | -0.2860339 | 0.4960755 | 0.1014267 | -0.5999442 | 0.7021467 | -0.5403213 | -0.0416917 |
| Argentina | -0.5894047 | -0.8101914 | 0.4675600 | -1.2759496 | 0.0806778 | 1.2409928 | 0.5897009 | -0.3817849 | -0.1453543 |
Determine the value of k
for this part i just want tell you, you dont always need to search a optimum K for the cluster, becouse sometimes we choose K value based on a bisnis problem. for this problem we need to search a levels of state development. and we will use elbow method
Tuning Model
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
Import cluster in model
## child_mort exports health imports income inflation
## Afghanistan 90.2 10.0 7.58 44.9 1610 9.44
## Albania 16.6 28.0 6.55 48.6 9930 4.49
## Algeria 27.3 38.4 4.17 31.4 12900 16.10
## Angola 119.0 62.3 2.85 42.9 5900 22.40
## Antigua and Barbuda 10.3 45.5 6.03 58.9 19100 1.44
## Argentina 14.5 18.9 8.10 16.0 18700 20.90
## life_expec total_fer gdpp cluster
## Afghanistan 56.2 5.82 553 1
## Albania 76.3 1.65 4090 2
## Algeria 76.5 2.89 4460 2
## Angola 60.1 6.16 3530 1
## Antigua and Barbuda 76.8 2.13 12200 2
## Argentina 75.8 2.37 10300 2
comparison of data from clusters
## # A tibble: 3 x 10
## cluster child_mort exports health imports income inflation life_expec
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 93.0 29.2 6.39 42.3 3942. 12.0 59.2
## 2 2 21.9 40.2 6.20 47.5 12306. 7.60 72.8
## 3 3 5 58.7 8.81 51.5 45672. 2.67 80.1
## # ... with 2 more variables: total_fer <dbl>, gdpp <dbl>
for this part we have a lot of insight:
1. cluster 1 have the highest number in child_mort and inflation
2. cluster 2 have the lowest number in total_fer
3. cluster 3 have the highest number in gdpp, invome, life_expect,
from that we can conclude that:
cluster 1 is under-developing country
cluster 2 is developing country
cluster 3 is developed country
Visualize Cluster
data_viz = dataset
data_viz$cluster = factor(data_viz$cluster,
levels = c(1,2,3),
labels = c("under-developing", "developing", "developed"))library(ggradar)
library(scales)
dat_radar <- data_viz %>%
group_by(cluster) %>%
summarise_all("mean") %>%
rename(group = cluster) %>%
mutate(group = as.character(group)) %>%
mutate_at(vars(-group),
funs(rescale))## Warning: `funs()` is deprecated as of dplyr 0.8.0.
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
ggradar(dat_radar,
grid.label.size = 4,
axis.label.size = 4,
group.point.size = 5,
group.line.width = 1.5,
legend.text.size= 10)on that plots we can conclude in detail:
- developed country has a highest gdpp, exports, healt, imports, income and life_expec
- under-developing country has a higest inflation, total_fer and child_mor. and besides he has lowest value
if we see cluster 3 on this plot, the centroid have a long distance with the observation mainly is Malta, Luxemburg and Singapore, we can assume that observation is outlier and we dont delete the outlier becouse we just have a little observation
in this plot we can conclude:
Variables that were highly contributed to PC1 are income, gdp, life_expec, health, inflation, child_mort, total_fer
Variables that were highly contributed to PC2 are import and export.
Summary
From the unsupervised learning analysis above, we can summarize that:
1.K-means clustering can be done using this dataset, although, the clusters did not resemble Kernels types. Geometrical properties of Kernels alone are not sufficient enough to obtain a clustering that resembles Kernels types.
2.Dimensionality reduction can be performed using this dataset. To perform dimensionality reduction, we can pick PCs from a total of 9 PC according to the total information we want to retain. on this article i used 4PC to reduce 50% + of dimension form my original data well retain 87% information on my data.
3.The improved data set obtained from unsupervised learning (eg.PCA) can be utilized further for supervised learning (classification) or for better data visualization (high dimensional data) with various insights.