Clusters of Countries

Clusters of Countries

Data Used

The data used is Population Growth, Fertility and Mortality Indicators.csv, tells about the number of some variables related to population fertility and mortality of each country around the world.

We have some variables from the data, and they are :

  • T03 The country code

  • Population.growth.and.indicators.of.fertility.and.mortality The country list

  • X The year column

  • X.1 Variable which contains some indicators, this variable is going to be spread to some variables.

  • X.2 The values of the observations.

  • X.3 Footnotes

  • X.4 Data source

The Goal

Assume that we are going to classify countries listed based on the indicators contained in the data.

The Flow

  1. Libraries Importing and Data Preparation.

  2. Exploratory Data Analyst

  3. PCA Transformation.

  4. Biplotting and Interpretation.

Libraries Importing and Data Preparation

Libraries Used

library(tidyverse) # Data manipulating
library(FactoMineR) # performing PCA
library(factoextra) # Creating some plot (biplot)
library(plotly) # Interactive plotting
library(tibble) # Creating rownames
library(corrplot) # Creating correlation plot
library(countrycode) #generating continent variable from country code
library(gganimate) #animating plot

Data Importing

mortal_total <- read.csv("source/Population Growth, Fertility and Mortality Indicators.csv")

str(mortal_total)
## 'data.frame':    4979 obs. of  7 variables:
##  $ T03                                                        : chr  "Region/Country/Area" "1" "1" "1" ...
##  $ Population.growth.and.indicators.of.fertility.and.mortality: chr  "" "Total, all countries or areas" "Total, all countries or areas" "Total, all countries or areas" ...
##  $ X                                                          : chr  "Year" "2005" "2005" "2005" ...
##  $ X.1                                                        : chr  "Series" "Population annual rate of increase (percent)" "Total fertility rate (children per women)" "Infant mortality for both sexes (per 1,000 live births)" ...
##  $ X.2                                                        : chr  "Value" "1.3" "2.6" "49.1" ...
##  $ X.3                                                        : chr  "Footnotes" "Data refers to a 5-year period preceding the reference year." "Data refers to a 5-year period preceding the reference year." "Data refers to a 5-year period preceding the reference year." ...
##  $ X.4                                                        : chr  "Source" "United Nations Population Division, New York, World Population Prospects: The 2017 Revision, last accessed June 2017." "United Nations Population Division, New York, World Population Prospects: The 2017 Revision; supplemented by da"| __truncated__ "United Nations Statistics Division, New York, \"Demographic Yearbook 2015\" and the demographic statistics data"| __truncated__ ...
  • We only need some variables to process the data, the last 2 columns and the first column will be eliminated

  • There is a year column (from 2000 to 2016 ), most of the countries only have values for 2005, 2010, and 2015.

  • The X.1 contains 8 indicators, we’re going to spread them into their own column

Data cleaning

  • In the chunk below we’re going to remove the last 2 variables and filter the year, we only need the 2015 data to interpret the latest condition of each country.
mortal_lastyears <- mortal_total %>% 
  select(-c(X.3,X.4)) %>% 
  filter(X == 2015) %>% 
  spread(key = X.1, value = X.2) %>% # create column for each indicator in X.1
  rename(year = X,
         Code = T03,
         Country = Population.growth.and.indicators.of.fertility.and.mortality,
         inf.mort = `Infant mortality for both sexes (per 1,000 live births)`,
         life.exp.both = `Life expectancy at birth for both sexes (years)`,
         life.exp.male = `Life expectancy at birth for males (years)`,
         life.exp.female = `Life expectancy at birth for females (years)`,
         maternal.mortality.ratio = `Maternal mortality ratio (deaths per 100,000 population)`,
         pop.increase = `Population annual rate of increase (percent)`,
         tot.fertil.rate = `Total fertility rate (children per women)`) %>% 
  mutate(inf.mort = as.numeric(as.character(inf.mort)),
         life.exp.both = as.numeric(as.character(life.exp.both)),
         life.exp.female = as.numeric(as.character(life.exp.female)),
         life.exp.male = as.numeric(as.character(life.exp.male)),
         maternal.mortality.ratio = as.numeric(as.character(maternal.mortality.ratio)),
         pop.increase = as.numeric(as.character(pop.increase)),
         tot.fertil.rate = as.numeric(as.character(tot.fertil.rate)))
## Warning: Problem with `mutate()` input `maternal.mortality.ratio`.
## ℹ NAs introduced by coercion
## ℹ Input `maternal.mortality.ratio` is `as.numeric(as.character(maternal.mortality.ratio))`.
## Warning in mask$eval_all_mutate(dots[[i]]): NAs introduced by coercion
head(mortal_lastyears)
##   Code                       Country year inf.mort life.exp.both
## 1    1 Total, all countries or areas 2015     35.0          70.8
## 2  100                      Bulgaria 2015      8.3          74.3
## 3  104                       Myanmar 2015     45.0          66.0
## 4  108                       Burundi 2015     77.9          56.1
## 5   11                Western Africa 2015     70.5          54.7
## 6  112                       Belarus 2015      3.6          72.1
##   life.exp.female life.exp.male maternal.mortality.ratio pop.increase
## 1            73.1          68.6                      216          1.2
## 2            77.8          70.8                       11         -0.6
## 3            68.3          63.7                      178          0.9
## 4            58.0          54.2                      712          3.0
## 5            55.6          53.9                       NA          2.7
## 6            77.7          66.5                        4          0.0
##   tot.fertil.rate
## 1             2.5
## 2             1.5
## 3             2.3
## 4             6.0
## 5             5.5
## 6             1.6

Country = Country list ; inf.mort = Infant mortality for both sexes (per 1,000 live births) ; life.exp.both = Life expectancy at birth for both sexes (years) ; life.exp.male = Life expectancy at birth for males (years) ; life.exp.female = Life expectancy at birth for females (years) ; maternal.mortality.ratio = Maternal mortality ratio (deaths per 100,000 population) ; pop.increase = Population annual rate of increase (percent) ; tot.fertil.rate = Total fertility rate (children per women)

NA checking

mortal_lastyears %>% 
  is.na() %>% 
  colSums()
##                     Code                  Country                     year 
##                        0                        0                        0 
##                 inf.mort            life.exp.both          life.exp.female 
##                       31                       31                       29 
##            life.exp.male maternal.mortality.ratio             pop.increase 
##                       29                       73                        0 
##          tot.fertil.rate 
##                       29

There are so many NAs in the data, it means that not all country listed have the data we need.

  • We’re going to replace the NAs to the average value of each variable/indicator.
# Assigning the average value of each variables
life.exp.both.avg <- mean(mortal_lastyears$life.exp.both,na.rm = T)
life.exp.male.avg <- mean(mortal_lastyears$life.exp.male, na.rm = T)
life.exp.female.avg <- mean(mortal_lastyears$life.exp.female,na.rm = T)
inf.mort.avg <- mean(mortal_lastyears$inf.mort,na.rm = T)
maternal.mortality.ratio.avg <- mean(mortal_lastyears$maternal.mortality.ratio,na.rm = T)
pop.increase.avg <- mean(mortal_lastyears$pop.increase,na.rm = T)
tot.fertil.rate.avg <- mean(mortal_lastyears$tot.fertil.rate,na.rm = T)

# Replacing the NAs with the avg value
mortal_lastyears <- mortal_lastyears %>% 
  mutate(inf.mort = replace_na(inf.mort,inf.mort.avg),
         life.exp.both = replace_na(life.exp.both,life.exp.both.avg),
         life.exp.female = replace_na(life.exp.female,life.exp.female.avg),
         life.exp.male = replace_na(life.exp.male,life.exp.male.avg),
         maternal.mortality.ratio = replace_na(maternal.mortality.ratio,maternal.mortality.ratio.avg),
         pop.increase = replace_na(pop.increase,pop.increase.avg),
         tot.fertil.rate = replace_na(tot.fertil.rate,tot.fertil.rate.avg))

head(mortal_lastyears)
##   Code                       Country year inf.mort life.exp.both
## 1    1 Total, all countries or areas 2015     35.0          70.8
## 2  100                      Bulgaria 2015      8.3          74.3
## 3  104                       Myanmar 2015     45.0          66.0
## 4  108                       Burundi 2015     77.9          56.1
## 5   11                Western Africa 2015     70.5          54.7
## 6  112                       Belarus 2015      3.6          72.1
##   life.exp.female life.exp.male maternal.mortality.ratio pop.increase
## 1            73.1          68.6                 216.0000          1.2
## 2            77.8          70.8                  11.0000         -0.6
## 3            68.3          63.7                 178.0000          0.9
## 4            58.0          54.2                 712.0000          3.0
## 5            55.6          53.9                 162.1842          2.7
## 6            77.7          66.5                   4.0000          0.0
##   tot.fertil.rate
## 1             2.5
## 2             1.5
## 3             2.3
## 4             6.0
## 5             5.5
## 6             1.6

There is an odd thing on the data as we replace the NA with the average number of each column. There are some rows/countries which have no observation value or only have 1 or 2 value for their indicator and we have filled them with the average values and it’s not supposed to be like that. We supposed to eliminate them.

  • eliminating some rows
# we will create a vector that indicates wether a rows' values are mostly the avg values of each column or not
cb <- 
mortal_lastyears$inf.mort == inf.mort.avg &
mortal_lastyears$life.exp.both == life.exp.both.avg &
mortal_lastyears$life.exp.male == life.exp.male.avg &
mortal_lastyears$life.exp.female == life.exp.female.avg &
mortal_lastyears$maternal.mortality.ratio == maternal.mortality.ratio.avg &
mortal_lastyears$tot.fertil.rate == tot.fertil.rate.avg

# join the vector to the data
mortal_lastyears <- cbind(mortal_lastyears, cb)

# eliminating the rows have mostly the average values in its columns
mortal_lastyears <- mortal_lastyears %>% 
  filter(cb == F)

mortal_lastyears <- mortal_lastyears[,-c(1,3,11)]

str(mortal_lastyears)
## 'data.frame':    235 obs. of  8 variables:
##  $ Country                 : chr  "Total, all countries or areas" "Bulgaria" "Myanmar" "Burundi" ...
##  $ inf.mort                : num  35 8.3 45 77.9 70.5 3.6 29.9 27.7 67.5 4.7 ...
##  $ life.exp.both           : num  70.8 74.3 66 56.1 54.7 72.1 67.6 75.3 56.4 81.8 ...
##  $ life.exp.female         : num  73.1 77.8 68.3 58 55.6 77.7 69.6 76.5 57.7 83.8 ...
##  $ life.exp.male           : num  68.6 70.8 63.7 54.2 53.9 66.5 65.5 74.1 55.1 79.7 ...
##  $ maternal.mortality.ratio: num  216 11 178 712 162 ...
##  $ pop.increase            : num  1.2 -0.6 0.9 3 2.7 0 1.6 2 2.7 1 ...
##  $ tot.fertil.rate         : num  2.5 1.5 2.3 6 5.5 1.6 2.7 3 5 1.6 ...

Continent Column

I think by giving the Continent column, we’re going to have some more insights, so let’s just do it.

mortal_lastyears$Continent <- countrycode(sourcevar = mortal_lastyears[,"Country"],
                                          origin = "country.name",
                                          destination = "continent")
## Warning in countrycode(sourcevar = mortal_lastyears[, "Country"], origin = "country.name", : Some values were not matched unambiguously: Africa, Asia, Australia and New Zealand, Caribbean, Central America, Central Asia, Channel Islands, Eastern Africa, Eastern Asia, Eastern Europe, Europe, Latin America & the Caribbean, Melanesia, Micronesia, Middle Africa, Northern Africa, Northern America, Northern Europe, Oceania, Other non-specified areas, Polynesia, South America, South-central Asia, South-eastern Asia, Southern Africa, Southern Asia, Southern Europe, Sub-Saharan Africa, Total, all countries or areas, Western Africa, Western Asia, Western Europe
## Warning in countrycode(sourcevar = mortal_lastyears[, "Country"], origin = "country.name", : Some strings were matched more than once, and therefore set to <NA> in the result: Australia and New Zealand,Oceania,Oceania

so some rows cannot be defined by its continent and all of them are not even a country actually. They are just regions or certain areas of the continent.

Our observations are countries so we wil just eliminate rows that represent some areas or regions.

mortal_lastyears <- mortal_lastyears %>% 
  mutate(Continent = replace_na(Continent, "?"))

mortal_lastyears <- mortal_lastyears %>% 
  filter(Continent != "?") %>% 
  mutate(Continent = as.factor(Continent))
  • We better assign the Country as rownames instead.
mortal_lastyears <- column_to_rownames(mortal_lastyears,"Country")

Now the data is ready to be proceed.

Exploratory Data Analyst

  • Variables Correlation
corrplot(cor(mortal_lastyears[,-8]), method = "circle")

# cor(mortal_lastyears[,-8]) %>% 
#   as.data.frame() %>%
#   rownames_to_column("series") %>% 
#   pivot_longer(!series, names_to ="series2", values_to ="correlation") %>% 
#   write_csv("corr_data.csv")

From the plot above we can conclude that :

  • the correlation between life expectancy of birth of male, female, and both are really high. In this case we better use the life expectancy of both
mortal_lastyears <- mortal_lastyears[,-c(3,4)]

# mortal_lastyears %>% 
#   rownames_to_column("country") %>% 
#   write_csv("mortal_latest.csv")
  • all variables have relatively strong correlation to each other but pop.increase

  • the pop.increase has the least correlation with other variables

Life Expectantion of the World

g <- ggplot(mortal_lastyears, aes(life.exp.both, inf.mort, text = rownames(mortal_lastyears))) +
  geom_jitter(aes(col = Continent)) +
  geom_smooth(aes(group = 1), method = loess)+
  theme_classic()

ggplotly(g)
## `geom_smooth()` using formula 'y ~ x'
## Warning: `group_by_()` is deprecated as of dplyr 0.7.0.
## Please use `group_by()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
g6 <- ggplot(mortal_lastyears, aes(life.exp.both, pop.increase, text = rownames(mortal_lastyears))) +
  geom_jitter(aes(col = Continent)) +
  geom_smooth(aes(group = 1), method = loess)+
  theme_classic()

ggplotly(g6)
## `geom_smooth()` using formula 'y ~ x'
g7 <- ggplot(mortal_lastyears, aes(life.exp.both, maternal.mortality.ratio, text = rownames(mortal_lastyears))) +
  geom_jitter(aes(col = Continent)) +
  geom_smooth(aes(group = 1), method = loess)+
  theme_classic()

ggplotly(g7)
## `geom_smooth()` using formula 'y ~ x'
g8 <- ggplot(mortal_lastyears, aes(life.exp.both, tot.fertil.rate, text = rownames(mortal_lastyears))) +
  geom_jitter(aes(col = Continent)) +
  geom_smooth(aes(group = 1), method = loess)+
  theme_classic()

ggplotly(g8)
## `geom_smooth()` using formula 'y ~ x'
  • Africa dominates the low life expectancy area but Europe are mostly on the high area of life expectancy . The rest are spread from the middle to the high.

  • Usualy the countries which infant mortality is high have less life expectancy. The infants die and the life expectancy is lower than other countries, Africa dominates this area and Europe is on the other side.

  • The higher fertility rate the lower life expectancy,Africa dominates this area and Europe is on the other side.

Total Fertility of the World

g1 <- ggplot(mortal_lastyears, aes(tot.fertil.rate,life.exp.both, text = rownames(mortal_lastyears))) + 
  geom_jitter(aes(col = Continent)) +
  geom_smooth(method = loess, aes(group = 1))+
  theme_classic()

ggplotly(g1)
## `geom_smooth()` using formula 'y ~ x'
g2 <- ggplot(mortal_lastyears, aes(tot.fertil.rate,inf.mort, text = rownames(mortal_lastyears))) + 
  geom_jitter(aes(col = Continent)) +
  geom_smooth(method = loess, aes(group = 1))+
  theme_classic()

ggplotly(g2)
## `geom_smooth()` using formula 'y ~ x'
g3 <- ggplot(mortal_lastyears, aes(tot.fertil.rate,maternal.mortality.ratio, text = rownames(mortal_lastyears))) + 
  geom_jitter(aes(col = Continent)) +
  geom_smooth(method = loess, aes(group = 1))+
  theme_classic()

ggplotly(g3)
## `geom_smooth()` using formula 'y ~ x'

* Africa dominates the area which total fertility rate is high, means that Africans are “productive”.

  • it’s kinda make sense countries with low fertility rate have low infant mortality number.

  • Usualy the countries which total fertility is high have low life expectancy.

  • Countries with high fertility rate tend to have high maternal mortality ratio and this still dominated by African countries.

Population Increase of The world

g4 <- ggplot(mortal_lastyears, aes(pop.increase,inf.mort, text = rownames(mortal_lastyears))) + 
  geom_jitter(aes(col = Continent)) +
  geom_smooth(method = loess, aes(group = 1))+
  theme_classic()

ggplotly(g4)
## `geom_smooth()` using formula 'y ~ x'
g5 <- ggplot(mortal_lastyears, aes(pop.increase,tot.fertil.rate, text = rownames(mortal_lastyears))) + 
  geom_jitter(aes(col = Continent)) +
  geom_smooth(method = loess, aes(group = 1))+
  theme_classic()

ggplotly(g5)
## `geom_smooth()` using formula 'y ~ x'

* Europe has low infant mortality number but also low population increase which is rational i think.

  • Most African countries and some Asian country have high pop increase and high infant mortality, it’s not really good though, it seems like they produce babies as much as possible but can’t really keep them alive until adult.

  • Some Asian countries even keep their infant mortality low but still their population increase greatly. And they are the “oil well” of the world.

  • The higher total fertility rate, the higher population increase.

Data Clustering

Data Scalling

mortal_lastyears_scaled <- scale(mortal_lastyears[,-6])
summary(mortal_lastyears_scaled)
##     inf.mort       life.exp.both     maternal.mortality.ratio  pop.increase    
##  Min.   :-1.0369   Min.   :-2.6949   Min.   :-0.7804          Min.   :-2.7139  
##  1st Qu.:-0.8089   1st Qu.:-0.6741   1st Qu.:-0.7142          1st Qu.:-0.7343  
##  Median :-0.3700   Median : 0.2131   Median :-0.4225          Median :-0.0744  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000          Mean   : 0.0000  
##  3rd Qu.: 0.7055   3rd Qu.: 0.6937   3rd Qu.: 0.1243          3rd Qu.: 0.6588  
##  Max.   : 2.9556   Max.   : 1.4947   Max.   : 3.5298          Max.   : 3.8116  
##  tot.fertil.rate  
##  Min.   :-1.1726  
##  1st Qu.:-0.7470  
##  Median :-0.3215  
##  Mean   : 0.0000  
##  3rd Qu.: 0.6359  
##  Max.   : 3.2245

Now the data has been scaled, we’re ready for clustering.

Optimal K value

wss <- function(data, maxCluster = 9) {
    # Initialize within sum of squares
    SSw <- (nrow(data) - 1) * sum(apply(data, 2, var))
    for (i in 2:maxCluster) {
      set.seed(10)
      SSw[i] <- sum(kmeans(data, centers = i)$withinss)
    }
    plot(1:maxCluster, SSw, type = "o", xlab = "Number of Clusters", ylab = "Within groups sum of squares", pch=19)
}

wss(mortal_lastyears_scaled)

The elbow method shows that the optimum K value is 2. But i think we should try 3 as well since 2 clusters will not give us much information.

K-means

  • k-means modelling
set.seed(11)
mortal_cluster2 <- kmeans(mortal_lastyears_scaled,2) 
mortal_cluster3 <- kmeans(mortal_lastyears_scaled,3)
  • cluster distribution
table(mortal_cluster2$cluster)
## 
##   1   2 
## 153  50
table(mortal_cluster3$cluster)
## 
##   1   2   3 
##  42 104  57
  • assigning cluster to new columns
mortal_lastyears$clust2 <- factor(mortal_cluster2$cluster)
mortal_lastyears$clust3 <- factor(mortal_cluster3$cluster)

Biplotting

Performing PCA on the Data

mortal_pca <- PCA(mortal_lastyears_scaled, graph = F)
mortal_pca_pr <- prcomp(mortal_lastyears_scaled)

Information gathered in each dimension

mortal_pca$eig
##        eigenvalue percentage of variance cumulative percentage of variance
## comp 1  3.9451216              78.902432                          78.90243
## comp 2  0.6727678              13.455355                          92.35779
## comp 3  0.1954976               3.909951                          96.26774
## comp 4  0.1354516               2.709031                          98.97677
## comp 5  0.0511615               1.023230                         100.00000
fviz_screeplot(mortal_pca)

The dimension 1 contains 80% of information and dimension 2 contains 12% information. The total is arround 92% of information.

Variables Contribution

fviz_contrib(mortal_pca_pr, choice = "var", axes = 1)

fviz_contrib(mortal_pca_pr, choice = "var", axes = 2)

fviz_pca_var(mortal_pca_pr,
             col.var = "contrib",
             gradient.cols = c("pink", "red", "black") ,
             repel = T)

Cluster Plot

2 Clusters

fviz_cluster(mortal_cluster2,
             data = mortal_lastyears_scaled,
             labelsize = 5,
             repel = T)
## Warning: ggrepel: 110 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

pr <- prcomp(mortal_lastyears_scaled)

mortal_arrow <- data.frame(pr$rotation)

mortal_pr_c2 <- data.frame(pr$x, 
                        clust = factor(mortal_cluster2$cluster), 
                        continent = mortal_lastyears$Continent,
                        country = factor(rownames(mortal_lastyears)),
                        inf.mort = mortal_lastyears$inf.mort,
                        life.exp.both = mortal_lastyears$life.exp.both,
                        maternal.mort.ratio = mortal_lastyears$maternal.mortality.ratio,
                        pop.increase = mortal_lastyears$pop.increase,
                        tot.fertil.rate = mortal_lastyears$tot.fertil.rate)
mortal_pr_c2 <- mortal_pr_c2 %>% 
  mutate(text = paste('clust :', factor(mortal_cluster2$cluster), 
                        '
                      continent :' ,mortal_lastyears$Continent,
                        '
                      country :' ,factor(rownames(mortal_lastyears)),
                        '
                      inf.mort :' ,mortal_lastyears$inf.mort,
                        '
                      life.exp.both :' , mortal_lastyears$life.exp.both,
                        '
                      maternal.mort.ratio :', mortal_lastyears$maternal.mortality.ratio,
                        '
                      pop.increase :', mortal_lastyears$pop.increase,
                        '
                      tot.fertil.rate :', mortal_lastyears$tot.fertil.rate))

po <- ggplot(mortal_pr_c2, aes(PC1,PC2, text = text ))+
  geom_hline(aes(yintercept=0), size=.2, alpha = 0.5, linetype = 2) + 
  geom_vline(aes(xintercept=0), size=.2, alpha = 0.5, linetype = 2)+
  geom_point(aes(col = clust, pch = continent)) +
  theme_classic()

ggplotly(po, tooltip = 'text')

When we divide the data into 2 clusters, we can conclude that the cluster 1 is :

  • the countries which have low life expectancy for male and female

  • the countries which have high fertility rate

  • the countries which have high population increase

  • and African countries dominate this cluster.

This cluster indicates the countries contained maybe are not a healthy country since they have low life expectancy. This countries will have more young people in the future since the are high fertility rate and the population grows rapidly.

cluster 2 is :

  • the countries which have high life expectancy for male and female

  • the countries which have low fertility rate

  • the countries which have low infant mortality number

  • the countries which have low maternal mortality ratio

  • and all Europe countries are in cluster2.

This cluster indicates the countries contained will tend to have less productive people in the future since the fertility rate is not really good and the population is not growing well. In this case, high life expectancy will make this countries population dominated by old people one day.

3 Clusters

fviz_cluster(mortal_cluster3,
             data = mortal_lastyears_scaled,
             labelsize = 5,
             repel = T)
## Warning: ggrepel: 110 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

mortal_pr_c3 <- data.frame(pr$x, 
                        clust = factor(mortal_cluster3$cluster), 
                        continent = mortal_lastyears$Continent,
                        country = factor(rownames(mortal_lastyears)),
                        inf.mort = mortal_lastyears$inf.mort,
                        life.exp.both = mortal_lastyears$life.exp.both,
                        maternal.mort.ratio = mortal_lastyears$maternal.mortality.ratio,
                        pop.increase = mortal_lastyears$pop.increase,
                        tot.fertil.rate = mortal_lastyears$tot.fertil.rate)
mortal_pr_c3 <- mortal_pr_c3 %>% 
  mutate(text = paste('clust :', factor(mortal_cluster3$cluster), 
                        '
                      continent :' ,mortal_lastyears$Continent,
                        '
                      country :' ,factor(rownames(mortal_lastyears)),
                        '
                      inf.mort :' ,mortal_lastyears$inf.mort,
                        '
                      life.exp.both :' , mortal_lastyears$life.exp.both,
                        '
                      maternal.mort.ratio :', mortal_lastyears$maternal.mortality.ratio,
                        '
                      pop.increase :', mortal_lastyears$pop.increase,
                        '
                      tot.fertil.rate :', mortal_lastyears$tot.fertil.rate))


po <- ggplot(mortal_pr_c3, aes(PC1,PC2, text = text ))+
  geom_hline(aes(yintercept=0), size=.2, alpha = 0.5, linetype = 2) + 
  geom_vline(aes(xintercept=0), size=.2, alpha = 0.5, linetype = 2)+
  geom_point(aes(col = clust, pch = continent)) +
  theme_classic()

ggplotly(po, tooltip = 'text')

When we divide the data into 3 clusters, we can conclude that the cluster 1 is :

  • the countries which have low life expectancy for male and female

  • the countries which have high fertility rate

  • the countries which have high population increase

  • African countries still dominate this cluster

This cluster is not really different with the cluster 1 from the case before.

cluster 2 is :

  • the countries in the middle, their observation values are near the average.

  • there are some outliers in this cluster. they are countries with high population growing and low infant mortality, the “oil well” i’ve told you before.

cluster 3 is :

  • the countries which have high life expectancy for male and female

  • the countries which have low fertility rate

  • the countries which have low population increase

  • the countries which have low maternal mortality ratio

This cluster indicates the countries contained will more likely to have less young people than countries in other clusters. the have low pop. increase, fertility rate. These countries should be more “productive”.

Animated Plot

So we’re going to see the animated plot of each country of each cluster from 2005 to 2015. We expect to see some countries change their cluster from time to time.

Data Tidying

The data tidying is not really different from the data preparation above, but we have the year column this time.

mortal_allyears <- mortal_total %>% 
  select(-c(X.3,X.4)) %>% 
  filter(Population.growth.and.indicators.of.fertility.and.mortality %in% rownames(mortal_lastyears)) %>% 
  spread(key = X.1, value = X.2) %>% # create column for each indicator in X.1
  rename(year = X,
         Code = T03,
         Country = Population.growth.and.indicators.of.fertility.and.mortality,
         inf.mort = `Infant mortality for both sexes (per 1,000 live births)`,
         life.exp.both = `Life expectancy at birth for both sexes (years)`,
         maternal.mortality.ratio = `Maternal mortality ratio (deaths per 100,000 population)`,
         pop.increase = `Population annual rate of increase (percent)`,
         tot.fertil.rate = `Total fertility rate (children per women)`) %>% 
  mutate(inf.mort = as.numeric(as.character(inf.mort)),
         life.exp.both = as.numeric(as.character(life.exp.both)),
         maternal.mortality.ratio = as.numeric(as.character(maternal.mortality.ratio)),
         pop.increase = as.numeric(as.character(pop.increase)),
         tot.fertil.rate = as.numeric(as.character(tot.fertil.rate)))
## Warning: Problem with `mutate()` input `maternal.mortality.ratio`.
## ℹ NAs introduced by coercion
## ℹ Input `maternal.mortality.ratio` is `as.numeric(as.character(maternal.mortality.ratio))`.
## Warning in mask$eval_all_mutate(dots[[i]]): NAs introduced by coercion
str(mortal_allyears)
## 'data.frame':    615 obs. of  10 variables:
##  $ Code                                        : chr  "100" "100" "100" "104" ...
##  $ Country                                     : chr  "Bulgaria" "Bulgaria" "Bulgaria" "Myanmar" ...
##  $ year                                        : chr  "2005" "2010" "2015" "2005" ...
##  $ inf.mort                                    : num  12.7 9.5 8.3 57.9 52.2 45 94.6 86.2 77.9 9.6 ...
##  $ life.exp.both                               : num  72.2 73.1 74.3 62.9 64.3 66 52 53.7 56.1 67.8 ...
##  $ Life expectancy at birth for females (years): chr  "75.8" "76.7" "77.8" "65.0" ...
##  $ Life expectancy at birth for males (years)  : chr  "68.8" "69.7" "70.8" "60.9" ...
##  $ maternal.mortality.ratio                    : num  15 11 11 248 205 178 863 808 712 13 ...
##  $ pop.increase                                : num  -0.8 -0.7 -0.6 1 0.7 0.9 3 3.3 3 -0.6 ...
##  $ tot.fertil.rate                             : num  1.2 1.5 1.5 2.9 2.6 2.3 6.9 6.5 6 1.3 ...
# adding continent
mortal_allyears$Continent <- countrycode(sourcevar = mortal_allyears[,"Country"],
                                          origin = "country.name",
                                          destination = "continent")


mortal_allyears <- mortal_allyears %>% 
  filter(Continent != "?") %>% 
  select( -c("Code","Life expectancy at birth for females (years)", "Life expectancy at birth for males (years)")) %>% 
  mutate(Continent = as.factor(Continent))


mortal_allyears %>% 
  is.na() %>% 
  colSums()
##                  Country                     year                 inf.mort 
##                        0                        0                       17 
##            life.exp.both maternal.mortality.ratio             pop.increase 
##                       16                       74                        5 
##          tot.fertil.rate                Continent 
##                        5                        0
# assigning a rownames from Country var.
mortal_lastyears$Country <- rownames(mortal_lastyears)

#creating a new dataframe to merge its cluster to the data.
country_cluster <- mortal_lastyears %>% 
  select(c("Country","clust2","clust3"))


mortal_allyears <- merge(country_cluster,mortal_allyears,by = "Country")
#replacing NAs with avg value

mortal_allyears %>% 
  is.na() %>% 
  colSums()
##                  Country                   clust2                   clust3 
##                        0                        0                        0 
##                     year                 inf.mort            life.exp.both 
##                        0                       17                       16 
## maternal.mortality.ratio             pop.increase          tot.fertil.rate 
##                       74                        5                        5 
##                Continent 
##                        0
life.exp.both.avg.2 <- mean(mortal_allyears$life.exp.both,na.rm = T)
life.exp.male.avg.2 <- mean(mortal_allyears$life.exp.male, na.rm = T)
## Warning in mean.default(mortal_allyears$life.exp.male, na.rm = T): argument is
## not numeric or logical: returning NA
life.exp.female.avg.2 <- mean(mortal_allyears$life.exp.female,na.rm = T)
## Warning in mean.default(mortal_allyears$life.exp.female, na.rm = T): argument is
## not numeric or logical: returning NA
inf.mort.avg.2 <- mean(mortal_allyears$inf.mort,na.rm = T)
maternal.mortality.ratio.avg.2 <- mean(mortal_allyears$maternal.mortality.ratio,na.rm = T)
pop.increase.avg.2 <- mean(mortal_allyears$pop.increase,na.rm = T)
tot.fertil.rate.avg.2 <- mean(mortal_allyears$tot.fertil.rate,na.rm = T)


mortal_allyears <- mortal_allyears %>% 
  mutate(inf.mort = replace_na(inf.mort,inf.mort.avg.2),
         life.exp.both = replace_na(life.exp.both,life.exp.both.avg.2),
         maternal.mortality.ratio = replace_na(maternal.mortality.ratio,maternal.mortality.ratio.avg.2),
         pop.increase = replace_na(pop.increase,pop.increase.avg.2),
         tot.fertil.rate = replace_na(tot.fertil.rate,tot.fertil.rate.avg.2))

mortal_allyears %>% 
  is.na() %>% 
  colSums()
##                  Country                   clust2                   clust3 
##                        0                        0                        0 
##                     year                 inf.mort            life.exp.both 
##                        0                        0                        0 
## maternal.mortality.ratio             pop.increase          tot.fertil.rate 
##                        0                        0                        0 
##                Continent 
##                        0
#eliminating some rows
cb2 <- 
mortal_allyears$inf.mort == inf.mort.avg.2 &
mortal_allyears$life.exp.both == life.exp.both.avg.2 &
mortal_allyears$maternal.mortality.ratio == maternal.mortality.ratio.avg.2 &
mortal_allyears$tot.fertil.rate == tot.fertil.rate.avg.2 |
  mortal_allyears$maternal.mortality.ratio == maternal.mortality.ratio.avg.2

mortal_allyears <- cbind(mortal_allyears, cb2)

mortal_allyears <- mortal_allyears %>% 
  filter(cb2 == F)
mortal_allyears <-  mortal_allyears[,-11]
str(mortal_allyears)
## 'data.frame':    541 obs. of  10 variables:
##  $ Country                 : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Albania" ...
##  $ clust2                  : Factor w/ 2 levels "1","2": 2 2 2 1 1 1 1 1 1 2 ...
##  $ clust3                  : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 2 3 3 3 1 ...
##  $ year                    : chr  "2005" "2010" "2015" "2005" ...
##  $ inf.mort                : num  89.5 76.7 68.6 21.1 16.8 ...
##  $ life.exp.both           : num  56.9 60 62.3 74.8 75.6 77.7 71.5 73.9 75.3 50 ...
##  $ maternal.mortality.ratio: num  821 584 396 30 30 29 148 147 140 705 ...
##  $ pop.increase            : num  4.4 2.8 3.2 -0.3 -0.9 -0.1 1.3 1.6 2 3.5 ...
##  $ tot.fertil.rate         : num  7.2 6.4 5.3 1.9 1.6 1.7 2.4 2.7 3 6.6 ...
##  $ Continent               : Factor w/ 5 levels "Africa","Americas",..: 3 3 3 4 4 4 1 1 1 1 ...

Plotting

mortal_allyears_scaled <- scale(mortal_allyears[,-c(1,2,3,4,10)])
summary(mortal_allyears_scaled)
##     inf.mort       life.exp.both     maternal.mortality.ratio
##  Min.   :-1.0953   Min.   :-2.7946   Min.   :-0.7531         
##  1st Qu.:-0.8480   1st Qu.:-0.6549   1st Qu.:-0.7024         
##  Median :-0.3681   Median : 0.2701   Median :-0.5165         
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000         
##  3rd Qu.: 0.7282   3rd Qu.: 0.7158   3rd Qu.: 0.4170         
##  Max.   : 2.8433   Max.   : 1.5739   Max.   : 3.2301         
##   pop.increase      tot.fertil.rate  
##  Min.   :-2.50753   Min.   :-1.2305  
##  1st Qu.:-0.66838   1st Qu.:-0.7849  
##  Median :-0.07722   Median :-0.3393  
##  Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.57962   3rd Qu.: 0.6791  
##  Max.   : 8.46171   Max.   : 2.9707
summary(mortal_lastyears_scaled)
##     inf.mort       life.exp.both     maternal.mortality.ratio  pop.increase    
##  Min.   :-1.0369   Min.   :-2.6949   Min.   :-0.7804          Min.   :-2.7139  
##  1st Qu.:-0.8089   1st Qu.:-0.6741   1st Qu.:-0.7142          1st Qu.:-0.7343  
##  Median :-0.3700   Median : 0.2131   Median :-0.4225          Median :-0.0744  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000          Mean   : 0.0000  
##  3rd Qu.: 0.7055   3rd Qu.: 0.6937   3rd Qu.: 0.1243          3rd Qu.: 0.6588  
##  Max.   : 2.9556   Max.   : 1.4947   Max.   : 3.5298          Max.   : 3.8116  
##  tot.fertil.rate  
##  Min.   :-1.1726  
##  1st Qu.:-0.7470  
##  Median :-0.3215  
##  Mean   : 0.0000  
##  3rd Qu.: 0.6359  
##  Max.   : 3.2245
pr.2 <-  prcomp(mortal_allyears_scaled,scale. = F)
pr.3 <- PCA(mortal_allyears_scaled, graph = F, scale.unit = F)
mortal_allyears_clust <- data.frame(pr.2$x,
                                    mortal_allyears)


mortal_allyears_clust <-  mortal_allyears_clust %>% 
  select(c(1,2,6,7,8,9,15)) %>% 
  filter(year %in% c(2005,2010,2015))

mortal_allyears_clust$year <- as.character(mortal_allyears_clust$year)
mortal_allyears_clust$year <- as.numeric(mortal_allyears_clust$year)
str(mortal_allyears_clust)
## 'data.frame':    541 obs. of  7 variables:
##  $ PC1      : num  -4.8 -3.38 -2.48 1.48 1.8 ...
##  $ PC2      : num  -0.674 -0.027 -0.489 0.799 1.122 ...
##  $ Country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Albania" ...
##  $ clust2   : Factor w/ 2 levels "1","2": 2 2 2 1 1 1 1 1 1 2 ...
##  $ clust3   : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 2 3 3 3 1 ...
##  $ year     : num  2005 2010 2015 2005 2010 ...
##  $ Continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 4 4 4 1 1 1 1 ...
  • 2 Clusters
ggplot(mortal_allyears_clust, aes(-PC1,-PC2)) +
  geom_hline(aes(yintercept=0), size=.2, alpha = 0.8, linetype = 2) + 
  geom_vline(aes(xintercept=0), size=.2, alpha = 0.8, linetype = 2)+
  geom_point(aes(col = clust2, pch = Continent))+
  theme_classic() +
 transition_time(year,range = c(2005,2015))

Some countries are moving from cluster 1 to cluster 2.

 ggplot(mortal_allyears_clust, aes(PC1,PC2)) +
  geom_hline(aes(yintercept=0), size=.2, alpha = 0.8, linetype = 2) + 
  geom_vline(aes(xintercept=0), size=.2, alpha = 0.8, linetype = 2) +
  geom_point(aes(col = clust3, pch = Continent))+
  theme_classic() +
  transition_time(year,range = c(2005,2015))

There are some countries change their cluster.

The cluster position are flipped, its because the “var” plot is different

fviz_pca_var(pr.2) #the animated

fviz_pca_var(mortal_pca_pr) #the 2015 data

They’re flipped 180 degrees for each arrow, so the information gained from animated plot is still valid anyway.

Recommendation

Based on the previous analyst, I recommend to use 3 cluster because it gives us some more information. The 2 cluster is too general while the 3 cluster is more specific.

The use of 2 cluster only give us information that there are 2 groups of country, the first which have high life expectancy, low fertility rate, and low pop. increase. and the other one is the opposite.

But when we use 3 cluster we can see the middle cluster between the extremes.