Clusters of Countries
Clusters of Countries
Data Used
The data used is Population Growth, Fertility and Mortality Indicators.csv
, tells about the number of some variables related to population fertility and mortality of each country around the world.
We have some variables from the data, and they are :
T03
The country codePopulation.growth.and.indicators.of.fertility.and.mortality
The country listX
The year columnX.1
Variable which contains some indicators, this variable is going to be spread to some variables.X.2
The values of the observations.X.3
FootnotesX.4
Data source
The Goal
Assume that we are going to classify countries listed based on the indicators contained in the data.
The Flow
Libraries Importing and Data Preparation.
Exploratory Data Analyst
PCA Transformation.
Biplotting and Interpretation.
Libraries Importing and Data Preparation
Libraries Used
library(tidyverse) # Data manipulating
library(FactoMineR) # performing PCA
library(factoextra) # Creating some plot (biplot)
library(plotly) # Interactive plotting
library(tibble) # Creating rownames
library(corrplot) # Creating correlation plot
library(countrycode) #generating continent variable from country code
library(gganimate) #animating plot
Data Importing
mortal_total <- read.csv("source/Population Growth, Fertility and Mortality Indicators.csv")
str(mortal_total)
## 'data.frame': 4979 obs. of 7 variables:
## $ T03 : chr "Region/Country/Area" "1" "1" "1" ...
## $ Population.growth.and.indicators.of.fertility.and.mortality: chr "" "Total, all countries or areas" "Total, all countries or areas" "Total, all countries or areas" ...
## $ X : chr "Year" "2005" "2005" "2005" ...
## $ X.1 : chr "Series" "Population annual rate of increase (percent)" "Total fertility rate (children per women)" "Infant mortality for both sexes (per 1,000 live births)" ...
## $ X.2 : chr "Value" "1.3" "2.6" "49.1" ...
## $ X.3 : chr "Footnotes" "Data refers to a 5-year period preceding the reference year." "Data refers to a 5-year period preceding the reference year." "Data refers to a 5-year period preceding the reference year." ...
## $ X.4 : chr "Source" "United Nations Population Division, New York, World Population Prospects: The 2017 Revision, last accessed June 2017." "United Nations Population Division, New York, World Population Prospects: The 2017 Revision; supplemented by da"| __truncated__ "United Nations Statistics Division, New York, \"Demographic Yearbook 2015\" and the demographic statistics data"| __truncated__ ...
We only need some variables to process the data, the last 2 columns and the first column will be eliminated
There is a year column (from 2000 to 2016 ), most of the countries only have values for 2005, 2010, and 2015.
The
X.1
contains 8 indicators, we’re going to spread them into their own column
Data cleaning
- In the chunk below we’re going to remove the last 2 variables and filter the year, we only need the 2015 data to interpret the latest condition of each country.
mortal_lastyears <- mortal_total %>%
select(-c(X.3,X.4)) %>%
filter(X == 2015) %>%
spread(key = X.1, value = X.2) %>% # create column for each indicator in X.1
rename(year = X,
Code = T03,
Country = Population.growth.and.indicators.of.fertility.and.mortality,
inf.mort = `Infant mortality for both sexes (per 1,000 live births)`,
life.exp.both = `Life expectancy at birth for both sexes (years)`,
life.exp.male = `Life expectancy at birth for males (years)`,
life.exp.female = `Life expectancy at birth for females (years)`,
maternal.mortality.ratio = `Maternal mortality ratio (deaths per 100,000 population)`,
pop.increase = `Population annual rate of increase (percent)`,
tot.fertil.rate = `Total fertility rate (children per women)`) %>%
mutate(inf.mort = as.numeric(as.character(inf.mort)),
life.exp.both = as.numeric(as.character(life.exp.both)),
life.exp.female = as.numeric(as.character(life.exp.female)),
life.exp.male = as.numeric(as.character(life.exp.male)),
maternal.mortality.ratio = as.numeric(as.character(maternal.mortality.ratio)),
pop.increase = as.numeric(as.character(pop.increase)),
tot.fertil.rate = as.numeric(as.character(tot.fertil.rate)))
## Warning: Problem with `mutate()` input `maternal.mortality.ratio`.
## ℹ NAs introduced by coercion
## ℹ Input `maternal.mortality.ratio` is `as.numeric(as.character(maternal.mortality.ratio))`.
## Warning in mask$eval_all_mutate(dots[[i]]): NAs introduced by coercion
## Code Country year inf.mort life.exp.both
## 1 1 Total, all countries or areas 2015 35.0 70.8
## 2 100 Bulgaria 2015 8.3 74.3
## 3 104 Myanmar 2015 45.0 66.0
## 4 108 Burundi 2015 77.9 56.1
## 5 11 Western Africa 2015 70.5 54.7
## 6 112 Belarus 2015 3.6 72.1
## life.exp.female life.exp.male maternal.mortality.ratio pop.increase
## 1 73.1 68.6 216 1.2
## 2 77.8 70.8 11 -0.6
## 3 68.3 63.7 178 0.9
## 4 58.0 54.2 712 3.0
## 5 55.6 53.9 NA 2.7
## 6 77.7 66.5 4 0.0
## tot.fertil.rate
## 1 2.5
## 2 1.5
## 3 2.3
## 4 6.0
## 5 5.5
## 6 1.6
Country
= Country list ; inf.mort
= Infant mortality for both sexes (per 1,000 live births) ; life.exp.both
= Life expectancy at birth for both sexes (years) ; life.exp.male
= Life expectancy at birth for males (years) ; life.exp.female
= Life expectancy at birth for females (years) ; maternal.mortality.ratio
= Maternal mortality ratio (deaths per 100,000 population) ; pop.increase
= Population annual rate of increase (percent) ; tot.fertil.rate
= Total fertility rate (children per women)
NA checking
## Code Country year
## 0 0 0
## inf.mort life.exp.both life.exp.female
## 31 31 29
## life.exp.male maternal.mortality.ratio pop.increase
## 29 73 0
## tot.fertil.rate
## 29
There are so many NAs in the data, it means that not all country listed have the data we need.
- We’re going to replace the NAs to the average value of each variable/indicator.
# Assigning the average value of each variables
life.exp.both.avg <- mean(mortal_lastyears$life.exp.both,na.rm = T)
life.exp.male.avg <- mean(mortal_lastyears$life.exp.male, na.rm = T)
life.exp.female.avg <- mean(mortal_lastyears$life.exp.female,na.rm = T)
inf.mort.avg <- mean(mortal_lastyears$inf.mort,na.rm = T)
maternal.mortality.ratio.avg <- mean(mortal_lastyears$maternal.mortality.ratio,na.rm = T)
pop.increase.avg <- mean(mortal_lastyears$pop.increase,na.rm = T)
tot.fertil.rate.avg <- mean(mortal_lastyears$tot.fertil.rate,na.rm = T)
# Replacing the NAs with the avg value
mortal_lastyears <- mortal_lastyears %>%
mutate(inf.mort = replace_na(inf.mort,inf.mort.avg),
life.exp.both = replace_na(life.exp.both,life.exp.both.avg),
life.exp.female = replace_na(life.exp.female,life.exp.female.avg),
life.exp.male = replace_na(life.exp.male,life.exp.male.avg),
maternal.mortality.ratio = replace_na(maternal.mortality.ratio,maternal.mortality.ratio.avg),
pop.increase = replace_na(pop.increase,pop.increase.avg),
tot.fertil.rate = replace_na(tot.fertil.rate,tot.fertil.rate.avg))
head(mortal_lastyears)
## Code Country year inf.mort life.exp.both
## 1 1 Total, all countries or areas 2015 35.0 70.8
## 2 100 Bulgaria 2015 8.3 74.3
## 3 104 Myanmar 2015 45.0 66.0
## 4 108 Burundi 2015 77.9 56.1
## 5 11 Western Africa 2015 70.5 54.7
## 6 112 Belarus 2015 3.6 72.1
## life.exp.female life.exp.male maternal.mortality.ratio pop.increase
## 1 73.1 68.6 216.0000 1.2
## 2 77.8 70.8 11.0000 -0.6
## 3 68.3 63.7 178.0000 0.9
## 4 58.0 54.2 712.0000 3.0
## 5 55.6 53.9 162.1842 2.7
## 6 77.7 66.5 4.0000 0.0
## tot.fertil.rate
## 1 2.5
## 2 1.5
## 3 2.3
## 4 6.0
## 5 5.5
## 6 1.6
There is an odd thing on the data as we replace the NA with the average number of each column. There are some rows/countries which have no observation value or only have 1 or 2 value for their indicator and we have filled them with the average values and it’s not supposed to be like that. We supposed to eliminate them.
- eliminating some rows
# we will create a vector that indicates wether a rows' values are mostly the avg values of each column or not
cb <-
mortal_lastyears$inf.mort == inf.mort.avg &
mortal_lastyears$life.exp.both == life.exp.both.avg &
mortal_lastyears$life.exp.male == life.exp.male.avg &
mortal_lastyears$life.exp.female == life.exp.female.avg &
mortal_lastyears$maternal.mortality.ratio == maternal.mortality.ratio.avg &
mortal_lastyears$tot.fertil.rate == tot.fertil.rate.avg
# join the vector to the data
mortal_lastyears <- cbind(mortal_lastyears, cb)
# eliminating the rows have mostly the average values in its columns
mortal_lastyears <- mortal_lastyears %>%
filter(cb == F)
mortal_lastyears <- mortal_lastyears[,-c(1,3,11)]
str(mortal_lastyears)
## 'data.frame': 235 obs. of 8 variables:
## $ Country : chr "Total, all countries or areas" "Bulgaria" "Myanmar" "Burundi" ...
## $ inf.mort : num 35 8.3 45 77.9 70.5 3.6 29.9 27.7 67.5 4.7 ...
## $ life.exp.both : num 70.8 74.3 66 56.1 54.7 72.1 67.6 75.3 56.4 81.8 ...
## $ life.exp.female : num 73.1 77.8 68.3 58 55.6 77.7 69.6 76.5 57.7 83.8 ...
## $ life.exp.male : num 68.6 70.8 63.7 54.2 53.9 66.5 65.5 74.1 55.1 79.7 ...
## $ maternal.mortality.ratio: num 216 11 178 712 162 ...
## $ pop.increase : num 1.2 -0.6 0.9 3 2.7 0 1.6 2 2.7 1 ...
## $ tot.fertil.rate : num 2.5 1.5 2.3 6 5.5 1.6 2.7 3 5 1.6 ...
Continent Column
I think by giving the Continent
column, we’re going to have some more insights, so let’s just do it.
mortal_lastyears$Continent <- countrycode(sourcevar = mortal_lastyears[,"Country"],
origin = "country.name",
destination = "continent")
## Warning in countrycode(sourcevar = mortal_lastyears[, "Country"], origin = "country.name", : Some values were not matched unambiguously: Africa, Asia, Australia and New Zealand, Caribbean, Central America, Central Asia, Channel Islands, Eastern Africa, Eastern Asia, Eastern Europe, Europe, Latin America & the Caribbean, Melanesia, Micronesia, Middle Africa, Northern Africa, Northern America, Northern Europe, Oceania, Other non-specified areas, Polynesia, South America, South-central Asia, South-eastern Asia, Southern Africa, Southern Asia, Southern Europe, Sub-Saharan Africa, Total, all countries or areas, Western Africa, Western Asia, Western Europe
## Warning in countrycode(sourcevar = mortal_lastyears[, "Country"], origin = "country.name", : Some strings were matched more than once, and therefore set to <NA> in the result: Australia and New Zealand,Oceania,Oceania
so some rows cannot be defined by its continent and all of them are not even a country actually. They are just regions or certain areas of the continent.
Our observations are countries so we wil just eliminate rows that represent some areas or regions.
mortal_lastyears <- mortal_lastyears %>%
mutate(Continent = replace_na(Continent, "?"))
mortal_lastyears <- mortal_lastyears %>%
filter(Continent != "?") %>%
mutate(Continent = as.factor(Continent))
- We better assign the
Country
as rownames instead.
Now the data is ready to be proceed.
Exploratory Data Analyst
- Variables Correlation
# cor(mortal_lastyears[,-8]) %>%
# as.data.frame() %>%
# rownames_to_column("series") %>%
# pivot_longer(!series, names_to ="series2", values_to ="correlation") %>%
# write_csv("corr_data.csv")
From the plot above we can conclude that :
- the correlation between life expectancy of birth of male, female, and both are really high. In this case we better use the
life expectancy of both
mortal_lastyears <- mortal_lastyears[,-c(3,4)]
# mortal_lastyears %>%
# rownames_to_column("country") %>%
# write_csv("mortal_latest.csv")
all variables have relatively strong correlation to each other but
pop.increase
the
pop.increase
has the least correlation with other variables
Life Expectantion of the World
g <- ggplot(mortal_lastyears, aes(life.exp.both, inf.mort, text = rownames(mortal_lastyears))) +
geom_jitter(aes(col = Continent)) +
geom_smooth(aes(group = 1), method = loess)+
theme_classic()
ggplotly(g)
## `geom_smooth()` using formula 'y ~ x'
## Warning: `group_by_()` is deprecated as of dplyr 0.7.0.
## Please use `group_by()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
g6 <- ggplot(mortal_lastyears, aes(life.exp.both, pop.increase, text = rownames(mortal_lastyears))) +
geom_jitter(aes(col = Continent)) +
geom_smooth(aes(group = 1), method = loess)+
theme_classic()
ggplotly(g6)
## `geom_smooth()` using formula 'y ~ x'
g7 <- ggplot(mortal_lastyears, aes(life.exp.both, maternal.mortality.ratio, text = rownames(mortal_lastyears))) +
geom_jitter(aes(col = Continent)) +
geom_smooth(aes(group = 1), method = loess)+
theme_classic()
ggplotly(g7)
## `geom_smooth()` using formula 'y ~ x'
g8 <- ggplot(mortal_lastyears, aes(life.exp.both, tot.fertil.rate, text = rownames(mortal_lastyears))) +
geom_jitter(aes(col = Continent)) +
geom_smooth(aes(group = 1), method = loess)+
theme_classic()
ggplotly(g8)
## `geom_smooth()` using formula 'y ~ x'
Africa dominates the low life expectancy area but Europe are mostly on the high area of life expectancy . The rest are spread from the middle to the high.
Usualy the countries which infant mortality is high have less life expectancy. The infants die and the life expectancy is lower than other countries, Africa dominates this area and Europe is on the other side.
The higher fertility rate the lower life expectancy,Africa dominates this area and Europe is on the other side.
Total Fertility of the World
g1 <- ggplot(mortal_lastyears, aes(tot.fertil.rate,life.exp.both, text = rownames(mortal_lastyears))) +
geom_jitter(aes(col = Continent)) +
geom_smooth(method = loess, aes(group = 1))+
theme_classic()
ggplotly(g1)
## `geom_smooth()` using formula 'y ~ x'
g2 <- ggplot(mortal_lastyears, aes(tot.fertil.rate,inf.mort, text = rownames(mortal_lastyears))) +
geom_jitter(aes(col = Continent)) +
geom_smooth(method = loess, aes(group = 1))+
theme_classic()
ggplotly(g2)
## `geom_smooth()` using formula 'y ~ x'
g3 <- ggplot(mortal_lastyears, aes(tot.fertil.rate,maternal.mortality.ratio, text = rownames(mortal_lastyears))) +
geom_jitter(aes(col = Continent)) +
geom_smooth(method = loess, aes(group = 1))+
theme_classic()
ggplotly(g3)
## `geom_smooth()` using formula 'y ~ x'
* Africa dominates the area which total fertility rate is high, means that Africans are “productive”.
it’s kinda make sense countries with low fertility rate have low infant mortality number.
Usualy the countries which total fertility is high have low life expectancy.
Countries with high fertility rate tend to have high maternal mortality ratio and this still dominated by African countries.
Population Increase of The world
g4 <- ggplot(mortal_lastyears, aes(pop.increase,inf.mort, text = rownames(mortal_lastyears))) +
geom_jitter(aes(col = Continent)) +
geom_smooth(method = loess, aes(group = 1))+
theme_classic()
ggplotly(g4)
## `geom_smooth()` using formula 'y ~ x'
g5 <- ggplot(mortal_lastyears, aes(pop.increase,tot.fertil.rate, text = rownames(mortal_lastyears))) +
geom_jitter(aes(col = Continent)) +
geom_smooth(method = loess, aes(group = 1))+
theme_classic()
ggplotly(g5)
## `geom_smooth()` using formula 'y ~ x'
* Europe has low infant mortality number but also low population increase which is rational i think.
Most African countries and some Asian country have high pop increase and high infant mortality, it’s not really good though, it seems like they produce babies as much as possible but can’t really keep them alive until adult.
Some Asian countries even keep their infant mortality low but still their population increase greatly. And they are the “oil well” of the world.
The higher total fertility rate, the higher population increase.
Data Clustering
Data Scalling
## inf.mort life.exp.both maternal.mortality.ratio pop.increase
## Min. :-1.0369 Min. :-2.6949 Min. :-0.7804 Min. :-2.7139
## 1st Qu.:-0.8089 1st Qu.:-0.6741 1st Qu.:-0.7142 1st Qu.:-0.7343
## Median :-0.3700 Median : 0.2131 Median :-0.4225 Median :-0.0744
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.7055 3rd Qu.: 0.6937 3rd Qu.: 0.1243 3rd Qu.: 0.6588
## Max. : 2.9556 Max. : 1.4947 Max. : 3.5298 Max. : 3.8116
## tot.fertil.rate
## Min. :-1.1726
## 1st Qu.:-0.7470
## Median :-0.3215
## Mean : 0.0000
## 3rd Qu.: 0.6359
## Max. : 3.2245
Now the data has been scaled, we’re ready for clustering.
Optimal K value
wss <- function(data, maxCluster = 9) {
# Initialize within sum of squares
SSw <- (nrow(data) - 1) * sum(apply(data, 2, var))
for (i in 2:maxCluster) {
set.seed(10)
SSw[i] <- sum(kmeans(data, centers = i)$withinss)
}
plot(1:maxCluster, SSw, type = "o", xlab = "Number of Clusters", ylab = "Within groups sum of squares", pch=19)
}
wss(mortal_lastyears_scaled)
The elbow method shows that the optimum K value is 2. But i think we should try 3 as well since 2 clusters will not give us much information.
Biplotting
Performing PCA on the Data
Information gathered in each dimension
## eigenvalue percentage of variance cumulative percentage of variance
## comp 1 3.9451216 78.902432 78.90243
## comp 2 0.6727678 13.455355 92.35779
## comp 3 0.1954976 3.909951 96.26774
## comp 4 0.1354516 2.709031 98.97677
## comp 5 0.0511615 1.023230 100.00000
The dimension 1 contains 80% of information and dimension 2 contains 12% information. The total is arround 92% of information.
Variables Contribution
fviz_pca_var(mortal_pca_pr,
col.var = "contrib",
gradient.cols = c("pink", "red", "black") ,
repel = T)
Cluster Plot
2 Clusters
## Warning: ggrepel: 110 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
pr <- prcomp(mortal_lastyears_scaled)
mortal_arrow <- data.frame(pr$rotation)
mortal_pr_c2 <- data.frame(pr$x,
clust = factor(mortal_cluster2$cluster),
continent = mortal_lastyears$Continent,
country = factor(rownames(mortal_lastyears)),
inf.mort = mortal_lastyears$inf.mort,
life.exp.both = mortal_lastyears$life.exp.both,
maternal.mort.ratio = mortal_lastyears$maternal.mortality.ratio,
pop.increase = mortal_lastyears$pop.increase,
tot.fertil.rate = mortal_lastyears$tot.fertil.rate)
mortal_pr_c2 <- mortal_pr_c2 %>%
mutate(text = paste('clust :', factor(mortal_cluster2$cluster),
'
continent :' ,mortal_lastyears$Continent,
'
country :' ,factor(rownames(mortal_lastyears)),
'
inf.mort :' ,mortal_lastyears$inf.mort,
'
life.exp.both :' , mortal_lastyears$life.exp.both,
'
maternal.mort.ratio :', mortal_lastyears$maternal.mortality.ratio,
'
pop.increase :', mortal_lastyears$pop.increase,
'
tot.fertil.rate :', mortal_lastyears$tot.fertil.rate))
po <- ggplot(mortal_pr_c2, aes(PC1,PC2, text = text ))+
geom_hline(aes(yintercept=0), size=.2, alpha = 0.5, linetype = 2) +
geom_vline(aes(xintercept=0), size=.2, alpha = 0.5, linetype = 2)+
geom_point(aes(col = clust, pch = continent)) +
theme_classic()
ggplotly(po, tooltip = 'text')
the countries which have low life expectancy for male and female
the countries which have high fertility rate
the countries which have high population increase
and African countries dominate this cluster.
This cluster indicates the countries contained maybe are not a healthy country since they have low life expectancy. This countries will have more young people in the future since the are high fertility rate and the population grows rapidly.
cluster 2 is :
the countries which have high life expectancy for male and female
the countries which have low fertility rate
the countries which have low infant mortality number
the countries which have low maternal mortality ratio
and all Europe countries are in cluster2.
This cluster indicates the countries contained will tend to have less productive people in the future since the fertility rate is not really good and the population is not growing well. In this case, high life expectancy will make this countries population dominated by old people one day.
3 Clusters
## Warning: ggrepel: 110 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
mortal_pr_c3 <- data.frame(pr$x,
clust = factor(mortal_cluster3$cluster),
continent = mortal_lastyears$Continent,
country = factor(rownames(mortal_lastyears)),
inf.mort = mortal_lastyears$inf.mort,
life.exp.both = mortal_lastyears$life.exp.both,
maternal.mort.ratio = mortal_lastyears$maternal.mortality.ratio,
pop.increase = mortal_lastyears$pop.increase,
tot.fertil.rate = mortal_lastyears$tot.fertil.rate)
mortal_pr_c3 <- mortal_pr_c3 %>%
mutate(text = paste('clust :', factor(mortal_cluster3$cluster),
'
continent :' ,mortal_lastyears$Continent,
'
country :' ,factor(rownames(mortal_lastyears)),
'
inf.mort :' ,mortal_lastyears$inf.mort,
'
life.exp.both :' , mortal_lastyears$life.exp.both,
'
maternal.mort.ratio :', mortal_lastyears$maternal.mortality.ratio,
'
pop.increase :', mortal_lastyears$pop.increase,
'
tot.fertil.rate :', mortal_lastyears$tot.fertil.rate))
po <- ggplot(mortal_pr_c3, aes(PC1,PC2, text = text ))+
geom_hline(aes(yintercept=0), size=.2, alpha = 0.5, linetype = 2) +
geom_vline(aes(xintercept=0), size=.2, alpha = 0.5, linetype = 2)+
geom_point(aes(col = clust, pch = continent)) +
theme_classic()
ggplotly(po, tooltip = 'text')
When we divide the data into 3 clusters, we can conclude that the cluster 1 is :
the countries which have low life expectancy for male and female
the countries which have high fertility rate
the countries which have high population increase
African countries still dominate this cluster
This cluster is not really different with the cluster 1 from the case before.
cluster 2 is :
the countries in the middle, their observation values are near the average.
there are some outliers in this cluster. they are countries with high population growing and low infant mortality, the “oil well” i’ve told you before.
cluster 3 is :
the countries which have high life expectancy for male and female
the countries which have low fertility rate
the countries which have low population increase
the countries which have low maternal mortality ratio
This cluster indicates the countries contained will more likely to have less young people than countries in other clusters. the have low pop. increase, fertility rate. These countries should be more “productive”.
Animated Plot
So we’re going to see the animated plot of each country of each cluster from 2005 to 2015. We expect to see some countries change their cluster from time to time.
Data Tidying
The data tidying is not really different from the data preparation above, but we have the year
column this time.
mortal_allyears <- mortal_total %>%
select(-c(X.3,X.4)) %>%
filter(Population.growth.and.indicators.of.fertility.and.mortality %in% rownames(mortal_lastyears)) %>%
spread(key = X.1, value = X.2) %>% # create column for each indicator in X.1
rename(year = X,
Code = T03,
Country = Population.growth.and.indicators.of.fertility.and.mortality,
inf.mort = `Infant mortality for both sexes (per 1,000 live births)`,
life.exp.both = `Life expectancy at birth for both sexes (years)`,
maternal.mortality.ratio = `Maternal mortality ratio (deaths per 100,000 population)`,
pop.increase = `Population annual rate of increase (percent)`,
tot.fertil.rate = `Total fertility rate (children per women)`) %>%
mutate(inf.mort = as.numeric(as.character(inf.mort)),
life.exp.both = as.numeric(as.character(life.exp.both)),
maternal.mortality.ratio = as.numeric(as.character(maternal.mortality.ratio)),
pop.increase = as.numeric(as.character(pop.increase)),
tot.fertil.rate = as.numeric(as.character(tot.fertil.rate)))
## Warning: Problem with `mutate()` input `maternal.mortality.ratio`.
## ℹ NAs introduced by coercion
## ℹ Input `maternal.mortality.ratio` is `as.numeric(as.character(maternal.mortality.ratio))`.
## Warning in mask$eval_all_mutate(dots[[i]]): NAs introduced by coercion
## 'data.frame': 615 obs. of 10 variables:
## $ Code : chr "100" "100" "100" "104" ...
## $ Country : chr "Bulgaria" "Bulgaria" "Bulgaria" "Myanmar" ...
## $ year : chr "2005" "2010" "2015" "2005" ...
## $ inf.mort : num 12.7 9.5 8.3 57.9 52.2 45 94.6 86.2 77.9 9.6 ...
## $ life.exp.both : num 72.2 73.1 74.3 62.9 64.3 66 52 53.7 56.1 67.8 ...
## $ Life expectancy at birth for females (years): chr "75.8" "76.7" "77.8" "65.0" ...
## $ Life expectancy at birth for males (years) : chr "68.8" "69.7" "70.8" "60.9" ...
## $ maternal.mortality.ratio : num 15 11 11 248 205 178 863 808 712 13 ...
## $ pop.increase : num -0.8 -0.7 -0.6 1 0.7 0.9 3 3.3 3 -0.6 ...
## $ tot.fertil.rate : num 1.2 1.5 1.5 2.9 2.6 2.3 6.9 6.5 6 1.3 ...
# adding continent
mortal_allyears$Continent <- countrycode(sourcevar = mortal_allyears[,"Country"],
origin = "country.name",
destination = "continent")
mortal_allyears <- mortal_allyears %>%
filter(Continent != "?") %>%
select( -c("Code","Life expectancy at birth for females (years)", "Life expectancy at birth for males (years)")) %>%
mutate(Continent = as.factor(Continent))
mortal_allyears %>%
is.na() %>%
colSums()
## Country year inf.mort
## 0 0 17
## life.exp.both maternal.mortality.ratio pop.increase
## 16 74 5
## tot.fertil.rate Continent
## 5 0
# assigning a rownames from Country var.
mortal_lastyears$Country <- rownames(mortal_lastyears)
#creating a new dataframe to merge its cluster to the data.
country_cluster <- mortal_lastyears %>%
select(c("Country","clust2","clust3"))
mortal_allyears <- merge(country_cluster,mortal_allyears,by = "Country")
## Country clust2 clust3
## 0 0 0
## year inf.mort life.exp.both
## 0 17 16
## maternal.mortality.ratio pop.increase tot.fertil.rate
## 74 5 5
## Continent
## 0
life.exp.both.avg.2 <- mean(mortal_allyears$life.exp.both,na.rm = T)
life.exp.male.avg.2 <- mean(mortal_allyears$life.exp.male, na.rm = T)
## Warning in mean.default(mortal_allyears$life.exp.male, na.rm = T): argument is
## not numeric or logical: returning NA
## Warning in mean.default(mortal_allyears$life.exp.female, na.rm = T): argument is
## not numeric or logical: returning NA
inf.mort.avg.2 <- mean(mortal_allyears$inf.mort,na.rm = T)
maternal.mortality.ratio.avg.2 <- mean(mortal_allyears$maternal.mortality.ratio,na.rm = T)
pop.increase.avg.2 <- mean(mortal_allyears$pop.increase,na.rm = T)
tot.fertil.rate.avg.2 <- mean(mortal_allyears$tot.fertil.rate,na.rm = T)
mortal_allyears <- mortal_allyears %>%
mutate(inf.mort = replace_na(inf.mort,inf.mort.avg.2),
life.exp.both = replace_na(life.exp.both,life.exp.both.avg.2),
maternal.mortality.ratio = replace_na(maternal.mortality.ratio,maternal.mortality.ratio.avg.2),
pop.increase = replace_na(pop.increase,pop.increase.avg.2),
tot.fertil.rate = replace_na(tot.fertil.rate,tot.fertil.rate.avg.2))
mortal_allyears %>%
is.na() %>%
colSums()
## Country clust2 clust3
## 0 0 0
## year inf.mort life.exp.both
## 0 0 0
## maternal.mortality.ratio pop.increase tot.fertil.rate
## 0 0 0
## Continent
## 0
#eliminating some rows
cb2 <-
mortal_allyears$inf.mort == inf.mort.avg.2 &
mortal_allyears$life.exp.both == life.exp.both.avg.2 &
mortal_allyears$maternal.mortality.ratio == maternal.mortality.ratio.avg.2 &
mortal_allyears$tot.fertil.rate == tot.fertil.rate.avg.2 |
mortal_allyears$maternal.mortality.ratio == maternal.mortality.ratio.avg.2
mortal_allyears <- cbind(mortal_allyears, cb2)
mortal_allyears <- mortal_allyears %>%
filter(cb2 == F)
mortal_allyears <- mortal_allyears[,-11]
str(mortal_allyears)
## 'data.frame': 541 obs. of 10 variables:
## $ Country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Albania" ...
## $ clust2 : Factor w/ 2 levels "1","2": 2 2 2 1 1 1 1 1 1 2 ...
## $ clust3 : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 2 3 3 3 1 ...
## $ year : chr "2005" "2010" "2015" "2005" ...
## $ inf.mort : num 89.5 76.7 68.6 21.1 16.8 ...
## $ life.exp.both : num 56.9 60 62.3 74.8 75.6 77.7 71.5 73.9 75.3 50 ...
## $ maternal.mortality.ratio: num 821 584 396 30 30 29 148 147 140 705 ...
## $ pop.increase : num 4.4 2.8 3.2 -0.3 -0.9 -0.1 1.3 1.6 2 3.5 ...
## $ tot.fertil.rate : num 7.2 6.4 5.3 1.9 1.6 1.7 2.4 2.7 3 6.6 ...
## $ Continent : Factor w/ 5 levels "Africa","Americas",..: 3 3 3 4 4 4 1 1 1 1 ...
Plotting
## inf.mort life.exp.both maternal.mortality.ratio
## Min. :-1.0953 Min. :-2.7946 Min. :-0.7531
## 1st Qu.:-0.8480 1st Qu.:-0.6549 1st Qu.:-0.7024
## Median :-0.3681 Median : 0.2701 Median :-0.5165
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.7282 3rd Qu.: 0.7158 3rd Qu.: 0.4170
## Max. : 2.8433 Max. : 1.5739 Max. : 3.2301
## pop.increase tot.fertil.rate
## Min. :-2.50753 Min. :-1.2305
## 1st Qu.:-0.66838 1st Qu.:-0.7849
## Median :-0.07722 Median :-0.3393
## Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.57962 3rd Qu.: 0.6791
## Max. : 8.46171 Max. : 2.9707
## inf.mort life.exp.both maternal.mortality.ratio pop.increase
## Min. :-1.0369 Min. :-2.6949 Min. :-0.7804 Min. :-2.7139
## 1st Qu.:-0.8089 1st Qu.:-0.6741 1st Qu.:-0.7142 1st Qu.:-0.7343
## Median :-0.3700 Median : 0.2131 Median :-0.4225 Median :-0.0744
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.7055 3rd Qu.: 0.6937 3rd Qu.: 0.1243 3rd Qu.: 0.6588
## Max. : 2.9556 Max. : 1.4947 Max. : 3.5298 Max. : 3.8116
## tot.fertil.rate
## Min. :-1.1726
## 1st Qu.:-0.7470
## Median :-0.3215
## Mean : 0.0000
## 3rd Qu.: 0.6359
## Max. : 3.2245
pr.2 <- prcomp(mortal_allyears_scaled,scale. = F)
pr.3 <- PCA(mortal_allyears_scaled, graph = F, scale.unit = F)
mortal_allyears_clust <- data.frame(pr.2$x,
mortal_allyears)
mortal_allyears_clust <- mortal_allyears_clust %>%
select(c(1,2,6,7,8,9,15)) %>%
filter(year %in% c(2005,2010,2015))
mortal_allyears_clust$year <- as.character(mortal_allyears_clust$year)
mortal_allyears_clust$year <- as.numeric(mortal_allyears_clust$year)
str(mortal_allyears_clust)
## 'data.frame': 541 obs. of 7 variables:
## $ PC1 : num -4.8 -3.38 -2.48 1.48 1.8 ...
## $ PC2 : num -0.674 -0.027 -0.489 0.799 1.122 ...
## $ Country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Albania" ...
## $ clust2 : Factor w/ 2 levels "1","2": 2 2 2 1 1 1 1 1 1 2 ...
## $ clust3 : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 2 3 3 3 1 ...
## $ year : num 2005 2010 2015 2005 2010 ...
## $ Continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 4 4 4 1 1 1 1 ...
- 2 Clusters
ggplot(mortal_allyears_clust, aes(-PC1,-PC2)) +
geom_hline(aes(yintercept=0), size=.2, alpha = 0.8, linetype = 2) +
geom_vline(aes(xintercept=0), size=.2, alpha = 0.8, linetype = 2)+
geom_point(aes(col = clust2, pch = Continent))+
theme_classic() +
transition_time(year,range = c(2005,2015))
Some countries are moving from cluster 1 to cluster 2.
ggplot(mortal_allyears_clust, aes(PC1,PC2)) +
geom_hline(aes(yintercept=0), size=.2, alpha = 0.8, linetype = 2) +
geom_vline(aes(xintercept=0), size=.2, alpha = 0.8, linetype = 2) +
geom_point(aes(col = clust3, pch = Continent))+
theme_classic() +
transition_time(year,range = c(2005,2015))
There are some countries change their cluster.
The cluster position are flipped, its because the “var” plot is different
They’re flipped 180 degrees for each arrow, so the information gained from animated plot is still valid anyway.
Recommendation
Based on the previous analyst, I recommend to use 3 cluster because it gives us some more information. The 2 cluster is too general while the 3 cluster is more specific.
The use of 2 cluster only give us information that there are 2 groups of country, the first which have high life expectancy, low fertility rate, and low pop. increase. and the other one is the opposite.
But when we use 3 cluster we can see the middle cluster between the extremes.