This project aims to explore the clustering of countries based on the sex ratio of newborns and GDP per capita in 2022. With the natural sex ratio of male vs. female, the ratio should be close to one. According to certain researches, there are theories mentioning that there might be more males than females because of the mobility of the Y-carrying sperm, which might have a higher success rate of fertilizing the egg due to its lighter weight and other reasons. However, with such factors, they should not influence the ratio significantly. In this clustering report, we will not look at such factors.
This clustering report aims to see if the economy, geographic locations, and culture could potentially influence the sex ratio of the newborns, which could also potentially represent or contribute to the equality of sexes in certain countries.
With the economy, geographic locations, and culture, economy is more scalable; hence we are adding the GDP per capita for each country.
With the initial hypothesis, there should be countries having sex ratio below 1, near 1, and above 1 without human interference. Hence, the initial guess of the clustering \(k\) value should be 3.
Countries without GDP per capita data or sex ratio in 2022 were removed from the dataset.
Dataset Resources: - Sex Ratio: World Bank Group - GDP per Capita: World Bank Group
options(repos = c(CRAN = "https://cloud.r-project.org"))
install.packages("factoextra")
install.packages("flexclust")
install.packages("fpc")
install.packages("clustertend")
install.packages("cluster")
install.packages("ClusterR")
install.packages('hopkins')
install.packages("moments")
install.packages("e1071")
library(dplyr)
library(factoextra)
library(stats)
library(flexclust)
library(fpc)
library(clustertend)
library(cluster)
library(ClusterR)
library(hopkins)
library(ggplot2)
library(dplyr)
library(rnaturalearth)
library(rnaturalearthdata)
library(moments)
library(e1071)
# upload the dataset
sex_ratio <- read.csv("USE ME! countries by rows transposed - 2022 sex ratio of new born.csv")
# inspect the dataset
str(sex_ratio)
## 'data.frame': 205 obs. of 3 variables:
## $ country : chr "ABW" "AFG" "AGO" "ALB" ...
## $ year_2022: num 1.04 1.05 1.03 1.07 1.06 ...
## $ gdp_2022 : int 30560 357 2930 6846 42414 49899 13936 6572 18017 20118 ...
# check if there is any missing values
table(sex_ratio$year_2022, useNA = "ifany")
##
## 1.011 1.014 1.02 1.021 1.022 1.023 1.024 1.025 1.026 1.027 1.028 1.029 1.03
## 2 1 1 1 1 1 1 1 2 2 3 1 5
## 1.031 1.032 1.033 1.035 1.036 1.037 1.038 1.039 1.04 1.041 1.042 1.043 1.044
## 3 4 6 3 3 3 2 9 5 6 3 2 3
## 1.045 1.046 1.047 1.048 1.049 1.05 1.051 1.052 1.053 1.054 1.055 1.056 1.057
## 5 2 2 7 3 4 7 10 5 4 8 3 6
## 1.058 1.059 1.06 1.061 1.062 1.063 1.064 1.065 1.066 1.067 1.068 1.069 1.07
## 3 4 5 3 2 4 3 3 3 4 1 1 6
## 1.071 1.072 1.075 1.076 1.077 1.08 1.081 1.091 1.106 1.111 1.122 1.151 1.163
## 4 2 1 1 5 3 1 1 1 1 1 1 1
summary(sex_ratio)
## country year_2022 gdp_2022
## Length:205 Min. :1.011 Min. : 251
## Class :character 1st Qu.:1.039 1st Qu.: 2579
## Mode :character Median :1.051 Median : 7606
## Mean :1.052 Mean : 20365
## 3rd Qu.:1.061 3rd Qu.: 26802
## Max. :1.163 Max. :226052
# Min ratio is 1.011; 1st Qu.:1.039; Median :1.052; Mean :1.052; 3rd Qu.:1.061; Max ratio is 1.163
# GDP: Min.:251; 1st Qu.:2348; Median:7606; Mean:18970; 3rd Qu.:22915; Max.:226052
# This summary indicates that the initial hypothesis with ratio less than 1 is false. However, we should still be able to perform the clustering on this dataset even there is no country having sex ratio below 1
# The countries with lower ratio would still be explained based on the factors mentioned above
# get the values only for this dataset
sex_ratio_data <- sex_ratio[1:205,2:3]
head(sex_ratio_data)
## year_2022 gdp_2022
## 1 1.043 30560
## 2 1.052 357
## 3 1.027 2930
## 4 1.072 6846
## 5 1.062 42414
## 6 1.046 49899
# Standardising GDP per capita
sex_ratio_data$gdp_2022 <- scale(sex_ratio_data$gdp_2022, center = TRUE, scale = TRUE)
head(sex_ratio_data)
## year_2022 gdp_2022
## 1 1.043 0.3332368
## 2 1.052 -0.6539686
## 3 1.027 -0.5698683
## 4 1.072 -0.4418712
## 5 1.062 0.7206928
## 6 1.046 0.9653451
sex_ratio_data$gdp_2022 <- as.numeric(sex_ratio_data$gdp_2022)
skewness(sex_ratio_data$gdp_2022)
## [1] 3.193584
# Based on the outlier table and summary of the dataset, we can observe that gdp_2022 has a very wide range of data
# The skewness of gdp_2022 is 3.193584 which indicates that it is highly right skewed. With such data/outlier, K-Means and Hierarchical will be highly impacted. In this dataset, we can not remove the outliers since these are some countries with either significant wealth or poverty, they will still contribute great value and insights to our results. Hence these two methods will not be able to provide a robust result.
# Initial number of clusters was defined before which is 3
# Let's still compare the predefined number of clusters and the elbow method optimal number of clusters
opt_md <- Optimal_Clusters_Medoids(sex_ratio_data, 10, 'euclidean', plot_clusters = TRUE)
##
## Based on the plot give the number of clusters (greater than 1) that you consider optimal?
knitr::include_graphics("~/Downloads/optimal cluster.png")
# After applying the elbow method, cluster number 3 or 4 is optimal in this case.
# Let's see the result of 3 and 4 clusters with average silhouette score
set.seed(123)
sex_ratio_clusters_3 <- pam(sex_ratio_data, 3)
sex_ratio_clusters_4 <- pam(sex_ratio_data, 4)
# For 3 clusters: 0.681149
sil_3 <- silhouette(sex_ratio_clusters_3$clustering, dist(sex_ratio_data))
avg_sil_3 <- mean(sil_3[, 3]) # Average silhouette width for 3 clusters
avg_sil_3
## [1] 0.681149
# For 4 clusters: 0.6392575
sil_4 <- silhouette(sex_ratio_clusters_4$clustering, dist(sex_ratio_data))
avg_sil_4 <- mean(sil_4[, 3]) # Average silhouette width for 4 clusters
avg_sil_4
## [1] 0.6392575
# After comparing the silhouette score, the optimal cluster number should be 3
# 3 centers: 93(1.051, 0.4462312); 154(1.051, -0.5414318); 46(1.036, 2.3480471)
sex_ratio_clusters <- pam(sex_ratio_data, 3)
attributes(sex_ratio_clusters)
## $names
## [1] "medoids" "id.med" "clustering" "objective" "isolation"
## [6] "clusinfo" "silinfo" "diss" "call" "data"
##
## $class
## [1] "pam" "partition"
sex_ratio_clusters$medoids
## year_2022 gdp_2022
## 93 1.051 0.4462312
## 154 1.051 -0.5414318
## 46 1.036 2.3480471
#Average silhouette width per cluster:
# [1] 0.4785270 0.8007543 0.2617221
#Average silhouette width of total data set:
# [1] 0.681149
# Visualisation of the clustering/silhouette value
class(sex_ratio_clusters)
## [1] "pam" "partition"
head(silhouette(sex_ratio_clusters))
## cluster neighbor sil_width
## 195 1 3 0.4573847
## 52 1 3 0.4490736
## 11 1 3 0.4448303
## 34 1 3 0.4443968
## 88 1 3 0.4421323
## 163 1 3 0.4391601
plot(sex_ratio_clusters)
# silhouette width for 3 clusters: 0.68
# cluster 1: 0.48
# cluster 2: 0.80
# cluster 3: 0.26
# However, there are two clusters overlapping, the medoids are too close(1 and 2).
# Let's apply Manhattan distance
pam2 <- eclust(sex_ratio_data, "pam", k = 3, hc_metric = "manhattan")
fviz_silhouette(pam2)
## cluster size ave.sil.width
## 1 1 51 0.48
## 2 2 139 0.80
## 3 3 15 0.26
fviz_cluster(pam2)
# We have more distinct 3 clusters
# Overall silhouette score is 0.68
# Now let's put the clusters in the dataset for further visualization
cluster_assignments <- pam2$cluster
sex_ratio_data$Cluster <- cluster_assignments
sex_ratio$Cluster <- cluster_assignments
head(sex_ratio_data)
## year_2022 gdp_2022 Cluster
## 1 1.043 0.3332368 1
## 2 1.052 -0.6539686 2
## 3 1.027 -0.5698683 2
## 4 1.072 -0.4418712 2
## 5 1.062 0.7206928 1
## 6 1.046 0.9653451 1
# Checking the different shapes of the cluster result
fviz_cluster(pam2, geom = "point", ellipse.type = "norm") # factoextra::
fviz_cluster(pam2, geom = "point", ellipse.type = "convex") # factoextra::
# Convex shape is more suitable to capture all the data points
# More elegant plots
fviz_cluster(list(data = sex_ratio_data, cluster = sex_ratio_clusters_3$cluster),
ellipse.type = "norm", geom = "point", stand = TRUE, palette = "jco", ggtheme = theme_classic())
# This plot offers a more balanced and fair representation of clusters, making it suitable for reliable interpretation
# General characteristics of clusters
aggregate(cbind(year_2022, gdp_2022) ~ Cluster, data = sex_ratio_data, mean)
## Cluster year_2022 gdp_2022
## 1 1 1.055745 0.4966754
## 2 2 1.049439 -0.4839292
## 3 3 1.059667 2.7957145
# Cluster 2 has the lowest GDP number and average ratio(1.049439 -0.4839292)
# Cluster 1 is in the middle(1.055745 0.4966754)
# Cluster 3 has the highest GDP number and average ratio(1.059667 2.7957145)
# List countries by the cluster
cluster_1 <- sex_ratio[sex_ratio$Cluster == 1, "country"]
cluster_2 <- sex_ratio[sex_ratio$Cluster == 2, "country"]
cluster_3 <- sex_ratio[sex_ratio$Cluster == 3, "country"]
cat("Countries in Cluster 1:\n", paste(cluster_1, collapse = ", "), "\n\n")
## Countries in Cluster 1:
## ABW, AND, ARE, ATG, AUT, BEL, BHR, BHS, BRB, BRN, CAN, CUW, CYP, CZE, DEU, ESP, EST, FIN, FRA, GBR, GRC, GUM, HKG, ISR, ITA, JPN, KNA, KOR, KWT, LTU, LVA, MAC, MLT, MNP, NCL, NLD, NZL, OMN, PRI, PRT, PYF, SAU, SMR, SVK, SVN, SWE, SXM, TCA, TTO, URY, VIR
cat("Countries in Cluster 2:\n", paste(cluster_2, collapse = ", "), "\n\n")
## Countries in Cluster 2:
## AFG, AGO, ALB, ARG, ARM, ASM, AZE, BDI, BEN, BFA, BGD, BGR, BIH, BLR, BLZ, BOL, BRA, BTN, BWA, CAF, CHL, CHN, CIV, CMR, COD, COG, COL, COM, CPV, CRI, DJI, DMA, DOM, DZA, ECU, EGY, ETH, FJI, FSM, GAB, GEO, GHA, GIN, GMB, GNB, GNQ, GRD, GTM, GUY, HND, HRV, HTI, HUN, IDN, IND, IRN, IRQ, JAM, JOR, KAZ, KEN, KGZ, KHM, KIR, LAO, LBN, LBR, LBY, LCA, LKA, LSO, MAR, MDA, MDG, MDV, MEX, MHL, MKD, MLI, MMR, MNE, MNG, MOZ, MRT, MUS, MWI, MYS, NAM, NER, NGA, NIC, NPL, NRU, PAK, PAN, PER, PHL, PLW, PNG, POL, PRY, PSE, ROU, RUS, RWA, SDN, SEN, SLB, SLE, SLV, SOM, SRB, STP, SUR, SWZ, SYC, SYR, TCD, TGO, THA, TJK, TKM, TLS, TON, TUN, TUR, TUV, TZA, UGA, UKR, UZB, VCT, VNM, VUT, WSM, YEM, ZAF, ZMB, ZWE
cat("Countries in Cluster 3:\n", paste(cluster_3, collapse = ", "), "\n\n")
## Countries in Cluster 3:
## AUS, BMU, CHE, CYM, DNK, FRO, IRL, ISL, LIE, LUX, MCO, NOR, QAT, SGP, USA
# Visualize the countries by clusters on world map. This would better represent the difference of countries geographically
# Countries not in the dataset were removed from the world map for better visualization result
world <- ne_countries(scale = "medium", returnclass = "sf")
# Fix the ISO codes for France and Norway. There was some issue with these two countries, hence these two countries were pre-processed to be reflected on the map correctly
world$iso_a3[world$name == "France"] <- "FRA"
world$iso_a3[world$name == "Norway"] <- "NOR"
print(world[world$iso_a3 %in% c("FRA", "NOR"), c("name", "iso_a3")])
## Simple feature collection with 2 features and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -61.79409 ymin: -21.36904 xmax: 55.83906 ymax: 80.47783
## Geodetic CRS: WGS 84
## name iso_a3 geometry
## 89 Norway NOR MULTIPOLYGON (((20.62217 69...
## 161 France FRA MULTIPOLYGON (((9.480371 42...
colnames(sex_ratio)[colnames(sex_ratio) == "country"] <- "iso_a3"
# Merge World Map with Cluster Data
world_clusters <- left_join(world, sex_ratio, by = "iso_a3")
# Filter out rows where `Cluster` is NA
world_clusters_filtered <- world_clusters %>%
filter(!is.na(Cluster))
# Plot World Map with Clusters, excluding NA
ggplot(data = world_clusters_filtered) +
geom_sf(aes(fill = as.factor(Cluster)), color = "gray70", size = 0.2) +
scale_fill_manual(values = c("blue", "green", "purple")) + # Remove `na.value` argument
theme_minimal() +
labs(
title = "World Map of Clusters (Corrected)",
fill = "Cluster"
)
# Identify countries not reflected in the map in case there is any
unmatched_countries <- setdiff(sex_ratio$iso_a3, world$iso_a3)
# Print unmatched countries
cat("Countries not reflected in the world map:\n")
## Countries not reflected in the world map:
unmatched_countries
## character(0)
Cluster features:
Based on the result of the year 2022:
The popular understanding is that the richer the countries are, the more equal of sex ratio they should have. However, based on our analysis of 2022, this was not explained or supported
Geographically, most countries in Asia, Africa, Middle East and Latin America are clustered in cluster 2, which has the lowest GDP per capita
Overall, we could see that most countries with similar economy levels have the similar sex ratio of newborns
Based on the analysis, the most dominated factor of the sex ratio would be their economy levels, cultures, and locations
However, this report only consists of one year data, which is 2022, will not explicitly or fully explain this phenomenon cross time for all the countries