Clustering on the Sex Ratio of Newborns in 2022

1 Objectives

This project aims to explore the clustering of countries based on the sex ratio of newborns and GDP per capita in 2022. With the natural sex ratio of male vs. female, the ratio should be close to one. According to certain researches, there are theories mentioning that there might be more males than females because of the mobility of the Y-carrying sperm, which might have a higher success rate of fertilizing the egg due to its lighter weight and other reasons. However, with such factors, they should not influence the ratio significantly. In this clustering report, we will not look at such factors.

This clustering report aims to see if the economy, geographic locations, and culture could potentially influence the sex ratio of the newborns, which could also potentially represent or contribute to the equality of sexes in certain countries.

With the economy, geographic locations, and culture, economy is more scalable; hence we are adding the GDP per capita for each country.

With the initial hypothesis, there should be countries having sex ratio below 1, near 1, and above 1 without human interference. Hence, the initial guess of the clustering \(k\) value should be 3.

Countries without GDP per capita data or sex ratio in 2022 were removed from the dataset.

Dataset Resources: - Sex Ratio: World Bank Group - GDP per Capita: World Bank Group

2 Packages and Libraries

options(repos = c(CRAN = "https://cloud.r-project.org"))
install.packages("factoextra")
install.packages("flexclust")
install.packages("fpc")
install.packages("clustertend")
install.packages("cluster")
install.packages("ClusterR")
install.packages('hopkins')
install.packages("moments")
install.packages("e1071")
library(dplyr)
library(factoextra)
library(stats)
library(flexclust)
library(fpc)
library(clustertend)
library(cluster)
library(ClusterR)
library(hopkins)
library(ggplot2)
library(dplyr)
library(rnaturalearth)
library(rnaturalearthdata)
library(moments)
library(e1071)

3 Pre-processing Dataset

# upload the dataset
sex_ratio <- read.csv("USE ME! countries by rows transposed - 2022 sex ratio of new born.csv")
# inspect the dataset
str(sex_ratio)

## 'data.frame':    205 obs. of  3 variables:
##  $ country  : chr  "ABW" "AFG" "AGO" "ALB" ...
##  $ year_2022: num  1.04 1.05 1.03 1.07 1.06 ...
##  $ gdp_2022 : int  30560 357 2930 6846 42414 49899 13936 6572 18017 20118 ...

# check if there is any missing values
table(sex_ratio$year_2022, useNA = "ifany")

## 
## 1.011 1.014  1.02 1.021 1.022 1.023 1.024 1.025 1.026 1.027 1.028 1.029  1.03 
##     2     1     1     1     1     1     1     1     2     2     3     1     5 
## 1.031 1.032 1.033 1.035 1.036 1.037 1.038 1.039  1.04 1.041 1.042 1.043 1.044 
##     3     4     6     3     3     3     2     9     5     6     3     2     3 
## 1.045 1.046 1.047 1.048 1.049  1.05 1.051 1.052 1.053 1.054 1.055 1.056 1.057 
##     5     2     2     7     3     4     7    10     5     4     8     3     6 
## 1.058 1.059  1.06 1.061 1.062 1.063 1.064 1.065 1.066 1.067 1.068 1.069  1.07 
##     3     4     5     3     2     4     3     3     3     4     1     1     6 
## 1.071 1.072 1.075 1.076 1.077  1.08 1.081 1.091 1.106 1.111 1.122 1.151 1.163 
##     4     2     1     1     5     3     1     1     1     1     1     1     1

summary(sex_ratio)

##    country            year_2022        gdp_2022     
##  Length:205         Min.   :1.011   Min.   :   251  
##  Class :character   1st Qu.:1.039   1st Qu.:  2579  
##  Mode  :character   Median :1.051   Median :  7606  
##                     Mean   :1.052   Mean   : 20365  
##                     3rd Qu.:1.061   3rd Qu.: 26802  
##                     Max.   :1.163   Max.   :226052

# Min ratio is 1.011; 1st Qu.:1.039; Median :1.052; Mean :1.052; 3rd Qu.:1.061; Max ratio is 1.163
# GDP: Min.:251; 1st Qu.:2348; Median:7606; Mean:18970; 3rd Qu.:22915; Max.:226052
# This summary indicates that the initial hypothesis with ratio less than 1 is false. However, we should still be able to perform the clustering on this dataset even there is no country having sex ratio below 1
# The countries with lower ratio would still be explained based on the factors mentioned above

# get the values only for this dataset
sex_ratio_data <- sex_ratio[1:205,2:3]
head(sex_ratio_data)

##   year_2022 gdp_2022
## 1     1.043    30560
## 2     1.052      357
## 3     1.027     2930
## 4     1.072     6846
## 5     1.062    42414
## 6     1.046    49899

# Standardising GDP per capita
sex_ratio_data$gdp_2022 <- scale(sex_ratio_data$gdp_2022, center = TRUE, scale = TRUE)
head(sex_ratio_data)

##   year_2022   gdp_2022
## 1     1.043  0.3332368
## 2     1.052 -0.6539686
## 3     1.027 -0.5698683
## 4     1.072 -0.4418712
## 5     1.062  0.7206928
## 6     1.046  0.9653451

sex_ratio_data$gdp_2022 <- as.numeric(sex_ratio_data$gdp_2022)
skewness(sex_ratio_data$gdp_2022)

## [1] 3.193584

# Based on the outlier table and summary of the dataset, we can observe that gdp_2022 has a very wide range of data
# The skewness of gdp_2022 is 3.193584 which indicates that it is highly right skewed. With such data/outlier, K-Means and Hierarchical will be highly impacted. In this dataset, we can not remove the outliers since these are some countries with either significant wealth or poverty, they will still contribute great value and insights to our results. Hence these two methods will not be able to provide a robust result.

4 Clustering Analysis

4.1 Determine the Optimal Number of Clusters

# Initial number of clusters was defined before which is 3
# Let's still compare the predefined number of clusters and the elbow method optimal number of clusters

opt_md <- Optimal_Clusters_Medoids(sex_ratio_data, 10, 'euclidean', plot_clusters = TRUE)

##   
## Based on the plot give the number of clusters (greater than 1) that you consider optimal?

knitr::include_graphics("~/Downloads/optimal cluster.png")

# After applying the elbow method, cluster number 3 or 4 is optimal in this case.

4.2 Perform PAM Clustering

# Let's see the result of 3 and 4 clusters with average silhouette score
set.seed(123)
sex_ratio_clusters_3 <- pam(sex_ratio_data, 3)
sex_ratio_clusters_4 <- pam(sex_ratio_data, 4)
# For 3 clusters: 0.681149
sil_3 <- silhouette(sex_ratio_clusters_3$clustering, dist(sex_ratio_data))
avg_sil_3 <- mean(sil_3[, 3])  # Average silhouette width for 3 clusters
avg_sil_3

## [1] 0.681149

# For 4 clusters: 0.6392575
sil_4 <- silhouette(sex_ratio_clusters_4$clustering, dist(sex_ratio_data))
avg_sil_4 <- mean(sil_4[, 3])  # Average silhouette width for 4 clusters
avg_sil_4

## [1] 0.6392575

# After comparing the silhouette score, the optimal cluster number should be 3
# 3 centers: 93(1.051, 0.4462312); 154(1.051, -0.5414318); 46(1.036, 2.3480471)
sex_ratio_clusters <- pam(sex_ratio_data, 3)
attributes(sex_ratio_clusters)

## $names
##  [1] "medoids"    "id.med"     "clustering" "objective"  "isolation" 
##  [6] "clusinfo"   "silinfo"    "diss"       "call"       "data"      
## 
## $class
## [1] "pam"       "partition"

sex_ratio_clusters$medoids

##     year_2022   gdp_2022
## 93      1.051  0.4462312
## 154     1.051 -0.5414318
## 46      1.036  2.3480471

#Average silhouette width per cluster:
#  [1] 0.4785270 0.8007543 0.2617221
#Average silhouette width of total data set:
#  [1] 0.681149

4.3 Visualisation of the clusters

# Visualisation of the clustering/silhouette value
class(sex_ratio_clusters)

## [1] "pam"       "partition"

head(silhouette(sex_ratio_clusters))

##     cluster neighbor sil_width
## 195       1        3 0.4573847
## 52        1        3 0.4490736
## 11        1        3 0.4448303
## 34        1        3 0.4443968
## 88        1        3 0.4421323
## 163       1        3 0.4391601

plot(sex_ratio_clusters)

# silhouette width for 3 clusters: 0.68
# cluster 1: 0.48
# cluster 2: 0.80
# cluster 3: 0.26
# However, there are two clusters overlapping, the medoids are too close(1 and 2).

4.4 PAM Clustering with Manhattan Distance

# Let's apply Manhattan distance
pam2 <- eclust(sex_ratio_data, "pam", k = 3, hc_metric = "manhattan")

fviz_silhouette(pam2)

##   cluster size ave.sil.width
## 1       1   51          0.48
## 2       2  139          0.80
## 3       3   15          0.26

fviz_cluster(pam2)

# We have more distinct 3 clusters
# Overall silhouette score is 0.68

5 Visualization of Clusters

# Now let's put the clusters in the dataset for further visualization
cluster_assignments <- pam2$cluster
sex_ratio_data$Cluster <- cluster_assignments
sex_ratio$Cluster <- cluster_assignments
head(sex_ratio_data)

##   year_2022   gdp_2022 Cluster
## 1     1.043  0.3332368       1
## 2     1.052 -0.6539686       2
## 3     1.027 -0.5698683       2
## 4     1.072 -0.4418712       2
## 5     1.062  0.7206928       1
## 6     1.046  0.9653451       1

# Checking the different shapes of the cluster result
fviz_cluster(pam2, geom = "point", ellipse.type = "norm") # factoextra::

fviz_cluster(pam2, geom = "point", ellipse.type = "convex") # factoextra::

# Convex shape is more suitable to capture all the data points

# More elegant plots
fviz_cluster(list(data = sex_ratio_data, cluster = sex_ratio_clusters_3$cluster), 
             ellipse.type = "norm", geom = "point", stand = TRUE, palette = "jco", ggtheme = theme_classic())

# This plot offers a more balanced and fair representation of clusters, making it suitable for reliable interpretation

6 Statistics by Clusters

# General characteristics of clusters
aggregate(cbind(year_2022, gdp_2022) ~ Cluster, data = sex_ratio_data, mean)

##   Cluster year_2022   gdp_2022
## 1       1  1.055745  0.4966754
## 2       2  1.049439 -0.4839292
## 3       3  1.059667  2.7957145

# Cluster 2 has the lowest GDP number and average ratio(1.049439 -0.4839292)
# Cluster 1 is in the middle(1.055745  0.4966754)
# Cluster 3 has the highest GDP number and average ratio(1.059667  2.7957145)

# List countries by the cluster
cluster_1 <- sex_ratio[sex_ratio$Cluster == 1, "country"]
cluster_2 <- sex_ratio[sex_ratio$Cluster == 2, "country"]
cluster_3 <- sex_ratio[sex_ratio$Cluster == 3, "country"]
cat("Countries in Cluster 1:\n", paste(cluster_1, collapse = ", "), "\n\n")

## Countries in Cluster 1:
##  ABW, AND, ARE, ATG, AUT, BEL, BHR, BHS, BRB, BRN, CAN, CUW, CYP, CZE, DEU, ESP, EST, FIN, FRA, GBR, GRC, GUM, HKG, ISR, ITA, JPN, KNA, KOR, KWT, LTU, LVA, MAC, MLT, MNP, NCL, NLD, NZL, OMN, PRI, PRT, PYF, SAU, SMR, SVK, SVN, SWE, SXM, TCA, TTO, URY, VIR

cat("Countries in Cluster 2:\n", paste(cluster_2, collapse = ", "), "\n\n")

## Countries in Cluster 2:
##  AFG, AGO, ALB, ARG, ARM, ASM, AZE, BDI, BEN, BFA, BGD, BGR, BIH, BLR, BLZ, BOL, BRA, BTN, BWA, CAF, CHL, CHN, CIV, CMR, COD, COG, COL, COM, CPV, CRI, DJI, DMA, DOM, DZA, ECU, EGY, ETH, FJI, FSM, GAB, GEO, GHA, GIN, GMB, GNB, GNQ, GRD, GTM, GUY, HND, HRV, HTI, HUN, IDN, IND, IRN, IRQ, JAM, JOR, KAZ, KEN, KGZ, KHM, KIR, LAO, LBN, LBR, LBY, LCA, LKA, LSO, MAR, MDA, MDG, MDV, MEX, MHL, MKD, MLI, MMR, MNE, MNG, MOZ, MRT, MUS, MWI, MYS, NAM, NER, NGA, NIC, NPL, NRU, PAK, PAN, PER, PHL, PLW, PNG, POL, PRY, PSE, ROU, RUS, RWA, SDN, SEN, SLB, SLE, SLV, SOM, SRB, STP, SUR, SWZ, SYC, SYR, TCD, TGO, THA, TJK, TKM, TLS, TON, TUN, TUR, TUV, TZA, UGA, UKR, UZB, VCT, VNM, VUT, WSM, YEM, ZAF, ZMB, ZWE

cat("Countries in Cluster 3:\n", paste(cluster_3, collapse = ", "), "\n\n")

## Countries in Cluster 3:
##  AUS, BMU, CHE, CYM, DNK, FRO, IRL, ISL, LIE, LUX, MCO, NOR, QAT, SGP, USA

7 Geographic Visualization

# Visualize the countries by clusters on world map. This would better represent the difference of countries geographically
# Countries not in the dataset were removed from the world map for better visualization result
world <- ne_countries(scale = "medium", returnclass = "sf")
# Fix the ISO codes for France and Norway. There was some issue with these two countries, hence these two countries were pre-processed to be reflected on the map correctly
world$iso_a3[world$name == "France"] <- "FRA"
world$iso_a3[world$name == "Norway"] <- "NOR"
print(world[world$iso_a3 %in% c("FRA", "NOR"), c("name", "iso_a3")])

## Simple feature collection with 2 features and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -61.79409 ymin: -21.36904 xmax: 55.83906 ymax: 80.47783
## Geodetic CRS:  WGS 84
##       name iso_a3                       geometry
## 89  Norway    NOR MULTIPOLYGON (((20.62217 69...
## 161 France    FRA MULTIPOLYGON (((9.480371 42...

colnames(sex_ratio)[colnames(sex_ratio) == "country"] <- "iso_a3"
# Merge World Map with Cluster Data
world_clusters <- left_join(world, sex_ratio, by = "iso_a3")
# Filter out rows where `Cluster` is NA
world_clusters_filtered <- world_clusters %>%
  filter(!is.na(Cluster))
# Plot World Map with Clusters, excluding NA
ggplot(data = world_clusters_filtered) +
  geom_sf(aes(fill = as.factor(Cluster)), color = "gray70", size = 0.2) +
  scale_fill_manual(values = c("blue", "green", "purple")) +  # Remove `na.value` argument
  theme_minimal() +
  labs(
    title = "World Map of Clusters (Corrected)",
    fill = "Cluster"
  )

# Identify countries not reflected in the map in case there is any
unmatched_countries <- setdiff(sex_ratio$iso_a3, world$iso_a3)
# Print unmatched countries
cat("Countries not reflected in the world map:\n")

## Countries not reflected in the world map:

unmatched_countries

## character(0)

8 Observations

Cluster features:
- Cluster 2 has the lowest GDP number and average ratio(1.049439 -0.4839292)
- Cluster 1 is in the middle(1.055745 0.4966754)
- Cluster 3 has the highest GDP number and average ratio(1.059667 2.7957145) We are clustering countries based on their similarities on sex ratio of newborn and economical matrix GDP per capita.
Based on the result of the year 2022:
- these countries with higher GDP per capita numbers also have a higher sex ratio, which indicates more male were born in the year of 2022
- Countries with the GDP per capita in the middle range are having the sex ratio in the middle range as well
- Countries with the worse GDP per capita numbers have the lowest sex ratio. They still have more male newborns, but comparing to the richer countries, it seems that they have more female newborn ratio than the richer countries.
- Bulgaria, Croatia, Hungary and Poland are classified as developed countries, however, they were clustered in cluster 2. The rest countries in cluster 2 are all developing countries.
The popular understanding is that the richer the countries are, the more equal of sex ratio they should have. However, based on our analysis of 2022, this was not explained or supported
Geographically, most countries in Asia, Africa, Middle East and Latin America are clustered in cluster 2, which has the lowest GDP per capita
- Most countries in North America, Europe are in cluster 1 or 3 which have better economy figures.
- Countries like Japan, Korea, New Zealand these developed countries are clustered in cluster 1
- Countries like Australia, Norway, Luxembourg, Denmark are clustered in cluster 3
- Some outliers from cluster 1: Uruguay, French Guiana in Latin America (most countries in LA are in cluster 2); Saudi Arabia, United Arab Emirates, Bahrain, Kuwait and Oman in Middle East (most countries there are clustered in cluster 2), Brunei and Japan in Asia (Most countries there are clustered in cluster 2)
- Some outliers from cluster 2: Bulgaria, Croatia, Hungary and Poland in Europe (most countries in Europe are clustered in cluster 1 or 3)
- Some outliers from cluster 3 (most of them are located in North America and Europe): Australia, Qatar and Singapore
Overall, we could see that most countries with similar economy levels have the similar sex ratio of newborns
- Most countries located in the similar locations (potentially having similar cultures) have the similar sex ratio of newborns and economy levels
- Some countries like South Korea, Japan, Singapore, Brunei have higher economy levels than other Asian countries have different sex ratio than most Asian countries as well even though their cultures may similar
- Some countries like Saudi Arabia, United Arab Emirates, Bahrain, Kuwait and Oman with higher economy levels due to their natural resources and other factors, they have different sex ratio than other Middle Eastern countries as well even though they have similar culture
Based on the analysis, the most dominated factor of the sex ratio would be their economy levels, cultures, and locations

However, this report only consists of one year data, which is 2022, will not explicitly or fully explain this phenomenon cross time for all the countries