The world is divided into countries with many differences between them. Gross Domestic Product (GDP) per capita is considered as one of the primary indicators that reflect how countries are classified based on economic development. It reflects the market value of all final goods and services produced in a country over a given period of time, making it the primary indicator for a country’s economic development. Nevertheless, there are various socio-economic indicators that aid in comprehending the level of poverty, lack of healthcare, lack of education, and inequality between men and women in parts of the world. Statistical surveys can be conducted to identify the countries that face severe issues regarding socio-economic problems. Graphical representations present the results in a comprehensible way and help draw clearer conclusions.
The aim of this paper is to demonstrate the vast diversity of countries worldwide by analyzing their socio-economic data, clustering them based on similarities and differences, and applying appropriate statistical methods to accurately characterize the resulting groups.
The dataset is compiled from the World Bank’s Health, Nutrition, Population Statistics Database, the Gender Statistics Database and the World Development Indicators Database. The dataset contains indicators on the socio-economic life of countries around the world as of 2019.
data <- read.xlsx(file.choose(), sheetName = "Data",
endRow = 218, header = TRUE)
rownames(data) <- data[,1]
| Variable | Description |
|---|---|
POP |
Population, total |
POP_GR |
Population growth (annual %) |
URB |
Urban population (% of total population) |
DENS |
Population density (people per sq. km of land area) |
BRTH |
Birth rate, crude (per 1,000 people) |
DTH |
Death rate, crude (per 1,000 people) |
GDP |
GDP per capita (current US$) |
GDP_GR |
GDP growth (annual %) |
LEXP_M |
Life expectancy at birth, male (years) |
LEXP_F |
Life expectancy at birth, female (years) |
FERT |
Fertility rate, total (births per woman) |
MOR |
Mortality rate, infant (per 1,000 live births) |
HLTH |
Current health expenditure (% of GDP) |
SUIC_F |
Suicide mortality rate, female (per 100,000 female population) |
SUIC_M |
Suicide mortality rate, male (per 100,000 female population) |
IMM |
Immunization, DPT (% of children ages 12-23 months) |
ELEC |
Access to electricity (% of population) |
INT |
Individuals using the Internet (% of population) |
WTR |
People using at least basic drinking water services (% of population) |
WAGE_F |
Wage and salaried workers, female (% of female employment) |
WAGE_M |
Wage and salaried workers, male (% of male employment) |
LBR |
Labor force, total |
EDU_F |
School enrollment, secondary, female (% gross) |
EDU_M |
School enrollment, secondary, male (% gross) |
The dataset contains 217 observations (countries) and 26 variables (socio-economic indicators).
str(data)
## 'data.frame': 217 obs. of 26 variables:
## $ CNTRY : chr "Afghanistan" "Albania" "Algeria" "American Samoa" ...
## $ CODE : chr "AFG" "ALB" "DZA" "ASM" ...
## $ POP : num 38041757 2854191 43053054 55312 77146 ...
## $ POP_GR: num 2.313 -0.426 1.934 -0.269 0.179 ...
## $ URB : num 25.8 61.2 73.2 87.1 88 ...
## $ DENS : num 58.3 104.2 18.1 276.6 164.1 ...
## $ BRTH : num 31.8 11.6 23.6 NA 7 ...
## $ DTH : num 6.29 8.08 4.72 NA 3.9 ...
## $ GDP : num 494 5396 3990 11715 40897 ...
## $ GDP_GR: num 3.912 2.113 1 -0.488 2.016 ...
## $ LEXP_F: num 66.4 80.2 78.1 NA NA ...
## $ LEXP_M: num 63.4 77 75.7 NA NA ...
## $ FERT : num 4.32 1.6 2.99 NA NA ...
## $ MORT : num 46.4 8.6 20 NA 2.5 49.9 5.6 8.1 10.2 NA ...
## $ HLTH : num 13.24 NA 6.24 NA 6.71 ...
## $ SUIC_F: num 3.6 2.7 1.8 NA NA 2.3 0.8 3.3 1.3 NA ...
## $ SUIC_M: num 4.6 5.9 3.1 NA NA 10 0 13.7 5.6 NA ...
## $ IMM : num 72 99 91 NA 99 57 95 83 92 NA ...
## $ ELEC : num 97.7 100 99.5 NA 100 ...
## $ INT : num 17.6 68.6 57.9 NA NA ...
## $ WTR : num 72.4 94.8 94.2 99.8 100 ...
## $ WAGE_F: num 8.21 47.87 74.52 NA NA ...
## $ WAGE_M: num 20.3 44.2 66.4 NA NA ...
## $ LBR : num 10382034 1421853 12500369 NA NA ...
## $ EDU_F : num NA 95.7 NA NA NA ...
## $ EDU_M : num NA 94.6 NA NA NA ...
summary(data)
## CNTRY CODE POP POP_GR
## Length:217 Length:217 Min. :1.076e+04 Min. :-1.6095
## Class :character Class :character 1st Qu.:7.779e+05 1st Qu.: 0.4265
## Mode :character Mode :character Median :6.661e+06 Median : 1.1058
## Mean :3.545e+07 Mean : 1.1996
## 3rd Qu.:2.544e+07 3rd Qu.: 1.9556
## Max. :1.408e+09 Max. : 4.4687
## NA's :1 NA's :1
## URB DENS BRTH DTH
## Min. : 13.25 Min. : 0.137 Min. : 5.90 Min. : 1.244
## 1st Qu.: 42.54 1st Qu.: 38.177 1st Qu.:10.61 1st Qu.: 5.708
## Median : 62.03 Median : 92.872 Median :17.05 Median : 7.118
## Mean : 61.21 Mean : 446.165 Mean :19.30 Mean : 7.531
## 3rd Qu.: 80.67 3rd Qu.: 233.011 3rd Qu.:26.87 3rd Qu.: 9.112
## Max. :100.00 Max. :19466.444 Max. :45.64 Max. :15.500
## NA's :3 NA's :1 NA's :11 NA's :12
## GDP GDP_GR LEXP_F LEXP_M
## Min. : 228.2 Min. :-11.143 Min. :55.49 Min. :51.08
## 1st Qu.: 2369.7 1st Qu.: 1.183 1st Qu.:69.83 1st Qu.:65.73
## Median : 7027.6 Median : 2.605 Median :77.30 Median :71.30
## Mean : 18605.3 Mean : 2.811 Mean :75.44 Mean :70.58
## 3rd Qu.: 23330.8 3rd Qu.: 4.778 3rd Qu.:81.05 3rd Qu.:76.21
## Max. :189487.1 Max. : 19.536 Max. :88.10 Max. :82.60
## NA's :12 NA's :14 NA's :18 NA's :18
## FERT MORT HLTH SUIC_F
## Min. :0.918 Min. : 1.60 Min. : 1.525 Min. : 0.300
## 1st Qu.:1.659 1st Qu.: 5.70 1st Qu.: 4.444 1st Qu.: 2.000
## Median :2.181 Median :14.30 Median : 6.272 Median : 3.300
## Mean :2.612 Mean :20.97 Mean : 6.595 Mean : 4.351
## 3rd Qu.:3.518 3rd Qu.:31.50 3rd Qu.: 8.202 3rd Qu.: 6.050
## Max. :6.824 Max. :82.40 Max. :23.962 Max. :30.100
## NA's :17 NA's :24 NA's :31 NA's :34
## SUIC_M IMM ELEC INT
## Min. : 0.00 Min. :35.00 Min. : 6.707 Min. : 5.60
## 1st Qu.: 6.70 1st Qu.:85.00 1st Qu.: 84.223 1st Qu.:35.94
## Median : 11.30 Median :93.00 Median : 99.996 Median :68.09
## Mean : 14.65 Mean :88.34 Mean : 86.352 Mean :60.99
## 3rd Qu.: 17.95 3rd Qu.:97.00 3rd Qu.:100.000 3rd Qu.:83.17
## Max. :116.00 Max. :99.00 Max. :100.000 Max. :99.70
## NA's :34 NA's :24 NA's :1 NA's :61
## WTR WAGE_F WAGE_M LBR
## Min. : 38.21 Min. : 1.05 Min. : 7.28 Min. : 31687
## 1st Qu.: 84.02 1st Qu.:29.75 1st Qu.:43.90 1st Qu.: 1055847
## Median : 96.90 Median :66.91 Median :65.76 Median : 4155002
## Mean : 89.10 Mean :59.42 Mean :60.59 Mean : 18416926
## 3rd Qu.: 99.87 3rd Qu.:88.17 3rd Qu.:80.45 3rd Qu.: 12534828
## Max. :100.00 Max. :99.47 Max. :99.61 Max. :800020955
## NA's :16 NA's :30 NA's :30 NA's :30
## EDU_F EDU_M
## Min. : 14.28 Min. : 26.81
## 1st Qu.: 86.40 1st Qu.: 83.70
## Median : 99.04 Median : 97.61
## Mean : 94.38 Mean : 93.26
## 3rd Qu.:106.63 3rd Qu.:108.16
## Max. :165.74 Max. :146.90
## NA's :101 NA's :101
The data needs to be edited before statistical methods can be used to analyze it.
data <- na.omit(round(data[,-c(1,2)],2))
nrow(data)
## [1] 94
Observations with missing values in their respective rows are removed from the dataset, which results in the exclusion of 113 countries, leaving a total of 94 countries in the dataset.
The correlation coefficients between the variables are calculated and a correlation matrix is constructed from them. The resulting matrix values are displayed.
cordata <- cor(data)
corrplot(cor(data), method = "color", tl.col = "black", tl.cex = 0.9)
The blue color denotes a positive correlation between variables, whereas the red color represents a negative correlation. The correlation matrix depicts several interesting relationships between the variables:
Access to electricity is strongly positively correlated with access to at least basic drinking water services. Access to electricity and drinking water are basic needs in society. Since both are essential requirements of a society and are being developed at a similar rate, they should be accessible at a similar level in countries.
Population has a very strong positive correlation with labor force. It is natural that countries with a larger population tend to have a higher number of people who are employed or looking for work.
The share of urban population is strongly positively correlated with access to the internet. The higher the share of urban population, the more developed the country’s economy is.
Birth rate, fertility rate and infant mortality are strongly negatively correlated with life expectancy, access to electricity, internet, at least basic drinking water services and the number of men and women who have completed secondary education. In general, countries with higher birth rates and infant mortality rates tend to have lower access to basic public services, such as access to electricity and internet.
Mortality is weakly positively correlated with access to secondary education and the number of employees. The stress levels experienced by individuals at schools or workplaces can contribute to higher mortality rates in countries.
Indeed, there are many other interesting relationships between variables. In order to identify clearly understandable relationships, the 24 variables can be reduced to a smaller number of factors.
It is checked whether the data is suitable for factor analysis. Bartlett’s test of sphericity is applied.
cortest.bartlett(cor(data), n = nrow(data))
## $chisq
## [1] 3305.226
##
## $p.value
## [1] 0
##
## $df
## [1] 276
After performing the test, we found that the p-value is smaller than the chosen significance level α = 0.05. Therefore, the null hypothesis is rejected. The data is suitable for factor analysis.
The Kaiser-Meyer-Olkin measure is calculated.
KMO(data)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = data)
## Overall MSA = 0.83
## MSA for each item =
## POP POP_GR URB DENS BRTH DTH GDP GDP_GR LEXP_F LEXP_M FERT
## 0.42 0.79 0.90 0.28 0.84 0.52 0.90 0.71 0.88 0.83 0.85
## MORT HLTH SUIC_F SUIC_M IMM ELEC INT WTR WAGE_F WAGE_M LBR
## 0.92 0.78 0.69 0.68 0.79 0.87 0.91 0.90 0.90 0.90 0.42
## EDU_F EDU_M
## 0.87 0.87
The obtained measure value of 0.83 indicates strong partial correlation. Thus, factor analysis can be applied.
The number of factors is determined by applying the Kaiser criterion and drawing a scree plot, where the x-axis indicates the number of factors, and the y-axis shows the corresponding eigenvalues.
ev <- eigen(cor(data))
plot(
x = seq(1:length(ev$values)), y = ev$values, type = "o",
xlab = "Factors", ylab = "Eigenvalues",
main = "Scree plot");
abline(1, 0, col = "red", lty = 2)
The number of factors to be extracted is the number of points on the graph with test values greater than 1. Five factors have test values greater than 1, but we choose to extract only four factors for better interpretability. In this case, therefore, 4 factors are distinguished.
The correlation matrix of the original variables is subjected to factor analysis using the maximum likelihood method.
fakt <- factanal(covmat = cor(data), factors = 4,
method = "mle", lower = 0.01)
fakt$loadings
##
## Loadings:
## Factor1 Factor2 Factor3 Factor4
## POP -0.106 0.989
## POP_GR -0.276 -0.571 -0.583
## URB 0.793
## DENS -0.343 0.134
## BRTH -0.755 -0.228 -0.605
## DTH 0.900 0.260
## GDP 0.735 0.153
## GDP_GR -0.438 -0.134
## LEXP_F 0.878 0.346
## LEXP_M 0.888 0.264
## FERT -0.716 -0.679
## MORT -0.844 -0.113 -0.342
## HLTH 0.517 0.370
## SUIC_F 0.320 0.540 0.257
## SUIC_M 0.255 0.612 0.110
## IMM 0.485 0.107
## ELEC 0.687 0.492
## INT 0.909 0.159 0.161
## WTR 0.757 0.495
## WAGE_F 0.855 0.253 0.182
## WAGE_M 0.836 0.160 0.138 -0.128
## LBR 0.991
## EDU_F 0.822 0.136 0.193
## EDU_M 0.840 0.150 0.188
##
## Factor1 Factor2 Factor3 Factor4
## SS loadings 10.169 2.356 2.237 2.104
## Proportion Var 0.424 0.098 0.093 0.088
## Cumulative Var 0.424 0.522 0.615 0.703
The table indicates the proportion of the total variance of the variables explained by each factor. The cumulative variance indicates the proportion of the total variance of the explanatory variables explained by each factor. Factor 1 explains 42% of the total variance, Factor 2 10%, Factor 3 9% and Factor 4 9%. The identified factors explain 70% of the total variance of the variables, which indicates a reasonably good model.
The loadings of the factors in the expression of the original variables through the factors are given.
pheatmap(fakt$loadings, angle_col = 45, cluster_cols = F)
| Factor 1 | Factor 2 | Factor 3 | Factor 4 | |
|---|---|---|---|---|
| Population, total | -0.106 | 0.989 | ||
| Population growth (annual %) | -0.276 | -0.571 | -0.583 | |
| Urban population (% of total population) | 0.793 | |||
| Population density (people per sq. km of land area) | -0.343 | 0,134 | ||
| Birth rate, crude (per 1,000 people) | -0.755 | -0.228 | -0.605 | |
| Death rate, crude (per 1,000 people) | 0.900 | 0.260 | ||
| GDP per capita (current US$) | 0.735 | 0.153 | ||
| GDP growth (annual %) | -0.438 | -0.134 | ||
| Life expectancy at birth, male (years) | 0.878 | 0.346 | ||
| Life expectancy at birth, female (years) | 0.888 | 0.264 | ||
| Fertility rate, total (births per woman) | -0.716 | -0.679 | ||
| Mortality rate, infant (per 1,000 live births) | -0.844 | -0.113 | -0.342 | |
| Current health expenditure (% of GDP) | 0.517 | 0.370 | ||
| Suicide mortality rate, female (per 100,000 female population) | 0.320 | 0.540 | 0.257 | |
| Suicide mortality rate, male (per 100,000 female population) | 0.255 | 0.612 | 0.110 | |
| Immunization, DPT (% of children ages 12-23 months) | 0.485 | 0.107 | ||
| Access to electricity (% of population) | 0.687 | 0.492 | ||
| Individuals using the Internet (% of population) | 0.909 | 0.159 | 0.160 | |
| People using at least basic drinking water services (% of population) | 0.757 | 0.495 | ||
| Wage and salaried workers, female (% of female employment) | 0.855 | 0.253 | 0.182 | |
| Wage and salaried workers, male (% of male employment) | 0.836 | 0.160 | 0.138 | -0.128 |
| Labor force, total | 0.991 | |||
| School enrollment, secondary, female (% gross) | 0.822 | 0.136 | 0.193 | |
| School enrollment, secondary, male (% gross) | 0.840 | 0.150 | 0.188 |
These values represent the extent of the variability explained by each variable in each factor. The loadings of the factors are utilized for interpreting them as:
The original data is plotted on the coordinate plane of the first two factors.
scores <- factanal(data, factors = 4, method = "mle", lower = 0.01,
scores = "regression")$scores
plot(scores[,1], scores[,2], xlab = "Factor 1",
ylab = "Factor 2",
main = "The projection of world countries onto the plane of the first two factors",
type = "n")
text(scores[,1], scores[,2], labels = rownames(scores), cex = 0.7)
The graph illustrates that the lowest quality of life is observed in European countries such as Latvia, Lithuania, and Bulgaria, whereas the highest quality of life is seen in the Maldives, Bahrain, and the United Arab Emirates. The countries in the Middle East (Bahrain, UAE, Oman, Saudi Arabia, etc.) exhibit strong economic development and a very high quality of life. Chad is distinctive from the rest in that it has both a very poor economic development and a very poor overall quality of life.
As the units of measurement of the variables in the dataset differ, standardization of the data is performed. Once the units of measurement are standardised, cluster analysis can be performed.
data_sc <- scale(data)
It is necessary to decide according to which similarity measure the distance between clusters will be evaluated. The agglomerative coefficient is calculated, which measures the amount of clustering structure found.
m <- c("single", "complete", "average", "ward")
names(m) <- c("single", "complete", "average", "ward")
ac <- function(x) {
agnes(data_sc, method = x)$ac
}
round(map_dbl(m, ac),2)
## single complete average ward
## 0.79 0.83 0.80 0.93
Ward’s method identifies the strongest clustering structure of the four methods assessed, because values closer to 1 suggest strong clustering structure. The Ward’s method is applied to the standardized data, and a dendrogram is generated.
plot(hc, cex =0.5, col = "black")
rect.hclust(hc, k = 6, border = 2:7)
After reviewing the dendrogram, it was concluded that 6 clusters will be used in further analysis. The associated countries for each cluster are listed below.
clust <- as.data.frame(cbind(names(cutree(hc, k = 6)),(cutree(hc, k = 6))))
rownames(clust) <- NULL
clust1 <- clust[clust[,2]==1,]
colnames(clust1) <- c("Country", "Cluster")
clust1
## Country Cluster
## 1 Armenia 1
## 3 Austria 1
## 12 Bulgaria 1
## 22 Cuba 1
## 24 Czech Republic 1
## 31 Estonia 1
## 33 France 1
## 34 Georgia 1
## 35 Germany 1
## 37 Greece 1
## 40 Hungary 1
## 45 Italy 1
## 47 South Korea 1
## 50 Latvia 1
## 51 Lithuania 1
## 58 Mauritius 1
## 61 Montenegro 1
## 71 Poland 1
## 72 Portugal 1
## 73 Romania 1
## 74 Russia 1
## 78 Serbia 1
## 79 Slovakia 1
## 80 Slovenia 1
## 81 South Africa 1
## 83 Saint Lucia 1
## 87 Thailand 1
## 92 United States of America 1
## 93 Uruguay 1
clust2 <- clust[clust[,2]==2,]
colnames(clust2) <- c("Country", "Cluster")
clust2
## Country Cluster
## 2 Australia 2
## 7 Belgium 2
## 16 Canada 2
## 18 Chile 2
## 20 Costa Rica 2
## 25 Denmark 2
## 32 Finland 2
## 41 Iceland 2
## 43 Ireland 2
## 52 Luxembourg 2
## 64 Netherlands 2
## 65 New Zealand 2
## 66 Norway 2
## 82 Spain 2
## 84 Sweden 2
## 85 Switzerland 2
## 91 United Kingdom 2
clust3 <- clust[clust[,2]==3,]
colnames(clust3) <- c("Country", "Cluster")
clust3
## Country Cluster
## 4 Azerbaijan 3
## 8 Belize 3
## 9 Bolivia 3
## 10 Brazil 3
## 15 Cape Verde 3
## 19 Colombia 3
## 27 Dominican Republic 3
## 28 Ecuador 3
## 29 Egypt 3
## 30 El Salvador 3
## 38 Guatemala 3
## 39 Honduras 3
## 46 Kazakhstan 3
## 48 Kyrgyzstan 3
## 59 Mexico 3
## 60 Mongolia 3
## 62 Morocco 3
## 69 Peru 3
## 70 Philippines 3
## 94 Uzbekistan 3
clust4 <- clust[clust[,2]==4,]
colnames(clust4) <- c("Country", "Cluster")
clust4
## Country Cluster
## 5 Bahrain 4
## 11 Brunei Darussalam 4
## 23 Cyprus 4
## 44 Israel 4
## 54 Malaysia 4
## 55 Maldives 4
## 56 Malta 4
## 67 Oman 4
## 76 Saudi Arabia 4
## 89 Turkey 4
## 90 United Arab Emirates 4
clust5 <- clust[clust[,2]==5,]
colnames(clust5) <- c("Country", "Cluster")
clust5
## Country Cluster
## 6 Bangladesh 5
## 13 Burkina Faso 5
## 14 Burundi 5
## 17 Chad 5
## 21 Cote d'Ivoire 5
## 26 Djibouti 5
## 36 Ghana 5
## 49 Laos 5
## 53 Malawi 5
## 57 Mauritania 5
## 63 Nepal 5
## 68 Pakistan 5
## 75 Rwanda 5
## 77 Senegal 5
## 86 Tanzania 5
## 88 East Timor 5
clust6 <- clust[clust[,2]==6,]
colnames(clust6) <- c("Country", "Cluster")
clust6
## Country Cluster
## 42 India 6
Country codes are added to the dataset.
malDF <- data.frame(country = c(
# Country codes for cluster 1
"ARM", "AUT", "BGR", "CUB", "CZE", "EST", "FRA",
"GEO", "DEU", "GRC", "HUN", "ITA", "KOR", "LVA",
"LTU", "MUS", "MNE", "POL", "PRT", "ROU", "RUS",
"SRB", "SVK", "SVN", "ZAF", "LCA", "THA", "USA",
"URY",
# Country codes for cluster 2
"AUS", "BEL", "CAN", "CHL", "CRI", "DNK", "FIN",
"ISL", "IRL", "LUX", "NLD", "NZL", "NOR", "ESP",
"SWE", "CHE", "GBR",
# Country codes for cluster 3
"AZE", "BLZ", "BOL", "BRA", "CPV", "COL", "DOM",
"ECU", "EGY", "SLV", "GTM", "HND", "KAZ", "KGZ",
"MEX", "MNG", "MAR", "PER", "PHL", "UZB",
# Country codes for cluster 4
"BHR", "BRN", "CYP", "ISR", "MYS", "MDV", "MLT",
"OMN", "SAU", "TUR", "ARE",
# Country codes for cluster 5
"BGD", "BFA", "BDI", "TCD", "CIV", "DJI", "GHA",
"LAO", "MWI", "MRT", "NPL", "PAK", "RWA", "SEN",
"TZA", "TLS",
# Country codes for cluster 6
"IND"),
Clusters = c(rep(1,29),rep(2,17),rep(3,20),
rep(4,11),rep(5,16),6))
clusterNames <- c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5", "Cluster 6")
malMap <- joinCountryData2Map(malDF, joinCode = "ISO3",
nameJoinColumn = "country")
## 94 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 149 codes from the map weren't represented in your data
The countries assigned to the clusters are shown on the map with different colours.
mapP <- mapCountryData(malMap, nameColumnToPlot = "Clusters",
catMethod = "categorical", missingCountryCol = "lightgrey",
colourPalette = c("red2", "seagreen", "navy",
"royalblue", "mediumvioletred",
"yellow2"), addLegend=TRUE)
Countries that belong to different clusters are marked on the map. You can see how the clusters of countries are distributed geographically.
A k-means clustering is also performed.
set.seed(1)
km <- kmeans(data_sc, centers = 6)
In this non-hierarchical cluster analysis method, the number of clusters is chosen in advance. Since the hierarchical cluster analysis resulted in 6 clusters in the printed dendrogram, the same number of clusters are used for k-means clustering. The associated countries for each cluster are listed below.
clustt <- cbind(data, cluster=km$cluster)
clustt1_var <- clustt[clustt$cluster==6,]
clustt1_var$cluster[clustt1_var$cluster == 6] <- 1
clustt1 <- as.data.frame(cbind(rownames(clustt1_var), clustt1_var[,25]))
colnames(clustt1) <- c("Country", "Cluster")
clustt1
## Country Cluster
## 1 Australia 1
## 2 Austria 1
## 3 Belgium 1
## 4 Canada 1
## 5 Denmark 1
## 6 Finland 1
## 7 France 1
## 8 Germany 1
## 9 Iceland 1
## 10 Ireland 1
## 11 South Korea 1
## 12 Luxembourg 1
## 13 Netherlands 1
## 14 New Zealand 1
## 15 Norway 1
## 16 Spain 1
## 17 Sweden 1
## 18 Switzerland 1
## 19 United Kingdom 1
## 20 United States of America 1
clustt2_var <- clustt[clustt$cluster==2,]
clustt2 <- as.data.frame(cbind(rownames(clustt2_var), clustt2_var[,25]))
colnames(clustt2) <- c("Country", "Cluster")
clustt2
## Country Cluster
## 1 Armenia 2
## 2 Azerbaijan 2
## 3 Bangladesh 2
## 4 Belize 2
## 5 Bolivia 2
## 6 Brazil 2
## 7 Cape Verde 2
## 8 Colombia 2
## 9 Dominican Republic 2
## 10 Ecuador 2
## 11 Egypt 2
## 12 El Salvador 2
## 13 Guatemala 2
## 14 Honduras 2
## 15 Kyrgyzstan 2
## 16 Laos 2
## 17 Mexico 2
## 18 Mongolia 2
## 19 Morocco 2
## 20 Nepal 2
## 21 Peru 2
## 22 Philippines 2
## 23 Saint Lucia 2
## 24 Thailand 2
## 25 Uzbekistan 2
clustt3_var <- clustt[clustt$cluster==4,]
clustt3_var$cluster[clustt3_var$cluster == 4] <- 3
clustt3 <- as.data.frame(cbind(rownames(clustt3_var), clustt3_var[,25]))
colnames(clustt3) <- c("Country", "Cluster")
clustt3
## Country Cluster
## 1 Bulgaria 3
## 2 Cuba 3
## 3 Czech Republic 3
## 4 Estonia 3
## 5 Georgia 3
## 6 Greece 3
## 7 Hungary 3
## 8 Italy 3
## 9 Kazakhstan 3
## 10 Latvia 3
## 11 Lithuania 3
## 12 Mauritius 3
## 13 Montenegro 3
## 14 Poland 3
## 15 Portugal 3
## 16 Romania 3
## 17 Russia 3
## 18 Serbia 3
## 19 Slovakia 3
## 20 Slovenia 3
## 21 South Africa 3
## 22 Uruguay 3
clustt4_var <- clustt[clustt$cluster==1,]
clustt4_var$cluster[clustt4_var$cluster == 1] <- 4
clustt4 <- as.data.frame(cbind(rownames(clustt4_var), clustt4_var[,25]))
colnames(clustt4) <- c("Country", "Cluster")
clustt4
## Country Cluster
## 1 Burkina Faso 4
## 2 Burundi 4
## 3 Chad 4
## 4 Cote d'Ivoire 4
## 5 Djibouti 4
## 6 Ghana 4
## 7 Malawi 4
## 8 Mauritania 4
## 9 Pakistan 4
## 10 Rwanda 4
## 11 Senegal 4
## 12 Tanzania 4
## 13 East Timor 4
clustt5_var <- clustt[clustt$cluster==5,]
clustt5 <- as.data.frame(cbind(rownames(clustt5_var), clustt5_var[,25]))
colnames(clustt5) <- c("Country", "Cluster")
clustt5
## Country Cluster
## 1 Bahrain 5
## 2 Brunei Darussalam 5
## 3 Chile 5
## 4 Costa Rica 5
## 5 Cyprus 5
## 6 Israel 5
## 7 Malaysia 5
## 8 Maldives 5
## 9 Malta 5
## 10 Oman 5
## 11 Saudi Arabia 5
## 12 Turkey 5
## 13 United Arab Emirates 5
clustt6_var <- clustt[clustt$cluster==3,]
clustt6_var$cluster[clustt6_var$cluster == 3] <- 6
clustt6 <- as.data.frame(cbind(rownames(clustt6_var), clustt6_var[,25]))
colnames(clustt6) <- c("Country", "Cluster")
clustt6
## Country Cluster
## 1 India 6
Country codes are added to the dataset.
malDF <- data.frame(country = c(
# Country codes for cluster 1
"AUS", "AUT", "BEL", "CAN", "DNK", "FIN", "FRA",
"DEU", "ISL", "IRL", "KOR", "LUX", "NLD", "NZL",
"NOR", "ESP", "SWE", "CHE", "GBR", "USA",
# Country codes for cluster 2
"ARM", "AZE", "BGD", "BLZ", "BOL", "BRA", "CPV",
"COL", "DOM", "ECU", "EGY", "SLV", "GTM", "HND",
"KGZ", "LAO", "MEX", "MNG", "MAR", "NPL", "PER",
"PHL", "LCA", "THA", "UZB",
# Country codes for cluster 3
"BGR", "CUB", "CZE", "EST", "GEO", "GRC", "HUN",
"ITA", "KAZ", "LVA", "LTU", "MUS", "MNE", "POL",
"PRT", "ROU", "RUS", "SRB", "SVK", "SVN", "ZAF",
"URY",
# Country codes for cluster 4
"BHR", "BDI", "TCD", "DJI", "CIV", "GHA", "MWI",
"MRT", "PAK", "RWA", "SEN", "TZA", "TLS",
# Country codes for cluster 5
"BHR", "BRN", "CHL", "CRI", "CYP", "ISR", "MYS",
"MDV", "MLT", "OMN", "SAU", "TUR", "ARE",
# Country codes for cluster 6
"IND"),
Clusters = c(rep("Cluster 1",nrow(clustt1)),rep("Cluster 2",nrow(clustt2)),rep("Cluster 3",nrow(clustt3)),
rep("Cluster 4",nrow(clustt4)),rep("Cluster 5",nrow(clustt5)),"Cluster 6"))
clusterNames <- c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5", "Cluster 6")
malMap <- joinCountryData2Map(malDF, joinCode = "ISO3",
nameJoinColumn = "country")
## 94 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 150 codes from the map weren't represented in your data
The countries assigned to the clusters are shown on the map with different colours.
mapP <- mapCountryData(malMap, nameColumnToPlot = "Clusters",
catMethod = "categorical",missingCountryCol = "lightgrey",
colourPalette = c("red2", "seagreen", "navy",
"royalblue", "mediumvioletred",
"yellow2"), addLegend=TRUE)
Countries that belong to different clusters are marked on the map. You can see how the clusters of countries are distributed geographically.
Often, countries that are geographically close have more similarities in terms of historical background and existing culture. These similarities also affect the socio-economic indicators of the countries. It can be observed that in the map showing countries extracted using non-hierarchical cluster analysis, the countries belonging to each cluster are geographically slightly closer to each other than in the map showing countries extracted using hierarchical cluster analysis. For this reason, the countries extracted by the k-means method will be used for further interpretation and analysis.
The countries identified through the k-means method will be utilized for subsequent analysis and interpretation.
In the light of the factors highlighted above, the countries extracted by the k-means method are subject to cluster interpretation:
A principal component analysis is performed on standardized baseline data.
pca <- princomp(data_sc)
summary(pca, importance = TRUE)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 3.4661706 1.5876652 1.43132211 1.17114920 1.07814493
## Proportion of Variance 0.5059802 0.1061577 0.08627966 0.05776411 0.04895397
## Cumulative Proportion 0.5059802 0.6121379 0.69841756 0.75618168 0.80513565
## Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
## Standard deviation 0.91978272 0.86909633 0.79280023 0.73551691 0.62617326
## Proportion of Variance 0.03562904 0.03181043 0.02647044 0.02278342 0.01651288
## Cumulative Proportion 0.84076470 0.87257512 0.89904556 0.92182899 0.93834186
## Comp.11 Comp.12 Comp.13 Comp.14
## Standard deviation 0.60404575 0.56849465 0.416171361 0.381876678
## Proportion of Variance 0.01536644 0.01361089 0.007294206 0.006141577
## Cumulative Proportion 0.95370831 0.96731919 0.974613399 0.980754977
## Comp.15 Comp.16 Comp.17 Comp.18
## Standard deviation 0.353216832 0.294074338 0.27972160 0.245110833
## Proportion of Variance 0.005254319 0.003642067 0.00329523 0.002530222
## Cumulative Proportion 0.986009296 0.989651363 0.99294659 0.995476815
## Comp.19 Comp.20 Comp.21 Comp.22
## Standard deviation 0.192458862 0.179906694 0.1245330421 0.1141824713
## Proportion of Variance 0.001559946 0.001363102 0.0006531348 0.0005490761
## Cumulative Proportion 0.997036760 0.998399862 0.9990529971 0.9996020732
## Comp.23 Comp.24
## Standard deviation 0.0720568309 0.0652415321
## Proportion of Variance 0.0002186674 0.0001792594
## Cumulative Proportion 0.9998207406 1.0000000000
There are 24 variables in total, which means there are 24 components. The tables show the standard deviations (or matrix test values) of the first 10 components and the proportion of variance explained and the cumulative proportion of variance explained of each component. The standard deviations of the first five components are greater than 1, indicating that they explain more variance than individual variables. The remaining components have standard deviations smaller than 1 and account for less variance than the individual variables. Each successive component explains less of the variance. The first three explain 70% of the total variance.
var <- get_pca_var(pca)
a <- fviz_contrib(pca, "var", axes=1, xtickslab.rt=90)
plot(a)
var <- get_pca_var(pca)
a <- fviz_contrib(pca, "var", axes=2, xtickslab.rt=90)
plot(a)
var <- get_pca_var(pca)
a <- fviz_contrib(pca, "var", axes=1:2, xtickslab.rt=90)
plot(a)
A graphical representation is generated for the initial two principal components, which account for the majority of the variance.
plot(pca$scores,type = "n", xlim = xlim, ylim = xlim,
main = "Scatterplot of First Two Principal Components")
text(pca$scores[,1], pca$scores[,2], labels = rownames(data_sc), cex = 0.5)
This scatterplot offers valuable insights into the relationships among countries, and provides a clear depiction of those that fall outside the expected patterns. Notably, the East African countries (Burundi, Tanzania, Malawi, Rwanda) and West African countries (Burkina Faso, Ivory Coast, Mauritania) are clustered together. Meanwhile, the Central African nation of Chad is an outlier, although it is grouped with other African countries. India is also a standout, most likely due to its exceedingly large population. Furthermore, clusters consisting of larger numbers of countries are formed, such as Colombia, Morocco, Ecuador, Cape Verde, Belize, and others. Additionally, European countries such as Germany, France, Austria, Portugal, and Estonia are also grouped together.
Within each cluster, a one-sided hypothesis is tested that the mean number of suicides among males is less than or equal to the mean number of suicides among females. The results of the t-test to test the hypotheses about the mean number of suicides between men and women are presented.
hyp_suic <- as.data.frame(rbind(t.test(clustt1_var$SUIC_M, clustt1_var$SUIC_F,
alternative="greater")$p.val,
t.test(clustt2_var$SUIC_M, clustt2_var$SUIC_F,
alternative="greater")$p.val,
t.test(clustt3_var$SUIC_M, clustt3_var$SUIC_F,
alternative="greater")$p.val,
t.test(clustt4_var$SUIC_M, clustt4_var$SUIC_F,
alternative="greater")$p.val,
t.test(clustt5_var$SUIC_M, clustt5_var$SUIC_F,
alternative="greater")$p.val))
rownames(hyp_suic) <- clusterNames[1:5]
colnames(hyp_suic) <- "p-value"
hyp_suic
## p-value
## Cluster 1 1.013598e-08
## Cluster 2 9.300975e-07
## Cluster 3 1.204952e-08
## Cluster 4 1.492020e-06
## Cluster 5 1.267298e-05
It can be concluded that the null hypothesis is rejected in all presented clusters because the p-values of the test performed are less than the significance level α = 0.05. Therefore, the average number of male suicides is not lower than that of females in each of the presented clusters. In the sixth cluster, there is only one country - India. In this country, the number of male suicides (14.1 suicides per 100,000 male population) is also higher than that of females (11.1 suicides per 100,000 female population).
| p-value | Null hypothesis | Result interpretation | |
|---|---|---|---|
| Cluster 1 | < α | Rejected | The average number of male suicides is not lower than that of females |
| Cluster 2 | < α | Rejected | The average number of male suicides is not lower than that of females |
| Cluster 3 | < α | Rejected | The average number of male suicides is not lower than that of females |
| Cluster 4 | < α | Rejected | The average number of male suicides is not lower than that of females |
| Cluster 5 | < α | Rejected | The average number of male suicides is not lower than that of females |
| Cluster 6 | - | - | The number of male suicides is higher than that of females |
Within each cluster, a two-sided hypothesis is tested that the average proportion of male employees who receive wages is equal to the average proportion of female employees who receive wages. The results of the t-test for testing hypotheses about the average proportion of male and female employees who receive wages are presented.
hyp_wage <- as.data.frame(rbind(t.test(clustt1_var$WAGE_M, clustt1_var$WAGE_F,
alternative="two.sided")$p.val,
t.test(clustt2_var$WAGE_M, clustt2_var$WAGE_F,
alternative="two.sided")$p.val,
t.test(clustt3_var$WAGE_M, clustt3_var$WAGE_F,
alternative="two.sided")$p.val,
t.test(clustt4_var$WAGE_M, clustt4_var$WAGE_F,
alternative="two.sided")$p.val,
t.test(clustt5_var$WAGE_M, clustt5_var$WAGE_F,
alternative="two.sided")$p.val))
rownames(hyp_wage) <- clusterNames[1:5]
colnames(hyp_wage) <- "p-value"
hyp_wage
## p-value
## Cluster 1 0.0001058282
## Cluster 2 0.4836540070
## Cluster 3 0.0122322417
## Cluster 4 0.0212397381
## Cluster 5 0.8945942619
It can be concluded that the null hypothesis is rejected in the first, third, and fourth clusters because the p-values of the test performed are less than the significance level α = 0.05. Therefore, in these clusters, the difference between the average proportion of male employees who receive wages and the average proportion of female employees who receive wages is statistically significant. The first and third cluster countries are mostly located in Europe, and these clusters also include Russia, the United States, and Canada. These countries differ significantly from the countries in the fourth cluster, which are located throughout the African continent. In the second and fifth clusters, the p-values are greater than the significance level, so the null hypothesis is not rejected for these clusters. Therefore, in the second and fifth clusters, the difference between the average proportion of male employees who receive wages and the average proportion of female employees who receive wages is statistically insignificant. In India, the proportion of male employees differs from that of females by only one hundredth (male - 24.17% of total employment, female - 24.16% of total employment).
| p-value | Null hypothesis | Result interpretation | |
|---|---|---|---|
| Cluster 1 | < α | Rejected | the difference between the average proportion of male employees who receive wages and the average proportion of female employees who receive wages is statistically significant |
| Cluster 2 | > α | Not rejected | the difference between the average proportion of male employees who receive wages and the average proportion of female employees who receive wages is statistically insignificant |
| Cluster 3 | < α | Rejected | the difference between the average proportion of male employees who receive wages and the average proportion of female employees who receive wages is statistically significant |
| Cluster 4 | < α | Rejected | the difference between the average proportion of male employees who receive wages and the average proportion of female employees who receive wages is statistically significant |
| Cluster 5 | > α | Not rejected | the difference between the average proportion of male employees who receive wages and the average proportion of female employees who receive wages is statistically insignificant |
| Cluster 6 | - | - | The difference in this indicator is not significant |
Within each cluster, a two-sided hypothesis is tested that the average expected lifespan of men is equal to the average expected lifespan of women. The results of the t-test for testing hypotheses about the average expected lifespan of men and women are presented.
hyp_lexp <- as.data.frame(rbind(t.test(clustt1_var$LEXP_M, clustt1_var$LEXP_F,
alternative="two.sided")$p.val,
t.test(clustt2_var$LEXP_M, clustt2_var$LEXP_F,
alternative="two.sided")$p.val,
t.test(clustt3_var$LEXP_M, clustt3_var$LEXP_F,
alternative="two.sided")$p.val,
t.test(clustt4_var$LEXP_M, clustt4_var$LEXP_F,
alternative="two.sided")$p.val,
t.test(clustt5_var$LEXP_M, clustt5_var$LEXP_F,
alternative="two.sided")$p.val))
rownames(hyp_lexp) <- clusterNames[1:5]
colnames(hyp_lexp) <- "p-value"
hyp_lexp
## p-value
## Cluster 1 1.547953e-13
## Cluster 2 1.078688e-09
## Cluster 3 1.663873e-06
## Cluster 4 6.406870e-02
## Cluster 5 1.144371e-03
It can be concluded that the null hypothesis is rejected in the first, second, third, and fifth clusters since the test p-values are smaller than the significance level of α = 0.05. Therefore, in these clusters, the difference between the expected average lifespan of men and women is statistically significant. These clusters are located throughout the world, so gender inequality in terms of the average expected lifespan prevails in both men and women on every continent. In the fourth cluster, the p-value is greater than the significance level, so the null hypothesis cannot be rejected. Therefore, in the fourth cluster, the difference in the expected average lifespan between men and women is statistically insignificant. Although there are some African countries where these indicators are statistically significant, this inequality does not apply to most African countries. In India, the average life expectancy for men is 70.95 and for women, it is 68.46, so the difference in this indicator is not significant.
| p-value | Null hypothesis | Result interpretation | |
|---|---|---|---|
| Cluster 1 | < α | Rejected | The difference between the expected average lifespan of men and women is statistically significant |
| Cluster 2 | < α | Rejected | The difference between the expected average lifespan of men and women is statistically significant |
| Cluster 3 | < α | Rejected | The difference between the expected average lifespan of men and women is statistically significant |
| Cluster 4 | > α | Not rejected | The difference in the expected average lifespan between men and women is statistically insignificant |
| Cluster 5 | < α | Rejected | The difference between the expected average lifespan of men and women is statistically significant |
| Cluster 6 | - | - | The difference in this indicator is not significant |
World countries can be classified into distinct clusters or groups using a range of analytical methods, including factor analysis, hierarchical and non-hierarchical clustering analysis, and principal component analysis. Among these methods, the k-means approach is considered the most precise as it enables the geographical grouping of countries in each cluster to be visualized effectively through maps.
The geographical position of countries within each cluster appears to contribute to similarities and differences in economic and social factors. A factor analysis was conducted to identify four key factors that explain the variability in the data. The resulting factor interpretations were used to compare the clusters and their respective economic and social characteristics.
Through the application of statistical methods, certain countries were identified as distinctive when compared to others. The initial data was plotted on the coordinate plane of the first two factors, revealing Chad as an outlier. Chad is a Central African country characterized by underdeveloped economy, and high rates of mortality and suicide. Further analysis was carried out using principal component analysis, which generated a graph of the first two principal components that accounted for the majority of the variance in the data. The resulting graph not only highlighted Chad, but also India, which is situated in South Asia and is the second most populous country in the world, with a sizeable population.
Through the formulation and testing of hypotheses, it was observed that there exists a significant difference between men and women in terms of the average number of suicides across all clusters. Furthermore, there are discernible disparities between men and women when investigating the hypothesis that posits the equivalence of expected life expectancies between the two genders.
To view the full R code for this project, click here.
sessionInfo()
## R version 4.3.0 (2023-04-21 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19045)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: Europe/Vilnius
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] knitr_1.42 DT_0.27 lubridate_1.9.2 forcats_1.0.0
## [5] stringr_1.5.0 dplyr_1.1.2 readr_2.1.4 tidyr_1.3.0
## [9] tibble_3.2.1 tidyverse_2.0.0 rworldmap_1.3-6 sp_1.6-0
## [13] psych_2.3.3 factoextra_1.0.7 ggplot2_3.4.2 corrplot_0.92
## [17] pheatmap_1.0.12 cluster_2.1.4 purrr_1.0.1 xlsx_0.6.5
##
## loaded via a namespace (and not attached):
## [1] dotCall64_1.0-2 gtable_0.3.3 spam_2.9-1 xfun_0.39
## [5] bslib_0.4.2 htmlwidgets_1.6.2 rstatix_0.7.2 ggrepel_0.9.3
## [9] rJava_1.0-6 lattice_0.21-8 tzdb_0.3.0 crosstalk_1.2.0
## [13] maptools_1.1-6 vctrs_0.6.2 tools_4.3.0 generics_0.1.3
## [17] parallel_4.3.0 fansi_1.0.4 highr_0.10 pkgconfig_2.0.3
## [21] RColorBrewer_1.1-3 lifecycle_1.0.3 farver_2.1.1 compiler_4.3.0
## [25] fields_14.1 munsell_0.5.0 mnormt_2.1.1 carData_3.0-5
## [29] htmltools_0.5.5 maps_3.4.1 sass_0.4.5 yaml_2.3.7
## [33] car_3.1-2 ggpubr_0.6.0 pillar_1.9.0 jquerylib_0.1.4
## [37] ellipsis_0.3.2 cachem_1.0.7 viridis_0.6.2 abind_1.4-5
## [41] nlme_3.1-162 tidyselect_1.2.0 digest_0.6.31 stringi_1.7.12
## [45] labeling_0.4.2 fastmap_1.1.1 grid_4.3.0 colorspace_2.1-0
## [49] cli_3.6.1 magrittr_2.0.3 xlsxjars_0.6.1 utf8_1.2.3
## [53] broom_1.0.4 foreign_0.8-84 withr_2.5.0 backports_1.4.1
## [57] scales_1.2.1 timechange_0.2.0 rmarkdown_2.21 gridExtra_2.3
## [61] ggsignif_0.6.4 hms_1.1.3 evaluate_0.20 viridisLite_0.4.1
## [65] rlang_1.1.0 Rcpp_1.0.10 glue_1.6.2 rstudioapi_0.14
## [69] jsonlite_1.8.4 R6_2.5.1